Andy Heninger [Fri, 14 Feb 2020 05:40:28 +0000 (21:40 -0800)]
ICU-20876 Regex Grapheme Cluster matching with Break Iterators.
Change the implementation of grapheme cluster matching in regex to use an ICU
break iterator instead of a little one-off state machine.
The old implementation had fallen behind the Unicode UAX-29 specification for
graphem clusters, and could not be easily updated.
The implementation follows the same general pattern that is used for finding
word boundaries with an ICU break iterator. In reviewing that code, a few
improvements to the handling of ICU error codes were also made.
Also note that this change adds a new dependency on Break Iteration. Regex
patterns that previously would work with ICU builds that were configured with
no break iteration will now fail. But only if they include \X for matching
grapheme cluster boundaries.
Jeff Genovy [Tue, 7 Jan 2020 09:38:21 +0000 (01:38 -0800)]
ICU-20322 On MinGW, move the DLLs to the "bin" directory.
This change builds on Vincent Torri's changes.
This installs the ICU DLL files in $prefix/bin instead of $prefix/lib.
Note: In order to disable this change in behavior you can edit
the "mh-mingw*" file(s). If you set the variable MINGW_MOVEDLLSTOBINDIR
to NO instead of YES, then it will retain the previous behavior of
installing the DLLs into the bin folder.
Andrew Paprocki [Tue, 12 Nov 2019 00:46:05 +0000 (19:46 -0500)]
ICU-20895 ICU_TIMEZONE_FILES_DIR_PREFIX_ENV_VAR
Adds `ICU_TIMEZONE_FILES_DIR_PREFIX_ENV_VAR`, similar to
`ICU_DATA_DIR_PREFIX_ENV_VAR`, that specifies an environment variable
to retrieve and prepend to the ICU time zone data file path.
Andy Heninger [Sun, 2 Feb 2020 04:20:37 +0000 (20:20 -0800)]
ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text
In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.
The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.
Compiled regular expression patterns make use of several shared common
UnicodeSets. This change simplifies the creation and use of these
static UnicodeSets.
- Pointer fields to the static sets are removed from the compiled patterns,
and the static variables are accessed directly. The deleted pointers
were a hold-over from earlier code that did not use shared statics.
- The UnicodeSet pattern literals are changed from hex constants to
u"string literals".
- The size of fRuleSets (from regexst.h) is changed from a hard-coded 10
to the number of UnicodeSets actually required. Doing this required
a change to regexcst.pl to export the required size. Changing and
rerunning this perl code resulted in massive but benign changes to
the generated file regexcst.h, the result of perl having changed its
order of enumeration of hashes since the file was last regenerated.
- UnicodeSets are frozen when possible. Should result in faster matching.
Frank Tang [Sat, 4 Jan 2020 02:08:51 +0000 (18:08 -0800)]
ICU-20934 Fix TZ test error
Somehow these tests are now fail on trunks.
Per https://mm.icann.org/pipermail/tz-announce/2019-July/000056.html
Brazil has canceled DST and will stay on standard time indefinitely.
Joshua Root [Mon, 21 Oct 2019 19:18:00 +0000 (06:18 +1100)]
ICU-20875 Include <cstddef> for max_align_t
The definition of max_align_t is not guaranteed to be available unless
the appropriate header is included. Since use of <stddef.h> from C++ is
deprecated, that's <cstddef>, and max_align_t is thus defined under the
std namespace rather than in the global namespace.
Caio Lima [Fri, 13 Dec 2019 03:14:28 +0000 (19:14 -0800)]
ICU-20442 Adding support for hour-cycle on DateTimePatternGenerator
DateTimePatternGenerator needs to consider the hour-cycle preferred by
Locale. This means that we need to to override the hour-cycle when a
locale contains "hc" keyword. This patch is adding such functionality.
In addition, "DateTimePatternGenerator::adjustFieldTypes" should adjust
hour field to properly follow tr35
spec(https://www.unicode.org/reports/tr35/tr35-dates.html#dfst-hour).
Smaarn [Wed, 16 Oct 2019 20:52:05 +0000 (22:52 +0200)]
ICU-20871 Fixed: no rule was defined to create the $(OUTDIR) directory if it didn't exist.
This would cause failures during cross compilation cases such as:
make[6]: Leaving directory '/spksrc/spk/bazarr/work-qoriq-6.1/icu/source/data'
make[5]: *** No rule to make target 'out', needed by 'out/icudt64b.dat'. Stop.
Frank Tang [Sat, 4 Jan 2020 02:08:51 +0000 (18:08 -0800)]
ICU-20934 Fix TZ test error
Somehow these tests are now fail on trunks.
Per https://mm.icann.org/pipermail/tz-announce/2019-July/000056.html
Brazil has canceled DST and will stay on standard time indefinitely.
Markus Scherer [Fri, 20 Dec 2019 00:09:10 +0000 (00:09 +0000)]
ICU-20916 LocaleMatcher distinguish between equivalent locales
- equivalent but originally unequal
- locale distance shifted left for additional fraction bits with micro distance
- Java more verbose matcher debug output
See #949
Andy Heninger [Fri, 23 Aug 2019 00:48:36 +0000 (17:48 -0700)]
ICU-20303 Break Iterator, improve handling of look-ahead rules.
- Merge the look-ahead results slots used when multiple rules share a common accepting state.
- Sequentially number the look-ahead result slot. Will eventually allow replacing the runtime map with an array.
- Inhibit chaining out of look-ahead rules. This could never actually happen; when a hard break
rule matches, the engine is stopped immediately, but the state table was being constructed
as if it could happen. Reduces table size for line break rules.
- Remove incorrect handling of fAccepting and fLookAhead fields of a state table row
when removing duplicate states. Look-ahead slot number was being mis-interpreted as a state number.
Joshua Root [Fri, 22 Nov 2019 10:44:57 +0000 (21:44 +1100)]
ICU-20904 Don't use char16_t with C++98/03
When C code includes the ICU headers, the UChar type is defined to be
uint16_t. But when C++ code includes the headers, UChar is char16_t
even when U_SHOW_CPLUSPLUS_API has been set to 0. Apart from arguably
being an inconsistency in the API, this means that C++98 or C++03 code
can't use the C API even though C99 code can.
So, change unicode/umachine.h to check not just whether __cplusplus is
defined but the value of U_CPLUSPLUS_VERSION when deciding how to
typedef UChar.
Jeff Genovy [Wed, 4 Dec 2019 19:50:55 +0000 (11:50 -0800)]
ICU-20873 Add the PGP "KEYS" file to the ICU repo.
This is a copy of the file from:
http://ssl.icu-project.org/KEYS
The ICU project's PGP KEYS file was previously hosted on a separate
server that not all ICU-TC members have access too. This change copies
the current KEYS file into the top-level git repo, so that we can setup
a redirect on the website to point at the checked-in file, rather than
hosting it separately.
Markus Scherer [Thu, 21 Nov 2019 21:29:18 +0000 (21:29 +0000)]
ICU-20893 Unicode 13 beta
See PR #915, see changes.txt
- Unicode 13 beta data as of 2019-nov-21
- uprops.icu format version 7.7 with more bits for Script/Script_Extensions
- more bits in spoof checker ScriptSet
- root line break rules adjusted for UAX 14 changes, from Andy
- line break tailorings not yet in sync with root