]> granicus.if.org Git - icu/commit
ICU-20876 Regex Grapheme Cluster matching with Break Iterators.
authorAndy Heninger <andy.heninger@gmail.com>
Fri, 14 Feb 2020 05:40:28 +0000 (21:40 -0800)
committerAndy Heninger <andy.heninger@gmail.com>
Wed, 19 Feb 2020 02:28:10 +0000 (18:28 -0800)
commit14bcaaf58eefc2c139350253d3cdfb1940bca7e4
tree8d1f838d924139d1a853c2a70cba9606aca3bf56
parented9ea2e7accc316d4086ac377f622ca90f2c016d
ICU-20876 Regex Grapheme Cluster matching with Break Iterators.

Change the implementation of grapheme cluster matching in regex to use an ICU
break iterator instead of a little one-off state machine.

The old implementation had fallen behind the Unicode UAX-29 specification for
graphem clusters, and could not be easily updated.

The implementation follows the same general pattern that is used for finding
word boundaries with an ICU break iterator. In reviewing that code, a few
improvements to the handling of ICU error codes were also made.

Also note that this change adds a new dependency on Break Iteration.  Regex
patterns that previously would work with ICU builds that were configured with
no break iteration will now fail. But only if they include \X for matching
grapheme cluster boundaries.
icu4c/source/i18n/regexcmp.cpp
icu4c/source/i18n/rematch.cpp
icu4c/source/i18n/unicode/regex.h
icu4c/source/test/testdata/regextst.txt