]> granicus.if.org Git - icu/commit
ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text
authorAndy Heninger <andy.heninger@gmail.com>
Sun, 2 Feb 2020 04:20:37 +0000 (20:20 -0800)
committerSteven R. Loomis <srl295@gmail.com>
Tue, 4 Feb 2020 00:51:17 +0000 (16:51 -0800)
commitd6b88d49e3be7096baf3828776c2b482a8ed1780
tree95495986e3726905d07dfe31823efccd08344311
parentb7d08bc04a4296982fcef8b6b8a354a9e4e7afca
ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text

In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.

The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.
icu4c/source/i18n/rematch.cpp
icu4c/source/test/testdata/regextst.txt