granicus.if.org Git - icu/commit

author	Andy Heninger <andy.heninger@gmail.com>
	Sun, 2 Feb 2020 04:20:37 +0000 (20:20 -0800)
committer	Steven R. Loomis <srl295@gmail.com>
	Tue, 4 Feb 2020 00:51:17 +0000 (16:51 -0800)
commit	d6b88d49e3be7096baf3828776c2b482a8ed1780
tree	95495986e3726905d07dfe31823efccd08344311	tree \| snapshot
parent	b7d08bc04a4296982fcef8b6b8a354a9e4e7afca	commit \| diff

ICU-20939 Fix problem w regexp \b boundaries & UTF-8 text

In regular expressions, when testing for word boundaries with \b, the
boundaries were incorrect when in Unicode mode, meaning that an ICU word break
iterator is being used to find the boundaries, and the text being matched is
UTF-8 encoded.

The bug stemmed from a misunderstanding of how string indexes work with UText
and break iterators, leading to the inclusion of code to convert from UTF-8 to
UTF-16 indexing, when what was wanted was the original UTF-8 index everywhere.
Removing the indexing conversion fixes the problem.

icu4c/source/i18n/rematch.cpp		diff \| blob \| history
icu4c/source/test/testdata/regextst.txt		diff \| blob \| history