Frank Tang [Wed, 29 Jul 2020 00:05:26 +0000 (17:05 -0700)]
ICU-20684 Fix uninitialized in isMatchAtCPBoundary
Downstream bug https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=15505
Fix Fuzzer-detected Use-of-uninitialized-value in isMatchAtCPBoundary
To test to show the bug in the new test case, configure and build with
CFLAGS="-fsanitize=memory" CXXFLAGS="-fsanitize=memory" ./runConfigureICU \
--enable-debug --disable-release Linux --disable-layoutex
Andy Heninger [Wed, 8 Jul 2020 00:12:09 +0000 (17:12 -0700)]
ICU-21178 Add check for corrupt rbbitst.txt data.
In the test data from rbbitst.txt, two or more adjacent boundary markers with
no intervening test data were accepted, with no indication of a problem.
This situation occurred, as described in bug ICU-21178, with a bad import of
some test cases from CLDR. PR #1194 corrected the problem with the test data
in ICU4C. This PR adds code to flag this situation in the test data, and
also propagates the data fix to ICU4J's copy of rbbitst.txt.
Andy Heninger [Sat, 27 Jun 2020 00:52:40 +0000 (17:52 -0700)]
ICU-13590 RBBI, improve handling of concurrent look-ahead rules.
Change the mapping from rule number to boundary position to use a simple array
instead of a linear search lookup map.
Look-ahead rules have a preceding context, a boundary position, and following context.
In the implementation, when the preceding context matches, the potential boundary
position is saved. Then, if the following context proves to match, the saved boundary is
returned as an actual boundary.
Look-ahead rules are numbered, and the implementation maintains a map from
rule number to the tentative saved boundary position.
In an earlier improvement to the rule builder, the rule numbering was changed to be a
contiguous sequence, from the original sparse numbering. In anticipation of
changing the mapping from number to position to use a simple array.
ICU-20545 Ensure that path ends with detected file separator
CharString, when asked, appends U_FILE_SEP_CHAR at the end of the string
it holds, if it won't find U_FILE_SEP_CHAR or U_FILE_ALT_SEP_CHAR there.
The problem starts if the dir variable uses
U_FILE_ALT_SEP_CHAR which is not equal to U_FILE_SEP_CHAR. Then the
resulting path could look like this
../data\
instead of this
../data/
This patch uses U_FILE_SEP_CHAR unless it detects that the dir variable
doesn't use it, and uses U_FILE_ALT_SEP_CHAR instead.
Andy Heninger [Mon, 1 Jun 2020 19:59:47 +0000 (12:59 -0700)]
ICU-20869 Fix compiler warning in FixedDecimal::getFractionalDigits().
Fix a clang compiler warning and a potential undefined behavior arising
from casting an out-of-range double to an int. See the Jira ticket for a
more detailed description of the problem.
This PR is to fix the immediate problem. Longer term, the function
may be replaced entirely - see issue ICU-21147.
If udata_create won't find U_FILE_SEP_CHAR at the end of a dir variable,
then it appends it. The problem starts if the dir variable uses
U_FILE_ALT_SEP_CHAR which is not equal to U_FILE_SEP_CHAR. Then the
resulting path could look like this
../data\mappings/cns-11643-1992.ucm
instead of this
../data/mappings/cns-11643-1992.ucm
This patch uses U_FILE_SEP_CHAR unless it detects that the dir variable
doesn't use it, and uses U_FILE_ALT_SEP_CHAR instead.
- Set intltest's Current Working Directory correctly to enable finding
resources.
- Adds c_cpp_properties.json, primarily for the includePath settings.
- Load average takes a while to respond, specify -j24 to always limit
parallel jobs to a maximum of 24.
- make's "-l" parameter is system load average, not CPU percentage.
A load average of 90 makes my laptop unusable, changing to -l20.
- Make running all tests the unit-testing default.
- Document the adjustments that can be made in the README.
- Skip these json files when checking for copyright notices. Pure json
does not permit comments, so c_cpp_properties.json cannot have
comments.
- defines += U_DISABLE_RENAMING=1 to simplify reference following.
Andy Heninger [Tue, 9 Jun 2020 20:19:17 +0000 (13:19 -0700)]
ICU-13565 Break Iteration, remove the dictionary bit from the implementation.
For identifying text that needs to be handled by a word dictionary for Break Iteration,
change from using a bit in the character category to sorting all dictionary categories
together, and recording the boundary between the non-dictionary and dictionary ranges.
This is internal to the implementaion. It does not affect behavior.
It does increase the number of character categories that can be handled using a
compact 8 bit Trie, from 127 to 255.
Fredrik Roubert [Tue, 2 Jun 2020 20:35:20 +0000 (22:35 +0200)]
ICU-21143 Applying non-zero offset to null pointer is undefined behaviour.
The result of pointer end + 1 will not be used if end is nullptr so it
doesn't really matter that the result of this operation is undefined,
but it's therefore also unnecessary to perform the operation at all.
Changing this removes this unnecessary operation and by doing so gives
the undefined behaviour sanitizer one thing less to worry about.
Steven R. Loomis [Wed, 20 May 2020 20:29:35 +0000 (13:29 -0700)]
ICU-21098 fix ticket URLs for logKnownIssue tickets.
- Still allows "1234" or "cldrbug:1234" format ticket IDs
- However, docs recommend "ICU-1234" or "CLDR-1234" format
in the future.
- Other ticket IDs could be used, but won't be linkified.
Jeff Genovy [Sun, 3 May 2020 19:41:08 +0000 (12:41 -0700)]
ICU-21104 Add build bots using Ubuntu 18.04 with C++14
This change adds an Azure build bot that builds using Clang on Ubuntu
18.04 with C++14 in debug mode, and a build bot on Travis that builds
using GCC with C++14.
Note: The Ubuntu 18.04 image doesn't have HarfBuzz, so we need to disable
building the layout engine.
Robert Melo [Tue, 21 Apr 2020 21:25:46 +0000 (18:25 -0300)]
ICU-21071 Fix lenient parse rules
- Check non-lenient rules before call lenint parsing
- Remove logKnownIssue 9503 from test code
- Adjust TestAllLocales test on ICU4C
- Add lenient checks on ICU4J
Fredrik Roubert [Wed, 11 Mar 2020 22:17:05 +0000 (23:17 +0100)]
ICU-20068 Shorten .gitignore by making common rules more generic.
Wildcard expressons, directory names and file names that identify
artefacts that should be ignored regardless of in which subdirectory
they are can be consolidated, thereby shortening the list considerably.