From: Matthew Fernandez Date: Thu, 14 Jul 2022 04:52:21 +0000 (-0700) Subject: fix UTF-8 character decoding in record labels X-Git-Tag: 5.0.1~36^2 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=2e61f3319486e0442d2cd8f4ae9c8777758aa431;p=graphviz fix UTF-8 character decoding in record labels When using `shape=record`, certain characters within labels have alternate semantics. The switch in `parse_reclbl` handles these, with most characters having no special semantics and branching to the default case. The trailing loop in this case was attempting to accrue UTF-8 continuation bytes. But lets take a look at the Wikipedia table describing UTF-8:¹ ┌────────────────┬───────────────┬──────────┬──────────┬──────────┬──────────┐ │First code point│Last code point│ Byte 1 │ Byte 2 │ Byte 3 │ Byte 4 │ ├────────────────┼───────────────┼──────────┼──────────┼──────────┼──────────┤ │ U+0000│ U+007F│ 0xxxxxxx │ │ │ │ │ U+0080│ U+07FF│ 110xxxxx │ 10xxxxxx │ │ │ │ U+0800│ U+FFFF│ 1110xxxx │ 10xxxxxx │ 10xxxxxx │ │ │ U+10000│ U+10FFFF│ 11110xxx │ 10xxxxxx │ 10xxxxxx │ 10xxxxxx │ └────────────────┴───────────────┴──────────┴──────────┴──────────┴──────────┘ Notice that the continuation bytes are distinguished by upper 0b10 bits. Now consider that the trailing loop was using a mask with 128 (0b10000000) to identify such bytes. Such a value masks _out_ bit 6. That is, this loop condition expression was true for all values with upper 0b10 bits _or_ upper 0b11 bits. The first consequence of this is that this loop thought multiple consecutive non-ASCII characters were a single character. It treated the 0b11xxxxxx byte 1 of a new non-ASCII character as if it were another continuation byte of the preceding non-ASCII character. The second consequence of this is that an ASCII character followed by non-ASCII characters would be treated as a single character. That is, 0b0xxxxxxx bytes led to adjacent following 0b1xxxxxxx bytes being consumed along with them. These factors combine in the #925 example to confuse the state machine of the containing loop and result in a malformed label being produced. This fix adjusts the mask such that it takes _both_ bit 7 and bit 6 and compares against 0b10xxxxxx. There are a number of other odd things going on with this code that I did not attempt to change: 1. ASCII characters are allowed to have following UTF-8 continuation bytes. This would be considered malformed, but this code treats it as legal. 2. UTF-8 permits a maximum of 3 continuation bytes (see table above) but this code allows an arbitrary number of continuation bytes. It also does no validation that the leading byte’s upper bit values and the number of continuation bytes correspond. 3. There is something called “hard space” mode that this code toggles to keep track of when a space that would otherwise be omitted needs to be preserved in the output. Once the hard space flag is toggled on, the code appears to never toggle it off. This looks like it has the (presumably unintended) effect of something like `"\\ "` causing all later spaces in the label to become hard spaces. Gitlab: fixes #925 ¹ https://en.wikipedia.org/wiki/UTF-8 --- diff --git a/CHANGELOG.md b/CHANGELOG.md index 1bce7141a..40b3f344f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 returned invalid button event"` failed #2256 - missing Perl includes patch #2262 - smyrna: incorrect tokenization in frmobjectui.c:set_attr_object_type #2259 +- [Dot] Record shape+UTF+" | "=Eats spaces. #925 ## [5.0.0] – 2022-07-07 diff --git a/lib/common/shapes.c b/lib/common/shapes.c index 3f256ee1a..b385e2c56 100644 --- a/lib/common/shapes.c +++ b/lib/common/shapes.c @@ -3342,7 +3342,7 @@ static field_t *parse_reclbl(node_t *n, bool LR, int flag, char *text) { hspsp = psp - 1; } reclblp++; - while (*reclblp & 128) + while ((*reclblp & 0xc0) == 0x80) *tsp++ = *reclblp++; break; } diff --git a/tests/test_regression.py b/tests/test_regression.py index 61b3a20eb..1073bcaf9 100644 --- a/tests/test_regression.py +++ b/tests/test_regression.py @@ -407,7 +407,6 @@ def test_827(): assert ret == 1, "Graphviz crashed when processing b15.gv" -@pytest.mark.xfail(strict=True) def test_925(): """ spaces should be handled correctly in UTF-8-containing labels in record shapes