From: Matthew Fernandez <matthew.fernandez@gmail.com>
Date: Thu, 14 Jul 2022 04:52:21 +0000 (-0700)
Subject: fix UTF-8 character decoding in record labels
X-Git-Tag: 5.0.1~36^2
X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=2e61f3319486e0442d2cd8f4ae9c8777758aa431;p=graphviz

fix UTF-8 character decoding in record labels

When using `shape=record`, certain characters within labels have alternate
semantics. The switch in `parse_reclbl` handles these, with most characters
having no special semantics and branching to the default case. The trailing loop
in this case was attempting to accrue UTF-8 continuation bytes. But lets take a
look at the Wikipedia table describing UTF-8:¹

  ┌────────────────┬───────────────┬──────────┬──────────┬──────────┬──────────┐
  │First code point│Last code point│  Byte 1  │  Byte 2  │  Byte 3  │  Byte 4  │
  ├────────────────┼───────────────┼──────────┼──────────┼──────────┼──────────┤
  │          U+0000│         U+007F│ 0xxxxxxx │          │          │          │
  │          U+0080│         U+07FF│ 110xxxxx │ 10xxxxxx │          │          │
  │          U+0800│         U+FFFF│ 1110xxxx │ 10xxxxxx │ 10xxxxxx │          │
  │         U+10000│       U+10FFFF│ 11110xxx │ 10xxxxxx │ 10xxxxxx │ 10xxxxxx │
  └────────────────┴───────────────┴──────────┴──────────┴──────────┴──────────┘

Notice that the continuation bytes are distinguished by upper 0b10 bits. Now
consider that the trailing loop was using a mask with 128 (0b10000000) to
identify such bytes. Such a value masks _out_ bit 6. That is, this loop
condition expression was true for all values with upper 0b10 bits _or_ upper
0b11 bits.

The first consequence of this is that this loop thought multiple consecutive
non-ASCII characters were a single character. It treated the 0b11xxxxxx byte 1
of a new non-ASCII character as if it were another continuation byte of the
preceding non-ASCII character.

The second consequence of this is that an ASCII character followed by non-ASCII
characters would be treated as a single character. That is, 0b0xxxxxxx bytes led
to adjacent following 0b1xxxxxxx bytes being consumed along with them.

These factors combine in the #925 example to confuse the state machine of the
containing loop and result in a malformed label being produced.

This fix adjusts the mask such that it takes _both_ bit 7 and bit 6 and compares
against 0b10xxxxxx.

There are a number of other odd things going on with this code that I did not
attempt to change:

  1. ASCII characters are allowed to have following UTF-8 continuation bytes.
     This would be considered malformed, but this code treats it as legal.

  2. UTF-8 permits a maximum of 3 continuation bytes (see table above) but this
     code allows an arbitrary number of continuation bytes. It also does no
     validation that the leading byte’s upper bit values and the number of
     continuation bytes correspond.

  3. There is something called “hard space” mode that this code toggles to keep
     track of when a space that would otherwise be omitted needs to be
     preserved in the output. Once the hard space flag is toggled on, the code
     appears to never toggle it off. This looks like it has the (presumably
     unintended) effect of something like `"\\ "` causing all later spaces in
     the label to become hard spaces.

Gitlab: fixes #925

¹ https://en.wikipedia.org/wiki/UTF-8
---

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1bce7141a..40b3f344f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   returned invalid button event"` failed #2256
 - missing Perl includes patch #2262
 - smyrna: incorrect tokenization in frmobjectui.c:set_attr_object_type #2259
+- [Dot] Record shape+UTF+" | "=Eats spaces. #925
 
 ## [5.0.0] â 2022-07-07
 
diff --git a/lib/common/shapes.c b/lib/common/shapes.c
index 3f256ee1a..b385e2c56 100644
--- a/lib/common/shapes.c
+++ b/lib/common/shapes.c
@@ -3342,7 +3342,7 @@ static field_t *parse_reclbl(node_t *n, bool LR, int flag, char *text) {
 		    hspsp = psp - 1;
 	    }
 	    reclblp++;
-	    while (*reclblp & 128)
+	    while ((*reclblp & 0xc0) == 0x80)
 		*tsp++ = *reclblp++;
 	    break;
 	}
diff --git a/tests/test_regression.py b/tests/test_regression.py
index 61b3a20eb..1073bcaf9 100644
--- a/tests/test_regression.py
+++ b/tests/test_regression.py
@@ -407,7 +407,6 @@ def test_827():
 
   assert ret == 1, "Graphviz crashed when processing b15.gv"
 
-@pytest.mark.xfail(strict=True)
 def test_925():
   """
   spaces should be handled correctly in UTF-8-containing labels in record shapes