fix UTF-8 character decoding in record labels
When using `shape=record`, certain characters within labels have alternate
semantics. The switch in `parse_reclbl` handles these, with most characters
having no special semantics and branching to the default case. The trailing loop
in this case was attempting to accrue UTF-8 continuation bytes. But lets take a
look at the Wikipedia table describing UTF-8:¹
┌────────────────┬───────────────┬──────────┬──────────┬──────────┬──────────┐
│First code point│Last code point│ Byte 1 │ Byte 2 │ Byte 3 │ Byte 4 │
├────────────────┼───────────────┼──────────┼──────────┼──────────┼──────────┤
│ U+0000│ U+007F│ 0xxxxxxx │ │ │ │
│ U+0080│ U+07FF│ 110xxxxx │ 10xxxxxx │ │ │
│ U+0800│ U+FFFF│ 1110xxxx │ 10xxxxxx │ 10xxxxxx │ │
│ U+10000│ U+10FFFF│ 11110xxx │ 10xxxxxx │ 10xxxxxx │ 10xxxxxx │
└────────────────┴───────────────┴──────────┴──────────┴──────────┴──────────┘
Notice that the continuation bytes are distinguished by upper 0b10 bits. Now
consider that the trailing loop was using a mask with 128 (
0b10000000) to
identify such bytes. Such a value masks _out_ bit 6. That is, this loop
condition expression was true for all values with upper 0b10 bits _or_ upper
0b11 bits.
The first consequence of this is that this loop thought multiple consecutive
non-ASCII characters were a single character. It treated the 0b11xxxxxx byte 1
of a new non-ASCII character as if it were another continuation byte of the
preceding non-ASCII character.
The second consequence of this is that an ASCII character followed by non-ASCII
characters would be treated as a single character. That is, 0b0xxxxxxx bytes led
to adjacent following 0b1xxxxxxx bytes being consumed along with them.
These factors combine in the #925 example to confuse the state machine of the
containing loop and result in a malformed label being produced.
This fix adjusts the mask such that it takes _both_ bit 7 and bit 6 and compares
against 0b10xxxxxx.
There are a number of other odd things going on with this code that I did not
attempt to change:
1. ASCII characters are allowed to have following UTF-8 continuation bytes.
This would be considered malformed, but this code treats it as legal.
2. UTF-8 permits a maximum of 3 continuation bytes (see table above) but this
code allows an arbitrary number of continuation bytes. It also does no
validation that the leading byte’s upper bit values and the number of
continuation bytes correspond.
3. There is something called “hard space” mode that this code toggles to keep
track of when a space that would otherwise be omitted needs to be
preserved in the output. Once the hard space flag is toggled on, the code
appears to never toggle it off. This looks like it has the (presumably
unintended) effect of something like `"\\ "` causing all later spaces in
the label to become hard spaces.
Gitlab: fixes #925
¹ https://en.wikipedia.org/wiki/UTF-8