From: Matthew Fernandez Date: Sat, 16 Oct 2021 05:24:34 +0000 (-0700) Subject: xml_core: support a mode for escaping UTF-8 characters X-Git-Tag: 2.50.0~64^2~2 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=da5e778ed1747a3b2a71a73e8a685e2a9d6cdfa6;p=graphviz xml_core: support a mode for escaping UTF-8 characters This is modeled after `html_string` in the VML plugin and intended to replace that function in a future commit. It differs from `html_string` in the following ways: * More limited unicode character detection. `html_string` has a very generalized notion of a valid character that extends to lengths beyond what UTF-8 allows. This new implementation in `xml_core` adheres more strictly to only valid UTF-8 character lengths. * Simpler character parsing. `html_string` is written in a style to (1) decode character byte length without branching and (2) use the outer loop to also loop over the UTF-8 character’s bytes. This new implementation in `xml_core` uses simpler, more obvious code for decoding the byte length and consumes more than one character of the input instead of reusing the outer loop. This code is not on a hot path and it is not necessary or helpful to micro-optimize the control flow. * Hex escapes instead of decimal escapes. `html_string` uses `&#[0-9]+;` escapes while this new implementation uses `&#x[0-9a-f]+;` escapes. For many characters, this results in a shorter sequence. A compiler that knows `snprintf` as a built-in (all recent GCC and Clang) should also be able to generate a hex escape without using any division operations. Note that nothing yet uses this functionality; all existing calls that go through this code have the `utf8` flag unset. This code aborts on encountering an invalid UTF-8 character. This is not ideal, but matches `html_string`’s error handling. Perhaps this can be improved in future. Related to #1868. --- diff --git a/lib/common/utils.h b/lib/common/utils.h index 68815bff4..3e4ace495 100644 --- a/lib/common/utils.h +++ b/lib/common/utils.h @@ -39,6 +39,8 @@ extern "C" { unsigned dash : 1; // escape consecutive ' ' unsigned nbsp : 1; + // anticipate non-ASCII characters that need to be encoded + unsigned utf8 : 1; } xml_flags_t; UTILS_API nodequeue *new_queue(int); diff --git a/lib/common/xml.c b/lib/common/xml.c index 7a6faf9fc..50bf82a02 100644 --- a/lib/common/xml.c +++ b/lib/common/xml.c @@ -1,7 +1,12 @@ +#include #include #include #include +#include #include +#include +#include +#include // variant of `isalpha` that assumes a C locale static bool isalpha_no_locale(char c) { @@ -62,7 +67,7 @@ static int xml_core(char previous, const char **current, xml_flags_t flags, const char *s = *current; char c = *s; - // we always consume one character for now + // we will consume at least one character, so note that now ++*current; // escape '&' only if not part of a legal entity sequence @@ -97,6 +102,76 @@ static int xml_core(char previous, const char **current, xml_flags_t flags, if (c == '\r' && flags.raw) return cb(state, " "); + unsigned char uc = (unsigned char)c; + if (uc > 0x7f && flags.utf8) { + + // replicating a table from https://en.wikipedia.org/wiki/UTF-8: + // + // ┌────────────────┬───────────────┬────────┬────────┬────────┬────────┐ + // │First code point│Last code point│Byte 1 │Byte 2 │Byte 3 │Byte 4 │ + // ├────────────────┼───────────────┼────────┼────────┼────────┼────────┤ + // │ U+0000│ U+007F│0xxxxxxx│ │ │ │ + // │ U+0080│ U+07FF│110xxxxx│10xxxxxx│ │ │ + // │ U+0800│ U+FFFF│1110xxxx│10xxxxxx│10xxxxxx│ │ + // │ U+10000│ U+10FFFF│11110xxx│10xxxxxx│10xxxxxx│10xxxxxx│ + // └────────────────┴───────────────┴────────┴────────┴────────┴────────┘ + // + // from which we can calculate the byte length of the current character + size_t length = + (uc >> 5) == 6 ? 2 : (uc >> 4) == 14 ? 3 : (uc >> 3) == 30 ? 4 : 0; + + // was the length malformed or is the follow on sequence truncated? + bool is_invalid = length == 0; + for (size_t l = 1; !is_invalid && length > l; ++l) + is_invalid |= s[l] == '\0'; + + // TODO: a better strategy than aborting on malformed data + if (is_invalid) { + fprintf(stderr, "Error during conversion to \"UTF-8\". Quiting.\n"); + exit(EXIT_FAILURE); + } + + // Decode the character. Refer again to the above table to understand this + // algorithm. + uint32_t utf8_char = 0; + switch (length) { + case 2: { + uint32_t low = ((uint32_t)s[1]) & ((1 << 6) - 1); + uint32_t high = ((uint32_t)s[0]) & ((1 << 5) - 1); + utf8_char = low | (high << 6); + break; + } + case 3: { + uint32_t low = ((uint32_t)s[2]) & ((1 << 6) - 1); + uint32_t mid = ((uint32_t)s[1]) & ((1 << 6) - 1); + uint32_t high = ((uint32_t)s[0]) & ((1 << 4) - 1); + utf8_char = low | (mid << 6) | (high << 12); + break; + } + case 4: { + uint32_t low = ((uint32_t)s[3]) & ((1 << 6) - 1); + uint32_t mid1 = ((uint32_t)s[2]) & ((1 << 6) - 1); + uint32_t mid2 = ((uint32_t)s[1]) & ((1 << 6) - 1); + uint32_t high = ((uint32_t)s[0]) & ((1 << 3) - 1); + utf8_char = low | (mid1 << 6) | (mid2 << 12) | (high << 18); + break; + } + default: + UNREACHABLE(); + } + + // setup a buffer that will fit the largest escape we need to print + char buffer[sizeof("�")]; + + // emit the escape sequence itself + snprintf(buffer, sizeof(buffer), "&#x%" PRIx32 ";", utf8_char); + + // note how many extra characters we consumed + *current += length - 1; + + return cb(state, buffer); + } + // otherwise, output the character as-is char buffer[2] = {c, '\0'}; return cb(state, buffer);