From: Matthew Fernandez <matthew.fernandez@gmail.com>
Date: Sat, 16 Oct 2021 05:24:34 +0000 (-0700)
Subject: xml_core: support a mode for escaping UTF-8 characters
X-Git-Tag: 2.50.0~64^2~2
X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=da5e778ed1747a3b2a71a73e8a685e2a9d6cdfa6;p=graphviz

xml_core: support a mode for escaping UTF-8 characters

This is modeled after `html_string` in the VML plugin and intended to replace
that function in a future commit. It differs from `html_string` in the following
ways:

  * More limited unicode character detection. `html_string` has a very
    generalized notion of a valid character that extends to lengths beyond what
    UTF-8 allows. This new implementation in `xml_core` adheres more strictly to
    only valid UTF-8 character lengths.

  * Simpler character parsing. `html_string` is written in a style to (1) decode
    character byte length without branching and (2) use the outer loop to also
    loop over the UTF-8 character’s bytes. This new implementation in `xml_core`
    uses simpler, more obvious code for decoding the byte length and consumes
    more than one character of the input instead of reusing the outer loop. This
    code is not on a hot path and it is not necessary or helpful to
    micro-optimize the control flow.

  * Hex escapes instead of decimal escapes. `html_string` uses `&#[0-9]+;`
    escapes while this new implementation uses `&#x[0-9a-f]+;` escapes. For
    many characters, this results in a shorter sequence. A compiler that knows
    `snprintf` as a built-in (all recent GCC and Clang) should also be able to
    generate a hex escape without using any division operations.

Note that nothing yet uses this functionality; all existing calls that go
through this code have the `utf8` flag unset.

This code aborts on encountering an invalid UTF-8 character. This is not ideal,
but matches `html_string`’s error handling. Perhaps this can be improved in
future.

Related to #1868.
---

diff --git a/lib/common/utils.h b/lib/common/utils.h
index 68815bff4..3e4ace495 100644
--- a/lib/common/utils.h
+++ b/lib/common/utils.h
@@ -39,6 +39,8 @@ extern "C" {
       unsigned dash : 1;
       // escape consecutive ' '
       unsigned nbsp : 1;
+      // anticipate non-ASCII characters that need to be encoded
+      unsigned utf8 : 1;
     } xml_flags_t;
 
     UTILS_API nodequeue *new_queue(int);
diff --git a/lib/common/xml.c b/lib/common/xml.c
index 7a6faf9fc..50bf82a02 100644
--- a/lib/common/xml.c
+++ b/lib/common/xml.c
@@ -1,7 +1,12 @@
+#include <cgraph/unreachable.h>
 #include <common/types.h>
 #include <common/utils.h>
 #include <ctype.h>
+#include <inttypes.h>
 #include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
 
 // variant of `isalpha` that assumes a C locale
 static bool isalpha_no_locale(char c) {
@@ -62,7 +67,7 @@ static int xml_core(char previous, const char **current, xml_flags_t flags,
   const char *s = *current;
   char c = *s;
 
-  // we always consume one character for now
+  // we will consume at least one character, so note that now
   ++*current;
 
   // escape '&' only if not part of a legal entity sequence
@@ -97,6 +102,76 @@ static int xml_core(char previous, const char **current, xml_flags_t flags,
   if (c == '\r' && flags.raw)
     return cb(state, "&#13;");
 
+  unsigned char uc = (unsigned char)c;
+  if (uc > 0x7f && flags.utf8) {
+
+    // replicating a table from https://en.wikipedia.org/wiki/UTF-8:
+    //
+    //   ââââââââââââââââââ¬ââââââââââââââââ¬âââââââââ¬âââââââââ¬âââââââââ¬âââââââââ
+    //   âFirst code pointâLast code pointâByte 1  âByte 2  âByte 3  âByte 4  â
+    //   ââââââââââââââââââ¼ââââââââââââââââ¼âââââââââ¼âââââââââ¼âââââââââ¼âââââââââ¤
+    //   â          U+0000â         U+007Fâ0xxxxxxxâ        â        â        â
+    //   â          U+0080â         U+07FFâ110xxxxxâ10xxxxxxâ        â        â
+    //   â          U+0800â         U+FFFFâ1110xxxxâ10xxxxxxâ10xxxxxxâ        â
+    //   â         U+10000â       U+10FFFFâ11110xxxâ10xxxxxxâ10xxxxxxâ10xxxxxxâ
+    //   ââââââââââââââââââ´ââââââââââââââââ´âââââââââ´âââââââââ´âââââââââ´âââââââââ
+    //
+    // from which we can calculate the byte length of the current character
+    size_t length =
+        (uc >> 5) == 6 ? 2 : (uc >> 4) == 14 ? 3 : (uc >> 3) == 30 ? 4 : 0;
+
+    // was the length malformed or is the follow on sequence truncated?
+    bool is_invalid = length == 0;
+    for (size_t l = 1; !is_invalid && length > l; ++l)
+      is_invalid |= s[l] == '\0';
+
+    // TODO: a better strategy than aborting on malformed data
+    if (is_invalid) {
+      fprintf(stderr, "Error during conversion to \"UTF-8\". Quiting.\n");
+      exit(EXIT_FAILURE);
+    }
+
+    // Decode the character. Refer again to the above table to understand this
+    // algorithm.
+    uint32_t utf8_char = 0;
+    switch (length) {
+    case 2: {
+      uint32_t low = ((uint32_t)s[1]) & ((1 << 6) - 1);
+      uint32_t high = ((uint32_t)s[0]) & ((1 << 5) - 1);
+      utf8_char = low | (high << 6);
+      break;
+    }
+    case 3: {
+      uint32_t low = ((uint32_t)s[2]) & ((1 << 6) - 1);
+      uint32_t mid = ((uint32_t)s[1]) & ((1 << 6) - 1);
+      uint32_t high = ((uint32_t)s[0]) & ((1 << 4) - 1);
+      utf8_char = low | (mid << 6) | (high << 12);
+      break;
+    }
+    case 4: {
+      uint32_t low = ((uint32_t)s[3]) & ((1 << 6) - 1);
+      uint32_t mid1 = ((uint32_t)s[2]) & ((1 << 6) - 1);
+      uint32_t mid2 = ((uint32_t)s[1]) & ((1 << 6) - 1);
+      uint32_t high = ((uint32_t)s[0]) & ((1 << 3) - 1);
+      utf8_char = low | (mid1 << 6) | (mid2 << 12) | (high << 18);
+      break;
+    }
+    default:
+      UNREACHABLE();
+    }
+
+    // setup a buffer that will fit the largest escape we need to print
+    char buffer[sizeof("&#xFFFFFFFF;")];
+
+    // emit the escape sequence itself
+    snprintf(buffer, sizeof(buffer), "&#x%" PRIx32 ";", utf8_char);
+
+    // note how many extra characters we consumed
+    *current += length - 1;
+
+    return cb(state, buffer);
+  }
+
   // otherwise, output the character as-is
   char buffer[2] = {c, '\0'};
   return cb(state, buffer);