xml_core: support a mode for escaping UTF-8 characters
This is modeled after `html_string` in the VML plugin and intended to replace
that function in a future commit. It differs from `html_string` in the following
ways:
* More limited unicode character detection. `html_string` has a very
generalized notion of a valid character that extends to lengths beyond what
UTF-8 allows. This new implementation in `xml_core` adheres more strictly to
only valid UTF-8 character lengths.
* Simpler character parsing. `html_string` is written in a style to (1) decode
character byte length without branching and (2) use the outer loop to also
loop over the UTF-8 character’s bytes. This new implementation in `xml_core`
uses simpler, more obvious code for decoding the byte length and consumes
more than one character of the input instead of reusing the outer loop. This
code is not on a hot path and it is not necessary or helpful to
micro-optimize the control flow.
* Hex escapes instead of decimal escapes. `html_string` uses `&#[0-9]+;`
escapes while this new implementation uses `&#x[0-9a-f]+;` escapes. For
many characters, this results in a shorter sequence. A compiler that knows
`snprintf` as a built-in (all recent GCC and Clang) should also be able to
generate a hex escape without using any division operations.
Note that nothing yet uses this functionality; all existing calls that go
through this code have the `utf8` flag unset.
This code aborts on encountering an invalid UTF-8 character. This is not ideal,
but matches `html_string`’s error handling. Perhaps this can be improved in
future.