From fbdcab953d086dffd5228e1ff6374cd2b1e8023c Mon Sep 17 00:00:00 2001 From: Alex Dowad Date: Mon, 9 Nov 2020 21:40:08 +0200 Subject: [PATCH] Unicode -> SJIS-mac conversion doesn't reject valid codepoints after a bad transcoding hint To give the background on this issue, here is an excerpt from JAPANESE.txt, from the Unicode Consortium: Apple has defined a block of 32 corporate characters as "transcoding hints." These are used in combination with standard Unicode characters to force them to be treated in a special way for mapping to other encodings; they have no other effect. Sixteen of these transcoding hints are "grouping hints" - they indicate that the next 2-4 Unicode characters should be treated as a single entity for transcoding. The other sixteen transcoding hints are "variant tags" - they are like combining characters, and can follow a standard Unicode (or a sequence consisting of a base character and other combining characters) to cause it to be treated in a special way for transcoding. These always terminate a combining-character sequence. The transcoding coding hints used in this mapping table are: 0xF860 group next 2 characters as a single entity for transcoding 0xF861 group next 3 characters as a single entity for transcoding 0xF862 group next 4 characters as a single entity for transcoding 0xF87A variant tag for "negative" (i.e. black & white reversed) 0xF87E variant tag for vertical form 0xF87F variant tag for other alternate form For example, the Apple addition character 0x85AB is Roman numeral thirteen. There is no single Unicode for this (although there are standard Unicodes for Roman numerals 1-12). Using the grouping hint 0xF862 in combination with standard Unicodes, we can map this as 0xF862+0x0058+0x0049+0x0049+0x0049 (i.e. X + I + I + I). Our SJIS-mac conversion code actually recognizes some special sequences which start with an Apple 'transcoding hint'. However, if a transcoding hint is misplaced and is not followed by one of the expected sequences, we can just emit one error marker for the bad transcoding hint and then process the following codepoint as normal. --- ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c b/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c index 78bf8e3671..45b87a8f98 100644 --- a/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c +++ b/ext/mbstring/libmbfl/filters/mbfilter_sjis_mac.c @@ -408,6 +408,7 @@ mbfl_filt_conv_wchar_sjis_mac(int c, mbfl_convert_filter *filter) } if (c == 0xf860 || c == 0xf861 || c == 0xf862) { + /* Apple 'transcoding hint' codepoints (from private use area) */ filter->status = 2; filter->cache = c; return c; @@ -527,8 +528,9 @@ mbfl_filt_conv_wchar_sjis_mac(int c, mbfl_convert_filter *filter) } if (filter->status == 0) { + /* Didn't find any of expected codepoints after Apple transcoding hint */ CK(mbfl_filt_conv_illegal_output(c1, filter)); - CK(mbfl_filt_conv_illegal_output(c, filter)); + return mbfl_filt_conv_wchar_sjis_mac(c, filter); } break; -- 2.40.0