From a5be6e9a1dfe820807f9ccb21dec5144982618e6 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 25 Jan 2019 14:54:38 +0200 Subject: [PATCH] Fix comments that claimed that mblen() only looks at first byte. GB18030's mblen() function looks at the first and the second byte of the multibyte character, to determine its length. copy.c had made the assumption that mblen() only looks at the first byte, but it turns out to work out fine, because of the way the GB18030 encoding works. COPY will see a 4-byte encoded character as two 2-byte encoded characters, which is enough for COPY's purposes. It cannot mix those up with delimiter or escaping characters, because only single-byte ASCII characters are supported as delimiters or escape characters. Discussion: https://www.postgresql.org/message-id/7704d099-9643-2a55-fb0e-becd64400dcb%40iki.fi --- src/backend/commands/copy.c | 7 ++++++- src/backend/utils/mb/wchar.c | 32 +++++++++++++++++++++++++------- 2 files changed, 31 insertions(+), 8 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 03745cca75..1c90934d97 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -4121,9 +4121,14 @@ not_end_of_copy: { int mblen; + /* + * It is enough to look at the first byte in all our encodings, to + * get the length. (GB18030 is a bit special, but still works for + * our purposes; see comment in pg_gb18030_mblen()) + */ mblen_str[0] = c; - /* All our encodings only read the first byte to get the length */ mblen = pg_encoding_mblen(cstate->file_encoding, mblen_str); + IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(mblen - 1); IF_NEED_REFILL_AND_EOF_BREAK(mblen - 1); raw_buf_ptr += mblen - 1; diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index a5fdda456e..8e5116dfc1 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -15,16 +15,23 @@ /* - * conversion to pg_wchar is done by "table driven." - * to add an encoding support, define mb2wchar_with_len(), mblen(), dsplen() - * for the particular encoding. Note that if the encoding is only - * supported in the client, you don't need to define - * mb2wchar_with_len() function (SJIS is the case). + * Operations on multi-byte encodings are driven by a table of helper + * functions. + * + * To add an encoding support, define mblen(), dsplen() and verifier() for + * the encoding. For server-encodings, also define mb2wchar() and wchar2mb() + * conversion functions. * * These functions generally assume that their input is validly formed. * The "verifier" functions, further down in the file, have to be more - * paranoid. We expect that mblen() does not need to examine more than - * the first byte of the character to discover the correct length. + * paranoid. + * + * We expect that mblen() does not need to examine more than the first byte + * of the character to discover the correct length. GB18030 is an exception + * to that rule, though, as it also looks at second byte. But even that + * behaves in a predictable way, if you only pass the first byte: it will + * treat 4-byte encoded characters as two 2-byte encoded characters, which is + * good enough for all current uses. * * Note: for the display output of psql to work properly, the return values * of the dsplen functions must conform to the Unicode standard. In particular @@ -1073,6 +1080,17 @@ pg_uhc_dsplen(const unsigned char *s) * GB18030 * Added by Bill Huang , */ + +/* + * Unlike all other mblen() functions, this also looks at the second byte of + * the input. However, if you only pass the first byte of a multi-byte + * string, and \0 as the second byte, this still works in a predictable way: + * a 4-byte character will be reported as two 2-byte characters. That's + * enough for all current uses, as a client-only encoding. It works that + * way, because in any valid 4-byte GB18030-encoded character, the third and + * fourth byte look like a 2-byte encoded character, when looked at + * separately. + */ static int pg_gb18030_mblen(const unsigned char *s) { -- 2.40.0