From cf92226e9f7f985a678287167b954a831a3b660c Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Mon, 20 May 2019 18:39:53 -0400 Subject: [PATCH] Doc: improve description of regexp character classes. Define the meanings of the POSIX-spec character classes in line, rather than referring to the ctype(3) man page. That man page doesn't even exist on many modern systems, and if it does exist it probably says the wrong things about non-ASCII characters. Also document our non-POSIX-spec "ascii" character class. Also, point out here that this behavior is controlled by collation or LC_CTYPE, since the existing text explaining that is pretty far away. Per gripe from Geert Lobbestael. Given the lack of prior complaints, I'm not excited about back-patching this. Discussion: https://postgr.es/m/155837022049.1359.2948065118562813468@wrigleys.postgresql.org --- doc/src/sgml/func.sgml | 46 +++++++++++++++++++++++++++++------------- 1 file changed, 32 insertions(+), 14 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index bc2275c8fe..a79e7c0380 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5104,18 +5104,37 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; Within a bracket expression, the name of a character class enclosed in [: and :] stands - for the list of all characters belonging to that class. Standard - character class names are: alnum, - alpha, blank, - cntrl, digit, - graph, lower, - print, punct, - space, upper, - xdigit. These stand for the character classes - defined in - ctype3. - A locale can provide others. A character class cannot be used as - an endpoint of a range. + for the list of all characters belonging to that class. A character + class cannot be used as an endpoint of a range. + The POSIX standard defines these character class + names: + alnum (letters and numeric digits), + alpha (letters), + blank (space and tab), + cntrl (control characters), + digit (numeric digits), + graph (printable characters except space), + lower (lower-case letters), + print (printable characters including space), + punct (punctuation), + space (any white space), + upper (upper-case letters), + and xdigit (hexadecimal digits). + The behavior of these standard character classes is generally + consistent across platforms for characters in the 7-bit ASCII set. + Whether a given non-ASCII character is considered to belong to one + of these classes depends on the collation + that is used for the regular-expression function or operator + (see ), or by default on the + database's LC_CTYPE locale setting (see + ). The classification of non-ASCII + characters can vary across platforms even in similarly-named + locales. (But the C locale never considers any + non-ASCII characters to belong to any of these classes.) + In addition to these standard character + classes, PostgreSQL defines + the ascii character class, which contains exactly + the 7-bit ASCII set. @@ -5126,8 +5145,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; and end of a word respectively. A word is defined as a sequence of word characters that is neither preceded nor followed by word characters. A word character is an alnum character (as - defined by - ctype3) + defined by the POSIX character class described above) or an underscore. This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. -- 2.40.0