operating system C library. These are the locales that most tools
provided by the operating system use. Another provider
is <literal>icu</literal>, which uses the external
- ICU<indexterm><primary>ICU</></> library. Support for ICU has to be
- configured when PostgreSQL is built.
+ ICU<indexterm><primary>ICU</></> library. ICU locales can only be
+ used if support for ICU was configured when PostgreSQL was built.
</para>
<para>
</para>
<para>
- A collation provided by <literal>icu</literal> maps to a named collator
- provided by the ICU library. ICU does not support
- separate <quote>collate</quote> and <quote>ctype</quote> settings, so they
- are always the same. Also, ICU collations are independent of the
- encoding, so there is always only one ICU collation for a given name in a
- database.
+ A collation object provided by <literal>icu</literal> maps to a named
+ collator provided by the ICU library. ICU does not support
+ separate <quote>collate</quote> and <quote>ctype</quote> settings, so
+ they are always the same. Also, ICU collations are independent of the
+ encoding, so there is always only one ICU collation of a given name in
+ a database.
</para>
<sect3>
<para>
If the operating system provides support for using multiple locales
within a single program (<function>newlocale</> and related functions),
- or support for ICU is configured,
+ or if support for ICU is configured,
then when a database cluster is initialized, <command>initdb</command>
populates the system catalog <literal>pg_collation</literal> with
- collations based on all the locales it finds on the operating
+ collations based on all the locales it finds in the operating
system at the time.
</para>
directly to the locales installed in the operating system, which can be
listed using the command <literal>locale -a</literal>. In case
a <literal>libc</literal> collation is needed that has different values
- for <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, or new
+ for <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, or if new
locales are installed in the operating system after the database system
was initialized, then a new collation may be created using
the <xref linkend="sql-createcollation"> command.
+ New operating system locales can also be imported en masse using
+ the <link linkend="functions-admin-collation"><function>pg_import_system_collations()</function></link> function.
</para>
<para>
Use of the stripped collation names is recommended, since it will
make one less thing you need to change if you decide to change to
another database encoding. Note however that the <literal>default</>,
- <literal>C</>, and <literal>POSIX</> collations, as well as all collations
- provided by ICU can be used regardless of the database encoding.
+ <literal>C</>, and <literal>POSIX</> collations can be used regardless of
+ the database encoding.
</para>
<para>
Collations provided by ICU are created with names in BCP 47 language tag
format, with a <quote>private use</quote>
extension <literal>-x-icu</literal> appended, to distinguish them from
- libc locales. So <literal>de-x-icu</literal> would be an example.
+ libc locales. So <literal>de-x-icu</literal> would be an example name.
</para>
<para>
See <ulink url="http://userguide.icu-project.org/locale"></ulink> for
information on ICU locale naming. <command>initdb</command> uses the ICU
APIs to extract a set of locales with distinct collation rules to populate
- the initial set of collations. Here are some examples collations that
+ the initial set of collations. Here are some example collations that
might be created:
<variablelist>
<listitem>
<para>German collation for Austria, default variant</para>
<para>
- (Note that as of this writing, there is no,
+ (As of this writing, there is no,
say, <literal>de-DE-x-icu</literal> or <literal>de-CH-x-icu</literal>,
because those are equivalent to <literal>de-x-icu</literal>.)
</para>
</para>
<para>
- Some (less frequently used) encodings are not supported by ICU. If the
- database cluster was initialized with such an encoding, no ICU collations
- will be predefined.
+ Some (less frequently used) encodings are not supported by ICU. When the
+ database encoding is one of these, ICU collation entries
+ in <literal>pg_collation</literal> are ignored. Attempting to use one
+ will draw an error along the lines of <quote>collation "de-x-icu" for
+ encoding "WIN874" does not exist</>.
</para>
</sect4>
</sect3>
classification) and <envar>LC_COLLATE</> (string sort order) locale
settings. For <literal>C</> or
<literal>POSIX</> locale, any character set is allowed, but for other
- locales there is only one character set that will work correctly.
+ libc-provided locales there is only one character set that will work
+ correctly.
(On Windows, however, UTF-8 encoding can be used with any locale.)
+ If you have ICU support configured, ICU-provided locales can be used
+ with most but not all server-side encodings.
</para>
<sect2 id="multibyte-charset-supported">
<table id="charset-table">
<title><productname>PostgreSQL</productname> Character Sets</title>
- <tgroup cols="6">
+ <tgroup cols="7">
<thead>
<row>
<entry>Name</entry>
<entry>Description</entry>
<entry>Language</entry>
<entry>Server?</entry>
+ <entry>ICU?</entry>
<!--
The Bytes/Char field is populated by looking at the values returned
by pg_wchar_table.mblen function for each encoding.
<entry>Big Five</entry>
<entry>Traditional Chinese</entry>
<entry>No</entry>
+ <entry>No</entry>
<entry>1-2</entry>
<entry><literal>WIN950</>, <literal>Windows950</></entry>
</row>
<entry>Extended UNIX Code-CN</entry>
<entry>Simplified Chinese</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1-3</entry>
<entry></entry>
</row>
<entry>Extended UNIX Code-JP</entry>
<entry>Japanese</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1-3</entry>
<entry></entry>
</row>
<entry>Extended UNIX Code-JP, JIS X 0213</entry>
<entry>Japanese</entry>
<entry>Yes</entry>
+ <entry>No</entry>
<entry>1-3</entry>
<entry></entry>
</row>
<entry>Extended UNIX Code-KR</entry>
<entry>Korean</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1-3</entry>
<entry></entry>
</row>
<entry>Extended UNIX Code-TW</entry>
<entry>Traditional Chinese, Taiwanese</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1-3</entry>
<entry></entry>
</row>
<entry>National Standard</entry>
<entry>Chinese</entry>
<entry>No</entry>
+ <entry>No</entry>
<entry>1-4</entry>
<entry></entry>
</row>
<entry>Extended National Standard</entry>
<entry>Simplified Chinese</entry>
<entry>No</entry>
+ <entry>No</entry>
<entry>1-2</entry>
<entry><literal>WIN936</>, <literal>Windows936</></entry>
</row>
<entry>ISO 8859-5, <acronym>ECMA</> 113</entry>
<entry>Latin/Cyrillic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>ISO 8859-6, <acronym>ECMA</> 114</entry>
<entry>Latin/Arabic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>ISO 8859-7, <acronym>ECMA</> 118</entry>
<entry>Latin/Greek</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>ISO 8859-8, <acronym>ECMA</> 121</entry>
<entry>Latin/Hebrew</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry><acronym>JOHAB</></entry>
<entry>Korean (Hangul)</entry>
<entry>No</entry>
+ <entry>No</entry>
<entry>1-3</entry>
<entry></entry>
</row>
<entry><acronym>KOI</acronym>8-R</entry>
<entry>Cyrillic (Russian)</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>KOI8</></entry>
</row>
<entry><acronym>KOI</acronym>8-U</entry>
<entry>Cyrillic (Ukrainian)</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>ISO 8859-1, <acronym>ECMA</> 94</entry>
<entry>Western European</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO88591</></entry>
</row>
<entry>ISO 8859-2, <acronym>ECMA</> 94</entry>
<entry>Central European</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO88592</></entry>
</row>
<entry>ISO 8859-3, <acronym>ECMA</> 94</entry>
<entry>South European</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO88593</></entry>
</row>
<entry>ISO 8859-4, <acronym>ECMA</> 94</entry>
<entry>North European</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO88594</></entry>
</row>
<entry>ISO 8859-9, <acronym>ECMA</> 128</entry>
<entry>Turkish</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO88599</></entry>
</row>
<entry>ISO 8859-10, <acronym>ECMA</> 144</entry>
<entry>Nordic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO885910</></entry>
</row>
<entry>ISO 8859-13</entry>
<entry>Baltic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO885913</></entry>
</row>
<entry>ISO 8859-14</entry>
<entry>Celtic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO885914</></entry>
</row>
<entry>ISO 8859-15</entry>
<entry>LATIN1 with Euro and accents</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ISO885915</></entry>
</row>
<entry>ISO 8859-16, <acronym>ASRO</> SR 14111</entry>
<entry>Romanian</entry>
<entry>Yes</entry>
+ <entry>No</entry>
<entry>1</entry>
<entry><literal>ISO885916</></entry>
</row>
<entry>Mule internal code</entry>
<entry>Multilingual Emacs</entry>
<entry>Yes</entry>
+ <entry>No</entry>
<entry>1-4</entry>
<entry></entry>
</row>
<entry>Shift JIS</entry>
<entry>Japanese</entry>
<entry>No</entry>
+ <entry>No</entry>
<entry>1-2</entry>
<entry><literal>Mskanji</>, <literal>ShiftJIS</>, <literal>WIN932</>, <literal>Windows932</></entry>
</row>
<entry>Shift JIS, JIS X 0213</entry>
<entry>Japanese</entry>
<entry>No</entry>
+ <entry>No</entry>
<entry>1-2</entry>
<entry></entry>
</row>
<entry>unspecified (see text)</entry>
<entry><emphasis>any</></entry>
<entry>Yes</entry>
+ <entry>No</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Unified Hangul Code</entry>
<entry>Korean</entry>
<entry>No</entry>
+ <entry>No</entry>
<entry>1-2</entry>
<entry><literal>WIN949</>, <literal>Windows949</></entry>
</row>
<entry>Unicode, 8-bit</entry>
<entry><emphasis>all</></entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1-4</entry>
<entry><literal>Unicode</></entry>
</row>
<entry>Windows CP866</entry>
<entry>Cyrillic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ALT</></entry>
</row>
<entry>Windows CP874</entry>
<entry>Thai</entry>
<entry>Yes</entry>
+ <entry>No</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1250</entry>
<entry>Central European</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1251</entry>
<entry>Cyrillic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>WIN</></entry>
</row>
<entry>Windows CP1252</entry>
<entry>Western European</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1253</entry>
<entry>Greek</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1254</entry>
<entry>Turkish</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1255</entry>
<entry>Hebrew</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1256</entry>
<entry>Arabic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1257</entry>
<entry>Baltic</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry></entry>
</row>
<entry>Windows CP1258</entry>
<entry>Vietnamese</entry>
<entry>Yes</entry>
+ <entry>Yes</entry>
<entry>1</entry>
<entry><literal>ABC</>, <literal>TCVN</>, <literal>TCVN5712</>, <literal>VSCII</></entry>
</row>
return visible;
}
+/*
+ * lookup_collation
+ * If there's a collation of the given name/namespace, and it works
+ * with the given encoding, return its OID. Else return InvalidOid.
+ */
+static Oid
+lookup_collation(const char *collname, Oid collnamespace, int32 encoding)
+{
+ Oid collid;
+ HeapTuple colltup;
+ Form_pg_collation collform;
+
+ /* Check for encoding-specific entry (exact match) */
+ collid = GetSysCacheOid3(COLLNAMEENCNSP,
+ PointerGetDatum(collname),
+ Int32GetDatum(encoding),
+ ObjectIdGetDatum(collnamespace));
+ if (OidIsValid(collid))
+ return collid;
+
+ /*
+ * Check for any-encoding entry. This takes a bit more work: while libc
+ * collations with collencoding = -1 do work with all encodings, ICU
+ * collations only work with certain encodings, so we have to check that
+ * aspect before deciding it's a match.
+ */
+ colltup = SearchSysCache3(COLLNAMEENCNSP,
+ PointerGetDatum(collname),
+ Int32GetDatum(-1),
+ ObjectIdGetDatum(collnamespace));
+ if (!HeapTupleIsValid(colltup))
+ return InvalidOid;
+ collform = (Form_pg_collation) GETSTRUCT(colltup);
+ if (collform->collprovider == COLLPROVIDER_ICU)
+ {
+ if (is_encoding_supported_by_icu(encoding))
+ collid = HeapTupleGetOid(colltup);
+ else
+ collid = InvalidOid;
+ }
+ else
+ {
+ collid = HeapTupleGetOid(colltup);
+ }
+ ReleaseSysCache(colltup);
+ return collid;
+}
+
/*
* CollationGetCollid
* Try to resolve an unqualified collation name.
* Returns OID if collation found in search path, else InvalidOid.
+ *
+ * Note that this will only find collations that work with the current
+ * database's encoding.
*/
Oid
CollationGetCollid(const char *collname)
if (namespaceId == myTempNamespace)
continue; /* do not look in temp namespace */
- /* Check for database-encoding-specific entry */
- collid = GetSysCacheOid3(COLLNAMEENCNSP,
- PointerGetDatum(collname),
- Int32GetDatum(dbencoding),
- ObjectIdGetDatum(namespaceId));
- if (OidIsValid(collid))
- return collid;
-
- /* Check for any-encoding entry */
- collid = GetSysCacheOid3(COLLNAMEENCNSP,
- PointerGetDatum(collname),
- Int32GetDatum(-1),
- ObjectIdGetDatum(namespaceId));
+ collid = lookup_collation(collname, namespaceId, dbencoding);
if (OidIsValid(collid))
return collid;
}
* Determine whether a collation (identified by OID) is visible in the
* current search path. Visible means "would be found by searching
* for the unqualified collation name".
+ *
+ * Note that only collations that work with the current database's encoding
+ * will be considered visible.
*/
bool
CollationIsVisible(Oid collid)
{
/*
* If it is in the path, it might still not be visible; it could be
- * hidden by another conversion of the same name earlier in the path.
- * So we must do a slow check to see if this conversion would be found
- * by CollationGetCollid.
+ * hidden by another collation of the same name earlier in the path,
+ * or it might not work with the current DB encoding. So we must do a
+ * slow check to see if this collation would be found by
+ * CollationGetCollid.
*/
char *collname = NameStr(collform->collname);
/*
* get_collation_oid - find a collation by possibly qualified name
+ *
+ * Note that this will only find collations that work with the current
+ * database's encoding.
*/
Oid
get_collation_oid(List *name, bool missing_ok)
if (missing_ok && !OidIsValid(namespaceId))
return InvalidOid;
- /* first try for encoding-specific entry, then any-encoding */
- colloid = GetSysCacheOid3(COLLNAMEENCNSP,
- PointerGetDatum(collation_name),
- Int32GetDatum(dbencoding),
- ObjectIdGetDatum(namespaceId));
- if (OidIsValid(colloid))
- return colloid;
- colloid = GetSysCacheOid3(COLLNAMEENCNSP,
- PointerGetDatum(collation_name),
- Int32GetDatum(-1),
- ObjectIdGetDatum(namespaceId));
+ colloid = lookup_collation(collation_name, namespaceId, dbencoding);
if (OidIsValid(colloid))
return colloid;
}
if (namespaceId == myTempNamespace)
continue; /* do not look in temp namespace */
- colloid = GetSysCacheOid3(COLLNAMEENCNSP,
- PointerGetDatum(collation_name),
- Int32GetDatum(dbencoding),
- ObjectIdGetDatum(namespaceId));
- if (OidIsValid(colloid))
- return colloid;
- colloid = GetSysCacheOid3(COLLNAMEENCNSP,
- PointerGetDatum(collation_name),
- Int32GetDatum(-1),
- ObjectIdGetDatum(namespaceId));
+ colloid = lookup_collation(collation_name, namespaceId, dbencoding);
if (OidIsValid(colloid))
return colloid;
}
}
+/*
+ * Check a string to see if it is pure ASCII
+ */
+static bool
+is_all_ascii(const char *str)
+{
+ while (*str)
+ {
+ if (IS_HIGHBIT_SET(*str))
+ return false;
+ str++;
+ }
+ return true;
+}
+
/* will we use "locale -a" in pg_import_system_collations? */
#if defined(HAVE_LOCALE_T) && !defined(WIN32)
#define READ_LOCALE_A_OUTPUT
/*
* Get a comment (specifically, the display name) for an ICU locale.
- * The result is a palloc'd string.
+ * The result is a palloc'd string, or NULL if we can't get a comment
+ * or find that it's not all ASCII. (We can *not* accept non-ASCII
+ * comments, because the contents of template0 must be encoding-agnostic.)
*/
static char *
get_icu_locale_comment(const char *localename)
UErrorCode status;
UChar displayname[128];
int32 len_uchar;
+ int32 i;
char *result;
status = U_ZERO_ERROR;
displayname, lengthof(displayname),
&status);
if (U_FAILURE(status))
- ereport(ERROR,
- (errmsg("could not get display name for locale \"%s\": %s",
- localename, u_errorName(status))));
+ return NULL; /* no good reason to raise an error */
+
+ /* Check for non-ASCII comment (can't use is_all_ascii for this) */
+ for (i = 0; i < len_uchar; i++)
+ {
+ if (displayname[i] > 127)
+ return NULL;
+ }
- icu_from_uchar(&result, displayname, len_uchar);
+ /* OK, transcribe */
+ result = palloc(len_uchar + 1);
+ for (i = 0; i < len_uchar; i++)
+ result[i] = displayname[i];
+ result[len_uchar] = '\0';
return result;
}
{
size_t len;
int enc;
- bool skip;
char alias[NAMEDATALEN];
len = strlen(localebuf);
* interpret the non-ASCII characters. We can't do much with
* those, so we filter them out.
*/
- skip = false;
- for (i = 0; i < len; i++)
- {
- if (IS_HIGHBIT_SET(localebuf[i]))
- {
- skip = true;
- break;
- }
- }
- if (skip)
+ if (!is_all_ascii(localebuf))
{
elog(DEBUG1, "locale name has non-ASCII characters, skipped: \"%s\"", localebuf);
continue;
/* Load collations known to ICU */
#ifdef USE_ICU
- if (!is_encoding_supported_by_icu(GetDatabaseEncoding()))
- {
- ereport(NOTICE,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("encoding \"%s\" not supported by ICU",
- pg_encoding_to_char(GetDatabaseEncoding()))));
- }
- else
{
int i;
{
const char *name;
char *langtag;
+ char *icucomment;
const char *collcollate;
UEnumeration *en;
UErrorCode status;
langtag = get_icu_language_tag(name);
collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
+
+ /*
+ * Be paranoid about not allowing any non-ASCII strings into
+ * pg_collation
+ */
+ if (!is_all_ascii(langtag) || !is_all_ascii(collcollate))
+ continue;
+
collid = CollationCreate(psprintf("%s-x-icu", langtag),
nspid, GetUserId(),
COLLPROVIDER_ICU, -1,
CommandCounterIncrement();
- CreateComments(collid, CollationRelationId, 0,
- get_icu_locale_comment(name));
+ icucomment = get_icu_locale_comment(name);
+ if (icucomment)
+ CreateComments(collid, CollationRelationId, 0,
+ icucomment);
}
/*
langtag = get_icu_language_tag(localeid);
collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : localeid;
+
+ /*
+ * Be paranoid about not allowing any non-ASCII strings into
+ * pg_collation
+ */
+ if (!is_all_ascii(langtag) || !is_all_ascii(collcollate))
+ continue;
+
collid = CollationCreate(psprintf("%s-x-icu", langtag),
nspid, GetUserId(),
COLLPROVIDER_ICU, -1,
CommandCounterIncrement();
- CreateComments(collid, CollationRelationId, 0,
- get_icu_locale_comment(localeid));
+ icucomment = get_icu_locale_comment(name);
+ if (icucomment)
+ CreateComments(collid, CollationRelationId, 0,
+ icucomment);
}
}
if (U_FAILURE(status))