1 <!-- $Header: /cvsroot/pgsql/doc/src/sgml/charset.sgml,v 2.3 2000/09/30 16:58:20 petere Exp $ -->
8 Describes the available localization features from the point of
9 view of the administrator.
14 <productname>Postgres</productname> supports localization with
20 Using the locale features of the operating system to provide
21 locale-specific collation order, number formatting, and other
28 Using explicit multiple-byte character sets defined in the
29 <productname>Postgres</productname> server to support languages
30 that require more characters than will fit into a single byte,
31 and to provide character set recoding between client and server.
32 The number of supported character sets is fixed at the time the
33 server is compiled, and internal operations such as string
34 comparisons require expansion of each character into a 32-bit
41 Single byte character recoding provides a more light-weight
42 solution for users of multiple, yet single-byte character sets.
50 <title>Locale Support</title>
53 <firstterm>Locale</> support refers to an application respecting
54 cultural preferences regarding alphabets, sorting, number
55 formatting, etc. <productname>PostgreSQL</> uses the standard ISO
56 C and POSIX-like locale facilities provided by the server operating
57 system. For additional information refer the documentation of your
65 Locale support is not build into <productname>PostgreSQL</> by
66 default; to enable it, supply the <option>--enable-locale</> option
67 to the <filename>configure</> script:
70 <prompt>$ </><userinput>./configure --enable-locale</>
73 Locale support only affects the server; all clients are compatible
74 with servers with or without locale support.
78 The information about which particular cultural rules to use is
79 determined by standard environment variables. If you are getting
80 localized behavior from other programs you probably have them set
81 up already. The simplest way to set the localization information
82 is the <envar>LANG</> variable, for example:
86 This sets the locale to Swedish (<literal>sv</>) as spoken in
87 Sweden (<literal>SE</>). Other possibilities might be
88 <literal>en_US</> (U.S. English) and <literal>fr_CA</> (Canada,
89 French). If more than one character set can be useful for a locale
90 then the specifications look like this:
91 <literal>cs_CZ.ISO8859-2</>. What locales are available under what
92 names on your system depends on what was provided by the operating
93 system vendor and what was installed.
97 Occasionally it is useful to mix rules from several locales, e.g.,
98 use U.S. rules but Spanish messages. To do that a set of
99 environment variables exist that override the default of
100 <envar>LANG</> for a particular category:
107 <entry>String sort order</>
111 <entry>Character classification (What is a letter? What is the upper-case equivalent of this letter?)</>
114 <entry>LC_MESSAGES</>
115 <entry>Language of messages</>
118 <entry>LC_MONETARY</>
119 <entry>Formatting of currency amounts</>
123 <entry>Formatting of numbers</>
127 <entry>Formatting of dates and times</>
133 <envar>LC_MESSAGES</> only affects the messages that come from the
134 operating system, not <productname>PostgreSQL</>.
138 If you want the system to behave as if it had no locale support,
139 use the special locale <literal>C</> or <literal>POSIX</>, or
140 simply unset all locale related variables.
144 Once you have chosen a set of localization rules this way you must
145 keep them fixed for any particular database cluster. That means
146 that the locales that were active when you ran <filename>initdb</>
147 must be kept the same when you start the postmaster. Otherwise,
148 the changed sort order can corrupt indexes or make your data
149 disappear mysteriously. It is currently not possible to change the
150 locales after database initialization or to use more than one set
151 of locales for a given database cluster.
159 Locale support influences in particular the following features:
164 Sort order in <command>ORDER BY</> queries.
170 The <function>to_char</> family of functions
176 The <literal>LIKE</> and <literal>~</> operators for pattern
184 The only severe drawback of using the locale support in
185 <productname>PostgreSQL</> is its speed. So use locale only if you
194 If locale support doesn't work in spite of the explanation above,
195 check that the locale support in your operating system is okay.
196 To check whether a given locale is installed and functional you
197 can use <application>Perl</>, for example. Perl has also support
198 for locales and if a locale is broken <command>perl -v</> will
199 complain something like this:
201 <prompt>$</> <userinput>export LC_CTYPE='not_exist'</>
202 <prompt>$</> <userinput>perl -v</>
204 perl: warning: Setting locale failed.
205 perl: warning: Please check that your locale settings:
207 LC_CTYPE = "not_exist",
209 are supported and installed on your system.
210 perl: warning: Falling back to the standard locale ("C").
216 Check that your locale files are in the right location. Possible
217 locations include: <filename>/usr/lib/locale</filename> (Linux,
218 Solaris), <filename>/usr/share/locale</filename> (Linux),
219 <filename>/usr/lib/nls/loc</filename> (DUX 4.0). Check the locale
220 man page of your system if you are not sure.
224 The directory <filename>src/test/locale</> contains a test suite
225 for <productname>PostgreSQL</>'s locale support.
231 <sect1 id="multibyte">
232 <title>Multibyte Support</title>
235 <title>Author</title>
238 <ulink url="mailto:ishii@postgresql.org">Tatsuo Ishii</ulink>,
239 last updated 2000-03-22.
241 url="http://www.sra.co.jp/people/t-ishii/PostgreSQL/">Tatsuo's
242 web site</ulink> for more information.
247 Multibyte (<acronym>MB</acronym>) support is intended to allow
248 <productname>Postgres</productname> to handle
249 multiple-byte character sets such as EUC (Extended Unix Code), Unicode and
250 Mule internal code. With <acronym>MB</acronym> enabled you can use multi-byte
251 character sets in regular expressions (regexp), LIKE, and some
252 other functions. The default
253 encoding system is selected while initializing your
254 <productname>Postgres</productname> installation using
255 <application>initdb</application>. Note that this can be
256 overridden when you create a database using
257 <application>createdb</application> or by using the SQL command
258 CREATE DATABASE. So you can have multiple databases each with
259 a different encoding system.
263 <acronym>MB</acronym> also fixes some problems concerning 8-bit single byte
264 character sets including ISO8859. (I would not say all of problems
265 have been fixed. I just confirmed that the regression test ran fine
266 and a few French characters could be used with the patch. Please let
267 me know if you find any problem while using 8-bit characters.)
271 <title>Enabling MB</title>
274 Run configure with a multibyte option:
277 % ./configure --enable-multibyte[=<replaceable>encoding_system</replaceable>]
280 where <replaceable>encoding_system</replaceable> can be one of the
281 values in the following table:
284 <title><productname>Postgres</productname> Character Set Encodings</title>
285 <titleabbrev>Encodings</titleabbrev>
289 <entry>Encoding</entry>
290 <entry>Description</entry>
295 <entry>SQL_ASCII</entry>
299 <entry>EUC_JP</entry>
300 <entry>Japanese EUC</entry>
303 <entry>EUC_CN</entry>
304 <entry>Chinese EUC</entry>
307 <entry>EUC_KR</entry>
308 <entry>Korean EUC</entry>
311 <entry>EUC_TW</entry>
312 <entry>Taiwan EUC</entry>
315 <entry>UNICODE</entry>
316 <entry>Unicode(UTF-8)</entry>
319 <entry>MULE_INTERNAL</entry>
320 <entry>Mule internal</entry>
323 <entry>LATIN1</entry>
324 <entry>ISO 8859-1 English and some European languages</entry>
327 <entry>LATIN2</entry>
328 <entry>ISO 8859-2 English and some European languages</entry>
331 <entry>LATIN3</entry>
332 <entry>ISO 8859-3 English and some European languages</entry>
335 <entry>LATIN4</entry>
336 <entry>ISO 8859-4 English and some European languages</entry>
339 <entry>LATIN5</entry>
340 <entry>ISO 8859-5 English and some European languages</entry>
344 <entry>KOI8-R</entry>
348 <entry>Windows CP1251</entry>
352 <entry>Windows CP866</entry>
360 Here is an example of configuring
361 <productname>Postgres</productname> to use a Japanese encoding by
365 % ./configure --enable-multibyte=EUC_JP
370 If the encoding system is omitted (./configure --enable-multibyte),
371 SQL_ASCII is assumed.
376 <title>Setting the Encoding</title>
379 <application>initdb</application> defines the default encoding
380 for a <productname>Postgres</productname> installation. For example:
386 sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
387 Note that you can use "--encoding" instead of "-E" if you prefer
388 to type longer option strings.
389 If no -E or --encoding option is given, the encoding
390 specified at the compile time is used.
394 You can create a database with a different encoding:
397 % createdb -E EUC_KR korean
400 will create a database named "korean" with EUC_KR encoding. The
401 another way to accomplish this is to use a SQL command:
404 CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
407 The encoding for a database is represented as an
408 <firstterm>encoding column</firstterm> in the
409 <literal>pg_database</literal> system catalog.
410 You can see that by using -l or \l of psql
416 Database | Owner | Encoding
417 ---------------+---------+---------------
418 euc_cn | t-ishii | EUC_CN
419 euc_jp | t-ishii | EUC_JP
420 euc_kr | t-ishii | EUC_KR
421 euc_tw | t-ishii | EUC_TW
422 mule_internal | t-ishii | MULE_INTERNAL
423 regression | t-ishii | SQL_ASCII
424 template1 | t-ishii | EUC_JP
425 test | t-ishii | EUC_JP
426 unicode | t-ishii | UNICODE
433 <title>Automatic encoding translation between backend and
437 <productname>Postgres</productname> supports an automatic
438 encoding translation between backend
439 and frontend for some encodings.
442 <title><productname>Postgres</productname> Client/Server Character Set Encodings</title>
443 <titleabbrev>Communication Encodings</titleabbrev>
447 <entry>Server Encoding</entry>
448 <entry>Available Client Encodings</entry>
453 <entry>EUC_JP</entry>
454 <entry>EUC_JP, SJIS</entry>
457 <entry>EUC_TW</entry>
458 <entry>EUC_TW, BIG5</entry>
461 <entry>LATIN2</entry>
462 <entry>LATIN2, WIN1250</entry>
465 <entry>LATIN5</entry>
466 <entry>LATIN5, WIN, ALT</entry>
469 <entry>MULE_INTERNAL</entry>
470 <entry>EUC_JP, SJIS, EUC_KR, EUC_CN,
471 EUC_TW, BIG5, LATIN1 to LATIN5,
472 WIN, ALT, WIN1250</entry>
480 To enable the automatic encoding translation, you have to tell
481 <productname>Postgres</productname> the encoding you would like
482 to use in frontend. There are
483 several ways to accomplish this.
488 Using the <command>\encoding</command> command in
489 <application>psql</application>.
490 <command>\encoding</command> allows you to change frontend
491 encoding on the fly. For
492 example, to change the encoding to SJIS, type:
502 Using libpq functions.
503 <command>\encoding</command> actually calls
504 PQsetClientEncoding() for its purpose.
507 int PQsetClientEncoding(PGconn *<replaceable>conn</replaceable>, const char *<replaceable>encoding</replaceable>)
510 where <replaceable>conn</replaceable> is a connection to the backend,
511 and <replaceable>encoding</replaceable> is an encoding you
512 want to use. If it successfully sets the encoding, it returns 0,
513 otherwise -1. The current encoding for this connection can be shown by
517 int PQclientEncoding(const PGconn *<replaceable>conn</replaceable>)
520 Note that it returns the "encoding id," not the encoding symbol string
521 such as "EUC_JP." To convert an encoding id to an encoding symbol, you
525 char *pg_encoding_to_char(int <replaceable>encoding_id</replaceable>)
532 Using <envar>PGCLIENTENCODING</envar>.
534 If an environment variable <envar>PGCLIENTENCODING</envar> is defined in the
535 frontend, an automatic encoding translation is done by the backend.
541 Using <command>SET CLIENT_ENCODING TO</command>.
543 Setting the frontend side encoding can be done a SQL command:
546 SET CLIENT_ENCODING TO 'encoding';
549 Also you can use SQL92 syntax "SET NAMES" for this purpose:
552 SET NAMES 'encoding';
555 To query the current the frontend encoding:
558 SHOW CLIENT_ENCODING;
561 To return to the default encoding:
564 RESET CLIENT_ENCODING;
573 <title>About Unicode</title>
576 An automatic encoding translation between Unicode and other
577 encodings is not yet supported.
582 <title>What happens if the translation is not possible?</title>
585 Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
586 then some Japanese characters could not be translated into LATIN1. In
587 this case, a letter cannot be represented in the LATIN1 character set,
588 would be transformed as:
597 <title>References</title>
600 These are good sources to start learning various kind of encoding
606 <ulink url="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf">
607 ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf</ulink>
608 Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
609 appear in section 3.2.
615 Unicode: <ulink url="http://www.unicode.org/">http://www.unicode.org/</ulink>
616 The homepage of UNICODE.
622 <literal>RFC 2044</literal>
623 UTF-8 is defined here.
631 <title>History</title>
636 * SJIS UDC (NEC selection IBM kanji) support contributed
638 * Changes above will appear in 7.0.1
641 * Add new libpq functions PQsetClientEncoding, PQclientEncoding
642 * ./configure --with-mb=EUC_JP
644 ./configure --enable-multibyte=EUC_JP
646 * Add SQL_ASCII regression test case
647 * Add SJIS User Defined Character (UDC) support
648 * All of above will appear in 7.0
651 * Add support for WIN1250 (Windows Czech) as a client encoding
652 (contributed by Pavel Behal)
653 * fix some compiler warnings (contributed by Tomoaki Nishiyama)
656 * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
657 (thanks Oleg Broytmann for testing)
658 * Fix problem with MB and locale
661 * Add support for Big5 for fronend encoding
662 (you need to create a database with EUC_TW to use Big5)
663 * Add regression test case for EUC_TW
664 (contributed by <ulink url="mailto:jonahkuo@mail.ttn.com.tw">Jonah Kuo</ulink>)
667 * Bugs related to SQL_ASCII support fixed
670 * 6.4 release. In this version, pg_database has "encoding"
671 column that represents the database encoding
674 * determine encoding at initdb/createdb rather than compile time
675 * support for PGCLIENTENCODING when issuing COPY command
676 * support for SQL92 syntax "SET NAMES"
677 * support for LATIN2-5
678 * add UNICODE regression test case
679 * new test suite for MB
680 * clean up source files
683 * add support for the encoding translation between the backend
685 * new command SET CLIENT_ENCODING etc. added
686 * add support for LATIN1 character set
687 * enhance 8 bit cleaness
689 April 21, 1998 some enhancements/fixes
690 * character_length(), position(), substring() are now aware of
691 multi-byte characters
693 * add --with-mb option to configure
694 * new regression tests for EUC_KR
695 (contributed by <ulink url="mailto:hong@lunaris.hanmesoft.co.kr">Soonmyung. Hong</ulink>)
696 * add some test cases to the EUC_JP regression test
697 * fix problem in regress/regress.sh in case of System V
698 * fix toupper(), tolower() to handle 8bit chars
700 Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
702 Mar 10, 1998 PL2 released
703 * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
704 * add an English document (this file)
705 * fix problems concerning 8-bit single byte characters
707 Mar 1, 1998 PL1 released
713 <title>WIN1250 on Windows/ODBC</title>
717 [Here is a good documentation explaining how to use WIN1250 on
718 Windows/ODBC from Pavel Behal. Please note that Installation step 1)
719 is not necceary in 6.5.1 - Tatsuo]
721 Version: 0.91 for PgSQL 6.5
723 Revised by: Tatsuo Ishii
724 Email: <ulink url="mailto:behal@opf.slu.cz">behal@opf.slu.cz</ulink>
725 Licence: The Same as PostgreSQL
727 Sorry for my Eglish and C code, I'm not native :-)
729 !!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
732 The WIN1250 character set on Windows client platforms can be used
733 with <productname>Postgres</productname> with locale support
738 The following should be kept in mind:
743 Success depends on proper system locales. This has been tested
744 with RH6.0 and Slackware 3.6, with cs_CZ.iso8859-2 locale.
750 Never try to set the server multibyte database encoding to WIN1250.
751 Always use LATIN2 instead since there is not a WIN1250 locale
758 WIN1250 encoding is useable only for M$W ODBC clients. The
759 characters are recoded on the fly, to be displayed and stored
767 When running, it is important to remember the following:
772 This configuration reorders your sort order depending on your
773 <envar>LC_<replaceable>x</replaceable></envar> settings. Don't be
774 confused with the regression test results since they don't use
781 A locale such as "<literal>ch</literal>" is correctly sorted
783 supports that locale; older systems may not do so but new ones
790 You have to insert money as '<literal>162,50</literal>' (note
791 comma within the single-quotes).
797 At the time of writing (early 1999), this configuration has
798 not received extensive testing. Please let us know of any
799 changes you had to make!
806 <title>WIN1250 on Windows/ODBC</title>
809 Change the three relevant files in the source directories.
815 Compile <productname>Postgres</productname> with local enabled
816 and the multibyte encoding set to <literal>LATIN2</literal>.
822 Set up your instalation. Do not forget to create locale
823 variables in your profile (environment). For example (this may
824 not be correct for <emphasis>your</emphasis> environment):
827 LC_ALL=cs_CZ.ISO8859-2
828 LC_COLLATE=cs_CZ.ISO8859-2
829 LC_CTYPE=cs_CZ.ISO8859-2
830 LC_MONETARY=cs_CZ.ISO8859-2
831 LC_NUMERIC=cs_CZ.ISO8859-2
832 LC_TIME=cs_CZ.ISO8859-2
839 You have to start the postmaster with locales set!
845 Try it with Czech language, and have it sort on a query.
851 Install ODBC driver for PgSQL on your M$ Windows machine.
857 Setup properly your data source. Include this line in your ODBC
858 configuration dialog in the field <literal>Connect Settings</literal>:
861 SET CLIENT_ENCODING = 'WIN1250';
868 Now try it again, but in Windows with ODBC.
877 <title>Single-byte character set recoding</>
878 <!-- formerly in README.charsets, by Josef Balatka, <balatka@email.cz> -->
881 You can set up this feature with the <option>--enable-recode</> option
882 to <filename>configure</>. This option was formerly described as
883 <quote>Cyrillic recode support</> which doesn't express all its
884 power. It can be used for <emphasis>any</> single-byte character
889 This method uses a file <filename>charset.conf</> file located in
890 the database directory (<envar>PGDATA</>). It's a typical
891 configuration text file where spaces and newlines separate items
892 and records and # specifies comments. Three keywords with the
893 following syntax are recognized here:
895 BaseCharset <replaceable>server_charset</>
896 RecodeTable <replaceable>from_charset</> <replaceable>to_charset</> <replaceable>file_name</>
897 HostCharset <replaceable>host_spec</> <replaceable>host_charset</>
902 <token>BaseCharset</> defines the encoding of the database server.
903 All character set names are only used for mapping inside of
904 <filename>charset.conf</> so you can freely use typing-friendly
909 <token>RecodeTable</> records specify translation tables between
910 server and client. The file name is relative to the
911 <envar>PGDATA</> directory. The table file format is very
912 simple. There are no keywords and characters are represented by a
913 pair of decimal or hexadecimal (0x prefixed) values on single
916 <replaceable>char_value</> <replaceable>translated_char_value</>
921 <token>HostCharset</> records define the client character set by IP
922 address. You can use a single IP address, an IP mask range starting
923 from the given address or an IP interval (e.g., 127.0.0.1,
924 192.168.1.100/24, 192.168.1.20-192.168.1.40).
928 The <filename>charset.conf</> file is always processed up to the
929 end, so you can easily specify exceptions from the previous
930 rules. In the src/data you will find charset.conf example and a few
935 As this solution is based on the client's IP address and character
936 set mapping there are obviously some restrictions as well. You
937 cannot use different encodings on the same host at the same
938 time. It is also inconvenient when you boot your client hosts into
939 more operating systems. Nevertheless, when these restrictions are
940 not limiting and you do not need multi-byte characters than it is a
941 simple and effective solution.
947 <!-- Keep this comment at the end of the file
952 sgml-minimize-attributes:nil
953 sgml-always-quote-attributes:t
956 sgml-parent-document:nil
957 sgml-default-dtd-file:"./reference.ced"
958 sgml-exposed-tags:nil
959 sgml-local-catalogs:("/usr/lib/sgml/catalog")
960 sgml-local-ecat-files:nil