granicus.if.org Git - postgresql/blob - doc/src/sgml/charset.sgml

   1 <!-- $Header: /cvsroot/pgsql/doc/src/sgml/charset.sgml,v 2.17 2001/11/19 03:58:24 tgl Exp $ -->
   2
   3 <chapter id="charset">
   4  <title>Localization</>
   5
   6  <abstract>
   7   <para>
   8    Describes the available localization features from the point of
   9    view of the administrator.
  10   </para>
  11  </abstract>
  12
  13   <para>
  14    <productname>Postgres</productname> supports localization with
  15    three approaches:
  16
  17    <itemizedlist>
  18     <listitem>
  19      <para>
  20       Using the locale features of the operating system to provide
  21       locale-specific collation order, number formatting, translated
  22       messages, and other aspects.
  23      </para>
  24     </listitem>
  25
  26     <listitem>
  27      <para>
  28       Using explicit multiple-byte character sets defined in the
  29       <productname>Postgres</productname> server to support languages
  30       that require more characters than will fit into a single byte,
  31       and to provide character set recoding between client and server.
  32       The number of supported character sets is fixed at the time the
  33       server is compiled, and internal operations such as string
  34       comparisons require expansion of each character into a 32-bit
  35       word.
  36      </para>
  37     </listitem>
  38
  39     <listitem>
  40      <para>
  41       Single byte character recoding provides a more light-weight
  42       solution for users of multiple, yet single-byte character sets.
  43      </para>
  44     </listitem>
  45    </itemizedlist>
  46   </para>
  47
  48
  49  <sect1 id="locale">
  50   <title>Locale Support</title>
  51
  52   <indexterm zone="locale"><primary>locale</></>
  53
  54   <para>
  55    <firstterm>Locale</> support refers to an application respecting
  56    cultural preferences regarding alphabets, sorting, number
  57    formatting, etc.  <productname>PostgreSQL</> uses the standard ISO
  58    C and <acronym>POSIX</acronym>-like locale facilities provided by the server operating
  59    system.  For additional information refer to the documentation of your
  60    system.
  61   </para>
  62
  63   <sect2>
  64    <title>Overview</>
  65
  66   <para>
  67     Locale support is not built into <productname>PostgreSQL</> by
  68     default; to enable it, supply the <option>--enable-locale</> option
  69     to the <filename>configure</> script:
  70 <informalexample>
  71 <screen>
  72 <prompt>$ </><userinput>./configure --enable-locale</>
  73 </screen>
  74 </informalexample>
  75     Locale support only affects the server; all clients are compatible
  76     with servers with or without locale support.
  77    </para>
  78
  79    <para>
  80     To enable messages translated to the user's preferred language,
  81     the <option>--enable-nls</option> option must be used.  This
  82     option is independent of the other locale support.
  83    </para>
  84
  85    <para>
  86     The information about which particular cultural rules to use is
  87     determined by standard environment variables.  If you are getting
  88     localized behavior from other programs you probably have them set
  89     up already.  The simplest way to set the localization information
  90     is the <envar>LANG</> variable, for example:
  91 <programlisting>
  92 export LANG=sv_SE
  93 </programlisting>
  94     This sets the locale to Swedish (<literal>sv</>) as spoken in
  95     Sweden (<literal>SE</>).  Other possibilities might be
  96     <literal>en_US</> (U.S. English) and <literal>fr_CA</> (Canada,
  97     French).  If more than one character set can be useful for a locale
  98     then the specifications look like this:
  99     <literal>cs_CZ.ISO8859-2</>. What locales are available under what
 100     names on your system depends on what was provided by the operating
 101     system vendor and what was installed.
 102    </para>
 103
 104    <para>
 105     Occasionally it is useful to mix rules from several locales, e.g.,
 106     use U.S. collation rules but Spanish messages.  To do that a set of
 107     environment variables exist that override the default of
 108     <envar>LANG</> for a particular category:
 109
 110     <informaltable>
 111      <tgroup cols="2">
 112       <tbody>
 113        <row>
 114         <entry><envar>LC_COLLATE</></>
 115         <entry>String sort order</>
 116        </row>
 117        <row>
 118         <entry><envar>LC_CTYPE</></>
 119         <entry>Character classification (What is a letter? The upper-case equivalent?)</>
 120        </row>
 121        <row>
 122         <entry><envar>LC_MESSAGES</></>
 123         <entry>Language of messages</>
 124        </row>
 125        <row>
 126         <entry><envar>LC_MONETARY</></>
 127         <entry>Formatting of currency amounts</>
 128        </row>
 129        <row>
 130         <entry><envar>LC_NUMERIC</></>
 131         <entry>Formatting of numbers</>
 132        </row>
 133        <row>
 134         <entry><envar>LC_TIME</></>
 135         <entry>Formatting of dates and times</>
 136        </row>
 137       </tbody>
 138      </tgroup>
 139     </informaltable>
 140
 141     Additionally, all of these specific variables and the
 142     <envar>LANG</> variable can be overridden with the
 143     <envar>LC_ALL</> environment variable.
 144    </para>
 145
 146    <note>
 147     <para>
 148      Some message localization libraries also look at the environment
 149      variable <envar>LANGUAGE</envar> which overrides all other locale
 150      settings for the purpose of setting the language of messages.  If
 151      in doubt, please refer to the documentation of your operating
 152      system, in particular the
 153      <citerefentry><refentrytitle>gettext</><manvolnum>3</></> manual
 154      page, for more information.
 155     </para>
 156    </note>
 157
 158    <para>
 159     If you want the system to behave as if it had no locale support,
 160     use the special locale <literal>C</> or <literal>POSIX</>, or
 161     simply unset all locale-related variables.
 162    </para>
 163
 164    <para>
 165     Note that the locale behavior of the server is determined by the
 166     environment variables seen by the server, not by the environment
 167     of any client.  Therefore, be careful to set these variables
 168     before starting the server.  A consequence of this is that if
 169     client and server are set up to different locales, messages may
 170     appear in different languages depending on where they originated.
 171    </para>
 172
 173    <para>
 174     The <envar>LC_COLLATE</> and <envar>LC_CTYPE</> variables affect the
 175     sort order of indexes.  Therefore, these values must be kept fixed
 176     for any particular database cluster, or indexes on text columns will
 177     become corrupt.  <productname>Postgres</productname> enforces this
 178     by recording the values of <envar>LC_COLLATE</> and <envar>LC_CTYPE</>
 179     that are seen by <application>initdb</>.  The server automatically adopts
 180     those two values when it is started; only the other <envar>LC_</>
 181     categories can be set from the environment at server startup.
 182     In short, only one collation order can be used in a database cluster,
 183     and it is chosen at <application>initdb</> time.
 184    </para>
 185   </sect2>
 186
 187   <sect2>
 188    <title>Benefits</>
 189
 190    <para>
 191     Locale support influences in particular the following features:
 192
 193     <itemizedlist>
 194      <listitem>
 195       <para>
 196        Sort order in <command>ORDER BY</> queries.
 197        <indexterm><primary>ORDER BY</></>
 198       </para>
 199      </listitem>
 200
 201      <listitem>
 202       <para>
 203        The <function>to_char</> family of functions
 204       </para>
 205      </listitem>
 206
 207      <listitem>
 208       <para>
 209        The <literal>LIKE</> and <literal>~</> operators for pattern
 210        matching
 211       </para>
 212      </listitem>
 213     </itemizedlist>
 214    </para>
 215
 216    <para>
 217     The only severe drawback of using the locale support in
 218     <productname>PostgreSQL</> is its speed.  So use locale only if you
 219     actually need it.  It should be noted in particular that selecting
 220     a non-C locale disables index optimizations for <literal>LIKE</> and
 221     <literal>~</> operators, which can make a huge difference in the
 222     speed of searches that use those operators.
 223    </para>
 224   </sect2>
 225
 226   <sect2>
 227    <title>Problems</>
 228
 229    <para>
 230     If locale support doesn't work in spite of the explanation above,
 231     check that the locale support in your operating system is correctly configured.
 232     To check whether a given locale is installed and functional you
 233     can use <application>Perl</>, for example.  Perl has also support
 234     for locales and if a locale is broken <command>perl -v</> will
 235     complain something like this:
 236 <screen>
 237 <prompt>$</> <userinput>export LC_CTYPE='not_exist'</>
 238 <prompt>$</> <userinput>perl -v</>
 239 <computeroutput>
 240 perl: warning: Setting locale failed.
 241 perl: warning: Please check that your locale settings:
 242 LC_ALL = (unset),
 243 LC_CTYPE = "not_exist",
 244 LANG = (unset)
 245 are supported and installed on your system.
 246 perl: warning: Falling back to the standard locale ("C").
 247 </computeroutput>
 248 </screen>
 249    </para>
 250
 251    <para>
 252     Check that your locale files are in the right location.  Possible
 253     locations include: <filename>/usr/lib/locale</filename> (<systemitem class="osname">Linux</>,
 254     <systemitem class="osname">Solaris</>), <filename>/usr/share/locale</filename> (<systemitem class="osname">Linux</>),
 255     <filename>/usr/lib/nls/loc</filename> (<systemitem class="osname">DUX 4.0</>).  Check the locale
 256     man page of your system if you are not sure.
 257    </para>
 258
 259    <para>
 260     Check that <productname>PostgreSQL</> is actually using the locale that
 261     you think it is.  <envar>LC_COLLATE</> and <envar>LC_CTYPE</> settings are
 262     determined at <application>initdb</> time and cannot be changed without
 263     repeating <application>initdb</>.  Other locale settings including
 264     <envar>LC_MESSAGES</> and <envar>LC_MONETARY</> are determined by the
 265     environment the postmaster is started in, and can be changed with a simple
 266     postmaster restart.  You can check the <envar>LC_COLLATE</> and
 267     <envar>LC_CTYPE</> settings of
 268     a database with the <filename>contrib/pg_controldata</> utility program.
 269    </para>
 270
 271    <para>
 272     The directory <filename>src/test/locale</> contains a test suite
 273     for <productname>PostgreSQL</>'s locale support.
 274    </para>
 275
 276    <para>
 277     Client applications that handle server-side errors by parsing the
 278     text of the error message will obviously have problems when the
 279     server's messages are in a different language.  If you create such
 280     an application you need to devise a plan to cope with this
 281     situation.  The embedded SQL interface (<application>ecpg</>) is
 282     also affected by this problem.  It is currently recommended that
 283     servers interfacing with <application>ecpg</> applications be
 284     configured to send messages in English.
 285    </para>
 286
 287    <para>
 288     Maintaining catalogs of message translations requires the on-going
 289     efforts of many volunteers that want to see
 290     <productname>PostgreSQL</> speak their preferred language well.
 291     If messages in your language is currently not available or fully
 292     translated, your assistance would be appreciated.  If you want to
 293     help, refer to the <citetitle>Developer's Guide</> or write to the
 294     developers' mailing list.
 295    </para>
 296   </sect2>
 297  </sect1>
 298
 299
 300   <sect1 id="multibyte">
 301    <title>Multibyte Support</title>
 302
 303   <indexterm zone="multibyte"><primary>multibyte</></>
 304
 305    <note>
 306     <title>Author</title>
 307
 308     <para>
 309      Tatsuo Ishii (<email>ishii@postgresql.org</email>),
 310      last updated 2000-03-22.
 311      Check <ulink
 312       url="http://www.sra.co.jp/people/t-ishii/PostgreSQL/">Tatsuo's
 313       web site</ulink> for more information.
 314     </para>
 315    </note>
 316
 317    <para>
 318     Multibyte (<acronym>MB</acronym>) support is intended to allow
 319     <productname>Postgres</productname> to handle
 320     multiple-byte character sets such as <acronym>EUC</> (Extended Unix Code), Unicode and
 321     Mule internal code. With <acronym>MB</acronym> enabled you can use multibyte
 322     character sets in regular expressions (regexp), LIKE, and some
 323     other functions. The default
 324     encoding system is selected while initializing your
 325     <productname>Postgres</productname> installation using
 326     <application>initdb</application>. Note that this can be
 327     overridden when you create a database using
 328     <application>createdb</application> or by using the SQL command
 329     CREATE DATABASE. So you can have multiple databases each with
 330     a different encoding system.
 331    </para>
 332
 333    <para>
 334     <acronym>MB</acronym> also fixes some problems concerning 8-bit single byte
 335     character sets including ISO8859. (I would not say all problems
 336     have been fixed. I just confirmed that the regression test ran fine
 337     and a few French characters could be used with the patch. Please let
 338     me know if you find any problem while using 8-bit characters.)
 339    </para>
 340
 341    <sect2>
 342     <title>Enabling MB</title>
 343
 344     <para>
 345      Run configure with the multibyte option:
 346
 347      <programlisting>
 348 % ./configure --enable-multibyte[=<replaceable>encoding_system</replaceable>]
 349      </programlisting>
 350
 351      where <replaceable>encoding_system</replaceable> can be one of the
 352      values in the following table:
 353
 354      <table tocentry="1">
 355       <title>Character Set Encodings</title>
 356       <titleabbrev>Encodings</titleabbrev>
 357       <tgroup cols="2">
 358        <thead>
 359         <row>
 360          <entry>Encoding</entry>
 361          <entry>Description</entry>
 362         </row>
 363        </thead>
 364        <tbody>
 365         <row>
 366          <entry><literal>SQL_ASCII</literal></entry>
 367          <entry><acronym>US ASCII</acronym></entry>
 368         </row>
 369         <row>
 370          <entry><literal>EUC_JP</literal></entry>
 371          <entry>Japanese <acronym>EUC</></entry>
 372         </row>
 373         <row>
 374          <entry><literal>EUC_CN</literal></entry>
 375          <entry>Chinese <acronym>EUC</></entry>
 376         </row>
 377         <row>
 378          <entry><literal>EUC_KR</literal></entry>
 379          <entry>Korean <acronym>EUC</></entry>
 380         </row>
 381         <row>
 382          <entry><literal>EUC_TW</literal></entry>
 383          <entry>Taiwan <acronym>EUC</acronym></entry>
 384         </row>
 385         <row>
 386          <entry><literal>UNICODE</literal></entry>
 387          <entry>Unicode (<acronym>UTF</acronym>-8)</entry>
 388         </row>
 389         <row>
 390          <entry><literal>MULE_INTERNAL</literal></entry>
 391          <entry>Mule internal code</entry>
 392         </row>
 393         <row>
 394          <entry><literal>LATIN1</literal></entry>
 395          <entry>ISO 8859-1 ECMA-94 Latin Alphabet No.1</entry>
 396         </row>
 397         <row>
 398          <entry><literal>LATIN2</literal></entry>
 399          <entry>ISO 8859-2 ECMA-94 Latin Alphabet No.2</entry>
 400         </row>
 401         <row>
 402          <entry><literal>LATIN3</literal></entry>
 403          <entry>ISO 8859-3 ECMA-94 Latin Alphabet No.3</entry>
 404         </row>
 405         <row>
 406          <entry><literal>LATIN4</literal></entry>
 407          <entry>ISO 8859-4 ECMA-94 Latin Alphabet No.4</entry>
 408         </row>
 409         <row>
 410          <entry><literal>LATIN5</literal></entry>
 411          <entry>ISO 8859-9 ECMA-128 Latin Alphabet No.5</entry>
 412         </row>
 413         <row>
 414          <entry><literal>LATIN6</literal></entry>
 415          <entry>ISO 8859-10 ECMA-144 Latin Alphabet No.6</entry>
 416         </row>
 417         <row>
 418          <entry><literal>LATIN7</literal></entry>
 419          <entry>ISO 8859-13 Latin Alphabet No.7</entry>
 420         </row>
 421         <row>
 422          <entry><literal>LATIN8</literal></entry>
 423          <entry>ISO 8859-14 Latin Alphabet No.8</entry>
 424         </row>
 425         <row>
 426          <entry><literal>LATIN9</literal></entry>
 427          <entry>ISO 8859-15 Latin Alphabet No.9</entry>
 428         </row>
 429         <row>
 430          <entry><literal>LATIN10</literal></entry>
 431          <entry>ISO 8859-16 ASRO SR 14111 Latin Alphabet No.10</entry>
 432         </row>
 433         <row>
 434          <entry><literal>ISO-8859-5</literal></entry>
 435          <entry>ECMA-113 Latin/Cyrillic</entry>
 436         </row>
 437         <row>
 438          <entry><literal>ISO-8859-6</literal></entry>
 439          <entry>ECMA-114 Latin/Arabic</entry>
 440         </row>
 441         <row>
 442          <entry><literal>ISO-8859-7</literal></entry>
 443          <entry>ECMA-118 Latin/Greek</entry>
 444         </row>
 445         <row>
 446          <entry><literal>ISO-8859-8</literal></entry>
 447          <entry>ECMA-121 Latin/Hebrew</entry>
 448         </row>
 449         <row>
 450          <entry><literal>KOI8</literal></entry>
 451          <entry><acronym>KOI</acronym>8-R(U)</entry>
 452         </row>
 453         <row>
 454          <entry><literal>WIN</literal></entry>
 455          <entry>Windows CP1251</entry>
 456         </row>
 457         <row>
 458          <entry><literal>ALT</literal></entry>
 459          <entry>Windows CP866</entry>
 460         </row>
 461        </tbody>
 462       </tgroup>
 463      </table>
 464     </para>
 465
 466     <para>
 467     CAUTION1: Note that before 7.2 LATIN5 meant ISO 8859-5 mistakely.  In 7.2
 468     LATIN5 measn ISO 8859-9. If you have LATIN5 database created on
 469     7.1 or before and want to migrate to 7.2, you should be very
 470     carefull about this change.
 471     </para>
 472
 473     <para>
 474     CAUTION2: Not all API supports encodings listed above. For example,
 475     PostgreSQL JDBC driver does not support MULE_INTERNAL, LATIN6,
 476     LATIN8 and LATIN10.
 477     </para>
 478
 479     <para>
 480      Here is an example of configuring
 481      <productname>Postgres</productname> to use a Japanese encoding by
 482      default:
 483
 484      <programlisting>
 485 % ./configure --enable-multibyte=EUC_JP
 486      </programlisting>
 487     </para>
 488
 489     <para>
 490      If the encoding system is omitted (./configure --enable-multibyte),
 491      SQL_ASCII is assumed.
 492     </para>
 493    </sect2>
 494
 495    <sect2>
 496     <title>Setting the Encoding</title>
 497
 498     <para>
 499      <application>initdb</application> defines the default encoding
 500      for a <productname>Postgres</productname> installation. For example:
 501
 502      <programlisting>
 503 % initdb -E EUC_JP
 504      </programlisting>
 505
 506      sets the default encoding to <literal>EUC_JP</literal> (Extended Unix Code for Japanese).
 507      Note that you can use <option>--encoding</option> instead of <option>-E</option> if you prefer
 508      to type longer option strings.
 509      If no -E or --encoding option is given, the encoding
 510      specified at configure time is used.
 511     </para>
 512
 513     <para>
 514      You can create a database with a different encoding:
 515
 516      <programlisting>
 517 % createdb -E EUC_KR korean
 518      </programlisting>
 519
 520      will create a database named <database>korean</database> with <literal>EUC_KR</literal> encoding.
 521      Another way to accomplish this is to use a SQL command:
 522
 523      <programlisting>
 524 CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
 525      </programlisting>
 526
 527      The encoding for a database is represented as an
 528      <firstterm>encoding column</firstterm>  in the
 529      <literal>pg_database</literal> system catalog.
 530      You can see that by using <option>-l</option> or <command>\l</command> of <command>psql</command>
 531      command.
 532
 533      <programlisting>
 534 $ psql -l
 535             List of databases
 536    Database    |  Owner  |   Encoding
 537 ---------------+---------+---------------
 538  euc_cn        | t-ishii | EUC_CN
 539  euc_jp        | t-ishii | EUC_JP
 540  euc_kr        | t-ishii | EUC_KR
 541  euc_tw        | t-ishii | EUC_TW
 542  mule_internal | t-ishii | MULE_INTERNAL
 543  regression    | t-ishii | SQL_ASCII
 544  template1     | t-ishii | EUC_JP
 545  test          | t-ishii | EUC_JP
 546  unicode       | t-ishii | UNICODE
 547 (9 rows)
 548      </programlisting>
 549     </para>
 550    </sect2>
 551
 552    <sect2>
 553     <title>Automatic encoding translation between backend and
 554      frontend</title>
 555
 556     <para>
 557      <productname>Postgres</productname> supports an automatic
 558      encoding translation between backend
 559      and frontend for some encodings.
 560
 561      <table tocentry="1">
 562       <title>Client/Server Character Set Encodings</title>
 563       <titleabbrev>Communication Encodings</titleabbrev>
 564       <tgroup cols="2">
 565        <thead>
 566         <row>
 567          <entry>Server Encoding</entry>
 568          <entry>Available Client Encodings</entry>
 569         </row>
 570        </thead>
 571        <tbody>
 572         <row>
 573          <entry><literal>SQL_ASCII</literal></entry>
 574          <entry><literal>SQL_ASCII</literal>, <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 575          </entry>
 576         </row>
 577         <row>
 578          <entry><literal>EUC_JP</literal></entry>
 579          <entry><literal>EUC_JP</literal>, <literal>SJIS</literal>,
 580          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 581          </entry>
 582         </row>
 583         <row>
 584          <entry><literal>EUC_TW</literal></entry>
 585          <entry><literal>EUC_TW</literal>, <literal>BIG5</literal>,
 586          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 587          </entry>
 588         </row>
 589         <row>
 590          <entry><literal>LATIN1</literal></entry>
 591          <entry><literal>LATIN1</literal>, <literal>UNICODE</literal>
 592          <literal>MULE_INTERNAL</literal>
 593          </entry>
 594         </row>
 595         <row>
 596          <entry><literal>LATIN2</literal></entry>
 597          <entry><literal>LATIN2</literal>, <literal>WIN1250</literal>,
 598          <literal>UNICODE</literal>,
 599          <literal>MULE_INTERNAL</literal>
 600          </entry>
 601         </row>
 602         <row>
 603          <entry><literal>LATIN3</literal></entry>
 604          <entry><literal>LATIN3</literal>, <literal>UNICODE</literal>
 605          <literal>MULE_INTERNAL</literal>
 606          </entry>
 607         </row>
 608         <row>
 609          <entry><literal>LATIN4</literal></entry>
 610          <entry><literal>LATIN4</literal>, <literal>UNICODE</literal>
 611          <literal>MULE_INTERNAL</literal>
 612          </entry>
 613         </row>
 614         <row>
 615          <entry><literal>LATIN5</literal></entry>
 616          <entry><literal>LATIN5</literal>, <literal>UNICODE</literal>
 617          <literal>MULE_INTERNAL</literal>
 618          </entry>
 619         </row>
 620         <row>
 621          <entry><literal>LATIN6</literal></entry>
 622          <entry><literal>LATIN6</literal>, <literal>UNICODE</literal>
 623          <literal>MULE_INTERNAL</literal>
 624          </entry>
 625         </row>
 626         <row>
 627          <entry><literal>LATIN7</literal></entry>
 628          <entry><literal>LATIN7</literal>, <literal>UNICODE</literal>
 629          <literal>MULE_INTERNAL</literal>
 630          </entry>
 631         </row>
 632         <row>
 633          <entry><literal>LATIN8</literal></entry>
 634          <entry><literal>LATIN8</literal>, <literal>UNICODE</literal>
 635          <literal>MULE_INTERNAL</literal>
 636          </entry>
 637         </row>
 638         <row>
 639          <entry><literal>LATIN9</literal></entry>
 640          <entry><literal>LATIN9</literal>, <literal>UNICODE</literal>
 641          <literal>MULE_INTERNAL</literal>
 642          </entry>
 643         </row>
 644         <row>
 645          <entry><literal>LATIN10</literal></entry>
 646          <entry><literal>LATIN10</literal>, <literal>UNICODE</literal>
 647          <literal>MULE_INTERNAL</literal>
 648          </entry>
 649         </row>
 650         <row>
 651          <entry><literal>ISO_8859_5</literal></entry>
 652          <entry><literal>ISO_8859_5</literal>,
 653          <literal>UNICODE</literal>
 654          </entry>
 655         </row>
 656         <row>
 657          <entry><literal>ISO_8859_6</literal></entry>
 658          <entry><literal>ISO_8859_6</literal>,
 659          <literal>UNICODE</literal>
 660          </entry>
 661         </row>
 662         <row>
 663          <entry><literal>ISO_8859_7</literal></entry>
 664          <entry><literal>ISO_8859_7</literal>,
 665          <literal>UNICODE</literal>
 666          </entry>
 667         </row>
 668         <row>
 669          <entry><literal>ISO_8859_8</literal></entry>
 670          <entry><literal>ISO_8859_8</literal>,
 671          <literal>UNICODE</literal>
 672          </entry>
 673         </row>
 674         <row>
 675          <entry><literal>ISO_8859_9</literal></entry>
 676          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 677          <literal>ALT</literal>, <literal>KOI8R</literal>,
 678          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 679          </entry>
 680         </row>
 681         <row>
 682          <entry><literal>UNICODE</literal></entry>
 683          <entry>
 684          <literal>EUC_JP</literal>, <literal>SJIS</literal>,
 685          <literal>EUC_KR</literal>, <literal>EUC_CN</literal>,
 686          <literal>EUC_TW</literal>, <literal>BIG5</literal>,
 687          <literal>LATIN1</literal> to <literal>LATIN10</literal>,
 688          <literal>ISO_8859_5</literal>,
 689          <literal>ISO_8859_6</literal>,
 690          <literal>ISO_8859_7</literal>,
 691          <literal>ISO_8859_8</literal>,
 692          <literal>WIN</literal>, <literal>ALT</literal>,
 693          <literal>KOI8</literal>
 694          </entry>
 695         </row>
 696         <row>
 697          <entry><literal>MULE_INTERNAL</literal></entry>
 698          <entry><literal>EUC_JP</literal>, <literal>SJIS</literal>, <literal>EUC_KR</literal>, <literal>EUC_CN</literal>,
 699           <literal>EUC_TW</literal>, <literal>BIG5</literal>, <literal>LATIN1</literal> to <literal>LATIN5</literal>,
 700           <literal>WIN</literal>, <literal>ALT</literal>, <literal>WIN1250</literal></entry>
 701         </row>
 702         <row>
 703          <entry><literal>KOI8</literal></entry>
 704          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 705          <literal>ALT</literal>, <literal>KOI8</literal>,
 706          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 707          </entry>
 708         </row>
 709         <row>
 710          <entry><literal>WIN</literal></entry>
 711          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 712          <literal>ALT</literal>, <literal>KOI8</literal>,
 713          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 714          </entry>
 715         </row>
 716         <row>
 717          <entry><literal>ALT</literal></entry>
 718          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 719          <literal>ALT</literal>, <literal>KOI8</literal>,
 720          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 721          </entry>
 722         </row>
 723        </tbody>
 724       </tgroup>
 725      </table>
 726     </para>
 727
 728     <para>
 729      To enable the automatic encoding translation, you have to tell
 730      <productname>Postgres</productname> the encoding you would like
 731      to use in frontend. There are
 732      several ways to accomplish this.
 733
 734      <itemizedlist>
 735       <listitem>
 736        <para>
 737         Using the <command>\encoding</command> command in
 738         <application>psql</application>.
 739         <command>\encoding</command> allows you to change frontend
 740         encoding on the fly. For
 741         example, to change the encoding to <literal>SJIS</literal>, type:
 742
 743         <programlisting>
 744 \encoding SJIS
 745         </programlisting>
 746        </para>
 747       </listitem>
 748
 749       <listitem>
 750        <para>
 751         Using <application>libpq</> functions.
 752         <command>\encoding</command> actually calls
 753         <function>PQsetClientEncoding()</function> for its purpose.
 754
 755         <programlisting>
 756 int PQsetClientEncoding(PGconn *<replaceable>conn</replaceable>, const char *<replaceable>encoding</replaceable>)
 757         </programlisting>
 758
 759         where <replaceable>conn</replaceable> is a connection to the backend,
 760         and <replaceable>encoding</replaceable> is an encoding you
 761         want to use. If it successfully sets the encoding, it returns 0,
 762         otherwise -1. The current encoding for this connection can be shown by
 763         using:
 764
 765         <programlisting>
 766 int PQclientEncoding(const PGconn *<replaceable>conn</replaceable>)
 767         </programlisting>
 768
 769         Note that it returns the encoding id, not the encoding symbol string
 770         such as <literal>EUC_JP</literal>. To convert an encoding id to an encoding symbol, you
 771         can use:
 772
 773         <programlisting>
 774 char *pg_encoding_to_char(int <replaceable>encoding_id</replaceable>)
 775         </programlisting>
 776        </para>
 777       </listitem>
 778
 779       <listitem>
 780        <para>
 781         Using <command>SET CLIENT_ENCODING TO</command>.
 782
 783         Setting the frontend side encoding can be done by this SQL command:
 784
 785         <programlisting>
 786 SET CLIENT_ENCODING TO 'encoding';
 787         </programlisting>
 788
 789         Also you can use SQL92 syntax <literal>SET NAMES</literal> for this purpose:
 790
 791         <programlisting>
 792 SET NAMES 'encoding';
 793         </programlisting>
 794
 795         To query the current frontend encoding:
 796
 797         <programlisting>
 798 SHOW CLIENT_ENCODING;
 799         </programlisting>
 800
 801         To return to the default encoding:
 802
 803         <programlisting>
 804 RESET CLIENT_ENCODING;
 805         </programlisting>
 806        </para>
 807       </listitem>
 808
 809       <listitem>
 810        <para>
 811         Using <envar>PGCLIENTENCODING</envar>.
 812
 813         If environment variable <envar>PGCLIENTENCODING</envar> is defined
 814         in the client's environment, that client encoding is automatically
 815         selected when a backend connection is made.  (This can subsequently
 816         be overridden using any of the other methods mentioned above.)
 817        </para>
 818       </listitem>
 819      </itemizedlist>
 820     </para>
 821    </sect2>
 822
 823    <sect2>
 824     <title>About Unicode</title>
 825
 826     <indexterm><primary>Unicode</></>
 827
 828     <para>
 829      An automatic encoding translation between Unicode and other
 830      encodings has been supported since PostgreSQL 7.1.
 831      For 7.1 it's not enabled by default.
 832      To enable this feature, run configure with the
 833      <option>--enable-unicode-conversion</option> option. Note that this requires
 834      the <option>--enable-multibyte</option> option also.
 835     </para>
 836     <para>
 837     For 7.2, <option>--enable-unicode-conversion</option> is not necessary.
 838     The unicode conversion functionality is automatically enabled
 839     if <option>--enable-multibyte</option> is specified.
 840     </para>
 841    </sect2>
 842
 843    <sect2>
 844     <title>What happens if the translation is not possible?</title>
 845
 846     <para>
 847      Suppose you choose <literal>EUC_JP</literal> for the backend, <literal>LATIN1</literal> for the frontend,
 848      then some Japanese characters could not be translated into <literal>LATIN1</literal>. In
 849      this case, a letter that cannot be represented in the <literal>LATIN1</literal> character set
 850      would be transformed as:
 851
 852      <programlisting>
 853 (HEXA DECIMAL)
 854      </programlisting>
 855     </para>
 856    </sect2>
 857
 858    <sect2>
 859     <title>References</title>
 860
 861     <para>
 862      These are good sources to start learning about various kinds of encoding
 863      systems.
 864
 865      <variablelist>
 866       <varlistentry>
 867        <term><ulink url="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"></ulink></term>
 868
 869        <listitem>
 870         <para>
 871          Detailed explanations of <literal>EUC_JP</literal>,
 872          <literal>EUC_CN</literal>, <literal>EUC_KR</literal>,
 873          <literal>EUC_TW</literal> appear in section 3.2.
 874         </para>
 875        </listitem>
 876       </varlistentry>
 877
 878       <varlistentry>
 879        <term><ulink url="http://www.unicode.org/"></ulink></term>
 880
 881        <listitem>
 882         <para>
 883          The web site of the Unicode Consortium
 884         </para>
 885        </listitem>
 886       </varlistentry>
 887
 888       <varlistentry>
 889        <term>RFC 2044</term>
 890
 891        <listitem>
 892         <para>
 893          <acronym>UTF</acronym>-8 is defined here.
 894         </para>
 895        </listitem>
 896       </varlistentry>
 897      </variablelist>
 898     </para>
 899    </sect2>
 900
 901    <sect2>
 902     <title>History</title>
 903
 904 <literallayout class="monospaced">
 905 Dec 7, 2000
 906         * An automatic encoding translation between Unicode and other
 907           encodings are implemented
 908         * Changes above will appear in 7.1
 909
 910 May 20, 2000
 911         * SJIS UDC (NEC selection IBM kanji) support contributed
 912           by Eiji Tokuya
 913         * Changes above will appear in 7.0.1
 914
 915 Mar 22, 2000
 916         * Add new libpq functions PQsetClientEncoding, PQclientEncoding
 917         * ./configure --with-mb=EUC_JP
 918           now deprecated. use
 919           ./configure --enable-multibyte=EUC_JP
 920           instead
 921         * Add SQL_ASCII regression test case
 922         * Add SJIS User Defined Character (UDC) support
 923         * All of above will appear in 7.0
 924
 925 July 11, 1999
 926         * Add support for WIN1250 (Windows Czech) as a client encoding
 927           (contributed by Pavel Behal)
 928         * fix some compiler warnings (contributed by Tomoaki Nishiyama)
 929
 930 Mar 23, 1999
 931         * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
 932           (thanks Oleg Broytmann for testing)
 933         * Fix problem with MB and locale
 934
 935 Jan 26, 1999
 936         * Add support for Big5 for fronend encoding
 937           (you need to create a database with EUC_TW to use Big5)
 938         * Add regression test case for EUC_TW
 939           (contributed by Jonah Kuo <email>jonahkuo@mail.ttn.com.tw</email>)
 940
 941 Dec 15, 1998
 942         * Bugs related to SQL_ASCII support fixed
 943
 944 Nov 5, 1998
 945         * 6.4 release. In this version, pg_database has "encoding"
 946           column that represents the database encoding
 947
 948 Jul 22, 1998
 949         * determine encoding at initdb/createdb rather than compile time
 950         * support for PGCLIENTENCODING when issuing COPY command
 951         * support for SQL92 syntax "SET NAMES"
 952         * support for LATIN2-5
 953         * add UNICODE regression test case
 954         * new test suite for MB
 955         * clean up source files
 956
 957 Jun 5, 1998
 958         * add support for the encoding translation between the backend
 959           and the frontend
 960         * new command SET CLIENT_ENCODING etc. added
 961         * add support for LATIN1 character set
 962         * enhance 8 bit cleaness
 963
 964 April 21, 1998 some enhancements/fixes
 965         * character_length(), position(), substring() are now aware of
 966           multi-byte characters
 967         * add octet_length()
 968         * add --with-mb option to configure
 969         * new regression tests for EUC_KR
 970           (contributed by Soonmyung Hong <email>hong@lunaris.hanmesoft.co.kr</email>)
 971         * add some test cases to the EUC_JP regression test
 972         * fix problem in regress/regress.sh in case of System V
 973         * fix toupper(), tolower() to handle 8bit chars
 974
 975 Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
 976
 977 Mar 10, 1998 PL2 released
 978         * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
 979         * add an English document (this file)
 980         * fix problems concerning 8-bit single byte characters
 981
 982 Mar 1, 1998 PL1 released
 983 </literallayout>
 984    </sect2>
 985
 986    <sect2>
 987     <title>WIN1250 on Windows/ODBC</title>
 988
 989     <para>
 990 <!--
 991 [Here is a good documentation explaining how to use WIN1250 on
 992 Windows/ODBC from Pavel Behal]
 993
 994 Version: 0.91 for PgSQL 6.5
 995 Author: Pavel Behal
 996 Revised by: Tatsuo Ishii
 997 Email: behal@opf.slu.cz
 998 Licence: The Same as PostgreSQL
 999
1000 Sorry for my Eglish and C code, I'm not native :-)
1001
1002 !!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1003 -->
1004
1005      The WIN1250 character set on Windows client platforms can be used
1006      with <productname>Postgres</productname> with locale support
1007      enabled.
1008     </para>
1009
1010     <para>
1011      The following should be kept in mind:
1012
1013      <itemizedlist>
1014       <listitem>
1015        <para>
1016         Success depends on proper system locales. This has been tested
1017         with <systemitem class="osname">Red Hat 6.0</> and <systemitem
1018         class="osname">Slackware 3.6</>, with <literal>cs_CZ.iso8859-2</literal> locale.
1019        </para>
1020       </listitem>
1021
1022       <listitem>
1023        <para>
1024         Never try to set the server multibyte database encoding to WIN1250.
1025         Always use LATIN2 instead since there is not a WIN1250 locale
1026         in Unix.
1027        </para>
1028       </listitem>
1029
1030       <listitem>
1031        <para>
1032         WIN1250 encoding is usable only for Windows ODBC clients. The
1033         characters are recoded on the fly, to be displayed and stored
1034         back properly.
1035        </para>
1036       </listitem>
1037      </itemizedlist>
1038     </para>
1039
1040     <para>
1041      When running, it is important to remember the following:
1042
1043      <itemizedlist>
1044       <listitem>
1045        <para>
1046         This configuration reorders your sort order depending on your
1047         <envar>LC_<replaceable>x</replaceable></envar> settings. Don't be
1048         confused with the regression test results since they don't use
1049         locale.
1050        </para>
1051       </listitem>
1052
1053       <listitem>
1054        <para>
1055         A locale such as <literal>ch</literal> is correctly sorted
1056         only if your system
1057         supports that locale; older systems may not do so but new ones
1058         (e.g. RH6.0) do.
1059        </para>
1060       </listitem>
1061
1062       <listitem>
1063        <para>
1064         You have to insert money as '<literal>162,50</literal>' (note
1065         comma within the single-quotes).
1066        </para>
1067       </listitem>
1068
1069       <listitem>
1070        <para>
1071         At the time of writing (early 1999), this configuration has
1072         not received extensive testing. Please let us know of any
1073         changes you had to make!
1074        </para>
1075       </listitem>
1076      </itemizedlist>
1077     </para>
1078
1079     <procedure>
1080      <title>WIN1250 on Windows/ODBC</title>
1081      <step>
1082       <para>
1083        Compile <productname>Postgres</productname> with locale enabled
1084        and the multibyte encoding set to <literal>LATIN2</literal>.
1085       </para>
1086      </step>
1087
1088      <step>
1089       <para>
1090        Set up your installation. Do not forget to create locale
1091        variables in your profile (environment). For example (this may
1092        not be correct for <emphasis>your</emphasis> environment):
1093
1094        <programlisting>
1095 LC_ALL=cs_CZ.ISO8859-2
1096 LC_COLLATE=cs_CZ.ISO8859-2
1097 LC_CTYPE=cs_CZ.ISO8859-2
1098 LC_MONETARY=cs_CZ.ISO8859-2
1099 LC_NUMERIC=cs_CZ.ISO8859-2
1100 LC_TIME=cs_CZ.ISO8859-2
1101        </programlisting>
1102       </para>
1103      </step>
1104
1105      <step>
1106       <para>
1107        You have to start the postmaster with locales set!
1108       </para>
1109      </step>
1110
1111      <step>
1112       <para>
1113        Try it with Czech language, and have it sort on a query.
1114       </para>
1115      </step>
1116
1117      <step>
1118       <para>
1119        Install ODBC driver for <productname>PostgreSQL</productname> on your Windows machine.
1120       </para>
1121      </step>
1122
1123      <step>
1124       <para>
1125        Setup properly your data source. Include this line in your ODBC
1126        configuration dialog in the field <literal>Connect Settings</literal>:
1127
1128        <programlisting>
1129 SET CLIENT_ENCODING = 'WIN1250';
1130        </programlisting>
1131       </para>
1132      </step>
1133
1134      <step>
1135       <para>
1136        Now try it again, but in Windows with ODBC.
1137       </para>
1138      </step>
1139     </procedure>
1140    </sect2>
1141   </sect1>
1142
1143
1144  <sect1 id="recode">
1145   <title>Single-byte character set recoding</>
1146 <!-- formerly in README.charsets, by Josef Balatka, <balatka@email.cz> -->
1147
1148   <para>
1149    You can set up this feature with the <option>--enable-recode</> option
1150    to <filename>configure</>. This option was formerly described as
1151    <quote>Cyrillic recode support</> which doesn't express all its
1152    power. It can be used for <emphasis>any</> single-byte character
1153    set recoding.
1154   </para>
1155
1156   <para>
1157    This method uses a file <filename>charset.conf</> file located in
1158    the database directory (<envar>PGDATA</>).  It's a typical
1159    configuration text file where spaces and newlines separate items
1160    and records and # specifies comments.  Three keywords with the
1161    following syntax are recognized here:
1162 <synopsis>
1163 BaseCharset      <replaceable>server_charset</>
1164 RecodeTable      <replaceable>from_charset</> <replaceable>to_charset</> <replaceable>file_name</>
1165 HostCharset      <replaceable>host_spec</>    <replaceable>host_charset</>
1166 </synopsis>
1167   </para>
1168
1169   <para>
1170    <token>BaseCharset</> defines the encoding of the database server.
1171    All character set names are only used for mapping inside of
1172    <filename>charset.conf</> so you can freely use typing-friendly
1173    names.
1174   </para>
1175
1176   <para>
1177    <token>RecodeTable</> records specify translation tables between
1178    server and client.  The file name is relative to the
1179    <envar>PGDATA</> directory.  The table file format is very
1180    simple. There are no keywords and characters are represented by a
1181    pair of decimal or hexadecimal (0x prefixed) values on single
1182    lines:
1183 <synopsis>
1184 <replaceable>char_value</>   <replaceable>translated_char_value</>
1185 </synopsis>
1186   </para>
1187
1188   <para>
1189    <token>HostCharset</> records define the client character set by IP
1190    address. You can use a single IP address, an IP mask range starting
1191    from the given address or an IP interval (e.g., 127.0.0.1,
1192    192.168.1.100/24, 192.168.1.20-192.168.1.40).
1193   </para>
1194
1195   <para>
1196    The <filename>charset.conf</> file is always processed up to the
1197    end, so you can easily specify exceptions from the previous
1198    rules. In the <filename>src/data/</> directory you will find an
1199    example <filename>charset.conf</> and a few recoding tables.
1200   </para>
1201
1202   <para>
1203    As this solution is based on the client's IP address and character
1204    set mapping there are obviously some restrictions as well. You
1205    cannot use different encodings on the same host at the same
1206    time. It is also inconvenient when you boot your client hosts into
1207    multiple operating systems.  Nevertheless, when these restrictions are
1208    not limiting and you do not need multibyte characters than it is a
1209    simple and effective solution.
1210   </para>
1211  </sect1>
1212
1213 </chapter>
1214
1215 <!-- Keep this comment at the end of the file
1216 Local variables:
1217 mode:sgml
1218 sgml-omittag:nil
1219 sgml-shorttag:t
1220 sgml-minimize-attributes:nil
1221 sgml-always-quote-attributes:t
1222 sgml-indent-step:1
1223 sgml-indent-data:t
1224 sgml-parent-document:nil
1225 sgml-default-dtd-file:"./reference.ced"
1226 sgml-exposed-tags:nil
1227 sgml-local-catalogs:("/usr/lib/sgml/catalog")
1228 sgml-local-ecat-files:nil
1229 End:
1230 -->