granicus.if.org Git - postgresql/blob - doc/src/sgml/charset.sgml

   1 <!-- $Header: /cvsroot/pgsql/doc/src/sgml/charset.sgml,v 2.19 2001/11/28 20:49:09 petere Exp $ -->
   2
   3 <chapter id="charset">
   4  <title>Localization</>
   5
   6  <abstract>
   7   <para>
   8    Describes the available localization features from the point of
   9    view of the administrator.
  10   </para>
  11  </abstract>
  12
  13   <para>
  14    <productname>PostgreSQL</productname> supports localization with
  15    three approaches:
  16
  17    <itemizedlist>
  18     <listitem>
  19      <para>
  20       Using the locale features of the operating system to provide
  21       locale-specific collation order, number formatting, translated
  22       messages, and other aspects.
  23      </para>
  24     </listitem>
  25
  26     <listitem>
  27      <para>
  28       Using explicit multiple-byte character sets defined in the
  29       <productname>PostgreSQL</productname> server to support languages
  30       that require more characters than will fit into a single byte,
  31       and to provide character set recoding between client and server.
  32       The number of supported character sets is fixed at the time the
  33       server is compiled, and internal operations such as string
  34       comparisons require expansion of each character into a 32-bit
  35       word.
  36      </para>
  37     </listitem>
  38
  39     <listitem>
  40      <para>
  41       Single byte character recoding provides a more light-weight
  42       solution for users of multiple, yet single-byte character sets.
  43      </para>
  44     </listitem>
  45    </itemizedlist>
  46   </para>
  47
  48
  49  <sect1 id="locale">
  50   <title>Locale Support</title>
  51
  52   <indexterm zone="locale"><primary>locale</></>
  53
  54   <para>
  55    <firstterm>Locale</> support refers to an application respecting
  56    cultural preferences regarding alphabets, sorting, number
  57    formatting, etc.  <productname>PostgreSQL</> uses the standard ISO
  58    C and <acronym>POSIX</acronym>-like locale facilities provided by the server operating
  59    system.  For additional information refer to the documentation of your
  60    system.
  61   </para>
  62
  63   <sect2>
  64    <title>Overview</>
  65
  66   <para>
  67     Locale support is not built into <productname>PostgreSQL</> by
  68     default; to enable it, supply the <option>--enable-locale</> option
  69     to the <filename>configure</> script:
  70 <informalexample>
  71 <screen>
  72 <prompt>$ </><userinput>./configure --enable-locale</>
  73 </screen>
  74 </informalexample>
  75     Locale support only affects the server; all clients are compatible
  76     with servers with or without locale support.
  77    </para>
  78
  79    <para>
  80     To enable messages translated to the user's preferred language,
  81     the <option>--enable-nls</option> option must be used.  This
  82     option is independent of the other locale support.
  83    </para>
  84
  85    <para>
  86     The information about which particular cultural rules to use is
  87     determined by standard environment variables.  If you are getting
  88     localized behavior from other programs you probably have them set
  89     up already.  The simplest way to set the localization information
  90     is the <envar>LANG</> variable, for example:
  91 <programlisting>
  92 export LANG=sv_SE
  93 </programlisting>
  94     This sets the locale to Swedish (<literal>sv</>) as spoken in
  95     Sweden (<literal>SE</>).  Other possibilities might be
  96     <literal>en_US</> (U.S. English) and <literal>fr_CA</> (Canada,
  97     French).  If more than one character set can be useful for a locale
  98     then the specifications look like this:
  99     <literal>cs_CZ.ISO8859-2</>. What locales are available under what
 100     names on your system depends on what was provided by the operating
 101     system vendor and what was installed.
 102    </para>
 103
 104    <para>
 105     Occasionally it is useful to mix rules from several locales, e.g.,
 106     use U.S. collation rules but Spanish messages.  To do that a set of
 107     environment variables exist that override the default of
 108     <envar>LANG</> for a particular category:
 109
 110     <informaltable>
 111      <tgroup cols="2">
 112       <tbody>
 113        <row>
 114         <entry><envar>LC_COLLATE</></>
 115         <entry>String sort order</>
 116        </row>
 117        <row>
 118         <entry><envar>LC_CTYPE</></>
 119         <entry>Character classification (What is a letter? The upper-case equivalent?)</>
 120        </row>
 121        <row>
 122         <entry><envar>LC_MESSAGES</></>
 123         <entry>Language of messages</>
 124        </row>
 125        <row>
 126         <entry><envar>LC_MONETARY</></>
 127         <entry>Formatting of currency amounts</>
 128        </row>
 129        <row>
 130         <entry><envar>LC_NUMERIC</></>
 131         <entry>Formatting of numbers</>
 132        </row>
 133        <row>
 134         <entry><envar>LC_TIME</></>
 135         <entry>Formatting of dates and times</>
 136        </row>
 137       </tbody>
 138      </tgroup>
 139     </informaltable>
 140
 141     Additionally, all of these specific variables and the
 142     <envar>LANG</> variable can be overridden with the
 143     <envar>LC_ALL</> environment variable.
 144    </para>
 145
 146    <note>
 147     <para>
 148      Some message localization libraries also look at the environment
 149      variable <envar>LANGUAGE</envar> which overrides all other locale
 150      settings for the purpose of setting the language of messages.  If
 151      in doubt, please refer to the documentation of your operating
 152      system, in particular the
 153      <citerefentry><refentrytitle>gettext</><manvolnum>3</></> manual
 154      page, for more information.
 155     </para>
 156    </note>
 157
 158    <para>
 159     If you want the system to behave as if it had no locale support,
 160     use the special locale <literal>C</> or <literal>POSIX</>, or
 161     simply unset all locale-related variables.
 162    </para>
 163
 164    <para>
 165     Note that the locale behavior of the server is determined by the
 166     environment variables seen by the server, not by the environment
 167     of any client.  Therefore, be careful to set these variables
 168     before starting the server.  A consequence of this is that if
 169     client and server are set up to different locales, messages may
 170     appear in different languages depending on where they originated.
 171    </para>
 172
 173    <para>
 174     The <envar>LC_COLLATE</> and <envar>LC_CTYPE</> variables affect the
 175     sort order of indexes.  Therefore, these values must be kept fixed
 176     for any particular database cluster, or indexes on text columns will
 177     become corrupt.  <productname>PostgreSQL</productname> enforces this
 178     by recording the values of <envar>LC_COLLATE</> and <envar>LC_CTYPE</>
 179     that are seen by <application>initdb</>.  The server automatically adopts
 180     those two values when it is started; only the other <envar>LC_</>
 181     categories can be set from the environment at server startup.
 182     In short, only one collation order can be used in a database cluster,
 183     and it is chosen at <application>initdb</> time.
 184    </para>
 185   </sect2>
 186
 187   <sect2>
 188    <title>Benefits</>
 189
 190    <para>
 191     Locale support influences in particular the following features:
 192
 193     <itemizedlist>
 194      <listitem>
 195       <para>
 196        Sort order in <command>ORDER BY</> queries.
 197        <indexterm><primary>ORDER BY</></>
 198       </para>
 199      </listitem>
 200
 201      <listitem>
 202       <para>
 203        The <function>to_char</> family of functions
 204       </para>
 205      </listitem>
 206
 207      <listitem>
 208       <para>
 209        The <literal>LIKE</> and <literal>~</> operators for pattern
 210        matching
 211       </para>
 212      </listitem>
 213     </itemizedlist>
 214    </para>
 215
 216    <para>
 217     The only severe drawback of using the locale support in
 218     <productname>PostgreSQL</> is its speed.  So use locale only if you
 219     actually need it.  It should be noted in particular that selecting
 220     a non-C locale disables index optimizations for <literal>LIKE</> and
 221     <literal>~</> operators, which can make a huge difference in the
 222     speed of searches that use those operators.
 223    </para>
 224   </sect2>
 225
 226   <sect2>
 227    <title>Problems</>
 228
 229    <para>
 230     If locale support doesn't work in spite of the explanation above,
 231     check that the locale support in your operating system is correctly configured.
 232     To check whether a given locale is installed and functional you
 233     can use <application>Perl</>, for example.  Perl has also support
 234     for locales and if a locale is broken <command>perl -v</> will
 235     complain something like this:
 236 <screen>
 237 <prompt>$</> <userinput>export LC_CTYPE='not_exist'</>
 238 <prompt>$</> <userinput>perl -v</>
 239 <computeroutput>
 240 perl: warning: Setting locale failed.
 241 perl: warning: Please check that your locale settings:
 242 LC_ALL = (unset),
 243 LC_CTYPE = "not_exist",
 244 LANG = (unset)
 245 are supported and installed on your system.
 246 perl: warning: Falling back to the standard locale ("C").
 247 </computeroutput>
 248 </screen>
 249    </para>
 250
 251    <para>
 252     Check that your locale files are in the right location.  Possible
 253     locations include: <filename>/usr/lib/locale</filename> (<systemitem class="osname">Linux</>,
 254     <systemitem class="osname">Solaris</>), <filename>/usr/share/locale</filename> (<systemitem class="osname">Linux</>),
 255     <filename>/usr/lib/nls/loc</filename> (<systemitem class="osname">DUX 4.0</>).  Check the locale
 256     man page of your system if you are not sure.
 257    </para>
 258
 259    <para>
 260     Check that <productname>PostgreSQL</> is actually using the locale that
 261     you think it is.  <envar>LC_COLLATE</> and <envar>LC_CTYPE</> settings are
 262     determined at <application>initdb</> time and cannot be changed without
 263     repeating <application>initdb</>.  Other locale settings including
 264     <envar>LC_MESSAGES</> and <envar>LC_MONETARY</> are determined by the
 265     environment the postmaster is started in, and can be changed with a simple
 266     postmaster restart.  You can check the <envar>LC_COLLATE</> and
 267     <envar>LC_CTYPE</> settings of
 268     a database with the <filename>contrib/pg_controldata</> utility program.
 269    </para>
 270
 271    <para>
 272     The directory <filename>src/test/locale</> contains a test suite
 273     for <productname>PostgreSQL</>'s locale support.
 274    </para>
 275
 276    <para>
 277     Client applications that handle server-side errors by parsing the
 278     text of the error message will obviously have problems when the
 279     server's messages are in a different language.  If you create such
 280     an application you need to devise a plan to cope with this
 281     situation.  The embedded SQL interface (<application>ecpg</>) is
 282     also affected by this problem.  It is currently recommended that
 283     servers interfacing with <application>ecpg</> applications be
 284     configured to send messages in English.
 285    </para>
 286
 287    <para>
 288     Maintaining catalogs of message translations requires the on-going
 289     efforts of many volunteers that want to see
 290     <productname>PostgreSQL</> speak their preferred language well.
 291     If messages in your language is currently not available or fully
 292     translated, your assistance would be appreciated.  If you want to
 293     help, refer to the <citetitle>Developer's Guide</> or write to the
 294     developers' mailing list.
 295    </para>
 296   </sect2>
 297  </sect1>
 298
 299
 300   <sect1 id="multibyte">
 301    <title>Multibyte Support</title>
 302
 303   <indexterm zone="multibyte"><primary>multibyte</></>
 304
 305    <note>
 306     <title>Author</title>
 307
 308     <para>
 309      Tatsuo Ishii (<email>ishii@postgresql.org</email>),
 310      last updated 2000-03-22.
 311      Check <ulink
 312       url="http://www.sra.co.jp/people/t-ishii/PostgreSQL/">Tatsuo's
 313       web site</ulink> for more information.
 314     </para>
 315    </note>
 316
 317    <para>
 318     Multibyte (<acronym>MB</acronym>) support is intended to allow
 319     <productname>PostgreSQL</productname> to handle
 320     multiple-byte character sets such as <acronym>EUC</> (Extended Unix Code), Unicode, and
 321     Mule internal code. With <acronym>MB</acronym> enabled you can use multibyte
 322     character sets in regular expressions (regexp), LIKE, and some
 323     other functions. The default
 324     encoding system is selected while initializing your
 325     <productname>PostgreSQL</productname> installation using
 326     <application>initdb</application>. Note that this can be
 327     overridden when you create a database using
 328     <application>createdb</application> or by using the SQL command
 329     <command>CREATE DATABASE</>. So you can have multiple databases each with
 330     a different encoding system.
 331    </para>
 332
 333    <sect2>
 334     <title>Enabling Multibyte Support</title>
 335
 336     <para>
 337      Run configure with the multibyte option:
 338
 339 <synopsis>
 340 ./configure --enable-multibyte<optional>=<replaceable>encoding_system</replaceable></optional>
 341 </synopsis>
 342
 343      where <replaceable>encoding_system</replaceable> can be one of the
 344      values in the following table:
 345
 346      <table tocentry="1">
 347       <title>Character Set Encodings</title>
 348       <titleabbrev>Encodings</titleabbrev>
 349       <tgroup cols="2">
 350        <thead>
 351         <row>
 352          <entry>Encoding</entry>
 353          <entry>Description</entry>
 354         </row>
 355        </thead>
 356        <tbody>
 357         <row>
 358          <entry><literal>SQL_ASCII</literal></entry>
 359          <entry><acronym>ASCII</acronym></entry>
 360         </row>
 361         <row>
 362          <entry><literal>EUC_JP</literal></entry>
 363          <entry>Japanese <acronym>EUC</></entry>
 364         </row>
 365         <row>
 366          <entry><literal>EUC_CN</literal></entry>
 367          <entry>Chinese <acronym>EUC</></entry>
 368         </row>
 369         <row>
 370          <entry><literal>EUC_KR</literal></entry>
 371          <entry>Korean <acronym>EUC</></entry>
 372         </row>
 373         <row>
 374          <entry><literal>EUC_TW</literal></entry>
 375          <entry>Taiwan <acronym>EUC</acronym></entry>
 376         </row>
 377         <row>
 378          <entry><literal>UNICODE</literal></entry>
 379          <entry>Unicode (<acronym>UTF</acronym>-8)</entry>
 380         </row>
 381         <row>
 382          <entry><literal>MULE_INTERNAL</literal></entry>
 383          <entry>Mule internal code</entry>
 384         </row>
 385         <row>
 386          <entry><literal>LATIN1</literal></entry>
 387          <entry>ISO 8859-1 ECMA-94 Latin Alphabet No.1</entry>
 388         </row>
 389         <row>
 390          <entry><literal>LATIN2</literal></entry>
 391          <entry>ISO 8859-2 ECMA-94 Latin Alphabet No.2</entry>
 392         </row>
 393         <row>
 394          <entry><literal>LATIN3</literal></entry>
 395          <entry>ISO 8859-3 ECMA-94 Latin Alphabet No.3</entry>
 396         </row>
 397         <row>
 398          <entry><literal>LATIN4</literal></entry>
 399          <entry>ISO 8859-4 ECMA-94 Latin Alphabet No.4</entry>
 400         </row>
 401         <row>
 402          <entry><literal>LATIN5</literal></entry>
 403          <entry>ISO 8859-9 ECMA-128 Latin Alphabet No.5</entry>
 404         </row>
 405         <row>
 406          <entry><literal>LATIN6</literal></entry>
 407          <entry>ISO 8859-10 ECMA-144 Latin Alphabet No.6</entry>
 408         </row>
 409         <row>
 410          <entry><literal>LATIN7</literal></entry>
 411          <entry>ISO 8859-13 Latin Alphabet No.7</entry>
 412         </row>
 413         <row>
 414          <entry><literal>LATIN8</literal></entry>
 415          <entry>ISO 8859-14 Latin Alphabet No.8</entry>
 416         </row>
 417         <row>
 418          <entry><literal>LATIN9</literal></entry>
 419          <entry>ISO 8859-15 Latin Alphabet No.9</entry>
 420         </row>
 421         <row>
 422          <entry><literal>LATIN10</literal></entry>
 423          <entry>ISO 8859-16 ASRO SR 14111 Latin Alphabet No.10</entry>
 424         </row>
 425         <row>
 426          <entry><literal>ISO-8859-5</literal></entry>
 427          <entry>ECMA-113 Latin/Cyrillic</entry>
 428         </row>
 429         <row>
 430          <entry><literal>ISO-8859-6</literal></entry>
 431          <entry>ECMA-114 Latin/Arabic</entry>
 432         </row>
 433         <row>
 434          <entry><literal>ISO-8859-7</literal></entry>
 435          <entry>ECMA-118 Latin/Greek</entry>
 436         </row>
 437         <row>
 438          <entry><literal>ISO-8859-8</literal></entry>
 439          <entry>ECMA-121 Latin/Hebrew</entry>
 440         </row>
 441         <row>
 442          <entry><literal>KOI8</literal></entry>
 443          <entry><acronym>KOI</acronym>8-R(U)</entry>
 444         </row>
 445         <row>
 446          <entry><literal>WIN</literal></entry>
 447          <entry>Windows CP1251</entry>
 448         </row>
 449         <row>
 450          <entry><literal>ALT</literal></entry>
 451          <entry>Windows CP866</entry>
 452         </row>
 453        </tbody>
 454       </tgroup>
 455      </table>
 456     </para>
 457
 458     <important>
 459      <para>
 460       Before <productname>PostgreSQL</>7.2, <literal>LATIN5</> mistakenly
 461       meant ISO 8859-5.  From 7.2 on,
 462       <literal>LATIN5</> means ISO 8859-9. If you have a <literal>LATIN5</>
 463       database created on 7.1 or earlier and want to migrate to 7.2 (or
 464       later), you should be very careful about this change.
 465      </para>
 466     </important>
 467
 468     <important>
 469      <para>
 470       Not all APIs supports all the encodings listed above. For example, the
 471       <productname>PostgreSQL</>
 472       JDBC driver does not support <literal>MULE_INTERNAL</>, <literal>LATIN6</>,
 473       <literal>LATIN8</>, and <literal>LATIN10</>.
 474      </para>
 475     </important>
 476
 477     <para>
 478      Here is an example of configuring
 479      <productname>PostgreSQL</productname> to use a Japanese encoding by
 480      default:
 481
 482 <screen>
 483 $ <userinput>./configure --enable-multibyte=EUC_JP</userinput>
 484 </screen>
 485     </para>
 486
 487     <para>
 488      If the encoding system is omitted (<literal>./configure --enable-multibyte</literal>),
 489      <literal>SQL_ASCII</> is assumed.
 490     </para>
 491    </sect2>
 492
 493    <sect2>
 494     <title>Setting the Encoding</title>
 495
 496     <para>
 497      <application>initdb</application> defines the default encoding
 498      for a <productname>PostgreSQL</productname> installation. For example:
 499
 500 <screen>
 501 $ <userinput>initdb -E EUC_JP</>
 502 </screen>
 503
 504      sets the default encoding to <literal>EUC_JP</literal> (Extended Unix Code for Japanese).
 505      Note that you can use <option>--encoding</option> instead of <option>-E</option> if you prefer
 506      to type longer option strings.
 507      If no <option>-E</> or <option>--encoding</option> option is given, the encoding
 508      specified at configure time is used.
 509     </para>
 510
 511     <para>
 512      You can create a database with a different encoding:
 513
 514 <screen>
 515 $ <userinput>createdb -E EUC_KR korean</>
 516 </screen>
 517
 518      will create a database named <database>korean</database> with <literal>EUC_KR</literal> encoding.
 519      Another way to accomplish this is to use a SQL command:
 520
 521 <programlisting>
 522 CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
 523 </programlisting>
 524
 525      The encoding for a database is represented as an
 526      <firstterm>encoding column</firstterm>  in the
 527      <literal>pg_database</literal> system catalog.
 528      You can see that by using the <option>-l</option> option or the
 529      <command>\l</command> command of <command>psql</command>.
 530
 531 <screen>
 532 $ <userinput>psql -l</userinput>
 533             List of databases
 534    Database    |  Owner  |   Encoding
 535 ---------------+---------+---------------
 536  euc_cn        | t-ishii | EUC_CN
 537  euc_jp        | t-ishii | EUC_JP
 538  euc_kr        | t-ishii | EUC_KR
 539  euc_tw        | t-ishii | EUC_TW
 540  mule_internal | t-ishii | MULE_INTERNAL
 541  regression    | t-ishii | SQL_ASCII
 542  template1     | t-ishii | EUC_JP
 543  test          | t-ishii | EUC_JP
 544  unicode       | t-ishii | UNICODE
 545 (9 rows)
 546 </screen>
 547     </para>
 548    </sect2>
 549
 550    <sect2>
 551     <title>Automatic encoding translation between server and
 552      client</title>
 553
 554     <para>
 555      <productname>PostgreSQL</productname> supports an automatic
 556      encoding translation between server
 557      and client for some encodings.  The available combinations are
 558      listed in <xref linkend="multibyte-translation-table">.
 559     </para>
 560
 561      <table tocentry="1" id="multibyte-translation-table">
 562       <title>Client/Server Character Set Encodings</title>
 563       <titleabbrev>Communication Encodings</titleabbrev>
 564       <tgroup cols="2">
 565        <thead>
 566         <row>
 567          <entry>Server Encoding</entry>
 568          <entry>Available Client Encodings</entry>
 569         </row>
 570        </thead>
 571        <tbody>
 572         <row>
 573          <entry><literal>SQL_ASCII</literal></entry>
 574          <entry><literal>SQL_ASCII</literal>, <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 575          </entry>
 576         </row>
 577         <row>
 578          <entry><literal>EUC_JP</literal></entry>
 579          <entry><literal>EUC_JP</literal>, <literal>SJIS</literal>,
 580          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 581          </entry>
 582         </row>
 583         <row>
 584          <entry><literal>EUC_TW</literal></entry>
 585          <entry><literal>EUC_TW</literal>, <literal>BIG5</literal>,
 586          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 587          </entry>
 588         </row>
 589         <row>
 590          <entry><literal>LATIN1</literal></entry>
 591          <entry><literal>LATIN1</literal>, <literal>UNICODE</literal>
 592          <literal>MULE_INTERNAL</literal>
 593          </entry>
 594         </row>
 595         <row>
 596          <entry><literal>LATIN2</literal></entry>
 597          <entry><literal>LATIN2</literal>, <literal>WIN1250</literal>,
 598          <literal>UNICODE</literal>,
 599          <literal>MULE_INTERNAL</literal>
 600          </entry>
 601         </row>
 602         <row>
 603          <entry><literal>LATIN3</literal></entry>
 604          <entry><literal>LATIN3</literal>, <literal>UNICODE</literal>
 605          <literal>MULE_INTERNAL</literal>
 606          </entry>
 607         </row>
 608         <row>
 609          <entry><literal>LATIN4</literal></entry>
 610          <entry><literal>LATIN4</literal>, <literal>UNICODE</literal>
 611          <literal>MULE_INTERNAL</literal>
 612          </entry>
 613         </row>
 614         <row>
 615          <entry><literal>LATIN5</literal></entry>
 616          <entry><literal>LATIN5</literal>, <literal>UNICODE</literal>
 617          <literal>MULE_INTERNAL</literal>
 618          </entry>
 619         </row>
 620         <row>
 621          <entry><literal>LATIN6</literal></entry>
 622          <entry><literal>LATIN6</literal>, <literal>UNICODE</literal>
 623          <literal>MULE_INTERNAL</literal>
 624          </entry>
 625         </row>
 626         <row>
 627          <entry><literal>LATIN7</literal></entry>
 628          <entry><literal>LATIN7</literal>, <literal>UNICODE</literal>
 629          <literal>MULE_INTERNAL</literal>
 630          </entry>
 631         </row>
 632         <row>
 633          <entry><literal>LATIN8</literal></entry>
 634          <entry><literal>LATIN8</literal>, <literal>UNICODE</literal>
 635          <literal>MULE_INTERNAL</literal>
 636          </entry>
 637         </row>
 638         <row>
 639          <entry><literal>LATIN9</literal></entry>
 640          <entry><literal>LATIN9</literal>, <literal>UNICODE</literal>
 641          <literal>MULE_INTERNAL</literal>
 642          </entry>
 643         </row>
 644         <row>
 645          <entry><literal>LATIN10</literal></entry>
 646          <entry><literal>LATIN10</literal>, <literal>UNICODE</literal>
 647          <literal>MULE_INTERNAL</literal>
 648          </entry>
 649         </row>
 650         <row>
 651          <entry><literal>ISO_8859_5</literal></entry>
 652          <entry><literal>ISO_8859_5</literal>,
 653          <literal>UNICODE</literal>
 654          </entry>
 655         </row>
 656         <row>
 657          <entry><literal>ISO_8859_6</literal></entry>
 658          <entry><literal>ISO_8859_6</literal>,
 659          <literal>UNICODE</literal>
 660          </entry>
 661         </row>
 662         <row>
 663          <entry><literal>ISO_8859_7</literal></entry>
 664          <entry><literal>ISO_8859_7</literal>,
 665          <literal>UNICODE</literal>
 666          </entry>
 667         </row>
 668         <row>
 669          <entry><literal>ISO_8859_8</literal></entry>
 670          <entry><literal>ISO_8859_8</literal>,
 671          <literal>UNICODE</literal>
 672          </entry>
 673         </row>
 674         <row>
 675          <entry><literal>ISO_8859_9</literal></entry>
 676          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 677          <literal>ALT</literal>, <literal>KOI8R</literal>,
 678          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 679          </entry>
 680         </row>
 681         <row>
 682          <entry><literal>UNICODE</literal></entry>
 683          <entry>
 684          <literal>EUC_JP</literal>, <literal>SJIS</literal>,
 685          <literal>EUC_KR</literal>, <literal>EUC_CN</literal>,
 686          <literal>EUC_TW</literal>, <literal>BIG5</literal>,
 687          <literal>LATIN1</literal> to <literal>LATIN10</literal>,
 688          <literal>ISO_8859_5</literal>,
 689          <literal>ISO_8859_6</literal>,
 690          <literal>ISO_8859_7</literal>,
 691          <literal>ISO_8859_8</literal>,
 692          <literal>WIN</literal>, <literal>ALT</literal>,
 693          <literal>KOI8</literal>
 694          </entry>
 695         </row>
 696         <row>
 697          <entry><literal>MULE_INTERNAL</literal></entry>
 698          <entry><literal>EUC_JP</literal>, <literal>SJIS</literal>, <literal>EUC_KR</literal>, <literal>EUC_CN</literal>,
 699           <literal>EUC_TW</literal>, <literal>BIG5</literal>, <literal>LATIN1</literal> to <literal>LATIN5</literal>,
 700           <literal>WIN</literal>, <literal>ALT</literal>, <literal>WIN1250</literal></entry>
 701         </row>
 702         <row>
 703          <entry><literal>KOI8</literal></entry>
 704          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 705          <literal>ALT</literal>, <literal>KOI8</literal>,
 706          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 707          </entry>
 708         </row>
 709         <row>
 710          <entry><literal>WIN</literal></entry>
 711          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 712          <literal>ALT</literal>, <literal>KOI8</literal>,
 713          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 714          </entry>
 715         </row>
 716         <row>
 717          <entry><literal>ALT</literal></entry>
 718          <entry><literal>ISO_8859_9</literal>, <literal>WIN</literal>,
 719          <literal>ALT</literal>, <literal>KOI8</literal>,
 720          <literal>UNICODE</literal>, <literal>MULE_INTERNAL</literal>
 721          </entry>
 722         </row>
 723        </tbody>
 724       </tgroup>
 725      </table>
 726
 727     <para>
 728      To enable the automatic encoding translation, you have to tell
 729      <productname>PostgreSQL</productname> the encoding you would like
 730      to use in the client. There are
 731      several ways to accomplish this.
 732
 733      <itemizedlist>
 734       <listitem>
 735        <para>
 736         Using the <command>\encoding</command> command in
 737         <application>psql</application>.
 738         <command>\encoding</command> allows you to change client
 739         encoding on the fly. For
 740         example, to change the encoding to <literal>SJIS</literal>, type:
 741
 742 <programlisting>
 743 \encoding SJIS
 744 </programlisting>
 745        </para>
 746       </listitem>
 747
 748       <listitem>
 749        <para>
 750         Using <application>libpq</> functions.
 751         <command>\encoding</command> actually calls
 752         <function>PQsetClientEncoding()</function> for its purpose.
 753
 754 <synopsis>
 755 int PQsetClientEncoding(PGconn *<replaceable>conn</replaceable>, const char *<replaceable>encoding</replaceable>)
 756 </synopsis>
 757
 758         where <replaceable>conn</replaceable> is a connection to the server,
 759         and <replaceable>encoding</replaceable> is an encoding you
 760         want to use. If it successfully sets the encoding, it returns 0,
 761         otherwise -1. The current encoding for this connection can be shown by
 762         using:
 763
 764 <synopsis>
 765 int PQclientEncoding(const PGconn *<replaceable>conn</replaceable>)
 766 </synopsis>
 767
 768         Note that it returns the encoding ID, not a symbolic string
 769         such as <literal>EUC_JP</literal>. To convert an encoding ID to an encoding name, you
 770         can use:
 771
 772 <synopsis>
 773 char *pg_encoding_to_char(int <replaceable>encoding_id</replaceable>)
 774 </synopsis>
 775        </para>
 776       </listitem>
 777
 778       <listitem>
 779        <para>
 780         Using <command>SET CLIENT_ENCODING TO</command>.
 781
 782         Setting the client encoding can be done with this SQL command:
 783
 784 <programlisting>
 785 SET CLIENT_ENCODING TO 'encoding';
 786 </programlisting>
 787
 788         Also you can use the SQL92 syntax <literal>SET NAMES</literal> for this purpose:
 789
 790 <programlisting>
 791 SET NAMES 'encoding';
 792 </programlisting>
 793
 794         To query the current client encoding:
 795
 796 <programlisting>
 797 SHOW CLIENT_ENCODING;
 798 </programlisting>
 799
 800         To return to the default encoding:
 801
 802 <programlisting>
 803 RESET CLIENT_ENCODING;
 804 </programlisting>
 805        </para>
 806       </listitem>
 807
 808       <listitem>
 809        <para>
 810         Using <envar>PGCLIENTENCODING</envar>.
 811
 812         If environment variable <envar>PGCLIENTENCODING</envar> is defined
 813         in the client's environment, that client encoding is automatically
 814         selected when a connection to the server is made.  (This can subsequently
 815         be overridden using any of the other methods mentioned above.)
 816        </para>
 817       </listitem>
 818      </itemizedlist>
 819     </para>
 820    </sect2>
 821
 822    <sect2>
 823     <title>About Unicode</title>
 824
 825     <indexterm><primary>Unicode</></>
 826
 827     <para>
 828      An automatic encoding translation between Unicode and other
 829      encodings has been supported since <productname>PostgreSQL</> 7.1.
 830      For 7.1 it was not enabled by default.
 831      To enable this feature, run configure with the
 832      <option>--enable-unicode-conversion</option> option. Note that this requires
 833      the <option>--enable-multibyte</option> option also.
 834     </para>
 835     <para>
 836     For 7.2, <option>--enable-unicode-conversion</option> is not necessary.
 837     The unicode conversion functionality is automatically enabled
 838     if <option>--enable-multibyte</option> is specified.
 839     </para>
 840    </sect2>
 841
 842    <sect2>
 843     <title>What happens if the translation is not possible?</title>
 844
 845     <para>
 846      Suppose you choose <literal>EUC_JP</literal> for the server
 847      and <literal>LATIN1</literal> for the client,
 848      then some Japanese characters cannot be translated into <literal>LATIN1</literal>. In
 849      this case, a letter that cannot be represented in the <literal>LATIN1</literal> character set
 850      would be transformed as:
 851
 852 <synopsis>
 853 (HEXA DECIMAL)
 854 </synopsis>
 855     </para>
 856    </sect2>
 857
 858    <sect2>
 859     <title>References</title>
 860
 861     <para>
 862      These are good sources to start learning about various kinds of encoding
 863      systems.
 864
 865      <variablelist>
 866       <varlistentry>
 867        <term><ulink url="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"></ulink></term>
 868
 869        <listitem>
 870         <para>
 871          Detailed explanations of <literal>EUC_JP</literal>,
 872          <literal>EUC_CN</literal>, <literal>EUC_KR</literal>,
 873          <literal>EUC_TW</literal> appear in section 3.2.
 874         </para>
 875        </listitem>
 876       </varlistentry>
 877
 878       <varlistentry>
 879        <term><ulink url="http://www.unicode.org/"></ulink></term>
 880
 881        <listitem>
 882         <para>
 883          The web site of the Unicode Consortium
 884         </para>
 885        </listitem>
 886       </varlistentry>
 887
 888       <varlistentry>
 889        <term>RFC 2044</term>
 890
 891        <listitem>
 892         <para>
 893          <acronym>UTF</acronym>-8 is defined here.
 894         </para>
 895        </listitem>
 896       </varlistentry>
 897      </variablelist>
 898     </para>
 899    </sect2>
 900
 901    <sect2>
 902     <title>History</title>
 903
 904 <literallayout class="monospaced">
 905 Dec 7, 2000
 906         * An automatic encoding translation between Unicode and other
 907           encodings are implemented
 908         * Changes above will appear in 7.1
 909
 910 May 20, 2000
 911         * SJIS UDC (NEC selection IBM kanji) support contributed
 912           by Eiji Tokuya
 913         * Changes above will appear in 7.0.1
 914
 915 Mar 22, 2000
 916         * Add new libpq functions PQsetClientEncoding, PQclientEncoding
 917         * ./configure --with-mb=EUC_JP
 918           now deprecated. use
 919           ./configure --enable-multibyte=EUC_JP
 920           instead
 921         * Add SQL_ASCII regression test case
 922         * Add SJIS User Defined Character (UDC) support
 923         * All of above will appear in 7.0
 924
 925 July 11, 1999
 926         * Add support for WIN1250 (Windows Czech) as a client encoding
 927           (contributed by Pavel Behal)
 928         * fix some compiler warnings (contributed by Tomoaki Nishiyama)
 929
 930 Mar 23, 1999
 931         * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
 932           (thanks Oleg Broytmann for testing)
 933         * Fix problem with MB and locale
 934
 935 Jan 26, 1999
 936         * Add support for Big5 for fronend encoding
 937           (you need to create a database with EUC_TW to use Big5)
 938         * Add regression test case for EUC_TW
 939           (contributed by Jonah Kuo <email>jonahkuo@mail.ttn.com.tw</email>)
 940
 941 Dec 15, 1998
 942         * Bugs related to SQL_ASCII support fixed
 943
 944 Nov 5, 1998
 945         * 6.4 release. In this version, pg_database has "encoding"
 946           column that represents the database encoding
 947
 948 Jul 22, 1998
 949         * determine encoding at initdb/createdb rather than compile time
 950         * support for PGCLIENTENCODING when issuing COPY command
 951         * support for SQL92 syntax "SET NAMES"
 952         * support for LATIN2-5
 953         * add UNICODE regression test case
 954         * new test suite for MB
 955         * clean up source files
 956
 957 Jun 5, 1998
 958         * add support for the encoding translation between the backend
 959           and the frontend
 960         * new command SET CLIENT_ENCODING etc. added
 961         * add support for LATIN1 character set
 962         * enhance 8 bit cleaness
 963
 964 April 21, 1998 some enhancements/fixes
 965         * character_length(), position(), substring() are now aware of
 966           multi-byte characters
 967         * add octet_length()
 968         * add --with-mb option to configure
 969         * new regression tests for EUC_KR
 970           (contributed by Soonmyung Hong <email>hong@lunaris.hanmesoft.co.kr</email>)
 971         * add some test cases to the EUC_JP regression test
 972         * fix problem in regress/regress.sh in case of System V
 973         * fix toupper(), tolower() to handle 8bit chars
 974
 975 Mar 25, 1998 MB PL2 is incorporated into <productname>PostgreSQL</> 6.3.1
 976
 977 Mar 10, 1998 PL2 released
 978         * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
 979         * add an English document (this file)
 980         * fix problems concerning 8-bit single byte characters
 981
 982 Mar 1, 1998 PL1 released
 983 </literallayout>
 984    </sect2>
 985
 986    <sect2>
 987     <title>WIN1250 on Windows/ODBC</title>
 988
 989     <para>
 990 <!--
 991 [Here is a good documentation explaining how to use WIN1250 on
 992 Windows/ODBC from Pavel Behal]
 993
 994 Version: 0.91 for PgSQL 6.5
 995 Author: Pavel Behal
 996 Revised by: Tatsuo Ishii
 997 Email: behal@opf.slu.cz
 998 License: The Same as <productname>PostgreSQL</>
 999
1000 Sorry for my Eglish and C code, I'm not native :-)
1001
1002 !!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1003 -->
1004
1005      The WIN1250 character set on Windows client platforms can be used
1006      with <productname>PostgreSQL</productname> with locale support
1007      enabled.
1008     </para>
1009
1010     <para>
1011      The following should be kept in mind:
1012
1013      <itemizedlist>
1014       <listitem>
1015        <para>
1016         Success depends on proper system locales. This has been tested
1017         with <systemitem class="osname">Red Hat 6.0</> and <systemitem
1018         class="osname">Slackware 3.6</>, with the
1019         <literal>cs_CZ.iso8859-2</literal> locale.
1020        </para>
1021       </listitem>
1022
1023       <listitem>
1024        <para>
1025         Never try to set the server's database encoding to WIN1250.
1026         Always use LATIN2 instead since there is no WIN1250 locale
1027         in Unix.
1028        </para>
1029       </listitem>
1030
1031       <listitem>
1032        <para>
1033         The WIN1250 encoding is usable only for Windows ODBC clients. The
1034         characters are recoded on the fly, to be displayed and stored
1035         back properly.
1036        </para>
1037       </listitem>
1038      </itemizedlist>
1039     </para>
1040
1041     <procedure>
1042      <title>WIN1250 on Windows/ODBC</title>
1043      <step>
1044       <para>
1045        Compile <productname>PostgreSQL</productname> with locale enabled
1046        and the server-side encoding set to <literal>LATIN2</literal>.
1047       </para>
1048      </step>
1049
1050      <step>
1051       <para>
1052        Set up your installation. Do not forget to create locale
1053        variables in your environment. For example (this may
1054        not be correct for <emphasis>your</emphasis> environment):
1055
1056 <programlisting>
1057 LC_ALL=cs_CZ.ISO8859-2
1058 </programlisting>
1059       </para>
1060      </step>
1061
1062      <step>
1063       <para>
1064        You have to start the server with locales set!
1065       </para>
1066      </step>
1067
1068      <step>
1069       <para>
1070        Try it with the Czech language, and have it sort on a query.
1071       </para>
1072      </step>
1073
1074      <step>
1075       <para>
1076        Install ODBC driver for <productname>PostgreSQL</productname> on your Windows machine.
1077       </para>
1078      </step>
1079
1080      <step>
1081       <para>
1082        Set up your data source properly. Include this line in your ODBC
1083        configuration dialog in the field <guilabel>Connect Settings</guilabel>:
1084
1085 <programlisting>
1086 SET CLIENT_ENCODING = 'WIN1250';
1087 </programlisting>
1088       </para>
1089      </step>
1090
1091      <step>
1092       <para>
1093        Now try it again, but in Windows with ODBC.
1094       </para>
1095      </step>
1096     </procedure>
1097    </sect2>
1098   </sect1>
1099
1100
1101  <sect1 id="recode">
1102   <title>Single-byte character set recoding</>
1103 <!-- formerly in README.charsets, by Josef Balatka, <balatka@email.cz> -->
1104
1105   <para>
1106    You can set up this feature with the <option>--enable-recode</> option
1107    to <filename>configure</>. This option was formerly described as
1108    <quote>Cyrillic recode support</> which doesn't express all its
1109    power. It can be used for <emphasis>any</> single-byte character
1110    set recoding.
1111   </para>
1112
1113   <para>
1114    This method uses a file <filename>charset.conf</> file located in
1115    the database directory (<envar>PGDATA</>).  It's a typical
1116    configuration text file where spaces and newlines separate items
1117    and records and # specifies comments.  Three keywords with the
1118    following syntax are recognized here:
1119 <synopsis>
1120 BaseCharset      <replaceable>server_charset</>
1121 RecodeTable      <replaceable>from_charset</> <replaceable>to_charset</> <replaceable>file_name</>
1122 HostCharset      <replaceable>host_spec</>    <replaceable>host_charset</>
1123 </synopsis>
1124   </para>
1125
1126   <para>
1127    <token>BaseCharset</> defines the encoding of the database server.
1128    All character set names are only used for mapping inside of
1129    <filename>charset.conf</> so you can freely use typing-friendly
1130    names.
1131   </para>
1132
1133   <para>
1134    <token>RecodeTable</> records specify translation tables between
1135    server and client.  The file name is relative to the
1136    <envar>PGDATA</> directory.  The table file format is very
1137    simple. There are no keywords and characters are represented by a
1138    pair of decimal or hexadecimal (0x prefixed) values on single
1139    lines:
1140 <synopsis>
1141 <replaceable>char_value</>   <replaceable>translated_char_value</>
1142 </synopsis>
1143   </para>
1144
1145   <para>
1146    <token>HostCharset</> records define the client character set by IP
1147    address. You can use a single IP address, an IP mask range starting
1148    from the given address or an IP interval (e.g., 127.0.0.1,
1149    192.168.1.100/24, 192.168.1.20-192.168.1.40).
1150   </para>
1151
1152   <para>
1153    The <filename>charset.conf</> file is always processed up to the
1154    end, so you can easily specify exceptions from the previous
1155    rules. In the <filename>src/data/</> directory you will find an
1156    example <filename>charset.conf</> and a few recoding tables.
1157   </para>
1158
1159   <para>
1160    As this solution is based on the client's IP address and character
1161    set mapping there are obviously some restrictions as well. You
1162    cannot use different encodings on the same host at the same
1163    time. It is also inconvenient when you boot your client hosts into
1164    multiple operating systems.  Nevertheless, when these restrictions are
1165    not limiting and you do not need multibyte characters then it is a
1166    simple and effective solution.
1167   </para>
1168  </sect1>
1169
1170 </chapter>
1171
1172 <!-- Keep this comment at the end of the file
1173 Local variables:
1174 mode:sgml
1175 sgml-omittag:nil
1176 sgml-shorttag:t
1177 sgml-minimize-attributes:nil
1178 sgml-always-quote-attributes:t
1179 sgml-indent-step:1
1180 sgml-indent-data:t
1181 sgml-parent-document:nil
1182 sgml-default-dtd-file:"./reference.ced"
1183 sgml-exposed-tags:nil
1184 sgml-local-catalogs:("/usr/lib/sgml/catalog")
1185 sgml-local-ecat-files:nil
1186 End:
1187 -->