From: Tatsuo Ishii Date: Tue, 9 Jan 2001 04:40:31 +0000 (+0000) Subject: README.mb has been unified into SGML documents. X-Git-Tag: REL7_1_BETA3~5 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=31a81ea8ec1fae50391b3819df266634ec0bfd8e;p=postgresql README.mb has been unified into SGML documents. --- diff --git a/doc/README.mb b/doc/README.mb deleted file mode 100644 index 973d5a0db3..0000000000 --- a/doc/README.mb +++ /dev/null @@ -1,325 +0,0 @@ -PostgreSQL 7.0 multi-byte (MB) support README Mar 22 2000 - - Tatsuo Ishii - ishii@postgresql.org - http://www.sra.co.jp/people/t-ishii/PostgreSQL/ - -0. Introduction - -The MB support is intended for allowing PostgreSQL to handle -multi-byte character sets such as EUC(Extended Unix Code), Unicode and -Mule internal code. With the MB enabled you can use multi-byte -character sets in regexp ,LIKE and some other functions. The default -encoding system chosen is determined while initializing your -PostgreSQL installation using initdb(1). Note that this can be -overridden when you create a database using createdb(1) or by using a -create database SQL command. So you could have multiple databases with -each different encoding system. - -MB also fixes some problems concerning with 8-bit single byte -character sets including ISO8859. (I would not say all of problems -have been fixed. I just confirmed that the regression test ran fine -and a few French characters could be used with the patch. Please let -me know if you find any problem while using 8-bit characters) - -1. How to use - -run configure with a multibyte option: - - % ./configure --enable-multibyte[=encoding_system] - -where the encoding_system is one of: - - SQL_ASCII ASCII - EUC_JP Japanese EUC - EUC_CN Chinese EUC - EUC_KR Korean EUC - EUC_TW Taiwan EUC - UNICODE Unicode(UTF-8) - MULE_INTERNAL Mule internal - LATIN1 ISO 8859-1 English and some European languages - LATIN2 ISO 8859-2 English and some European languages - LATIN3 ISO 8859-3 English and some European languages - LATIN4 ISO 8859-4 English and some European languages - LATIN5 ISO 8859-5 English and some European languages - KOI8 KOI8-R - WIN Windows CP1251 - ALT Windows CP866 - -Example: - - % ./configure --enable-multibyte=EUC_JP - -If the encoding system is omitted (./configure --enable-multibyte), -SQL_ASCII is assumed. - -2. How to set the encoding - -initdb command defines the default encoding for a PostgreSQL -installation. For example: - - % initdb -E EUC_JP - -sets the default encoding to EUC_JP(Extended Unix Code for Japanese). -Note that you can use "--encoding" instead of "-E" if you like longer -option string:-) If no -E or --encoding option is given, the encoding -specified at the compile time is used. - -You can create a database with a different encoding. - - % createdb -E EUC_KR korean - -will create a database named "korean" with EUC_KR encoding. The -another way to accomplish this is to use a SQL command: - - CREATE DATABASE korean WITH ENCODING = 'EUC_KR'; - -The encoding for a database is represented as "encoding" column in the -pg_database system catalog. You can see that by using -l or \l of psql -command. - -$ psql -l - List of databases - Database | Owner | Encoding ----------------+---------+--------------- - euc_cn | t-ishii | EUC_CN - euc_jp | t-ishii | EUC_JP - euc_kr | t-ishii | EUC_KR - euc_tw | t-ishii | EUC_TW - mule_internal | t-ishii | MULE_INTERNAL - regression | t-ishii | SQL_ASCII - template1 | t-ishii | EUC_JP - test | t-ishii | EUC_JP - unicode | t-ishii | UNICODE -(9 rows) - -3. Automatic encoding translation between backend and frontend - -PostgreSQL supports an automatic encoding translation between backend -and frontend for some encodings. - - encoding of backend available encoding of frontend - -------------------------------------------------------------------- - EUC_JP EUC_JP, SJIS - - EUC_TW EUC_TW, BIG5 - - LATIN2 LATIN2, WIN1250 - - LATIN5 LATIN5, WIN, ALT - - MULE_INTERNAL EUC_JP, SJIS, EUC_KR, EUC_CN, - EUC_TW, BIG5, LATIN1 to LATIN5, - WIN, ALT, WIN1250 - -To enable the automatic encoding translation, you have to tell -PostgreSQL the encoding you would like to use in frontend. There are -several ways to accomplish this. - -o using \encoding command in psql - -\encoding allows you to change frontend encoding on the fly. For -example, to change the encoding to SJIS, type: - - \encoding SJIS - -o using libpq functions - -\encoding actually calls PQsetClientEncoding() for its purpose. - - int PQsetClientEncoding(PGconn *conn, const char *encoding) - -conn is a connection to the backend, and encoding is an encoding you -want to use. If it successfully sets the encoding, it returns 0, -otherwise -1. The current encoding for this connection can be shown by -using: - - int PQclientEncoding(const PGconn *conn) - -Note that it returns the "encoding id," not the encoding symbol string -such as "EUC_JP." To convert an encoding id to an encoding symbol, you -can use: - -char *pg_encoding_to_char(int encoding_id) - -o using PGCLIENTENCODING - -If an environment variable PGCLIENTENCODING is defined in the -frontend, an automatic encoding translation is done by the backend. - -o using SET CLIENT_ENCODING TO command - -Setting the frontend side encoding can be done a SQL command: - - SET CLIENT_ENCODING TO 'encoding'; - -Also you can use SQL92 syntax "SET NAMES" for this purpose: - - SET NAMES 'encoding'; - -To query the current the frontend encoding: - - SHOW CLIENT_ENCODING; - -To return to the default encoding: - - RESET CLIENT_ENCODING; - -4. About Unicode - -An automatic encoding translation between Unicode and any other -encodings is not supported (yet). - -5. What happens if the translation is not possible? - -Suppose you choose EUC_JP for the backend, LATIN1 for the frontend, -then some Japanese characters could not be translated into LATIN1. In -this case, a letter cannot be represented in the LATIN1 character set, -would be transformed as: - - (HEXA DECIMAL) - -6. References - -These are good sources to start learning various kind of encoding -systems. - -ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf - Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW - appear in section 3.2. - -Unicode: http://www.unicode.org/ - The homepage of UNICODE. - - RFC 2044 - UTF-8 is defined here. - -5. History - -May 20, 2000 - * SJIS UDC (NEC selection IBM kanji) support contributed - by Eiji Tokuya - * Changes above will appear in 7.0.1 - -Mar 22, 2000 - * Add new libpq functions PQsetClientEncoding, PQclientEncoding - * ./configure --with-mb=EUC_JP - now deprecated. use - ./configure --enable-multibyte=EUC_JP - instead - * Add SQL_ASCII regression test case - * Add SJIS User Defined Character (UDC) support - * All of above will appear in 7.0 - -July 11, 1999 - * Add support for WIN1250 (Windows Czech) as a client encoding - (contributed by Pavel Behal) - * fix some compiler warnings (contributed by Tomoaki Nishiyama) - -Mar 23, 1999 - * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866) - (thanks Oleg Broytmann for testing) - * Fix problem with MB and locale - -Jan 26, 1999 - * Add support for Big5 for fronend encoding - (you need to create a database with EUC_TW to use Big5) - * Add regression test case for EUC_TW - (contributed by Jonah Kuo ) - -Dec 15, 1998 - * Bugs related to SQL_ASCII support fixed - -Nov 5, 1998 - * 6.4 release. In this version, pg_database has "encoding" - column that represents the database encoding - -Jul 22, 1998 - * determine encoding at initdb/createdb rather than compile time - * support for PGCLIENTENCODING when issuing COPY command - * support for SQL92 syntax "SET NAMES" - * support for LATIN2-5 - * add UNICODE regression test case - * new test suite for MB - * clean up source files - -Jun 5, 1998 - * add support for the encoding translation between the backend - and the frontend - * new command SET CLIENT_ENCODING etc. added - * add support for LATIN1 character set - * enhance 8 bit cleaness - -April 21, 1998 some enhancements/fixes - * character_length(), position(), substring() are now aware of - multi-byte characters - * add octet_length() - * add --with-mb option to configure - * new regression tests for EUC_KR - (contributed by "Soonmyung. Hong" ) - * add some test cases to the EUC_JP regression test - * fix problem in regress/regress.sh in case of System V - * fix toupper(), tolower() to handle 8bit chars - -Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1 - -Mar 10, 1998 PL2 released - * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL - * add an English document (this file) - * fix problems concerning 8-bit single byte characters - -Mar 1, 1998 PL1 released - -Appendix: - -[Here is a good documentation explaining how to use WIN1250 on -Windows/ODBC from Pavel Behal. Please note that Installation step 1) -is not necceary in 6.5.1 -- Tatsuo] - -Version: 0.91 for PgSQL 6.5 -Author: Pavel Behal -Revised by: Tatsuo Ishii -Email: behal@opf.slu.cz -Licence: The Same as PostgreSQL - -Sorry for my Eglish and C code, I'm not native :-) - -!!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! - -Instalation: ------------- -1) Change three affected files in source directories - (I don't have time to create proper patch diffs, I don't know how) -2) Compile with enabled locale and multibyte set to LATIN2 -3) Setup properly your instalation, do not forget to create locale - variables in your profile (environment). Ex. (may not be exactly true): - LC_ALL=cs_CZ.ISO8859-2 - LC_COLLATE=cs_CZ.ISO8859-2 - LC_CTYPE=cs_CZ.ISO8859-2 - LC_MONETARY=cs_CZ.ISO8859-2 - LC_NUMERIC=cs_CZ.ISO8859-2 - LC_TIME=cs_CZ.ISO8859-2 -4) You have to start the postmaster with locales set! -5) Try it with Czech language, it have to sort -5) Install ODBC driver for PgSQL into your M$ Windows -6) Setup properly your data source. Include this line in your ODBC - configuration dialog in field "Connect Settings:" : - SET CLIENT_ENCODING = 'WIN1250'; -7) Now try it again, but in Windows with ODBC. - -Description: ------------- -- Depends on proper system locales, tested with RH6.0 and Slackware 3.6, - with cs_CZ.iso8859-2 loacle -- Never try to set-up server multibyte database encoding to WIN1250, - always use LATIN2 instead. There is not WIN1250 locale in Unix -- WIN1250 encoding is useable only for M$W ODBC clients. The characters are - on thy fly re-coded, to be displayed and stored back properly - -Important: ----------- -- it reorders your sort order depending on your LC_... setting, so don't be - confused with regression tests, they don't use locale -- "ch" is corectly sorted only in some newer locales (Ex. RH6.0) -- you have to insert money as '162,50' (with comma in aphostrophes!) -- not tested properly