From: Tatsuo Ishii Date: Tue, 9 Jan 2001 09:54:11 +0000 (+0000) Subject: Add a README file for multi-byte. This file is contributed by X-Git-Tag: REL7_1_BETA3~1 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=eea348b72b3f470401d2a6b1d429de8ad921a8c4;p=postgresql Add a README file for multi-byte. This file is contributed by Chih-Chang Hsieh , written in traditional Chinese (Big5). --- diff --git a/doc/README.mb.big5 b/doc/README.mb.big5 new file mode 100644 index 0000000000..eabbcc42e4 --- /dev/null +++ b/doc/README.mb.big5 @@ -0,0 +1,326 @@ +PostgreSQL 7.0.1 multi-byte (MB) support README May 20 2000 + + Tatsuo Ishii + ishii@postgresql.org + http://www.sra.co.jp/people/t-ishii/PostgreSQL/ + +[µù] 1. ·PÁÂ¥Û¤«¹F¤Ò (Tatsuo Ishii) ¥ý¥Í! + 2. µùÄÀ³¡¥÷­ì¤å©ÒµL, ¤¤Ä¶­Y¦³¿ù»~, ½ÐÁpµ¸ cch@cc.kmu.edu.tw + + +0. ²¤¶ + +MB ¤ä´©¬O¬°¤FÅý PostgreSQL ¯à³B²z¦h¦ì¤¸²Õ¦r¤¸ (multi-byte character), +¨Ò¦p: EUC (Extended Unix Code), Unicode (²Î¤@½X) ©M Mule internal code +(¦h°ê»y¨¥¤º½X). ¦b MB ªº¤ä´©¤U, §A¥i¥H¦b¥¿³Wªí¥Ü¦¡ (regexp), LIKE ¤Î +¨ä¥L¤@¨Ç¨ç¦¡¤¤¨Ï¥Î¦h¦ì¤¸²Õ¦r¤¸. ¹w³]ªº½s½X¨t²Î¥i¨ú¨M©ó§A¦w¸Ë PostgreSQL +®Éªº initdb(1) ©R¥O, ¥ç¥i¥Ñ createdb(1) ©R¥O©Î«Ø¥ß¸ê®Æ®wªº SQL ©R¥O¨M©w. +©Ò¥H§A¥i¥H¦³¦h­Ó¤£¦P½s½X¨t²Îªº¸ê®Æ®w. + +MB ¤ä´©¤]¸Ñ¨M¤F¤@¨Ç 8 ¦ì¤¸³æ¦ì¤¸²Õ¦r¤¸¶° (¥]§t ISO-8859-1) ªº¬ÛÃö°ÝÃD, +(§Ú¨Ã¨S¦³»¡©Ò¦³ªº¬ÛÃö°ÝÃD³£¸Ñ¨M¤F, §Ú¥u¬O½T»{¤F°jÂk´ú¸Õ°õ¦æ¦¨¥\, +¦Ó¤@¨Çªk»y¦r¤¸¦b MB ­×¸É¤U¥i¥H¨Ï¥Î. ¦pªG§A¦b¨Ï¥Î 8 ¦ì¤¸¦r¤¸®Éµo²{¤F +¥ô¦ó°ÝÃD, ½Ð³qª¾§Ú) + +1. ¦p¦ó¨Ï¥Î + +½sĶ PostgreSQL «e, °õ¦æ configure ®É¨Ï¥Î multibyte ªº¿ï¶µ + + % ./configure --enable-multibyte[=encoding_system] + % ./configure --enable-multibyte[=½s½X¨t²Î] + +¨ä¤¤ªº½s½X¨t²Î¥i¥H«ü©w¬°¤U­±¨ä¤¤¤§¤@: + + SQL_ASCII ASCII + EUC_JP Japanese EUC + EUC_CN Chinese EUC + EUC_KR Korean EUC + EUC_TW Taiwan EUC + UNICODE Unicode(UTF-8) + MULE_INTERNAL Mule internal + LATIN1 ISO 8859-1 English and some European languages + LATIN2 ISO 8859-2 English and some European languages + LATIN3 ISO 8859-3 English and some European languages + LATIN4 ISO 8859-4 English and some European languages + LATIN5 ISO 8859-5 English and some European languages + KOI8 KOI8-R + WIN Windows CP1251 + ALT Windows CP866 + +¨Ò¦p: + + % ./configure --enable-multibyte=EUC_JP + +¦pªG¬Ù²¤«ü©w½s½X¨t²Î, ¨º»ò¹w³]­È´N¬O SQL_ASCII. + +2. ¦p¦ó³]©w½s½X + +initdb ©R¥O©w¸q PostgresSQL ¦w¸Ë«áªº¹w³]½s½X, ¨Ò¦p: + + % initdb -E EUC_JP + +±N¹w³]ªº½s½X³]©w¬° EUC_JP (Extended Unix Code for Japanese), ¦pªG§A³ßÅw +¸ûªøªº¦r¦ê, §A¤]¥i¥H¥Î "--encoding" ¦Ó¤£¥Î "-E". ¦pªG¨S¦³¨Ï¥Î -E ©Î +--encoding ªº¿ï¶µ, ¨º»ò½sö®Éªº³]©w·|¦¨¬°¹w³]­È. + +§A¥i¥H«Ø¥ß¨Ï¥Î¤£¦P½s½Xªº¸ê®Æ®w: + + % createdb -E EUC_KR korean + +³o­Ó©R¥O·|«Ø¥ß¤@­Ó¥s°µ "korean" ªº¸ê®Æ®w, ¦Ó¨ä±Ä¥Î EUC_KR ½s½X. +¥t¥~¦³¤@­Ó¤èªk, ¬O¨Ï¥Î SQL ©R¥O, ¤]¥i¥H¹F¨ì¦P¼Ëªº¥Øªº: + + CREATE DATABASE korean WITH ENCODING = 'EUC_KR'; + +¦b pg_database ¨t²Î³W®æªí (system catalog) ¤¤¦³¤@­Ó "encoding" ªºÄæ¦ì, +´N¬O¥Î¨Ó¬ö¿ý¤@­Ó¸ê®Æ®wªº½s½X. §A¥i¥H¥Î psql -l ©Î¶i¤J psql «á¥Î \l ªº +©R¥O¨Ó¬d¬Ý¸ê®Æ®w±Ä¥Î¦óºØ½s½X: + +$ psql -l + List of databases + Database | Owner | Encoding +---------------+---------+--------------- + euc_cn | t-ishii | EUC_CN + euc_jp | t-ishii | EUC_JP + euc_kr | t-ishii | EUC_KR + euc_tw | t-ishii | EUC_TW + mule_internal | t-ishii | MULE_INTERNAL + regression | t-ishii | SQL_ASCII + template1 | t-ishii | EUC_JP + test | t-ishii | EUC_JP + unicode | t-ishii | UNICODE +(9 rows) + +3. «eºÝ»P«áºÝ½s½Xªº¦Û°ÊÂà´« + +[µù: «eºÝªx«ü«È¤áºÝªºµ{¦¡, ¥i¯à¬O psql ©R¥O¸ÑĶ¾¹, ©Î±Ä¥Î libpq ªº C +µ{¦¡, Perl µ{¦¡, ©ÎªÌ¬O³z¹L ODBC ªºµøµ¡À³¥Îµ{¦¡. ¦Ó«áºÝ´N¬O«ü PostgreSQL +¸ê®Æ®wªº¦øªAµ{¦¡] + +PostgreSQL ¤ä´©¬Y¨Ç½s½X¦b«eºÝ»P«áºÝ¶¡°µ¦Û°ÊÂà´«: [µù: ³o¸Ì©Ò¿×ªº¦Û°Ê +Âà´«¬O«ü§A¦b«eºÝ¤Î«áºÝ©Ò«Å§i±Ä¥Îªº½s½X¤£¦P, ¦ý¥u­n PostgreSQL ¤ä´©³o +¨âºØ½s½X¶¡ªºÂà´«, ¨º»ò¥¦·|À°§A¦b¦s¨ú«e°µÂà´«] + + encoding of backend available encoding of frontend + -------------------------------------------------------------------- + EUC_JP EUC_JP, SJIS + + EUC_TW EUC_TW, BIG5 + + LATIN2 LATIN2, WIN1250 + + LATIN5 LATIN5, WIN, ALT + + MULE_INTERNAL EUC_JP, SJIS, EUC_KR, EUC_CN, + EUC_TW, BIG5, LATIN1 to LATIN5, + WIN, ALT, WIN1250 + +¦b±Ò°Ê¦Û°Ê½s½XÂà´«¤§«e, §A¥²¶·§i¶D PostgreSQL §A­n¦b«eºÝ±Ä¥Î¦óºØ½s½X. +¦³¦n´X­Ó¤èªk¥i¥H¹F¨ì³o­Ó¥Øªº: + +o ¦b psql ©R¥O¸ÑĶ¾¹¤¤¨Ï¥Î \encoding ³o­Ó©R¥O + +\encoding ³o­Ó©R¥O¥i¥HÅý§A°¨¤W¤Á´««eºÝ½s½X, ¨Ò¦p, §A­n±N«eºÝ½s½X¤Á´«¬° SJIS, +¨º»ò½Ð¥´: + + \encoding SJIS + +o ¨Ï¥Î libpq [µù: PostgreSQL ¸ê®Æ®wªº C API µ{¦¡®w] ªº¨ç¦¡ + +psql ªº \encoding ©R¥O¨ä¹ê¥u¬O¥h©I¥s PQsetClientEncoding() ³o­Ó¨ç¦¡¨Ó¹F¨ì¥Øªº. + + int PQsetClientEncoding(PGconn *conn, const char *encoding) + +¤W¦¡¤¤ conn ³o­Ó°Ñ¼Æ¥Nªí¤@­Ó¹ï«áºÝªº³s½u, encoding ³o­Ó°Ñ¼Æ­n©ñ§A·Q¥Îªº½s½X, +°²¦p¥¦¦¨¥\¦a³]©w¤F½s½X, «K·|¶Ç¦^ 0 ­È, ¥¢±Ñªº¸Ü¶Ç¦^ -1. ¦Ü©ó¥Ø«e³s½uªº½s½X¥i +§Q¥Î¥H¤U¨ç¦¡¬dª¾: + + int PQclientEncoding(const PGconn *conn) + +³o¸Ì­nª`·Nªº¬O: ³o­Ó¨ç¦¡¶Ç¦^ªº¬O½s½Xªº¥N¸¹ (encoding id, ¬O­Ó¾ã¼Æ­È), +¦Ó¤£¬O½s½Xªº¦WºÙ¦r¦ê (¦p "EUC_JP"), ¦pªG§A­n¥Ñ½s½X¥N¸¹±oª¾½s½X¦WºÙ, +¥²¶·©I¥s: + +char *pg_encoding_to_char(int encoding_id) + +o ¨Ï¥Î PGCLIENTENCODING ³o­ÓÀô¹ÒÅÜ¼Æ + +¦pªG«eºÝ©³³]©w¤F PGCLIENTENCODING ³o¤@­ÓÀô¹ÒÅܼÆ, ¨º»ò«áºÝ·|°µ½s½X¦Û°ÊÂà´«. + +[µù] PostgreSQL 7.0.0 ~ 7.0.3 ¦³­Ó bug -- ¤£»{³o­ÓÀô¹ÒÅÜ¼Æ + +o ¨Ï¥Î SET CLIENT_ENCODING TO ³o­Ó SQL ªº©R¥O + +­n³]©w«eºÝªº½s½X¥i¥H¥Î¥H¤U³o­Ó SQL ©R¥O: + + SET CLIENT_ENCODING TO 'encoding'; + +§A¤]¥i¥H¨Ï¥Î SQL92 ªº»yªk "SET NAMES" ¹F¨ì¦P¼Ëªº¥Øªº: + + SET NAMES 'encoding'; + +¬d¸ß¥Ø«eªº«eºÝ½s½X¥i¥H¥Î¥H¤U³o­Ó SQL ©R¥O: + + SHOW CLIENT_ENCODING; + +¤Á´«¬°­ì¨Ó¹w³]ªº½s½X, ¥Î¥H¤U³o­Ó SQL ©R¥O: + + RESET CLIENT_ENCODING; + +[µù] ¨Ï¥Î psql ©R¥O¸ÑĶ¾¹®É, «Øij¤£­n¥Î³o­Ó¤èªk, ½Ð¥Î \encoding + +4. Ãö©ó Unicode (²Î¤@½X) + +²Î¤@½X©M¨ä¥L½s½X¶¡ªºÂà´«¥i¯à­n¦b 7.1 ª©«á¤~·|¹ê²{. + +5. ¦pªGµLªkÂà´«·|µo¥Í¤°»ò¨Æ? + +°²³]§A¦b«áºÝ¿ï¾Ü¤F EUC_JP ³o­Ó½s½X, «eºÝ¨Ï¥Î LATIN1, (¬Y¨Ç¤é¤å¦r¤¸µLªkÂà´«¦¨ +LATIN1) ¦b³o­Óª¬ªp¤U, ¬Y­Ó¦r¤¸­Y¤£¯àÂন LATIN1 ¦r¤¸¶°, ´N·|³QÂন¥H¤Uªº«¬¦¡: + + (¤Q¤»¶i¦ì­È) + +6. °Ñ¦Ò¸ê®Æ + +These are good sources to start learning various kind of encoding +systems. + +ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf + Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW + appear in section 3.2. + +Unicode: http://www.unicode.org/ + The homepage of UNICODE. + + RFC 2044 + UTF-8 is defined here. + +5. History + +May 20, 2000 + * SJIS UDC (NEC selection IBM kanji) support contributed + by Eiji Tokuya + * Changes above will appear in 7.0.1 + +Mar 22, 2000 + * Add new libpq functions PQsetClientEncoding, PQclientEncoding + * ./configure --with-mb=EUC_JP + now deprecated. use + ./configure --enable-multibyte=EUC_JP + instead + * Add SQL_ASCII regression test case + * Add SJIS User Defined Character (UDC) support + * All of above will appear in 7.0 + +July 11, 1999 + * Add support for WIN1250 (Windows Czech) as a client encoding + (contributed by Pavel Behal) + * fix some compiler warnings (contributed by Tomoaki Nishiyama) + +Mar 23, 1999 + * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866) + (thanks Oleg Broytmann for testing) + * Fix problem with MB and locale + +Jan 26, 1999 + * Add support for Big5 for fronend encoding + (you need to create a database with EUC_TW to use Big5) + * Add regression test case for EUC_TW + (contributed by Jonah Kuo ) + +Dec 15, 1998 + * Bugs related to SQL_ASCII support fixed + +Nov 5, 1998 + * 6.4 release. In this version, pg_database has "encoding" + column that represents the database encoding + +Jul 22, 1998 + * determine encoding at initdb/createdb rather than compile time + * support for PGCLIENTENCODING when issuing COPY command + * support for SQL92 syntax "SET NAMES" + * support for LATIN2-5 + * add UNICODE regression test case + * new test suite for MB + * clean up source files + +Jun 5, 1998 + * add support for the encoding translation between the backend + and the frontend + * new command SET CLIENT_ENCODING etc. added + * add support for LATIN1 character set + * enhance 8 bit cleaness + +April 21, 1998 some enhancements/fixes + * character_length(), position(), substring() are now aware of + multi-byte characters + * add octet_length() + * add --with-mb option to configure + * new regression tests for EUC_KR + (contributed by "Soonmyung. Hong" ) + * add some test cases to the EUC_JP regression test + * fix problem in regress/regress.sh in case of System V + * fix toupper(), tolower() to handle 8bit chars + +Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1 + +Mar 10, 1998 PL2 released + * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL + * add an English document (this file) + * fix problems concerning 8-bit single byte characters + +Mar 1, 1998 PL1 released + +Appendix: + +[Here is a good documentation explaining how to use WIN1250 on +Windows/ODBC from Pavel Behal. Please note that Installation step 1) +is not necceary in 6.5.1 -- Tatsuo] + +Version: 0.91 for PgSQL 6.5 +Author: Pavel Behal +Revised by: Tatsuo Ishii +Email: behal@opf.slu.cz +Licence: The Same as PostgreSQL + +Sorry for my Eglish and C code, I'm not native :-) + +!!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! + +Instalation: +------------ +1) Change three affected files in source directories + (I don't have time to create proper patch diffs, I don't know how) +2) Compile with enabled locale and multibyte set to LATIN2 +3) Setup properly your instalation, do not forget to create locale + variables in your profile (environment). Ex. (may not be exactly true): + LC_ALL=cs_CZ.ISO8859-2 + LC_COLLATE=cs_CZ.ISO8859-2 + LC_CTYPE=cs_CZ.ISO8859-2 + LC_MONETARY=cs_CZ.ISO8859-2 + LC_NUMERIC=cs_CZ.ISO8859-2 + LC_TIME=cs_CZ.ISO8859-2 +4) You have to start the postmaster with locales set! +5) Try it with Czech language, it have to sort +5) Install ODBC driver for PgSQL into your M$ Windows +6) Setup properly your data source. Include this line in your ODBC + configuration dialog in field "Connect Settings:" : + SET CLIENT_ENCODING = 'WIN1250'; +7) Now try it again, but in Windows with ODBC. + +Description: +------------ +- Depends on proper system locales, tested with RH6.0 and Slackware 3.6, + with cs_CZ.iso8859-2 loacle +- Never try to set-up server multibyte database encoding to WIN1250, + always use LATIN2 instead. There is not WIN1250 locale in Unix +- WIN1250 encoding is useable only for M$W ODBC clients. The characters are + on thy fly re-coded, to be displayed and stored back properly + +Important: +---------- +- it reorders your sort order depending on your LC_... setting, so don't be + confused with regression tests, they don't use locale +- "ch" is corectly sorted only in some newer locales (Ex. RH6.0) +- you have to insert money as '162,50' (with comma in aphostrophes!) +- not tested properly