1 PostgreSQL 7.0 multi-byte (MB) support README Mar 22 2000
5 http://www.sra.co.jp/people/t-ishii/PostgreSQL/
9 The MB support is intended for allowing PostgreSQL to handle
10 multi-byte character sets such as EUC(Extended Unix Code), Unicode and
11 Mule internal code. With the MB enabled you can use multi-byte
12 character sets in regexp ,LIKE and some other functions. The default
13 encoding system chosen is determined while initializing your
14 PostgreSQL installation using initdb(1). Note that this can be
15 overridden when you create a database using createdb(1) or by using a
16 create database SQL command. So you could have multiple databases with
17 each different encoding system.
19 MB also fixes some problems concerning with 8-bit single byte
20 character sets including ISO8859. (I would not say all of problems
21 have been fixed. I just confirmed that the regression test ran fine
22 and a few French characters could be used with the patch. Please let
23 me know if you find any problem while using 8-bit characters)
27 run configure with a multibyte option:
29 % ./configure --enable-multibyte[=encoding_system]
31 where the encoding_system is one of:
38 UNICODE Unicode(UTF-8)
39 MULE_INTERNAL Mule internal
40 LATIN1 ISO 8859-1 English and some European languages
41 LATIN2 ISO 8859-2 English and some European languages
42 LATIN3 ISO 8859-3 English and some European languages
43 LATIN4 ISO 8859-4 English and some European languages
44 LATIN5 ISO 8859-5 English and some European languages
51 % ./configure --enable-multibyte=EUC_JP
53 If the encoding system is omitted (./configure --enable-multibyte),
56 2. How to set the encoding
58 initdb command defines the default encoding for a PostgreSQL
59 installation. For example:
63 sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
64 Note that you can use "--encoding" instead of "-E" if you like longer
65 option string:-) If no -E or --encoding option is given, the encoding
66 specified at the compile time is used.
68 You can create a database with a different encoding.
70 % createdb -E EUC_KR korean
72 will create a database named "korean" with EUC_KR encoding. The
73 another way to accomplish this is to use a SQL command:
75 CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
77 The encoding for a database is represented as "encoding" column in the
78 pg_database system catalog. You can see that by using -l or \l of psql
83 Database | Owner | Encoding
84 ---------------+---------+---------------
85 euc_cn | t-ishii | EUC_CN
86 euc_jp | t-ishii | EUC_JP
87 euc_kr | t-ishii | EUC_KR
88 euc_tw | t-ishii | EUC_TW
89 mule_internal | t-ishii | MULE_INTERNAL
90 regression | t-ishii | SQL_ASCII
91 template1 | t-ishii | EUC_JP
92 test | t-ishii | EUC_JP
93 unicode | t-ishii | UNICODE
96 3. Automatic encoding translation between backend and frontend
98 PostgreSQL supports an automatic encoding translation between backend
99 and frontend for some encodings.
101 encoding of backend available encoding of frontend
102 --------------------------------------------------------------------
107 LATIN2 LATIN2, WIN1250
109 LATIN5 LATIN5, WIN, ALT
111 MULE_INTERNAL EUC_JP, SJIS, EUC_KR, EUC_CN,
112 EUC_TW, BIG5, LATIN1 to LATIN5,
115 To enable the automatic encoding translation, you have to tell
116 PostgreSQL the encoding you would like to use in frontend. There are
117 several ways to accomplish this.
119 o using \encoding command in psql
121 \encoding allows you to change frontend encoding on the fly. For
122 example, to change the encoding to SJIS, type:
126 o using libpq functions
128 \encoding actually calls PQsetClientEncoding() for its purpose.
130 int PQsetClientEncoding(PGconn *conn, const char *encoding)
132 conn is a connection to the backend, and encoding is an encoding you
133 want to use. If it successfully sets the encoding, it returns 0,
134 otherwise -1. The current encoding for this connection can be shown by
137 int PQclientEncoding(const PGconn *conn)
139 Note that it returns the "encoding id," not the encoding symbol string
140 such as "EUC_JP." To convert an encoding id to an encoding symbol, you
143 char *pg_encoding_to_char(int encoding_id)
145 o using PGCLIENTENCODING
147 If an environment variable PGCLIENTENCODING is defined in the
148 frontend, an automatic encoding translation is done by the backend.
150 o using SET CLIENT_ENCODING TO command
152 Setting the frontend side encoding can be done a SQL command:
154 SET CLIENT_ENCODING TO 'encoding';
156 Also you can use SQL92 syntax "SET NAMES" for this purpose:
158 SET NAMES 'encoding';
160 To query the current the frontend encoding:
162 SHOW CLIENT_ENCODING;
164 To return to the default encoding:
166 RESET CLIENT_ENCODING;
170 An automatic encoding translation between Unicode and any other
171 encodings is not supported (yet).
173 5. What happens if the translation is not possible?
175 Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
176 then some Japanese characters could not be translated into LATIN1. In
177 this case, a letter cannot be represented in the LATIN1 character set,
178 would be transformed as:
184 These are good sources to start learning various kind of encoding
187 ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
188 Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
189 appear in section 3.2.
191 Unicode: http://www.unicode.org/
192 The homepage of UNICODE.
195 UTF-8 is defined here.
200 * Add new libpq functions PQsetClientEncoding, PQclientEncoding
201 * ./configure --with-mb=EUC_JP
203 ./configure --enable-multibyte=EUC_JP
205 * Add SQL_ASCII regression test case
206 * Add SJIS User Defined Character (UDC) support
207 * All of above will appear in 7.0
210 * Add support for WIN1250 (Windows Czech) as a client encoding
211 (contributed by Pavel Behal)
212 * fix some compiler warnings (contributed by Tomoaki Nishiyama)
215 * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
216 (thanks Oleg Broytmann for testing)
217 * Fix problem with MB and locale
220 * Add support for Big5 for fronend encoding
221 (you need to create a database with EUC_TW to use Big5)
222 * Add regression test case for EUC_TW
223 (contributed by Jonah Kuo <jonahkuo@mail.ttn.com.tw>)
226 * Bugs related to SQL_ASCII support fixed
229 * 6.4 release. In this version, pg_database has "encoding"
230 column that represents the database encoding
233 * determine encoding at initdb/createdb rather than compile time
234 * support for PGCLIENTENCODING when issuing COPY command
235 * support for SQL92 syntax "SET NAMES"
236 * support for LATIN2-5
237 * add UNICODE regression test case
238 * new test suite for MB
239 * clean up source files
242 * add support for the encoding translation between the backend
244 * new command SET CLIENT_ENCODING etc. added
245 * add support for LATIN1 character set
246 * enhance 8 bit cleaness
248 April 21, 1998 some enhancements/fixes
249 * character_length(), position(), substring() are now aware of
250 multi-byte characters
252 * add --with-mb option to configure
253 * new regression tests for EUC_KR
254 (contributed by "Soonmyung. Hong" <hong@lunaris.hanmesoft.co.kr>)
255 * add some test cases to the EUC_JP regression test
256 * fix problem in regress/regress.sh in case of System V
257 * fix toupper(), tolower() to handle 8bit chars
259 Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
261 Mar 10, 1998 PL2 released
262 * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
263 * add an English document (this file)
264 * fix problems concerning 8-bit single byte characters
266 Mar 1, 1998 PL1 released
270 [Here is a good documentation explaining how to use WIN1250 on
271 Windows/ODBC from Pavel Behal. Please note that Installation step 1)
272 is not necceary in 6.5.1 -- Tatsuo]
274 Version: 0.91 for PgSQL 6.5
276 Revised by: Tatsuo Ishii
277 Email: behal@opf.slu.cz
278 Licence: The Same as PostgreSQL
280 Sorry for my Eglish and C code, I'm not native :-)
282 !!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
286 1) Change three affected files in source directories
287 (I don't have time to create proper patch diffs, I don't know how)
288 2) Compile with enabled locale and multibyte set to LATIN2
289 3) Setup properly your instalation, do not forget to create locale
290 variables in your profile (environment). Ex. (may not be exactly true):
291 LC_ALL=cs_CZ.ISO8859-2
292 LC_COLLATE=cs_CZ.ISO8859-2
293 LC_CTYPE=cs_CZ.ISO8859-2
294 LC_MONETARY=cs_CZ.ISO8859-2
295 LC_NUMERIC=cs_CZ.ISO8859-2
296 LC_TIME=cs_CZ.ISO8859-2
297 4) You have to start the postmaster with locales set!
298 5) Try it with Czech language, it have to sort
299 5) Install ODBC driver for PgSQL into your M$ Windows
300 6) Setup properly your data source. Include this line in your ODBC
301 configuration dialog in field "Connect Settings:" :
302 SET CLIENT_ENCODING = 'WIN1250';
303 7) Now try it again, but in Windows with ODBC.
307 - Depends on proper system locales, tested with RH6.0 and Slackware 3.6,
308 with cs_CZ.iso8859-2 loacle
309 - Never try to set-up server multibyte database encoding to WIN1250,
310 always use LATIN2 instead. There is not WIN1250 locale in Unix
311 - WIN1250 encoding is useable only for M$W ODBC clients. The characters are
312 on thy fly re-coded, to be displayed and stored back properly
316 - it reorders your sort order depending on your LC_... setting, so don't be
317 confused with regression tests, they don't use locale
318 - "ch" is corectly sorted only in some newer locales (Ex. RH6.0)
319 - you have to insert money as '162,50' (with comma in aphostrophes!)
320 - not tested properly