PostgreSQL Bugs

Collected from the PG bugs email list.

Bug ID16469
PG Version11.7
OSUbuntu 18.04
Opened2020-05-29 00:50:48+00
Reported byKeith Erskine
StatusNew

Body of first available message related to this bug follows.

The following bug has been logged on the website:

Bug reference:      16469
Logged by:          Keith Erskine
Email address:      (redacted)
PostgreSQL version: 11.7
Operating system:   Ubuntu 18.04
Description:        

I was trying to insert a high unicode character (i.e. beyond the basic
multilingual plane) into a table using the Python module pyodbc.  The INSERT
succeeded but when I read the text back, it appeared to be different.  The
unicode character in question was U+1F31C (LAST QUARTER MOON WITH FACE).

Here is my setup:
Postgres 11.7
Linux 18.04
unixODBC 2.3.7
driver psqlodbcw.so 12.01.0000
pyodbc 4.0.30
Python 3.7

Steps to reproduce:
1) Create a utf-8 database:
psql -c "CREATE DATABASE test WITH encoding='UTF8' LC_COLLATE='en_US.UTF-8'
LC_CTYPE='en_US.UTF-8'" -U postgres
2) Run the following Python 3.7 code (NOT Python 2.7):
import pyodbc
cnxn = pyodbc.connect("DRIVER={PostgreSQL
Unicode};SERVER=localhost;UID=postgres;DATABASE=test", autocommit=True)
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8')
crsr = cnxn.cursor()
crsr.execute("DROP TABLE IF EXISTS t1")
crsr.execute("CREATE TABLE t1 (s varchar(50))")
v = "aaa \U0001F31C zzz"
crsr.execute(r"insert into t1 values ('{}')".format(v))
rows = crsr.execute("select * from t1").fetchall()
print(rows)
crsr.close()
cnxn.close()

Note:
1) I can write and read 2-byte unicode characters, without any problem.
2) Curiously, the same Python code actually works on Windows.  The correct
characters are returned.
3) Here's an ODBC trace fragment for a similar example:
(writing)
[ODBC][25398][1590664490.057932][SQLExecDirect.c][240]
		Entry:
			Statement = 0x5612f0515100
			SQL = [insert into t1 values (1, U&'x \+01F31C z', 'x 🌜 z')][length =
55]
[ODBC][25398][1590664490.058301][SQLExecDirect.c][515]
		Exit:[SQL_SUCCESS]

...

(reading)
[ODBC][25398][1590664490.061578][SQLGetData.c][237]
		Entry:
			Statement = 0x5612f0515100
			Column Number = 2
			Target Type = 1 SQL_CHAR
			Buffer Length = 4096
			Target Value = 0x5612f051bfb0
			StrLen Or Ind = 0x7ffcfd40cfa8
[ODBC][25398][1590664490.061584][SQLGetData.c][534]
		Exit:[SQL_SUCCESS]                
			Buffer = [x 🌜 z]                
			Strlen Or Ind = 0x7ffcfd40cfa8 -> 8
[ODBC][25398][1590664490.061593][SQLGetData.c][237]
		Entry:
			Statement = 0x5612f0515100
			Column Number = 3
			Target Type = 1 SQL_CHAR
			Buffer Length = 4096
			Target Value = 0x5612f051bfb0
			StrLen Or Ind = 0x7ffcfd40cfa8
[ODBC][25398][1590664490.061599][SQLGetData.c][534]
		Exit:[SQL_SUCCESS]                
			Buffer = [x � z]                
			Strlen Or Ind = 0x7ffcfd40cfa8 -> 12


As you can see, if I use the PostgreSQL unicode format in the INSERT
statement (the first text column, column number 2), it is read back fine. 
But in column number 3, with an embedded literal high unicode character, the
text is not read back correctly.  It appear as if the original INSERT SQL
statement was not decoded as utf-8 text, but was treated as UCS-2 or ASCII. 
Hence, when the data is read back out and decoded into utf-8, it is
mangled.

A broader question, when using SQLExecDirect with Postgres, how should the
SQL statement be encoded?  Does it have to be utf-8, or perhaps UCS-2?  Any
documentation on this would be greatly appreciated.  I haven't been able to
find much official documentation on the subject.  Even a code reference
would help.

Many thanks for any and all help.

Messages

DateAuthorSubject
2020-05-29 00:50:48+00PG Bug reporting formBUG #16469: High unicode character mangled on write and read with ODBC driver