From 9076f9e187c4a0620b52bff7f3c4659647141783 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 14 May 2010 16:08:46 +0000 Subject: [PATCH] Merged revisions 81168 via svnmerge from svn+ssh://pythondev@svn.python.org/python/branches/py3k ........ r81168 | victor.stinner | 2010-05-14 17:58:55 +0200 (ven., 14 mai 2010) | 10 lines Issue #8711: Document PyUnicode_DecodeFSDefault*() functions * Add paragraph titles to c-api/unicode.rst. * Fix PyUnicode_DecodeFSDefault*() comment: it now uses the "surrogateescape" error handler (and not "replace") * Remove "The function is intended to be used for paths and file names only during bootstrapping process where the codecs are not set up." from PyUnicode_FSConverter() comment: it is used after the bootstrapping and for other purposes than file names ........ --- Doc/c-api/unicode.rst | 128 ++++++++++++++++++++++++++++------------ Include/unicodeobject.h | 20 ++++--- 2 files changed, 101 insertions(+), 47 deletions(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index 4c0d6a462d..b89c0983b1 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -10,11 +10,12 @@ Unicode Objects and Codecs Unicode Objects ^^^^^^^^^^^^^^^ +Unicode Type +"""""""""""" + These are the basic Unicode object types used for the Unicode implementation in Python: -.. % --- Unicode Type ------------------------------------------------------- - .. ctype:: Py_UNICODE @@ -89,12 +90,13 @@ access internal read-only data of Unicode objects: Clear the free list. Return the total number of freed items. +Unicode Character Properties +"""""""""""""""""""""""""""" + Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration. -.. % --- Unicode character properties --------------------------------------- - .. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) @@ -192,11 +194,13 @@ These APIs can be used for fast direct character conversions: Return the character *ch* converted to a double. Return ``-1.0`` if this is not possible. This macro does not raise exceptions. + +Plain Py_UNICODE +"""""""""""""""" + To create Unicode objects and access their basic sequence properties, use these APIs: -.. % --- Plain Py_UNICODE --------------------------------------------------- - .. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) @@ -346,9 +350,47 @@ Python can interface directly to this type using the following functions. Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to the system's :ctype:`wchar_t`. -.. % --- wchar_t support for platforms which support it --------------------- + +File System Encoding +"""""""""""""""""""" + +To encode and decode file names and other environment strings, +:cdata:`Py_FileSystemEncoding` should be used as the encoding, and +``"surrogateescape"`` should be used as the error handler (:pep:`383`). To +encode file names during argument parsing, the ``"O&"`` converter should be +used, passsing :func:PyUnicode_FSConverter as the conversion function: + +.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result) + + Convert *obj* into *result*, using :cdata:`Py_FileSystemDefaultEncoding`, + and the ``"surrogateescape"`` error handler. *result* must be a + ``PyObject*``, return a :func:`bytes` object which must be released if it + is no longer used. + + .. versionadded:: 3.1 + +.. cfunction:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size) + + Decode a null-terminated string using :cdata:`Py_FileSystemDefaultEncoding` + and the ``"surrogateescape"`` error handler. + + If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8. + + Use :func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length. + +.. cfunction:: PyObject* PyUnicode_DecodeFSDefault(const char *s) + + Decode a string using :cdata:`Py_FileSystemDefaultEncoding` and + the ``"surrogateescape"`` error handler. + + If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8. +wchar_t Support +""""""""""""""" + +wchar_t support for platforms which support it: + .. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size. @@ -395,9 +437,11 @@ built-in codecs is "strict" (:exc:`ValueError` is raised). The codecs all use a similar interface. Only deviation from the following generic ones are documented for simplicity. -These are the generic codec APIs: -.. % --- Generic Codecs ----------------------------------------------------- +Generic Codecs +"""""""""""""" + +These are the generic codec APIs: .. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors) @@ -426,9 +470,11 @@ These are the generic codec APIs: using the Python codec registry. Return *NULL* if an exception was raised by the codec. -These are the UTF-8 codec APIs: -.. % --- UTF-8 Codecs ------------------------------------------------------- +UTF-8 Codecs +"""""""""""" + +These are the UTF-8 codec APIs: .. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) @@ -458,9 +504,11 @@ These are the UTF-8 codec APIs: object. Error handling is "strict". Return *NULL* if an exception was raised by the codec. -These are the UTF-32 codec APIs: -.. % --- UTF-32 Codecs ------------------------------------------------------ */ +UTF-32 Codecs +""""""""""""" + +These are the UTF-32 codec APIs: .. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder) @@ -525,9 +573,10 @@ These are the UTF-32 codec APIs: Return *NULL* if an exception was raised by the codec. -These are the UTF-16 codec APIs: +UTF-16 Codecs +""""""""""""" -.. % --- UTF-16 Codecs ------------------------------------------------------ */ +These are the UTF-16 codec APIs: .. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder) @@ -591,9 +640,11 @@ These are the UTF-16 codec APIs: order. The string always starts with a BOM mark. Error handling is "strict". Return *NULL* if an exception was raised by the codec. -These are the "Unicode Escape" codec APIs: -.. % --- Unicode-Escape Codecs ---------------------------------------------- +Unicode-Escape Codecs +""""""""""""""""""""" + +These are the "Unicode Escape" codec APIs: .. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) @@ -615,9 +666,11 @@ These are the "Unicode Escape" codec APIs: string object. Error handling is "strict". Return *NULL* if an exception was raised by the codec. -These are the "Raw Unicode Escape" codec APIs: -.. % --- Raw-Unicode-Escape Codecs ------------------------------------------ +Raw-Unicode-Escape Codecs +""""""""""""""""""""""""" + +These are the "Raw Unicode Escape" codec APIs: .. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) @@ -639,11 +692,13 @@ These are the "Raw Unicode Escape" codec APIs: Python string object. Error handling is "strict". Return *NULL* if an exception was raised by the codec. + +Latin-1 Codecs +"""""""""""""" + These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode ordinals and only these are accepted by the codecs during encoding. -.. % --- Latin-1 Codecs ----------------------------------------------------- - .. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) @@ -664,11 +719,13 @@ ordinals and only these are accepted by the codecs during encoding. object. Error handling is "strict". Return *NULL* if an exception was raised by the codec. + +ASCII Codecs +"""""""""""" + These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other codes generate errors. -.. % --- ASCII Codecs ------------------------------------------------------- - .. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) @@ -689,9 +746,11 @@ codes generate errors. object. Error handling is "strict". Return *NULL* if an exception was raised by the codec. -These are the mapping codec APIs: -.. % --- Character Map Codecs ----------------------------------------------- +Character Map Codecs +"""""""""""""""""""" + +These are the mapping codec APIs: This codec is special in that it can be used to implement many different codecs (and this is in fact what was done to obtain most of the standard codecs @@ -760,7 +819,9 @@ use the Win32 MBCS converters to implement the conversions. Note that MBCS (or DBCS) is a class of encodings, not just one. The target encoding is defined by the user settings on the machine running the codec. -.. % --- MBCS codecs for Windows -------------------------------------------- + +MBCS codecs for Windows +""""""""""""""""""""""" .. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) @@ -790,20 +851,9 @@ the user settings on the machine running the codec. object. Error handling is "strict". Return *NULL* if an exception was raised by the codec. -For decoding file names and other environment strings, :cdata:`Py_FileSystemEncoding` -should be used as the encoding, and ``"surrogateescape"`` should be used as the error -handler. For encoding file names during argument parsing, the ``O&`` converter should -be used, passsing PyUnicode_FSConverter as the conversion function: - -.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result) - - Convert *obj* into *result*, using the file system encoding, and the ``surrogateescape`` - error handler. *result* must be a ``PyObject*``, yielding a bytes or bytearray object - which must be released if it is no longer used. - - .. versionadded:: 3.1 -.. % --- Methods & Slots ---------------------------------------------------- +Methods & Slots +""""""""""""""" .. _unicodemethodsandslots: diff --git a/Include/unicodeobject.h b/Include/unicodeobject.h index d21dd96ede..cc2d5359b7 100644 --- a/Include/unicodeobject.h +++ b/Include/unicodeobject.h @@ -1238,25 +1238,29 @@ PyAPI_FUNC(int) PyUnicode_EncodeDecimal( /* --- File system encoding ---------------------------------------------- */ /* ParseTuple converter which converts a Unicode object into the file - system encoding, using the PEP 383 error handler; bytes objects are - output as-is. */ + system encoding as a bytes object, using the "surrogateescape" error + handler; bytes objects are output as-is. */ PyAPI_FUNC(int) PyUnicode_FSConverter(PyObject*, void*); -/* Decode a null-terminated string using Py_FileSystemDefaultEncoding. +/* Decode a null-terminated string using Py_FileSystemDefaultEncoding + and the "surrogateescape" error handler. - If the encoding is supported by one of the built-in codecs (i.e., UTF-8, - UTF-16, UTF-32, Latin-1 or MBCS), otherwise fallback to UTF-8 and replace - invalid characters with '?'. + If Py_FileSystemDefaultEncoding is not set, fall back to UTF-8. - The function is intended to be used for paths and file names only - during bootstrapping process where the codecs are not set up. + Use PyUnicode_DecodeFSDefaultAndSize() if you have the string length. */ PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefault( const char *s /* encoded string */ ); +/* Decode a string using Py_FileSystemDefaultEncoding + and the "surrogateescape" error handler. + + If Py_FileSystemDefaultEncoding is not set, fall back to UTF-8. +*/ + PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefaultAndSize( const char *s, /* encoded string */ Py_ssize_t size /* size */ -- 2.40.0