Refactor a bit the codecs doc.

author Ezio Melotti <ezio.melotti@gmail.com>

Tue, 25 Oct 2011 07:40:38 +0000 (10:40 +0300)

committer Ezio Melotti <ezio.melotti@gmail.com>

Tue, 25 Oct 2011 07:40:38 +0000 (10:40 +0300)
author Ezio Melotti <ezio.melotti@gmail.com>
Tue, 25 Oct 2011 07:40:38 +0000 (10:40 +0300)
committer Ezio Melotti <ezio.melotti@gmail.com>
Tue, 25 Oct 2011 07:40:38 +0000 (10:40 +0300)
diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst

index 9477133794f2a9ebd74ea50328948bb1283f3109..7747794091ea562f2beb230fc2a4368704427e52 100644 (file)
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -810,27 +810,28 @@ e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
  Windows). There's a string constant with 256 characters that shows you which
  character is mapped to which byte value.
  
-All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
+All of these encodings can only encode 256 of the 1114112 codepoints
  defined in Unicode. A simple and straightforward way that can store each Unicode
-code point, is to store each codepoint as two consecutive bytes. There are two
-possibilities: Store the bytes in big endian or in little endian order. These
-two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
-disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
-will always have to swap bytes on encoding and decoding. UTF-16 avoids this
-problem: Bytes will always be in natural endianness. When these bytes are read
+code point, is to store each codepoint as four consecutive bytes. There are two
+possibilities: store the bytes in big endian or in little endian order. These
+two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
+disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
+will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
+problem: bytes will always be in natural endianness. When these bytes are read
  by a CPU with a different endianness, then bytes have to be swapped though. To
-be able to detect the endianness of a UTF-16 byte sequence, there's the so
-called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
-This character will be prepended to every UTF-16 byte sequence. The byte swapped
-version of this character (``0xFFFE``) is an illegal character that may not
-appear in a Unicode text. So when the first character in an UTF-16 byte sequence
+be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
+there's the so called BOM ("Byte Order Mark"). This is the Unicode character
+``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
+byte sequence. The byte swapped version of this character (``0xFFFE``) is an
+illegal character that may not appear in a Unicode text. So when the
+first character in an ``UTF-16`` or ``UTF-32`` byte sequence
  appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
-Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
-a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
+Unfortunately the character ``U+FEFF`` had a second purpose as
+a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
  a word to be split. It can e.g. be used to give hints to a ligature algorithm.
  With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
  deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
-Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
+Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM
  it's a device to determine the storage layout of the encoded bytes, and vanishes
  once the byte sequence has been decoded into a string; as a ``ZERO WIDTH
  NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
@@ -838,7 +839,7 @@ NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
  There's another encoding that is able to encoding the full range of Unicode
  characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
  with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
-parts: Marker bits (the most significant bits) and payload bits. The marker bits
+parts: marker bits (the most significant bits) and payload bits. The marker bits
  are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are
  encoded like this (with x being payload bits, which when concatenated give the
  Unicode character):
@@ -877,13 +878,14 @@ map to
     | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
     | INVERTED QUESTION MARK
  
-in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
+in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be
  correctly guessed from the byte sequence. So here the BOM is not used to be able
  to determine the byte order used for generating the byte sequence, but as a
  signature that helps in guessing the encoding. On encoding the utf-8-sig codec
  will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
-decoding utf-8-sig will skip those three bytes if they appear as the first three
-bytes in the file.
+decoding ``utf-8-sig`` will skip those three bytes if they appear as the first
+three bytes in the file.  In UTF-8, the use of the BOM is discouraged and
+should generally be avoided.
  
  
  .. _standard-encodings:
author	Ezio Melotti <ezio.melotti@gmail.com>
	Tue, 25 Oct 2011 07:40:38 +0000 (10:40 +0300)
committer	Ezio Melotti <ezio.melotti@gmail.com>
	Tue, 25 Oct 2011 07:40:38 +0000 (10:40 +0300)