Added a note in each regarding the fact that unicode strings that look the same

author Mark Summerfield <list@qtrac.plus.com>

Thu, 16 Aug 2007 10:09:22 +0000 (10:09 +0000)

committer Mark Summerfield <list@qtrac.plus.com>

Thu, 16 Aug 2007 10:09:22 +0000 (10:09 +0000)
author Mark Summerfield <list@qtrac.plus.com>
Thu, 16 Aug 2007 10:09:22 +0000 (10:09 +0000)
committer Mark Summerfield <list@qtrac.plus.com>
Thu, 16 Aug 2007 10:09:22 +0000 (10:09 +0000)
diff --git a/Doc/library/unicodedata.rst b/Doc/library/unicodedata.rst

index 017d4ee785fc0782fa9eecce725fd7402527c5eb..ec788c5f063c1a4dd32d7b1c3d57d7cf347777ac 100644 (file)
--- a/Doc/library/unicodedata.rst
+++ b/Doc/library/unicodedata.rst
@@ -107,7 +107,7 @@ the following functions:
     based on the definition of canonical equivalence and compatibility equivalence.
     In Unicode, several characters can be expressed in various way. For example, the
     character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
-   the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
+   the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
  
     For each character, there are two normal forms: normal form C and normal form D.
     Normal form D (NFD) is also known as canonical decomposition, and translates
@@ -126,6 +126,10 @@ the following functions:
     (NFKC) first applies the compatibility decomposition, followed by the canonical
     composition.
  
+   Even if two unicode strings are normalized and look the same to
+   a human reader, if one has combining characters and the other
+   doesn't, they may not compare equal.
+
     .. versionadded:: 2.3
  
  In addition, the module exposes the following constant:
diff --git a/Doc/reference/expressions.rst b/Doc/reference/expressions.rst

index 3364fd69950ac90c7a6b21704af7b002731452e0..a1c4185dd3ddee0d2e6d89f850a908091ebd3476 100644 (file)
--- a/Doc/reference/expressions.rst
+++ b/Doc/reference/expressions.rst
@@ -1040,7 +1040,7 @@ Comparison of objects of the same type depends on the type:
  
  * Strings are compared lexicographically using the numeric equivalents (the
    result of the built-in function :func:`ord`) of their characters.  Unicode and
-  8-bit strings are fully interoperable in this behavior.
+  8-bit strings are fully interoperable in this behavior. [#]_
  
  * Tuples and lists are compared lexicographically using comparison of
    corresponding elements.  This means that to compare equal, each element must
@@ -1328,6 +1328,12 @@ groups from right to left).
     cases, Python returns the latter result, in order to preserve that
     ``divmod(x,y)[0] * y + x % y`` be very close to ``x``.
  
+.. [#] While comparisons between unicode strings make sense at the byte
+   level, they may be counter-intuitive to users. For example, the
+   strings ``u"\u00C7"`` and ``u"\u0327\u0043"`` compare differently,
+   even though they both represent the same unicode character (LATIN
+   CAPTITAL LETTER C WITH CEDILLA).
+
  .. [#] The implementation computes this efficiently, without constructing lists or
     sorting.
author	Mark Summerfield <list@qtrac.plus.com>
	Thu, 16 Aug 2007 10:09:22 +0000 (10:09 +0000)
committer	Mark Summerfield <list@qtrac.plus.com>
	Thu, 16 Aug 2007 10:09:22 +0000 (10:09 +0000)
Doc/library/unicodedata.rst		patch \| blob \| history
Doc/reference/expressions.rst		patch \| blob \| history