based on the definition of canonical equivalence and compatibility equivalence.
In Unicode, several characters can be expressed in various way. For example, the
character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
- the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
+ the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
For each character, there are two normal forms: normal form C and normal form D.
Normal form D (NFD) is also known as canonical decomposition, and translates
(NFKC) first applies the compatibility decomposition, followed by the canonical
composition.
+ Even if two unicode strings are normalized and look the same to
+ a human reader, if one has combining characters and the other
+ doesn't, they may not compare equal.
+
.. versionadded:: 2.3
In addition, the module exposes the following constant:
* Strings are compared lexicographically using the numeric equivalents (the
result of the built-in function :func:`ord`) of their characters. Unicode and
- 8-bit strings are fully interoperable in this behavior.
+ 8-bit strings are fully interoperable in this behavior. [#]_
* Tuples and lists are compared lexicographically using comparison of
corresponding elements. This means that to compare equal, each element must
cases, Python returns the latter result, in order to preserve that
``divmod(x,y)[0] * y + x % y`` be very close to ``x``.
+.. [#] While comparisons between unicode strings make sense at the byte
+ level, they may be counter-intuitive to users. For example, the
+ strings ``u"\u00C7"`` and ``u"\u0327\u0043"`` compare differently,
+ even though they both represent the same unicode character (LATIN
+ CAPTITAL LETTER C WITH CEDILLA).
+
.. [#] The implementation computes this efficiently, without constructing lists or
sorting.