granicus.if.org Git - python/commit

closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558)

The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX GH-15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
(cherry picked from commit 2f09413947d1ce0043de62ed2346f9a2b4e5880b)

Co-authored-by: Greg Price <gnprice@gmail.com>

author	Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com>
	Wed, 4 Sep 2019 03:03:37 +0000 (20:03 -0700)
committer	GitHub <noreply@github.com>
	Wed, 4 Sep 2019 03:03:37 +0000 (20:03 -0700)
commit	4dd1c9d9c2bca4744c70c9556b7051f4465ede3e
tree	3d6401fa900d729bc66dd0c057d3c47577442cc6	tree \| snapshot
parent	952ea67289ffbd2f4785a9e537884a63d1208101	commit \| diff

Doc/whatsnew/3.8.rst		diff \| blob \| history
Lib/test/test_unicodedata.py		diff \| blob \| history
Misc/NEWS.d/next/Core and Builtins/2019-08-27-21-21-36.bpo-37966.5OBLez.rst	[new file with mode: 0644]	blob
Modules/unicodedata.c		diff \| blob \| history