Issue #8188: Introduce a new scheme for computing hashes of numbers

author Mark Dickinson <dickinsm@gmail.com>

Sun, 23 May 2010 13:33:13 +0000 (13:33 +0000)

committer Mark Dickinson <dickinsm@gmail.com>

Sun, 23 May 2010 13:33:13 +0000 (13:33 +0000)
author Mark Dickinson <dickinsm@gmail.com>
Sun, 23 May 2010 13:33:13 +0000 (13:33 +0000)
committer Mark Dickinson <dickinsm@gmail.com>
Sun, 23 May 2010 13:33:13 +0000 (13:33 +0000)
diff --git a/Doc/library/stdtypes.rst b/Doc/library/stdtypes.rst

index c5d6766c39e04ea1e4cf5f3961ac3f84923e8ea9..b07c7f8b53de96ee665f32d06af6467173da30b6 100644 (file)
--- a/Doc/library/stdtypes.rst
+++ b/Doc/library/stdtypes.rst
@@ -595,6 +595,109 @@ hexadecimal string representing the same number::
     '0x1.d380000000000p+11'
  
  
+.. _numeric-hash:
+
+Hashing of numeric types
+------------------------
+
+For numbers ``x`` and ``y``, possibly of different types, it's a requirement
+that ``hash(x) == hash(y)`` whenever ``x == y`` (see the :meth:`__hash__`
+method documentation for more details).  For ease of implementation and
+efficiency across a variety of numeric types (including :class:`int`,
+:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`)
+Python's hash for numeric types is based on a single mathematical function
+that's defined for any rational number, and hence applies to all instances of
+:class:`int` and :class:`fraction.Fraction`, and all finite instances of
+:class:`float` and :class:`decimal.Decimal`.  Essentially, this function is
+given by reduction modulo ``P`` for a fixed prime ``P``.  The value of ``P`` is
+made available to Python as the :attr:`modulus` attribute of
+:data:`sys.hash_info`.
+
+.. impl-detail::
+
+   Currently, the prime used is ``P = 2**31 - 1`` on machines with 32-bit C
+   longs and ``P = 2**61 - 1`` on machines with 64-bit C longs.
+
+Here are the rules in detail:
+
+ - If ``x = m / n`` is a nonnegative rational number and ``n`` is not divisible
+   by ``P``, define ``hash(x)`` as ``m * invmod(n, P) % P``, where ``invmod(n,
+   P)`` gives the inverse of ``n`` modulo ``P``.
+
+ - If ``x = m / n`` is a nonnegative rational number and ``n`` is
+   divisible by ``P`` (but ``m`` is not) then ``n`` has no inverse
+   modulo ``P`` and the rule above doesn't apply; in this case define
+   ``hash(x)`` to be the constant value ``sys.hash_info.inf``.
+
+ - If ``x = m / n`` is a negative rational number define ``hash(x)``
+   as ``-hash(-x)``.  If the resulting hash is ``-1``, replace it with
+   ``-2``.
+
+ - The particular values ``sys.hash_info.inf``, ``-sys.hash_info.inf``
+   and ``sys.hash_info.nan`` are used as hash values for positive
+   infinity, negative infinity, or nans (respectively).  (All hashable
+   nans have the same hash value.)
+
+ - For a :class:`complex` number ``z``, the hash values of the real
+   and imaginary parts are combined by computing ``hash(z.real) +
+   sys.hash_info.imag * hash(z.imag)``, reduced modulo
+   ``2**sys.hash_info.width`` so that it lies in
+   ``range(-2**(sys.hash_info.width - 1), 2**(sys.hash_info.width -
+   1))``.  Again, if the result is ``-1``, it's replaced with ``-2``.
+
+
+To clarify the above rules, here's some example Python code,
+equivalent to the builtin hash, for computing the hash of a rational
+number, :class:`float`, or :class:`complex`::
+
+
+   import sys, math
+
+   def hash_fraction(m, n):
+       """Compute the hash of a rational number m / n.
+
+       Assumes m and n are integers, with n positive.
+       Equivalent to hash(fractions.Fraction(m, n)).
+
+       """
+       P = sys.hash_info.modulus
+       # Remove common factors of P.  (Unnecessary if m and n already coprime.)
+       while m % P == n % P == 0:
+           m, n = m // P, n // P
+
+       if n % P == 0:
+           hash_ = sys.hash_info.inf
+       else:
+           # Fermat's Little Theorem: pow(n, P-1, P) is 1, so
+           # pow(n, P-2, P) gives the inverse of n modulo P.
+           hash_ = (abs(m) % P) * pow(n, P - 2, P) % P
+       if m < 0:
+           hash_ = -hash_
+       if hash_ == -1:
+           hash_ = -2
+       return hash_
+
+   def hash_float(x):
+       """Compute the hash of a float x."""
+
+       if math.isnan(x):
+           return sys.hash_info.nan
+       elif math.isinf(x):
+           return sys.hash_info.inf if x > 0 else -sys.hash_info.inf
+       else:
+           return hash_fraction(*x.as_integer_ratio())
+
+   def hash_complex(z):
+       """Compute the hash of a complex number z."""
+
+       hash_ = hash_float(z.real) + sys.hash_info.imag * hash_float(z.imag)
+       # do a signed reduction modulo 2**sys.hash_info.width
+       M = 2**(sys.hash_info.width - 1)
+       hash_ = (hash_ & (M - 1)) - (hash & M)
+       if hash_ == -1:
+           hash_ == -2
+       return hash_
+
  .. _typeiter:
  
  Iterator Types
diff --git a/Doc/library/sys.rst b/Doc/library/sys.rst

index 3b9bbb076cb3e4840d0d30837f1ab9a366fe0884..e2a2f72e019d73caa9e477ed56f645ccf3440f4f 100644 (file)
--- a/Doc/library/sys.rst
+++ b/Doc/library/sys.rst
@@ -446,6 +446,30 @@ always available.
        Changed to a named tuple and added *service_pack_minor*,
        *service_pack_major*, *suite_mask*, and *product_type*.
  
+
+.. data:: hash_info
+
+   A structseq giving parameters of the numeric hash implementation.  For
+   more details about hashing of numeric types, see :ref:`numeric-hash`.
+
+   +---------------------+--------------------------------------------------+
+   | attribute           | explanation                                      |
+   +=====================+==================================================+
+   | :const:`width`      | width in bits used for hash values               |
+   +---------------------+--------------------------------------------------+
+   | :const:`modulus`    | prime modulus P used for numeric hash scheme     |
+   +---------------------+--------------------------------------------------+
+   | :const:`inf`        | hash value returned for a positive infinity      |
+   +---------------------+--------------------------------------------------+
+   | :const:`nan`        | hash value returned for a nan                    |
+   +---------------------+--------------------------------------------------+
+   | :const:`imag`       | multiplier used for the imaginary part of a      |
+   |                     | complex number                                   |
+   +---------------------+--------------------------------------------------+
+
+   .. versionadded:: 3.2
+
+
  .. data:: hexversion
  
     The version number encoded as a single integer.  This is guaranteed to increase
diff --git a/Include/pyport.h b/Include/pyport.h

index dc5c1fd389d2240e437097a5f049acb2b3fed433..95612851b04ff8474ae50b94b2dd5d5f68a9f719 100644 (file)
--- a/Include/pyport.h
+++ b/Include/pyport.h
@@ -126,6 +126,20 @@ Used in:  PY_LONG_LONG
  #endif
  #endif
  
+/* Parameters used for the numeric hash implementation.  See notes for
+   _PyHash_Double in Objects/object.c.  Numeric hashes are based on
+   reduction modulo the prime 2**_PyHASH_BITS - 1. */
+
+#if SIZEOF_LONG >= 8
+#define _PyHASH_BITS 61
+#else
+#define _PyHASH_BITS 31
+#endif
+#define _PyHASH_MODULUS ((1UL << _PyHASH_BITS) - 1)
+#define _PyHASH_INF 314159
+#define _PyHASH_NAN 0
+#define _PyHASH_IMAG 1000003UL
+
  /* uintptr_t is the C9X name for an unsigned integral type such that a
   * legitimate void* can be cast to uintptr_t and then back to void* again
   * without loss of information.  Similarly for intptr_t, wrt a signed
diff --git a/Lib/decimal.py b/Lib/decimal.py

index cc71cd834032e358594ebed9fb8c0e9c864d776a..29ce39838f399647ec2d92ecf4f5b1b263748ba7 100644 (file)
--- a/Lib/decimal.py
+++ b/Lib/decimal.py
@@ -862,7 +862,7 @@ class Decimal(object):
      # that specified by IEEE 754.
  
      def __eq__(self, other, context=None):
-        other = _convert_other(other, allow_float=True)
+        other = _convert_other(other, allow_float = True)
          if other is NotImplemented:
              return other
          if self._check_nans(other, context):
@@ -870,7 +870,7 @@ class Decimal(object):
          return self._cmp(other) == 0
  
      def __ne__(self, other, context=None):
-        other = _convert_other(other, allow_float=True)
+        other = _convert_other(other, allow_float = True)
          if other is NotImplemented:
              return other
          if self._check_nans(other, context):
@@ -879,7 +879,7 @@ class Decimal(object):
  
  
      def __lt__(self, other, context=None):
-        other = _convert_other(other, allow_float=True)
+        other = _convert_other(other, allow_float = True)
          if other is NotImplemented:
              return other
          ans = self._compare_check_nans(other, context)
@@ -888,7 +888,7 @@ class Decimal(object):
          return self._cmp(other) < 0
  
      def __le__(self, other, context=None):
-        other = _convert_other(other, allow_float=True)
+        other = _convert_other(other, allow_float = True)
          if other is NotImplemented:
              return other
          ans = self._compare_check_nans(other, context)
@@ -897,7 +897,7 @@ class Decimal(object):
          return self._cmp(other) <= 0
  
      def __gt__(self, other, context=None):
-        other = _convert_other(other, allow_float=True)
+        other = _convert_other(other, allow_float = True)
          if other is NotImplemented:
              return other
          ans = self._compare_check_nans(other, context)
@@ -906,7 +906,7 @@ class Decimal(object):
          return self._cmp(other) > 0
  
      def __ge__(self, other, context=None):
-        other = _convert_other(other, allow_float=True)
+        other = _convert_other(other, allow_float = True)
          if other is NotImplemented:
              return other
          ans = self._compare_check_nans(other, context)
@@ -935,55 +935,28 @@ class Decimal(object):
  
      def __hash__(self):
          """x.__hash__() <==> hash(x)"""
-        # Decimal integers must hash the same as the ints
-        #
-        # The hash of a nonspecial noninteger Decimal must depend only
-        # on the value of that Decimal, and not on its representation.
-        # For example: hash(Decimal('100E-1')) == hash(Decimal('10')).
-
-        # Equality comparisons involving signaling nans can raise an
-        # exception; since equality checks are implicitly and
-        # unpredictably used when checking set and dict membership, we
-        # prevent signaling nans from being used as set elements or
-        # dict keys by making __hash__ raise an exception.
+
+        # In order to make sure that the hash of a Decimal instance
+        # agrees with the hash of a numerically equal integer, float
+        # or Fraction, we follow the rules for numeric hashes outlined
+        # in the documentation.  (See library docs, 'Built-in Types').
          if self._is_special:
              if self.is_snan():
                  raise TypeError('Cannot hash a signaling NaN value.')
              elif self.is_nan():
-                # 0 to match hash(float('nan'))
-                return 0
+                return _PyHASH_NAN
              else:
-                # values chosen to match hash(float('inf')) and
-                # hash(float('-inf')).
                  if self._sign:
-                    return -271828
+                    return -_PyHASH_INF
                  else:
-                    return 314159
-
-        # In Python 2.7, we're allowing comparisons (but not
-        # arithmetic operations) between floats and Decimals;  so if
-        # a Decimal instance is exactly representable as a float then
-        # its hash should match that of the float.
-        self_as_float = float(self)
-        if Decimal.from_float(self_as_float) == self:
-            return hash(self_as_float)
-
-        if self._isinteger():
-            op = _WorkRep(self.to_integral_value())
-            # to make computation feasible for Decimals with large
-            # exponent, we use the fact that hash(n) == hash(m) for
-            # any two nonzero integers n and m such that (i) n and m
-            # have the same sign, and (ii) n is congruent to m modulo
-            # 2**64-1.  So we can replace hash((-1)**s*c*10**e) with
-            # hash((-1)**s*c*pow(10, e, 2**64-1).
-            return hash((-1)**op.sign*op.int*pow(10, op.exp, 2**64-1))
-        # The value of a nonzero nonspecial Decimal instance is
-        # faithfully represented by the triple consisting of its sign,
-        # its adjusted exponent, and its coefficient with trailing
-        # zeros removed.
-        return hash((self._sign,
-                     self._exp+len(self._int),
-                     self._int.rstrip('0')))
+                    return _PyHASH_INF
+
+        if self._exp >= 0:
+            exp_hash = pow(10, self._exp, _PyHASH_MODULUS)
+        else:
+            exp_hash = pow(_PyHASH_10INV, -self._exp, _PyHASH_MODULUS)
+        hash_ = int(self._int) * exp_hash % _PyHASH_MODULUS
+        return hash_ if self >= 0 else -hash_
  
      def as_tuple(self):
          """Represents the number as a triple tuple.
@@ -6218,6 +6191,17 @@ _NegativeOne = Decimal(-1)
  # _SignedInfinity[sign] is infinity w/ that sign
  _SignedInfinity = (_Infinity, _NegativeInfinity)
  
+# Constants related to the hash implementation;  hash(x) is based
+# on the reduction of x modulo _PyHASH_MODULUS
+import sys
+_PyHASH_MODULUS = sys.hash_info.modulus
+# hash values to use for positive and negative infinities, and nans
+_PyHASH_INF = sys.hash_info.inf
+_PyHASH_NAN = sys.hash_info.nan
+del sys
+
+# _PyHASH_10INV is the inverse of 10 modulo the prime _PyHASH_MODULUS
+_PyHASH_10INV = pow(10, _PyHASH_MODULUS - 2, _PyHASH_MODULUS)
  
  
  if __name__ == '__main__':
diff --git a/Lib/fractions.py b/Lib/fractions.py

index fc8a12c0144ab06d92c690c953775719944a9ea9..51e67e22eade0ea5f33ba3608c326bb1e79afef6 100644 (file)
--- a/Lib/fractions.py
+++ b/Lib/fractions.py
@@ -8,6 +8,7 @@ import math
  import numbers
  import operator
  import re
+import sys
  
  __all__ = ['Fraction', 'gcd']
  
@@ -23,6 +24,12 @@ def gcd(a, b):
          a, b = b, a%b
      return a
  
+# Constants related to the hash implementation;  hash(x) is based
+# on the reduction of x modulo the prime _PyHASH_MODULUS.
+_PyHASH_MODULUS = sys.hash_info.modulus
+# Value to be used for rationals that reduce to infinity modulo
+# _PyHASH_MODULUS.
+_PyHASH_INF = sys.hash_info.inf
  
  _RATIONAL_FORMAT = re.compile(r"""
      \A\s*                      # optional whitespace at the start, then
@@ -528,16 +535,22 @@ class Fraction(numbers.Rational):
  
          """
          # XXX since this method is expensive, consider caching the result
-        if self._denominator == 1:
-            # Get integers right.
-            return hash(self._numerator)
-        # Expensive check, but definitely correct.
-        if self == float(self):
-            return hash(float(self))
+
+        # In order to make sure that the hash of a Fraction agrees
+        # with the hash of a numerically equal integer, float or
+        # Decimal instance, we follow the rules for numeric hashes
+        # outlined in the documentation.  (See library docs, 'Built-in
+        # Types').
+
+        # dinv is the inverse of self._denominator modulo the prime
+        # _PyHASH_MODULUS, or 0 if self._denominator is divisible by
+        # _PyHASH_MODULUS.
+        dinv = pow(self._denominator, _PyHASH_MODULUS - 2, _PyHASH_MODULUS)
+        if not dinv:
+            hash_ = _PyHASH_INF
          else:
-            # Use tuple's hash to avoid a high collision rate on
-            # simple fractions.
-            return hash((self._numerator, self._denominator))
+            hash_ = abs(self._numerator) * dinv % _PyHASH_MODULUS
+        return hash_ if self >= 0 else -hash_
  
      def __eq__(a, b):
          """a == b"""
diff --git a/Lib/test/test_float.py b/Lib/test/test_float.py

index b52b1db9e461882ba749b1e16fd8b62c9f4b50d8..cabeb16c42b1512acd76f03a1cec2eea8438c262 100644 (file)
--- a/Lib/test/test_float.py
+++ b/Lib/test/test_float.py
@@ -914,15 +914,6 @@ class InfNanTest(unittest.TestCase):
          self.assertFalse(NAN.is_inf())
          self.assertFalse((0.).is_inf())
  
-    def test_hash_inf(self):
-        # the actual values here should be regarded as an
-        # implementation detail, but they need to be
-        # identical to those used in the Decimal module.
-        self.assertEqual(hash(float('inf')), 314159)
-        self.assertEqual(hash(float('-inf')), -271828)
-        self.assertEqual(hash(float('nan')), 0)
-
-
  fromHex = float.fromhex
  toHex = float.hex
  class HexFloatTestCase(unittest.TestCase):
diff --git a/Lib/test/test_numeric_tower.py b/Lib/test/test_numeric_tower.py

new file mode 100644 (file)

index 0000000..eafdb0f
--- /dev/null
+++ b/Lib/test/test_numeric_tower.py
@@ -0,0 +1,151 @@
+# test interactions betwen int, float, Decimal and Fraction
+
+import unittest
+import random
+import math
+import sys
+import operator
+from test.support import run_unittest
+
+from decimal import Decimal as D
+from fractions import Fraction as F
+
+# Constants related to the hash implementation;  hash(x) is based
+# on the reduction of x modulo the prime _PyHASH_MODULUS.
+_PyHASH_MODULUS = sys.hash_info.modulus
+_PyHASH_INF = sys.hash_info.inf
+
+class HashTest(unittest.TestCase):
+    def check_equal_hash(self, x, y):
+        # check both that x and y are equal and that their hashes are equal
+        self.assertEqual(hash(x), hash(y),
+                         "got different hashes for {!r} and {!r}".format(x, y))
+        self.assertEqual(x, y)
+
+    def test_bools(self):
+        self.check_equal_hash(False, 0)
+        self.check_equal_hash(True, 1)
+
+    def test_integers(self):
+        # check that equal values hash equal
+
+        # exact integers
+        for i in range(-1000, 1000):
+            self.check_equal_hash(i, float(i))
+            self.check_equal_hash(i, D(i))
+            self.check_equal_hash(i, F(i))
+
+        # the current hash is based on reduction modulo 2**n-1 for some
+        # n, so pay special attention to numbers of the form 2**n and 2**n-1.
+        for i in range(100):
+            n = 2**i - 1
+            if n == int(float(n)):
+                self.check_equal_hash(n, float(n))
+                self.check_equal_hash(-n, -float(n))
+            self.check_equal_hash(n, D(n))
+            self.check_equal_hash(n, F(n))
+            self.check_equal_hash(-n, D(-n))
+            self.check_equal_hash(-n, F(-n))
+
+            n = 2**i
+            self.check_equal_hash(n, float(n))
+            self.check_equal_hash(-n, -float(n))
+            self.check_equal_hash(n, D(n))
+            self.check_equal_hash(n, F(n))
+            self.check_equal_hash(-n, D(-n))
+            self.check_equal_hash(-n, F(-n))
+
+        # random values of various sizes
+        for _ in range(1000):
+            e = random.randrange(300)
+            n = random.randrange(-10**e, 10**e)
+            self.check_equal_hash(n, D(n))
+            self.check_equal_hash(n, F(n))
+            if n == int(float(n)):
+                self.check_equal_hash(n, float(n))
+
+    def test_binary_floats(self):
+        # check that floats hash equal to corresponding Fractions and Decimals
+
+        # floats that are distinct but numerically equal should hash the same
+        self.check_equal_hash(0.0, -0.0)
+
+        # zeros
+        self.check_equal_hash(0.0, D(0))
+        self.check_equal_hash(-0.0, D(0))
+        self.check_equal_hash(-0.0, D('-0.0'))
+        self.check_equal_hash(0.0, F(0))
+
+        # infinities and nans
+        self.check_equal_hash(float('inf'), D('inf'))
+        self.check_equal_hash(float('-inf'), D('-inf'))
+
+        for _ in range(1000):
+            x = random.random() * math.exp(random.random()*200.0 - 100.0)
+            self.check_equal_hash(x, D.from_float(x))
+            self.check_equal_hash(x, F.from_float(x))
+
+    def test_complex(self):
+        # complex numbers with zero imaginary part should hash equal to
+        # the corresponding float
+
+        test_values = [0.0, -0.0, 1.0, -1.0, 0.40625, -5136.5,
+                       float('inf'), float('-inf')]
+
+        for zero in -0.0, 0.0:
+            for value in test_values:
+                self.check_equal_hash(value, complex(value, zero))
+
+    def test_decimals(self):
+        # check that Decimal instances that have different representations
+        # but equal values give the same hash
+        zeros = ['0', '-0', '0.0', '-0.0e10', '000e-10']
+        for zero in zeros:
+            self.check_equal_hash(D(zero), D(0))
+
+        self.check_equal_hash(D('1.00'), D(1))
+        self.check_equal_hash(D('1.00000'), D(1))
+        self.check_equal_hash(D('-1.00'), D(-1))
+        self.check_equal_hash(D('-1.00000'), D(-1))
+        self.check_equal_hash(D('123e2'), D(12300))
+        self.check_equal_hash(D('1230e1'), D(12300))
+        self.check_equal_hash(D('12300'), D(12300))
+        self.check_equal_hash(D('12300.0'), D(12300))
+        self.check_equal_hash(D('12300.00'), D(12300))
+        self.check_equal_hash(D('12300.000'), D(12300))
+
+    def test_fractions(self):
+        # check special case for fractions where either the numerator
+        # or the denominator is a multiple of _PyHASH_MODULUS
+        self.assertEqual(hash(F(1, _PyHASH_MODULUS)), _PyHASH_INF)
+        self.assertEqual(hash(F(-1, 3*_PyHASH_MODULUS)), -_PyHASH_INF)
+        self.assertEqual(hash(F(7*_PyHASH_MODULUS, 1)), 0)
+        self.assertEqual(hash(F(-_PyHASH_MODULUS, 1)), 0)
+
+    def test_hash_normalization(self):
+        # Test for a bug encountered while changing long_hash.
+        #
+        # Given objects x and y, it should be possible for y's
+        # __hash__ method to return hash(x) in order to ensure that
+        # hash(x) == hash(y).  But hash(x) is not exactly equal to the
+        # result of x.__hash__(): there's some internal normalization
+        # to make sure that the result fits in a C long, and is not
+        # equal to the invalid hash value -1.  This internal
+        # normalization must therefore not change the result of
+        # hash(x) for any x.
+
+        class HalibutProxy:
+            def __hash__(self):
+                return hash('halibut')
+            def __eq__(self, other):
+                return other == 'halibut'
+
+        x = {'halibut', HalibutProxy()}
+        self.assertEqual(len(x), 1)
+
+
+def test_main():
+    run_unittest(HashTest)
+
+if __name__ == '__main__':
+    test_main()
diff --git a/Lib/test/test_sys.py b/Lib/test/test_sys.py

index 2caf09fc784e792b7ea792807d8f726dd188aea0..c056f9a77cce6a6bd27cc19a6a8397d725744a4f 100644 (file)
--- a/Lib/test/test_sys.py
+++ b/Lib/test/test_sys.py
@@ -426,6 +426,23 @@ class SysModuleTest(unittest.TestCase):
          self.assertEqual(type(sys.int_info.bits_per_digit), int)
          self.assertEqual(type(sys.int_info.sizeof_digit), int)
          self.assertIsInstance(sys.hexversion, int)
+
+        self.assertEqual(len(sys.hash_info), 5)
+        self.assertLess(sys.hash_info.modulus, 2**sys.hash_info.width)
+        # sys.hash_info.modulus should be a prime; we do a quick
+        # probable primality test (doesn't exclude the possibility of
+        # a Carmichael number)
+        for x in range(1, 100):
+            self.assertEqual(
+                pow(x, sys.hash_info.modulus-1, sys.hash_info.modulus),
+                1,
+                "sys.hash_info.modulus {} is a non-prime".format(
+                    sys.hash_info.modulus)
+                )
+        self.assertIsInstance(sys.hash_info.inf, int)
+        self.assertIsInstance(sys.hash_info.nan, int)
+        self.assertIsInstance(sys.hash_info.imag, int)
+
          self.assertIsInstance(sys.maxsize, int)
          self.assertIsInstance(sys.maxunicode, int)
          self.assertIsInstance(sys.platform, str)
diff --git a/Misc/NEWS b/Misc/NEWS

index 498c4abe8b0aa2bb34c2e9a0bc34ca2753e1fd13..36f374bdc955932b18f83f47fa6e0bc68be45b63 100644 (file)
--- a/Misc/NEWS
+++ b/Misc/NEWS
@@ -12,6 +12,11 @@ What's New in Python 3.2 Alpha 1?
  Core and Builtins
  -----------------
  
+- Issue #8188: Introduce a new scheme for computing hashes of numbers
+  (instances of int, float, complex, decimal.Decimal and
+  fractions.Fraction) that makes it easy to maintain the invariant
+  that hash(x) == hash(y) whenever x and y have equal value.
+
  - Issue #8748: Fix two issues with comparisons between complex and integer
    objects.  (1) The comparison could incorrectly return True in some cases
    (2**53+1 == complex(2**53) == 2**53), breaking transivity of equality.
diff --git a/Objects/complexobject.c b/Objects/complexobject.c

index 9e1e2178561711000c927482acf5cfe61e63122f..7594c886ece4b3cbcd4f8d0ad193518895162905 100644 (file)
--- a/Objects/complexobject.c
+++ b/Objects/complexobject.c
@@ -403,12 +403,12 @@ complex_str(PyComplexObject *v)
  static long
  complex_hash(PyComplexObject *v)
  {
-    long hashreal, hashimag, combined;
-    hashreal = _Py_HashDouble(v->cval.real);
-    if (hashreal == -1)
+    unsigned long hashreal, hashimag, combined;
+    hashreal = (unsigned long)_Py_HashDouble(v->cval.real);
+    if (hashreal == (unsigned long)-1)
          return -1;
-    hashimag = _Py_HashDouble(v->cval.imag);
-    if (hashimag == -1)
+    hashimag = (unsigned long)_Py_HashDouble(v->cval.imag);
+    if (hashimag == (unsigned long)-1)
          return -1;
      /* Note:  if the imaginary part is 0, hashimag is 0 now,
       * so the following returns hashreal unchanged.  This is
@@ -416,10 +416,10 @@ complex_hash(PyComplexObject *v)
       * compare equal must have the same hash value, so that
       * hash(x + 0*j) must equal hash(x).
       */
-    combined = hashreal + 1000003 * hashimag;
-    if (combined == -1)
-        combined = -2;
-    return combined;
+    combined = hashreal + _PyHASH_IMAG * hashimag;
+    if (combined == (unsigned long)-1)
+        combined = (unsigned long)-2;
+    return (long)combined;
  }
  
  /* This macro may return! */
diff --git a/Objects/longobject.c b/Objects/longobject.c

index 850396b8efe1a3193d36b1e9739a3fc693870725..564d1a0459ac4e11e57ad222ee9bebc16052a08e 100644 (file)
--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -2571,18 +2571,37 @@ long_hash(PyLongObject *v)
          sign = -1;
          i = -(i);
      }
-    /* The following loop produces a C unsigned long x such that x is
-       congruent to the absolute value of v modulo ULONG_MAX.  The
-       resulting x is nonzero if and only if v is. */
      while (--i >= 0) {
-        /* Force a native long #-bits (32 or 64) circular shift */
-        x = (x >> (8*SIZEOF_LONG-PyLong_SHIFT)) | (x << PyLong_SHIFT);
+        /* Here x is a quantity in the range [0, _PyHASH_MODULUS); we
+           want to compute x * 2**PyLong_SHIFT + v->ob_digit[i] modulo
+           _PyHASH_MODULUS.
+
+           The computation of x * 2**PyLong_SHIFT % _PyHASH_MODULUS
+           amounts to a rotation of the bits of x.  To see this, write
+
+             x * 2**PyLong_SHIFT = y * 2**_PyHASH_BITS + z
+
+           where y = x >> (_PyHASH_BITS - PyLong_SHIFT) gives the top
+           PyLong_SHIFT bits of x (those that are shifted out of the
+           original _PyHASH_BITS bits, and z = (x << PyLong_SHIFT) &
+           _PyHASH_MODULUS gives the bottom _PyHASH_BITS - PyLong_SHIFT
+           bits of x, shifted up.  Then since 2**_PyHASH_BITS is
+           congruent to 1 modulo _PyHASH_MODULUS, y*2**_PyHASH_BITS is
+           congruent to y modulo _PyHASH_MODULUS.  So
+
+             x * 2**PyLong_SHIFT = y + z (mod _PyHASH_MODULUS).
+
+           The right-hand side is just the result of rotating the
+           _PyHASH_BITS bits of x left by PyLong_SHIFT places; since
+           not all _PyHASH_BITS bits of x are 1s, the same is true
+           after rotation, so 0 <= y+z < _PyHASH_MODULUS and y + z is
+           the reduction of x*2**PyLong_SHIFT modulo
+           _PyHASH_MODULUS. */
+        x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) |
+            (x >> (_PyHASH_BITS - PyLong_SHIFT));
          x += v->ob_digit[i];
-        /* If the addition above overflowed we compensate by
-           incrementing.  This preserves the value modulo
-           ULONG_MAX. */
-        if (x < v->ob_digit[i])
-            x++;
+        if (x >= _PyHASH_MODULUS)
+            x -= _PyHASH_MODULUS;
      }
      x = x * sign;
      if (x == (unsigned long)-1)
diff --git a/Objects/object.c b/Objects/object.c

index 0802348455b73c1fb5b06d73b0cc8193f5938103..76d018f7f1392fc7e9908349018e687ba41ea752 100644 (file)
--- a/Objects/object.c
+++ b/Objects/object.c
@@ -647,63 +647,101 @@ PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
     All the utility functions (_Py_Hash*()) return "-1" to signify an error.
  */
  
+/* For numeric types, the hash of a number x is based on the reduction
+   of x modulo the prime P = 2**_PyHASH_BITS - 1.  It's designed so that
+   hash(x) == hash(y) whenever x and y are numerically equal, even if
+   x and y have different types.
+
+   A quick summary of the hashing strategy:
+
+   (1) First define the 'reduction of x modulo P' for any rational
+   number x; this is a standard extension of the usual notion of
+   reduction modulo P for integers.  If x == p/q (written in lowest
+   terms), the reduction is interpreted as the reduction of p times
+   the inverse of the reduction of q, all modulo P; if q is exactly
+   divisible by P then define the reduction to be infinity.  So we've
+   got a well-defined map
+
+      reduce : { rational numbers } -> { 0, 1, 2, ..., P-1, infinity }.
+
+   (2) Now for a rational number x, define hash(x) by:
+
+      reduce(x)   if x >= 0
+      -reduce(-x) if x < 0
+
+   If the result of the reduction is infinity (this is impossible for
+   integers, floats and Decimals) then use the predefined hash value
+   _PyHASH_INF for x >= 0, or -_PyHASH_INF for x < 0, instead.
+   _PyHASH_INF, -_PyHASH_INF and _PyHASH_NAN are also used for the
+   hashes of float and Decimal infinities and nans.
+
+   A selling point for the above strategy is that it makes it possible
+   to compute hashes of decimal and binary floating-point numbers
+   efficiently, even if the exponent of the binary or decimal number
+   is large.  The key point is that
+
+      reduce(x * y) == reduce(x) * reduce(y) (modulo _PyHASH_MODULUS)
+
+   provided that {reduce(x), reduce(y)} != {0, infinity}.  The reduction of a
+   binary or decimal float is never infinity, since the denominator is a power
+   of 2 (for binary) or a divisor of a power of 10 (for decimal).  So we have,
+   for nonnegative x,
+
+      reduce(x * 2**e) == reduce(x) * reduce(2**e) % _PyHASH_MODULUS
+
+      reduce(x * 10**e) == reduce(x) * reduce(10**e) % _PyHASH_MODULUS
+
+   and reduce(10**e) can be computed efficiently by the usual modular
+   exponentiation algorithm.  For reduce(2**e) it's even better: since
+   P is of the form 2**n-1, reduce(2**e) is 2**(e mod n), and multiplication
+   by 2**(e mod n) modulo 2**n-1 just amounts to a rotation of bits.
+
+   */
+
  long
  _Py_HashDouble(double v)
  {
-    double intpart, fractpart;
-    int expo;
-    long hipart;
-    long x;             /* the final hash value */
-    /* This is designed so that Python numbers of different types
-     * that compare equal hash to the same value; otherwise comparisons
-     * of mapping keys will turn out weird.
-     */
+    int e, sign;
+    double m;
+    unsigned long x, y;
  
      if (!Py_IS_FINITE(v)) {
          if (Py_IS_INFINITY(v))
-            return v < 0 ? -271828 : 314159;
+            return v > 0 ? _PyHASH_INF : -_PyHASH_INF;
          else
-            return 0;
+            return _PyHASH_NAN;
      }
-    fractpart = modf(v, &intpart);
-    if (fractpart == 0.0) {
-        /* This must return the same hash as an equal int or long. */
-        if (intpart > LONG_MAX/2 || -intpart > LONG_MAX/2) {
-            /* Convert to long and use its hash. */
-            PyObject *plong;                    /* converted to Python long */
-            plong = PyLong_FromDouble(v);
-            if (plong == NULL)
-                return -1;
-            x = PyObject_Hash(plong);
-            Py_DECREF(plong);
-            return x;
-        }
-        /* Fits in a C long == a Python int, so is its own hash. */
-        x = (long)intpart;
-        if (x == -1)
-            x = -2;
-        return x;
-    }
-    /* The fractional part is non-zero, so we don't have to worry about
-     * making this match the hash of some other type.
-     * Use frexp to get at the bits in the double.
-     * Since the VAX D double format has 56 mantissa bits, which is the
-     * most of any double format in use, each of these parts may have as
-     * many as (but no more than) 56 significant bits.
-     * So, assuming sizeof(long) >= 4, each part can be broken into two
-     * longs; frexp and multiplication are used to do that.
-     * Also, since the Cray double format has 15 exponent bits, which is
-     * the most of any double format in use, shifting the exponent field
-     * left by 15 won't overflow a long (again assuming sizeof(long) >= 4).
-     */
-    v = frexp(v, &expo);
-    v *= 2147483648.0;          /* 2**31 */
-    hipart = (long)v;           /* take the top 32 bits */
-    v = (v - (double)hipart) * 2147483648.0; /* get the next 32 bits */
-    x = hipart + (long)v + (expo << 15);
-    if (x == -1)
-        x = -2;
-    return x;
+
+    m = frexp(v, &e);
+
+    sign = 1;
+    if (m < 0) {
+        sign = -1;
+        m = -m;
+    }
+
+    /* process 28 bits at a time;  this should work well both for binary
+       and hexadecimal floating point. */
+    x = 0;
+    while (m) {
+        x = ((x << 28) & _PyHASH_MODULUS) | x >> (_PyHASH_BITS - 28);
+        m *= 268435456.0;  /* 2**28 */
+        e -= 28;
+        y = (unsigned long)m;  /* pull out integer part */
+        m -= y;
+        x += y;
+        if (x >= _PyHASH_MODULUS)
+            x -= _PyHASH_MODULUS;
+    }
+
+    /* adjust for the exponent;  first reduce it modulo _PyHASH_BITS */
+    e = e >= 0 ? e % _PyHASH_BITS : _PyHASH_BITS-1-((-1-e) % _PyHASH_BITS);
+    x = ((x << e) & _PyHASH_MODULUS) | x >> (_PyHASH_BITS - e);
+
+    x = x * sign;
+    if (x == (unsigned long)-1)
+        x = (unsigned long)-2;
+    return (long)x;
  }
  
  long
diff --git a/Objects/typeobject.c b/Objects/typeobject.c

index 369bac6bb9770e187f7424a487a88b19951602f5..adfb0ec06740511c22c09ed614a045e3f6513f54 100644 (file)
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -4921,6 +4921,7 @@ slot_tp_hash(PyObject *self)
      PyObject *func, *res;
      static PyObject *hash_str;
      long h;
+    int overflow;
  
      func = lookup_method(self, "__hash__", &hash_str);
  
@@ -4937,14 +4938,27 @@ slot_tp_hash(PyObject *self)
      Py_DECREF(func);
      if (res == NULL)
          return -1;
-    if (PyLong_Check(res))
+
+    if (!PyLong_Check(res)) {
+        PyErr_SetString(PyExc_TypeError,
+                        "__hash__ method should return an integer");
+        return -1;
+    }
+    /* Transform the PyLong `res` to a C long `h`.  For an existing
+       hashable Python object x, hash(x) will always lie within the range
+       of a C long.  Therefore our transformation must preserve values
+       that already lie within this range, to ensure that if x.__hash__()
+       returns hash(y) then hash(x) == hash(y). */
+    h = PyLong_AsLongAndOverflow(res, &overflow);
+    if (overflow)
+        /* res was not within the range of a C long, so we're free to
+           use any sufficiently bit-mixing transformation;
+           long.__hash__ will do nicely. */
          h = PyLong_Type.tp_hash(res);
-    else
-        h = PyLong_AsLong(res);
      Py_DECREF(res);
-           if (h == -1 && !PyErr_Occurred())
-           h = -2;
-           return h;
+    if (h == -1 && !PyErr_Occurred())
+        h = -2;
+    return h;
  }
  
  static PyObject *
diff --git a/Python/sysmodule.c b/Python/sysmodule.c

index 77b120fbe2e6366ecbf46fd9dbb3710cba7db177..4c87d54c7c41265c6ec6063ac480b3578cb1c46a 100644 (file)
--- a/Python/sysmodule.c
+++ b/Python/sysmodule.c
@@ -570,6 +570,57 @@ sys_setrecursionlimit(PyObject *self, PyObject *args)
      return Py_None;
  }
  
+static PyTypeObject Hash_InfoType;
+
+PyDoc_STRVAR(hash_info_doc,
+"hash_info\n\
+\n\
+A struct sequence providing parameters used for computing\n\
+numeric hashes.  The attributes are read only.");
+
+static PyStructSequence_Field hash_info_fields[] = {
+    {"width", "width of the type used for hashing, in bits"},
+    {"modulus", "prime number giving the modulus on which the hash "
+                "function is based"},
+    {"inf", "value to be used for hash of a positive infinity"},
+    {"nan", "value to be used for hash of a nan"},
+    {"imag", "multiplier used for the imaginary part of a complex number"},
+    {NULL, NULL}
+};
+
+static PyStructSequence_Desc hash_info_desc = {
+    "sys.hash_info",
+    hash_info_doc,
+    hash_info_fields,
+    5,
+};
+
+PyObject *
+get_hash_info(void)
+{
+    PyObject *hash_info;
+    int field = 0;
+    hash_info = PyStructSequence_New(&Hash_InfoType);
+    if (hash_info == NULL)
+        return NULL;
+    PyStructSequence_SET_ITEM(hash_info, field++,
+                              PyLong_FromLong(8*sizeof(long)));
+    PyStructSequence_SET_ITEM(hash_info, field++,
+                              PyLong_FromLong(_PyHASH_MODULUS));
+    PyStructSequence_SET_ITEM(hash_info, field++,
+                              PyLong_FromLong(_PyHASH_INF));
+    PyStructSequence_SET_ITEM(hash_info, field++,
+                              PyLong_FromLong(_PyHASH_NAN));
+    PyStructSequence_SET_ITEM(hash_info, field++,
+                              PyLong_FromLong(_PyHASH_IMAG));
+    if (PyErr_Occurred()) {
+        Py_CLEAR(hash_info);
+        return NULL;
+    }
+    return hash_info;
+}
+
+
  PyDoc_STRVAR(setrecursionlimit_doc,
  "setrecursionlimit(n)\n\
  \n\
@@ -1482,6 +1533,11 @@ _PySys_Init(void)
                          PyFloat_GetInfo());
      SET_SYS_FROM_STRING("int_info",
                          PyLong_GetInfo());
+    /* initialize hash_info */
+    if (Hash_InfoType.tp_name == 0)
+        PyStructSequence_InitType(&Hash_InfoType, &hash_info_desc);
+    SET_SYS_FROM_STRING("hash_info",
+                        get_hash_info());
      SET_SYS_FROM_STRING("maxunicode",
                          PyLong_FromLong(PyUnicode_GetMax()));
      SET_SYS_FROM_STRING("builtin_module_names",
author	Mark Dickinson <dickinsm@gmail.com>
	Sun, 23 May 2010 13:33:13 +0000 (13:33 +0000)
committer	Mark Dickinson <dickinsm@gmail.com>
	Sun, 23 May 2010 13:33:13 +0000 (13:33 +0000)
Doc/library/stdtypes.rst		patch \| blob \| history
Doc/library/sys.rst		patch \| blob \| history
Include/pyport.h		patch \| blob \| history
Lib/decimal.py		patch \| blob \| history
Lib/fractions.py		patch \| blob \| history
Lib/test/test_float.py		patch \| blob \| history
Lib/test/test_numeric_tower.py	[new file with mode: 0644]	patch \| blob
Lib/test/test_sys.py		patch \| blob \| history
Misc/NEWS		patch \| blob \| history
Objects/complexobject.c		patch \| blob \| history
Objects/longobject.c		patch \| blob \| history
Objects/object.c		patch \| blob \| history
Objects/typeobject.c		patch \| blob \| history
Python/sysmodule.c		patch \| blob \| history