From: Tom Lane Date: Wed, 15 Jun 2005 16:24:07 +0000 (+0000) Subject: Improve hash method for bitmapsets: some examination of actual outputs X-Git-Tag: REL8_1_0BETA1~543 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=e5a11a8879c653d171497f8adec68f8997a4c4cf;p=postgresql Improve hash method for bitmapsets: some examination of actual outputs shows that adding a circular shift between words greatly improves the distribution of hash outputs. --- diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 5f4ca9a779..d74ba6189e 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -14,7 +14,7 @@ * Copyright (c) 2003-2005, PostgreSQL Global Development Group * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/nodes/bitmapset.c,v 1.8 2005/06/08 23:02:04 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/nodes/bitmapset.c,v 1.9 2005/06/15 16:24:07 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -769,22 +769,36 @@ bms_first_member(Bitmapset *a) * * Note: we must ensure that any two bitmapsets that are bms_equal() will * hash to the same value; in practice this means that trailing all-zero - * words cannot affect the result. Longitudinal XOR provides a reasonable - * hash value that has this property. + * words cannot affect the result. The circular-shift-and-XOR hash method + * used here has this property, so long as we work from back to front. + * + * Note: you might wonder why we bother with the circular shift; at first + * glance a straight longitudinal XOR seems as good and much simpler. The + * reason is empirical: this gives a better distribution of hash values on + * the bitmapsets actually generated by the planner. A common way to have + * multiword bitmapsets is "a JOIN b JOIN c JOIN d ...", which gives rise + * to rangetables in which base tables and JOIN nodes alternate; so + * bitmapsets of base table RT indexes tend to use only odd-numbered or only + * even-numbered bits. A straight longitudinal XOR would preserve this + * property, leading to a much smaller set of possible outputs than if + * we include a shift. */ uint32 bms_hash_value(const Bitmapset *a) { bitmapword result = 0; - int nwords; int wordnum; - if (a == NULL) + if (a == NULL || a->nwords <= 0) return 0; /* All empty sets hash to 0 */ - nwords = a->nwords; - for (wordnum = 0; wordnum < nwords; wordnum++) + for (wordnum = a->nwords; --wordnum > 0; ) { result ^= a->words[wordnum]; + if (result & ((bitmapword) 1 << (BITS_PER_BITMAPWORD - 1))) + result = (result << 1) | 1; + else + result = (result << 1); } + result ^= a->words[0]; return (uint32) result; }