The PDB Serialized Hash Table Format\r
====================================\r
+\r
+.. contents::\r
+ :local:\r
+\r
+.. _hash_intro:\r
+\r
+Introduction\r
+============\r
+\r
+One of the design goals of the PDB format is to provide accelerated access to\r
+debug information, and for this reason there are several occasions where hash\r
+tables are serialized and embedded directly to the file, rather than requiring\r
+a consumer to read a list of values and reconstruct the hash table on the fly.\r
+\r
+The serialization format supports hash tables of arbitrarily large size and\r
+capacity, as well as value types and hash functions. The only supported key\r
+value type is a uint32. The only requirement is that the producer and consumer\r
+agree on the hash function. As such, the hash function can is not discussed\r
+further in this document, it is assumed that for a particular instance of a PDB\r
+file hash table, the appropriate hash function is being used.\r
+\r
+On-Disk Format\r
+==============\r
+\r
+.. code-block:: none\r
+\r
+ .--------------------.-- +0\r
+ | Size |\r
+ .--------------------.-- +4\r
+ | Capacity |\r
+ .--------------------.-- +8\r
+ | Present Bit Vector |\r
+ .--------------------.-- +N\r
+ | Deleted Bit Vector |\r
+ .--------------------.-- +M ─╮\r
+ | Key | │\r
+ .--------------------.-- +M+4 │\r
+ | Value | │\r
+ .--------------------.-- +M+4+sizeof(Value) │\r
+ ... ├─ |Capacity| Bucket entries\r
+ .--------------------. │\r
+ | Key | │\r
+ .--------------------. │\r
+ | Value | │\r
+ .--------------------. ─╯\r
+\r
+- **Size** - The number of values contained in the hash table.\r
+ \r
+- **Capacity** - The number of buckets in the hash table. Producers should\r
+ maintain a load factor of no greater than ``2/3*Capacity+1``.\r
+ \r
+- **Present Bit Vector** - A serialized bit vector which contains information\r
+ about which buckets have valid values. If the bucket has a value, the\r
+ corresponding bit will be set, and if the bucket doesn't have a value (either\r
+ because the bucket is empty or because the value is a tombstone value) the bit\r
+ will be unset.\r
+ \r
+- **Deleted Bit Vector** - A serialized bit vector which contains information\r
+ about which buckets have tombstone values. If the entry in this bucket is\r
+ deleted, the bit will be set, otherwise it will be unset.\r
+\r
+- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first\r
+ entry is the key (always a uint32), and the second entry is the value. The\r
+ state of each bucket (valid, empty, deleted) can be determined by examining\r
+ the present and deleted bit vectors.\r
+\r
+\r
+.. _hash_bit_vectors:\r
+\r
+Present and Deleted Bit Vectors\r
+===============================\r
+\r
+The bit vectors indicating the status of each bucket are serialized as follows:\r
+\r
+.. code-block:: none\r
+\r
+ .--------------------.-- +0\r
+ | Word Count |\r
+ .--------------------.-- +4\r
+ | Word_0 | ─╮\r
+ .--------------------.-- +8 │\r
+ | Word_1 | │\r
+ .--------------------.-- +12 ├─ |Word Count| values\r
+ ... │\r
+ .--------------------. │\r
+ | Word_N | │\r
+ .--------------------. ─╯\r
+\r
+The words, when viewed as a contiguous block of bytes, represent a bit vector with\r
+the following layout:\r
+\r
+.. code-block:: none\r
+\r
+ .------------. .------------.------------.\r
+ | Word_N | ... | Word_1 | Word_0 |\r
+ .------------. .------------.------------.\r
+ | | | | |\r
+ +N*32 +(N-1)*32 +64 +32 +0\r
+\r
+where the k'th bit of this bit vector represents the status of the k'th bucket\r
+in the hash table.\r