From: Nico Weber Date: Wed, 1 May 2019 19:15:05 +0000 (+0000) Subject: Convert PDB docs to unix line endings. No other changes. X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=52810e6419bb5c8ba922ff0ed3c20ee03601191b;p=llvm Convert PDB docs to unix line endings. No other changes. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@359712 91177308-0d34-0410-b5e6-96231b3b80d8 --- diff --git a/docs/PDB/GlobalStream.rst b/docs/PDB/GlobalStream.rst index 314b9f01ffa..dcc99ae3a0e 100644 --- a/docs/PDB/GlobalStream.rst +++ b/docs/PDB/GlobalStream.rst @@ -1,3 +1,3 @@ -===================================== -The PDB Global Symbol Stream -===================================== +===================================== +The PDB Global Symbol Stream +===================================== diff --git a/docs/PDB/HashTable.rst b/docs/PDB/HashTable.rst index c296f3efadb..eb7abbd5107 100644 --- a/docs/PDB/HashTable.rst +++ b/docs/PDB/HashTable.rst @@ -1,103 +1,103 @@ -The PDB Serialized Hash Table Format -==================================== - -.. contents:: - :local: - -.. _hash_intro: - -Introduction -============ - -One of the design goals of the PDB format is to provide accelerated access to -debug information, and for this reason there are several occasions where hash -tables are serialized and embedded directly to the file, rather than requiring -a consumer to read a list of values and reconstruct the hash table on the fly. - -The serialization format supports hash tables of arbitrarily large size and -capacity, as well as value types and hash functions. The only supported key -value type is a uint32. The only requirement is that the producer and consumer -agree on the hash function. As such, the hash function can is not discussed -further in this document, it is assumed that for a particular instance of a PDB -file hash table, the appropriate hash function is being used. - -On-Disk Format -============== - -.. code-block:: none - - .--------------------.-- +0 - | Size | - .--------------------.-- +4 - | Capacity | - .--------------------.-- +8 - | Present Bit Vector | - .--------------------.-- +N - | Deleted Bit Vector | - .--------------------.-- +M ─╮ - | Key | │ - .--------------------.-- +M+4 │ - | Value | │ - .--------------------.-- +M+4+sizeof(Value) │ - ... ├─ |Capacity| Bucket entries - .--------------------. │ - | Key | │ - .--------------------. │ - | Value | │ - .--------------------. ─╯ - -- **Size** - The number of values contained in the hash table. - -- **Capacity** - The number of buckets in the hash table. Producers should - maintain a load factor of no greater than ``2/3*Capacity+1``. - -- **Present Bit Vector** - A serialized bit vector which contains information - about which buckets have valid values. If the bucket has a value, the - corresponding bit will be set, and if the bucket doesn't have a value (either - because the bucket is empty or because the value is a tombstone value) the bit - will be unset. - -- **Deleted Bit Vector** - A serialized bit vector which contains information - about which buckets have tombstone values. If the entry in this bucket is - deleted, the bit will be set, otherwise it will be unset. - -- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first - entry is the key (always a uint32), and the second entry is the value. The - state of each bucket (valid, empty, deleted) can be determined by examining - the present and deleted bit vectors. - - -.. _hash_bit_vectors: - -Present and Deleted Bit Vectors -=============================== - -The bit vectors indicating the status of each bucket are serialized as follows: - -.. code-block:: none - - .--------------------.-- +0 - | Word Count | - .--------------------.-- +4 - | Word_0 | ─╮ - .--------------------.-- +8 │ - | Word_1 | │ - .--------------------.-- +12 ├─ |Word Count| values - ... │ - .--------------------. │ - | Word_N | │ - .--------------------. ─╯ - -The words, when viewed as a contiguous block of bytes, represent a bit vector with -the following layout: - -.. code-block:: none - - .------------. .------------.------------. - | Word_N | ... | Word_1 | Word_0 | - .------------. .------------.------------. - | | | | | - +N*32 +(N-1)*32 +64 +32 +0 - -where the k'th bit of this bit vector represents the status of the k'th bucket -in the hash table. +The PDB Serialized Hash Table Format +==================================== + +.. contents:: + :local: + +.. _hash_intro: + +Introduction +============ + +One of the design goals of the PDB format is to provide accelerated access to +debug information, and for this reason there are several occasions where hash +tables are serialized and embedded directly to the file, rather than requiring +a consumer to read a list of values and reconstruct the hash table on the fly. + +The serialization format supports hash tables of arbitrarily large size and +capacity, as well as value types and hash functions. The only supported key +value type is a uint32. The only requirement is that the producer and consumer +agree on the hash function. As such, the hash function can is not discussed +further in this document, it is assumed that for a particular instance of a PDB +file hash table, the appropriate hash function is being used. + +On-Disk Format +============== + +.. code-block:: none + + .--------------------.-- +0 + | Size | + .--------------------.-- +4 + | Capacity | + .--------------------.-- +8 + | Present Bit Vector | + .--------------------.-- +N + | Deleted Bit Vector | + .--------------------.-- +M ─╮ + | Key | │ + .--------------------.-- +M+4 │ + | Value | │ + .--------------------.-- +M+4+sizeof(Value) │ + ... ├─ |Capacity| Bucket entries + .--------------------. │ + | Key | │ + .--------------------. │ + | Value | │ + .--------------------. ─╯ + +- **Size** - The number of values contained in the hash table. + +- **Capacity** - The number of buckets in the hash table. Producers should + maintain a load factor of no greater than ``2/3*Capacity+1``. + +- **Present Bit Vector** - A serialized bit vector which contains information + about which buckets have valid values. If the bucket has a value, the + corresponding bit will be set, and if the bucket doesn't have a value (either + because the bucket is empty or because the value is a tombstone value) the bit + will be unset. + +- **Deleted Bit Vector** - A serialized bit vector which contains information + about which buckets have tombstone values. If the entry in this bucket is + deleted, the bit will be set, otherwise it will be unset. + +- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first + entry is the key (always a uint32), and the second entry is the value. The + state of each bucket (valid, empty, deleted) can be determined by examining + the present and deleted bit vectors. + + +.. _hash_bit_vectors: + +Present and Deleted Bit Vectors +=============================== + +The bit vectors indicating the status of each bucket are serialized as follows: + +.. code-block:: none + + .--------------------.-- +0 + | Word Count | + .--------------------.-- +4 + | Word_0 | ─╮ + .--------------------.-- +8 │ + | Word_1 | │ + .--------------------.-- +12 ├─ |Word Count| values + ... │ + .--------------------. │ + | Word_N | │ + .--------------------. ─╯ + +The words, when viewed as a contiguous block of bytes, represent a bit vector with +the following layout: + +.. code-block:: none + + .------------. .------------.------------. + | Word_N | ... | Word_1 | Word_0 | + .------------. .------------.------------. + | | | | | + +N*32 +(N-1)*32 +64 +32 +0 + +where the k'th bit of this bit vector represents the status of the k'th bucket +in the hash table. diff --git a/docs/PDB/ModiStream.rst b/docs/PDB/ModiStream.rst index 7e500bd921c..8104c5308b7 100644 --- a/docs/PDB/ModiStream.rst +++ b/docs/PDB/ModiStream.rst @@ -1,80 +1,80 @@ -===================================== -The Module Information Stream -===================================== - -.. contents:: - :local: - -.. _modi_stream_intro: - -Introduction -============ - -The Module Info Stream (henceforth referred to as the Modi stream) contains -information about a single module (object file, import library, etc that -contributes to the binary this PDB contains debug information about. There -is one modi stream for each module, and the mapping between modi stream index -and module is contained in the :doc:`DBI Stream `. The modi stream -for a single module contains line information for the compiland, as well as -all CodeView information for the symbols defined in the compiland. Finally, -there is a "global refs" substream which is not well understood. - -.. _modi_stream_layout: - -Stream Layout -============= - -A modi stream is laid out as follows: - - -.. code-block:: c++ - - struct ModiStream { - uint32_t Signature; - uint8_t Symbols[SymbolSize-4]; - uint8_t C11LineInfo[C11Size]; - uint8_t C13LineInfo[C13Size]; - - uint32_t GlobalRefsSize; - uint8_t GlobalRefs[GlobalRefsSize]; - }; - -- **Signature** - Unknown. In practice only the value of ``4`` has been - observed. It is hypothesized that this value corresponds to the set of - ``CV_SIGNATURE_xx`` defines in ``cvinfo.h``, with the value of ``4`` - meaning that this module has C13 line information (as opposed to C11 line - information). A corollary of this is that we expect to only ever see - C13 line info, and that we do not understand the format of C11 line info. - -- **Symbols** - The :ref:`CodeView Symbol Substream `. - ``SymbolSize`` is equal to the value of ``SymByteSize`` for the - corresponding module's entry in the :ref:`Module Info Substream ` - of the :doc:`DBI Stream `. - -- **C11LineInfo** - A block containing CodeView line information in C11 - format. ``C11Size`` is equal to the value of ``C11ByteSize`` from the - :ref:`Module Info Substream ` of the - :doc:`DBI Stream `. If this value is ``0``, then C11 line - information is not present. As mentioned previously, the format of - C11 line info is not understood and we assume all line in modern PDBs - to be in C13 format. - -- **C13LineInfo** - A block containing CodeView line information in C13 - format. ``C13Size`` is equal to the value of ``C13ByteSize`` from the - :ref:`Module Info Substream ` of the - :doc:`DBI Stream `. If this value is ``0``, then C13 line - information is not present. - -- **GlobalRefs** - The meaning of this substream is not understood. - -.. _modi_symbol_substream: - -The CodeView Symbol Substream -============================= - -The CodeView Symbol Substream. This is an array of variable length -records describing the functions, variables, inlining information, -and other symbols defined in the compiland. The entire array consumes -``SymbolSize-4`` bytes. The format of a CodeView Symbol Record (and -thusly, an array of CodeView Symbol Records) is described in -:doc:`CodeViewSymbols`. +===================================== +The Module Information Stream +===================================== + +.. contents:: + :local: + +.. _modi_stream_intro: + +Introduction +============ + +The Module Info Stream (henceforth referred to as the Modi stream) contains +information about a single module (object file, import library, etc that +contributes to the binary this PDB contains debug information about. There +is one modi stream for each module, and the mapping between modi stream index +and module is contained in the :doc:`DBI Stream `. The modi stream +for a single module contains line information for the compiland, as well as +all CodeView information for the symbols defined in the compiland. Finally, +there is a "global refs" substream which is not well understood. + +.. _modi_stream_layout: + +Stream Layout +============= + +A modi stream is laid out as follows: + + +.. code-block:: c++ + + struct ModiStream { + uint32_t Signature; + uint8_t Symbols[SymbolSize-4]; + uint8_t C11LineInfo[C11Size]; + uint8_t C13LineInfo[C13Size]; + + uint32_t GlobalRefsSize; + uint8_t GlobalRefs[GlobalRefsSize]; + }; + +- **Signature** - Unknown. In practice only the value of ``4`` has been + observed. It is hypothesized that this value corresponds to the set of + ``CV_SIGNATURE_xx`` defines in ``cvinfo.h``, with the value of ``4`` + meaning that this module has C13 line information (as opposed to C11 line + information). A corollary of this is that we expect to only ever see + C13 line info, and that we do not understand the format of C11 line info. + +- **Symbols** - The :ref:`CodeView Symbol Substream `. + ``SymbolSize`` is equal to the value of ``SymByteSize`` for the + corresponding module's entry in the :ref:`Module Info Substream ` + of the :doc:`DBI Stream `. + +- **C11LineInfo** - A block containing CodeView line information in C11 + format. ``C11Size`` is equal to the value of ``C11ByteSize`` from the + :ref:`Module Info Substream ` of the + :doc:`DBI Stream `. If this value is ``0``, then C11 line + information is not present. As mentioned previously, the format of + C11 line info is not understood and we assume all line in modern PDBs + to be in C13 format. + +- **C13LineInfo** - A block containing CodeView line information in C13 + format. ``C13Size`` is equal to the value of ``C13ByteSize`` from the + :ref:`Module Info Substream ` of the + :doc:`DBI Stream `. If this value is ``0``, then C13 line + information is not present. + +- **GlobalRefs** - The meaning of this substream is not understood. + +.. _modi_symbol_substream: + +The CodeView Symbol Substream +============================= + +The CodeView Symbol Substream. This is an array of variable length +records describing the functions, variables, inlining information, +and other symbols defined in the compiland. The entire array consumes +``SymbolSize-4`` bytes. The format of a CodeView Symbol Record (and +thusly, an array of CodeView Symbol Records) is described in +:doc:`CodeViewSymbols`. diff --git a/docs/PDB/MsfFile.rst b/docs/PDB/MsfFile.rst index dfbbf9ded7f..a53ebe3e884 100644 --- a/docs/PDB/MsfFile.rst +++ b/docs/PDB/MsfFile.rst @@ -1,179 +1,179 @@ -===================================== -The MSF File Format -===================================== - -.. contents:: - :local: - -.. _msf_layout: - -File Layout -=========== - -The MSF file format consists of the following components: - -1. :ref:`msf_superblock` -2. :ref:`msf_freeblockmap` (also know as Free Page Map, or FPM) -3. Data - -Each component is stored as an indexed block, the length of which is specified -in ``SuperBlock::BlockSize``. The file consists of 1 or more iterations of the -following pattern (sometimes referred to as an "interval"): - -1. 1 block of data -2. Free Block Map 1 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 1) -3. Free Block Map 2 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 2) -4. ``SuperBlock::BlockSize - 3`` blocks of data - -In the first interval, the first data block is used to store -:ref:`msf_superblock`. - -The following diagram demonstrates the general layout of the file (\| denotes -the end of an interval, and is for visualization purposes only): - -+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+ -| Block Index | 0 | 1 | 2 | 3 - 4095 | \| | 4096 | 4097 | 4098 | 4099 - 8191 | \| | ... | -+=============+=======================+==================+==================+==========+====+======+======+======+=============+====+=====+ -| Meaning | :ref:`msf_superblock` | Free Block Map 1 | Free Block Map 2 | Data | \| | Data | FPM1 | FPM2 | Data | \| | ... | -+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+ - -The file may end after any block, including immediately after a FPM1. - -.. note:: - LLVM only supports 4096 byte blocks (sometimes referred to as the "BigMsf" - variant), so the rest of this document will assume a block size of 4096. - -.. _msf_superblock: - -The Superblock -============== -At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as -follows: - -.. code-block:: c++ - - struct SuperBlock { - char FileMagic[sizeof(Magic)]; - ulittle32_t BlockSize; - ulittle32_t FreeBlockMapBlock; - ulittle32_t NumBlocks; - ulittle32_t NumDirectoryBytes; - ulittle32_t Unknown; - ulittle32_t BlockMapAddr; - }; - -- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"`` - followed by the bytes ``1A 44 53 00 00 00``. -- **BlockSize** - The block size of the internal file system. Valid values are - 512, 1024, 2048, and 4096 bytes. Certain aspects of the MSF file layout vary - depending on the block sizes. For the purposes of LLVM, we handle only block - sizes of 4KiB, and all further discussion assumes a block size of 4KiB. -- **FreeBlockMapBlock** - The index of a block within the file, at which begins - a bitfield representing the set of all blocks within the file which are "free" - (i.e. the data within that block is not used). See :ref:`msf_freeblockmap` for - more information. - **Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``! -- **NumBlocks** - The total number of blocks in the file. ``NumBlocks * BlockSize`` - should equal the size of the file on disk. -- **NumDirectoryBytes** - The size of the stream directory, in bytes. The stream - directory contains information about each stream's size and the set of blocks - that it occupies. It will be described in more detail later. -- **BlockMapAddr** - The index of a block within the MSF file. At this block is - an array of ``ulittle32_t``'s listing the blocks that the stream directory - resides on. For large MSF files, the stream directory (which describes the - block layout of each stream) may not fit entirely on a single block. As a - result, this extra layer of indirection is introduced, whereby this block - contains the list of blocks that the stream directory occupies, and the stream - directory itself can be stitched together accordingly. The number of - ``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``. - -.. _msf_freeblockmap: - -The Free Block Map -================== - -The Free Block Map (sometimes referred to as the Free Page Map, or FPM) is a -series of blocks which contains a bit flag for every block in the file. The -flag will be set to 0 if the block is in use, and 1 if the block is unused. - -Each file contains two FPMs, one of which is active at any given time. This -feature is designed to support incremental and atomic updates of the underlying -MSF file. While writing to an MSF file, if the active FPM is FPM1, you can -write your new modified bitfield to FPM2, and vice versa. Only when you commit -the file to disk do you need to swap the value in the SuperBlock to point to -the new ``FreeBlockMapBlock``. - -The Free Block Maps are stored as a series of single blocks thoughout the file -at intervals of BlockSize. Because each FPM block is of size ``BlockSize`` -bytes, it contains 8 times as many bits as an interval has blocks. This means -that the first block of each FPM refers to the first 8 intervals of the file -(the first 32768 blocks), the second block of each FPM refers to the next 8 -blocks, and so on. This results in far more FPM blocks being present than are -required, but in order to maintain backwards compatibility the format must stay -this way. - -The Stream Directory -==================== -The Stream Directory is the root of all access to the other streams in an MSF -file. Beginning at byte 0 of the stream directory is the following structure: - -.. code-block:: c++ - - struct StreamDirectory { - ulittle32_t NumStreams; - ulittle32_t StreamSizes[NumStreams]; - ulittle32_t StreamBlocks[NumStreams][]; - }; - -And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes. -Note that each of the last two arrays is of variable length, and in particular -that the second array is jagged. - -**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4 -streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}. - -Stream 0: ceil(1000 / 4096) = 1 block - -Stream 1: ceil(8000 / 4096) = 2 blocks - -Stream 2: ceil(16000 / 4096) = 4 blocks - -Stream 3: ceil(9000 / 4096) = 3 blocks - -In total, 10 blocks are used. Let's see what the stream directory might look -like: - -.. code-block:: c++ - - struct StreamDirectory { - ulittle32_t NumStreams = 4; - ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000}; - ulittle32_t StreamBlocks[][] = { - {4}, - {5, 6}, - {11, 9, 7, 8}, - {10, 15, 12} - }; - }; - -In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes`` -would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one -``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``. - -Note also that the streams are discontiguous, and that part of stream 3 is in the -middle of part of stream 2. You cannot assume anything about the layout of the -blocks! - -Alignment and Block Boundaries -============================== -As may be clear by now, it is possible for a single field (whether it be a high -level record, a long string field, or even a single ``uint16``) to begin and -end in separate blocks. For example, if the block size is 4096 bytes, and a -``uint16`` field begins at the last byte of the current block, then it would -need to end on the first byte of the next block. Since blocks are not -necessarily contiguously laid out in the file, this means that both the consumer -and the producer of an MSF file must be prepared to split data apart -accordingly. In the aforementioned example, the high byte of the ``uint16`` -would be written to the last byte of block N, and the low byte would be written -to the first byte of block N+1, which could be tens of thousands of bytes later -(or even earlier!) in the file, depending on what the stream directory says. +===================================== +The MSF File Format +===================================== + +.. contents:: + :local: + +.. _msf_layout: + +File Layout +=========== + +The MSF file format consists of the following components: + +1. :ref:`msf_superblock` +2. :ref:`msf_freeblockmap` (also know as Free Page Map, or FPM) +3. Data + +Each component is stored as an indexed block, the length of which is specified +in ``SuperBlock::BlockSize``. The file consists of 1 or more iterations of the +following pattern (sometimes referred to as an "interval"): + +1. 1 block of data +2. Free Block Map 1 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 1) +3. Free Block Map 2 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 2) +4. ``SuperBlock::BlockSize - 3`` blocks of data + +In the first interval, the first data block is used to store +:ref:`msf_superblock`. + +The following diagram demonstrates the general layout of the file (\| denotes +the end of an interval, and is for visualization purposes only): + ++-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+ +| Block Index | 0 | 1 | 2 | 3 - 4095 | \| | 4096 | 4097 | 4098 | 4099 - 8191 | \| | ... | ++=============+=======================+==================+==================+==========+====+======+======+======+=============+====+=====+ +| Meaning | :ref:`msf_superblock` | Free Block Map 1 | Free Block Map 2 | Data | \| | Data | FPM1 | FPM2 | Data | \| | ... | ++-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+ + +The file may end after any block, including immediately after a FPM1. + +.. note:: + LLVM only supports 4096 byte blocks (sometimes referred to as the "BigMsf" + variant), so the rest of this document will assume a block size of 4096. + +.. _msf_superblock: + +The Superblock +============== +At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as +follows: + +.. code-block:: c++ + + struct SuperBlock { + char FileMagic[sizeof(Magic)]; + ulittle32_t BlockSize; + ulittle32_t FreeBlockMapBlock; + ulittle32_t NumBlocks; + ulittle32_t NumDirectoryBytes; + ulittle32_t Unknown; + ulittle32_t BlockMapAddr; + }; + +- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"`` + followed by the bytes ``1A 44 53 00 00 00``. +- **BlockSize** - The block size of the internal file system. Valid values are + 512, 1024, 2048, and 4096 bytes. Certain aspects of the MSF file layout vary + depending on the block sizes. For the purposes of LLVM, we handle only block + sizes of 4KiB, and all further discussion assumes a block size of 4KiB. +- **FreeBlockMapBlock** - The index of a block within the file, at which begins + a bitfield representing the set of all blocks within the file which are "free" + (i.e. the data within that block is not used). See :ref:`msf_freeblockmap` for + more information. + **Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``! +- **NumBlocks** - The total number of blocks in the file. ``NumBlocks * BlockSize`` + should equal the size of the file on disk. +- **NumDirectoryBytes** - The size of the stream directory, in bytes. The stream + directory contains information about each stream's size and the set of blocks + that it occupies. It will be described in more detail later. +- **BlockMapAddr** - The index of a block within the MSF file. At this block is + an array of ``ulittle32_t``'s listing the blocks that the stream directory + resides on. For large MSF files, the stream directory (which describes the + block layout of each stream) may not fit entirely on a single block. As a + result, this extra layer of indirection is introduced, whereby this block + contains the list of blocks that the stream directory occupies, and the stream + directory itself can be stitched together accordingly. The number of + ``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``. + +.. _msf_freeblockmap: + +The Free Block Map +================== + +The Free Block Map (sometimes referred to as the Free Page Map, or FPM) is a +series of blocks which contains a bit flag for every block in the file. The +flag will be set to 0 if the block is in use, and 1 if the block is unused. + +Each file contains two FPMs, one of which is active at any given time. This +feature is designed to support incremental and atomic updates of the underlying +MSF file. While writing to an MSF file, if the active FPM is FPM1, you can +write your new modified bitfield to FPM2, and vice versa. Only when you commit +the file to disk do you need to swap the value in the SuperBlock to point to +the new ``FreeBlockMapBlock``. + +The Free Block Maps are stored as a series of single blocks thoughout the file +at intervals of BlockSize. Because each FPM block is of size ``BlockSize`` +bytes, it contains 8 times as many bits as an interval has blocks. This means +that the first block of each FPM refers to the first 8 intervals of the file +(the first 32768 blocks), the second block of each FPM refers to the next 8 +blocks, and so on. This results in far more FPM blocks being present than are +required, but in order to maintain backwards compatibility the format must stay +this way. + +The Stream Directory +==================== +The Stream Directory is the root of all access to the other streams in an MSF +file. Beginning at byte 0 of the stream directory is the following structure: + +.. code-block:: c++ + + struct StreamDirectory { + ulittle32_t NumStreams; + ulittle32_t StreamSizes[NumStreams]; + ulittle32_t StreamBlocks[NumStreams][]; + }; + +And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes. +Note that each of the last two arrays is of variable length, and in particular +that the second array is jagged. + +**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4 +streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}. + +Stream 0: ceil(1000 / 4096) = 1 block + +Stream 1: ceil(8000 / 4096) = 2 blocks + +Stream 2: ceil(16000 / 4096) = 4 blocks + +Stream 3: ceil(9000 / 4096) = 3 blocks + +In total, 10 blocks are used. Let's see what the stream directory might look +like: + +.. code-block:: c++ + + struct StreamDirectory { + ulittle32_t NumStreams = 4; + ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000}; + ulittle32_t StreamBlocks[][] = { + {4}, + {5, 6}, + {11, 9, 7, 8}, + {10, 15, 12} + }; + }; + +In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes`` +would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one +``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``. + +Note also that the streams are discontiguous, and that part of stream 3 is in the +middle of part of stream 2. You cannot assume anything about the layout of the +blocks! + +Alignment and Block Boundaries +============================== +As may be clear by now, it is possible for a single field (whether it be a high +level record, a long string field, or even a single ``uint16``) to begin and +end in separate blocks. For example, if the block size is 4096 bytes, and a +``uint16`` field begins at the last byte of the current block, then it would +need to end on the first byte of the next block. Since blocks are not +necessarily contiguously laid out in the file, this means that both the consumer +and the producer of an MSF file must be prepared to split data apart +accordingly. In the aforementioned example, the high byte of the ``uint16`` +would be written to the last byte of block N, and the low byte would be written +to the first byte of block N+1, which could be tens of thousands of bytes later +(or even earlier!) in the file, depending on what the stream directory says. diff --git a/docs/PDB/PublicStream.rst b/docs/PDB/PublicStream.rst index 5b413cfb886..7c860c266ee 100644 --- a/docs/PDB/PublicStream.rst +++ b/docs/PDB/PublicStream.rst @@ -1,3 +1,3 @@ -===================================== -The PDB Public Symbol Stream -===================================== +===================================== +The PDB Public Symbol Stream +===================================== diff --git a/docs/PDB/TpiStream.rst b/docs/PDB/TpiStream.rst index 91429b8b90a..314f688d108 100644 --- a/docs/PDB/TpiStream.rst +++ b/docs/PDB/TpiStream.rst @@ -1,312 +1,312 @@ -===================================== -The PDB TPI and IPI Streams -===================================== - -.. contents:: - :local: - -.. _tpi_intro: - -Introduction -============ - -The PDB TPI Stream (Index 2) and IPI Stream (Index 4) contain information about -all types used in the program. It is organized as a :ref:`header ` -followed by a list of :doc:`CodeView Type Records `. Types are -referenced from various streams and records throughout the PDB by their -:ref:`type index `. In general, the sequence of type records -following the :ref:`header ` forms a topologically sorted DAG -(directed acyclic graph), which means that a type record B can only refer to -the type A if ``A.TypeIndex < B.TypeIndex``. While there are rare cases where -this property will not hold (particularly when dealing with object files -compiled with MASM), an implementation should try very hard to make this -property hold, as it means the entire type graph can be constructed in a single -pass. - -.. important:: - Type records form a topologically sorted DAG (directed acyclic graph). - -.. _tpi_ipi: - -TPI vs IPI Stream -================= - -Recent versions of the PDB format (aka all versions covered by this document) -have 2 streams with identical layout, henceforth referred to as the TPI stream -and IPI stream. Subsequent contents of this document describing the on-disk -format apply equally whether it is for the TPI Stream or the IPI Stream. The -only difference between the two is in *which* CodeView records are allowed to -appear in each one, summarized by the following table: - -+----------------------+---------------------+ -| TPI Stream | IPI Stream | -+======================+=====================+ -| LF_POINTER | LF_FUNC_ID | -+----------------------+---------------------+ -| LF_MODIFIER | LF_MFUNC_ID | -+----------------------+---------------------+ -| LF_PROCEDURE | LF_BUILDINFO | -+----------------------+---------------------+ -| LF_MFUNCTION | LF_SUBSTR_LIST | -+----------------------+---------------------+ -| LF_LABEL | LF_STRING_ID | -+----------------------+---------------------+ -| LF_ARGLIST | LF_UDT_SRC_LINE | -+----------------------+---------------------+ -| LF_FIELDLIST | LF_UDT_MOD_SRC_LINE | -+----------------------+---------------------+ -| LF_ARRAY | | -+----------------------+---------------------+ -| LF_CLASS | | -+----------------------+---------------------+ -| LF_STRUCTURE | | -+----------------------+---------------------+ -| LF_INTERFACE | | -+----------------------+---------------------+ -| LF_UNION | | -+----------------------+---------------------+ -| LF_ENUM | | -+----------------------+---------------------+ -| LF_TYPESERVER2 | | -+----------------------+---------------------+ -| LF_VFTABLE | | -+----------------------+---------------------+ -| LF_VTSHAPE | | -+----------------------+---------------------+ -| LF_BITFIELD | | -+----------------------+---------------------+ -| LF_METHODLIST | | -+----------------------+---------------------+ -| LF_PRECOMP | | -+----------------------+---------------------+ -| LF_ENDPRECOMP | | -+----------------------+---------------------+ - -The usage of these records is described in more detail in -:doc:`CodeView Type Records `. - -.. _type_indices: - -Type Indices -============ - -A type index is a 32-bit integer that uniquely identifies a type inside of an -object file's ``.debug$T`` section or a PDB file's TPI or IPI stream. The -value of the type index for the first type record from the TPI stream is given -by the ``TypeIndexBegin`` member of the :ref:`TPI Stream Header ` -although in practice this value is always equal to 0x1000 (4096). - -Any type index with a high bit set is considered to come from the IPI stream, -although this appears to be more of a hack, and LLVM does not generate type -indices of this nature. They can, however, be observed in Microsoft PDBs -occasionally, so one should be prepared to handle them. Note that having the -high bit set is not a necessary condition to determine whether a type index -comes from the IPI stream, it is only sufficient. - -Once the high bit is cleared, any type index >= ``TypeIndexBegin`` is presumed -to come from the appropriate stream, and any type index less than this is a -bitmask which can be decomposed as follows: - -.. code-block:: none - - .---------------------------.------.----------. - | Unused | Mode | Kind | - '---------------------------'------'----------' - |+32 |+12 |+8 |+0 - - -- **Kind** - A value from the following enum: - -.. code-block:: c++ - - enum class SimpleTypeKind : uint32_t { - None = 0x0000, // uncharacterized type (no type) - Void = 0x0003, // void - NotTranslated = 0x0007, // type not translated by cvpack - HResult = 0x0008, // OLE/COM HRESULT - - SignedCharacter = 0x0010, // 8 bit signed - UnsignedCharacter = 0x0020, // 8 bit unsigned - NarrowCharacter = 0x0070, // really a char - WideCharacter = 0x0071, // wide char - Character16 = 0x007a, // char16_t - Character32 = 0x007b, // char32_t - - SByte = 0x0068, // 8 bit signed int - Byte = 0x0069, // 8 bit unsigned int - Int16Short = 0x0011, // 16 bit signed - UInt16Short = 0x0021, // 16 bit unsigned - Int16 = 0x0072, // 16 bit signed int - UInt16 = 0x0073, // 16 bit unsigned int - Int32Long = 0x0012, // 32 bit signed - UInt32Long = 0x0022, // 32 bit unsigned - Int32 = 0x0074, // 32 bit signed int - UInt32 = 0x0075, // 32 bit unsigned int - Int64Quad = 0x0013, // 64 bit signed - UInt64Quad = 0x0023, // 64 bit unsigned - Int64 = 0x0076, // 64 bit signed int - UInt64 = 0x0077, // 64 bit unsigned int - Int128Oct = 0x0014, // 128 bit signed int - UInt128Oct = 0x0024, // 128 bit unsigned int - Int128 = 0x0078, // 128 bit signed int - UInt128 = 0x0079, // 128 bit unsigned int - - Float16 = 0x0046, // 16 bit real - Float32 = 0x0040, // 32 bit real - Float32PartialPrecision = 0x0045, // 32 bit PP real - Float48 = 0x0044, // 48 bit real - Float64 = 0x0041, // 64 bit real - Float80 = 0x0042, // 80 bit real - Float128 = 0x0043, // 128 bit real - - Complex16 = 0x0056, // 16 bit complex - Complex32 = 0x0050, // 32 bit complex - Complex32PartialPrecision = 0x0055, // 32 bit PP complex - Complex48 = 0x0054, // 48 bit complex - Complex64 = 0x0051, // 64 bit complex - Complex80 = 0x0052, // 80 bit complex - Complex128 = 0x0053, // 128 bit complex - - Boolean8 = 0x0030, // 8 bit boolean - Boolean16 = 0x0031, // 16 bit boolean - Boolean32 = 0x0032, // 32 bit boolean - Boolean64 = 0x0033, // 64 bit boolean - Boolean128 = 0x0034, // 128 bit boolean - }; - -- **Mode** - A value from the following enum: - -.. code-block:: c++ - - enum class SimpleTypeMode : uint32_t { - Direct = 0, // Not a pointer - NearPointer = 1, // Near pointer - FarPointer = 2, // Far pointer - HugePointer = 3, // Huge pointer - NearPointer32 = 4, // 32 bit near pointer - FarPointer32 = 5, // 32 bit far pointer - NearPointer64 = 6, // 64 bit near pointer - NearPointer128 = 7 // 128 bit near pointer - }; - -Note that for pointers, the bitness is represented in the mode. So a ``void*`` -would have a type index with ``Mode=NearPointer32, Kind=Void`` if built for 32-bits -but a type index with ``Mode=NearPointer64, Kind=Void`` if built for 64-bits. - -By convention, the type index for ``std::nullptr_t`` is constructed the same way -as the type index for ``void*``, but using the bitless enumeration value -``NearPointer``. - - - -.. _tpi_header: - -Stream Header -============= -At offset 0 of the TPI Stream is a header with the following layout: - - -.. code-block:: c++ - - struct TpiStreamHeader { - uint32_t Version; - uint32_t HeaderSize; - uint32_t TypeIndexBegin; - uint32_t TypeIndexEnd; - uint32_t TypeRecordBytes; - - uint16_t HashStreamIndex; - uint16_t HashAuxStreamIndex; - uint32_t HashKeySize; - uint32_t NumHashBuckets; - - int32_t HashValueBufferOffset; - uint32_t HashValueBufferLength; - - int32_t IndexOffsetBufferOffset; - uint32_t IndexOffsetBufferLength; - - int32_t HashAdjBufferOffset; - uint32_t HashAdjBufferLength; - }; - -- **Version** - A value from the following enum. - -.. code-block:: c++ - - enum class TpiStreamVersion : uint32_t { - V40 = 19950410, - V41 = 19951122, - V50 = 19961031, - V70 = 19990903, - V80 = 20040203, - }; - -Similar to the :doc:`PDB Stream `, this value always appears to be -``V80``, and no other values have been observed. It is assumed that should -another value be observed, the layout described by this document may not be -accurate. - -- **HeaderSize** - ``sizeof(TpiStreamHeader)`` - -- **TypeIndexBegin** - The numeric value of the type index representing the - first type record in the TPI stream. This is usually the value 0x1000 as type - indices lower than this are reserved (see :ref:`Type Indices ` for - a discussion of reserved type indices). - -- **TypeIndexEnd** - One greater than the numeric value of the type index - representing the last type record in the TPI stream. The total number of type - records in the TPI stream can be computed as ``TypeIndexEnd - TypeIndexBegin``. - -- **TypeRecordBytes** - The number of bytes of type record data following the header. - -- **HashStreamIndex** - The index of a stream which contains a list of hashes for - every type record. This value may be -1, indicating that hash information is not - present. In practice a valid stream index is always observed, so any producer - implementation should be prepared to emit this stream to ensure compatibility with - tools which may expect it to be present. - -- **HashAuxStreamIndex** - Presumably the index of a stream which contains a separate - hash table, although this has not been observed in practice and it's unclear what it - might be used for. - -- **HashKeySize** - The size of a hash value (usually 4 bytes). - -- **NumHashBuckets** - The number of buckets used to generate the hash values in the - aforementioned hash streams. - -- **HashValueBufferOffset / HashValueBufferLength** - The offset and size within - the TPI Hash Stream of the list of hash values. It should be assumed that there - are either 0 hash values, or a number equal to the number of type records in the - TPI stream (``TypeIndexEnd - TypeEndBegin``). Thus, if ``HashBufferLength`` is - not equal to ``(TypeIndexEnd - TypeEndBegin) * HashKeySize`` we can consider the - PDB malformed. - -- **IndexOffsetBufferOffset / IndexOffsetBufferLength** - The offset and size - within the TPI Hash Stream of the Type Index Offsets Buffer. This is a list of - pairs of uint32_t's where the first value is a :ref:`Type Index ` - and the second value is the offset in the type record data of the type with this - index. This can be used to do a binary search followed bin a linear search to - get amortized O(log n) lookup by type index. - -- **HashAdjBufferOffset / HashAdjBufferLength** - The offset and size within - the TPI hash stream of a serialized hash table whose keys are the hash values - in the hash value buffer and whose values are type indices. This appears to - be useful in incremental linking scenarios, so that if a type is modified an - entry can be created mapping the old hash value to the new type index so that - a PDB file consumer can always have the most up to date version of the type - without forcing the incremental linker to garbage collect and update - references that point to the old version to now point to the new version. - The layout of this hash table is described in :doc:`HashTable`. - -.. _tpi_records: - -CodeView Type Record List -========================= -Following the header, there are ``TypeRecordBytes`` bytes of data that represent a -variable length array of :doc:`CodeView type records `. The number -of such records (e.g. the length of the array) can be determined by computing the -value ``Header.TypeIndexEnd - Header.TypeIndexBegin``. - -log(n) random access is provided by way of the Type Index Offsets array (if present) -described previously. \ No newline at end of file +===================================== +The PDB TPI and IPI Streams +===================================== + +.. contents:: + :local: + +.. _tpi_intro: + +Introduction +============ + +The PDB TPI Stream (Index 2) and IPI Stream (Index 4) contain information about +all types used in the program. It is organized as a :ref:`header ` +followed by a list of :doc:`CodeView Type Records `. Types are +referenced from various streams and records throughout the PDB by their +:ref:`type index `. In general, the sequence of type records +following the :ref:`header ` forms a topologically sorted DAG +(directed acyclic graph), which means that a type record B can only refer to +the type A if ``A.TypeIndex < B.TypeIndex``. While there are rare cases where +this property will not hold (particularly when dealing with object files +compiled with MASM), an implementation should try very hard to make this +property hold, as it means the entire type graph can be constructed in a single +pass. + +.. important:: + Type records form a topologically sorted DAG (directed acyclic graph). + +.. _tpi_ipi: + +TPI vs IPI Stream +================= + +Recent versions of the PDB format (aka all versions covered by this document) +have 2 streams with identical layout, henceforth referred to as the TPI stream +and IPI stream. Subsequent contents of this document describing the on-disk +format apply equally whether it is for the TPI Stream or the IPI Stream. The +only difference between the two is in *which* CodeView records are allowed to +appear in each one, summarized by the following table: + ++----------------------+---------------------+ +| TPI Stream | IPI Stream | ++======================+=====================+ +| LF_POINTER | LF_FUNC_ID | ++----------------------+---------------------+ +| LF_MODIFIER | LF_MFUNC_ID | ++----------------------+---------------------+ +| LF_PROCEDURE | LF_BUILDINFO | ++----------------------+---------------------+ +| LF_MFUNCTION | LF_SUBSTR_LIST | ++----------------------+---------------------+ +| LF_LABEL | LF_STRING_ID | ++----------------------+---------------------+ +| LF_ARGLIST | LF_UDT_SRC_LINE | ++----------------------+---------------------+ +| LF_FIELDLIST | LF_UDT_MOD_SRC_LINE | ++----------------------+---------------------+ +| LF_ARRAY | | ++----------------------+---------------------+ +| LF_CLASS | | ++----------------------+---------------------+ +| LF_STRUCTURE | | ++----------------------+---------------------+ +| LF_INTERFACE | | ++----------------------+---------------------+ +| LF_UNION | | ++----------------------+---------------------+ +| LF_ENUM | | ++----------------------+---------------------+ +| LF_TYPESERVER2 | | ++----------------------+---------------------+ +| LF_VFTABLE | | ++----------------------+---------------------+ +| LF_VTSHAPE | | ++----------------------+---------------------+ +| LF_BITFIELD | | ++----------------------+---------------------+ +| LF_METHODLIST | | ++----------------------+---------------------+ +| LF_PRECOMP | | ++----------------------+---------------------+ +| LF_ENDPRECOMP | | ++----------------------+---------------------+ + +The usage of these records is described in more detail in +:doc:`CodeView Type Records `. + +.. _type_indices: + +Type Indices +============ + +A type index is a 32-bit integer that uniquely identifies a type inside of an +object file's ``.debug$T`` section or a PDB file's TPI or IPI stream. The +value of the type index for the first type record from the TPI stream is given +by the ``TypeIndexBegin`` member of the :ref:`TPI Stream Header ` +although in practice this value is always equal to 0x1000 (4096). + +Any type index with a high bit set is considered to come from the IPI stream, +although this appears to be more of a hack, and LLVM does not generate type +indices of this nature. They can, however, be observed in Microsoft PDBs +occasionally, so one should be prepared to handle them. Note that having the +high bit set is not a necessary condition to determine whether a type index +comes from the IPI stream, it is only sufficient. + +Once the high bit is cleared, any type index >= ``TypeIndexBegin`` is presumed +to come from the appropriate stream, and any type index less than this is a +bitmask which can be decomposed as follows: + +.. code-block:: none + + .---------------------------.------.----------. + | Unused | Mode | Kind | + '---------------------------'------'----------' + |+32 |+12 |+8 |+0 + + +- **Kind** - A value from the following enum: + +.. code-block:: c++ + + enum class SimpleTypeKind : uint32_t { + None = 0x0000, // uncharacterized type (no type) + Void = 0x0003, // void + NotTranslated = 0x0007, // type not translated by cvpack + HResult = 0x0008, // OLE/COM HRESULT + + SignedCharacter = 0x0010, // 8 bit signed + UnsignedCharacter = 0x0020, // 8 bit unsigned + NarrowCharacter = 0x0070, // really a char + WideCharacter = 0x0071, // wide char + Character16 = 0x007a, // char16_t + Character32 = 0x007b, // char32_t + + SByte = 0x0068, // 8 bit signed int + Byte = 0x0069, // 8 bit unsigned int + Int16Short = 0x0011, // 16 bit signed + UInt16Short = 0x0021, // 16 bit unsigned + Int16 = 0x0072, // 16 bit signed int + UInt16 = 0x0073, // 16 bit unsigned int + Int32Long = 0x0012, // 32 bit signed + UInt32Long = 0x0022, // 32 bit unsigned + Int32 = 0x0074, // 32 bit signed int + UInt32 = 0x0075, // 32 bit unsigned int + Int64Quad = 0x0013, // 64 bit signed + UInt64Quad = 0x0023, // 64 bit unsigned + Int64 = 0x0076, // 64 bit signed int + UInt64 = 0x0077, // 64 bit unsigned int + Int128Oct = 0x0014, // 128 bit signed int + UInt128Oct = 0x0024, // 128 bit unsigned int + Int128 = 0x0078, // 128 bit signed int + UInt128 = 0x0079, // 128 bit unsigned int + + Float16 = 0x0046, // 16 bit real + Float32 = 0x0040, // 32 bit real + Float32PartialPrecision = 0x0045, // 32 bit PP real + Float48 = 0x0044, // 48 bit real + Float64 = 0x0041, // 64 bit real + Float80 = 0x0042, // 80 bit real + Float128 = 0x0043, // 128 bit real + + Complex16 = 0x0056, // 16 bit complex + Complex32 = 0x0050, // 32 bit complex + Complex32PartialPrecision = 0x0055, // 32 bit PP complex + Complex48 = 0x0054, // 48 bit complex + Complex64 = 0x0051, // 64 bit complex + Complex80 = 0x0052, // 80 bit complex + Complex128 = 0x0053, // 128 bit complex + + Boolean8 = 0x0030, // 8 bit boolean + Boolean16 = 0x0031, // 16 bit boolean + Boolean32 = 0x0032, // 32 bit boolean + Boolean64 = 0x0033, // 64 bit boolean + Boolean128 = 0x0034, // 128 bit boolean + }; + +- **Mode** - A value from the following enum: + +.. code-block:: c++ + + enum class SimpleTypeMode : uint32_t { + Direct = 0, // Not a pointer + NearPointer = 1, // Near pointer + FarPointer = 2, // Far pointer + HugePointer = 3, // Huge pointer + NearPointer32 = 4, // 32 bit near pointer + FarPointer32 = 5, // 32 bit far pointer + NearPointer64 = 6, // 64 bit near pointer + NearPointer128 = 7 // 128 bit near pointer + }; + +Note that for pointers, the bitness is represented in the mode. So a ``void*`` +would have a type index with ``Mode=NearPointer32, Kind=Void`` if built for 32-bits +but a type index with ``Mode=NearPointer64, Kind=Void`` if built for 64-bits. + +By convention, the type index for ``std::nullptr_t`` is constructed the same way +as the type index for ``void*``, but using the bitless enumeration value +``NearPointer``. + + + +.. _tpi_header: + +Stream Header +============= +At offset 0 of the TPI Stream is a header with the following layout: + + +.. code-block:: c++ + + struct TpiStreamHeader { + uint32_t Version; + uint32_t HeaderSize; + uint32_t TypeIndexBegin; + uint32_t TypeIndexEnd; + uint32_t TypeRecordBytes; + + uint16_t HashStreamIndex; + uint16_t HashAuxStreamIndex; + uint32_t HashKeySize; + uint32_t NumHashBuckets; + + int32_t HashValueBufferOffset; + uint32_t HashValueBufferLength; + + int32_t IndexOffsetBufferOffset; + uint32_t IndexOffsetBufferLength; + + int32_t HashAdjBufferOffset; + uint32_t HashAdjBufferLength; + }; + +- **Version** - A value from the following enum. + +.. code-block:: c++ + + enum class TpiStreamVersion : uint32_t { + V40 = 19950410, + V41 = 19951122, + V50 = 19961031, + V70 = 19990903, + V80 = 20040203, + }; + +Similar to the :doc:`PDB Stream `, this value always appears to be +``V80``, and no other values have been observed. It is assumed that should +another value be observed, the layout described by this document may not be +accurate. + +- **HeaderSize** - ``sizeof(TpiStreamHeader)`` + +- **TypeIndexBegin** - The numeric value of the type index representing the + first type record in the TPI stream. This is usually the value 0x1000 as type + indices lower than this are reserved (see :ref:`Type Indices ` for + a discussion of reserved type indices). + +- **TypeIndexEnd** - One greater than the numeric value of the type index + representing the last type record in the TPI stream. The total number of type + records in the TPI stream can be computed as ``TypeIndexEnd - TypeIndexBegin``. + +- **TypeRecordBytes** - The number of bytes of type record data following the header. + +- **HashStreamIndex** - The index of a stream which contains a list of hashes for + every type record. This value may be -1, indicating that hash information is not + present. In practice a valid stream index is always observed, so any producer + implementation should be prepared to emit this stream to ensure compatibility with + tools which may expect it to be present. + +- **HashAuxStreamIndex** - Presumably the index of a stream which contains a separate + hash table, although this has not been observed in practice and it's unclear what it + might be used for. + +- **HashKeySize** - The size of a hash value (usually 4 bytes). + +- **NumHashBuckets** - The number of buckets used to generate the hash values in the + aforementioned hash streams. + +- **HashValueBufferOffset / HashValueBufferLength** - The offset and size within + the TPI Hash Stream of the list of hash values. It should be assumed that there + are either 0 hash values, or a number equal to the number of type records in the + TPI stream (``TypeIndexEnd - TypeEndBegin``). Thus, if ``HashBufferLength`` is + not equal to ``(TypeIndexEnd - TypeEndBegin) * HashKeySize`` we can consider the + PDB malformed. + +- **IndexOffsetBufferOffset / IndexOffsetBufferLength** - The offset and size + within the TPI Hash Stream of the Type Index Offsets Buffer. This is a list of + pairs of uint32_t's where the first value is a :ref:`Type Index ` + and the second value is the offset in the type record data of the type with this + index. This can be used to do a binary search followed bin a linear search to + get amortized O(log n) lookup by type index. + +- **HashAdjBufferOffset / HashAdjBufferLength** - The offset and size within + the TPI hash stream of a serialized hash table whose keys are the hash values + in the hash value buffer and whose values are type indices. This appears to + be useful in incremental linking scenarios, so that if a type is modified an + entry can be created mapping the old hash value to the new type index so that + a PDB file consumer can always have the most up to date version of the type + without forcing the incremental linker to garbage collect and update + references that point to the old version to now point to the new version. + The layout of this hash table is described in :doc:`HashTable`. + +.. _tpi_records: + +CodeView Type Record List +========================= +Following the header, there are ``TypeRecordBytes`` bytes of data that represent a +variable length array of :doc:`CodeView type records `. The number +of such records (e.g. the length of the array) can be determined by computing the +value ``Header.TypeIndexEnd - Header.TypeIndexBegin``. + +log(n) random access is provided by way of the Type Index Offsets array (if present) +described previously. diff --git a/docs/PDB/index.rst b/docs/PDB/index.rst index 0662e9d9e58..88e6015642a 100644 --- a/docs/PDB/index.rst +++ b/docs/PDB/index.rst @@ -1,168 +1,168 @@ -===================================== -The PDB File Format -===================================== - -.. contents:: - :local: - -.. _pdb_intro: - -Introduction -============ - -PDB (Program Database) is a file format invented by Microsoft and which contains -debug information that can be consumed by debuggers and other tools. Since -officially supported APIs exist on Windows for querying debug information from -PDBs even without the user understanding the internals of the file format, a -large ecosystem of tools has been built for Windows to consume this format. In -order for Clang to be able to generate programs that can interoperate with these -tools, it is necessary for us to generate PDB files ourselves. - -At the same time, LLVM has a long history of being able to cross-compile from -any platform to any platform, and we wish for the same to be true here. So it -is necessary for us to understand the PDB file format at the byte-level so that -we can generate PDB files entirely on our own. - -This manual describes what we know about the PDB file format today. The layout -of the file, the various streams contained within, the format of individual -records within, and more. - -We would like to extend our heartfelt gratitude to Microsoft, without whom we -would not be where we are today. Much of the knowledge contained within this -manual was learned through reading code published by Microsoft on their `GitHub -repo `__. - -.. _pdb_layout: - -File Layout -=========== - -.. important:: - Unless otherwise specified, all numeric values are encoded in little endian. - If you see a type such as ``uint16_t`` or ``uint64_t`` going forward, always - assume it is little endian! - -.. toctree:: - :hidden: - - MsfFile - PdbStream - TpiStream - DbiStream - ModiStream - PublicStream - GlobalStream - HashTable - CodeViewSymbols - CodeViewTypes - -.. _msf: - -The MSF Container ------------------ -A PDB file is really just a special case of an MSF (Multi-Stream Format) file. -An MSF file is actually a miniature "file system within a file". It contains -multiple streams (aka files) which can represent arbitrary data, and these -streams are divided into blocks which may not necessarily be contiguously -laid out within the file (aka fragmented). Additionally, the MSF contains a -stream directory (aka MFT) which describes how the streams (files) are laid -out within the MSF. - -For more information about the MSF container format, stream directory, and -block layout, see :doc:`MsfFile`. - -.. _streams: - -Streams -------- -The PDB format contains a number of streams which describe various information -such as the types, symbols, source files, and compilands (e.g. object files) -of a program, as well as some additional streams containing hash tables that are -used by debuggers and other tools to provide fast lookup of records and types -by name, and various other information about how the program was compiled such -as the specific toolchain used, and more. A summary of streams contained in a -PDB file is as follows: - -+--------------------+------------------------------+-------------------------------------------+ -| Name | Stream Index | Contents | -+====================+==============================+===========================================+ -| Old Directory | - Fixed Stream Index 0 | - Previous MSF Stream Directory | -+--------------------+------------------------------+-------------------------------------------+ -| PDB Stream | - Fixed Stream Index 1 | - Basic File Information | -| | | - Fields to match EXE to this PDB | -| | | - Map of named streams to stream indices | -+--------------------+------------------------------+-------------------------------------------+ -| TPI Stream | - Fixed Stream Index 2 | - CodeView Type Records | -| | | - Index of TPI Hash Stream | -+--------------------+------------------------------+-------------------------------------------+ -| DBI Stream | - Fixed Stream Index 3 | - Module/Compiland Information | -| | | - Indices of individual module streams | -| | | - Indices of public / global streams | -| | | - Section Contribution Information | -| | | - Source File Information | -| | | - References to streams containing | -| | | FPO / PGO Data | -+--------------------+------------------------------+-------------------------------------------+ -| IPI Stream | - Fixed Stream Index 4 | - CodeView Type Records | -| | | - Index of IPI Hash Stream | -+--------------------+------------------------------+-------------------------------------------+ -| /LinkInfo | - Contained in PDB Stream | - Unknown | -| | Named Stream map | | -+--------------------+------------------------------+-------------------------------------------+ -| /src/headerblock | - Contained in PDB Stream | - Summary of embedded source file content | -| | Named Stream map | (e.g. natvis files) | -+--------------------+------------------------------+-------------------------------------------+ -| /names | - Contained in PDB Stream | - PDB-wide global string table used for | -| | Named Stream map | string de-duplication | -+--------------------+------------------------------+-------------------------------------------+ -| Module Info Stream | - Contained in DBI Stream | - CodeView Symbol Records for this module | -| | - One for each compiland | - Line Number Information | -+--------------------+------------------------------+-------------------------------------------+ -| Public Stream | - Contained in DBI Stream | - Public (Exported) Symbol Records | -| | | - Index of Public Hash Stream | -+--------------------+------------------------------+-------------------------------------------+ -| Global Stream | - Contained in DBI Stream | - Single combined master symbol-table | -| | | - Index of Global Hash Stream | -+--------------------+------------------------------+-------------------------------------------+ -| TPI Hash Stream | - Contained in TPI Stream | - Hash table for looking up TPI records | -| | | by name | -+--------------------+------------------------------+-------------------------------------------+ -| IPI Hash Stream | - Contained in IPI Stream | - Hash table for looking up IPI records | -| | | by name | -+--------------------+------------------------------+-------------------------------------------+ - -More information about the structure of each of these can be found on the -following pages: - -:doc:`PdbStream` - Information about the PDB Info Stream and how it is used to match PDBs to EXEs. - -:doc:`TpiStream` - Information about the TPI stream and the CodeView records contained within. - -:doc:`DbiStream` - Information about the DBI stream and relevant substreams including the Module Substreams, - source file information, and CodeView symbol records contained within. - -:doc:`ModiStream` - Information about the Module Information Stream, of which there is one for each compilation - unit and the format of symbols contained within. - -:doc:`PublicStream` - Information about the Public Symbol Stream. - -:doc:`GlobalStream` - Information about the Global Symbol Stream. - -:doc:`HashTable` - Information about the serialized hash table format used internally to represent things such - as the Named Stream Map and the Hash Adjusters in the :doc:`TPI/IPI Stream `. - -CodeView -======== -CodeView is another format which comes into the picture. While MSF defines -the structure of the overall file, and PDB defines the set of streams that -appear within the MSF file and the format of those streams, CodeView defines -the format of **symbol and type records** that appear within specific streams. -Refer to the pages on :doc:`CodeViewSymbols` and :doc:`CodeViewTypes` for -more information about the CodeView format. +===================================== +The PDB File Format +===================================== + +.. contents:: + :local: + +.. _pdb_intro: + +Introduction +============ + +PDB (Program Database) is a file format invented by Microsoft and which contains +debug information that can be consumed by debuggers and other tools. Since +officially supported APIs exist on Windows for querying debug information from +PDBs even without the user understanding the internals of the file format, a +large ecosystem of tools has been built for Windows to consume this format. In +order for Clang to be able to generate programs that can interoperate with these +tools, it is necessary for us to generate PDB files ourselves. + +At the same time, LLVM has a long history of being able to cross-compile from +any platform to any platform, and we wish for the same to be true here. So it +is necessary for us to understand the PDB file format at the byte-level so that +we can generate PDB files entirely on our own. + +This manual describes what we know about the PDB file format today. The layout +of the file, the various streams contained within, the format of individual +records within, and more. + +We would like to extend our heartfelt gratitude to Microsoft, without whom we +would not be where we are today. Much of the knowledge contained within this +manual was learned through reading code published by Microsoft on their `GitHub +repo `__. + +.. _pdb_layout: + +File Layout +=========== + +.. important:: + Unless otherwise specified, all numeric values are encoded in little endian. + If you see a type such as ``uint16_t`` or ``uint64_t`` going forward, always + assume it is little endian! + +.. toctree:: + :hidden: + + MsfFile + PdbStream + TpiStream + DbiStream + ModiStream + PublicStream + GlobalStream + HashTable + CodeViewSymbols + CodeViewTypes + +.. _msf: + +The MSF Container +----------------- +A PDB file is really just a special case of an MSF (Multi-Stream Format) file. +An MSF file is actually a miniature "file system within a file". It contains +multiple streams (aka files) which can represent arbitrary data, and these +streams are divided into blocks which may not necessarily be contiguously +laid out within the file (aka fragmented). Additionally, the MSF contains a +stream directory (aka MFT) which describes how the streams (files) are laid +out within the MSF. + +For more information about the MSF container format, stream directory, and +block layout, see :doc:`MsfFile`. + +.. _streams: + +Streams +------- +The PDB format contains a number of streams which describe various information +such as the types, symbols, source files, and compilands (e.g. object files) +of a program, as well as some additional streams containing hash tables that are +used by debuggers and other tools to provide fast lookup of records and types +by name, and various other information about how the program was compiled such +as the specific toolchain used, and more. A summary of streams contained in a +PDB file is as follows: + ++--------------------+------------------------------+-------------------------------------------+ +| Name | Stream Index | Contents | ++====================+==============================+===========================================+ +| Old Directory | - Fixed Stream Index 0 | - Previous MSF Stream Directory | ++--------------------+------------------------------+-------------------------------------------+ +| PDB Stream | - Fixed Stream Index 1 | - Basic File Information | +| | | - Fields to match EXE to this PDB | +| | | - Map of named streams to stream indices | ++--------------------+------------------------------+-------------------------------------------+ +| TPI Stream | - Fixed Stream Index 2 | - CodeView Type Records | +| | | - Index of TPI Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| DBI Stream | - Fixed Stream Index 3 | - Module/Compiland Information | +| | | - Indices of individual module streams | +| | | - Indices of public / global streams | +| | | - Section Contribution Information | +| | | - Source File Information | +| | | - References to streams containing | +| | | FPO / PGO Data | ++--------------------+------------------------------+-------------------------------------------+ +| IPI Stream | - Fixed Stream Index 4 | - CodeView Type Records | +| | | - Index of IPI Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| /LinkInfo | - Contained in PDB Stream | - Unknown | +| | Named Stream map | | ++--------------------+------------------------------+-------------------------------------------+ +| /src/headerblock | - Contained in PDB Stream | - Summary of embedded source file content | +| | Named Stream map | (e.g. natvis files) | ++--------------------+------------------------------+-------------------------------------------+ +| /names | - Contained in PDB Stream | - PDB-wide global string table used for | +| | Named Stream map | string de-duplication | ++--------------------+------------------------------+-------------------------------------------+ +| Module Info Stream | - Contained in DBI Stream | - CodeView Symbol Records for this module | +| | - One for each compiland | - Line Number Information | ++--------------------+------------------------------+-------------------------------------------+ +| Public Stream | - Contained in DBI Stream | - Public (Exported) Symbol Records | +| | | - Index of Public Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| Global Stream | - Contained in DBI Stream | - Single combined master symbol-table | +| | | - Index of Global Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| TPI Hash Stream | - Contained in TPI Stream | - Hash table for looking up TPI records | +| | | by name | ++--------------------+------------------------------+-------------------------------------------+ +| IPI Hash Stream | - Contained in IPI Stream | - Hash table for looking up IPI records | +| | | by name | ++--------------------+------------------------------+-------------------------------------------+ + +More information about the structure of each of these can be found on the +following pages: + +:doc:`PdbStream` + Information about the PDB Info Stream and how it is used to match PDBs to EXEs. + +:doc:`TpiStream` + Information about the TPI stream and the CodeView records contained within. + +:doc:`DbiStream` + Information about the DBI stream and relevant substreams including the Module Substreams, + source file information, and CodeView symbol records contained within. + +:doc:`ModiStream` + Information about the Module Information Stream, of which there is one for each compilation + unit and the format of symbols contained within. + +:doc:`PublicStream` + Information about the Public Symbol Stream. + +:doc:`GlobalStream` + Information about the Global Symbol Stream. + +:doc:`HashTable` + Information about the serialized hash table format used internally to represent things such + as the Named Stream Map and the Hash Adjusters in the :doc:`TPI/IPI Stream `. + +CodeView +======== +CodeView is another format which comes into the picture. While MSF defines +the structure of the overall file, and PDB defines the set of streams that +appear within the MSF file and the format of those streams, CodeView defines +the format of **symbol and type records** that appear within specific streams. +Refer to the pages on :doc:`CodeViewSymbols` and :doc:`CodeViewTypes` for +more information about the CodeView format.