[PDB] Begin adding documentation for the PDB file format.

author Zachary Turner <zturner@google.com>

Thu, 10 Nov 2016 19:24:21 +0000 (19:24 +0000)

committer Zachary Turner <zturner@google.com>

Thu, 10 Nov 2016 19:24:21 +0000 (19:24 +0000)
author Zachary Turner <zturner@google.com>
Thu, 10 Nov 2016 19:24:21 +0000 (19:24 +0000)
committer Zachary Turner <zturner@google.com>
Thu, 10 Nov 2016 19:24:21 +0000 (19:24 +0000)
diff --git a/docs/PDB/DbiStream.rst b/docs/PDB/DbiStream.rst

new file mode 100644 (file)

index 0000000..0a247a1
--- /dev/null
+++ b/docs/PDB/DbiStream.rst
@@ -0,0 +1,3 @@
+=====================================\r
+The PDB DBI (Debug Info) Stream\r
+=====================================\r
diff --git a/docs/PDB/GlobalStream.rst b/docs/PDB/GlobalStream.rst

new file mode 100644 (file)

index 0000000..314b9f0
--- /dev/null
+++ b/docs/PDB/GlobalStream.rst
@@ -0,0 +1,3 @@
+=====================================\r
+The PDB Global Symbol Stream\r
+=====================================\r
diff --git a/docs/PDB/HashStream.rst b/docs/PDB/HashStream.rst

new file mode 100644 (file)

index 0000000..a758db4
--- /dev/null
+++ b/docs/PDB/HashStream.rst
@@ -0,0 +1,3 @@
+=====================================\r
+The TPI & IPI Hash Streams\r
+=====================================\r
diff --git a/docs/PDB/ModiStream.rst b/docs/PDB/ModiStream.rst

new file mode 100644 (file)

index 0000000..3eb4505
--- /dev/null
+++ b/docs/PDB/ModiStream.rst
@@ -0,0 +1,3 @@
+=====================================\r
+The Module Information Stream\r
+=====================================\r
diff --git a/docs/PDB/MsfFile.rst b/docs/PDB/MsfFile.rst

new file mode 100644 (file)

index 0000000..bdceca3
--- /dev/null
+++ b/docs/PDB/MsfFile.rst
@@ -0,0 +1,121 @@
+=====================================\r
+The MSF File Format\r
+=====================================\r
+\r
+.. contents::\r
+   :local:\r
+\r
+.. _msf_superblock:\r
+\r
+The Superblock\r
+==============\r
+At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as\r
+follows:\r
+\r
+.. code-block:: c++\r
+\r
+  struct SuperBlock {\r
+    char FileMagic[sizeof(Magic)];\r
+    ulittle32_t BlockSize;\r
+    ulittle32_t FreeBlockMapBlock;\r
+    ulittle32_t NumBlocks;\r
+    ulittle32_t NumDirectoryBytes;\r
+    ulittle32_t Unknown;\r
+    ulittle32_t BlockMapAddr;\r
+  };\r
+\r
+- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"``\r
+  followed by the bytes ``1A 44 53 00 00 00``.\r
+- **BlockSize** - The block size of the internal file system.  Valid values are\r
+  512, 1024, 2048, and 4096 bytes.  Certain aspects of the MSF file layout vary\r
+  depending on the block sizes.  For the purposes of LLVM, we handle only block\r
+  sizes of 4KiB, and all further discussion assumes a block size of 4KiB.\r
+- **FreeBlockMapBlock** - The index of a block within the file, at which begins\r
+  a bitfield representing the set of all blocks within the file which are "free"\r
+  (i.e. the data within that block is not used).  This bitfield is spread across\r
+  the MSF file at ``BlockSize`` intervals.\r
+  **Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``!  This field\r
+  is designed to support incremental and atomic updates of the underlying MSF\r
+  file.  While writing to an MSF file, if the value of this field is `1`, you\r
+  can write your new modified bitfield to page 2, and vice versa.  Only when\r
+  you commit the file to disk do you need to swap the value in the SuperBlock\r
+  to point to the new ``FreeBlockMapBlock``.\r
+- **NumBlocks** - The total number of blocks in the file.  ``NumBlocks * BlockSize``\r
+  should equal the size of the file on disk.\r
+- **NumDirectoryBytes** - The size of the stream directory, in bytes.  The stream\r
+  directory contains information about each stream's size and the set of blocks\r
+  that it occupies.  It will be described in more detail later.\r
+- **BlockMapAddr** - The index of a block within the MSF file.  At this block is\r
+  an array of ``ulittle32_t``'s listing the blocks that the stream directory\r
+  resides on.  For large MSF files, the stream directory (which describes the\r
+  block layout of each stream) may not fit entirely on a single block.  As a\r
+  result, this extra layer of indirection is introduced, whereby this block\r
+  contains the list of blocks that the stream directory occupies, and the stream\r
+  directory itself can be stitched together accordingly.  The number of\r
+  ``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``.\r
+  \r
+The Stream Directory\r
+====================\r
+The Stream Directory is the root of all access to the other streams in an MSF\r
+file.  Beginning at byte 0 of the stream directory is the following structure:\r
+\r
+.. code-block:: c++\r
+\r
+  struct StreamDirectory {\r
+    ulittle32_t NumStreams;\r
+    ulittle32_t StreamSizes[NumStreams];\r
+    ulittle32_t StreamBlocks[NumStreams][];\r
+  };\r
+  \r
+And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes.\r
+Note that each of the last two arrays is of variable length, and in particular\r
+that the second array is jagged.  \r
+\r
+**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4\r
+streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}.\r
+\r
+Stream 0: ceil(1000 / 4096) = 1 block\r
+\r
+Stream 1: ceil(8000 / 4096) = 2 blocks\r
+\r
+Stream 2: ceil(16000 / 4096) = 4 blocks\r
+\r
+Stream 3: ceil(9000 / 4096) = 3 blocks\r
+\r
+In total, 10 blocks are used.  Let's see what the stream directory might look\r
+like:\r
+\r
+.. code-block:: c++\r
+\r
+  struct StreamDirectory {\r
+    ulittle32_t NumStreams = 4;\r
+    ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000};\r
+    ulittle32_t StreamBlocks[][] = {\r
+      {4},\r
+      {5, 6},\r
+      {11, 9, 7, 8},\r
+      {10, 15, 12}\r
+    };\r
+  };\r
+  \r
+In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes``\r
+would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one\r
+``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``.\r
+\r
+Note also that the streams are discontiguous, and that part of stream 3 is in the\r
+middle of part of stream 2.  You cannot assume anything about the layout of the\r
+blocks!\r
+\r
+Alignment and Block Boundaries\r
+==============================\r
+As may be clear by now, it is possible for a single field (whether it be a high\r
+level record, a long string field, or even a single ``uint16``) to begin and\r
+end in separate blocks.  For example, if the block size is 4096 bytes, and a\r
+``uint16`` field begins at the last byte of the current block, then it would\r
+need to end on the first byte of the next block.  Since blocks are not\r
+necessarily contiguously laid out in the file, this means that both the consumer\r
+and the producer of an MSF file must be prepared to split data apart\r
+accordingly.  In the aforementioned example, the high byte of the ``uint16``\r
+would be written to the last byte of block N, and the low byte would be written\r
+to the first byte of block N+1, which could be tens of thousands of bytes later\r
+(or even earlier!) in the file, depending on what the stream directory says.\r
diff --git a/docs/PDB/PdbStream.rst b/docs/PDB/PdbStream.rst

new file mode 100644 (file)

index 0000000..65adb9c
--- /dev/null
+++ b/docs/PDB/PdbStream.rst
@@ -0,0 +1,3 @@
+========================================\r
+The PDB Info Stream (aka the PDB Stream)\r
+========================================\r
diff --git a/docs/PDB/PublicStream.rst b/docs/PDB/PublicStream.rst

new file mode 100644 (file)

index 0000000..5b413cf
--- /dev/null
+++ b/docs/PDB/PublicStream.rst
@@ -0,0 +1,3 @@
+=====================================\r
+The PDB Public Symbol Stream\r
+=====================================\r
diff --git a/docs/PDB/TpiStream.rst b/docs/PDB/TpiStream.rst

new file mode 100644 (file)

index 0000000..1e3297e
--- /dev/null
+++ b/docs/PDB/TpiStream.rst
@@ -0,0 +1,3 @@
+=====================================\r
+The PDB TPI Stream\r
+=====================================\r
diff --git a/docs/PDB/index.rst b/docs/PDB/index.rst

new file mode 100644 (file)

index 0000000..2c1c3e3
--- /dev/null
+++ b/docs/PDB/index.rst
@@ -0,0 +1,160 @@
+=====================================\r
+The PDB File Format\r
+=====================================\r
+\r
+.. contents::\r
+   :local:\r
+\r
+.. _pdb_intro:\r
+\r
+Introduction\r
+============\r
+\r
+PDB (Program Database) is a file format invented by Microsoft and which contains\r
+debug information that can be consumed by debuggers and other tools.  Since\r
+officially supported APIs exist on Windows for querying debug information from\r
+PDBs even without the user understanding the internals of the file format, a\r
+large ecosystem of tools has been built for Windows to consume this format.  In\r
+order for Clang to be able to generate programs that can interoperate with these\r
+tools, it is necessary for us to generate PDB files ourselves.\r
+\r
+At the same time, LLVM has a long history of being able to cross-compile from\r
+any platform to any platform, and we wish for the same to be true here.  So it\r
+is necessary for us to understand the PDB file format at the byte-level so that\r
+we can generate PDB files entirely on our own.\r
+\r
+This manual describes what we know about the PDB file format today.  The layout\r
+of the file, the various streams contained within, the format of individual\r
+records within, and more.\r
+\r
+We would like to extend our heartfelt gratitude to Microsoft, without whom we\r
+would not be where we are today.  Much of the knowledge contained within this\r
+manual was learned through reading code published by Microsoft on their `GitHub\r
+repo <https://github.com/Microsoft/microsoft-pdb>`__.\r
+\r
+.. _pdb_layout:\r
+\r
+File Layout\r
+===========\r
+\r
+.. toctree::\r
+   :hidden:\r
+   \r
+   MsfFile\r
+   PdbStream\r
+   TpiStream\r
+   DbiStream\r
+   ModiStream\r
+   PublicStream\r
+   GlobalStream\r
+   HashStream\r
+\r
+.. _msf:\r
+\r
+The MSF Container\r
+-----------------\r
+A PDB file is really just a special case of an MSF (Multi-Stream Format) file.\r
+An MSF file is actually a miniature "file system within a file".  It contains\r
+multiple streams (aka files) which can represent arbitrary data, and these\r
+streams are divided into blocks which may not necessarily be contiguously\r
+laid out within the file (aka fragmented).  Additionally, the MSF contains a\r
+stream directory (aka MFT) which describes how the streams (files) are laid\r
+out within the MSF.\r
+\r
+For more information about the MSF container format, stream directory, and\r
+block layout, see :doc:`MsfFile`.\r
+\r
+.. _streams:\r
+\r
+Streams\r
+-------\r
+The PDB format contains a number of streams which describe various information\r
+such as the types, symbols, source files, and compilands (e.g. object files)\r
+of a program, as well as some additional streams containing hash tables that are\r
+used by debuggers and other tools to provide fast lookup of records and types\r
+by name, and various other information about how the program was compiled such\r
+as the specific toolchain used, and more.  A summary of streams contained in a\r
+PDB file is as follows:\r
+\r
++--------------------+------------------------------+-------------------------------------------+\r
+| Name               | Stream Index                 | Contents                                  |\r
++====================+==============================+===========================================+\r
+| Old Directory      | - Fixed Stream Index 0       | - Previous MSF Stream Directory           |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| PDB Stream         | - Fixed Stream Index 1       | - Basic File Information                  |\r
+|                    |                              | - Fields to match EXE to this PDB         |\r
+|                    |                              | - Map of named streams to stream indices  |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| TPI Stream         | - Fixed Stream Index 2       | - CodeView Type Records                   |\r
+|                    |                              | - Index of TPI Hash Stream                |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| DBI Stream         | - Fixed Stream Index 3       | - Module/Compiland Information            |\r
+|                    |                              | - Indices of individual module streams    |\r
+|                    |                              | - Indices of public / global streams      |\r
+|                    |                              | - Section Contribution Information        |\r
+|                    |                              | - Source File Information                 |\r
+|                    |                              | - FPO / PGO Data                          |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| IPI Stream         | - Fixed Stream Index 4       | - CodeView Type Records                   |\r
+|                    |                              | - Index of IPI Hash Stream                |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| /LinkInfo          | - Contained in PDB Stream    | - Unknown                                 |\r
+|                    |   Named Stream map           |                                           |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| /src/headerblock   | - Contained in PDB Stream    | - Unknown                                 |\r
+|                    |   Named Stream map           |                                           |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| /names             | - Contained in PDB Stream    | - PDB-wide global string table used for   |\r
+|                    |   Named Stream map           |   string de-duplication                   |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| Module Info Stream | - Contained in DBI Stream    | - CodeView Symbol Records for this module |\r
+|                    | - One for each compiland     | - Line Number Information                 |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| Public Stream      | - Contained in DBI Stream    | - Public (Exported) Symbol Records        |\r
+|                    |                              | - Index of Public Hash Stream             |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| Global Stream      | - Contained in DBI Stream    | - Global Symbol Records                   |\r
+|                    |                              | - Index of Global Hash Stream             |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| TPI Hash Stream    | - Contained in TPI Stream    | - Hash table for looking up TPI records   |\r
+|                    |                              |   by name                                 |\r
++--------------------+------------------------------+-------------------------------------------+\r
+| IPI Hash Stream    | - Contained in IPI Stream    | - Hash table for looking up IPI records   |\r
+|                    |                              |   by name                                 |\r
++--------------------+------------------------------+-------------------------------------------+\r
+\r
+More information about the structure of each of these can be found on the\r
+following pages:\r
+   \r
+:doc:`PdbStream`\r
+   Information about the PDB Info Stream and how it is used to match PDBs to EXEs.\r
+\r
+:doc:`TpiStream`\r
+   Information about the TPI stream and the CodeView records contained within.\r
+\r
+:doc:`DbiStream`\r
+   Information about the DBI stream and relevant substreams including the Module Substreams,\r
+   source file information, and CodeView symbol records contained within.\r
+\r
+:doc:`ModiStream`\r
+   Information about the Module Information Stream, of which there is one for each compilation\r
+   unit and the format of symbols contained within.\r
+\r
+:doc:`PublicStream`\r
+   Information about the Public Symbol Stream.\r
+\r
+:doc:`GlobalStream`\r
+   Information about the Global Symbol Stream.\r
+\r
+:doc:`HashStream`\r
+   Information about the Hash Table stream, and how it can be used to quickly look up records\r
+   by name.\r
+\r
+CodeView\r
+========\r
+CodeView is another format which comes into the picture.  While MSF defines\r
+the structure of the overall file, and PDB defines the set of streams that\r
+appear within the MSF file and the format of those streams, CodeView defines\r
+the format of **symbol and type records** that appear within specific streams.\r
+Refer to the pages on `CodeView Symbol Records` and `CodeView Type Records` for\r
+more information about the CodeView format.\r
diff --git a/docs/index.rst b/docs/index.rst

index 634e19a5c6f656a460e816c5fba7eddf25ee46bb..341a9c16325b9ab390e79eb8770d9cb931347b05 100644 (file)
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -274,6 +274,7 @@ For API clients and LLVM developers.
     Coroutines
     GlobalISel
     XRay
+   PDB/index
  
  :doc:`WritingAnLLVMPass`
     Information on how to write LLVM transformations and analyses.
@@ -398,6 +399,9 @@ For API clients and LLVM developers.
  :doc:`XRay`
    High-level documentation of how to use XRay in LLVM.
  
+:doc:`The Microsoft PDB File Format <PDB/index>`
+  A detailed description of the Microsoft PDB (Program Database) file format.
+
  Development Process Documentation
  =================================
author	Zachary Turner <zturner@google.com>
	Thu, 10 Nov 2016 19:24:21 +0000 (19:24 +0000)
committer	Zachary Turner <zturner@google.com>
	Thu, 10 Nov 2016 19:24:21 +0000 (19:24 +0000)
docs/PDB/DbiStream.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/GlobalStream.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/HashStream.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/ModiStream.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/MsfFile.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/PdbStream.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/PublicStream.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/TpiStream.rst	[new file with mode: 0644]	patch \| blob
docs/PDB/index.rst	[new file with mode: 0644]	patch \| blob
docs/index.rst		patch \| blob \| history