granicus.if.org Git - postgresql/blob - doc/src/sgml/storage.sgml

   1 <!-- doc/src/sgml/storage.sgml -->
   2
   3 <chapter id="storage">
   4
   5 <title>Database Physical Storage</title>
   6
   7 <para>
   8 This chapter provides an overview of the physical storage format used by
   9 <productname>PostgreSQL</productname> databases.
  10 </para>
  11
  12 <sect1 id="storage-file-layout">
  13
  14 <title>Database File Layout</title>
  15
  16 <para>
  17 This section describes the storage format at the level of files and
  18 directories.
  19 </para>
  20
  21 <para>
  22 Traditionally, the configuration and data files used by a database
  23 cluster are stored together within the cluster's data
  24 directory, commonly referred to as <varname>PGDATA</> (after the name of the
  25 environment variable that can be used to define it).  A common location for
  26 <varname>PGDATA</> is <filename>/var/lib/pgsql/data</>.  Multiple clusters,
  27 managed by different server instances, can exist on the same machine.
  28 </para>
  29
  30 <para>
  31 The <varname>PGDATA</> directory contains several subdirectories and control
  32 files, as shown in <xref linkend="pgdata-contents-table">.  In addition to
  33 these required items, the cluster configuration files
  34 <filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
  35 <filename>pg_ident.conf</filename> are traditionally stored in
  36 <varname>PGDATA</>, although it is possible to place them elsewhere.
  37 </para>
  38
  39 <table tocentry="1" id="pgdata-contents-table">
  40 <title>Contents of <varname>PGDATA</></title>
  41 <tgroup cols="2">
  42 <thead>
  43 <row>
  44 <entry>
  45 Item
  46 </entry>
  47 <entry>Description</entry>
  48 </row>
  49 </thead>
  50
  51 <tbody>
  52
  53 <row>
  54  <entry><filename>PG_VERSION</></entry>
  55  <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
  56 </row>
  57
  58 <row>
  59  <entry><filename>base</></entry>
  60  <entry>Subdirectory containing per-database subdirectories</entry>
  61 </row>
  62
  63 <row>
  64  <entry><filename>global</></entry>
  65  <entry>Subdirectory containing cluster-wide tables, such as
  66  <structname>pg_database</></entry>
  67 </row>
  68
  69 <row>
  70  <entry><filename>pg_commit_ts</></entry>
  71  <entry>Subdirectory containing transaction commit timestamp data</entry>
  72 </row>
  73
  74 <row>
  75  <entry><filename>pg_clog</></entry>
  76  <entry>Subdirectory containing transaction commit status data</entry>
  77 </row>
  78
  79 <row>
  80  <entry><filename>pg_dynshmem</></entry>
  81  <entry>Subdirectory containing files used by the dynamic shared memory
  82   subsystem</entry>
  83 </row>
  84
  85 <row>
  86  <entry><filename>pg_logical</></entry>
  87  <entry>Subdirectory containing status data for logical decoding</entry>
  88 </row>
  89
  90 <row>
  91  <entry><filename>pg_multixact</></entry>
  92  <entry>Subdirectory containing multitransaction status data
  93   (used for shared row locks)</entry>
  94 </row>
  95
  96 <row>
  97  <entry><filename>pg_notify</></entry>
  98  <entry>Subdirectory containing LISTEN/NOTIFY status data</entry>
  99 </row>
 100
 101 <row>
 102  <entry><filename>pg_replslot</></entry>
 103  <entry>Subdirectory containing replication slot data</entry>
 104 </row>
 105
 106 <row>
 107  <entry><filename>pg_serial</></entry>
 108  <entry>Subdirectory containing information about committed serializable transactions</entry>
 109 </row>
 110
 111 <row>
 112  <entry><filename>pg_snapshots</></entry>
 113  <entry>Subdirectory containing exported snapshots</entry>
 114 </row>
 115
 116 <row>
 117  <entry><filename>pg_stat</></entry>
 118  <entry>Subdirectory containing permanent files for the statistics
 119   subsystem</entry>
 120 </row>
 121
 122 <row>
 123  <entry><filename>pg_stat_tmp</></entry>
 124  <entry>Subdirectory containing temporary files for the statistics
 125   subsystem</entry>
 126 </row>
 127
 128 <row>
 129  <entry><filename>pg_subtrans</></entry>
 130  <entry>Subdirectory containing subtransaction status data</entry>
 131 </row>
 132
 133 <row>
 134  <entry><filename>pg_tblspc</></entry>
 135  <entry>Subdirectory containing symbolic links to tablespaces</entry>
 136 </row>
 137
 138 <row>
 139  <entry><filename>pg_twophase</></entry>
 140  <entry>Subdirectory containing state files for prepared transactions</entry>
 141 </row>
 142
 143 <row>
 144  <entry><filename>pg_xlog</></entry>
 145  <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
 146 </row>
 147
 148 <row>
 149  <entry><filename>postgresql.auto.conf</></entry>
 150  <entry>A file used for storing configuration parameters that are set by
 151 <command>ALTER SYSTEM</command></entry>
 152 </row>
 153
 154 <row>
 155  <entry><filename>postmaster.opts</></entry>
 156  <entry>A file recording the command-line options the server was
 157 last started with</entry>
 158 </row>
 159
 160 <row>
 161  <entry><filename>postmaster.pid</></entry>
 162  <entry>A lock file recording the current postmaster process ID (PID),
 163   cluster data directory path,
 164   postmaster start timestamp,
 165   port number,
 166   Unix-domain socket directory path (empty on Windows),
 167   first valid listen_address (IP address or <literal>*</>, or empty if
 168   not listening on TCP),
 169   and shared memory segment ID
 170   (this file is not present after server shutdown)</entry>
 171 </row>
 172
 173 </tbody>
 174 </tgroup>
 175 </table>
 176
 177 <para>
 178 For each database in the cluster there is a subdirectory within
 179 <varname>PGDATA</><filename>/base</>, named after the database's OID in
 180 <structname>pg_database</>.  This subdirectory is the default location
 181 for the database's files; in particular, its system catalogs are stored
 182 there.
 183 </para>
 184
 185 <para>
 186 Each table and index is stored in a separate file.  For ordinary relations,
 187 these files are named after the table or index's <firstterm>filenode</> number,
 188 which can be found in <structname>pg_class</>.<structfield>relfilenode</>. But
 189 for temporary relations, the file name is of the form
 190 <literal>t<replaceable>BBB</>_<replaceable>FFF</></>, where <replaceable>BBB</>
 191 is the backend ID of the backend which created the file, and <replaceable>FFF</>
 192 is the filenode number.  In either case, in addition to the main file (a/k/a
 193 main fork), each table and index has a <firstterm>free space map</> (see <xref
 194 linkend="storage-fsm">), which stores information about free space available in
 195 the relation.  The free space map is stored in a file named with the filenode
 196 number plus the suffix <literal>_fsm</>.  Tables also have a
 197 <firstterm>visibility map</>, stored in a fork with the suffix <literal>_vm</>,
 198 to track which pages are known to have no dead tuples.  The visibility map is
 199 described further in <xref linkend="storage-vm">.  Unlogged tables and indexes
 200 have a third fork, known as the initialization fork, which is stored in a fork
 201 with the suffix <literal>_init</literal> (see <xref linkend="storage-init">).
 202 </para>
 203
 204 <caution>
 205 <para>
 206 Note that while a table's filenode often matches its OID, this is
 207 <emphasis>not</> necessarily the case; some operations, like
 208 <command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
 209 of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
 210 Avoid assuming that filenode and table OID are the same.
 211 Also, for certain system catalogs including <structname>pg_class</> itself,
 212 <structname>pg_class</>.<structfield>relfilenode</> contains zero.  The
 213 actual filenode number of these catalogs is stored in a lower-level data
 214 structure, and can be obtained using the <function>pg_relation_filenode()</>
 215 function.
 216 </para>
 217 </caution>
 218
 219 <para>
 220 When a table or index exceeds 1 GB, it is divided into gigabyte-sized
 221 <firstterm>segments</>.  The first segment's file name is the same as the
 222 filenode; subsequent segments are named filenode.1, filenode.2, etc.
 223 This arrangement avoids problems on platforms that have file size limitations.
 224 (Actually, 1 GB is just the default segment size.  The segment size can be
 225 adjusted using the configuration option <option>--with-segsize</option>
 226 when building <productname>PostgreSQL</>.)
 227 In principle, free space map and visibility map forks could require multiple
 228 segments as well, though this is unlikely to happen in practice.
 229 </para>
 230
 231 <para>
 232 A table that has columns with potentially large entries will have an
 233 associated <firstterm>TOAST</> table, which is used for out-of-line storage of
 234 field values that are too large to keep in the table rows proper.
 235 <structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
 236 its <acronym>TOAST</> table, if any.
 237 See <xref linkend="storage-toast"> for more information.
 238 </para>
 239
 240 <para>
 241 The contents of tables and indexes are discussed further in
 242 <xref linkend="storage-page-layout">.
 243 </para>
 244
 245 <para>
 246 Tablespaces make the scenario more complicated.  Each user-defined tablespace
 247 has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
 248 directory, which points to the physical tablespace directory (i.e., the
 249 location specified in the tablespace's <command>CREATE TABLESPACE</> command).
 250 This symbolic link is named after
 251 the tablespace's OID.  Inside the physical tablespace directory there is
 252 a subdirectory with a name that depends on the <productname>PostgreSQL</>
 253 server version, such as <literal>PG_9.0_201008051</>.  (The reason for using
 254 this subdirectory is so that successive versions of the database can use
 255 the same <command>CREATE TABLESPACE</> location value without conflicts.)
 256 Within the version-specific subdirectory, there is
 257 a subdirectory for each database that has elements in the tablespace, named
 258 after the database's OID.  Tables and indexes are stored within that
 259 directory, using the filenode naming scheme.
 260 The <literal>pg_default</> tablespace is not accessed through
 261 <filename>pg_tblspc</>, but corresponds to
 262 <varname>PGDATA</><filename>/base</>.  Similarly, the <literal>pg_global</>
 263 tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
 264 <varname>PGDATA</><filename>/global</>.
 265 </para>
 266
 267 <para>
 268 The <function>pg_relation_filepath()</> function shows the entire path
 269 (relative to <varname>PGDATA</>) of any relation.  It is often useful
 270 as a substitute for remembering many of the above rules.  But keep in
 271 mind that this function just gives the name of the first segment of the
 272 main fork of the relation &mdash; you may need to append a segment number
 273 and/or <literal>_fsm</>, <literal>_vm</>, or <literal>_init</> to find all
 274 the files associated with the relation.
 275 </para>
 276
 277 <para>
 278 Temporary files (for operations such as sorting more data than can fit in
 279 memory) are created within <varname>PGDATA</><filename>/base/pgsql_tmp</>,
 280 or within a <filename>pgsql_tmp</> subdirectory of a tablespace directory
 281 if a tablespace other than <literal>pg_default</> is specified for them.
 282 The name of a temporary file has the form
 283 <filename>pgsql_tmp<replaceable>PPP</>.<replaceable>NNN</></filename>,
 284 where <replaceable>PPP</> is the PID of the owning backend and
 285 <replaceable>NNN</> distinguishes different temporary files of that backend.
 286 </para>
 287
 288 </sect1>
 289
 290 <sect1 id="storage-toast">
 291
 292 <title>TOAST</title>
 293
 294     <indexterm>
 295      <primary>TOAST</primary>
 296     </indexterm>
 297     <indexterm><primary>sliced bread</><see>TOAST</></indexterm>
 298
 299 <para>
 300 This section provides an overview of <acronym>TOAST</> (The
 301 Oversized-Attribute Storage Technique).
 302 </para>
 303
 304 <para>
 305 <productname>PostgreSQL</productname> uses a fixed page size (commonly
 306 8 kB), and does not allow tuples to span multiple pages.  Therefore, it is
 307 not possible to store very large field values directly.  To overcome
 308 this limitation, large field values are compressed and/or broken up into
 309 multiple physical rows.  This happens transparently to the user, with only
 310 small impact on most of the backend code.  The technique is affectionately
 311 known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
 312 The <acronym>TOAST</> infrastructure is also used to improve handling of
 313 large data values in-memory.
 314 </para>
 315
 316 <para>
 317 Only certain data types support <acronym>TOAST</> &mdash; there is no need to
 318 impose the overhead on data types that cannot produce large field values.
 319 To support <acronym>TOAST</>, a data type must have a variable-length
 320 (<firstterm>varlena</>) representation, in which, ordinarily, the first
 321 four-byte word of any stored value contains the total length of the value in
 322 bytes (including itself).  <acronym>TOAST</> does not constrain the rest
 323 of the data type's representation.  The special representations collectively
 324 called <firstterm><acronym>TOAST</>ed values</firstterm> work by modifying or
 325 reinterpreting this initial length word.  Therefore, the C-level functions
 326 supporting a <acronym>TOAST</>-able data type must be careful about how they
 327 handle potentially <acronym>TOAST</>ed input values: an input might not
 328 actually consist of a four-byte length word and contents until after it's
 329 been <firstterm>detoasted</>.  (This is normally done by invoking
 330 <function>PG_DETOAST_DATUM</> before doing anything with an input value,
 331 but in some cases more efficient approaches are possible.
 332 See <xref linkend="xtypes-toast"> for more detail.)
 333 </para>
 334
 335 <para>
 336 <acronym>TOAST</> usurps two bits of the varlena length word (the high-order
 337 bits on big-endian machines, the low-order bits on little-endian machines),
 338 thereby limiting the logical size of any value of a <acronym>TOAST</>-able
 339 data type to 1 GB (2<superscript>30</> - 1 bytes).  When both bits are zero,
 340 the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
 341 the remaining bits of the length word give the total datum size (including
 342 length word) in bytes.  When the highest-order or lowest-order bit is set,
 343 the value has only a single-byte header instead of the normal four-byte
 344 header, and the remaining bits of that byte give the total datum size
 345 (including length byte) in bytes.  This alternative supports space-efficient
 346 storage of values shorter than 127 bytes, while still allowing the data type
 347 to grow to 1 GB at need.  Values with single-byte headers aren't aligned on
 348 any particular boundary, whereas values with four-byte headers are aligned on
 349 at least a four-byte boundary; this omission of alignment padding provides
 350 additional space savings that is significant compared to short values.
 351 As a special case, if the remaining bits of a single-byte header are all
 352 zero (which would be impossible for a self-inclusive length), the value is
 353 a pointer to out-of-line data, with several possible alternatives as
 354 described below.  The type and size of such a <firstterm>TOAST pointer</>
 355 are determined by a code stored in the second byte of the datum.
 356 Lastly, when the highest-order or lowest-order bit is clear but the adjacent
 357 bit is set, the content of the datum has been compressed and must be
 358 decompressed before use.  In this case the remaining bits of the four-byte
 359 length word give the total size of the compressed datum, not the
 360 original data.  Note that compression is also possible for out-of-line data
 361 but the varlena header does not tell whether it has occurred &mdash;
 362 the content of the <acronym>TOAST</> pointer tells that, instead.
 363 </para>
 364
 365 <para>
 366 As mentioned, there are multiple types of <acronym>TOAST</> pointer datums.
 367 The oldest and most common type is a pointer to out-of-line data stored in
 368 a <firstterm><acronym>TOAST</> table</firstterm> that is separate from, but
 369 associated with, the table containing the <acronym>TOAST</> pointer datum
 370 itself.  These <firstterm>on-disk</> pointer datums are created by the
 371 <acronym>TOAST</> management code (in <filename>access/heap/tuptoaster.c</>)
 372 when a tuple to be stored on disk is too large to be stored as-is.
 373 Further details appear in <xref linkend="storage-toast-ondisk">.
 374 Alternatively, a <acronym>TOAST</> pointer datum can contain a pointer to
 375 out-of-line data that appears elsewhere in memory.  Such datums are
 376 necessarily short-lived, and will never appear on-disk, but they are very
 377 useful for avoiding copying and redundant processing of large data values.
 378 Further details appear in <xref linkend="storage-toast-inmemory">.
 379 </para>
 380
 381 <para>
 382 The compression technique used for either in-line or out-of-line compressed
 383 data is a fairly simple and very fast member
 384 of the LZ family of compression techniques.  See
 385 <filename>src/common/pg_lzcompress.c</> for the details.
 386 </para>
 387
 388 <sect2 id="storage-toast-ondisk">
 389  <title>Out-of-line, on-disk TOAST storage</title>
 390
 391 <para>
 392 If any of the columns of a table are <acronym>TOAST</>-able, the table will
 393 have an associated <acronym>TOAST</> table, whose OID is stored in the table's
 394 <structname>pg_class</>.<structfield>reltoastrelid</> entry.  On-disk
 395 <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
 396 described in more detail below.
 397 </para>
 398
 399 <para>
 400 Out-of-line values are divided (after compression if used) into chunks of at
 401 most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
 402 so that four chunk rows will fit on a page, making it about 2000 bytes).
 403 Each chunk is stored as a separate row in the <acronym>TOAST</> table
 404 belonging to the owning table.  Every
 405 <acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
 406 identifying the particular <acronym>TOAST</>ed value),
 407 <structfield>chunk_seq</> (a sequence number for the chunk within its value),
 408 and <structfield>chunk_data</> (the actual data of the chunk).  A unique index
 409 on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
 410 retrieval of the values.  A pointer datum representing an out-of-line on-disk
 411 <acronym>TOAST</>ed value therefore needs to store the OID of the
 412 <acronym>TOAST</> table in which to look and the OID of the specific value
 413 (its <structfield>chunk_id</>).  For convenience, pointer datums also store the
 414 logical datum size (original uncompressed data length) and physical stored size
 415 (different if compression was applied).  Allowing for the varlena header bytes,
 416 the total size of an on-disk <acronym>TOAST</> pointer datum is therefore 18
 417 bytes regardless of the actual size of the represented value.
 418 </para>
 419
 420 <para>
 421 The <acronym>TOAST</> management code is triggered only
 422 when a row value to be stored in a table is wider than
 423 <symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
 424 The <acronym>TOAST</> code will compress and/or move
 425 field values out-of-line until the row value is shorter than
 426 <symbol>TOAST_TUPLE_TARGET</> bytes (also normally 2 kB)
 427 or no more gains can be had.  During an UPDATE
 428 operation, values of unchanged fields are normally preserved as-is; so an
 429 UPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs if
 430 none of the out-of-line values change.
 431 </para>
 432
 433 <para>
 434 The <acronym>TOAST</> management code recognizes four different strategies
 435 for storing <acronym>TOAST</>-able columns on disk:
 436
 437    <itemizedlist>
 438     <listitem>
 439      <para>
 440       <literal>PLAIN</literal> prevents either compression or
 441       out-of-line storage; furthermore it disables use of single-byte headers
 442       for varlena types.
 443       This is the only possible strategy for
 444       columns of non-<acronym>TOAST</>-able data types.
 445      </para>
 446     </listitem>
 447     <listitem>
 448      <para>
 449       <literal>EXTENDED</literal> allows both compression and out-of-line
 450       storage.  This is the default for most <acronym>TOAST</>-able data types.
 451       Compression will be attempted first, then out-of-line storage if
 452       the row is still too big.
 453      </para>
 454     </listitem>
 455     <listitem>
 456      <para>
 457       <literal>EXTERNAL</literal> allows out-of-line storage but not
 458       compression.  Use of <literal>EXTERNAL</literal> will
 459       make substring operations on wide <type>text</type> and
 460       <type>bytea</type> columns faster (at the penalty of increased storage
 461       space) because these operations are optimized to fetch only the
 462       required parts of the out-of-line value when it is not compressed.
 463      </para>
 464     </listitem>
 465     <listitem>
 466      <para>
 467       <literal>MAIN</literal> allows compression but not out-of-line
 468       storage.  (Actually, out-of-line storage will still be performed
 469       for such columns, but only as a last resort when there is no other
 470       way to make the row small enough to fit on a page.)
 471      </para>
 472     </listitem>
 473    </itemizedlist>
 474
 475 Each <acronym>TOAST</>-able data type specifies a default strategy for columns
 476 of that data type, but the strategy for a given table column can be altered
 477 with <command>ALTER TABLE SET STORAGE</>.
 478 </para>
 479
 480 <para>
 481 This scheme has a number of advantages compared to a more straightforward
 482 approach such as allowing row values to span pages.  Assuming that queries are
 483 usually qualified by comparisons against relatively small key values, most of
 484 the work of the executor will be done using the main row entry. The big values
 485 of <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)
 486 at the time the result set is sent to the client. Thus, the main table is much
 487 smaller and more of its rows fit in the shared buffer cache than would be the
 488 case without any out-of-line storage. Sort sets shrink also, and sorts will
 489 more often be done entirely in memory. A little test showed that a table
 490 containing typical HTML pages and their URLs was stored in about half of the
 491 raw data size including the <acronym>TOAST</> table, and that the main table
 492 contained only about 10% of the entire data (the URLs and some small HTML
 493 pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
 494 comparison table, in which all the HTML pages were cut down to 7 kB to fit.
 495 </para>
 496
 497 </sect2>
 498
 499 <sect2 id="storage-toast-inmemory">
 500  <title>Out-of-line, in-memory TOAST storage</title>
 501
 502 <para>
 503 <acronym>TOAST</> pointers can point to data that is not on disk, but is
 504 elsewhere in the memory of the current server process.  Such pointers
 505 obviously cannot be long-lived, but they are nonetheless useful.  There
 506 are currently two sub-cases:
 507 pointers to <firstterm>indirect</> data and
 508 pointers to <firstterm>expanded</> data.
 509 </para>
 510
 511 <para>
 512 Indirect <acronym>TOAST</> pointers simply point at a non-indirect varlena
 513 value stored somewhere in memory.  This case was originally created merely
 514 as a proof of concept, but it is currently used during logical decoding to
 515 avoid possibly having to create physical tuples exceeding 1 GB (as pulling
 516 all out-of-line field values into the tuple might do).  The case is of
 517 limited use since the creator of the pointer datum is entirely responsible
 518 that the referenced data survives for as long as the pointer could exist,
 519 and there is no infrastructure to help with this.
 520 </para>
 521
 522 <para>
 523 Expanded <acronym>TOAST</> pointers are useful for complex data types
 524 whose on-disk representation is not especially suited for computational
 525 purposes.  As an example, the standard varlena representation of a
 526 <productname>PostgreSQL</> array includes dimensionality information, a
 527 nulls bitmap if there are any null elements, then the values of all the
 528 elements in order.  When the element type itself is variable-length, the
 529 only way to find the <replaceable>N</>'th element is to scan through all the
 530 preceding elements.  This representation is appropriate for on-disk storage
 531 because of its compactness, but for computations with the array it's much
 532 nicer to have an <quote>expanded</> or <quote>deconstructed</>
 533 representation in which all the element starting locations have been
 534 identified.  The <acronym>TOAST</> pointer mechanism supports this need by
 535 allowing a pass-by-reference Datum to point to either a standard varlena
 536 value (the on-disk representation) or a <acronym>TOAST</> pointer that
 537 points to an expanded representation somewhere in memory.  The details of
 538 this expanded representation are up to the data type, though it must have
 539 a standard header and meet the other API requirements given
 540 in <filename>src/include/utils/expandeddatum.h</>.  C-level functions
 541 working with the data type can choose to handle either representation.
 542 Functions that do not know about the expanded representation, but simply
 543 apply <function>PG_DETOAST_DATUM</> to their inputs, will automatically
 544 receive the traditional varlena representation; so support for an expanded
 545 representation can be introduced incrementally, one function at a time.
 546 </para>
 547
 548 <para>
 549 <acronym>TOAST</> pointers to expanded values are further broken down
 550 into <firstterm>read-write</> and <firstterm>read-only</> pointers.
 551 The pointed-to representation is the same either way, but a function that
 552 receives a read-write pointer is allowed to modify the referenced value
 553 in-place, whereas one that receives a read-only pointer must not; it must
 554 first create a copy if it wants to make a modified version of the value.
 555 This distinction and some associated conventions make it possible to avoid
 556 unnecessary copying of expanded values during query execution.
 557 </para>
 558
 559 <para>
 560 For all types of in-memory <acronym>TOAST</> pointer, the <acronym>TOAST</>
 561 management code ensures that no such pointer datum can accidentally get
 562 stored on disk.  In-memory <acronym>TOAST</> pointers are automatically
 563 expanded to normal in-line varlena values before storage &mdash; and then
 564 possibly converted to on-disk <acronym>TOAST</> pointers, if the containing
 565 tuple would otherwise be too big.
 566 </para>
 567
 568 </sect2>
 569
 570 </sect1>
 571
 572 <sect1 id="storage-fsm">
 573
 574 <title>Free Space Map</title>
 575
 576 <indexterm>
 577  <primary>Free Space Map</primary>
 578 </indexterm>
 579 <indexterm><primary>FSM</><see>Free Space Map</></indexterm>
 580
 581 <para>
 582 Each heap and index relation, except for hash indexes, has a Free Space Map
 583 (FSM) to keep track of available space in the relation. It's stored
 584 alongside the main relation data in a separate relation fork, named after the
 585 filenode number of the relation, plus a <literal>_fsm</> suffix. For example,
 586 if the filenode of a relation is 12345, the FSM is stored in a file called
 587 <filename>12345_fsm</>, in the same directory as the main relation file.
 588 </para>
 589
 590 <para>
 591 The Free Space Map is organized as a tree of <acronym>FSM</> pages. The
 592 bottom level <acronym>FSM</> pages store the free space available on each
 593 heap (or index) page, using one byte to represent each such page. The upper
 594 levels aggregate information from the lower levels.
 595 </para>
 596
 597 <para>
 598 Within each <acronym>FSM</> page is a binary tree, stored in an array with
 599 one byte per node. Each leaf node represents a heap page, or a lower level
 600 <acronym>FSM</> page. In each non-leaf node, the higher of its children's
 601 values is stored. The maximum value in the leaf nodes is therefore stored
 602 at the root.
 603 </para>
 604
 605 <para>
 606 See <filename>src/backend/storage/freespace/README</> for more details on
 607 how the <acronym>FSM</> is structured, and how it's updated and searched.
 608 The <xref linkend="pgfreespacemap"> module
 609 can be used to examine the information stored in free space maps.
 610 </para>
 611
 612 </sect1>
 613
 614 <sect1 id="storage-vm">
 615
 616 <title>Visibility Map</title>
 617
 618 <indexterm>
 619  <primary>Visibility Map</primary>
 620 </indexterm>
 621 <indexterm><primary>VM</><see>Visibility Map</></indexterm>
 622
 623 <para>
 624 Each heap relation has a Visibility Map
 625 (VM) to keep track of which pages contain only tuples that are known to be
 626 visible to all active transactions; it also keeps track of which pages contain
 627 only unfrozen tuples.  It's stored
 628 alongside the main relation data in a separate relation fork, named after the
 629 filenode number of the relation, plus a <literal>_vm</> suffix. For example,
 630 if the filenode of a relation is 12345, the VM is stored in a file called
 631 <filename>12345_vm</>, in the same directory as the main relation file.
 632 Note that indexes do not have VMs.
 633 </para>
 634
 635 <para>
 636 The visibility map stores two bits per heap page.  The first bit, if set,
 637 indicates that the page is all-visible, or in other words that the page does
 638 not contain any tuples that need to be vacuumed.
 639 This information can also be used by <firstterm>index-only scans</> to answer
 640 queries using only the index tuple.
 641 The second bit, if set, means that all tuples on the page have been frozen.
 642 </para>
 643
 644 <para>
 645 The map is conservative in the sense that we make sure that whenever a bit is
 646 set, we know the condition is true, but if a bit is not set, it might or
 647 might not be true. Visibility map bits are only set by vacuum, but are
 648 cleared by any data-modifying operations on a page.
 649 </para>
 650
 651 </sect1>
 652
 653 <sect1 id="storage-init">
 654
 655 <title>The Initialization Fork</title>
 656
 657 <indexterm>
 658  <primary>Initialization Fork</primary>
 659 </indexterm>
 660
 661 <para>
 662 Each unlogged table, and each index on an unlogged table, has an initialization
 663 fork.  The initialization fork is an empty table or index of the appropriate
 664 type.  When an unlogged table must be reset to empty due to a crash, the
 665 initialization fork is copied over the main fork, and any other forks are
 666 erased (they will be recreated automatically as needed).
 667 </para>
 668
 669 </sect1>
 670
 671 <sect1 id="storage-page-layout">
 672
 673 <title>Database Page Layout</title>
 674
 675 <para>
 676 This section provides an overview of the page format used within
 677 <productname>PostgreSQL</productname> tables and indexes.<footnote>
 678   <para>
 679     Actually, index access methods need not use this page format.
 680     All the existing index methods do use this basic format,
 681     but the data kept on index metapages usually doesn't follow
 682     the item layout rules.
 683   </para>
 684 </footnote>
 685 Sequences and <acronym>TOAST</> tables are formatted just like a regular table.
 686 </para>
 687
 688 <para>
 689 In the following explanation, a
 690 <firstterm>byte</firstterm>
 691 is assumed to contain 8 bits.  In addition, the term
 692 <firstterm>item</firstterm>
 693 refers to an individual data value that is stored on a page.  In a table,
 694 an item is a row; in an index, an item is an index entry.
 695 </para>
 696
 697 <para>
 698 Every table and index is stored as an array of <firstterm>pages</> of a
 699 fixed size (usually 8 kB, although a different page size can be selected
 700 when compiling the server).  In a table, all the pages are logically
 701 equivalent, so a particular item (row) can be stored in any page.  In
 702 indexes, the first page is generally reserved as a <firstterm>metapage</>
 703 holding control information, and there can be different types of pages
 704 within the index, depending on the index access method.
 705 </para>
 706
 707 <para>
 708 <xref linkend="page-table"> shows the overall layout of a page.
 709 There are five parts to each page.
 710 </para>
 711
 712 <table tocentry="1" id="page-table">
 713 <title>Overall Page Layout</title>
 714 <titleabbrev>Page Layout</titleabbrev>
 715 <tgroup cols="2">
 716 <thead>
 717 <row>
 718 <entry>
 719 Item
 720 </entry>
 721 <entry>Description</entry>
 722 </row>
 723 </thead>
 724
 725 <tbody>
 726
 727 <row>
 728  <entry>PageHeaderData</entry>
 729  <entry>24 bytes long. Contains general information about the page, including
 730 free space pointers.</entry>
 731 </row>
 732
 733 <row>
 734 <entry>ItemIdData</entry>
 735 <entry>Array of (offset,length) pairs pointing to the actual items.
 736 4 bytes per item.</entry>
 737 </row>
 738
 739 <row>
 740 <entry>Free space</entry>
 741 <entry>The unallocated space. New item pointers are allocated from the start
 742 of this area, new items from the end.</entry>
 743 </row>
 744
 745 <row>
 746 <entry>Items</entry>
 747 <entry>The actual items themselves.</entry>
 748 </row>
 749
 750 <row>
 751 <entry>Special space</entry>
 752 <entry>Index access method specific data. Different methods store different
 753 data. Empty in ordinary tables.</entry>
 754 </row>
 755
 756 </tbody>
 757 </tgroup>
 758 </table>
 759
 760  <para>
 761
 762   The first 24 bytes of each page consists of a page header
 763   (<structname>PageHeaderData</>). Its format is detailed in <xref
 764   linkend="pageheaderdata-table">. The first field tracks the most
 765   recent WAL entry related to this page. The second field contains
 766   the page checksum if <xref linkend="app-initdb-data-checksums"> are
 767   enabled.  Next is a 2-byte field containing flag bits. This is followed
 768   by three 2-byte integer fields (<structfield>pd_lower</structfield>,
 769   <structfield>pd_upper</structfield>, and
 770   <structfield>pd_special</structfield>).  These contain byte offsets
 771   from the page start to the start of unallocated space, to the end of
 772   unallocated space, and to the start of the special space.  The next 2
 773   bytes of the page header, <structfield>pd_pagesize_version</structfield>,
 774   store both the page size and a version indicator.  Beginning with
 775   <productname>PostgreSQL</productname> 8.3 the version number is 4;
 776   <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
 777   <productname>PostgreSQL</productname> 8.0 used version number 2;
 778   <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
 779   prior releases used version number 0.
 780   (The basic page layout and header format has not changed in most of these
 781   versions, but the layout of heap row headers has.)  The page size
 782   is basically only present as a cross-check; there is no support for having
 783   more than one page size in an installation.
 784   The last field is a hint that shows whether pruning the page is likely
 785   to be profitable: it tracks the oldest un-pruned XMAX on the page.
 786
 787  </para>
 788
 789  <table tocentry="1" id="pageheaderdata-table">
 790  <title>PageHeaderData Layout</title>
 791  <titleabbrev>PageHeaderData Layout</titleabbrev>
 792  <tgroup cols="4">
 793  <thead>
 794   <row>
 795    <entry>Field</entry>
 796    <entry>Type</entry>
 797    <entry>Length</entry>
 798    <entry>Description</entry>
 799   </row>
 800  </thead>
 801  <tbody>
 802   <row>
 803    <entry>pd_lsn</entry>
 804    <entry>PageXLogRecPtr</entry>
 805    <entry>8 bytes</entry>
 806    <entry>LSN: next byte after last byte of xlog record for last change
 807    to this page</entry>
 808   </row>
 809   <row>
 810    <entry>pd_checksum</entry>
 811    <entry>uint16</entry>
 812    <entry>2 bytes</entry>
 813    <entry>Page checksum</entry>
 814   </row>
 815   <row>
 816    <entry>pd_flags</entry>
 817    <entry>uint16</entry>
 818    <entry>2 bytes</entry>
 819    <entry>Flag bits</entry>
 820   </row>
 821   <row>
 822    <entry>pd_lower</entry>
 823    <entry>LocationIndex</entry>
 824    <entry>2 bytes</entry>
 825    <entry>Offset to start of free space</entry>
 826   </row>
 827   <row>
 828    <entry>pd_upper</entry>
 829    <entry>LocationIndex</entry>
 830    <entry>2 bytes</entry>
 831    <entry>Offset to end of free space</entry>
 832   </row>
 833   <row>
 834    <entry>pd_special</entry>
 835    <entry>LocationIndex</entry>
 836    <entry>2 bytes</entry>
 837    <entry>Offset to start of special space</entry>
 838   </row>
 839   <row>
 840    <entry>pd_pagesize_version</entry>
 841    <entry>uint16</entry>
 842    <entry>2 bytes</entry>
 843    <entry>Page size and layout version number information</entry>
 844   </row>
 845   <row>
 846    <entry>pd_prune_xid</entry>
 847    <entry>TransactionId</entry>
 848    <entry>4 bytes</entry>
 849    <entry>Oldest unpruned XMAX on page, or zero if none</entry>
 850   </row>
 851  </tbody>
 852  </tgroup>
 853  </table>
 854
 855  <para>
 856   All the details can be found in
 857   <filename>src/include/storage/bufpage.h</filename>.
 858  </para>
 859
 860  <para>
 861
 862   Following the page header are item identifiers
 863   (<type>ItemIdData</type>), each requiring four bytes.
 864   An item identifier contains a byte-offset to
 865   the start of an item, its length in bytes, and a few attribute bits
 866   which affect its interpretation.
 867   New item identifiers are allocated
 868   as needed from the beginning of the unallocated space.
 869   The number of item identifiers present can be determined by looking at
 870   <structfield>pd_lower</>, which is increased to allocate a new identifier.
 871   Because an item
 872   identifier is never moved until it is freed, its index can be used on a
 873   long-term basis to reference an item, even when the item itself is moved
 874   around on the page to compact free space.  In fact, every pointer to an
 875   item (<type>ItemPointer</type>, also known as
 876   <type>CTID</type>) created by
 877   <productname>PostgreSQL</productname> consists of a page number and the
 878   index of an item identifier.
 879
 880  </para>
 881
 882  <para>
 883
 884   The items themselves are stored in space allocated backwards from the end
 885   of unallocated space.  The exact structure varies depending on what the
 886   table is to contain. Tables and sequences both use a structure named
 887   <type>HeapTupleHeaderData</type>, described below.
 888
 889  </para>
 890
 891  <para>
 892
 893   The final section is the <quote>special section</quote> which can
 894  contain anything the access method wishes to store.  For example,
 895   b-tree indexes store links to the page's left and right siblings,
 896   as well as some other data relevant to the index structure.
 897   Ordinary tables do not use a special section at all (indicated by setting
 898   <structfield>pd_special</> to equal the page size).
 899
 900  </para>
 901
 902  <para>
 903
 904   All table rows are structured in the same way. There is a fixed-size
 905   header (occupying 23 bytes on most machines), followed by an optional null
 906   bitmap, an optional object ID field, and the user data. The header is
 907   detailed
 908   in <xref linkend="heaptupleheaderdata-table">.  The actual user data
 909   (columns of the row) begins at the offset indicated by
 910   <structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
 911   distance for the platform.
 912   The null bitmap is
 913   only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
 914   <structfield>t_infomask</structfield>. If it is present it begins just after
 915   the fixed header and occupies enough bytes to have one bit per data column
 916   (that is, <structfield>t_natts</> bits altogether). In this list of bits, a
 917   1 bit indicates not-null, a 0 bit is a null.  When the bitmap is not
 918   present, all columns are assumed not-null.
 919   The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
 920   is set in <structfield>t_infomask</structfield>.  If present, it appears just
 921   before the <structfield>t_hoff</> boundary.  Any padding needed to make
 922   <structfield>t_hoff</> a MAXALIGN multiple will appear between the null
 923   bitmap and the object ID.  (This in turn ensures that the object ID is
 924   suitably aligned.)
 925
 926  </para>
 927
 928  <table tocentry="1" id="heaptupleheaderdata-table">
 929  <title>HeapTupleHeaderData Layout</title>
 930  <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
 931  <tgroup cols="4">
 932  <thead>
 933   <row>
 934    <entry>Field</entry>
 935    <entry>Type</entry>
 936    <entry>Length</entry>
 937    <entry>Description</entry>
 938   </row>
 939  </thead>
 940  <tbody>
 941   <row>
 942    <entry>t_xmin</entry>
 943    <entry>TransactionId</entry>
 944    <entry>4 bytes</entry>
 945    <entry>insert XID stamp</entry>
 946   </row>
 947   <row>
 948    <entry>t_xmax</entry>
 949    <entry>TransactionId</entry>
 950    <entry>4 bytes</entry>
 951    <entry>delete XID stamp</entry>
 952   </row>
 953   <row>
 954    <entry>t_cid</entry>
 955    <entry>CommandId</entry>
 956    <entry>4 bytes</entry>
 957    <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
 958   </row>
 959   <row>
 960    <entry>t_xvac</entry>
 961    <entry>TransactionId</entry>
 962    <entry>4 bytes</entry>
 963    <entry>XID for VACUUM operation moving a row version</entry>
 964   </row>
 965   <row>
 966    <entry>t_ctid</entry>
 967    <entry>ItemPointerData</entry>
 968    <entry>6 bytes</entry>
 969    <entry>current TID of this or newer row version</entry>
 970   </row>
 971   <row>
 972    <entry>t_infomask2</entry>
 973    <entry>uint16</entry>
 974    <entry>2 bytes</entry>
 975    <entry>number of attributes, plus various flag bits</entry>
 976   </row>
 977   <row>
 978    <entry>t_infomask</entry>
 979    <entry>uint16</entry>
 980    <entry>2 bytes</entry>
 981    <entry>various flag bits</entry>
 982   </row>
 983   <row>
 984    <entry>t_hoff</entry>
 985    <entry>uint8</entry>
 986    <entry>1 byte</entry>
 987    <entry>offset to user data</entry>
 988   </row>
 989  </tbody>
 990  </tgroup>
 991  </table>
 992
 993  <para>
 994    All the details can be found in
 995    <filename>src/include/access/htup_details.h</filename>.
 996  </para>
 997
 998  <para>
 999
1000   Interpreting the actual data can only be done with information obtained
1001   from other tables, mostly <structname>pg_attribute</structname>. The
1002   key values needed to identify field locations are
1003   <structfield>attlen</structfield> and <structfield>attalign</structfield>.
1004   There is no way to directly get a
1005   particular attribute, except when there are only fixed width fields and no
1006   null values. All this trickery is wrapped up in the functions
1007   <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
1008   and <firstterm>heap_getsysattr</firstterm>.
1009
1010  </para>
1011  <para>
1012
1013   To read the data you need to examine each attribute in turn. First check
1014   whether the field is NULL according to the null bitmap. If it is, go to
1015   the next. Then make sure you have the right alignment.  If the field is a
1016   fixed width field, then all the bytes are simply placed. If it's a
1017   variable length field (attlen = -1) then it's a bit more complicated.
1018   All variable-length data types share the common header structure
1019   <type>struct varlena</type>, which includes the total length of the stored
1020   value and some flag bits.  Depending on the flags, the data can be either
1021   inline or in a <acronym>TOAST</> table;
1022   it might be compressed, too (see <xref linkend="storage-toast">).
1023
1024  </para>
1025 </sect1>
1026
1027 </chapter>