granicus.if.org Git - postgresql/blob - src/backend/access/transam/README

   1 src/backend/access/transam/README
   2
   3 The Transaction System
   4 ======================
   5
   6 PostgreSQL's transaction system is a three-layer system.  The bottom layer
   7 implements low-level transactions and subtransactions, on top of which rests
   8 the mainloop's control code, which in turn implements user-visible
   9 transactions and savepoints.
  10
  11 The middle layer of code is called by postgres.c before and after the
  12 processing of each query, or after detecting an error:
  13
  14                 StartTransactionCommand
  15                 CommitTransactionCommand
  16                 AbortCurrentTransaction
  17
  18 Meanwhile, the user can alter the system's state by issuing the SQL commands
  19 BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE.  The traffic cop
  20 redirects these calls to the toplevel routines
  21
  22                 BeginTransactionBlock
  23                 EndTransactionBlock
  24                 UserAbortTransactionBlock
  25                 DefineSavepoint
  26                 RollbackToSavepoint
  27                 ReleaseSavepoint
  28
  29 respectively.  Depending on the current state of the system, these functions
  30 call low level functions to activate the real transaction system:
  31
  32                 StartTransaction
  33                 CommitTransaction
  34                 AbortTransaction
  35                 CleanupTransaction
  36                 StartSubTransaction
  37                 CommitSubTransaction
  38                 AbortSubTransaction
  39                 CleanupSubTransaction
  40
  41 Additionally, within a transaction, CommandCounterIncrement is called to
  42 increment the command counter, which allows future commands to "see" the
  43 effects of previous commands within the same transaction.  Note that this is
  44 done automatically by CommitTransactionCommand after each query inside a
  45 transaction block, but some utility functions also do it internally to allow
  46 some operations (usually in the system catalogs) to be seen by future
  47 operations in the same utility command.  (For example, in DefineRelation it is
  48 done after creating the heap so the pg_class row is visible, to be able to
  49 lock it.)
  50
  51
  52 For example, consider the following sequence of user commands:
  53
  54 1)              BEGIN
  55 2)              SELECT * FROM foo
  56 3)              INSERT INTO foo VALUES (...)
  57 4)              COMMIT
  58
  59 In the main processing loop, this results in the following function call
  60 sequence:
  61
  62      /  StartTransactionCommand;
  63     /       StartTransaction;
  64 1) <        ProcessUtility;             << BEGIN
  65     \       BeginTransactionBlock;
  66      \  CommitTransactionCommand;
  67
  68     /   StartTransactionCommand;
  69 2) /        ProcessQuery;               << SELECT ...
  70    \        CommitTransactionCommand;
  71     \       CommandCounterIncrement;
  72
  73     /   StartTransactionCommand;
  74 3) /        ProcessQuery;               << INSERT ...
  75    \        CommitTransactionCommand;
  76     \       CommandCounterIncrement;
  77
  78      /  StartTransactionCommand;
  79     /   ProcessUtility;                 << COMMIT
  80 4) <            EndTransactionBlock;
  81     \   CommitTransactionCommand;
  82      \      CommitTransaction;
  83
  84 The point of this example is to demonstrate the need for
  85 StartTransactionCommand and CommitTransactionCommand to be state smart -- they
  86 should call CommandCounterIncrement between the calls to BeginTransactionBlock
  87 and EndTransactionBlock and outside these calls they need to do normal start,
  88 commit or abort processing.
  89
  90 Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In
  91 this case AbortCurrentTransaction is called, and the transaction is put in
  92 aborted state.  In this state, any user input is ignored except for
  93 transaction-termination statements, or ROLLBACK TO <savepoint> commands.
  94
  95 Transaction aborts can occur in two ways:
  96
  97 1) system dies from some internal cause  (syntax error, etc)
  98 2) user types ROLLBACK
  99
 100 The reason we have to distinguish them is illustrated by the following two
 101 situations:
 102
 103         case 1                                  case 2
 104         ------                                  ------
 105 1) user types BEGIN                     1) user types BEGIN
 106 2) user does something                  2) user does something
 107 3) user does not like what              3) system aborts for some reason
 108    she sees and types ABORT                (syntax error, etc)
 109
 110 In case 1, we want to abort the transaction and return to the default state.
 111 In case 2, there may be more commands coming our way which are part of the
 112 same transaction block; we have to ignore these commands until we see a COMMIT
 113 or ROLLBACK.
 114
 115 Internal aborts are handled by AbortCurrentTransaction, while user aborts are
 116 handled by UserAbortTransactionBlock.  Both of them rely on AbortTransaction
 117 to do all the real work.  The only difference is what state we enter after
 118 AbortTransaction does its work:
 119
 120 * AbortCurrentTransaction leaves us in TBLOCK_ABORT,
 121 * UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
 122
 123 Low-level transaction abort handling is divided in two phases:
 124 * AbortTransaction executes as soon as we realize the transaction has
 125   failed.  It should release all shared resources (locks etc) so that we do
 126   not delay other backends unnecessarily.
 127 * CleanupTransaction executes when we finally see a user COMMIT
 128   or ROLLBACK command; it cleans things up and gets us out of the transaction
 129   completely.  In particular, we mustn't destroy TopTransactionContext until
 130   this point.
 131
 132 Also, note that when a transaction is committed, we don't close it right away.
 133 Rather it's put in TBLOCK_END state, which means that when
 134 CommitTransactionCommand is called after the query has finished processing,
 135 the transaction has to be closed.  The distinction is subtle but important,
 136 because it means that control will leave the xact.c code with the transaction
 137 open, and the main loop will be able to keep processing inside the same
 138 transaction.  So, in a sense, transaction commit is also handled in two
 139 phases, the first at EndTransactionBlock and the second at
 140 CommitTransactionCommand (which is where CommitTransaction is actually
 141 called).
 142
 143 The rest of the code in xact.c are routines to support the creation and
 144 finishing of transactions and subtransactions.  For example, AtStart_Memory
 145 takes care of initializing the memory subsystem at main transaction start.
 146
 147
 148 Subtransaction Handling
 149 -----------------------
 150
 151 Subtransactions are implemented using a stack of TransactionState structures,
 152 each of which has a pointer to its parent transaction's struct.  When a new
 153 subtransaction is to be opened, PushTransaction is called, which creates a new
 154 TransactionState, with its parent link pointing to the current transaction.
 155 StartSubTransaction is in charge of initializing the new TransactionState to
 156 sane values, and properly initializing other subsystems (AtSubStart routines).
 157
 158 When closing a subtransaction, either CommitSubTransaction has to be called
 159 (if the subtransaction is committing), or AbortSubTransaction and
 160 CleanupSubTransaction (if it's aborting).  In either case, PopTransaction is
 161 called so the system returns to the parent transaction.
 162
 163 One important point regarding subtransaction handling is that several may need
 164 to be closed in response to a single user command.  That's because savepoints
 165 have names, and we allow to commit or rollback a savepoint by name, which is
 166 not necessarily the one that was last opened.  Also a COMMIT or ROLLBACK
 167 command must be able to close out the entire stack.  We handle this by having
 168 the utility command subroutine mark all the state stack entries as commit-
 169 pending or abort-pending, and then when the main loop reaches
 170 CommitTransactionCommand, the real work is done.  The main point of doing
 171 things this way is that if we get an error while popping state stack entries,
 172 the remaining stack entries still show what we need to do to finish up.
 173
 174 In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
 175 through the one identified by the savepoint name, and then re-create that
 176 subtransaction level with the same name.  So it's a completely new
 177 subtransaction as far as the internals are concerned.
 178
 179 Other subsystems are allowed to start "internal" subtransactions, which are
 180 handled by BeginInternalSubtransaction.  This is to allow implementing
 181 exception handling, e.g. in PL/pgSQL.  ReleaseCurrentSubTransaction and
 182 RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
 183 subtransactions.  The main difference between this and the savepoint/release
 184 path is that we execute the complete state transition immediately in each
 185 subroutine, rather than deferring some work until CommitTransactionCommand.
 186 Another difference is that BeginInternalSubtransaction is allowed when no
 187 explicit transaction block has been established, while DefineSavepoint is not.
 188
 189
 190 Transaction and Subtransaction Numbering
 191 ----------------------------------------
 192
 193 Transactions and subtransactions are assigned permanent XIDs only when/if
 194 they first do something that requires one --- typically, insert/update/delete
 195 a tuple, though there are a few other places that need an XID assigned.
 196 If a subtransaction requires an XID, we always first assign one to its
 197 parent.  This maintains the invariant that child transactions have XIDs later
 198 than their parents, which is assumed in a number of places.
 199
 200 The subsidiary actions of obtaining a lock on the XID and entering it into
 201 pg_subtrans and PG_PROC are done at the time it is assigned.
 202
 203 A transaction that has no XID still needs to be identified for various
 204 purposes, notably holding locks.  For this purpose we assign a "virtual
 205 transaction ID" or VXID to each top-level transaction.  VXIDs are formed from
 206 two fields, the backendID and a backend-local counter; this arrangement allows
 207 assignment of a new VXID at transaction start without any contention for
 208 shared memory.  To ensure that a VXID isn't re-used too soon after backend
 209 exit, we store the last local counter value into shared memory at backend
 210 exit, and initialize it from the previous value for the same backendID slot
 211 at backend start.  All these counters go back to zero at shared memory
 212 re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
 213
 214 Internally, a backend needs a way to identify subtransactions whether or not
 215 they have XIDs; but this need only lasts as long as the parent top transaction
 216 endures.  Therefore, we have SubTransactionId, which is somewhat like
 217 CommandId in that it's generated from a counter that we reset at the start of
 218 each top transaction.  The top-level transaction itself has SubTransactionId 1,
 219 and subtransactions have IDs 2 and up.  (Zero is reserved for
 220 InvalidSubTransactionId.)  Note that subtransactions do not have their
 221 own VXIDs; they use the parent top transaction's VXID.
 222
 223
 224 Interlocking Transaction Begin, Transaction End, and Snapshots
 225 --------------------------------------------------------------
 226
 227 We try hard to minimize the amount of overhead and lock contention involved
 228 in the frequent activities of beginning/ending a transaction and taking a
 229 snapshot.  Unfortunately, we must have some interlocking for this, because
 230 we must ensure consistency about the commit order of transactions.
 231 For example, suppose an UPDATE in xact A is blocked by xact B's prior
 232 update of the same row, and xact B is doing commit while xact C gets a
 233 snapshot.  Xact A can complete and commit as soon as B releases its locks.
 234 If xact C's GetSnapshotData sees xact B as still running, then it had
 235 better see xact A as still running as well, or it will be able to see two
 236 tuple versions - one deleted by xact B and one inserted by xact A.  Another
 237 reason why this would be bad is that C would see (in the row inserted by A)
 238 earlier changes by B, and it would be inconsistent for C not to see any
 239 of B's changes elsewhere in the database.
 240
 241 Formally, the correctness requirement is "if a snapshot A considers
 242 transaction X as committed, and any of transaction X's snapshots considered
 243 transaction Y as committed, then snapshot A must consider transaction Y as
 244 committed".
 245
 246 What we actually enforce is strict serialization of commits and rollbacks
 247 with snapshot-taking: we do not allow any transaction to exit the set of
 248 running transactions while a snapshot is being taken.  (This rule is
 249 stronger than necessary for consistency, but is relatively simple to
 250 enforce, and it assists with some other issues as explained below.)  The
 251 implementation of this is that GetSnapshotData takes the ProcArrayLock in
 252 shared mode (so that multiple backends can take snapshots in parallel),
 253 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
 254 while clearing MyPgXact->xid at transaction end (either commit or abort).
 255
 256 ProcArrayEndTransaction also holds the lock while advancing the shared
 257 latestCompletedXid variable.  This allows GetSnapshotData to use
 258 latestCompletedXid + 1 as xmax for its snapshot: there can be no
 259 transaction >= this xid value that the snapshot needs to consider as
 260 completed.
 261
 262 In short, then, the rule is that no transaction may exit the set of
 263 currently-running transactions between the time we fetch latestCompletedXid
 264 and the time we finish building our snapshot.  However, this restriction
 265 only applies to transactions that have an XID --- read-only transactions
 266 can end without acquiring ProcArrayLock, since they don't affect anyone
 267 else's snapshot nor latestCompletedXid.
 268
 269 Transaction start, per se, doesn't have any interlocking with these
 270 considerations, since we no longer assign an XID immediately at transaction
 271 start.  But when we do decide to allocate an XID, GetNewTransactionId must
 272 store the new XID into the shared ProcArray before releasing XidGenLock.
 273 This ensures that all top-level XIDs <= latestCompletedXid are either
 274 present in the ProcArray, or not running anymore.  (This guarantee doesn't
 275 apply to subtransaction XIDs, because of the possibility that there's not
 276 room for them in the subxid array; instead we guarantee that they are
 277 present or the overflow flag is set.)  If a backend released XidGenLock
 278 before storing its XID into MyPgXact, then it would be possible for another
 279 backend to allocate and commit a later XID, causing latestCompletedXid to
 280 pass the first backend's XID, before that value became visible in the
 281 ProcArray.  That would break GetOldestXmin, as discussed below.
 282
 283 We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
 284 subxid array) without taking ProcArrayLock.  This was once necessary to
 285 avoid deadlock; while that is no longer the case, it's still beneficial for
 286 performance.  We are thereby relying on fetch/store of an XID to be atomic,
 287 else other backends might see a partially-set XID.  This also means that
 288 readers of the ProcArray xid fields must be careful to fetch a value only
 289 once, rather than assume they can read it multiple times and get the same
 290 answer each time.  (Use volatile-qualified pointers when doing this, to
 291 ensure that the C compiler does exactly what you tell it to.)
 292
 293 Another important activity that uses the shared ProcArray is GetOldestXmin,
 294 which must determine a lower bound for the oldest xmin of any active MVCC
 295 snapshot, system-wide.  Each individual backend advertises the smallest
 296 xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
 297 live snapshots (eg, if it's between transactions or hasn't yet set a
 298 snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
 299 valid xmin fields.  It does this with only shared lock on ProcArrayLock,
 300 which means there is a potential race condition against other backends
 301 doing GetSnapshotData concurrently: we must be certain that a concurrent
 302 backend that is about to set its xmin does not compute an xmin less than
 303 what GetOldestXmin returns.  We ensure that by including all the active
 304 XIDs into the MIN() calculation, along with the valid xmins.  The rule that
 305 transactions can't exit without taking exclusive ProcArrayLock ensures that
 306 concurrent holders of shared ProcArrayLock will compute the same minimum of
 307 currently-active XIDs: no xact, in particular not the oldest, can exit
 308 while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
 309 active XID will be the same as that of any concurrent GetSnapshotData, and
 310 so it can't produce an overestimate.  If there is no active transaction at
 311 all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
 312 for the xmin that might be computed by concurrent or later GetSnapshotData
 313 calls.  (We know that no XID less than this could be about to appear in
 314 the ProcArray, because of the XidGenLock interlock discussed above.)
 315
 316 GetSnapshotData also performs an oldest-xmin calculation (which had better
 317 match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
 318 for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
 319 too expensive.  Note that while it is certain that two concurrent
 320 executions of GetSnapshotData will compute the same xmin for their own
 321 snapshots, as argued above, it is not certain that they will arrive at the
 322 same estimate of RecentGlobalXmin.  This is because we allow XID-less
 323 transactions to clear their MyPgXact->xmin asynchronously (without taking
 324 ProcArrayLock), so one execution might see what had been the oldest xmin,
 325 and another not.  This is OK since RecentGlobalXmin need only be a valid
 326 lower bound.  As noted above, we are already assuming that fetch/store
 327 of the xid fields is atomic, so assuming it for xmin as well is no extra
 328 risk.
 329
 330
 331 pg_clog and pg_subtrans
 332 -----------------------
 333
 334 pg_clog and pg_subtrans are permanent (on-disk) storage of transaction related
 335 information.  There is a limited number of pages of each kept in memory, so
 336 in many cases there is no need to actually read from disk.  However, if
 337 there's a long running transaction or a backend sitting idle with an open
 338 transaction, it may be necessary to be able to read and write this information
 339 from disk.  They also allow information to be permanent across server restarts.
 340
 341 pg_clog records the commit status for each transaction that has been assigned
 342 an XID.  A transaction can be in progress, committed, aborted, or
 343 "sub-committed".  This last state means that it's a subtransaction that's no
 344 longer running, but its parent has not updated its state yet.  It is not
 345 necessary to update a subtransaction's transaction status to subcommit, so we
 346 can just defer it until main transaction commit.  The main role of marking
 347 transactions as sub-committed is to provide an atomic commit protocol when
 348 transaction status is spread across multiple clog pages. As a result, whenever
 349 transaction status spreads across multiple pages we must use a two-phase commit
 350 protocol: the first phase is to mark the subtransactions as sub-committed, then
 351 we mark the top level transaction and all its subtransactions committed (in
 352 that order).  Thus, subtransactions that have not aborted appear as in-progress
 353 even when they have already finished, and the subcommit status appears as a
 354 very short transitory state during main transaction commit.  Subtransaction
 355 abort is always marked in clog as soon as it occurs.  When the transaction
 356 status all fit in a single CLOG page, we atomically mark them all as committed
 357 without bothering with the intermediate sub-commit state.
 358
 359 Savepoints are implemented using subtransactions.  A subtransaction is a
 360 transaction inside a transaction; its commit or abort status is not only
 361 dependent on whether it committed itself, but also whether its parent
 362 transaction committed.  To implement multiple savepoints in a transaction we
 363 allow unlimited transaction nesting depth, so any particular subtransaction's
 364 commit state is dependent on the commit status of each and every ancestor
 365 transaction.
 366
 367 The "subtransaction parent" (pg_subtrans) mechanism records, for each
 368 transaction with an XID, the TransactionId of its parent transaction.  This
 369 information is stored as soon as the subtransaction is assigned an XID.
 370 Top-level transactions do not have a parent, so they leave their pg_subtrans
 371 entries set to the default value of zero (InvalidTransactionId).
 372
 373 pg_subtrans is used to check whether the transaction in question is still
 374 running --- the main Xid of a transaction is recorded in the PGXACT struct,
 375 but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
 376 in shared memory, so we have to store them on disk.  Note, however, that for
 377 each transaction we keep a "cache" of Xids that are known to be part of the
 378 transaction tree, so we can skip looking at pg_subtrans unless we know the
 379 cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
 380
 381 slru.c is the supporting mechanism for both pg_clog and pg_subtrans.  It
 382 implements the LRU policy for in-memory buffer pages.  The high-level routines
 383 for pg_clog are implemented in transam.c, while the low-level functions are in
 384 clog.c.  pg_subtrans is contained completely in subtrans.c.
 385
 386
 387 Write-Ahead Log Coding
 388 ----------------------
 389
 390 The WAL subsystem (also called XLOG in the code) exists to guarantee crash
 391 recovery.  It can also be used to provide point-in-time recovery, as well as
 392 hot-standby replication via log shipping.  Here are some notes about
 393 non-obvious aspects of its design.
 394
 395 A basic assumption of a write AHEAD log is that log entries must reach stable
 396 storage before the data-page changes they describe.  This ensures that
 397 replaying the log to its end will bring us to a consistent state where there
 398 are no partially-performed transactions.  To guarantee this, each data page
 399 (either heap or index) is marked with the LSN (log sequence number --- in
 400 practice, a WAL file location) of the latest XLOG record affecting the page.
 401 Before the bufmgr can write out a dirty page, it must ensure that xlog has
 402 been flushed to disk at least up to the page's LSN.  This low-level
 403 interaction improves performance by not waiting for XLOG I/O until necessary.
 404 The LSN check exists only in the shared-buffer manager, not in the local
 405 buffer manager used for temp tables; hence operations on temp tables must not
 406 be WAL-logged.
 407
 408 During WAL replay, we can check the LSN of a page to detect whether the change
 409 recorded by the current log entry is already applied (it has been, if the page
 410 LSN is >= the log entry's WAL location).
 411
 412 Usually, log entries contain just enough information to redo a single
 413 incremental update on a page (or small group of pages).  This will work only
 414 if the filesystem and hardware implement data page writes as atomic actions,
 415 so that a page is never left in a corrupt partly-written state.  Since that's
 416 often an untenable assumption in practice, we log additional information to
 417 allow complete reconstruction of modified pages.  The first WAL record
 418 affecting a given page after a checkpoint is made to contain a copy of the
 419 entire page, and we implement replay by restoring that page copy instead of
 420 redoing the update.  (This is more reliable than the data storage itself would
 421 be because we can check the validity of the WAL record's CRC.)  We can detect
 422 the "first change after checkpoint" by noting whether the page's old LSN
 423 precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
 424
 425 The general schema for executing a WAL-logged action is
 426
 427 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
 428 to be modified.
 429
 430 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 431 PANIC because the shared buffers will contain unlogged changes, which we
 432 have to ensure don't get to disk.  Obviously, you should check conditions
 433 such as whether there's enough free space on the page before you start the
 434 critical section.)
 435
 436 3. Apply the required changes to the shared buffer(s).
 437
 438 4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
 439 happen before the WAL record is inserted; see notes in SyncOneBuffer().)
 440 Note that marking a buffer dirty with MarkBufferDirty() should only
 441 happen iff you write a WAL record; see Writing Hints below.
 442
 443 5. If the relation requires WAL-logging, build a WAL log record and pass it
 444 to XLogInsert(); then update the page's LSN using the returned XLOG
 445 location.  For instance,
 446
 447                 recptr = XLogInsert(rmgr_id, info, rdata);
 448
 449                 PageSetLSN(dp, recptr);
 450                 // Note that we no longer do PageSetTLI() from 9.3 onwards
 451                 // since that field on a page has now changed its meaning.
 452
 453 6. END_CRIT_SECTION()
 454
 455 7. Unlock and unpin the buffer(s).
 456
 457 XLogInsert's "rdata" argument is an array of pointer/size items identifying
 458 chunks of data to be written in the XLOG record, plus optional shared-buffer
 459 IDs for chunks that are in shared buffers rather than temporary variables.
 460 The "rdata" array must mention (at least once) each of the shared buffers
 461 being modified, unless the action is such that the WAL replay routine can
 462 reconstruct the entire page contents.  XLogInsert includes the logic that
 463 tests to see whether a shared buffer has been modified since the last
 464 checkpoint.  If not, the entire page contents are logged rather than just the
 465 portion(s) pointed to by "rdata".
 466
 467 Because XLogInsert drops the rdata components associated with buffers it
 468 chooses to log in full, the WAL replay routines normally need to test to see
 469 which buffers were handled that way --- otherwise they may be misled about
 470 what the XLOG record actually contains.  XLOG records that describe multi-page
 471 changes therefore require some care to design: you must be certain that you
 472 know what data is indicated by each "BKP" bit.  An example of the trickiness
 473 is that in a HEAP_UPDATE record, BKP(0) normally is associated with the source
 474 page and BKP(1) is associated with the destination page --- but if these are
 475 the same page, only BKP(0) would have been set.
 476
 477 For this reason as well as the risk of deadlocking on buffer locks, it's best
 478 to design WAL records so that they reflect small atomic actions involving just
 479 one or a few pages.  The current XLOG infrastructure cannot handle WAL records
 480 involving references to more than four shared buffers, anyway.
 481
 482 In the case where the WAL record contains enough information to re-generate
 483 the entire contents of a page, do *not* show that page's buffer ID in the
 484 rdata array, even if some of the rdata items point into the buffer.  This is
 485 because you don't want XLogInsert to log the whole page contents.  The
 486 standard replay-routine pattern for this case is
 487
 488         buffer = XLogReadBuffer(rnode, blkno, true);
 489         Assert(BufferIsValid(buffer));
 490         page = (Page) BufferGetPage(buffer);
 491
 492         ... initialize the page ...
 493
 494         PageSetLSN(page, lsn);
 495         MarkBufferDirty(buffer);
 496         UnlockReleaseBuffer(buffer);
 497
 498 In the case where the WAL record provides only enough information to
 499 incrementally update the page, the rdata array *must* mention the buffer
 500 ID at least once; otherwise there is no defense against torn-page problems.
 501 The standard replay-routine pattern for this case is
 502
 503         if (record->xl_info & XLR_BKP_BLOCK(N))
 504         {
 505                 /* apply the change from the full-page image */
 506                 (void) RestoreBackupBlock(lsn, record, N, false, false);
 507                 return;
 508         }
 509
 510         buffer = XLogReadBuffer(rnode, blkno, false);
 511         if (!BufferIsValid(buffer))
 512         {
 513                 /* page has been deleted, so we need do nothing */
 514                 return;
 515         }
 516         page = (Page) BufferGetPage(buffer);
 517
 518         if (XLByteLE(lsn, PageGetLSN(page)))
 519         {
 520                 /* changes are already applied */
 521                 UnlockReleaseBuffer(buffer);
 522                 return;
 523         }
 524
 525         ... apply the change ...
 526
 527         PageSetLSN(page, lsn);
 528         MarkBufferDirty(buffer);
 529         UnlockReleaseBuffer(buffer);
 530
 531 As noted above, for a multi-page update you need to be able to determine
 532 which XLR_BKP_BLOCK(N) flag applies to each page.  If a WAL record reflects
 533 a combination of fully-rewritable and incremental updates, then the rewritable
 534 pages don't count for the XLR_BKP_BLOCK(N) numbering.  (XLR_BKP_BLOCK(N) is
 535 associated with the N'th distinct buffer ID seen in the "rdata" array, and
 536 per the above discussion, fully-rewritable buffers shouldn't be mentioned in
 537 "rdata".)
 538
 539 When replaying a WAL record that describes changes on multiple pages, you
 540 must be careful to lock the pages properly to prevent concurrent Hot Standby
 541 queries from seeing an inconsistent state.  If this requires that two
 542 or more buffer locks be held concurrently, the coding pattern shown above
 543 is too simplistic, since it assumes the routine can exit as soon as it's
 544 known the current page requires no modification.  Instead, you might have
 545 something like
 546
 547         if (record->xl_info & XLR_BKP_BLOCK(0))
 548         {
 549                 /* apply the change from the full-page image */
 550                 buffer0 = RestoreBackupBlock(lsn, record, 0, false, true);
 551         }
 552         else
 553         {
 554                 buffer0 = XLogReadBuffer(rnode, blkno, false);
 555                 if (BufferIsValid(buffer0))
 556                 {
 557                         ... apply the change if not already done ...
 558                         MarkBufferDirty(buffer0);
 559                 }
 560         }
 561
 562         ... similarly apply the changes for remaining pages ...
 563
 564         /* and now we can release the lock on the first page */
 565         if (BufferIsValid(buffer0))
 566                 UnlockReleaseBuffer(buffer0);
 567
 568 Note that we must only use PageSetLSN/PageGetLSN() when we know the action
 569 is serialised. Only Startup process may modify data blocks during recovery,
 570 so Startup process may execute PageGetLSN() without fear of serialisation
 571 problems. All other processes must only call PageSet/GetLSN when holding
 572 either an exclusive buffer lock or a shared lock plus buffer header lock,
 573 or be writing the data block directly rather than through shared buffers
 574 while holding AccessExclusiveLock on the relation.
 575
 576 Due to all these constraints, complex changes (such as a multilevel index
 577 insertion) normally need to be described by a series of atomic-action WAL
 578 records.  What do you do if the intermediate states are not self-consistent?
 579 The answer is that the WAL replay logic has to be able to fix things up.
 580 In btree indexes, for example, a page split requires insertion of a new key in
 581 the parent btree level, but for locking reasons this has to be reflected by
 582 two separate WAL records.  The replay code has to remember "unfinished" split
 583 operations, and match them up to subsequent insertions in the parent level.
 584 If no matching insert has been found by the time the WAL replay ends, the
 585 replay code has to do the insertion on its own to restore the index to
 586 consistency.  Such insertions occur after WAL is operational, so they can
 587 and should write WAL records for the additional generated actions.
 588
 589 Writing Hints
 590 -------------
 591
 592 In some cases, we write additional information to data blocks without
 593 writing a preceding WAL record. This should only happen iff the data can
 594 be reconstructed later following a crash and the action is simply a way
 595 of optimising for performance. When a hint is written we use
 596 MarkBufferDirtyHint() to mark the block dirty.
 597
 598 If the buffer is clean and checksums are in use then
 599 MarkBufferDirtyHint() inserts an XLOG_HINT record to ensure that we
 600 take a full page image that includes the hint. We do this to avoid
 601 a partial page write, when we write the dirtied page. WAL is not
 602 written during recovery, so we simply skip dirtying blocks because
 603 of hints when in recovery.
 604
 605 If you do decide to optimise away a WAL record, then any calls to
 606 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
 607 otherwise you will expose the risk of partial page writes.
 608
 609
 610 Write-Ahead Logging for Filesystem Actions
 611 ------------------------------------------
 612
 613 The previous section described how to WAL-log actions that only change page
 614 contents within shared buffers.  For that type of action it is generally
 615 possible to check all likely error cases (such as insufficient space on the
 616 page) before beginning to make the actual change.  Therefore we can make
 617 the change and the creation of the associated WAL log record "atomic" by
 618 wrapping them into a critical section --- the odds of failure partway
 619 through are low enough that PANIC is acceptable if it does happen.
 620
 621 Clearly, that approach doesn't work for cases where there's a significant
 622 probability of failure within the action to be logged, such as creation
 623 of a new file or database.  We don't want to PANIC, and we especially don't
 624 want to PANIC after having already written a WAL record that says we did
 625 the action --- if we did, replay of the record would probably fail again
 626 and PANIC again, making the failure unrecoverable.  This means that the
 627 ordinary WAL rule of "write WAL before the changes it describes" doesn't
 628 work, and we need a different design for such cases.
 629
 630 There are several basic types of filesystem actions that have this
 631 issue.  Here is how we deal with each:
 632
 633 1. Adding a disk page to an existing table.
 634
 635 This action isn't WAL-logged at all.  We extend a table by writing a page
 636 of zeroes at its end.  We must actually do this write so that we are sure
 637 the filesystem has allocated the space.  If the write fails we can just
 638 error out normally.  Once the space is known allocated, we can initialize
 639 and fill the page via one or more normal WAL-logged actions.  Because it's
 640 possible that we crash between extending the file and writing out the WAL
 641 entries, we have to treat discovery of an all-zeroes page in a table or
 642 index as being a non-error condition.  In such cases we can just reclaim
 643 the space for re-use.
 644
 645 2. Creating a new table, which requires a new file in the filesystem.
 646
 647 We try to create the file, and if successful we make a WAL record saying
 648 we did it.  If not successful, we can just throw an error.  Notice that
 649 there is a window where we have created the file but not yet written any
 650 WAL about it to disk.  If we crash during this window, the file remains
 651 on disk as an "orphan".  It would be possible to clean up such orphans
 652 by having database restart search for files that don't have any committed
 653 entry in pg_class, but that currently isn't done because of the possibility
 654 of deleting data that is useful for forensic analysis of the crash.
 655 Orphan files are harmless --- at worst they waste a bit of disk space ---
 656 because we check for on-disk collisions when allocating new relfilenode
 657 OIDs.  So cleaning up isn't really necessary.
 658
 659 3. Deleting a table, which requires an unlink() that could fail.
 660
 661 Our approach here is to WAL-log the operation first, but to treat failure
 662 of the actual unlink() call as a warning rather than error condition.
 663 Again, this can leave an orphan file behind, but that's cheap compared to
 664 the alternatives.  Since we can't actually do the unlink() until after
 665 we've committed the DROP TABLE transaction, throwing an error would be out
 666 of the question anyway.  (It may be worth noting that the WAL entry about
 667 the file deletion is actually part of the commit record for the dropping
 668 transaction.)
 669
 670 4. Creating and deleting databases and tablespaces, which requires creating
 671 and deleting directories and entire directory trees.
 672
 673 These cases are handled similarly to creating individual files, ie, we
 674 try to do the action first and then write a WAL entry if it succeeded.
 675 The potential amount of wasted disk space is rather larger, of course.
 676 In the creation case we try to delete the directory tree again if creation
 677 fails, so as to reduce the risk of wasted space.  Failure partway through
 678 a deletion operation results in a corrupt database: the DROP failed, but
 679 some of the data is gone anyway.  There is little we can do about that,
 680 though, and in any case it was presumably data the user no longer wants.
 681
 682 In all of these cases, if WAL replay fails to redo the original action
 683 we must panic and abort recovery.  The DBA will have to manually clean up
 684 (for instance, free up some disk space or fix directory permissions) and
 685 then restart recovery.  This is part of the reason for not writing a WAL
 686 entry until we've successfully done the original action.
 687
 688
 689 Asynchronous Commit
 690 -------------------
 691
 692 As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
 693 we don't wait while the WAL record for the commit is fsync'ed.
 694 We perform an asynchronous commit when synchronous_commit = off.  Instead
 695 of performing an XLogFlush() up to the LSN of the commit, we merely note
 696 the LSN in shared memory.  The backend then continues with other work.
 697 We record the LSN only for an asynchronous commit, not an abort; there's
 698 never any need to flush an abort record, since the presumption after a
 699 crash would be that the transaction aborted anyway.
 700
 701 We always force synchronous commit when the transaction is deleting
 702 relations, to ensure the commit record is down to disk before the relations
 703 are removed from the filesystem.  Also, certain utility commands that have
 704 non-roll-backable side effects (such as filesystem changes) force sync
 705 commit to minimize the window in which the filesystem change has been made
 706 but the transaction isn't guaranteed committed.
 707
 708 Every wal_writer_delay milliseconds, the walwriter process performs an
 709 XLogBackgroundFlush().  This checks the location of the last completely
 710 filled WAL page.  If that has moved forwards, then we write all the changed
 711 buffers up to that point, so that under full load we write only whole
 712 buffers.  If there has been a break in activity and the current WAL page is
 713 the same as before, then we find out the LSN of the most recent
 714 asynchronous commit, and flush up to that point, if required (i.e.,
 715 if it's in the current WAL page).  This arrangement in itself would
 716 guarantee that an async commit record reaches disk during at worst the
 717 second walwriter cycle after the transaction completes.  However, we also
 718 allow XLogFlush to flush full buffers "flexibly" (ie, not wrapping around
 719 at the end of the circular WAL buffer area), so as to minimize the number
 720 of writes issued under high load when multiple WAL pages are filled per
 721 walwriter cycle.  This makes the worst-case delay three walwriter cycles.
 722
 723 There are some other subtle points to consider with asynchronous commits.
 724 First, for each page of CLOG we must remember the LSN of the latest commit
 725 affecting the page, so that we can enforce the same flush-WAL-before-write
 726 rule that we do for ordinary relation pages.  Otherwise the record of the
 727 commit might reach disk before the WAL record does.  Again, abort records
 728 need not factor into this consideration.
 729
 730 In fact, we store more than one LSN for each clog page.  This relates to
 731 the way we set transaction status hint bits during visibility tests.
 732 We must not set a transaction-committed hint bit on a relation page and
 733 have that record make it to disk prior to the WAL record of the commit.
 734 Since visibility tests are normally made while holding buffer share locks,
 735 we do not have the option of changing the page's LSN to guarantee WAL
 736 synchronization.  Instead, we defer the setting of the hint bit if we have
 737 not yet flushed WAL as far as the LSN associated with the transaction.
 738 This requires tracking the LSN of each unflushed async commit.  It is
 739 convenient to associate this data with clog buffers: because we will flush
 740 WAL before writing a clog page, we know that we do not need to remember a
 741 transaction's LSN longer than the clog page holding its commit status
 742 remains in memory.  However, the naive approach of storing an LSN for each
 743 clog position is unattractive: the LSNs are 32x bigger than the two-bit
 744 commit status fields, and so we'd need 256K of additional shared memory for
 745 each 8K clog buffer page.  We choose instead to store a smaller number of
 746 LSNs per page, where each LSN is the highest LSN associated with any
 747 transaction commit in a contiguous range of transaction IDs on that page.
 748 This saves storage at the price of some possibly-unnecessary delay in
 749 setting transaction hint bits.
 750
 751 How many transactions should share the same cached LSN (N)?  If the
 752 system's workload consists only of small async-commit transactions, then
 753 it's reasonable to have N similar to the number of transactions per
 754 walwriter cycle, since that is the granularity with which transactions will
 755 become truly committed (and thus hintable) anyway.  The worst case is where
 756 a sync-commit xact shares a cached LSN with an async-commit xact that
 757 commits a bit later; even though we paid to sync the first xact to disk,
 758 we won't be able to hint its outputs until the second xact is sync'd, up to
 759 three walwriter cycles later.  This argues for keeping N (the group size)
 760 as small as possible.  For the moment we are setting the group size to 32,
 761 which makes the LSN cache space the same size as the actual clog buffer
 762 space (independently of BLCKSZ).
 763
 764 It is useful that we can run both synchronous and asynchronous commit
 765 transactions concurrently, but the safety of this is perhaps not
 766 immediately obvious.  Assume we have two transactions, T1 and T2.  The Log
 767 Sequence Number (LSN) is the point in the WAL sequence where a transaction
 768 commit is recorded, so LSN1 and LSN2 are the commit records of those
 769 transactions.  If T2 can see changes made by T1 then when T2 commits it
 770 must be true that LSN2 follows LSN1.  Thus when T2 commits it is certain
 771 that all of the changes made by T1 are also now recorded in the WAL.  This
 772 is true whether T1 was asynchronous or synchronous.  As a result, it is
 773 safe for asynchronous commits and synchronous commits to work concurrently
 774 without endangering data written by synchronous commits.  Sub-transactions
 775 are not important here since the final write to disk only occurs at the
 776 commit of the top level transaction.
 777
 778 Changes to data blocks cannot reach disk unless WAL is flushed up to the
 779 point of the LSN of the data blocks.  Any attempt to write unsafe data to
 780 disk will trigger a write which ensures the safety of all data written by
 781 that and prior transactions.  Data blocks and clog pages are both protected
 782 by LSNs.
 783
 784 Changes to a temp table are not WAL-logged, hence could reach disk in
 785 advance of T1's commit, but we don't care since temp table contents don't
 786 survive crashes anyway.
 787
 788 Database writes made via any of the paths we have introduced to avoid WAL
 789 overhead for bulk updates are also safe.  In these cases it's entirely
 790 possible for the data to reach disk before T1's commit, because T1 will
 791 fsync it down to disk without any sort of interlock, as soon as it finishes
 792 the bulk update.  However, all these paths are designed to write data that
 793 no other transaction can see until after T1 commits.  The situation is thus
 794 not different from ordinary WAL-logged updates.
 795
 796 Transaction Emulation during Recovery
 797 -------------------------------------
 798
 799 During Recovery we replay transaction changes in the order they occurred.
 800 As part of this replay we emulate some transactional behaviour, so that
 801 read only backends can take MVCC snapshots. We do this by maintaining a
 802 list of XIDs belonging to transactions that are being replayed, so that
 803 each transaction that has recorded WAL records for database writes exist
 804 in the array until it commits. Further details are given in comments in
 805 procarray.c.
 806
 807 Many actions write no WAL records at all, for example read only transactions.
 808 These have no effect on MVCC in recovery and we can pretend they never
 809 occurred at all. Subtransaction commit does not write a WAL record either
 810 and has very little effect, since lock waiters need to wait for the
 811 parent transaction to complete.
 812
 813 Not all transactional behaviour is emulated, for example we do not insert
 814 a transaction entry into the lock table, nor do we maintain the transaction
 815 stack in memory. Clog entries are made normally. Multixact is not maintained
 816 because its purpose is to record tuple level locks that an application has
 817 requested to prevent other tuple locks. Since tuple locks cannot be obtained at
 818 all, there is never any conflict and so there is no reason to update multixact.
 819 Subtrans is maintained during recovery but the details of the transaction
 820 tree are ignored and all subtransactions reference the top-level TransactionId
 821 directly. Since commit is atomic this provides correct lock wait behaviour
 822 yet simplifies emulation of subtransactions considerably.
 823
 824 Further details on locking mechanics in recovery are given in comments
 825 with the Lock rmgr code.