$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.4 2006/03/29 21:17:37 tgl Exp $

The Transaction System
----------------------

PostgreSQL's transaction system is a three-layer system.  The bottom layer
implements low-level transactions and subtransactions, on top of which rests
the mainloop's control code, which in turn implements user-visible
transactions and savepoints.

The middle layer of code is called by postgres.c before and after the
processing of each query, or after detecting an error:

		StartTransactionCommand
		CommitTransactionCommand
		AbortCurrentTransaction

Meanwhile, the user can alter the system's state by issuing the SQL commands
BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE.  The traffic cop
redirects these calls to the toplevel routines

		BeginTransactionBlock
		EndTransactionBlock
		UserAbortTransactionBlock
		DefineSavepoint
		RollbackToSavepoint
		ReleaseSavepoint

respectively.  Depending on the current state of the system, these functions
call low level functions to activate the real transaction system:

		StartTransaction
		CommitTransaction
		AbortTransaction
		CleanupTransaction
		StartSubTransaction
		CommitSubTransaction
		AbortSubTransaction
		CleanupSubTransaction

Additionally, within a transaction, CommandCounterIncrement is called to
increment the command counter, which allows future commands to "see" the
effects of previous commands within the same transaction.  Note that this is
done automatically by CommitTransactionCommand after each query inside a
transaction block, but some utility functions also do it internally to allow
some operations (usually in the system catalogs) to be seen by future
operations in the same utility command.  (For example, in DefineRelation it is
done after creating the heap so the pg_class row is visible, to be able to
lock it.)


For example, consider the following sequence of user commands:

1)		BEGIN
2)		SELECT * FROM foo
3)		INSERT INTO foo VALUES (...)
4)		COMMIT

In the main processing loop, this results in the following function call
sequence:

	 /	StartTransactionCommand;
	/		StartTransaction;
1) <		ProcessUtility;				<< BEGIN
	\		BeginTransactionBlock;
	 \	CommitTransactionCommand;

	/	StartTransactionCommand;
2) /		ProcessQuery;				<< SELECT ...
   \		CommitTransactionCommand;
	\		CommandCounterIncrement;

	/	StartTransactionCommand;
3) /		ProcessQuery;				<< INSERT ...
   \		CommitTransactionCommand;
	\		CommandCounterIncrement;

	 /	StartTransactionCommand;
	/	ProcessUtility;				<< COMMIT
4) <			EndTransactionBlock;
	\	CommitTransactionCommand;
	 \		CommitTransaction;

The point of this example is to demonstrate the need for
StartTransactionCommand and CommitTransactionCommand to be state smart -- they
should call CommandCounterIncrement between the calls to BeginTransactionBlock
and EndTransactionBlock and outside these calls they need to do normal start,
commit or abort processing.

Furthermore, suppose the "SELECT * FROM foo" caused an abort condition.	In
this case AbortCurrentTransaction is called, and the transaction is put in
aborted state.  In this state, any user input is ignored except for
transaction-termination statements, or ROLLBACK TO <savepoint> commands.

Transaction aborts can occur in two ways:

1)	system dies from some internal cause  (syntax error, etc)
2)	user types ROLLBACK

The reason we have to distinguish them is illustrated by the following two
situations:

	case 1					case 2
	------					------
1) user types BEGIN			1) user types BEGIN
2) user does something			2) user does something
3) user does not like what		3) system aborts for some reason
   she sees and types ABORT		   (syntax error, etc)

In case 1, we want to abort the transaction and return to the default state.
In case 2, there may be more commands coming our way which are part of the
same transaction block; we have to ignore these commands until we see a COMMIT
or ROLLBACK.

Internal aborts are handled by AbortCurrentTransaction, while user aborts are
handled by UserAbortTransactionBlock.  Both of them rely on AbortTransaction
to do all the real work.  The only difference is what state we enter after
AbortTransaction does its work:

* AbortCurrentTransaction leaves us in TBLOCK_ABORT,
* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END

Low-level transaction abort handling is divided in two phases:
* AbortTransaction executes as soon as we realize the transaction has
  failed.  It should release all shared resources (locks etc) so that we do
  not delay other backends unnecessarily.
* CleanupTransaction executes when we finally see a user COMMIT
  or ROLLBACK command; it cleans things up and gets us out of the transaction
  completely.  In particular, we mustn't destroy TopTransactionContext until
  this point.

Also, note that when a transaction is committed, we don't close it right away.
Rather it's put in TBLOCK_END state, which means that when
CommitTransactionCommand is called after the query has finished processing,
the transaction has to be closed.  The distinction is subtle but important,
because it means that control will leave the xact.c code with the transaction
open, and the main loop will be able to keep processing inside the same
transaction.  So, in a sense, transaction commit is also handled in two
phases, the first at EndTransactionBlock and the second at
CommitTransactionCommand (which is where CommitTransaction is actually
called).

The rest of the code in xact.c are routines to support the creation and
finishing of transactions and subtransactions.  For example, AtStart_Memory
takes care of initializing the memory subsystem at main transaction start.


Subtransaction handling
-----------------------

Subtransactions are implemented using a stack of TransactionState structures,
each of which has a pointer to its parent transaction's struct.  When a new
subtransaction is to be opened, PushTransaction is called, which creates a new
TransactionState, with its parent link pointing to the current transaction.
StartSubTransaction is in charge of initializing the new TransactionState to
sane values, and properly initializing other subsystems (AtSubStart routines).

When closing a subtransaction, either CommitSubTransaction has to be called
(if the subtransaction is committing), or AbortSubTransaction and
CleanupSubTransaction (if it's aborting).  In either case, PopTransaction is
called so the system returns to the parent transaction.

One important point regarding subtransaction handling is that several may need
to be closed in response to a single user command.  That's because savepoints
have names, and we allow to commit or rollback a savepoint by name, which is
not necessarily the one that was last opened.  Also a COMMIT or ROLLBACK
command must be able to close out the entire stack.  We handle this by having
the utility command subroutine mark all the state stack entries as commit-
pending or abort-pending, and then when the main loop reaches
CommitTransactionCommand, the real work is done.  The main point of doing
things this way is that if we get an error while popping state stack entries,
the remaining stack entries still show what we need to do to finish up.

In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
through the one identified by the savepoint name, and then re-create that
subtransaction level with the same name.  So it's a completely new
subtransaction as far as the internals are concerned.

Other subsystems are allowed to start "internal" subtransactions, which are
handled by BeginInternalSubtransaction.  This is to allow implementing
exception handling, e.g. in PL/pgSQL.  ReleaseCurrentSubTransaction and
RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
subtransactions.  The main difference between this and the savepoint/release
path is that we execute the complete state transition immediately in each
subroutine, rather than deferring some work until CommitTransactionCommand.
Another difference is that BeginInternalSubtransaction is allowed when no
explicit transaction block has been established, while DefineSavepoint is not.


Subtransaction numbering
------------------------

A top-level transaction is always given a TransactionId (XID) as soon as it is
created.  This is necessary for a number of reasons, notably XMIN bookkeeping
for VACUUM.  However, a subtransaction doesn't need its own XID unless it
(or one of its child subxacts) writes tuples into the database.  Therefore,
we postpone assigning XIDs to subxacts until and unless they call
GetCurrentTransactionId.  The subsidiary actions of obtaining a lock on the
XID and and entering it into pg_subtrans and PG_PROC are done at the same time.

Internally, a backend needs a way to identify subtransactions whether or not
they have XIDs; but this need only lasts as long as the parent top transaction
endures.  Therefore, we have SubTransactionId, which is somewhat like
CommandId in that it's generated from a counter that we reset at the start of
each top transaction.  The top-level transaction itself has SubTransactionId 1,
and subtransactions have IDs 2 and up.  (Zero is reserved for
InvalidSubTransactionId.)


pg_clog and pg_subtrans
-----------------------

pg_clog and pg_subtrans are permanent (on-disk) storage of transaction related
information.  There is a limited number of pages of each kept in memory, so
in many cases there is no need to actually read from disk.  However, if
there's a long running transaction or a backend sitting idle with an open
transaction, it may be necessary to be able to read and write this information
from disk.  They also allow information to be permanent across server restarts.

pg_clog records the commit status for each transaction that has been assigned
an XID.  A transaction can be in progress, committed, aborted, or
"sub-committed".  This last state means that it's a subtransaction that's no
longer running, but its parent has not updated its state yet (either it is
still running, or the backend crashed without updating its status).  A
sub-committed transaction's status will be updated again to the final value as
soon as the parent commits or aborts, or when the parent is detected to be
aborted.

Savepoints are implemented using subtransactions.  A subtransaction is a
transaction inside a transaction; its commit or abort status is not only
dependent on whether it committed itself, but also whether its parent
transaction committed.  To implement multiple savepoints in a transaction we
allow unlimited transaction nesting depth, so any particular subtransaction's
commit state is dependent on the commit status of each and every ancestor
transaction.

The "subtransaction parent" (pg_subtrans) mechanism records, for each
transaction with an XID, the TransactionId of its parent transaction.  This
information is stored as soon as the subtransaction is assigned an XID.
Top-level transactions do not have a parent, so they leave their pg_subtrans
entries set to the default value of zero (InvalidTransactionId).

pg_subtrans is used to check whether the transaction in question is still
running --- the main Xid of a transaction is recorded in the PGPROC struct,
but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
in shared memory, so we have to store them on disk.  Note, however, that for
each transaction we keep a "cache" of Xids that are known to be part of the
transaction tree, so we can skip looking at pg_subtrans unless we know the
cache has been overflowed.  See storage/ipc/procarray.c for the gory details.

slru.c is the supporting mechanism for both pg_clog and pg_subtrans.  It
implements the LRU policy for in-memory buffer pages.  The high-level routines
for pg_clog are implemented in transam.c, while the low-level functions are in
clog.c.  pg_subtrans is contained completely in subtrans.c.


Write-Ahead Log coding
----------------------

The WAL subsystem (also called XLOG in the code) exists to guarantee crash
recovery.  It can also be used to provide point-in-time recovery, as well as
hot-standby replication via log shipping.  Here are some notes about
non-obvious aspects of its design.

A basic assumption of a write AHEAD log is that log entries must reach stable
storage before the data-page changes they describe.  This ensures that
replaying the log to its end will bring us to a consistent state where there
are no partially-performed transactions.  To guarantee this, each data page
(either heap or index) is marked with the LSN (log sequence number --- in
practice, a WAL file location) of the latest XLOG record affecting the page.
Before the bufmgr can write out a dirty page, it must ensure that xlog has
been flushed to disk at least up to the page's LSN.  This low-level
interaction improves performance by not waiting for XLOG I/O until necessary.
The LSN check exists only in the shared-buffer manager, not in the local
buffer manager used for temp tables; hence operations on temp tables must not
be WAL-logged.

During WAL replay, we can check the LSN of a page to detect whether the change
recorded by the current log entry is already applied (it has been, if the page
LSN is >= the log entry's WAL location).

Usually, log entries contain just enough information to redo a single
incremental update on a page (or small group of pages).  This will work only
if the filesystem and hardware implement data page writes as atomic actions,
so that a page is never left in a corrupt partly-written state.  Since that's
often an untenable assumption in practice, we log additional information to
allow complete reconstruction of modified pages.  The first WAL record
affecting a given page after a checkpoint is made to contain a copy of the
entire page, and we implement replay by restoring that page copy instead of
redoing the update.  (This is more reliable than the data storage itself would
be because we can check the validity of the WAL record's CRC.)  We can detect
the "first change after checkpoint" by noting whether the page's old LSN
precedes the end of WAL as of the last checkpoint (the RedoRecPtr).

The general schema for executing a WAL-logged action is

1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
to be modified.

2. START_CRIT_SECTION()  (Any error during the next two steps must cause a
PANIC because the shared buffers will contain unlogged changes, which we
have to ensure don't get to disk.  Obviously, you should check conditions
such as whether there's enough free space on the page before you start the
critical section.)

3. Apply the required changes to the shared buffer(s).

4. Build a WAL log record and pass it to XLogInsert(); then update the page's
LSN and TLI using the returned XLOG location.  For instance,

		recptr = XLogInsert(rmgr_id, info, rdata);

		PageSetLSN(dp, recptr);
		PageSetTLI(dp, ThisTimeLineID);

5. END_CRIT_SECTION()

6. Unlock and write the buffer(s):

		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
		WriteBuffer(buffer);

(Note: WriteBuffer doesn't really "write" the buffer anymore, it just marks it
dirty and unpins it.  The write will not happen until a checkpoint occurs or
the shared buffer is needed for another page.)

XLogInsert's "rdata" argument is an array of pointer/size items identifying
chunks of data to be written in the XLOG record, plus optional shared-buffer
IDs for chunks that are in shared buffers rather than temporary variables.
The "rdata" array must mention (at least once) each of the shared buffers
being modified, unless the action is such that the WAL replay routine can
reconstruct the entire page contents.  XLogInsert includes the logic that
tests to see whether a shared buffer has been modified since the last
checkpoint.  If not, the entire page contents are logged rather than just the
portion(s) pointed to by "rdata".

Because XLogInsert drops the rdata components associated with buffers it
chooses to log in full, the WAL replay routines normally need to test to see
which buffers were handled that way --- otherwise they may be misled about
what the XLOG record actually contains.  XLOG records that describe multi-page
changes therefore require some care to design: you must be certain that you
know what data is indicated by each "BKP" bit.  An example of the trickiness
is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source
page and BKP(2) is associated with the destination page --- but if these are
the same page, only BKP(1) would have been set.

For this reason as well as the risk of deadlocking on buffer locks, it's best
to design WAL records so that they reflect small atomic actions involving just
one or a few pages.  The current XLOG infrastructure cannot handle WAL records
involving references to more than three shared buffers, anyway.

In the case where the WAL record contains enough information to re-generate
the entire contents of a page, do *not* show that page's buffer ID in the
rdata array, even if some of the rdata items point into the buffer.  This is
because you don't want XLogInsert to log the whole page contents.  The
standard replay-routine pattern for this case is

	reln = XLogOpenRelation(rnode);
	buffer = XLogReadBuffer(reln, blkno, true);
	Assert(BufferIsValid(buffer));
	page = (Page) BufferGetPage(buffer);

	... initialize the page ...

	PageSetLSN(page, lsn);
	PageSetTLI(page, ThisTimeLineID);
	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
	WriteBuffer(buffer);

In the case where the WAL record provides only enough information to
incrementally update the page, the rdata array *must* mention the buffer
ID at least once; otherwise there is no defense against torn-page problems.
The standard replay-routine pattern for this case is

	if (record->xl_info & XLR_BKP_BLOCK_n)
		<< do nothing, page was rewritten from logged copy >>;

	reln = XLogOpenRelation(rnode);
	buffer = XLogReadBuffer(reln, blkno, false);
	if (!BufferIsValid(buffer))
		<< do nothing, page has been deleted >>;
	page = (Page) BufferGetPage(buffer);

	if (XLByteLE(lsn, PageGetLSN(page)))
	{
		/* changes are already applied */
		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
		ReleaseBuffer(buffer);
		return;
	}

	... apply the change ...

	PageSetLSN(page, lsn);
	PageSetTLI(page, ThisTimeLineID);
	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
	WriteBuffer(buffer);

As noted above, for a multi-page update you need to be able to determine
which XLR_BKP_BLOCK_n flag applies to each page.  If a WAL record reflects
a combination of fully-rewritable and incremental updates, then the rewritable
pages don't count for the XLR_BKP_BLOCK_n numbering.  (XLR_BKP_BLOCK_n is
associated with the n'th distinct buffer ID seen in the "rdata" array, and
per the above discussion, fully-rewritable buffers shouldn't be mentioned in
"rdata".)

Due to all these constraints, complex changes (such as a multilevel index
insertion) normally need to be described by a series of atomic-action WAL
records.  What do you do if the intermediate states are not self-consistent?
The answer is that the WAL replay logic has to be able to fix things up.
In btree indexes, for example, a page split requires insertion of a new key in
the parent btree level, but for locking reasons this has to be reflected by
two separate WAL records.  The replay code has to remember "unfinished" split
operations, and match them up to subsequent insertions in the parent level.
If no matching insert has been found by the time the WAL replay ends, the
replay code has to do the insertion on its own to restore the index to
consistency.