--- /dev/null
+<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.1 2001/01/24 23:15:19 petere Exp $ -->
+
+<chapter id="wal">
+ <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
+
+ <note>
+ <title>Author</title>
+ <para>
+ Vadim Mikheev and Oliver Elphick
+ </para>
+ </note>
+
+ <sect1 id="wal-general">
+ <title>General Description</Title>
+
+ <para>
+ <firstterm>Write Ahead Logging</firstterm> (<acronym>WAL</acronym>)
+ is a standard approach to transaction logging. Its detailed
+ description may be found in most (if not all) books about
+ transaction processing. Briefly, <acronym>WAL</acronym>'s central
+ concept is that changes to data files (where tables and indices
+ reside) must be written only after those changes have been logged -
+ that is, when log records have been flushed to permanent
+ storage. When we follow this procedure, we do not need to flush
+ data pages to disk on every transaction commit, because we know
+ that in the event of a crash we will be able to recover the
+ database using the log: any changes that have not been applied to
+ the data pages will first be redone from the log records (this is
+ roll-forward recovery, also known as REDO) and then changes made by
+ uncommitted transactions will be removed from the data pages
+ (roll-backward recovery - UNDO).
+ </para>
+
+ <sect2 id="wal-benefits-now">
+ <title>Immediate Benefits of <acronym>WAL</acronym></title>
+
+ <para>
+ The first obvious benefit of using <acronym>WAL</acronym> is a
+ significantly reduced number of disk writes, since only the log
+ file needs to be flushed to disk at the time of transaction
+ commit; in multi-user environments, commits of many transactions
+ may be accomplished with a single <function>fsync()</function> of
+ the log file. Furthermore, the log file is written sequentially,
+ and so the cost of syncing the log is much less than the cost of
+ flushing the data pages.
+ </para>
+
+ <para>
+ The next benefit is consistency of the data pages. The truth is
+ that, before <acronym>WAL</acronym>,
+ <productname>PostgreSQL</productname> was never able to guarantee
+ consistency in the case of a crash. Before
+ <acronym>WAL</acronym>, any crash during writing could result in:
+
+ <orderedlist>
+ <listitem>
+ <simpara>index tuples pointing to non-existent table rows</simpara>
+ </listitem>
+
+ <listitem>
+ <simpara>index tuples lost in split operations</simpara>
+ </listitem>
+
+ <listitem>
+ <simpara>totally corrupted table or index page content, because
+ of partially written data pages</simpara>
+ </listitem>
+ </orderedlist>
+
+ Problems with indices (problems 1 and 2) could possibly have been
+ fixed by additional <function>fsync()</function> calls, but it is
+ not obvious how to handle the last case without
+ <acronym>WAL</acronym>; <acronym>WAL</acronym> saves the entire
+ data page content in the log if that is required to ensure page
+ consistency for after-crash recovery.
+ </para>
+ </sect2>
+
+ <sect2 id="wal-benefits-later">
+ <title>Future Benefits</title>
+
+ <para>
+ In this first release of <acronym>WAL</acronym>, UNDO operation is
+ not implemented, because of lack of time. This means that changes
+ made by aborted transactions will still occupy disk space and that
+ we still need a permanent <filename>pg_log</filename> file to hold
+ the status of transactions, since we are not able to re-use
+ transaction identifiers. Once UNDO is implemented,
+ <filename>pg_log</filename> will no longer be required to be
+ permanent; it will be possible to remove
+ <filename>pg_log</filename> at shutdown, split it into segments
+ and remove old segments.
+ </para>
+
+ <para>
+ With UNDO, it will also be possible to implement
+ <firstterm>savepoints</firstterm> to allow partial rollback of
+ invalid transaction operations (parser errors caused by mistyping
+ commands, insertion of duplicate primary/unique keys and so on)
+ with the ability to continue or commit valid operations made by
+ the transaction before the error. At present, any error will
+ invalidate the whole transaction and require a transaction abort.
+ </para>
+
+ <para>
+ <acronym>WAL</acronym> offers the opportunity for a new method for
+ database on-line backup and restore (<acronym>BAR</acronym>). To
+ use this method, one would have to make periodic saves of data
+ files to another disk, a tape or another host and also archive the
+ <acronym>WAL</acronym> log files. The database file copy and the
+ archived log files could be used to restore just as if one were
+ restoring after a crash. Each time a new database file copy was
+ made the old log files could be removed. Implementing this
+ facility will require the logging of data file and index creation
+ and deletion; it will also require development of a method for
+ copying the data files (operating system copy commands are not
+ suitable).
+ </para>
+ </sect2>
+ </sect1>
+
+ <sect1 id="wal-implementation">
+ <title>Implementation</title>
+
+ <para>
+ <acronym>WAL</acronym> is automatically enabled from release 7.1
+ onwards. No action is required from the administrator with the
+ exception of ensuring that the additional disk-space requirements
+ of the <acronym>WAL</acronym> logs are met, and that any necessary
+ tuning is done (see <xref linkend="wal-configuration">).
+ </para>
+
+ <para>
+ <acronym>WAL</acronym> logs are stored in the directory
+ <Filename><replaceable>$PGDATA</replaceable>/pg_xlog</Filename>, as
+ a set of segment files, each 16 MB in size. Each segment is
+ divided into 8 kB pages. The log record headers are described in
+ <filename>access/xlog.h</filename>; record content is dependent on
+ the type of event that is being logged. Segment files are given
+ sequential numbers as names, starting at
+ <filename>0000000000000000</filename>. The numbers do not wrap, at
+ present, but it should take a very long time to exhaust the
+ available stock of numbers.
+ </para>
+
+ <para>
+ The <acronym>WAL</acronym> buffers and control structure are in
+ shared memory, and are handled by the backends; they are protected
+ by spinlocks. The demand on shared memory is dependent on the
+ number of buffers; the default size of the <acronym>WAL</acronym>
+ buffers is 64 kB.
+ </para>
+
+ <para>
+ It is of advantage if the log is located on another disk than the
+ main database files. This may be achieved by moving the directory,
+ <filename>pg_xlog</filename>, to another location (while the
+ postmaster is shut down, of course) and creating a symbolic link
+ from the original location in <replaceable>$PGDATA</replaceable> to
+ the new location.
+ </para>
+
+ <para>
+ The aim of <acronym>WAL</acronym>, to ensure that the log is
+ written before database records are altered, may be subverted by
+ disk drives that falsely report a successful write to the kernel,
+ when, in fact, they have only cached the data and not yet stored it
+ on the disk. A power failure in such a situation may still lead to
+ irrecoverable data corruption; administrators should try to ensure
+ that disks holding <productname>PostgreSQL</productname>'s data and
+ log files do not make such false reports.
+ </para>
+
+ <sect2 id="wal-recovery">
+ <title>Database Recovery with <acronym>WAL</acronym></title>
+
+ <para>
+ After a checkpoint has been made and the log flushed, the
+ checkpoint's position is saved in the file
+ <filename>pg_control</filename>. Therefore, when recovery is to be
+ done, the backend first reads <filename>pg_control</filename> and
+ then the checkpoint record; next it reads the redo record, whose
+ position is saved in the checkpoint, and begins the REDO operation.
+ Because the entire content of the pages is saved in the log on the
+ first page modification after a checkpoint, the pages will be first
+ restored to a consistent state.
+ </para>
+
+ <para>
+ Using <filename>pg_control</filename> to get the checkpoint
+ position speeds up the recovery process, but to handle possible
+ corruption of <filename>pg_control</filename>, we should actually
+ implement the reading of existing log segments in reverse order --
+ newest to oldest -- in order to find the last checkpoint. This has
+ not yet been done in release 7.1.
+ </para>
+ </sect2>
+ </sect1>
+
+ <sect1 id="wal-configuration">
+ <title><acronym>WAL</acronym> Configuration</title>
+
+ <para>
+ There are several <acronym>WAL</acronym>-related parameters that
+ affect database performance. This section explains their use.
+ Consult <xref linkend="runtime-config"> for details about setting
+ configuration parameters.
+ </para>
+
+ <para>
+ There are two commonly used <acronym>WAL</acronym> functions:
+ <function>LogInsert</function> and <function>LogFlush</function>.
+ <function>LogInsert</function> is used to place a new record into
+ the <acronym>WAL</acronym> buffers in shared memory. If there is no
+ space for the new record, <function>LogInsert</function> will have
+ to write (move to kernel cache) a few filled <acronym>WAL</acronym>
+ buffers. This is undesirable because <function>LogInsert</function>
+ is used on every database low level modification (for example,
+ tuple insertion) at a time when an exclusive lock is held on
+ affected data pages and the operation is supposed to be as fast as
+ possible; what is worse, writing <acronym>WAL</acronym> buffers may
+ also cause the creation of a new log segment, which takes even more
+ time. Normally, <acronym>WAL</acronym> buffers should be written
+ and flushed by a <function>LogFlush</function> request, which is
+ made, for the most part, at transaction commit time to ensure that
+ transaction records are flushed to permanent storage. On systems
+ with high log output, <function>LogFlush</function> requests may
+ not occur often enough to prevent <acronym>WAL</acronym> buffers
+ being written by <function>LogInsert</function>. On such systems
+ one should increase the number of <acronym>WAL</acronym> buffers by
+ modifying the <varname>WAL_BUFFERS</varname> parameter. The default
+ number of <acronym>WAL</acronym> buffers is 8. Increasing this
+ value will have an impact on shared memory usage.
+ </para>
+
+ <para>
+ <firstterm>Checkpoints</firstterm> are points in the sequence of
+ transactions at which it is guaranteed that the data files have
+ been updated with all information logged before the checkpoint. At
+ checkpoint time, all dirty data pages are flushed to disk and a
+ special checkpoint record is written to the log file. As result, in
+ the event of a crash, the recoverer knows from what record in the
+ log (known as the redo record) it should start the REDO operation,
+ since any changes made to data files before that record are already
+ on disk. After a checkpoint has been made, any log segments written
+ before the redo record are removed, so checkpoints are used to free
+ disk space in the <acronym>WAL</acronym> directory. (When
+ <acronym>WAL</acronym>-based <acronym>BAR</acronym> is implemented,
+ the log segments can be archived instead of just being removed.)
+ The checkpoint maker is also able to create a few log segments for
+ future use, so as to avoid the need for
+ <function>LogInsert</function> or <function>LogFlush</function> to
+ spend time in creating them.
+ </para>
+
+ <para>
+ The <acronym>WAL</acronym> log is held on the disk as a set of 16
+ MB files called <firstterm>segments</firstterm>. By default a new
+ segment is created only if more than 75% of the current segment is
+ used. One can instruct the server to create up to 64 log segments
+ at checkpoint time by modifying the <varname>WAL_FILES</varname>
+ configuration parameter.
+ </para>
+
+ <para>
+ For faster after-crash recovery, it would be better to create
+ checkpoints more often. However, one should balance this against
+ the cost of flushing dirty data pages; in addition, to ensure data
+ page consistency, the first modification of a data page after each
+ checkpoint results in logging the entire page content, thus
+ increasing output to log and the log's size.
+ </para>
+
+ <para>
+ By default, the postmaster spawns a special backend process to
+ create the next checkpoint 300 seconds after the previous
+ checkpoint's creation. One can change this interval by modifying
+ the <varname>CHECKPOINT_TIMEOUT</varname> parameter. It is also
+ possible to force a checkpoint by using the SQL command
+ <command>CHECKPOINT</command>.
+ </para>
+
+ <para>
+ Setting the <varname>WAL_DEBUG</varname> parameter to any non-zero
+ value will result in each <function>LogInsert</function> and
+ <function>LogFlush</function> <acronym>WAL</acronym> call being
+ logged to standard error. At present, it makes no difference what
+ the non-zero value is. This option may be replaced by a more
+ general mechanism in the future.
+ </para>
+
+ <para>
+ The <varname>COMMIT_DELAY</varname> parameter defines for how long
+ the backend will be forced to sleep after writing a commit record
+ to the log with <function>LogInsert</function> call but before
+ performing a <function>LogFlush</function>. This delay allows other
+ backends to add their commit records to the log so as to have all
+ of them flushed with a single log sync. Unfortunately, this
+ mechanism is not fully implemented at release 7.1, so there is at
+ present no point in changing this parameter from its default value
+ of 5 microseconds.
+ </para>
+ </sect1>
+</chapter>
+
+<!-- Keep this comment at the end of the file
+Local variables:
+mode:sgml
+sgml-omittag:nil
+sgml-shorttag:t
+sgml-minimize-attributes:nil
+sgml-always-quote-attributes:t
+sgml-indent-step:1
+sgml-indent-data:t
+sgml-parent-document:nil
+sgml-default-dtd-file:"./reference.ced"
+sgml-exposed-tags:nil
+sgml-local-catalogs:("/usr/lib/sgml/catalog")
+sgml-local-ecat-files:nil
+End:
+-->