granicus.if.org Git - postgresql/blob - doc/src/sgml/wal.sgml

   1 <!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.28 2004/03/09 16:57:47 neilc Exp $ -->
   2
   3 <chapter id="wal">
   4  <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
   5
   6  <indexterm zone="wal">
   7   <primary>WAL</primary>
   8  </indexterm>
   9
  10  <indexterm>
  11   <primary>transaction log</primary>
  12   <see>WAL</see>
  13  </indexterm>
  14
  15   <para>
  16    <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
  17    is a standard approach to transaction logging.  Its detailed
  18    description may be found in most (if not all) books about
  19    transaction processing. Briefly, <acronym>WAL</acronym>'s central
  20    concept is that changes to data files (where tables and indexes
  21    reside) must be written only after those changes have been logged,
  22    that is, when log records describing the changes have been flushed
  23    to permanent storage. If we follow this procedure, we do not need
  24    to flush data pages to disk on every transaction commit, because we
  25    know that in the event of a crash we will be able to recover the
  26    database using the log: any changes that have not been applied to
  27    the data pages will first be redone from the log records (this is
  28    roll-forward recovery, also known as REDO) and then changes made by
  29    uncommitted transactions will be removed from the data pages
  30    (roll-backward recovery, UNDO).
  31   </para>
  32
  33   <sect1 id="wal-benefits-now">
  34    <title>Benefits of <acronym>WAL</acronym></title>
  35
  36    <indexterm zone="wal-benefits-now">
  37     <primary>fsync</primary>
  38    </indexterm>
  39
  40    <para>
  41     The first obvious benefit of using <acronym>WAL</acronym> is a
  42     significantly reduced number of disk writes, since only the log
  43     file needs to be flushed to disk at the time of transaction
  44     commit; in multiuser environments, commits of many transactions
  45     may be accomplished with a single <function>fsync()</function> of
  46     the log file. Furthermore, the log file is written sequentially,
  47     and so the cost of syncing the log is much less than the cost of
  48     flushing the data pages.
  49    </para>
  50
  51    <para>
  52     The next benefit is consistency of the data pages. The truth is
  53     that, before <acronym>WAL</acronym>,
  54     <productname>PostgreSQL</productname> was never able to guarantee
  55     consistency in the case of a crash.  Before
  56     <acronym>WAL</acronym>, any crash during writing could result in:
  57
  58     <orderedlist>
  59      <listitem>
  60       <simpara>index rows pointing to nonexistent table rows</simpara>
  61      </listitem>
  62
  63      <listitem>
  64       <simpara>index rows lost in split operations</simpara>
  65      </listitem>
  66
  67      <listitem>
  68       <simpara>totally corrupted table or index page content, because
  69       of partially written data pages</simpara>
  70      </listitem>
  71     </orderedlist>
  72
  73     Problems with indexes (problems 1 and 2) could possibly have been
  74     fixed by additional <function>fsync()</function> calls, but it is
  75     not obvious how to handle the last case without
  76     <acronym>WAL</acronym>; <acronym>WAL</acronym> saves the entire data
  77     page content in the log if that is required to ensure page
  78     consistency for after-crash recovery.
  79    </para>
  80   </sect1>
  81
  82   <sect1 id="wal-benefits-later">
  83    <title>Future Benefits</title>
  84
  85    <para>
  86     The UNDO operation is not implemented. This means that changes
  87     made by aborted transactions will still occupy disk space and that
  88     a permanent <filename>pg_clog</filename> file to hold
  89     the status of transactions is still needed, since
  90     transaction identifiers cannot be reused. Once UNDO is implemented,
  91     <filename>pg_clog</filename> will no longer be required to be
  92     permanent; it will be possible to remove
  93     <filename>pg_clog</filename> at shutdown. (However, the urgency of
  94     this concern has decreased greatly with the adoption of a segmented
  95     storage method for <filename>pg_clog</filename>: it is no longer
  96     necessary to keep old <filename>pg_clog</filename> entries around
  97     forever.)
  98    </para>
  99
 100    <para>
 101     With UNDO, it will also be possible to implement
 102     <firstterm>savepoints</firstterm><indexterm><primary>savepoint</></> to allow partial rollback of
 103     invalid transaction operations (parser errors caused by mistyping
 104     commands, insertion of duplicate primary/unique keys and so on)
 105     with the ability to continue or commit valid operations made by
 106     the transaction before the error.  At present, any error will
 107     invalidate the whole transaction and require a transaction abort.
 108    </para>
 109
 110    <para>
 111     <acronym>WAL</acronym> offers the opportunity for a new method for
 112     database on-line backup and restore (<acronym>BAR</acronym>).  To
 113     use this method, one would have to make periodic saves of data
 114     files to another disk, a tape or another host and also archive the
 115     <acronym>WAL</acronym> log files.  The database file copy and the
 116     archived log files could be used to restore just as if one were
 117     restoring after a crash. Each time a new database file copy was
 118     made the old log files could be removed.  Implementing this
 119     facility will require the logging of data file and index creation
 120     and deletion; it will also require development of a method for
 121     copying the data files (operating system copy commands are not
 122     suitable).
 123    </para>
 124
 125    <para>
 126     A difficulty standing in the way of realizing these benefits is that
 127     they require saving <acronym>WAL</acronym> entries for considerable
 128     periods of time (e.g., as long as the longest possible transaction if
 129     transaction UNDO is wanted). The present <acronym>WAL</acronym>
 130     format is extremely bulky since it includes many disk page
 131     snapshots. This is not a serious concern at present, since the
 132     entries only need to be kept for one or two checkpoint intervals;
 133     but to achieve these future benefits some sort of compressed
 134     <acronym>WAL</acronym> format will be needed.
 135    </para>
 136   </sect1>
 137
 138  <sect1 id="wal-configuration">
 139   <title><acronym>WAL</acronym> Configuration</title>
 140
 141   <para>
 142    There are several <acronym>WAL</acronym>-related configuration parameters that
 143    affect database performance. This section explains their use.
 144    Consult <xref linkend="runtime-config"> for details about setting
 145    configuration parameters.
 146   </para>
 147
 148   <para>
 149    <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> are points in the sequence of
 150    transactions at which it is guaranteed that the data files have
 151    been updated with all information logged before the checkpoint.  At
 152    checkpoint time, all dirty data pages are flushed to disk and a
 153    special checkpoint record is written to the log file. As result, in
 154    the event of a crash, the recoverer knows from what record in the
 155    log (known as the redo record) it should start the REDO operation,
 156    since any changes made to data files before that record are already
 157    on disk. After a checkpoint has been made, any log segments written
 158    before the redo records are no longer needed and can be recycled or
 159    removed. (When <acronym>WAL</acronym>-based <acronym>BAR</acronym> is
 160    implemented, the log segments would be archived before being recycled
 161    or removed.)
 162   </para>
 163
 164   <para>
 165    The server spawns a special process every so often to create the
 166    next checkpoint.  A checkpoint is created every <xref
 167    linkend="guc-checkpoint-segments"> log segments, or every <xref
 168    linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
 169    The default settings are 3 segments and 300 seconds respectively.
 170    It is also possible to force a checkpoint by using the SQL command
 171    <command>CHECKPOINT</command>.
 172   </para>
 173
 174   <para>
 175    Reducing <varname>checkpoint_segments</varname> and/or
 176    <varname>checkpoint_timeout</varname> causes checkpoints to be done
 177    more often. This allows faster after-crash recovery (since less work
 178    will need to be redone). However, one must balance this against the
 179    increased cost of flushing dirty data pages more often. In addition,
 180    to ensure data page consistency, the first modification of a data
 181    page after each checkpoint results in logging the entire page
 182    content. Thus a smaller checkpoint interval increases the volume of
 183    output to the log, partially negating the goal of using a smaller
 184    interval, and in any case causing more disk I/O.
 185   </para>
 186
 187   <para>
 188    There will be at least one 16 MB segment file, and will normally
 189    not be more than 2 * <varname>checkpoint_segments</varname> + 1
 190    files.  You can use this to estimate space requirements for <acronym>WAL</acronym>.
 191    Ordinarily, when old log segment files are no longer needed, they
 192    are recycled (renamed to become the next segments in the numbered
 193    sequence). If, due to a short-term peak of log output rate, there
 194    are more than 2 * <varname>checkpoint_segments</varname> + 1
 195    segment files, the unneeded segment files will be deleted instead
 196    of recycled until the system gets back under this limit.
 197   </para>
 198
 199   <para>
 200    There are two commonly used <acronym>WAL</acronym> functions:
 201    <function>LogInsert</function> and <function>LogFlush</function>.
 202    <function>LogInsert</function> is used to place a new record into
 203    the <acronym>WAL</acronym> buffers in shared memory. If there is no
 204    space for the new record, <function>LogInsert</function> will have
 205    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
 206    buffers. This is undesirable because <function>LogInsert</function>
 207    is used on every database low level modification (for example, row
 208    insertion) at a time when an exclusive lock is held on affected
 209    data pages, so the operation needs to be as fast as possible.  What
 210    is worse, writing <acronym>WAL</acronym> buffers may also force the
 211    creation of a new log segment, which takes even more
 212    time. Normally, <acronym>WAL</acronym> buffers should be written
 213    and flushed by a <function>LogFlush</function> request, which is
 214    made, for the most part, at transaction commit time to ensure that
 215    transaction records are flushed to permanent storage. On systems
 216    with high log output, <function>LogFlush</function> requests may
 217    not occur often enough to prevent <acronym>WAL</acronym> buffers
 218    being written by <function>LogInsert</function>. On such systems
 219    one should increase the number of <acronym>WAL</acronym> buffers by
 220    modifying the configuration parameter <xref
 221    linkend="guc-wal-buffers">.  The default number of <acronym>
 222    WAL</acronym> buffers is 8.  Increasing this value will
 223    correspondingly increase shared memory usage.
 224   </para>
 225
 226   <para>
 227    Checkpoints are fairly expensive because they force all dirty kernel
 228    buffers to disk using the operating system <literal>sync()</> call.
 229    Busy servers may fill checkpoint segment files too quickly,
 230    causing excessive checkpointing. If such forced checkpoints happen
 231    more frequently than <xref linkend="guc-checkpoint-warning"> seconds,
 232    a message, will be output to the server logs recommending increasing
 233    <varname>checkpoint_segments</varname>.
 234   </para>
 235
 236   <para>
 237    The <xref linkend="guc-commit-delay"> parameter defines for how many
 238    microseconds the server process will sleep after writing a commit
 239    record to the log with <function>LogInsert</function> but before
 240    performing a <function>LogFlush</function>. This delay allows other
 241    server processes to add their commit records to the log so as to have all
 242    of them flushed with a single log sync. No sleep will occur if
 243    <xref linkend="guc-fsync">
 244    is not enabled, nor if fewer than <xref linkend="guc-commit-siblings">
 245    other sessions are currently in active transactions; this avoids
 246    sleeping when it's unlikely that any other session will commit soon.
 247    Note that on most platforms, the resolution of a sleep request is
 248    ten milliseconds, so that any nonzero <varname>commit_delay</varname>
 249    setting between 1 and 10000 microseconds would have the same effect.
 250    Good values for these parameters are not yet clear; experimentation
 251    is encouraged.
 252   </para>
 253
 254   <para>
 255    The <xref linkend="guc-wal-sync-method"> parameter determines how
 256    <productname>PostgreSQL</productname> will ask the kernel to force
 257     <acronym>WAL</acronym> updates out to disk.
 258    All the options should be the same as far as reliability goes,
 259    but it's quite platform-specific which one will be the fastest.
 260    Note that this parameter is irrelevant if <varname>fsync</varname>
 261    has been turned off.
 262   </para>
 263
 264   <para>
 265    Enabling the <xref linkend="guc-wal-debug"> configuration parameter
 266    (provided that <productname>PostgreSQL</productname> has been
 267    compiled with support for it) will result in each
 268    <function>LogInsert</function> and <function>LogFlush</function>
 269    <acronym>WAL</acronym> call being logged to the server log. This
 270    option may be replaced by a more general mechanism in the future.
 271   </para>
 272  </sect1>
 273
 274  <sect1 id="wal-internals">
 275   <title>Internals</title>
 276
 277   <para>
 278    <acronym>WAL</acronym> is automatically enabled; no action is
 279    required from the administrator except ensuring that the additional
 280    disk-space requirements of the <acronym>WAL</acronym> logs are met,
 281    and that any necessary tuning is done (see <xref
 282    linkend="wal-configuration">).
 283   </para>
 284
 285   <para>
 286    <acronym>WAL</acronym> logs are stored in the directory
 287    <filename>pg_xlog</filename> under the data directory, as a set of
 288    segment files, each 16 MB in size.  Each segment is divided into 8
 289    kB pages. The log record headers are described in
 290    <filename>access/xlog.h</filename>; the record content is dependent
 291    on the type of event that is being logged.  Segment files are given
 292    ever-increasing numbers as names, starting at
 293    <filename>0000000000000000</filename>.  The numbers do not wrap, at
 294    present, but it should take a very long time to exhaust the
 295    available stock of numbers.
 296   </para>
 297
 298   <para>
 299    The <acronym>WAL</acronym> buffers and control structure are in
 300    shared memory and are handled by the server child processes; they
 301    are protected by lightweight locks.  The demand on shared memory is
 302    dependent on the number of buffers.  The default size of the
 303    <acronym>WAL</acronym> buffers is 8 buffers of 8 kB each, or 64 kB
 304    total.
 305   </para>
 306
 307   <para>
 308    It is of advantage if the log is located on another disk than the
 309    main database files.  This may be achieved by moving the directory
 310    <filename>pg_xlog</filename> to another location (while the server
 311    is shut down, of course) and creating a symbolic link from the
 312    original location in the main data directory to the new location.
 313   </para>
 314
 315   <para>
 316    The aim of <acronym>WAL</acronym>, to ensure that the log is
 317    written before database records are altered, may be subverted by
 318    disk drives<indexterm><primary>disk drive</></> that falsely report a successful write to the kernel,
 319    when, in fact, they have only cached the data and not yet stored it
 320    on the disk.  A power failure in such a situation may still lead to
 321    irrecoverable data corruption.  Administrators should try to ensure
 322    that disks holding <productname>PostgreSQL</productname>'s
 323    <acronym>WAL</acronym> log files do not make such false reports.
 324   </para>
 325
 326   <para>
 327    After a checkpoint has been made and the log flushed, the
 328    checkpoint's position is saved in the file
 329    <filename>pg_control</filename>. Therefore, when recovery is to be
 330    done, the server first reads <filename>pg_control</filename> and
 331    then the checkpoint record; then it performs the REDO operation by
 332    scanning forward from the log position indicated in the checkpoint
 333    record.  Because the entire content of data pages is saved in the
 334    log on the first page modification after a checkpoint, all pages
 335    changed since the checkpoint will be restored to a consistent
 336    state.
 337   </para>
 338
 339   <para>
 340    Using <filename>pg_control</filename> to get the checkpoint
 341    position speeds up the recovery process, but to handle possible
 342    corruption of <filename>pg_control</filename>, we should actually
 343    implement the reading of existing log segments in reverse order --
 344    newest to oldest -- in order to find the last checkpoint.  This has
 345    not been implemented, yet.
 346   </para>
 347  </sect1>
 348 </chapter>
 349
 350 <!-- Keep this comment at the end of the file
 351 Local variables:
 352 mode:sgml
 353 sgml-omittag:nil
 354 sgml-shorttag:t
 355 sgml-minimize-attributes:nil
 356 sgml-always-quote-attributes:t
 357 sgml-indent-step:1
 358 sgml-indent-data:t
 359 sgml-parent-document:nil
 360 sgml-default-dtd-file:"./reference.ced"
 361 sgml-exposed-tags:nil
 362 sgml-local-catalogs:("/usr/lib/sgml/catalog")
 363 sgml-local-ecat-files:nil
 364 End:
 365 -->