1 <!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.28 2004/03/09 16:57:47 neilc Exp $ -->
4 <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
11 <primary>transaction log</primary>
16 <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
17 is a standard approach to transaction logging. Its detailed
18 description may be found in most (if not all) books about
19 transaction processing. Briefly, <acronym>WAL</acronym>'s central
20 concept is that changes to data files (where tables and indexes
21 reside) must be written only after those changes have been logged,
22 that is, when log records describing the changes have been flushed
23 to permanent storage. If we follow this procedure, we do not need
24 to flush data pages to disk on every transaction commit, because we
25 know that in the event of a crash we will be able to recover the
26 database using the log: any changes that have not been applied to
27 the data pages will first be redone from the log records (this is
28 roll-forward recovery, also known as REDO) and then changes made by
29 uncommitted transactions will be removed from the data pages
30 (roll-backward recovery, UNDO).
33 <sect1 id="wal-benefits-now">
34 <title>Benefits of <acronym>WAL</acronym></title>
36 <indexterm zone="wal-benefits-now">
37 <primary>fsync</primary>
41 The first obvious benefit of using <acronym>WAL</acronym> is a
42 significantly reduced number of disk writes, since only the log
43 file needs to be flushed to disk at the time of transaction
44 commit; in multiuser environments, commits of many transactions
45 may be accomplished with a single <function>fsync()</function> of
46 the log file. Furthermore, the log file is written sequentially,
47 and so the cost of syncing the log is much less than the cost of
48 flushing the data pages.
52 The next benefit is consistency of the data pages. The truth is
53 that, before <acronym>WAL</acronym>,
54 <productname>PostgreSQL</productname> was never able to guarantee
55 consistency in the case of a crash. Before
56 <acronym>WAL</acronym>, any crash during writing could result in:
60 <simpara>index rows pointing to nonexistent table rows</simpara>
64 <simpara>index rows lost in split operations</simpara>
68 <simpara>totally corrupted table or index page content, because
69 of partially written data pages</simpara>
73 Problems with indexes (problems 1 and 2) could possibly have been
74 fixed by additional <function>fsync()</function> calls, but it is
75 not obvious how to handle the last case without
76 <acronym>WAL</acronym>; <acronym>WAL</acronym> saves the entire data
77 page content in the log if that is required to ensure page
78 consistency for after-crash recovery.
82 <sect1 id="wal-benefits-later">
83 <title>Future Benefits</title>
86 The UNDO operation is not implemented. This means that changes
87 made by aborted transactions will still occupy disk space and that
88 a permanent <filename>pg_clog</filename> file to hold
89 the status of transactions is still needed, since
90 transaction identifiers cannot be reused. Once UNDO is implemented,
91 <filename>pg_clog</filename> will no longer be required to be
92 permanent; it will be possible to remove
93 <filename>pg_clog</filename> at shutdown. (However, the urgency of
94 this concern has decreased greatly with the adoption of a segmented
95 storage method for <filename>pg_clog</filename>: it is no longer
96 necessary to keep old <filename>pg_clog</filename> entries around
101 With UNDO, it will also be possible to implement
102 <firstterm>savepoints</firstterm><indexterm><primary>savepoint</></> to allow partial rollback of
103 invalid transaction operations (parser errors caused by mistyping
104 commands, insertion of duplicate primary/unique keys and so on)
105 with the ability to continue or commit valid operations made by
106 the transaction before the error. At present, any error will
107 invalidate the whole transaction and require a transaction abort.
111 <acronym>WAL</acronym> offers the opportunity for a new method for
112 database on-line backup and restore (<acronym>BAR</acronym>). To
113 use this method, one would have to make periodic saves of data
114 files to another disk, a tape or another host and also archive the
115 <acronym>WAL</acronym> log files. The database file copy and the
116 archived log files could be used to restore just as if one were
117 restoring after a crash. Each time a new database file copy was
118 made the old log files could be removed. Implementing this
119 facility will require the logging of data file and index creation
120 and deletion; it will also require development of a method for
121 copying the data files (operating system copy commands are not
126 A difficulty standing in the way of realizing these benefits is that
127 they require saving <acronym>WAL</acronym> entries for considerable
128 periods of time (e.g., as long as the longest possible transaction if
129 transaction UNDO is wanted). The present <acronym>WAL</acronym>
130 format is extremely bulky since it includes many disk page
131 snapshots. This is not a serious concern at present, since the
132 entries only need to be kept for one or two checkpoint intervals;
133 but to achieve these future benefits some sort of compressed
134 <acronym>WAL</acronym> format will be needed.
138 <sect1 id="wal-configuration">
139 <title><acronym>WAL</acronym> Configuration</title>
142 There are several <acronym>WAL</acronym>-related configuration parameters that
143 affect database performance. This section explains their use.
144 Consult <xref linkend="runtime-config"> for details about setting
145 configuration parameters.
149 <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> are points in the sequence of
150 transactions at which it is guaranteed that the data files have
151 been updated with all information logged before the checkpoint. At
152 checkpoint time, all dirty data pages are flushed to disk and a
153 special checkpoint record is written to the log file. As result, in
154 the event of a crash, the recoverer knows from what record in the
155 log (known as the redo record) it should start the REDO operation,
156 since any changes made to data files before that record are already
157 on disk. After a checkpoint has been made, any log segments written
158 before the redo records are no longer needed and can be recycled or
159 removed. (When <acronym>WAL</acronym>-based <acronym>BAR</acronym> is
160 implemented, the log segments would be archived before being recycled
165 The server spawns a special process every so often to create the
166 next checkpoint. A checkpoint is created every <xref
167 linkend="guc-checkpoint-segments"> log segments, or every <xref
168 linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
169 The default settings are 3 segments and 300 seconds respectively.
170 It is also possible to force a checkpoint by using the SQL command
171 <command>CHECKPOINT</command>.
175 Reducing <varname>checkpoint_segments</varname> and/or
176 <varname>checkpoint_timeout</varname> causes checkpoints to be done
177 more often. This allows faster after-crash recovery (since less work
178 will need to be redone). However, one must balance this against the
179 increased cost of flushing dirty data pages more often. In addition,
180 to ensure data page consistency, the first modification of a data
181 page after each checkpoint results in logging the entire page
182 content. Thus a smaller checkpoint interval increases the volume of
183 output to the log, partially negating the goal of using a smaller
184 interval, and in any case causing more disk I/O.
188 There will be at least one 16 MB segment file, and will normally
189 not be more than 2 * <varname>checkpoint_segments</varname> + 1
190 files. You can use this to estimate space requirements for <acronym>WAL</acronym>.
191 Ordinarily, when old log segment files are no longer needed, they
192 are recycled (renamed to become the next segments in the numbered
193 sequence). If, due to a short-term peak of log output rate, there
194 are more than 2 * <varname>checkpoint_segments</varname> + 1
195 segment files, the unneeded segment files will be deleted instead
196 of recycled until the system gets back under this limit.
200 There are two commonly used <acronym>WAL</acronym> functions:
201 <function>LogInsert</function> and <function>LogFlush</function>.
202 <function>LogInsert</function> is used to place a new record into
203 the <acronym>WAL</acronym> buffers in shared memory. If there is no
204 space for the new record, <function>LogInsert</function> will have
205 to write (move to kernel cache) a few filled <acronym>WAL</acronym>
206 buffers. This is undesirable because <function>LogInsert</function>
207 is used on every database low level modification (for example, row
208 insertion) at a time when an exclusive lock is held on affected
209 data pages, so the operation needs to be as fast as possible. What
210 is worse, writing <acronym>WAL</acronym> buffers may also force the
211 creation of a new log segment, which takes even more
212 time. Normally, <acronym>WAL</acronym> buffers should be written
213 and flushed by a <function>LogFlush</function> request, which is
214 made, for the most part, at transaction commit time to ensure that
215 transaction records are flushed to permanent storage. On systems
216 with high log output, <function>LogFlush</function> requests may
217 not occur often enough to prevent <acronym>WAL</acronym> buffers
218 being written by <function>LogInsert</function>. On such systems
219 one should increase the number of <acronym>WAL</acronym> buffers by
220 modifying the configuration parameter <xref
221 linkend="guc-wal-buffers">. The default number of <acronym>
222 WAL</acronym> buffers is 8. Increasing this value will
223 correspondingly increase shared memory usage.
227 Checkpoints are fairly expensive because they force all dirty kernel
228 buffers to disk using the operating system <literal>sync()</> call.
229 Busy servers may fill checkpoint segment files too quickly,
230 causing excessive checkpointing. If such forced checkpoints happen
231 more frequently than <xref linkend="guc-checkpoint-warning"> seconds,
232 a message, will be output to the server logs recommending increasing
233 <varname>checkpoint_segments</varname>.
237 The <xref linkend="guc-commit-delay"> parameter defines for how many
238 microseconds the server process will sleep after writing a commit
239 record to the log with <function>LogInsert</function> but before
240 performing a <function>LogFlush</function>. This delay allows other
241 server processes to add their commit records to the log so as to have all
242 of them flushed with a single log sync. No sleep will occur if
243 <xref linkend="guc-fsync">
244 is not enabled, nor if fewer than <xref linkend="guc-commit-siblings">
245 other sessions are currently in active transactions; this avoids
246 sleeping when it's unlikely that any other session will commit soon.
247 Note that on most platforms, the resolution of a sleep request is
248 ten milliseconds, so that any nonzero <varname>commit_delay</varname>
249 setting between 1 and 10000 microseconds would have the same effect.
250 Good values for these parameters are not yet clear; experimentation
255 The <xref linkend="guc-wal-sync-method"> parameter determines how
256 <productname>PostgreSQL</productname> will ask the kernel to force
257 <acronym>WAL</acronym> updates out to disk.
258 All the options should be the same as far as reliability goes,
259 but it's quite platform-specific which one will be the fastest.
260 Note that this parameter is irrelevant if <varname>fsync</varname>
265 Enabling the <xref linkend="guc-wal-debug"> configuration parameter
266 (provided that <productname>PostgreSQL</productname> has been
267 compiled with support for it) will result in each
268 <function>LogInsert</function> and <function>LogFlush</function>
269 <acronym>WAL</acronym> call being logged to the server log. This
270 option may be replaced by a more general mechanism in the future.
274 <sect1 id="wal-internals">
275 <title>Internals</title>
278 <acronym>WAL</acronym> is automatically enabled; no action is
279 required from the administrator except ensuring that the additional
280 disk-space requirements of the <acronym>WAL</acronym> logs are met,
281 and that any necessary tuning is done (see <xref
282 linkend="wal-configuration">).
286 <acronym>WAL</acronym> logs are stored in the directory
287 <filename>pg_xlog</filename> under the data directory, as a set of
288 segment files, each 16 MB in size. Each segment is divided into 8
289 kB pages. The log record headers are described in
290 <filename>access/xlog.h</filename>; the record content is dependent
291 on the type of event that is being logged. Segment files are given
292 ever-increasing numbers as names, starting at
293 <filename>0000000000000000</filename>. The numbers do not wrap, at
294 present, but it should take a very long time to exhaust the
295 available stock of numbers.
299 The <acronym>WAL</acronym> buffers and control structure are in
300 shared memory and are handled by the server child processes; they
301 are protected by lightweight locks. The demand on shared memory is
302 dependent on the number of buffers. The default size of the
303 <acronym>WAL</acronym> buffers is 8 buffers of 8 kB each, or 64 kB
308 It is of advantage if the log is located on another disk than the
309 main database files. This may be achieved by moving the directory
310 <filename>pg_xlog</filename> to another location (while the server
311 is shut down, of course) and creating a symbolic link from the
312 original location in the main data directory to the new location.
316 The aim of <acronym>WAL</acronym>, to ensure that the log is
317 written before database records are altered, may be subverted by
318 disk drives<indexterm><primary>disk drive</></> that falsely report a successful write to the kernel,
319 when, in fact, they have only cached the data and not yet stored it
320 on the disk. A power failure in such a situation may still lead to
321 irrecoverable data corruption. Administrators should try to ensure
322 that disks holding <productname>PostgreSQL</productname>'s
323 <acronym>WAL</acronym> log files do not make such false reports.
327 After a checkpoint has been made and the log flushed, the
328 checkpoint's position is saved in the file
329 <filename>pg_control</filename>. Therefore, when recovery is to be
330 done, the server first reads <filename>pg_control</filename> and
331 then the checkpoint record; then it performs the REDO operation by
332 scanning forward from the log position indicated in the checkpoint
333 record. Because the entire content of data pages is saved in the
334 log on the first page modification after a checkpoint, all pages
335 changed since the checkpoint will be restored to a consistent
340 Using <filename>pg_control</filename> to get the checkpoint
341 position speeds up the recovery process, but to handle possible
342 corruption of <filename>pg_control</filename>, we should actually
343 implement the reading of existing log segments in reverse order --
344 newest to oldest -- in order to find the last checkpoint. This has
345 not been implemented, yet.
350 <!-- Keep this comment at the end of the file
355 sgml-minimize-attributes:nil
356 sgml-always-quote-attributes:t
359 sgml-parent-document:nil
360 sgml-default-dtd-file:"./reference.ced"
361 sgml-exposed-tags:nil
362 sgml-local-catalogs:("/usr/lib/sgml/catalog")
363 sgml-local-ecat-files:nil