From 70ec2f8f4392f4e3d379c2c759789d631ffeec10 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Fri, 15 Mar 2013 17:41:47 -0400 Subject: [PATCH] Improve the documentation about commit_delay. Clarify the docs explaining what commit_delay does, and add a recommendation about a useful value for it, namely half of the single-page fsync time reported by pg_test_fsync. This is informed by testing of the new-in-9.3 implementation of commit_delay; in prior versions it was far harder to arrive at a useful setting. In passing, do some wordsmithing and markup-fixing in the same general area. Also, change pg_test_fsync's default time-per-test from 2 seconds to 5. The old value was about the minimum at which the results could be taken seriously at all, and so seems a tad optimistic as a default. Peter Geoghegan, reviewed by Noah Misch; some additional editing by me --- contrib/pg_test_fsync/pg_test_fsync.c | 2 +- doc/src/sgml/config.sgml | 9 +- doc/src/sgml/pgtestfsync.sgml | 8 +- doc/src/sgml/wal.sgml | 169 ++++++++++++++++---------- 4 files changed, 118 insertions(+), 70 deletions(-) diff --git a/contrib/pg_test_fsync/pg_test_fsync.c b/contrib/pg_test_fsync/pg_test_fsync.c index ec4b90c797..5ee03981a3 100644 --- a/contrib/pg_test_fsync/pg_test_fsync.c +++ b/contrib/pg_test_fsync/pg_test_fsync.c @@ -60,7 +60,7 @@ do { \ static const char *progname; -static int secs_per_test = 2; +static int secs_per_test = 5; static int needs_unlink = 0; static char full_buf[XLOG_SEG_SIZE], *buf, diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index ae6ee60ab1..575b40b58d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1603,8 +1603,8 @@ include 'filename' Write Ahead Log - See also for details on WAL - and checkpoint tuning. + For additional information on tuning these settings, + see . @@ -1957,7 +1957,7 @@ include 'filename' given interval. However, it also increases latency by up to commit_delay microseconds for each WAL flush. Because the delay is just wasted if no other transactions - become ready to commit, it is only performed if at least + become ready to commit, a delay is only performed if at least commit_siblings other transactions are active immediately before a flush would otherwise have been initiated. In PostgreSQL releases prior to 9.3, @@ -1968,7 +1968,8 @@ include 'filename' the first process that becomes ready to flush waits for the configured interval, while subsequent processes wait only until the leader completes the flush. The default commit_delay is zero - (no delay), and only honored if fsync is enabled. + (no delay). No delays are performed unless fsync + is enabled. diff --git a/doc/src/sgml/pgtestfsync.sgml b/doc/src/sgml/pgtestfsync.sgml index 00ef209fa2..8c58985c90 100644 --- a/doc/src/sgml/pgtestfsync.sgml +++ b/doc/src/sgml/pgtestfsync.sgml @@ -36,8 +36,8 @@ difference in real database throughput, especially since many database servers are not speed-limited by their transaction logs. pg_test_fsync reports average file sync operation - time in microseconds for each wal_sync_method, which can be used to inform - efforts to optimize the value of commit_delay. + time in microseconds for each wal_sync_method, which can also be used to + inform efforts to optimize the value of . @@ -72,8 +72,8 @@ Specifies the number of seconds for each test. The more time per test, the greater the test's accuracy, but the longer it takes - to run. The default is 2 seconds, which allows the program to - complete in about 30 seconds. + to run. The default is 5 seconds, which allows the program to + complete in under 2 minutes. diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index fc5c3b24c3..dbaadb6f15 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -133,7 +133,7 @@ (BBU) disk controllers. In such setups, the synchronize command forces all data from the controller cache to the disks, eliminating much of the benefit of the BBU. You can run the - module to see + program to see if you are affected. If you are affected, the performance benefits of the BBU can be regained by turning off write barriers in the file system or reconfiguring the disk controller, if that is @@ -372,11 +372,12 @@ asynchronous commit, but it is actually a synchronous commit method (in fact, commit_delay is ignored during an asynchronous commit). commit_delay causes a delay - just before a synchronous commit attempts to flush - WAL to disk, in the hope that a single flush - executed by one such transaction can also serve other transactions - committing at about the same time. Setting commit_delay - can only help when there are many concurrently committing transactions. + just before a transaction flushes WAL to disk, in + the hope that a single flush executed by one such transaction can also + serve other transactions committing at about the same time. The + setting can be thought of as a way of increasing the time window in + which transactions can join a group about to participate in a single + flush, to amortize the cost of the flush among multiple transactions. @@ -394,15 +395,16 @@ Checkpointscheckpoint are points in the sequence of transactions at which it is guaranteed - that the heap and index data files have been updated with all information written before - the checkpoint. At checkpoint time, all dirty data pages are flushed to - disk and a special checkpoint record is written to the log file. - (The changes were previously flushed to the WAL files.) + that the heap and index data files have been updated with all + information written before that checkpoint. At checkpoint time, all + dirty data pages are flushed to disk and a special checkpoint record is + written to the log file. (The change records were previously flushed + to the WAL files.) In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the log (known as the redo record) from which it should start the REDO operation. Any changes made to - data files before that point are guaranteed to be already on disk. Hence, after - a checkpoint, log segments preceding the one containing + data files before that point are guaranteed to be already on disk. + Hence, after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed. (When WAL archiving is being done, the log segments must be archived before being recycled or removed.) @@ -411,31 +413,32 @@ The checkpoint requirement of flushing all dirty data pages to disk can cause a significant I/O load. For this reason, checkpoint - activity is throttled so I/O begins at checkpoint start and completes - before the next checkpoint starts; this minimizes performance + activity is throttled so that I/O begins at checkpoint start and completes + before the next checkpoint is due to start; this minimizes performance degradation during checkpoints. The server's checkpointer process automatically performs - a checkpoint every so often. A checkpoint is created every log segments, or every seconds, whichever comes first. The default settings are 3 segments and 300 seconds (5 minutes), respectively. - In cases where no WAL has been written since the previous checkpoint, new - checkpoints will be skipped even if checkpoint_timeout has passed. - If WAL archiving is being used and you want to put a lower limit on - how often files are archived in order to bound potential data - loss, you should adjust archive_timeout parameter rather than the checkpoint - parameters. It is also possible to force a checkpoint by using the SQL + If no WAL has been written since the previous checkpoint, new checkpoints + will be skipped even if checkpoint_timeout has passed. + (If WAL archiving is being used and you want to put a lower limit on how + often files are archived in order to bound potential data loss, you should + adjust the parameter rather than the + checkpoint parameters.) + It is also possible to force a checkpoint by using the SQL command CHECKPOINT. Reducing checkpoint_segments and/or checkpoint_timeout causes checkpoints to occur - more often. This allows faster after-crash recovery (since less work - will need to be redone). However, one must balance this against the + more often. This allows faster after-crash recovery, since less work + will need to be redone. However, one must balance this against the increased cost of flushing dirty data pages more often. If is set (as is the default), there is another factor to consider. To ensure data page consistency, @@ -450,7 +453,7 @@ Checkpoints are fairly expensive, first because they require writing out all currently dirty buffers, and second because they result in extra subsequent WAL traffic as discussed above. It is therefore - wise to set the checkpointing parameters high enough that checkpoints + wise to set the checkpointing parameters high enough so that checkpoints don't happen too often. As a simple sanity check on your checkpointing parameters, you can set the parameter. If checkpoints happen closer together than @@ -498,7 +501,7 @@ altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they - are recycled (renamed to become the next segments in the numbered + are recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted instead @@ -507,64 +510,108 @@ In archive recovery or standby mode, the server periodically performs - restartpointsrestartpoint + restartpoints,restartpoint which are similar to checkpoints in normal operation: the server forces all its state to disk, updates the pg_control file to indicate that the already-processed WAL data need not be scanned again, - and then recycles any old log segment files in pg_xlog - directory. A restartpoint is triggered if at least one checkpoint record - has been replayed and checkpoint_timeout seconds have passed - since last restartpoint. In standby mode, a restartpoint is also triggered - if checkpoint_segments log segments have been replayed since - last restartpoint and at least one checkpoint record has been replayed. + and then recycles any old log segment files in the pg_xlog + directory. Restartpoints can't be performed more frequently than checkpoints in the master because restartpoints can only be performed at checkpoint records. + A restartpoint is triggered when a checkpoint record is reached if at + least checkpoint_timeout seconds have passed since the last + restartpoint. In standby mode, a restartpoint is also triggered if at + least checkpoint_segments log segments have been replayed + since the last restartpoint. There are two commonly used internal WAL functions: - LogInsert and LogFlush. - LogInsert is used to place a new record into + XLogInsert and XLogFlush. + XLogInsert is used to place a new record into the WAL buffers in shared memory. If there is no - space for the new record, LogInsert will have + space for the new record, XLogInsert will have to write (move to kernel cache) a few filled WAL - buffers. This is undesirable because LogInsert + buffers. This is undesirable because XLogInsert is used on every database low level modification (for example, row insertion) at a time when an exclusive lock is held on affected data pages, so the operation needs to be as fast as possible. What is worse, writing WAL buffers might also force the creation of a new log segment, which takes even more time. Normally, WAL buffers should be written - and flushed by a LogFlush request, which is + and flushed by an XLogFlush request, which is made, for the most part, at transaction commit time to ensure that transaction records are flushed to permanent storage. On systems - with high log output, LogFlush requests might - not occur often enough to prevent LogInsert + with high log output, XLogFlush requests might + not occur often enough to prevent XLogInsert from having to do writes. On such systems one should increase the number of WAL buffers by - modifying the configuration parameter . When + modifying the parameter. When is set and the system is very busy, - setting this value higher will help smooth response times during the - period immediately following each checkpoint. + setting wal_buffers higher will help smooth response times + during the period immediately following each checkpoint. The parameter defines for how many - microseconds the server process will sleep after writing a commit - record to the log with LogInsert but before - performing a LogFlush. This delay allows other - server processes to add their commit records to the log so as to have all - of them flushed with a single log sync. No sleep will occur if - - is not enabled, or if fewer than - other sessions are currently in active transactions; this avoids - sleeping when it's unlikely that any other session will commit soon. - Note that on most platforms, the resolution of a sleep request is - ten milliseconds, so that any nonzero commit_delay - setting between 1 and 10000 microseconds would have the same effect. - Good values for these parameters are not yet clear; experimentation - is encouraged. + microseconds a group commit leader process will sleep after acquiring a + lock within XLogFlush, while group commit + followers queue up behind the leader. This delay allows other server + processes to add their commit records to the WAL buffers so that all of + them will be flushed by the leader's eventual sync operation. No sleep + will occur if is not enabled, or if fewer + than other sessions are currently + in active transactions; this avoids sleeping when it's unlikely that + any other session will commit soon. Note that on some platforms, the + resolution of a sleep request is ten milliseconds, so that any nonzero + commit_delay setting between 1 and 10000 + microseconds would have the same effect. Note also that on some + platforms, sleep operations may take slightly longer than requested by + the parameter. + + + + Since the purpose of commit_delay is to allow the + cost of each flush operation to be amortized across concurrently + committing transactions (potentially at the expense of transaction + latency), it is necessary to quantify that cost before the setting can + be chosen intelligently. The higher that cost is, the more effective + commit_delay is expected to be in increasing + transaction throughput, up to a point. The program can be used to measure the average time + in microseconds that a single WAL flush operation takes. A value of + half of the average time the program reports it takes to flush after a + single 8kB write operation is often the most effective setting for + commit_delay, so this value is recommended as the + starting point to use when optimizing for a particular workload. While + tuning commit_delay is particularly useful when the + WAL log is stored on high-latency rotating disks, benefits can be + significant even on storage media with very fast sync times, such as + solid-state drives or RAID arrays with a battery-backed write cache; + but this should definitely be tested against a representative workload. + Higher values of commit_siblings should be used in + such cases, whereas smaller commit_siblings values + are often helpful on higher latency media. Note that it is quite + possible that a setting of commit_delay that is too + high can increase transaction latency by so much that total transaction + throughput suffers. + + + + When commit_delay is set to zero (the default), it + is still possible for a form of group commit to occur, but each group + will consist only of sessions that reach the point where they need to + flush their commit records during the window in which the previous + flush operation (if any) is occurring. At higher client counts a + gangway effect tends to occur, so that the effects of group + commit become significant even when commit_delay is + zero, and thus explicitly setting commit_delay tends + to help less. Setting commit_delay can only help + when (1) there are some concurrently committing transactions, and (2) + throughput is limited to some degree by commit rate; but with high + rotational latency this setting can be effective in increasing + transaction throughput with as few as two clients (that is, a single + committing client with one sibling transaction). @@ -574,9 +621,9 @@ All the options should be the same in terms of reliability, with the exception of fsync_writethrough, which can sometimes force a flush of the disk cache even when other options do not do so. - However, it's quite platform-specific which one will be the fastest; - you can test option speeds using the module. + However, it's quite platform-specific which one will be the fastest. + You can test the speeds of different options using the program. Note that this parameter is irrelevant if fsync has been turned off. @@ -585,7 +632,7 @@ Enabling the configuration parameter (provided that PostgreSQL has been compiled with support for it) will result in each - LogInsert and LogFlush + XLogInsert and XLogFlush WAL call being logged to the server log. This option might be replaced by a more general mechanism in the future. -- 2.40.0