1 <!-- doc/src/sgml/high-availability.sgml -->
3 <chapter id="high-availability">
4 <title>High Availability, Load Balancing, and Replication</title>
6 <indexterm><primary>high availability</></>
7 <indexterm><primary>failover</></>
8 <indexterm><primary>replication</></>
9 <indexterm><primary>load balancing</></>
10 <indexterm><primary>clustering</></>
11 <indexterm><primary>data partitioning</></>
14 Database servers can work together to allow a second server to
15 take over quickly if the primary server fails (high
16 availability), or to allow several computers to serve the same
17 data (load balancing). Ideally, database servers could work
18 together seamlessly. Web servers serving static web pages can
19 be combined quite easily by merely load-balancing web requests
20 to multiple machines. In fact, read-only database servers can
21 be combined relatively easily too. Unfortunately, most database
22 servers have a read/write mix of requests, and read/write servers
23 are much harder to combine. This is because though read-only
24 data needs to be placed on each server only once, a write to any
25 server has to be propagated to all servers so that future read
26 requests to those servers return consistent results.
30 This synchronization problem is the fundamental difficulty for
31 servers working together. Because there is no single solution
32 that eliminates the impact of the sync problem for all use cases,
33 there are multiple solutions. Each solution addresses this
34 problem in a different way, and minimizes its impact for a specific
39 Some solutions deal with synchronization by allowing only one
40 server to modify the data. Servers that can modify data are
41 called read/write, <firstterm>master</> or <firstterm>primary</> servers.
42 Servers that track changes in the master are called <firstterm>standby</>
43 or <firstterm>slave</> servers. A standby server that cannot be connected
44 to until it is promoted to a master server is called a <firstterm>warm
45 standby</> server, and one that can accept connections and serves read-only
46 queries is called a <firstterm>hot standby</> server.
50 Some solutions are synchronous,
51 meaning that a data-modifying transaction is not considered
52 committed until all servers have committed the transaction. This
53 guarantees that a failover will not lose any data and that all
54 load-balanced servers will return consistent results no matter
55 which server is queried. In contrast, asynchronous solutions allow some
56 delay between the time of a commit and its propagation to the other servers,
57 opening the possibility that some transactions might be lost in
58 the switch to a backup server, and that load balanced servers
59 might return slightly stale results. Asynchronous communication
60 is used when synchronous would be too slow.
64 Solutions can also be categorized by their granularity. Some solutions
65 can deal only with an entire database server, while others allow control
66 at the per-table or per-database level.
70 Performance must be considered in any choice. There is usually a
71 trade-off between functionality and
72 performance. For example, a fully synchronous solution over a slow
73 network might cut performance by more than half, while an asynchronous
74 one might have a minimal performance impact.
78 The remainder of this section outlines various failover, replication,
79 and load balancing solutions. A <ulink
80 url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
84 <sect1 id="different-replication-solutions">
85 <title>Comparison of Different Solutions</title>
90 <term>Shared Disk Failover</term>
94 Shared disk failover avoids synchronization overhead by having only one
95 copy of the database. It uses a single disk array that is shared by
96 multiple servers. If the main database server fails, the standby server
97 is able to mount and start the database as though it were recovering from
98 a database crash. This allows rapid failover with no data loss.
102 Shared hardware functionality is common in network storage devices.
103 Using a network file system is also possible, though care must be
104 taken that the file system has full <acronym>POSIX</> behavior (see <xref
105 linkend="creating-cluster-nfs">). One significant limitation of this
106 method is that if the shared disk array fails or becomes corrupt, the
107 primary and standby servers are both nonfunctional. Another issue is
108 that the standby server should never access the shared storage while
109 the primary server is running.
116 <term>File System (Block-Device) Replication</term>
120 A modified version of shared hardware functionality is file system
121 replication, where all changes to a file system are mirrored to a file
122 system residing on another computer. The only restriction is that
123 the mirroring must be done in a way that ensures the standby server
124 has a consistent copy of the file system — specifically, writes
125 to the standby must be done in the same order as those on the master.
126 <productname>DRBD</> is a popular file system replication solution
131 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
133 Oracle RAC is a shared disk approach and just send cache invalidations
134 to other nodes but not actual data. As the disk is shared, data is
135 only committed once to disk and there is a distributed locking
136 protocol to make nodes agree on a serializable transactional order.
143 <term>Transaction Log Shipping</term>
147 Warm and hot standby servers can be kept current by reading a
148 stream of write-ahead log (<acronym>WAL</>)
149 records. If the main server fails, the standby contains
150 almost all of the data of the main server, and can be quickly
151 made the new master database server. This can be synchronous or
152 asynchronous and can only be done for the entire database server.
155 A standby server can be implemented using file-based log shipping
156 (<xref linkend="warm-standby">) or streaming replication (see
157 <xref linkend="streaming-replication">), or a combination of both. For
158 information on hot standby, see <xref linkend="hot-standby">.
164 <term>Trigger-Based Master-Standby Replication</term>
168 A master-standby replication setup sends all data modification
169 queries to the master server. The master server asynchronously
170 sends data changes to the standby server. The standby can answer
171 read-only queries while the master server is running. The
172 standby server is ideal for data warehouse queries.
176 <productname>Slony-I</> is an example of this type of replication, with per-table
177 granularity, and support for multiple standby servers. Because it
178 updates the standby server asynchronously (in batches), there is
179 possible data loss during fail over.
185 <term>Statement-Based Replication Middleware</term>
189 With statement-based replication middleware, a program intercepts
190 every SQL query and sends it to one or all servers. Each server
191 operates independently. Read-write queries must be sent to all servers,
192 so that every server receives any changes. But read-only queries can be
193 sent to just one server, allowing the read workload to be distributed
198 If queries are simply broadcast unmodified, functions like
199 <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
200 sequences can have different values on different servers.
201 This is because each server operates independently, and because
202 SQL queries are broadcast (and not actual modified rows). If
203 this is unacceptable, either the middleware or the application
204 must query such values from a single server and then use those
205 values in write queries. Another option is to use this replication
206 option with a traditional master-standby setup, i.e. data modification
207 queries are sent only to the master and are propagated to the
208 standby servers via master-standby replication, not by the replication
209 middleware. Care must also be taken that all
210 transactions either commit or abort on all servers, perhaps
211 using two-phase commit (<xref linkend="sql-prepare-transaction">
212 and <xref linkend="sql-commit-prepared">.
213 <productname>Pgpool-II</> and <productname>Continuent Tungsten</>
214 are examples of this type of replication.
220 <term>Asynchronous Multimaster Replication</term>
224 For servers that are not regularly connected, like laptops or
225 remote servers, keeping data consistent among servers is a
226 challenge. Using asynchronous multimaster replication, each
227 server works independently, and periodically communicates with
228 the other servers to identify conflicting transactions. The
229 conflicts can be resolved by users or conflict resolution rules.
230 Bucardo is an example of this type of replication.
236 <term>Synchronous Multimaster Replication</term>
240 In synchronous multimaster replication, each server can accept
241 write requests, and modified data is transmitted from the
242 original server to every other server before each transaction
243 commits. Heavy write activity can cause excessive locking,
244 leading to poor performance. In fact, write performance is
245 often worse than that of a single server. Read requests can
246 be sent to any server. Some implementations use shared disk
247 to reduce the communication overhead. Synchronous multimaster
248 replication is best for mostly read workloads, though its big
249 advantage is that any server can accept write requests —
250 there is no need to partition workloads between master and
251 standby servers, and because the data changes are sent from one
252 server to another, there is no problem with non-deterministic
253 functions like <function>random()</>.
257 <productname>PostgreSQL</> does not offer this type of replication,
258 though <productname>PostgreSQL</> two-phase commit (<xref
259 linkend="sql-prepare-transaction"> and <xref
260 linkend="sql-commit-prepared">)
261 can be used to implement this in application code or middleware.
267 <term>Commercial Solutions</term>
271 Because <productname>PostgreSQL</> is open source and easily
272 extended, a number of companies have taken <productname>PostgreSQL</>
273 and created commercial closed-source solutions with unique
274 failover, replication, and load balancing capabilities.
282 <xref linkend="high-availability-matrix"> summarizes
283 the capabilities of the various solutions listed above.
286 <table id="high-availability-matrix">
287 <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
291 <entry>Feature</entry>
292 <entry>Shared Disk Failover</entry>
293 <entry>File System Replication</entry>
294 <entry>Transaction Log Shipping</entry>
295 <entry>Trigger-Based Master-Standby Replication</entry>
296 <entry>Statement-Based Replication Middleware</entry>
297 <entry>Asynchronous Multimaster Replication</entry>
298 <entry>Synchronous Multimaster Replication</entry>
305 <entry>Most Common Implementation</entry>
306 <entry align="center">NAS</entry>
307 <entry align="center">DRBD</entry>
308 <entry align="center">Streaming Repl.</entry>
309 <entry align="center">Slony</entry>
310 <entry align="center">pgpool-II</entry>
311 <entry align="center">Bucardo</entry>
312 <entry align="center"></entry>
316 <entry>Communication Method</entry>
317 <entry align="center">shared disk</entry>
318 <entry align="center">disk blocks</entry>
319 <entry align="center">WAL</entry>
320 <entry align="center">table rows</entry>
321 <entry align="center">SQL</entry>
322 <entry align="center">table rows</entry>
323 <entry align="center">table rows and row locks</entry>
327 <entry>No special hardware required</entry>
328 <entry align="center"></entry>
329 <entry align="center">•</entry>
330 <entry align="center">•</entry>
331 <entry align="center">•</entry>
332 <entry align="center">•</entry>
333 <entry align="center">•</entry>
334 <entry align="center">•</entry>
338 <entry>Allows multiple master servers</entry>
339 <entry align="center"></entry>
340 <entry align="center"></entry>
341 <entry align="center"></entry>
342 <entry align="center"></entry>
343 <entry align="center">•</entry>
344 <entry align="center">•</entry>
345 <entry align="center">•</entry>
349 <entry>No master server overhead</entry>
350 <entry align="center">•</entry>
351 <entry align="center"></entry>
352 <entry align="center">•</entry>
353 <entry align="center"></entry>
354 <entry align="center">•</entry>
355 <entry align="center"></entry>
356 <entry align="center"></entry>
360 <entry>No waiting for multiple servers</entry>
361 <entry align="center">•</entry>
362 <entry align="center"></entry>
363 <entry align="center">with sync off</entry>
364 <entry align="center">•</entry>
365 <entry align="center"></entry>
366 <entry align="center">•</entry>
367 <entry align="center"></entry>
371 <entry>Master failure will never lose data</entry>
372 <entry align="center">•</entry>
373 <entry align="center">•</entry>
374 <entry align="center">with sync on</entry>
375 <entry align="center"></entry>
376 <entry align="center">•</entry>
377 <entry align="center"></entry>
378 <entry align="center">•</entry>
382 <entry>Standby accept read-only queries</entry>
383 <entry align="center"></entry>
384 <entry align="center"></entry>
385 <entry align="center">with hot</entry>
386 <entry align="center">•</entry>
387 <entry align="center">•</entry>
388 <entry align="center">•</entry>
389 <entry align="center">•</entry>
393 <entry>Per-table granularity</entry>
394 <entry align="center"></entry>
395 <entry align="center"></entry>
396 <entry align="center"></entry>
397 <entry align="center">•</entry>
398 <entry align="center"></entry>
399 <entry align="center">•</entry>
400 <entry align="center">•</entry>
404 <entry>No conflict resolution necessary</entry>
405 <entry align="center">•</entry>
406 <entry align="center">•</entry>
407 <entry align="center">•</entry>
408 <entry align="center">•</entry>
409 <entry align="center"></entry>
410 <entry align="center"></entry>
411 <entry align="center">•</entry>
419 There are a few solutions that do not fit into the above categories:
425 <term>Data Partitioning</term>
429 Data partitioning splits tables into data sets. Each set can
430 be modified by only one server. For example, data can be
431 partitioned by offices, e.g., London and Paris, with a server
432 in each office. If queries combining London and Paris data
433 are necessary, an application can query both servers, or
434 master/standby replication can be used to keep a read-only copy
435 of the other office's data on each server.
441 <term>Multiple-Server Parallel Query Execution</term>
445 Many of the above solutions allow multiple servers to handle multiple
446 queries, but none allow a single query to use multiple servers to
447 complete faster. This solution allows multiple servers to work
448 concurrently on a single query. It is usually accomplished by
449 splitting the data among servers and having each server execute its
450 part of the query and return results to a central server where they
451 are combined and returned to the user. <productname>Pgpool-II</>
452 has this capability. Also, this can be implemented using the
453 <productname>PL/Proxy</> tool set.
464 <sect1 id="warm-standby">
465 <title>Log-Shipping Standby Servers</title>
469 Continuous archiving can be used to create a <firstterm>high
470 availability</> (HA) cluster configuration with one or more
471 <firstterm>standby servers</> ready to take over operations if the
472 primary server fails. This capability is widely referred to as
473 <firstterm>warm standby</> or <firstterm>log shipping</>.
477 The primary and standby server work together to provide this capability,
478 though the servers are only loosely coupled. The primary server operates
479 in continuous archiving mode, while each standby server operates in
480 continuous recovery mode, reading the WAL files from the primary. No
481 changes to the database tables are required to enable this capability,
482 so it offers low administration overhead compared to some other
483 replication solutions. This configuration also has relatively low
484 performance impact on the primary server.
488 Directly moving WAL records from one database server to another
489 is typically described as log shipping. <productname>PostgreSQL</>
490 implements file-based log shipping by transferring WAL records
491 one file (WAL segment) at a time. WAL files (16MB) can be
492 shipped easily and cheaply over any distance, whether it be to an
493 adjacent system, another system at the same site, or another system on
494 the far side of the globe. The bandwidth required for this technique
495 varies according to the transaction rate of the primary server.
496 Record-based log shipping is more granular and streams WAL changes
497 incrementally over a network connection (see <xref
498 linkend="streaming-replication">).
502 It should be noted that log shipping is asynchronous, i.e., the WAL
503 records are shipped after transaction commit. As a result, there is a
504 window for data loss should the primary server suffer a catastrophic
505 failure; transactions not yet shipped will be lost. The size of the
506 data loss window in file-based log shipping can be limited by use of the
507 <varname>archive_timeout</varname> parameter, which can be set as low
508 as a few seconds. However such a low setting will
509 substantially increase the bandwidth required for file shipping.
510 Streaming replication (see <xref linkend="streaming-replication">)
511 allows a much smaller window of data loss.
515 Recovery performance is sufficiently good that the standby will
516 typically be only moments away from full
517 availability once it has been activated. As a result, this is called
518 a warm standby configuration which offers high
519 availability. Restoring a server from an archived base backup and
520 rollforward will take considerably longer, so that technique only
521 offers a solution for disaster recovery, not high availability.
522 A standby server can also be used for read-only queries, in which case
523 it is called a Hot Standby server. See <xref linkend="hot-standby"> for
527 <indexterm zone="high-availability">
528 <primary>warm standby</primary>
531 <indexterm zone="high-availability">
532 <primary>PITR standby</primary>
535 <indexterm zone="high-availability">
536 <primary>standby server</primary>
539 <indexterm zone="high-availability">
540 <primary>log shipping</primary>
543 <indexterm zone="high-availability">
544 <primary>witness server</primary>
547 <indexterm zone="high-availability">
548 <primary>STONITH</primary>
551 <sect2 id="standby-planning">
552 <title>Planning</title>
555 It is usually wise to create the primary and standby servers
556 so that they are as similar as possible, at least from the
557 perspective of the database server. In particular, the path names
558 associated with tablespaces will be passed across unmodified, so both
559 primary and standby servers must have the same mount paths for
560 tablespaces if that feature is used. Keep in mind that if
561 <xref linkend="sql-createtablespace">
562 is executed on the primary, any new mount point needed for it must
563 be created on the primary and all standby servers before the command
564 is executed. Hardware need not be exactly the same, but experience shows
565 that maintaining two identical systems is easier than maintaining two
566 dissimilar ones over the lifetime of the application and system.
567 In any case the hardware architecture must be the same — shipping
568 from, say, a 32-bit to a 64-bit system will not work.
572 In general, log shipping between servers running different major
573 <productname>PostgreSQL</> release
574 levels is not possible. It is the policy of the PostgreSQL Global
575 Development Group not to make changes to disk formats during minor release
576 upgrades, so it is likely that running different minor release levels
577 on primary and standby servers will work successfully. However, no
578 formal support for that is offered and you are advised to keep primary
579 and standby servers at the same release level as much as possible.
580 When updating to a new minor release, the safest policy is to update
581 the standby servers first — a new minor release is more likely
582 to be able to read WAL files from a previous minor release than vice
588 <sect2 id="standby-server-operation">
589 <title>Standby Server Operation</title>
592 In standby mode, the server continuously applies WAL received from the
593 master server. The standby server can read WAL from a WAL archive
594 (see <xref linkend="restore-command">) or directly from the master
595 over a TCP connection (streaming replication). The standby server will
596 also attempt to restore any WAL found in the standby cluster's
597 <filename>pg_xlog</> directory. That typically happens after a server
598 restart, when the standby replays again WAL that was streamed from the
599 master before the restart, but you can also manually copy files to
600 <filename>pg_xlog</> at any time to have them replayed.
604 At startup, the standby begins by restoring all WAL available in the
605 archive location, calling <varname>restore_command</>. Once it
606 reaches the end of WAL available there and <varname>restore_command</>
607 fails, it tries to restore any WAL available in the <filename>pg_xlog</> directory.
608 If that fails, and streaming replication has been configured, the
609 standby tries to connect to the primary server and start streaming WAL
610 from the last valid record found in archive or <filename>pg_xlog</>. If that fails
611 or streaming replication is not configured, or if the connection is
612 later disconnected, the standby goes back to step 1 and tries to
613 restore the file from the archive again. This loop of retries from the
614 archive, <filename>pg_xlog</>, and via streaming replication goes on until the server
615 is stopped or failover is triggered by a trigger file.
619 Standby mode is exited and the server switches to normal operation
620 when <command>pg_ctl promote</> is run or a trigger file is found
621 (<varname>trigger_file</>). Before failover,
622 any WAL immediately available in the archive or in <filename>pg_xlog</> will be
623 restored, but no attempt is made to connect to the master.
627 <sect2 id="preparing-master-for-standby">
628 <title>Preparing the Master for Standby Servers</title>
631 Set up continuous archiving on the primary to an archive directory
632 accessible from the standby, as described
633 in <xref linkend="continuous-archiving">. The archive location should be
634 accessible from the standby even when the master is down, i.e. it should
635 reside on the standby server itself or another trusted server, not on
640 If you want to use streaming replication, set up authentication on the
641 primary server to allow replication connections from the standby
642 server(s); that is, create a role and provide a suitable entry or
643 entries in <filename>pg_hba.conf</> with the database field set to
644 <literal>replication</>. Also ensure <varname>max_wal_senders</> is set
645 to a sufficiently large value in the configuration file of the primary
650 Take a base backup as described in <xref linkend="backup-base-backup">
651 to bootstrap the standby server.
655 <sect2 id="standby-server-setup">
656 <title>Setting Up a Standby Server</title>
659 To set up the standby server, restore the base backup taken from primary
660 server (see <xref linkend="backup-pitr-recovery">). Create a recovery
661 command file <filename>recovery.conf</> in the standby's cluster data
662 directory, and turn on <varname>standby_mode</>. Set
663 <varname>restore_command</> to a simple command to copy files from
664 the WAL archive. If you plan to have multiple standby servers for high
665 availability purposes, set <varname>recovery_target_timeline</> to
666 <literal>latest</>, to make the standby server follow the timeline change
667 that occurs at failover to another standby.
672 Do not use pg_standby or similar tools with the built-in standby mode
673 described here. <varname>restore_command</> should return immediately
674 if the file does not exist; the server will retry the command again if
675 necessary. See <xref linkend="log-shipping-alternative">
676 for using tools like pg_standby.
681 If you want to use streaming replication, fill in
682 <varname>primary_conninfo</> with a libpq connection string, including
683 the host name (or IP address) and any additional details needed to
684 connect to the primary server. If the primary needs a password for
685 authentication, the password needs to be specified in
686 <varname>primary_conninfo</> as well.
690 If you're setting up the standby server for high availability purposes,
691 set up WAL archiving, connections and authentication like the primary
692 server, because the standby server will work as a primary server after
697 If you're using a WAL archive, its size can be minimized using the <xref
698 linkend="archive-cleanup-command"> parameter to remove files that are no
699 longer required by the standby server.
700 The <application>pg_archivecleanup</> utility is designed specifically to
701 be used with <varname>archive_cleanup_command</> in typical single-standby
702 configurations, see <xref linkend="pgarchivecleanup">.
703 Note however, that if you're using the archive for backup purposes, you
704 need to retain files needed to recover from at least the latest base
705 backup, even if they're no longer needed by the standby.
709 A simple example of a <filename>recovery.conf</> is:
712 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
713 restore_command = 'cp /path/to/archive/%f %p'
714 archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
719 You can have any number of standby servers, but if you use streaming
720 replication, make sure you set <varname>max_wal_senders</> high enough in
721 the primary to allow them to be connected simultaneously.
726 <sect2 id="streaming-replication">
727 <title>Streaming Replication</title>
729 <indexterm zone="high-availability">
730 <primary>Streaming Replication</primary>
734 Streaming replication allows a standby server to stay more up-to-date
735 than is possible with file-based log shipping. The standby connects
736 to the primary, which streams WAL records to the standby as they're
737 generated, without waiting for the WAL file to be filled.
741 Streaming replication is asynchronous, so there is still a small delay
742 between committing a transaction in the primary and for the changes to
743 become visible in the standby. The delay is however much smaller than with
744 file-based log shipping, typically under one second assuming the standby
745 is powerful enough to keep up with the load. With streaming replication,
746 <varname>archive_timeout</> is not required to reduce the data loss
751 If you use streaming replication without file-based continuous
752 archiving, you have to set <varname>wal_keep_segments</> in the master
753 to a value high enough to ensure that old WAL segments are not recycled
754 too early, while the standby might still need them to catch up. If the
755 standby falls behind too much, it needs to be reinitialized from a new
756 base backup. If you set up a WAL archive that's accessible from the
757 standby, <varname>wal_keep_segments</> is not required as the standby can always
758 use the archive to catch up.
762 To use streaming replication, set up a file-based log-shipping standby
763 server as described in <xref linkend="warm-standby">. The step that
764 turns a file-based log-shipping standby into streaming replication
765 standby is setting <varname>primary_conninfo</> setting in the
766 <filename>recovery.conf</> file to point to the primary server. Set
767 <xref linkend="guc-listen-addresses"> and authentication options
768 (see <filename>pg_hba.conf</>) on the primary so that the standby server
769 can connect to the <literal>replication</> pseudo-database on the primary
770 server (see <xref linkend="streaming-replication-authentication">).
774 On systems that support the keepalive socket option, setting
775 <xref linkend="guc-tcp-keepalives-idle">,
776 <xref linkend="guc-tcp-keepalives-interval"> and
777 <xref linkend="guc-tcp-keepalives-count"> helps the primary promptly
778 notice a broken connection.
782 Set the maximum number of concurrent connections from the standby servers
783 (see <xref linkend="guc-max-wal-senders"> for details).
787 When the standby is started and <varname>primary_conninfo</> is set
788 correctly, the standby will connect to the primary after replaying all
789 WAL files available in the archive. If the connection is established
790 successfully, you will see a walreceiver process in the standby, and
791 a corresponding walsender process in the primary.
794 <sect3 id="streaming-replication-authentication">
795 <title>Authentication</title>
797 It is very important that the access privileges for replication be set up
798 so that only trusted users can read the WAL stream, because it is
799 easy to extract privileged information from it. Standby servers must
800 authenticate to the primary as a superuser or an account that has the
801 <literal>REPLICATION</> privilege. It is recommended to create a
802 dedicated user account with <literal>REPLICATION</> and <literal>LOGIN</>
803 privileges for replication. While <literal>REPLICATION</> privilege gives
804 very high permissions, it does not allow the user to modify any data on
805 the primary system, which the <literal>SUPERUSER</> privilege does.
809 Client authentication for replication is controlled by a
810 <filename>pg_hba.conf</> record specifying <literal>replication</> in the
811 <replaceable>database</> field. For example, if the standby is running on
812 host IP <literal>192.168.1.100</> and the account name for replication
813 is <literal>foo</>, the administrator can add the following line to the
814 <filename>pg_hba.conf</> file on the primary:
817 # Allow the user "foo" from host 192.168.1.100 to connect to the primary
818 # as a replication standby if the user's password is correctly supplied.
820 # TYPE DATABASE USER ADDRESS METHOD
821 host replication foo 192.168.1.100/32 md5
825 The host name and port number of the primary, connection user name,
826 and password are specified in the <filename>recovery.conf</> file.
827 The password can also be set in the <filename>~/.pgpass</> file on the
828 standby (specify <literal>replication</> in the <replaceable>database</>
830 For example, if the primary is running on host IP <literal>192.168.1.50</>,
831 port <literal>5432</literal>, the account name for replication is
832 <literal>foo</>, and the password is <literal>foopass</>, the administrator
833 can add the following line to the <filename>recovery.conf</> file on the
837 # The standby connects to the primary that is running on host 192.168.1.50
838 # and port 5432 as the user "foo" whose password is "foopass".
839 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
844 <sect3 id="streaming-replication-monitoring">
845 <title>Monitoring</title>
847 An important health indicator of streaming replication is the amount
848 of WAL records generated in the primary, but not yet applied in the
849 standby. You can calculate this lag by comparing the current WAL write
850 location on the primary with the last WAL location received by the
851 standby. They can be retrieved using
852 <function>pg_current_xlog_location</> on the primary and the
853 <function>pg_last_xlog_receive_location</> on the standby,
854 respectively (see <xref linkend="functions-admin-backup-table"> and
855 <xref linkend="functions-recovery-info-table"> for details).
856 The last WAL receive location in the standby is also displayed in the
857 process status of the WAL receiver process, displayed using the
858 <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
861 You can retrieve a list of WAL sender processes via the
862 <link linkend="monitoring-stats-views-table">
863 <literal>pg_stat_replication</></link> view. Large differences between
864 <function>pg_current_xlog_location</> and <literal>sent_location</> field
865 might indicate that the master server is under heavy load, while
866 differences between <literal>sent_location</> and
867 <function>pg_last_xlog_receive_location</> on the standby might indicate
868 network delay, or that the standby is under heavy load.
873 <sect2 id="cascading-replication">
874 <title>Cascading Replication</title>
876 <indexterm zone="high-availability">
877 <primary>Cascading Replication</primary>
881 The cascading replication feature allows a standby server to accept replication
882 connections and stream WAL records to other standbys, acting as a relay.
883 This can be used to reduce the number of direct connections to the master
884 and also to minimise inter-site bandwidth overheads.
888 A standby acting as both a receiver and a sender is known as a cascading
889 standby. Standbys that are more directly connected to the master are known
890 as upstream servers, while those standby servers further away are downstream
891 servers. Cascading replication does not place limits on the number or
892 arrangement of downstream servers, though each standby connects to only
893 one upstream server which eventually links to a single master/primary
898 A cascading standby sends not only WAL records received from the
899 master but also those restored from the archive. So even if the replication
900 connection in some upstream connection is terminated, streaming replication
901 continues downstream for as long as new WAL records are available.
905 Cascading replication is currently asynchronous. Synchronous replication
906 (see <xref linkend="synchronous-replication">) settings have no effect on
907 cascading replication at present.
911 Hot Standby feedback propagates upstream, whatever the cascaded arrangement.
915 Promoting a cascading standby terminates the immediate downstream replication
916 connections which it serves. This is because the timeline becomes different
917 between standbys, and they can no longer continue replication. The
918 affected standby(s) may reconnect to reestablish streaming replication.
922 To use cascading replication, set up the cascading standby so that it can
923 accept replication connections (that is, set
924 <xref linkend="guc-max-wal-senders"> and <xref linkend="guc-hot-standby">,
926 <link linkend="auth-pg-hba-conf">host-based authentication</link>).
927 You will also need to set <varname>primary_conninfo</> in the downstream
928 standby to point to the cascading standby.
932 <sect2 id="synchronous-replication">
933 <title>Synchronous Replication</title>
935 <indexterm zone="high-availability">
936 <primary>Synchronous Replication</primary>
940 <productname>PostgreSQL</> streaming replication is asynchronous by
941 default. If the primary server
942 crashes then some transactions that were committed may not have been
943 replicated to the standby server, causing data loss. The amount
944 of data loss is proportional to the replication delay at the time of
949 Synchronous replication offers the ability to confirm that all changes
950 made by a transaction have been transferred to one synchronous standby
951 server. This extends the standard level of durability
952 offered by a transaction commit. This level of protection is referred
953 to as 2-safe replication in computer science theory.
957 When requesting synchronous replication, each commit of a
958 write transaction will wait until confirmation is
959 received that the commit has been written to the transaction log on disk
960 of both the primary and standby server. The only possibility that data
961 can be lost is if both the primary and the standby suffer crashes at the
962 same time. This can provide a much higher level of durability, though only
963 if the sysadmin is cautious about the placement and management of the two
964 servers. Waiting for confirmation increases the user's confidence that the
965 changes will not be lost in the event of server crashes but it also
966 necessarily increases the response time for the requesting transaction.
967 The minimum wait time is the roundtrip time between primary to standby.
971 Read only transactions and transaction rollbacks need not wait for
972 replies from standby servers. Subtransaction commits do not wait for
973 responses from standby servers, only top-level commits. Long
974 running actions such as data loading or index building do not wait
975 until the very final commit message. All two-phase commit actions
976 require commit waits, including both prepare and commit.
979 <sect3 id="synchronous-replication-config">
980 <title>Basic Configuration</title>
983 Once streaming replication has been configured, configuring synchronous
984 replication requires only one additional configuration step:
985 <xref linkend="guc-synchronous-standby-names"> must be set to
986 a non-empty value. <varname>synchronous_commit</> must also be set to
987 <literal>on</>, but since this is the default value, typically no change is
988 required. (See <xref linkend="runtime-config-wal-settings"> and
989 <xref linkend="runtime-config-replication-master">.)
990 This configuration will cause each commit to wait for
991 confirmation that the standby has written the commit record to durable
993 <varname>synchronous_commit</> can be set by individual
994 users, so it can be configured in the configuration file, for particular
995 users or databases, or dynamically by applications, in order to control
996 the durability guarantee on a per-transaction basis.
1000 After a commit record has been written to disk on the primary, the
1001 WAL record is then sent to the standby. The standby sends reply
1002 messages each time a new batch of WAL data is written to disk, unless
1003 <varname>wal_receiver_status_interval</> is set to zero on the standby.
1004 If the standby is the first matching standby, as specified in
1005 <varname>synchronous_standby_names</> on the primary, the reply
1006 messages from that standby will be used to wake users waiting for
1007 confirmation that the commit record has been received. These parameters
1008 allow the administrator to specify which standby servers should be
1009 synchronous standbys. Note that the configuration of synchronous
1010 replication is mainly on the master. Named standbys must be directly
1011 connected to the master; the master knows nothing about downstream
1012 standby servers using cascaded replication.
1016 Setting <varname>synchronous_commit</> to <literal>remote_write</> will
1017 cause each commit to wait for confirmation that the standby has received
1018 the commit record and written it out to its own operating system, but not
1019 for the data to be flushed to disk on the standby. This
1020 setting provides a weaker guarantee of durability than <literal>on</>
1021 does: the standby could lose the data in the event of an operating system
1022 crash, though not a <productname>PostgreSQL</> crash.
1023 However, it's a useful setting in practice
1024 because it can decrease the response time for the transaction.
1025 Data loss could only occur if both the primary and the standby crash and
1026 the database of the primary gets corrupted at the same time.
1030 Users will stop waiting if a fast shutdown is requested. However, as
1031 when using asynchronous replication, the server will not fully
1032 shutdown until all outstanding WAL records are transferred to the currently
1033 connected standby servers.
1038 <sect3 id="synchronous-replication-performance">
1039 <title>Planning for Performance</title>
1042 Synchronous replication usually requires carefully planned and placed
1043 standby servers to ensure applications perform acceptably. Waiting
1044 doesn't utilise system resources, but transaction locks continue to be
1045 held until the transfer is confirmed. As a result, incautious use of
1046 synchronous replication will reduce performance for database
1047 applications because of increased response times and higher contention.
1051 <productname>PostgreSQL</> allows the application developer
1052 to specify the durability level required via replication. This can be
1053 specified for the system overall, though it can also be specified for
1054 specific users or connections, or even individual transactions.
1058 For example, an application workload might consist of:
1059 10% of changes are important customer details, while
1060 90% of changes are less important data that the business can more
1061 easily survive if it is lost, such as chat messages between users.
1065 With synchronous replication options specified at the application level
1066 (on the primary) we can offer synchronous replication for the most
1067 important changes, without slowing down the bulk of the total workload.
1068 Application level options are an important and practical tool for allowing
1069 the benefits of synchronous replication for high performance applications.
1073 You should consider that the network bandwidth must be higher than
1074 the rate of generation of WAL data.
1079 <sect3 id="synchronous-replication-ha">
1080 <title>Planning for High Availability</title>
1083 Commits made when <varname>synchronous_commit</> is set to <literal>on</>
1084 or <literal>remote_write</> will wait until the synchronous standby responds. The response
1085 may never occur if the last, or only, standby should crash.
1089 The best solution for avoiding data loss is to ensure you don't lose
1090 your last remaining synchronous standby. This can be achieved by naming multiple
1091 potential synchronous standbys using <varname>synchronous_standby_names</>.
1092 The first named standby will be used as the synchronous standby. Standbys
1093 listed after this will take over the role of synchronous standby if the
1094 first one should fail.
1098 When a standby first attaches to the primary, it will not yet be properly
1099 synchronized. This is described as <literal>catchup</> mode. Once
1100 the lag between standby and primary reaches zero for the first time
1101 we move to real-time <literal>streaming</> state.
1102 The catch-up duration may be long immediately after the standby has
1103 been created. If the standby is shut down, then the catch-up period
1104 will increase according to the length of time the standby has been down.
1105 The standby is only able to become a synchronous standby
1106 once it has reached <literal>streaming</> state.
1110 If primary restarts while commits are waiting for acknowledgement, those
1111 waiting transactions will be marked fully committed once the primary
1113 There is no way to be certain that all standbys have received all
1114 outstanding WAL data at time of the crash of the primary. Some
1115 transactions may not show as committed on the standby, even though
1116 they show as committed on the primary. The guarantee we offer is that
1117 the application will not receive explicit acknowledgement of the
1118 successful commit of a transaction until the WAL data is known to be
1119 safely received by the standby.
1123 If you really do lose your last standby server then you should disable
1124 <varname>synchronous_standby_names</> and reload the configuration file
1125 on the primary server.
1129 If the primary is isolated from remaining standby servers you should
1130 fail over to the best candidate of those other remaining standby servers.
1134 If you need to re-create a standby server while transactions are
1135 waiting, make sure that the commands pg_start_backup() and
1136 pg_stop_backup() are run in a session with
1137 <varname>synchronous_commit</> = <literal>off</>, otherwise those
1138 requests will wait forever for the standby to appear.
1145 <sect1 id="warm-standby-failover">
1146 <title>Failover</title>
1149 If the primary server fails then the standby server should begin
1150 failover procedures.
1154 If the standby server fails then no failover need take place. If the
1155 standby server can be restarted, even some time later, then the recovery
1156 process can also be restarted immediately, taking advantage of
1157 restartable recovery. If the standby server cannot be restarted, then a
1158 full new standby server instance should be created.
1162 If the primary server fails and the standby server becomes the
1163 new primary, and then the old primary restarts, you must have
1164 a mechanism for informing the old primary that it is no longer the primary. This is
1165 sometimes known as <acronym>STONITH</> (Shoot The Other Node In The Head), which is
1166 necessary to avoid situations where both systems think they are the
1167 primary, which will lead to confusion and ultimately data loss.
1171 Many failover systems use just two systems, the primary and the standby,
1172 connected by some kind of heartbeat mechanism to continually verify the
1173 connectivity between the two and the viability of the primary. It is
1174 also possible to use a third system (called a witness server) to prevent
1175 some cases of inappropriate failover, but the additional complexity
1176 might not be worthwhile unless it is set up with sufficient care and
1181 <productname>PostgreSQL</productname> does not provide the system
1182 software required to identify a failure on the primary and notify
1183 the standby database server. Many such tools exist and are well
1184 integrated with the operating system facilities required for
1185 successful failover, such as IP address migration.
1189 Once failover to the standby occurs, there is only a
1190 single server in operation. This is known as a degenerate state.
1191 The former standby is now the primary, but the former primary is down
1192 and might stay down. To return to normal operation, a standby server
1194 either on the former primary system when it comes up, or on a third,
1195 possibly new, system. Once complete, the primary and standby can be
1196 considered to have switched roles. Some people choose to use a third
1197 server to provide backup for the new primary until the new standby
1198 server is recreated,
1199 though clearly this complicates the system configuration and
1200 operational processes.
1204 So, switching from primary to standby server can be fast but requires
1205 some time to re-prepare the failover cluster. Regular switching from
1206 primary to standby is useful, since it allows regular downtime on
1207 each system for maintenance. This also serves as a test of the
1208 failover mechanism to ensure that it will really work when you need it.
1209 Written administration procedures are advised.
1213 To trigger failover of a log-shipping standby server,
1214 run <command>pg_ctl promote</> or create a trigger
1215 file with the file name and path specified by the <varname>trigger_file</>
1216 setting in <filename>recovery.conf</>. If you're planning to use
1217 <command>pg_ctl promote</> to fail over, <varname>trigger_file</> is
1218 not required. If you're setting up the reporting servers that are
1219 only used to offload read-only queries from the primary, not for high
1220 availability purposes, you don't need to promote it.
1224 <sect1 id="log-shipping-alternative">
1225 <title>Alternative Method for Log Shipping</title>
1228 An alternative to the built-in standby mode described in the previous
1229 sections is to use a <varname>restore_command</> that polls the archive location.
1230 This was the only option available in versions 8.4 and below. In this
1231 setup, set <varname>standby_mode</> off, because you are implementing
1232 the polling required for standby operation yourself. See the
1233 <xref linkend="pgstandby"> module for a reference
1234 implementation of this.
1238 Note that in this mode, the server will apply WAL one file at a
1239 time, so if you use the standby server for queries (see Hot Standby),
1240 there is a delay between an action in the master and when the
1241 action becomes visible in the standby, corresponding the time it takes
1242 to fill up the WAL file. <varname>archive_timeout</> can be used to make that delay
1243 shorter. Also note that you can't combine streaming replication with
1248 The operations that occur on both primary and standby servers are
1249 normal continuous archiving and recovery tasks. The only point of
1250 contact between the two database servers is the archive of WAL files
1251 that both share: primary writing to the archive, standby reading from
1252 the archive. Care must be taken to ensure that WAL archives from separate
1253 primary servers do not become mixed together or confused. The archive
1254 need not be large if it is only required for standby operation.
1258 The magic that makes the two loosely coupled servers work together is
1259 simply a <varname>restore_command</> used on the standby that,
1260 when asked for the next WAL file, waits for it to become available from
1261 the primary. The <varname>restore_command</> is specified in the
1262 <filename>recovery.conf</> file on the standby server. Normal recovery
1263 processing would request a file from the WAL archive, reporting failure
1264 if the file was unavailable. For standby processing it is normal for
1265 the next WAL file to be unavailable, so the standby must wait for
1266 it to appear. For files ending in <literal>.backup</> or
1267 <literal>.history</> there is no need to wait, and a non-zero return
1268 code must be returned. A waiting <varname>restore_command</> can be
1269 written as a custom script that loops after polling for the existence of
1270 the next WAL file. There must also be some way to trigger failover, which
1271 should interrupt the <varname>restore_command</>, break the loop and
1272 return a file-not-found error to the standby server. This ends recovery
1273 and the standby will then come up as a normal server.
1277 Pseudocode for a suitable <varname>restore_command</> is:
1280 while (!NextWALFileReady() && !triggered)
1282 sleep(100000L); /* wait for ~0.1 sec */
1283 if (CheckForExternalTrigger())
1287 CopyWALFileForRecovery();
1292 A working example of a waiting <varname>restore_command</> is provided
1293 in the <xref linkend="pgstandby"> module. It
1294 should be used as a reference on how to correctly implement the logic
1295 described above. It can also be extended as needed to support specific
1296 configurations and environments.
1300 The method for triggering failover is an important part of planning
1301 and design. One potential option is the <varname>restore_command</>
1302 command. It is executed once for each WAL file, but the process
1303 running the <varname>restore_command</> is created and dies for
1304 each file, so there is no daemon or server process, and
1305 signals or a signal handler cannot be used. Therefore, the
1306 <varname>restore_command</> is not suitable to trigger failover.
1307 It is possible to use a simple timeout facility, especially if
1308 used in conjunction with a known <varname>archive_timeout</>
1309 setting on the primary. However, this is somewhat error prone
1310 since a network problem or busy primary server might be sufficient
1311 to initiate failover. A notification mechanism such as the explicit
1312 creation of a trigger file is ideal, if this can be arranged.
1315 <sect2 id="warm-standby-config">
1316 <title>Implementation</title>
1319 The short procedure for configuring a standby server using this alternative
1320 method is as follows. For
1321 full details of each step, refer to previous sections as noted.
1325 Set up primary and standby systems as nearly identical as
1326 possible, including two identical copies of
1327 <productname>PostgreSQL</> at the same release level.
1332 Set up continuous archiving from the primary to a WAL archive
1333 directory on the standby server. Ensure that
1334 <xref linkend="guc-archive-mode">,
1335 <xref linkend="guc-archive-command"> and
1336 <xref linkend="guc-archive-timeout">
1337 are set appropriately on the primary
1338 (see <xref linkend="backup-archiving-wal">).
1343 Make a base backup of the primary server (see <xref
1344 linkend="backup-base-backup">), and load this data onto the standby.
1349 Begin recovery on the standby server from the local WAL
1350 archive, using a <filename>recovery.conf</> that specifies a
1351 <varname>restore_command</> that waits as described
1352 previously (see <xref linkend="backup-pitr-recovery">).
1359 Recovery treats the WAL archive as read-only, so once a WAL file has
1360 been copied to the standby system it can be copied to tape at the same
1361 time as it is being read by the standby database server.
1362 Thus, running a standby server for high availability can be performed at
1363 the same time as files are stored for longer term disaster recovery
1368 For testing purposes, it is possible to run both primary and standby
1369 servers on the same system. This does not provide any worthwhile
1370 improvement in server robustness, nor would it be described as HA.
1374 <sect2 id="warm-standby-record">
1375 <title>Record-based Log Shipping</title>
1378 It is also possible to implement record-based log shipping using this
1379 alternative method, though this requires custom development, and changes
1380 will still only become visible to hot standby queries after a full WAL
1381 file has been shipped.
1385 An external program can call the <function>pg_xlogfile_name_offset()</>
1386 function (see <xref linkend="functions-admin">)
1387 to find out the file name and the exact byte offset within it of
1388 the current end of WAL. It can then access the WAL file directly
1389 and copy the data from the last known end of WAL through the current end
1390 over to the standby servers. With this approach, the window for data
1391 loss is the polling cycle time of the copying program, which can be very
1392 small, and there is no wasted bandwidth from forcing partially-used
1393 segment files to be archived. Note that the standby servers'
1394 <varname>restore_command</> scripts can only deal with whole WAL files,
1395 so the incrementally copied data is not ordinarily made available to
1396 the standby servers. It is of use only when the primary dies —
1397 then the last partial WAL file is fed to the standby before allowing
1398 it to come up. The correct implementation of this process requires
1399 cooperation of the <varname>restore_command</> script with the data
1404 Starting with <productname>PostgreSQL</> version 9.0, you can use
1405 streaming replication (see <xref linkend="streaming-replication">) to
1406 achieve the same benefits with less effort.
1411 <sect1 id="hot-standby">
1412 <title>Hot Standby</title>
1414 <indexterm zone="high-availability">
1415 <primary>Hot Standby</primary>
1419 Hot Standby is the term used to describe the ability to connect to
1420 the server and run read-only queries while the server is in archive
1421 recovery or standby mode. This
1422 is useful both for replication purposes and for restoring a backup
1423 to a desired state with great precision.
1424 The term Hot Standby also refers to the ability of the server to move
1425 from recovery through to normal operation while users continue running
1426 queries and/or keep their connections open.
1430 Running queries in hot standby mode is similar to normal query operation,
1431 though there are several usage and administrative differences
1435 <sect2 id="hot-standby-users">
1436 <title>User's Overview</title>
1439 When the <xref linkend="guc-hot-standby"> parameter is set to true on a
1440 standby server, it will begin accepting connections once the recovery has
1441 brought the system to a consistent state. All such connections are
1442 strictly read-only; not even temporary tables may be written.
1446 The data on the standby takes some time to arrive from the primary server
1447 so there will be a measurable delay between primary and standby. Running the
1448 same query nearly simultaneously on both primary and standby might therefore
1449 return differing results. We say that data on the standby is
1450 <firstterm>eventually consistent</firstterm> with the primary. Once the
1451 commit record for a transaction is replayed on the standby, the changes
1452 made by that transaction will be visible to any new snapshots taken on
1453 the standby. Snapshots may be taken at the start of each query or at the
1454 start of each transaction, depending on the current transaction isolation
1455 level. For more details, see <xref linkend="transaction-iso">.
1459 Transactions started during hot standby may issue the following commands:
1464 Query access - <command>SELECT</>, <command>COPY TO</>
1469 Cursor commands - <command>DECLARE</>, <command>FETCH</>, <command>CLOSE</>
1474 Parameters - <command>SHOW</>, <command>SET</>, <command>RESET</>
1479 Transaction management commands
1483 <command>BEGIN</>, <command>END</>, <command>ABORT</>, <command>START TRANSACTION</>
1488 <command>SAVEPOINT</>, <command>RELEASE</>, <command>ROLLBACK TO SAVEPOINT</>
1493 <command>EXCEPTION</> blocks and other internal subtransactions
1501 <command>LOCK TABLE</>, though only when explicitly in one of these modes:
1502 <literal>ACCESS SHARE</>, <literal>ROW SHARE</> or <literal>ROW EXCLUSIVE</>.
1507 Plans and resources - <command>PREPARE</>, <command>EXECUTE</>,
1508 <command>DEALLOCATE</>, <command>DISCARD</>
1513 Plugins and extensions - <command>LOAD</>
1520 Transactions started during hot standby will never be assigned a
1521 transaction ID and cannot write to the system write-ahead log.
1522 Therefore, the following actions will produce error messages:
1527 Data Manipulation Language (DML) - <command>INSERT</>,
1528 <command>UPDATE</>, <command>DELETE</>, <command>COPY FROM</>,
1529 <command>TRUNCATE</>.
1530 Note that there are no allowed actions that result in a trigger
1531 being executed during recovery. This restriction applies even to
1532 temporary tables, because table rows cannot be read or written without
1533 assigning a transaction ID, which is currently not possible in a
1534 Hot Standby environment.
1539 Data Definition Language (DDL) - <command>CREATE</>,
1540 <command>DROP</>, <command>ALTER</>, <command>COMMENT</>.
1541 This restriction applies even to temporary tables, because carrying
1542 out these operations would require updating the system catalog tables.
1547 <command>SELECT ... FOR SHARE | UPDATE</>, because row locks cannot be
1548 taken without updating the underlying data files.
1553 Rules on <command>SELECT</> statements that generate DML commands.
1558 <command>LOCK</> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</>.
1563 <command>LOCK</> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</>.
1568 Transaction management commands that explicitly set non-read-only state:
1572 <command>BEGIN READ WRITE</>,
1573 <command>START TRANSACTION READ WRITE</>
1578 <command>SET TRANSACTION READ WRITE</>,
1579 <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</>
1584 <command>SET transaction_read_only = off</>
1592 Two-phase commit commands - <command>PREPARE TRANSACTION</>,
1593 <command>COMMIT PREPARED</>, <command>ROLLBACK PREPARED</>
1594 because even read-only transactions need to write WAL in the
1595 prepare phase (the first phase of two phase commit).
1600 Sequence updates - <function>nextval()</>, <function>setval()</>
1605 <command>LISTEN</>, <command>UNLISTEN</>, <command>NOTIFY</>
1612 In normal operation, <quote>read-only</> transactions are allowed to
1613 update sequences and to use <command>LISTEN</>, <command>UNLISTEN</>, and
1614 <command>NOTIFY</>, so Hot Standby sessions operate under slightly tighter
1615 restrictions than ordinary read-only sessions. It is possible that some
1616 of these restrictions might be loosened in a future release.
1620 During hot standby, the parameter <varname>transaction_read_only</> is always
1621 true and may not be changed. But as long as no attempt is made to modify
1622 the database, connections during hot standby will act much like any other
1623 database connection. If failover or switchover occurs, the database will
1624 switch to normal processing mode. Sessions will remain connected while the
1625 server changes mode. Once hot standby finishes, it will be possible to
1626 initiate read-write transactions (even from a session begun during
1631 Users will be able to tell whether their session is read-only by
1632 issuing <command>SHOW transaction_read_only</>. In addition, a set of
1633 functions (<xref linkend="functions-recovery-info-table">) allow users to
1634 access information about the standby server. These allow you to write
1635 programs that are aware of the current state of the database. These
1636 can be used to monitor the progress of recovery, or to allow you to
1637 write complex programs that restore the database to particular states.
1641 <sect2 id="hot-standby-conflict">
1642 <title>Handling Query Conflicts</title>
1645 The primary and standby servers are in many ways loosely connected. Actions
1646 on the primary will have an effect on the standby. As a result, there is
1647 potential for negative interactions or conflicts between them. The easiest
1648 conflict to understand is performance: if a huge data load is taking place
1649 on the primary then this will generate a similar stream of WAL records on the
1650 standby, so standby queries may contend for system resources, such as I/O.
1654 There are also additional types of conflict that can occur with Hot Standby.
1655 These conflicts are <emphasis>hard conflicts</> in the sense that queries
1656 might need to be canceled and, in some cases, sessions disconnected to resolve them.
1657 The user is provided with several ways to handle these
1658 conflicts. Conflict cases include:
1663 Access Exclusive locks taken on the primary server, including both
1664 explicit <command>LOCK</> commands and various <acronym>DDL</>
1665 actions, conflict with table accesses in standby queries.
1670 Dropping a tablespace on the primary conflicts with standby queries
1671 using that tablespace for temporary work files.
1676 Dropping a database on the primary conflicts with sessions connected
1677 to that database on the standby.
1682 Application of a vacuum cleanup record from WAL conflicts with
1683 standby transactions whose snapshots can still <quote>see</> any of
1684 the rows to be removed.
1689 Application of a vacuum cleanup record from WAL conflicts with
1690 queries accessing the target page on the standby, whether or not
1691 the data to be removed is visible.
1698 On the primary server, these cases simply result in waiting; and the
1699 user might choose to cancel either of the conflicting actions. However,
1700 on the standby there is no choice: the WAL-logged action already occurred
1701 on the primary so the standby must not fail to apply it. Furthermore,
1702 allowing WAL application to wait indefinitely may be very undesirable,
1703 because the standby's state will become increasingly far behind the
1704 primary's. Therefore, a mechanism is provided to forcibly cancel standby
1705 queries that conflict with to-be-applied WAL records.
1709 An example of the problem situation is an administrator on the primary
1710 server running <command>DROP TABLE</> on a table that is currently being
1711 queried on the standby server. Clearly the standby query cannot continue
1712 if the <command>DROP TABLE</> is applied on the standby. If this situation
1713 occurred on the primary, the <command>DROP TABLE</> would wait until the
1714 other query had finished. But when <command>DROP TABLE</> is run on the
1715 primary, the primary doesn't have information about what queries are
1716 running on the standby, so it will not wait for any such standby
1717 queries. The WAL change records come through to the standby while the
1718 standby query is still running, causing a conflict. The standby server
1719 must either delay application of the WAL records (and everything after
1720 them, too) or else cancel the conflicting query so that the <command>DROP
1721 TABLE</> can be applied.
1725 When a conflicting query is short, it's typically desirable to allow it to
1726 complete by delaying WAL application for a little bit; but a long delay in
1727 WAL application is usually not desirable. So the cancel mechanism has
1728 parameters, <xref linkend="guc-max-standby-archive-delay"> and <xref
1729 linkend="guc-max-standby-streaming-delay">, that define the maximum
1730 allowed delay in WAL application. Conflicting queries will be canceled
1731 once it has taken longer than the relevant delay setting to apply any
1732 newly-received WAL data. There are two parameters so that different delay
1733 values can be specified for the case of reading WAL data from an archive
1734 (i.e., initial recovery from a base backup or <quote>catching up</> a
1735 standby server that has fallen far behind) versus reading WAL data via
1736 streaming replication.
1740 In a standby server that exists primarily for high availability, it's
1741 best to set the delay parameters relatively short, so that the server
1742 cannot fall far behind the primary due to delays caused by standby
1743 queries. However, if the standby server is meant for executing
1744 long-running queries, then a high or even infinite delay value may be
1745 preferable. Keep in mind however that a long-running query could
1746 cause other sessions on the standby server to not see recent changes
1747 on the primary, if it delays application of WAL records.
1751 Once the delay specified by <varname>max_standby_archive_delay</> or
1752 <varname>max_standby_streaming_delay</> has been exceeded, conflicting
1753 queries will be canceled. This usually results just in a cancellation
1754 error, although in the case of replaying a <command>DROP DATABASE</>
1755 the entire conflicting session will be terminated. Also, if the conflict
1756 is over a lock held by an idle transaction, the conflicting session is
1757 terminated (this behavior might change in the future).
1761 Canceled queries may be retried immediately (after beginning a new
1762 transaction, of course). Since query cancellation depends on
1763 the nature of the WAL records being replayed, a query that was
1764 canceled may well succeed if it is executed again.
1768 Keep in mind that the delay parameters are compared to the elapsed time
1769 since the WAL data was received by the standby server. Thus, the grace
1770 period allowed to any one query on the standby is never more than the
1771 delay parameter, and could be considerably less if the standby has already
1772 fallen behind as a result of waiting for previous queries to complete, or
1773 as a result of being unable to keep up with a heavy update load.
1777 The most common reason for conflict between standby queries and WAL replay
1778 is <quote>early cleanup</>. Normally, <productname>PostgreSQL</> allows
1779 cleanup of old row versions when there are no transactions that need to
1780 see them to ensure correct visibility of data according to MVCC rules.
1781 However, this rule can only be applied for transactions executing on the
1782 master. So it is possible that cleanup on the master will remove row
1783 versions that are still visible to a transaction on the standby.
1787 Experienced users should note that both row version cleanup and row version
1788 freezing will potentially conflict with standby queries. Running a manual
1789 <command>VACUUM FREEZE</> is likely to cause conflicts even on tables with
1790 no updated or deleted rows.
1794 Users should be clear that tables that are regularly and heavily updated
1795 on the primary server will quickly cause cancellation of longer running
1796 queries on the standby. In such cases the setting of a finite value for
1797 <varname>max_standby_archive_delay</> or
1798 <varname>max_standby_streaming_delay</> can be considered similar to
1799 setting <varname>statement_timeout</>.
1803 Remedial possibilities exist if the number of standby-query cancellations
1804 is found to be unacceptable. The first option is to set the parameter
1805 <varname>hot_standby_feedback</>, which prevents <command>VACUUM</> from
1806 removing recently-dead rows and so cleanup conflicts do not occur.
1808 should note that this will delay cleanup of dead rows on the primary,
1809 which may result in undesirable table bloat. However, the cleanup
1810 situation will be no worse than if the standby queries were running
1811 directly on the primary server, and you are still getting the benefit of
1812 off-loading execution onto the standby.
1813 <varname>max_standby_archive_delay</> must be kept large in this case,
1814 because delayed WAL files might already contain entries that conflict with
1815 the desired standby queries.
1819 Another option is to increase <xref linkend="guc-vacuum-defer-cleanup-age">
1820 on the primary server, so that dead rows will not be cleaned up as quickly
1821 as they normally would be. This will allow more time for queries to
1822 execute before they are canceled on the standby, without having to set
1823 a high <varname>max_standby_streaming_delay</>. However it is
1824 difficult to guarantee any specific execution-time window with this
1825 approach, since <varname>vacuum_defer_cleanup_age</> is measured in
1826 transactions executed on the primary server.
1830 The number of query cancels and the reason for them can be viewed using
1831 the <structname>pg_stat_database_conflicts</> system view on the standby
1832 server. The <structname>pg_stat_database</> system view also contains
1833 summary information.
1837 <sect2 id="hot-standby-admin">
1838 <title>Administrator's Overview</title>
1841 If <varname>hot_standby</> is turned <literal>on</> in
1842 <filename>postgresql.conf</> and there is a <filename>recovery.conf</>
1843 file present, the server will run in Hot Standby mode.
1844 However, it may take some time for Hot Standby connections to be allowed,
1845 because the server will not accept connections until it has completed
1846 sufficient recovery to provide a consistent state against which queries
1847 can run. During this period,
1848 clients that attempt to connect will be refused with an error message.
1849 To confirm the server has come up, either loop trying to connect from
1850 the application, or look for these messages in the server logs:
1853 LOG: entering standby mode
1855 ... then some time later ...
1857 LOG: consistent recovery state reached
1858 LOG: database system is ready to accept read only connections
1861 Consistency information is recorded once per checkpoint on the primary.
1862 It is not possible to enable hot standby when reading WAL
1863 written during a period when <varname>wal_level</> was not set to
1864 <literal>hot_standby</> on the primary. Reaching a consistent state can
1865 also be delayed in the presence of both of these conditions:
1870 A write transaction has more than 64 subtransactions
1875 Very long-lived write transactions
1880 If you are running file-based log shipping ("warm standby"), you might need
1881 to wait until the next WAL file arrives, which could be as long as the
1882 <varname>archive_timeout</> setting on the primary.
1886 The setting of some parameters on the standby will need reconfiguration
1887 if they have been changed on the primary. For these parameters,
1888 the value on the standby must
1889 be equal to or greater than the value on the primary. If these parameters
1890 are not set high enough then the standby will refuse to start.
1891 Higher values can then be supplied and the server
1892 restarted to begin recovery again. These parameters are:
1897 <varname>max_connections</>
1902 <varname>max_prepared_transactions</>
1907 <varname>max_locks_per_transaction</>
1914 It is important that the administrator select appropriate settings for
1915 <xref linkend="guc-max-standby-archive-delay"> and <xref
1916 linkend="guc-max-standby-streaming-delay">. The best choices vary
1917 depending on business priorities. For example if the server is primarily
1918 tasked as a High Availability server, then you will want low delay
1919 settings, perhaps even zero, though that is a very aggressive setting. If
1920 the standby server is tasked as an additional server for decision support
1921 queries then it might be acceptable to set the maximum delay values to
1922 many hours, or even -1 which means wait forever for queries to complete.
1926 Transaction status "hint bits" written on the primary are not WAL-logged,
1927 so data on the standby will likely re-write the hints again on the standby.
1928 Thus, the standby server will still perform disk writes even though
1929 all users are read-only; no changes occur to the data values
1930 themselves. Users will still write large sort temporary files and
1931 re-generate relcache info files, so no part of the database
1932 is truly read-only during hot standby mode.
1933 Note also that writes to remote databases using
1934 <application>dblink</application> module, and other operations outside the
1935 database using PL functions will still be possible, even though the
1936 transaction is read-only locally.
1940 The following types of administration commands are not accepted
1941 during recovery mode:
1946 Data Definition Language (DDL) - e.g. <command>CREATE INDEX</>
1951 Privilege and Ownership - <command>GRANT</>, <command>REVOKE</>,
1952 <command>REASSIGN</>
1957 Maintenance commands - <command>ANALYZE</>, <command>VACUUM</>,
1958 <command>CLUSTER</>, <command>REINDEX</>
1965 Again, note that some of these commands are actually allowed during
1966 "read only" mode transactions on the primary.
1970 As a result, you cannot create additional indexes that exist solely
1971 on the standby, nor statistics that exist solely on the standby.
1972 If these administration commands are needed, they should be executed
1973 on the primary, and eventually those changes will propagate to the
1978 <function>pg_cancel_backend()</>
1979 and <function>pg_terminate_backend()</> will work on user backends,
1980 but not the Startup process, which performs
1981 recovery. <structname>pg_stat_activity</structname> does not show an
1982 entry for the Startup process, nor do recovering transactions show
1983 as active. As a result, <structname>pg_prepared_xacts</structname>
1984 is always empty during recovery. If you wish to resolve in-doubt
1985 prepared transactions, view <literal>pg_prepared_xacts</> on the
1986 primary and issue commands to resolve transactions there.
1990 <structname>pg_locks</structname> will show locks held by backends,
1991 as normal. <structname>pg_locks</structname> also shows
1992 a virtual transaction managed by the Startup process that owns all
1993 <literal>AccessExclusiveLocks</> held by transactions being replayed by recovery.
1994 Note that the Startup process does not acquire locks to
1995 make database changes, and thus locks other than <literal>AccessExclusiveLocks</>
1996 do not show in <structname>pg_locks</structname> for the Startup
1997 process; they are just presumed to exist.
2001 The <productname>Nagios</> plugin <productname>check_pgsql</> will
2002 work, because the simple information it checks for exists.
2003 The <productname>check_postgres</> monitoring script will also work,
2004 though some reported values could give different or confusing results.
2005 For example, last vacuum time will not be maintained, since no
2006 vacuum occurs on the standby. Vacuums running on the primary
2007 do still send their changes to the standby.
2011 WAL file control commands will not work during recovery,
2012 e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc.
2016 Dynamically loadable modules work, including <structname>pg_stat_statements</>.
2020 Advisory locks work normally in recovery, including deadlock detection.
2021 Note that advisory locks are never WAL logged, so it is impossible for
2022 an advisory lock on either the primary or the standby to conflict with WAL
2023 replay. Nor is it possible to acquire an advisory lock on the primary
2024 and have it initiate a similar advisory lock on the standby. Advisory
2025 locks relate only to the server on which they are acquired.
2029 Trigger-based replication systems such as <productname>Slony</>,
2030 <productname>Londiste</> and <productname>Bucardo</> won't run on the
2031 standby at all, though they will run happily on the primary server as
2032 long as the changes are not sent to standby servers to be applied.
2033 WAL replay is not trigger-based so you cannot relay from the
2034 standby to any system that requires additional database writes or
2035 relies on the use of triggers.
2039 New OIDs cannot be assigned, though some <acronym>UUID</> generators may still
2040 work as long as they do not rely on writing new status to the database.
2044 Currently, temporary table creation is not allowed during read only
2045 transactions, so in some cases existing scripts will not run correctly.
2046 This restriction might be relaxed in a later release. This is
2047 both a SQL Standard compliance issue and a technical issue.
2051 <command>DROP TABLESPACE</> can only succeed if the tablespace is empty.
2052 Some standby users may be actively using the tablespace via their
2053 <varname>temp_tablespaces</> parameter. If there are temporary files in the
2054 tablespace, all active queries are canceled to ensure that temporary
2055 files are removed, so the tablespace can be removed and WAL replay
2060 Running <command>DROP DATABASE</> or <command>ALTER DATABASE ... SET
2061 TABLESPACE</> on the primary
2062 will generate a WAL entry that will cause all users connected to that
2063 database on the standby to be forcibly disconnected. This action occurs
2064 immediately, whatever the setting of
2065 <varname>max_standby_streaming_delay</>. Note that
2066 <command>ALTER DATABASE ... RENAME</> does not disconnect users, which
2067 in most cases will go unnoticed, though might in some cases cause a
2068 program confusion if it depends in some way upon database name.
2072 In normal (non-recovery) mode, if you issue <command>DROP USER</> or <command>DROP ROLE</>
2073 for a role with login capability while that user is still connected then
2074 nothing happens to the connected user - they remain connected. The user cannot
2075 reconnect however. This behavior applies in recovery also, so a
2076 <command>DROP USER</> on the primary does not disconnect that user on the standby.
2080 The statistics collector is active during recovery. All scans, reads, blocks,
2081 index usage, etc., will be recorded normally on the standby. Replayed
2082 actions will not duplicate their effects on primary, so replaying an
2083 insert will not increment the Inserts column of pg_stat_user_tables.
2084 The stats file is deleted at the start of recovery, so stats from primary
2085 and standby will differ; this is considered a feature, not a bug.
2089 Autovacuum is not active during recovery. It will start normally at the
2094 The background writer is active during recovery and will perform
2095 restartpoints (similar to checkpoints on the primary) and normal block
2096 cleaning activities. This can include updates of the hint bit
2097 information stored on the standby server.
2098 The <command>CHECKPOINT</> command is accepted during recovery,
2099 though it performs a restartpoint rather than a new checkpoint.
2103 <sect2 id="hot-standby-parameters">
2104 <title>Hot Standby Parameter Reference</title>
2107 Various parameters have been mentioned above in
2108 <xref linkend="hot-standby-conflict"> and
2109 <xref linkend="hot-standby-admin">.
2113 On the primary, parameters <xref linkend="guc-wal-level"> and
2114 <xref linkend="guc-vacuum-defer-cleanup-age"> can be used.
2115 <xref linkend="guc-max-standby-archive-delay"> and
2116 <xref linkend="guc-max-standby-streaming-delay"> have no effect if set on
2121 On the standby, parameters <xref linkend="guc-hot-standby">,
2122 <xref linkend="guc-max-standby-archive-delay"> and
2123 <xref linkend="guc-max-standby-streaming-delay"> can be used.
2124 <xref linkend="guc-vacuum-defer-cleanup-age"> has no effect
2125 as long as the server remains in standby mode, though it will
2126 become relevant if the standby becomes primary.
2130 <sect2 id="hot-standby-caveats">
2131 <title>Caveats</title>
2134 There are several limitations of Hot Standby.
2135 These can and probably will be fixed in future releases:
2140 Operations on hash indexes are not presently WAL-logged, so
2141 replay will not update these indexes.
2146 Full knowledge of running transactions is required before snapshots
2147 can be taken. Transactions that use large numbers of subtransactions
2148 (currently greater than 64) will delay the start of read only
2149 connections until the completion of the longest running write transaction.
2150 If this situation occurs, explanatory messages will be sent to the server log.
2155 Valid starting points for standby queries are generated at each
2156 checkpoint on the master. If the standby is shut down while the master
2157 is in a shutdown state, it might not be possible to re-enter Hot Standby
2158 until the primary is started up, so that it generates further starting
2159 points in the WAL logs. This situation isn't a problem in the most
2160 common situations where it might happen. Generally, if the primary is
2161 shut down and not available anymore, that's likely due to a serious
2162 failure that requires the standby being converted to operate as
2163 the new primary anyway. And in situations where the primary is
2164 being intentionally taken down, coordinating to make sure the standby
2165 becomes the new primary smoothly is also standard procedure.
2170 At the end of recovery, <literal>AccessExclusiveLocks</> held by prepared transactions
2171 will require twice the normal number of lock table entries. If you plan
2172 on running either a large number of concurrent prepared transactions
2173 that normally take <literal>AccessExclusiveLocks</>, or you plan on having one
2174 large transaction that takes many <literal>AccessExclusiveLocks</>, you are
2175 advised to select a larger value of <varname>max_locks_per_transaction</>,
2176 perhaps as much as twice the value of the parameter on
2177 the primary server. You need not consider this at all if
2178 your setting of <varname>max_prepared_transactions</> is 0.
2183 The Serializable transaction isolation level is not yet available in hot
2184 standby. (See <xref linkend="xact-serializable"> and
2185 <xref linkend="serializable-consistency"> for details.)
2186 An attempt to set a transaction to the serializable isolation level in
2187 hot standby mode will generate an error.