granicus.if.org Git - postgresql/blob - doc/src/sgml/high-availability.sgml

   1 <!-- doc/src/sgml/high-availability.sgml -->
   2
   3 <chapter id="high-availability">
   4  <title>High Availability, Load Balancing, and Replication</title>
   5
   6  <indexterm><primary>high availability</></>
   7  <indexterm><primary>failover</></>
   8  <indexterm><primary>replication</></>
   9  <indexterm><primary>load balancing</></>
  10  <indexterm><primary>clustering</></>
  11  <indexterm><primary>data partitioning</></>
  12
  13  <para>
  14   Database servers can work together to allow a second server to
  15   take over quickly if the primary server fails (high
  16   availability), or to allow several computers to serve the same
  17   data (load balancing).  Ideally, database servers could work
  18   together seamlessly.  Web servers serving static web pages can
  19   be combined quite easily by merely load-balancing web requests
  20   to multiple machines.  In fact, read-only database servers can
  21   be combined relatively easily too.  Unfortunately, most database
  22   servers have a read/write mix of requests, and read/write servers
  23   are much harder to combine.  This is because though read-only
  24   data needs to be placed on each server only once, a write to any
  25   server has to be propagated to all servers so that future read
  26   requests to those servers return consistent results.
  27  </para>
  28
  29  <para>
  30   This synchronization problem is the fundamental difficulty for
  31   servers working together.  Because there is no single solution
  32   that eliminates the impact of the sync problem for all use cases,
  33   there are multiple solutions.  Each solution addresses this
  34   problem in a different way, and minimizes its impact for a specific
  35   workload.
  36  </para>
  37
  38  <para>
  39   Some solutions deal with synchronization by allowing only one
  40   server to modify the data.  Servers that can modify data are
  41   called read/write, <firstterm>master</> or <firstterm>primary</> servers.
  42   Servers that track changes in the master are called <firstterm>standby</>
  43   or <firstterm>slave</> servers. A standby server that cannot be connected
  44   to until it is promoted to a master server is called a <firstterm>warm
  45   standby</> server, and one that can accept connections and serves read-only
  46   queries is called a <firstterm>hot standby</> server.
  47  </para>
  48
  49  <para>
  50   Some solutions are synchronous,
  51   meaning that a data-modifying transaction is not considered
  52   committed until all servers have committed the transaction.  This
  53   guarantees that a failover will not lose any data and that all
  54   load-balanced servers will return consistent results no matter
  55   which server is queried. In contrast, asynchronous solutions allow some
  56   delay between the time of a commit and its propagation to the other servers,
  57   opening the possibility that some transactions might be lost in
  58   the switch to a backup server, and that load balanced servers
  59   might return slightly stale results.  Asynchronous communication
  60   is used when synchronous would be too slow.
  61  </para>
  62
  63  <para>
  64   Solutions can also be categorized by their granularity.  Some solutions
  65   can deal only with an entire database server, while others allow control
  66   at the per-table or per-database level.
  67  </para>
  68
  69  <para>
  70   Performance must be considered in any choice.  There is usually a
  71   trade-off between functionality and
  72   performance.  For example, a fully synchronous solution over a slow
  73   network might cut performance by more than half, while an asynchronous
  74   one might have a minimal performance impact.
  75  </para>
  76
  77  <para>
  78   The remainder of this section outlines various failover, replication,
  79   and load balancing solutions.  A <ulink
  80   url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
  81   also available.
  82  </para>
  83
  84  <sect1 id="different-replication-solutions">
  85  <title>Comparison of Different Solutions</title>
  86
  87  <variablelist>
  88
  89   <varlistentry>
  90    <term>Shared Disk Failover</term>
  91    <listitem>
  92
  93     <para>
  94      Shared disk failover avoids synchronization overhead by having only one
  95      copy of the database.  It uses a single disk array that is shared by
  96      multiple servers.  If the main database server fails, the standby server
  97      is able to mount and start the database as though it were recovering from
  98      a database crash.  This allows rapid failover with no data loss.
  99     </para>
 100
 101     <para>
 102      Shared hardware functionality is common in network storage devices.
 103      Using a network file system is also possible, though care must be
 104      taken that the file system has full <acronym>POSIX</> behavior (see <xref
 105      linkend="creating-cluster-nfs">).  One significant limitation of this
 106      method is that if the shared disk array fails or becomes corrupt, the
 107      primary and standby servers are both nonfunctional.  Another issue is
 108      that the standby server should never access the shared storage while
 109      the primary server is running.
 110     </para>
 111
 112    </listitem>
 113   </varlistentry>
 114
 115   <varlistentry>
 116    <term>File System (Block-Device) Replication</term>
 117    <listitem>
 118
 119     <para>
 120      A modified version of shared hardware functionality is file system
 121      replication, where all changes to a file system are mirrored to a file
 122      system residing on another computer.  The only restriction is that
 123      the mirroring must be done in a way that ensures the standby server
 124      has a consistent copy of the file system &mdash; specifically, writes
 125      to the standby must be done in the same order as those on the master.
 126      <productname>DRBD</> is a popular file system replication solution
 127      for Linux.
 128     </para>
 129
 130 <!--
 131 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
 132
 133 Oracle RAC is a shared disk approach and just send cache invalidations
 134 to other nodes but not actual data. As the disk is shared, data is
 135 only committed once to disk and there is a distributed locking
 136 protocol to make nodes agree on a serializable transactional order.
 137 -->
 138
 139    </listitem>
 140   </varlistentry>
 141
 142   <varlistentry>
 143    <term>Transaction Log Shipping</term>
 144    <listitem>
 145
 146     <para>
 147      Warm and hot standby servers can be kept current by reading a
 148      stream of write-ahead log (<acronym>WAL</>)
 149      records.  If the main server fails, the standby contains
 150      almost all of the data of the main server, and can be quickly
 151      made the new master database server.  This can be synchronous or
 152      asynchronous and can only be done for the entire database server.
 153     </para>
 154     <para>
 155      A standby server can be implemented using file-based log shipping
 156      (<xref linkend="warm-standby">) or streaming replication (see
 157      <xref linkend="streaming-replication">), or a combination of both. For
 158      information on hot standby, see <xref linkend="hot-standby">.
 159     </para>
 160    </listitem>
 161   </varlistentry>
 162
 163   <varlistentry>
 164    <term>Trigger-Based Master-Standby Replication</term>
 165    <listitem>
 166
 167     <para>
 168      A master-standby replication setup sends all data modification
 169      queries to the master server.  The master server asynchronously
 170      sends data changes to the standby server.  The standby can answer
 171      read-only queries while the master server is running.  The
 172      standby server is ideal for data warehouse queries.
 173     </para>
 174
 175     <para>
 176      <productname>Slony-I</> is an example of this type of replication, with per-table
 177      granularity, and support for multiple standby servers.  Because it
 178      updates the standby server asynchronously (in batches), there is
 179      possible data loss during fail over.
 180     </para>
 181    </listitem>
 182   </varlistentry>
 183
 184   <varlistentry>
 185    <term>Statement-Based Replication Middleware</term>
 186    <listitem>
 187
 188     <para>
 189      With statement-based replication middleware, a program intercepts
 190      every SQL query and sends it to one or all servers.  Each server
 191      operates independently.  Read-write queries must be sent to all servers,
 192      so that every server receives any changes.  But read-only queries can be
 193      sent to just one server, allowing the read workload to be distributed
 194      among them.
 195     </para>
 196
 197     <para>
 198      If queries are simply broadcast unmodified, functions like
 199      <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
 200      sequences can have different values on different servers.
 201      This is because each server operates independently, and because
 202      SQL queries are broadcast (and not actual modified rows).  If
 203      this is unacceptable, either the middleware or the application
 204      must query such values from a single server and then use those
 205      values in write queries.  Another option is to use this replication
 206      option with a traditional master-standby setup, i.e. data modification
 207      queries are sent only to the master and are propagated to the
 208      standby servers via master-standby replication, not by the replication
 209      middleware.  Care must also be taken that all
 210      transactions either commit or abort on all servers, perhaps
 211      using two-phase commit (<xref linkend="sql-prepare-transaction">
 212      and <xref linkend="sql-commit-prepared">.
 213      <productname>Pgpool-II</> and <productname>Continuent Tungsten</>
 214      are examples of this type of replication.
 215     </para>
 216    </listitem>
 217   </varlistentry>
 218
 219   <varlistentry>
 220    <term>Asynchronous Multimaster Replication</term>
 221    <listitem>
 222
 223     <para>
 224      For servers that are not regularly connected, like laptops or
 225      remote servers, keeping data consistent among servers is a
 226      challenge.  Using asynchronous multimaster replication, each
 227      server works independently, and periodically communicates with
 228      the other servers to identify conflicting transactions.  The
 229      conflicts can be resolved by users or conflict resolution rules.
 230      Bucardo is an example of this type of replication.
 231     </para>
 232    </listitem>
 233   </varlistentry>
 234
 235   <varlistentry>
 236    <term>Synchronous Multimaster Replication</term>
 237    <listitem>
 238
 239     <para>
 240      In synchronous multimaster replication, each server can accept
 241      write requests, and modified data is transmitted from the
 242      original server to every other server before each transaction
 243      commits.  Heavy write activity can cause excessive locking,
 244      leading to poor performance.  In fact, write performance is
 245      often worse than that of a single server.  Read requests can
 246      be sent to any server.  Some implementations use shared disk
 247      to reduce the communication overhead.  Synchronous multimaster
 248      replication is best for mostly read workloads, though its big
 249      advantage is that any server can accept write requests &mdash;
 250      there is no need to partition workloads between master and
 251      standby servers, and because the data changes are sent from one
 252      server to another, there is no problem with non-deterministic
 253      functions like <function>random()</>.
 254     </para>
 255
 256     <para>
 257      <productname>PostgreSQL</> does not offer this type of replication,
 258      though <productname>PostgreSQL</> two-phase commit (<xref
 259      linkend="sql-prepare-transaction"> and <xref
 260      linkend="sql-commit-prepared">)
 261      can be used to implement this in application code or middleware.
 262     </para>
 263    </listitem>
 264   </varlistentry>
 265
 266   <varlistentry>
 267    <term>Commercial Solutions</term>
 268    <listitem>
 269
 270     <para>
 271      Because <productname>PostgreSQL</> is open source and easily
 272      extended, a number of companies have taken <productname>PostgreSQL</>
 273      and created commercial closed-source solutions with unique
 274      failover, replication, and load balancing capabilities.
 275     </para>
 276    </listitem>
 277   </varlistentry>
 278
 279  </variablelist>
 280
 281  <para>
 282   <xref linkend="high-availability-matrix"> summarizes
 283   the capabilities of the various solutions listed above.
 284  </para>
 285
 286  <table id="high-availability-matrix">
 287   <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
 288   <tgroup cols="8">
 289    <thead>
 290     <row>
 291      <entry>Feature</entry>
 292      <entry>Shared Disk Failover</entry>
 293      <entry>File System Replication</entry>
 294      <entry>Transaction Log Shipping</entry>
 295      <entry>Trigger-Based Master-Standby Replication</entry>
 296      <entry>Statement-Based Replication Middleware</entry>
 297      <entry>Asynchronous Multimaster Replication</entry>
 298      <entry>Synchronous Multimaster Replication</entry>
 299     </row>
 300    </thead>
 301
 302    <tbody>
 303
 304     <row>
 305      <entry>Most Common Implementation</entry>
 306      <entry align="center">NAS</entry>
 307      <entry align="center">DRBD</entry>
 308      <entry align="center">Streaming Repl.</entry>
 309      <entry align="center">Slony</entry>
 310      <entry align="center">pgpool-II</entry>
 311      <entry align="center">Bucardo</entry>
 312      <entry align="center"></entry>
 313     </row>
 314
 315     <row>
 316      <entry>Communication Method</entry>
 317      <entry align="center">shared disk</entry>
 318      <entry align="center">disk blocks</entry>
 319      <entry align="center">WAL</entry>
 320      <entry align="center">table rows</entry>
 321      <entry align="center">SQL</entry>
 322      <entry align="center">table rows</entry>
 323      <entry align="center">table rows and row locks</entry>
 324     </row>
 325
 326     <row>
 327      <entry>No special hardware required</entry>
 328      <entry align="center"></entry>
 329      <entry align="center">&bull;</entry>
 330      <entry align="center">&bull;</entry>
 331      <entry align="center">&bull;</entry>
 332      <entry align="center">&bull;</entry>
 333      <entry align="center">&bull;</entry>
 334      <entry align="center">&bull;</entry>
 335     </row>
 336
 337     <row>
 338      <entry>Allows multiple master servers</entry>
 339      <entry align="center"></entry>
 340      <entry align="center"></entry>
 341      <entry align="center"></entry>
 342      <entry align="center"></entry>
 343      <entry align="center">&bull;</entry>
 344      <entry align="center">&bull;</entry>
 345      <entry align="center">&bull;</entry>
 346     </row>
 347
 348     <row>
 349      <entry>No master server overhead</entry>
 350      <entry align="center">&bull;</entry>
 351      <entry align="center"></entry>
 352      <entry align="center">&bull;</entry>
 353      <entry align="center"></entry>
 354      <entry align="center">&bull;</entry>
 355      <entry align="center"></entry>
 356      <entry align="center"></entry>
 357     </row>
 358
 359     <row>
 360      <entry>No waiting for multiple servers</entry>
 361      <entry align="center">&bull;</entry>
 362      <entry align="center"></entry>
 363      <entry align="center">with sync off</entry>
 364      <entry align="center">&bull;</entry>
 365      <entry align="center"></entry>
 366      <entry align="center">&bull;</entry>
 367      <entry align="center"></entry>
 368     </row>
 369
 370     <row>
 371      <entry>Master failure will never lose data</entry>
 372      <entry align="center">&bull;</entry>
 373      <entry align="center">&bull;</entry>
 374      <entry align="center">with sync on</entry>
 375      <entry align="center"></entry>
 376      <entry align="center">&bull;</entry>
 377      <entry align="center"></entry>
 378      <entry align="center">&bull;</entry>
 379     </row>
 380
 381     <row>
 382      <entry>Standby accept read-only queries</entry>
 383      <entry align="center"></entry>
 384      <entry align="center"></entry>
 385      <entry align="center">with hot</entry>
 386      <entry align="center">&bull;</entry>
 387      <entry align="center">&bull;</entry>
 388      <entry align="center">&bull;</entry>
 389      <entry align="center">&bull;</entry>
 390     </row>
 391
 392     <row>
 393      <entry>Per-table granularity</entry>
 394      <entry align="center"></entry>
 395      <entry align="center"></entry>
 396      <entry align="center"></entry>
 397      <entry align="center">&bull;</entry>
 398      <entry align="center"></entry>
 399      <entry align="center">&bull;</entry>
 400      <entry align="center">&bull;</entry>
 401     </row>
 402
 403     <row>
 404      <entry>No conflict resolution necessary</entry>
 405      <entry align="center">&bull;</entry>
 406      <entry align="center">&bull;</entry>
 407      <entry align="center">&bull;</entry>
 408      <entry align="center">&bull;</entry>
 409      <entry align="center"></entry>
 410      <entry align="center"></entry>
 411      <entry align="center">&bull;</entry>
 412     </row>
 413
 414    </tbody>
 415   </tgroup>
 416  </table>
 417
 418  <para>
 419   There are a few solutions that do not fit into the above categories:
 420  </para>
 421
 422  <variablelist>
 423
 424   <varlistentry>
 425    <term>Data Partitioning</term>
 426    <listitem>
 427
 428     <para>
 429      Data partitioning splits tables into data sets.  Each set can
 430      be modified by only one server.  For example, data can be
 431      partitioned by offices, e.g., London and Paris, with a server
 432      in each office.  If queries combining London and Paris data
 433      are necessary, an application can query both servers, or
 434      master/standby replication can be used to keep a read-only copy
 435      of the other office's data on each server.
 436     </para>
 437    </listitem>
 438   </varlistentry>
 439
 440   <varlistentry>
 441    <term>Multiple-Server Parallel Query Execution</term>
 442    <listitem>
 443
 444     <para>
 445      Many of the above solutions allow multiple servers to handle multiple
 446      queries, but none allow a single query to use multiple servers to
 447      complete faster.  This solution allows multiple servers to work
 448      concurrently on a single query.  It is usually accomplished by
 449      splitting the data among servers and having each server execute its
 450      part of the query and return results to a central server where they
 451      are combined and returned to the user.  <productname>Pgpool-II</>
 452      has this capability.  Also, this can be implemented using the
 453      <productname>PL/Proxy</> tool set.
 454     </para>
 455
 456    </listitem>
 457   </varlistentry>
 458
 459  </variablelist>
 460
 461  </sect1>
 462
 463
 464  <sect1 id="warm-standby">
 465  <title>Log-Shipping Standby Servers</title>
 466
 467
 468   <para>
 469    Continuous archiving can be used to create a <firstterm>high
 470    availability</> (HA) cluster configuration with one or more
 471    <firstterm>standby servers</> ready to take over operations if the
 472    primary server fails. This capability is widely referred to as
 473    <firstterm>warm standby</> or <firstterm>log shipping</>.
 474   </para>
 475
 476   <para>
 477    The primary and standby server work together to provide this capability,
 478    though the servers are only loosely coupled. The primary server operates
 479    in continuous archiving mode, while each standby server operates in
 480    continuous recovery mode, reading the WAL files from the primary. No
 481    changes to the database tables are required to enable this capability,
 482    so it offers low administration overhead compared to some other
 483    replication solutions. This configuration also has relatively low
 484    performance impact on the primary server.
 485   </para>
 486
 487   <para>
 488    Directly moving WAL records from one database server to another
 489    is typically described as log shipping. <productname>PostgreSQL</>
 490    implements file-based log shipping by transferring WAL records
 491    one file (WAL segment) at a time. WAL files (16MB) can be
 492    shipped easily and cheaply over any distance, whether it be to an
 493    adjacent system, another system at the same site, or another system on
 494    the far side of the globe. The bandwidth required for this technique
 495    varies according to the transaction rate of the primary server.
 496    Record-based log shipping is more granular and streams WAL changes
 497    incrementally over a network connection (see <xref
 498    linkend="streaming-replication">).
 499   </para>
 500
 501   <para>
 502    It should be noted that log shipping is asynchronous, i.e., the WAL
 503    records are shipped after transaction commit. As a result, there is a
 504    window for data loss should the primary server suffer a catastrophic
 505    failure; transactions not yet shipped will be lost.  The size of the
 506    data loss window in file-based log shipping can be limited by use of the
 507    <varname>archive_timeout</varname> parameter, which can be set as low
 508    as a few seconds.  However such a low setting will
 509    substantially increase the bandwidth required for file shipping.
 510    Streaming replication (see <xref linkend="streaming-replication">)
 511    allows a much smaller window of data loss.
 512   </para>
 513
 514   <para>
 515    Recovery performance is sufficiently good that the standby will
 516    typically be only moments away from full
 517    availability once it has been activated. As a result, this is called
 518    a warm standby configuration which offers high
 519    availability. Restoring a server from an archived base backup and
 520    rollforward will take considerably longer, so that technique only
 521    offers a solution for disaster recovery, not high availability.
 522    A standby server can also be used for read-only queries, in which case
 523    it is called a Hot Standby server. See <xref linkend="hot-standby"> for
 524    more information.
 525   </para>
 526
 527   <indexterm zone="high-availability">
 528    <primary>warm standby</primary>
 529   </indexterm>
 530
 531   <indexterm zone="high-availability">
 532    <primary>PITR standby</primary>
 533   </indexterm>
 534
 535   <indexterm zone="high-availability">
 536    <primary>standby server</primary>
 537   </indexterm>
 538
 539   <indexterm zone="high-availability">
 540    <primary>log shipping</primary>
 541   </indexterm>
 542
 543   <indexterm zone="high-availability">
 544    <primary>witness server</primary>
 545   </indexterm>
 546
 547   <indexterm zone="high-availability">
 548    <primary>STONITH</primary>
 549   </indexterm>
 550
 551   <sect2 id="standby-planning">
 552    <title>Planning</title>
 553
 554    <para>
 555     It is usually wise to create the primary and standby servers
 556     so that they are as similar as possible, at least from the
 557     perspective of the database server.  In particular, the path names
 558     associated with tablespaces will be passed across unmodified, so both
 559     primary and standby servers must have the same mount paths for
 560     tablespaces if that feature is used.  Keep in mind that if
 561     <xref linkend="sql-createtablespace">
 562     is executed on the primary, any new mount point needed for it must
 563     be created on the primary and all standby servers before the command
 564     is executed. Hardware need not be exactly the same, but experience shows
 565     that maintaining two identical systems is easier than maintaining two
 566     dissimilar ones over the lifetime of the application and system.
 567     In any case the hardware architecture must be the same &mdash; shipping
 568     from, say, a 32-bit to a 64-bit system will not work.
 569    </para>
 570
 571    <para>
 572     In general, log shipping between servers running different major
 573     <productname>PostgreSQL</> release
 574     levels is not possible. It is the policy of the PostgreSQL Global
 575     Development Group not to make changes to disk formats during minor release
 576     upgrades, so it is likely that running different minor release levels
 577     on primary and standby servers will work successfully. However, no
 578     formal support for that is offered and you are advised to keep primary
 579     and standby servers at the same release level as much as possible.
 580     When updating to a new minor release, the safest policy is to update
 581     the standby servers first &mdash; a new minor release is more likely
 582     to be able to read WAL files from a previous minor release than vice
 583     versa.
 584    </para>
 585
 586   </sect2>
 587
 588   <sect2 id="standby-server-operation">
 589    <title>Standby Server Operation</title>
 590
 591    <para>
 592     In standby mode, the server continuously applies WAL received from the
 593     master server. The standby server can read WAL from a WAL archive
 594     (see <xref linkend="restore-command">) or directly from the master
 595     over a TCP connection (streaming replication). The standby server will
 596     also attempt to restore any WAL found in the standby cluster's
 597     <filename>pg_xlog</> directory. That typically happens after a server
 598     restart, when the standby replays again WAL that was streamed from the
 599     master before the restart, but you can also manually copy files to
 600     <filename>pg_xlog</> at any time to have them replayed.
 601    </para>
 602
 603    <para>
 604     At startup, the standby begins by restoring all WAL available in the
 605     archive location, calling <varname>restore_command</>. Once it
 606     reaches the end of WAL available there and <varname>restore_command</>
 607     fails, it tries to restore any WAL available in the <filename>pg_xlog</> directory.
 608     If that fails, and streaming replication has been configured, the
 609     standby tries to connect to the primary server and start streaming WAL
 610     from the last valid record found in archive or <filename>pg_xlog</>. If that fails
 611     or streaming replication is not configured, or if the connection is
 612     later disconnected, the standby goes back to step 1 and tries to
 613     restore the file from the archive again. This loop of retries from the
 614     archive, <filename>pg_xlog</>, and via streaming replication goes on until the server
 615     is stopped or failover is triggered by a trigger file.
 616    </para>
 617
 618    <para>
 619     Standby mode is exited and the server switches to normal operation
 620     when <command>pg_ctl promote</> is run or a trigger file is found
 621     (<varname>trigger_file</>). Before failover,
 622     any WAL immediately available in the archive or in <filename>pg_xlog</> will be
 623     restored, but no attempt is made to connect to the master.
 624    </para>
 625   </sect2>
 626
 627   <sect2 id="preparing-master-for-standby">
 628    <title>Preparing the Master for Standby Servers</title>
 629
 630    <para>
 631     Set up continuous archiving on the primary to an archive directory
 632     accessible from the standby, as described
 633     in <xref linkend="continuous-archiving">. The archive location should be
 634     accessible from the standby even when the master is down, i.e. it should
 635     reside on the standby server itself or another trusted server, not on
 636     the master server.
 637    </para>
 638
 639    <para>
 640     If you want to use streaming replication, set up authentication on the
 641     primary server to allow replication connections from the standby
 642     server(s); that is, create a role and provide a suitable entry or
 643     entries in <filename>pg_hba.conf</> with the database field set to
 644     <literal>replication</>.  Also ensure <varname>max_wal_senders</> is set
 645     to a sufficiently large value in the configuration file of the primary
 646     server.
 647    </para>
 648
 649    <para>
 650     Take a base backup as described in <xref linkend="backup-base-backup">
 651     to bootstrap the standby server.
 652    </para>
 653   </sect2>
 654
 655   <sect2 id="standby-server-setup">
 656    <title>Setting Up a Standby Server</title>
 657
 658    <para>
 659     To set up the standby server, restore the base backup taken from primary
 660     server (see <xref linkend="backup-pitr-recovery">). Create a recovery
 661     command file <filename>recovery.conf</> in the standby's cluster data
 662     directory, and turn on <varname>standby_mode</>. Set
 663     <varname>restore_command</> to a simple command to copy files from
 664     the WAL archive. If you plan to have multiple standby servers for high
 665     availability purposes, set <varname>recovery_target_timeline</> to
 666     <literal>latest</>, to make the standby server follow the timeline change
 667     that occurs at failover to another standby.
 668    </para>
 669
 670    <note>
 671      <para>
 672      Do not use pg_standby or similar tools with the built-in standby mode
 673      described here. <varname>restore_command</> should return immediately
 674      if the file does not exist; the server will retry the command again if
 675      necessary. See <xref linkend="log-shipping-alternative">
 676      for using tools like pg_standby.
 677     </para>
 678    </note>
 679
 680    <para>
 681      If you want to use streaming replication, fill in
 682      <varname>primary_conninfo</> with a libpq connection string, including
 683      the host name (or IP address) and any additional details needed to
 684      connect to the primary server. If the primary needs a password for
 685      authentication, the password needs to be specified in
 686      <varname>primary_conninfo</> as well.
 687    </para>
 688
 689    <para>
 690     If you're setting up the standby server for high availability purposes,
 691     set up WAL archiving, connections and authentication like the primary
 692     server, because the standby server will work as a primary server after
 693     failover.
 694    </para>
 695
 696    <para>
 697     If you're using a WAL archive, its size can be minimized using the <xref
 698     linkend="archive-cleanup-command"> parameter to remove files that are no
 699     longer required by the standby server.
 700     The <application>pg_archivecleanup</> utility is designed specifically to
 701     be used with <varname>archive_cleanup_command</> in typical single-standby
 702     configurations, see <xref linkend="pgarchivecleanup">.
 703     Note however, that if you're using the archive for backup purposes, you
 704     need to retain files needed to recover from at least the latest base
 705     backup, even if they're no longer needed by the standby.
 706    </para>
 707
 708    <para>
 709     A simple example of a <filename>recovery.conf</> is:
 710 <programlisting>
 711 standby_mode = 'on'
 712 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 713 restore_command = 'cp /path/to/archive/%f %p'
 714 archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
 715 </programlisting>
 716    </para>
 717
 718    <para>
 719     You can have any number of standby servers, but if you use streaming
 720     replication, make sure you set <varname>max_wal_senders</> high enough in
 721     the primary to allow them to be connected simultaneously.
 722    </para>
 723
 724   </sect2>
 725
 726   <sect2 id="streaming-replication">
 727    <title>Streaming Replication</title>
 728
 729    <indexterm zone="high-availability">
 730     <primary>Streaming Replication</primary>
 731    </indexterm>
 732
 733    <para>
 734     Streaming replication allows a standby server to stay more up-to-date
 735     than is possible with file-based log shipping. The standby connects
 736     to the primary, which streams WAL records to the standby as they're
 737     generated, without waiting for the WAL file to be filled.
 738    </para>
 739
 740    <para>
 741     Streaming replication is asynchronous, so there is still a small delay
 742     between committing a transaction in the primary and for the changes to
 743     become visible in the standby. The delay is however much smaller than with
 744     file-based log shipping, typically under one second assuming the standby
 745     is powerful enough to keep up with the load. With streaming replication,
 746     <varname>archive_timeout</> is not required to reduce the data loss
 747     window.
 748    </para>
 749
 750    <para>
 751     If you use streaming replication without file-based continuous
 752     archiving, you have to set <varname>wal_keep_segments</> in the master
 753     to a value high enough to ensure that old WAL segments are not recycled
 754     too early, while the standby might still need them to catch up. If the
 755     standby falls behind too much, it needs to be reinitialized from a new
 756     base backup. If you set up a WAL archive that's accessible from the
 757     standby, <varname>wal_keep_segments</> is not required as the standby can always
 758     use the archive to catch up.
 759    </para>
 760
 761    <para>
 762     To use streaming replication, set up a file-based log-shipping standby
 763     server as described in <xref linkend="warm-standby">. The step that
 764     turns a file-based log-shipping standby into streaming replication
 765     standby is setting <varname>primary_conninfo</> setting in the
 766     <filename>recovery.conf</> file to point to the primary server. Set
 767     <xref linkend="guc-listen-addresses"> and authentication options
 768     (see <filename>pg_hba.conf</>) on the primary so that the standby server
 769     can connect to the <literal>replication</> pseudo-database on the primary
 770     server (see <xref linkend="streaming-replication-authentication">).
 771    </para>
 772
 773    <para>
 774     On systems that support the keepalive socket option, setting
 775     <xref linkend="guc-tcp-keepalives-idle">,
 776     <xref linkend="guc-tcp-keepalives-interval"> and
 777     <xref linkend="guc-tcp-keepalives-count"> helps the primary promptly
 778     notice a broken connection.
 779    </para>
 780
 781    <para>
 782     Set the maximum number of concurrent connections from the standby servers
 783     (see <xref linkend="guc-max-wal-senders"> for details).
 784    </para>
 785
 786    <para>
 787     When the standby is started and <varname>primary_conninfo</> is set
 788     correctly, the standby will connect to the primary after replaying all
 789     WAL files available in the archive. If the connection is established
 790     successfully, you will see a walreceiver process in the standby, and
 791     a corresponding walsender process in the primary.
 792    </para>
 793
 794    <sect3 id="streaming-replication-authentication">
 795     <title>Authentication</title>
 796     <para>
 797      It is very important that the access privileges for replication be set up
 798      so that only trusted users can read the WAL stream, because it is
 799      easy to extract privileged information from it.  Standby servers must
 800      authenticate to the primary as a superuser or an account that has the
 801      <literal>REPLICATION</> privilege. It is recommended to create a
 802      dedicated user account with <literal>REPLICATION</> and <literal>LOGIN</>
 803      privileges for replication. While <literal>REPLICATION</> privilege gives
 804      very high permissions, it does not allow the user to modify any data on
 805      the primary system, which the <literal>SUPERUSER</> privilege does.
 806     </para>
 807
 808     <para>
 809      Client authentication for replication is controlled by a
 810      <filename>pg_hba.conf</> record specifying <literal>replication</> in the
 811      <replaceable>database</> field. For example, if the standby is running on
 812      host IP <literal>192.168.1.100</> and the account name for replication
 813      is <literal>foo</>, the administrator can add the following line to the
 814      <filename>pg_hba.conf</> file on the primary:
 815
 816 <programlisting>
 817 # Allow the user "foo" from host 192.168.1.100 to connect to the primary
 818 # as a replication standby if the user's password is correctly supplied.
 819 #
 820 # TYPE  DATABASE        USER            ADDRESS                 METHOD
 821 host    replication     foo             192.168.1.100/32        md5
 822 </programlisting>
 823     </para>
 824     <para>
 825      The host name and port number of the primary, connection user name,
 826      and password are specified in the <filename>recovery.conf</> file.
 827      The password can also be set in the <filename>~/.pgpass</> file on the
 828      standby (specify <literal>replication</> in the <replaceable>database</>
 829      field).
 830      For example, if the primary is running on host IP <literal>192.168.1.50</>,
 831      port <literal>5432</literal>, the account name for replication is
 832      <literal>foo</>, and the password is <literal>foopass</>, the administrator
 833      can add the following line to the <filename>recovery.conf</> file on the
 834      standby:
 835
 836 <programlisting>
 837 # The standby connects to the primary that is running on host 192.168.1.50
 838 # and port 5432 as the user "foo" whose password is "foopass".
 839 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 840 </programlisting>
 841     </para>
 842    </sect3>
 843
 844    <sect3 id="streaming-replication-monitoring">
 845     <title>Monitoring</title>
 846     <para>
 847      An important health indicator of streaming replication is the amount
 848      of WAL records generated in the primary, but not yet applied in the
 849      standby. You can calculate this lag by comparing the current WAL write
 850      location on the primary with the last WAL location received by the
 851      standby. They can be retrieved using
 852      <function>pg_current_xlog_location</> on the primary and the
 853      <function>pg_last_xlog_receive_location</> on the standby,
 854      respectively (see <xref linkend="functions-admin-backup-table"> and
 855      <xref linkend="functions-recovery-info-table"> for details).
 856      The last WAL receive location in the standby is also displayed in the
 857      process status of the WAL receiver process, displayed using the
 858      <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
 859     </para>
 860     <para>
 861      You can retrieve a list of WAL sender processes via the
 862      <link linkend="monitoring-stats-views-table">
 863      <literal>pg_stat_replication</></link> view. Large differences between
 864      <function>pg_current_xlog_location</> and <literal>sent_location</> field
 865      might indicate that the master server is under heavy load, while
 866      differences between <literal>sent_location</> and
 867      <function>pg_last_xlog_receive_location</> on the standby might indicate
 868      network delay, or that the standby is under heavy load.
 869     </para>
 870    </sect3>
 871   </sect2>
 872
 873   <sect2 id="cascading-replication">
 874    <title>Cascading Replication</title>
 875
 876    <indexterm zone="high-availability">
 877     <primary>Cascading Replication</primary>
 878    </indexterm>
 879
 880    <para>
 881     The cascading replication feature allows a standby server to accept replication
 882     connections and stream WAL records to other standbys, acting as a relay.
 883     This can be used to reduce the number of direct connections to the master
 884     and also to minimise inter-site bandwidth overheads.
 885    </para>
 886
 887    <para>
 888     A standby acting as both a receiver and a sender is known as a cascading
 889     standby.  Standbys that are more directly connected to the master are known
 890     as upstream servers, while those standby servers further away are downstream
 891     servers.  Cascading replication does not place limits on the number or
 892     arrangement of downstream servers, though each standby connects to only
 893     one upstream server which eventually links to a single master/primary
 894     server.
 895    </para>
 896
 897    <para>
 898     A cascading standby sends not only WAL records received from the
 899     master but also those restored from the archive. So even if the replication
 900     connection in some upstream connection is terminated, streaming replication
 901     continues downstream for as long as new WAL records are available.
 902    </para>
 903
 904    <para>
 905     Cascading replication is currently asynchronous. Synchronous replication
 906     (see <xref linkend="synchronous-replication">) settings have no effect on
 907     cascading replication at present.
 908    </para>
 909
 910    <para>
 911     Hot Standby feedback propagates upstream, whatever the cascaded arrangement.
 912    </para>
 913
 914    <para>
 915     Promoting a cascading standby terminates the immediate downstream replication
 916     connections which it serves. This is because the timeline becomes different
 917     between standbys, and they can no longer continue replication.  The
 918     affected standby(s) may reconnect to reestablish streaming replication.
 919    </para>
 920
 921    <para>
 922     To use cascading replication, set up the cascading standby so that it can
 923     accept replication connections (that is, set
 924     <xref linkend="guc-max-wal-senders"> and <xref linkend="guc-hot-standby">,
 925     and configure
 926     <link linkend="auth-pg-hba-conf">host-based authentication</link>).
 927     You will also need to set <varname>primary_conninfo</> in the downstream
 928     standby to point to the cascading standby.
 929    </para>
 930   </sect2>
 931
 932   <sect2 id="synchronous-replication">
 933    <title>Synchronous Replication</title>
 934
 935    <indexterm zone="high-availability">
 936     <primary>Synchronous Replication</primary>
 937    </indexterm>
 938
 939    <para>
 940     <productname>PostgreSQL</> streaming replication is asynchronous by
 941     default. If the primary server
 942     crashes then some transactions that were committed may not have been
 943     replicated to the standby server, causing data loss. The amount
 944     of data loss is proportional to the replication delay at the time of
 945     failover.
 946    </para>
 947
 948    <para>
 949     Synchronous replication offers the ability to confirm that all changes
 950     made by a transaction have been transferred to one synchronous standby
 951     server. This extends the standard level of durability
 952     offered by a transaction commit. This level of protection is referred
 953     to as 2-safe replication in computer science theory.
 954    </para>
 955
 956    <para>
 957     When requesting synchronous replication, each commit of a
 958     write transaction will wait until confirmation is
 959     received that the commit has been written to the transaction log on disk
 960     of both the primary and standby server. The only possibility that data
 961     can be lost is if both the primary and the standby suffer crashes at the
 962     same time. This can provide a much higher level of durability, though only
 963     if the sysadmin is cautious about the placement and management of the two
 964     servers.  Waiting for confirmation increases the user's confidence that the
 965     changes will not be lost in the event of server crashes but it also
 966     necessarily increases the response time for the requesting transaction.
 967     The minimum wait time is the roundtrip time between primary to standby.
 968    </para>
 969
 970    <para>
 971     Read only transactions and transaction rollbacks need not wait for
 972     replies from standby servers. Subtransaction commits do not wait for
 973     responses from standby servers, only top-level commits. Long
 974     running actions such as data loading or index building do not wait
 975     until the very final commit message. All two-phase commit actions
 976     require commit waits, including both prepare and commit.
 977    </para>
 978
 979    <sect3 id="synchronous-replication-config">
 980     <title>Basic Configuration</title>
 981
 982    <para>
 983     Once streaming replication has been configured, configuring synchronous
 984     replication requires only one additional configuration step:
 985     <xref linkend="guc-synchronous-standby-names"> must be set to
 986     a non-empty value.  <varname>synchronous_commit</> must also be set to
 987     <literal>on</>, but since this is the default value, typically no change is
 988     required.  (See <xref linkend="runtime-config-wal-settings"> and
 989     <xref linkend="runtime-config-replication-master">.)
 990     This configuration will cause each commit to wait for
 991     confirmation that the standby has written the commit record to durable
 992     storage.
 993     <varname>synchronous_commit</> can be set by individual
 994     users, so it can be configured in the configuration file, for particular
 995     users or databases, or dynamically by applications, in order to control
 996     the durability guarantee on a per-transaction basis.
 997    </para>
 998
 999    <para>
1000     After a commit record has been written to disk on the primary, the
1001     WAL record is then sent to the standby. The standby sends reply
1002     messages each time a new batch of WAL data is written to disk, unless
1003     <varname>wal_receiver_status_interval</> is set to zero on the standby.
1004     If the standby is the first matching standby, as specified in
1005     <varname>synchronous_standby_names</> on the primary, the reply
1006     messages from that standby will be used to wake users waiting for
1007     confirmation that the commit record has been received. These parameters
1008     allow the administrator to specify which standby servers should be
1009     synchronous standbys. Note that the configuration of synchronous
1010     replication is mainly on the master. Named standbys must be directly
1011     connected to the master; the master knows nothing about downstream
1012     standby servers using cascaded replication.
1013    </para>
1014
1015    <para>
1016     Setting <varname>synchronous_commit</> to <literal>remote_write</> will
1017     cause each commit to wait for confirmation that the standby has received
1018     the commit record and written it out to its own operating system, but not
1019     for the data to be flushed to disk on the standby.  This
1020     setting provides a weaker guarantee of durability than <literal>on</>
1021     does: the standby could lose the data in the event of an operating system
1022     crash, though not a <productname>PostgreSQL</> crash.
1023     However, it's a useful setting in practice
1024     because it can decrease the response time for the transaction.
1025     Data loss could only occur if both the primary and the standby crash and
1026     the database of the primary gets corrupted at the same time.
1027    </para>
1028
1029    <para>
1030     Users will stop waiting if a fast shutdown is requested.  However, as
1031     when using asynchronous replication, the server will not fully
1032     shutdown until all outstanding WAL records are transferred to the currently
1033     connected standby servers.
1034    </para>
1035
1036    </sect3>
1037
1038    <sect3 id="synchronous-replication-performance">
1039     <title>Planning for Performance</title>
1040
1041    <para>
1042     Synchronous replication usually requires carefully planned and placed
1043     standby servers to ensure applications perform acceptably. Waiting
1044     doesn't utilise system resources, but transaction locks continue to be
1045     held until the transfer is confirmed. As a result, incautious use of
1046     synchronous replication will reduce performance for database
1047     applications because of increased response times and higher contention.
1048    </para>
1049
1050    <para>
1051     <productname>PostgreSQL</> allows the application developer
1052     to specify the durability level required via replication. This can be
1053     specified for the system overall, though it can also be specified for
1054     specific users or connections, or even individual transactions.
1055    </para>
1056
1057    <para>
1058     For example, an application workload might consist of:
1059     10% of changes are important customer details, while
1060     90% of changes are less important data that the business can more
1061     easily survive if it is lost, such as chat messages between users.
1062    </para>
1063
1064    <para>
1065     With synchronous replication options specified at the application level
1066     (on the primary) we can offer synchronous replication for the most
1067     important changes, without slowing down the bulk of the total workload.
1068     Application level options are an important and practical tool for allowing
1069     the benefits of synchronous replication for high performance applications.
1070    </para>
1071
1072    <para>
1073     You should consider that the network bandwidth must be higher than
1074     the rate of generation of WAL data.
1075    </para>
1076
1077    </sect3>
1078
1079    <sect3 id="synchronous-replication-ha">
1080     <title>Planning for High Availability</title>
1081
1082    <para>
1083     Commits made when <varname>synchronous_commit</> is set to <literal>on</>
1084     or <literal>remote_write</> will wait until the synchronous standby responds. The response
1085     may never occur if the last, or only, standby should crash.
1086    </para>
1087
1088    <para>
1089     The best solution for avoiding data loss is to ensure you don't lose
1090     your last remaining synchronous standby. This can be achieved by naming multiple
1091     potential synchronous standbys using <varname>synchronous_standby_names</>.
1092     The first named standby will be used as the synchronous standby. Standbys
1093     listed after this will take over the role of synchronous standby if the
1094     first one should fail.
1095    </para>
1096
1097    <para>
1098     When a standby first attaches to the primary, it will not yet be properly
1099     synchronized. This is described as <literal>catchup</> mode. Once
1100     the lag between standby and primary reaches zero for the first time
1101     we move to real-time <literal>streaming</> state.
1102     The catch-up duration may be long immediately after the standby has
1103     been created. If the standby is shut down, then the catch-up period
1104     will increase according to the length of time the standby has been down.
1105     The standby is only able to become a synchronous standby
1106     once it has reached <literal>streaming</> state.
1107    </para>
1108
1109    <para>
1110     If primary restarts while commits are waiting for acknowledgement, those
1111     waiting transactions will be marked fully committed once the primary
1112     database recovers.
1113     There is no way to be certain that all standbys have received all
1114     outstanding WAL data at time of the crash of the primary. Some
1115     transactions may not show as committed on the standby, even though
1116     they show as committed on the primary. The guarantee we offer is that
1117     the application will not receive explicit acknowledgement of the
1118     successful commit of a transaction until the WAL data is known to be
1119     safely received by the standby.
1120    </para>
1121
1122    <para>
1123     If you really do lose your last standby server then you should disable
1124     <varname>synchronous_standby_names</> and reload the configuration file
1125     on the primary server.
1126    </para>
1127
1128    <para>
1129     If the primary is isolated from remaining standby servers you should
1130     fail over to the best candidate of those other remaining standby servers.
1131    </para>
1132
1133    <para>
1134     If you need to re-create a standby server while transactions are
1135     waiting, make sure that the commands pg_start_backup() and
1136     pg_stop_backup() are run in a session with
1137     <varname>synchronous_commit</> = <literal>off</>, otherwise those
1138     requests will wait forever for the standby to appear.
1139    </para>
1140
1141    </sect3>
1142   </sect2>
1143   </sect1>
1144
1145   <sect1 id="warm-standby-failover">
1146    <title>Failover</title>
1147
1148    <para>
1149     If the primary server fails then the standby server should begin
1150     failover procedures.
1151    </para>
1152
1153    <para>
1154     If the standby server fails then no failover need take place. If the
1155     standby server can be restarted, even some time later, then the recovery
1156     process can also be restarted immediately, taking advantage of
1157     restartable recovery. If the standby server cannot be restarted, then a
1158     full new standby server instance should be created.
1159    </para>
1160
1161    <para>
1162     If the primary server fails and the standby server becomes the
1163     new primary, and then the old primary restarts, you must have
1164     a mechanism for informing the old primary that it is no longer the primary. This is
1165     sometimes known as <acronym>STONITH</> (Shoot The Other Node In The Head), which is
1166     necessary to avoid situations where both systems think they are the
1167     primary, which will lead to confusion and ultimately data loss.
1168    </para>
1169
1170    <para>
1171     Many failover systems use just two systems, the primary and the standby,
1172     connected by some kind of heartbeat mechanism to continually verify the
1173     connectivity between the two and the viability of the primary. It is
1174     also possible to use a third system (called a witness server) to prevent
1175     some cases of inappropriate failover, but the additional complexity
1176     might not be worthwhile unless it is set up with sufficient care and
1177     rigorous testing.
1178    </para>
1179
1180    <para>
1181     <productname>PostgreSQL</productname> does not provide the system
1182     software required to identify a failure on the primary and notify
1183     the standby database server.  Many such tools exist and are well
1184     integrated with the operating system facilities required for
1185     successful failover, such as IP address migration.
1186    </para>
1187
1188    <para>
1189     Once failover to the standby occurs, there is only a
1190     single server in operation. This is known as a degenerate state.
1191     The former standby is now the primary, but the former primary is down
1192     and might stay down.  To return to normal operation, a standby server
1193     must be recreated,
1194     either on the former primary system when it comes up, or on a third,
1195     possibly new, system. Once complete, the primary and standby can be
1196     considered to have switched roles. Some people choose to use a third
1197     server to provide backup for the new primary until the new standby
1198     server is recreated,
1199     though clearly this complicates the system configuration and
1200     operational processes.
1201    </para>
1202
1203    <para>
1204     So, switching from primary to standby server can be fast but requires
1205     some time to re-prepare the failover cluster. Regular switching from
1206     primary to standby is useful, since it allows regular downtime on
1207     each system for maintenance. This also serves as a test of the
1208     failover mechanism to ensure that it will really work when you need it.
1209     Written administration procedures are advised.
1210    </para>
1211
1212    <para>
1213     To trigger failover of a log-shipping standby server,
1214     run <command>pg_ctl promote</> or create a trigger
1215     file with the file name and path specified by the <varname>trigger_file</>
1216     setting in <filename>recovery.conf</>. If you're planning to use
1217     <command>pg_ctl promote</> to fail over, <varname>trigger_file</> is
1218     not required. If you're setting up the reporting servers that are
1219     only used to offload read-only queries from the primary, not for high
1220     availability purposes, you don't need to promote it.
1221    </para>
1222   </sect1>
1223
1224   <sect1 id="log-shipping-alternative">
1225    <title>Alternative Method for Log Shipping</title>
1226
1227    <para>
1228     An alternative to the built-in standby mode described in the previous
1229     sections is to use a <varname>restore_command</> that polls the archive location.
1230     This was the only option available in versions 8.4 and below. In this
1231     setup, set <varname>standby_mode</> off, because you are implementing
1232     the polling required for standby operation yourself. See the
1233     <xref linkend="pgstandby"> module for a reference
1234     implementation of this.
1235    </para>
1236
1237    <para>
1238     Note that in this mode, the server will apply WAL one file at a
1239     time, so if you use the standby server for queries (see Hot Standby),
1240     there is a delay between an action in the master and when the
1241     action becomes visible in the standby, corresponding the time it takes
1242     to fill up the WAL file. <varname>archive_timeout</> can be used to make that delay
1243     shorter. Also note that you can't combine streaming replication with
1244     this method.
1245    </para>
1246
1247    <para>
1248     The operations that occur on both primary and standby servers are
1249     normal continuous archiving and recovery tasks. The only point of
1250     contact between the two database servers is the archive of WAL files
1251     that both share: primary writing to the archive, standby reading from
1252     the archive. Care must be taken to ensure that WAL archives from separate
1253     primary servers do not become mixed together or confused. The archive
1254     need not be large if it is only required for standby operation.
1255    </para>
1256
1257    <para>
1258     The magic that makes the two loosely coupled servers work together is
1259     simply a <varname>restore_command</> used on the standby that,
1260     when asked for the next WAL file, waits for it to become available from
1261     the primary. The <varname>restore_command</> is specified in the
1262     <filename>recovery.conf</> file on the standby server. Normal recovery
1263     processing would request a file from the WAL archive, reporting failure
1264     if the file was unavailable.  For standby processing it is normal for
1265     the next WAL file to be unavailable, so the standby must wait for
1266     it to appear. For files ending in <literal>.backup</> or
1267     <literal>.history</> there is no need to wait, and a non-zero return
1268     code must be returned. A waiting <varname>restore_command</> can be
1269     written as a custom script that loops after polling for the existence of
1270     the next WAL file. There must also be some way to trigger failover, which
1271     should interrupt the <varname>restore_command</>, break the loop and
1272     return a file-not-found error to the standby server. This ends recovery
1273     and the standby will then come up as a normal server.
1274    </para>
1275
1276    <para>
1277     Pseudocode for a suitable <varname>restore_command</> is:
1278 <programlisting>
1279 triggered = false;
1280 while (!NextWALFileReady() &amp;&amp; !triggered)
1281 {
1282     sleep(100000L);         /* wait for ~0.1 sec */
1283     if (CheckForExternalTrigger())
1284         triggered = true;
1285 }
1286 if (!triggered)
1287         CopyWALFileForRecovery();
1288 </programlisting>
1289    </para>
1290
1291    <para>
1292     A working example of a waiting <varname>restore_command</> is provided
1293     in the <xref linkend="pgstandby"> module. It
1294     should be used as a reference on how to correctly implement the logic
1295     described above. It can also be extended as needed to support specific
1296     configurations and environments.
1297    </para>
1298
1299    <para>
1300     The method for triggering failover is an important part of planning
1301     and design. One potential option is the <varname>restore_command</>
1302     command.  It is executed once for each WAL file, but the process
1303     running the <varname>restore_command</> is created and dies for
1304     each file, so there is no daemon or server process, and
1305     signals or a signal handler cannot be used. Therefore, the
1306     <varname>restore_command</> is not suitable to trigger failover.
1307     It is possible to use a simple timeout facility, especially if
1308     used in conjunction with a known <varname>archive_timeout</>
1309     setting on the primary. However, this is somewhat error prone
1310     since a network problem or busy primary server might be sufficient
1311     to initiate failover. A notification mechanism such as the explicit
1312     creation of a trigger file is ideal, if this can be arranged.
1313    </para>
1314
1315   <sect2 id="warm-standby-config">
1316    <title>Implementation</title>
1317
1318    <para>
1319     The short procedure for configuring a standby server using this alternative
1320     method is as follows. For
1321     full details of each step, refer to previous sections as noted.
1322     <orderedlist>
1323      <listitem>
1324       <para>
1325        Set up primary and standby systems as nearly identical as
1326        possible, including two identical copies of
1327        <productname>PostgreSQL</> at the same release level.
1328       </para>
1329      </listitem>
1330      <listitem>
1331       <para>
1332        Set up continuous archiving from the primary to a WAL archive
1333        directory on the standby server. Ensure that
1334        <xref linkend="guc-archive-mode">,
1335        <xref linkend="guc-archive-command"> and
1336        <xref linkend="guc-archive-timeout">
1337        are set appropriately on the primary
1338        (see <xref linkend="backup-archiving-wal">).
1339       </para>
1340      </listitem>
1341      <listitem>
1342       <para>
1343        Make a base backup of the primary server (see <xref
1344        linkend="backup-base-backup">), and load this data onto the standby.
1345       </para>
1346      </listitem>
1347      <listitem>
1348       <para>
1349        Begin recovery on the standby server from the local WAL
1350        archive, using a <filename>recovery.conf</> that specifies a
1351        <varname>restore_command</> that waits as described
1352        previously (see <xref linkend="backup-pitr-recovery">).
1353       </para>
1354      </listitem>
1355     </orderedlist>
1356    </para>
1357
1358    <para>
1359     Recovery treats the WAL archive as read-only, so once a WAL file has
1360     been copied to the standby system it can be copied to tape at the same
1361     time as it is being read by the standby database server.
1362     Thus, running a standby server for high availability can be performed at
1363     the same time as files are stored for longer term disaster recovery
1364     purposes.
1365    </para>
1366
1367    <para>
1368     For testing purposes, it is possible to run both primary and standby
1369     servers on the same system. This does not provide any worthwhile
1370     improvement in server robustness, nor would it be described as HA.
1371    </para>
1372   </sect2>
1373
1374   <sect2 id="warm-standby-record">
1375    <title>Record-based Log Shipping</title>
1376
1377    <para>
1378     It is also possible to implement record-based log shipping using this
1379     alternative method, though this requires custom development, and changes
1380     will still only become visible to hot standby queries after a full WAL
1381     file has been shipped.
1382    </para>
1383
1384    <para>
1385     An external program can call the <function>pg_xlogfile_name_offset()</>
1386     function (see <xref linkend="functions-admin">)
1387     to find out the file name and the exact byte offset within it of
1388     the current end of WAL.  It can then access the WAL file directly
1389     and copy the data from the last known end of WAL through the current end
1390     over to the standby servers.  With this approach, the window for data
1391     loss is the polling cycle time of the copying program, which can be very
1392     small, and there is no wasted bandwidth from forcing partially-used
1393     segment files to be archived.  Note that the standby servers'
1394     <varname>restore_command</> scripts can only deal with whole WAL files,
1395     so the incrementally copied data is not ordinarily made available to
1396     the standby servers.  It is of use only when the primary dies &mdash;
1397     then the last partial WAL file is fed to the standby before allowing
1398     it to come up.  The correct implementation of this process requires
1399     cooperation of the <varname>restore_command</> script with the data
1400     copying program.
1401    </para>
1402
1403    <para>
1404     Starting with <productname>PostgreSQL</> version 9.0, you can use
1405     streaming replication (see <xref linkend="streaming-replication">) to
1406     achieve the same benefits with less effort.
1407    </para>
1408   </sect2>
1409  </sect1>
1410
1411  <sect1 id="hot-standby">
1412   <title>Hot Standby</title>
1413
1414   <indexterm zone="high-availability">
1415    <primary>Hot Standby</primary>
1416   </indexterm>
1417
1418    <para>
1419     Hot Standby is the term used to describe the ability to connect to
1420     the server and run read-only queries while the server is in archive
1421     recovery or standby mode. This
1422     is useful both for replication purposes and for restoring a backup
1423     to a desired state with great precision.
1424     The term Hot Standby also refers to the ability of the server to move
1425     from recovery through to normal operation while users continue running
1426     queries and/or keep their connections open.
1427    </para>
1428
1429    <para>
1430     Running queries in hot standby mode is similar to normal query operation,
1431     though there are several usage and administrative differences
1432     explained below.
1433    </para>
1434
1435   <sect2 id="hot-standby-users">
1436    <title>User's Overview</title>
1437
1438    <para>
1439     When the <xref linkend="guc-hot-standby"> parameter is set to true on a
1440     standby server, it will begin accepting connections once the recovery has
1441     brought the system to a consistent state.  All such connections are
1442     strictly read-only; not even temporary tables may be written.
1443    </para>
1444
1445    <para>
1446     The data on the standby takes some time to arrive from the primary server
1447     so there will be a measurable delay between primary and standby. Running the
1448     same query nearly simultaneously on both primary and standby might therefore
1449     return differing results. We say that data on the standby is
1450     <firstterm>eventually consistent</firstterm> with the primary.  Once the
1451     commit record for a transaction is replayed on the standby, the changes
1452     made by that transaction will be visible to any new snapshots taken on
1453     the standby.  Snapshots may be taken at the start of each query or at the
1454     start of each transaction, depending on the current transaction isolation
1455     level.  For more details, see <xref linkend="transaction-iso">.
1456    </para>
1457
1458    <para>
1459     Transactions started during hot standby may issue the following commands:
1460
1461     <itemizedlist>
1462      <listitem>
1463       <para>
1464        Query access - <command>SELECT</>, <command>COPY TO</>
1465       </para>
1466      </listitem>
1467      <listitem>
1468       <para>
1469        Cursor commands - <command>DECLARE</>, <command>FETCH</>, <command>CLOSE</>
1470       </para>
1471      </listitem>
1472      <listitem>
1473       <para>
1474        Parameters - <command>SHOW</>, <command>SET</>, <command>RESET</>
1475       </para>
1476      </listitem>
1477      <listitem>
1478       <para>
1479        Transaction management commands
1480         <itemizedlist>
1481          <listitem>
1482           <para>
1483            <command>BEGIN</>, <command>END</>, <command>ABORT</>, <command>START TRANSACTION</>
1484           </para>
1485          </listitem>
1486          <listitem>
1487           <para>
1488            <command>SAVEPOINT</>, <command>RELEASE</>, <command>ROLLBACK TO SAVEPOINT</>
1489           </para>
1490          </listitem>
1491          <listitem>
1492           <para>
1493            <command>EXCEPTION</> blocks and other internal subtransactions
1494           </para>
1495          </listitem>
1496         </itemizedlist>
1497       </para>
1498      </listitem>
1499      <listitem>
1500       <para>
1501        <command>LOCK TABLE</>, though only when explicitly in one of these modes:
1502        <literal>ACCESS SHARE</>, <literal>ROW SHARE</> or <literal>ROW EXCLUSIVE</>.
1503       </para>
1504      </listitem>
1505      <listitem>
1506       <para>
1507        Plans and resources - <command>PREPARE</>, <command>EXECUTE</>,
1508        <command>DEALLOCATE</>, <command>DISCARD</>
1509       </para>
1510      </listitem>
1511      <listitem>
1512       <para>
1513        Plugins and extensions - <command>LOAD</>
1514       </para>
1515      </listitem>
1516     </itemizedlist>
1517    </para>
1518
1519    <para>
1520     Transactions started during hot standby will never be assigned a
1521     transaction ID and cannot write to the system write-ahead log.
1522     Therefore, the following actions will produce error messages:
1523
1524     <itemizedlist>
1525      <listitem>
1526       <para>
1527        Data Manipulation Language (DML) - <command>INSERT</>,
1528        <command>UPDATE</>, <command>DELETE</>, <command>COPY FROM</>,
1529        <command>TRUNCATE</>.
1530        Note that there are no allowed actions that result in a trigger
1531        being executed during recovery.  This restriction applies even to
1532        temporary tables, because table rows cannot be read or written without
1533        assigning a transaction ID, which is currently not possible in a
1534        Hot Standby environment.
1535       </para>
1536      </listitem>
1537      <listitem>
1538       <para>
1539        Data Definition Language (DDL) - <command>CREATE</>,
1540        <command>DROP</>, <command>ALTER</>, <command>COMMENT</>.
1541        This restriction applies even to temporary tables, because carrying
1542        out these operations would require updating the system catalog tables.
1543       </para>
1544      </listitem>
1545      <listitem>
1546       <para>
1547        <command>SELECT ... FOR SHARE | UPDATE</>, because row locks cannot be
1548        taken without updating the underlying data files.
1549       </para>
1550      </listitem>
1551      <listitem>
1552       <para>
1553        Rules on <command>SELECT</> statements that generate DML commands.
1554       </para>
1555      </listitem>
1556      <listitem>
1557       <para>
1558        <command>LOCK</> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</>.
1559       </para>
1560      </listitem>
1561      <listitem>
1562       <para>
1563        <command>LOCK</> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</>.
1564       </para>
1565      </listitem>
1566      <listitem>
1567       <para>
1568        Transaction management commands that explicitly set non-read-only state:
1569         <itemizedlist>
1570          <listitem>
1571           <para>
1572             <command>BEGIN READ WRITE</>,
1573             <command>START TRANSACTION READ WRITE</>
1574           </para>
1575          </listitem>
1576          <listitem>
1577           <para>
1578             <command>SET TRANSACTION READ WRITE</>,
1579             <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</>
1580           </para>
1581          </listitem>
1582          <listitem>
1583           <para>
1584            <command>SET transaction_read_only = off</>
1585           </para>
1586          </listitem>
1587         </itemizedlist>
1588       </para>
1589      </listitem>
1590      <listitem>
1591       <para>
1592        Two-phase commit commands - <command>PREPARE TRANSACTION</>,
1593        <command>COMMIT PREPARED</>, <command>ROLLBACK PREPARED</>
1594        because even read-only transactions need to write WAL in the
1595        prepare phase (the first phase of two phase commit).
1596       </para>
1597      </listitem>
1598      <listitem>
1599       <para>
1600        Sequence updates - <function>nextval()</>, <function>setval()</>
1601       </para>
1602      </listitem>
1603      <listitem>
1604       <para>
1605        <command>LISTEN</>, <command>UNLISTEN</>, <command>NOTIFY</>
1606       </para>
1607      </listitem>
1608     </itemizedlist>
1609    </para>
1610
1611    <para>
1612     In normal operation, <quote>read-only</> transactions are allowed to
1613     update sequences and to use <command>LISTEN</>, <command>UNLISTEN</>, and
1614     <command>NOTIFY</>, so Hot Standby sessions operate under slightly tighter
1615     restrictions than ordinary read-only sessions.  It is possible that some
1616     of these restrictions might be loosened in a future release.
1617    </para>
1618
1619    <para>
1620     During hot standby, the parameter <varname>transaction_read_only</> is always
1621     true and may not be changed.  But as long as no attempt is made to modify
1622     the database, connections during hot standby will act much like any other
1623     database connection.  If failover or switchover occurs, the database will
1624     switch to normal processing mode.  Sessions will remain connected while the
1625     server changes mode.  Once hot standby finishes, it will be possible to
1626     initiate read-write transactions (even from a session begun during
1627     hot standby).
1628    </para>
1629
1630    <para>
1631     Users will be able to tell whether their session is read-only by
1632     issuing <command>SHOW transaction_read_only</>.  In addition, a set of
1633     functions (<xref linkend="functions-recovery-info-table">) allow users to
1634     access information about the standby server. These allow you to write
1635     programs that are aware of the current state of the database. These
1636     can be used to monitor the progress of recovery, or to allow you to
1637     write complex programs that restore the database to particular states.
1638    </para>
1639   </sect2>
1640
1641   <sect2 id="hot-standby-conflict">
1642    <title>Handling Query Conflicts</title>
1643
1644    <para>
1645     The primary and standby servers are in many ways loosely connected. Actions
1646     on the primary will have an effect on the standby. As a result, there is
1647     potential for negative interactions or conflicts between them. The easiest
1648     conflict to understand is performance: if a huge data load is taking place
1649     on the primary then this will generate a similar stream of WAL records on the
1650     standby, so standby queries may contend for system resources, such as I/O.
1651    </para>
1652
1653    <para>
1654     There are also additional types of conflict that can occur with Hot Standby.
1655     These conflicts are <emphasis>hard conflicts</> in the sense that queries
1656     might need to be canceled and, in some cases, sessions disconnected to resolve them.
1657     The user is provided with several ways to handle these
1658     conflicts. Conflict cases include:
1659
1660       <itemizedlist>
1661        <listitem>
1662         <para>
1663          Access Exclusive locks taken on the primary server, including both
1664          explicit <command>LOCK</> commands and various <acronym>DDL</>
1665          actions, conflict with table accesses in standby queries.
1666         </para>
1667        </listitem>
1668        <listitem>
1669         <para>
1670          Dropping a tablespace on the primary conflicts with standby queries
1671          using that tablespace for temporary work files.
1672         </para>
1673        </listitem>
1674        <listitem>
1675         <para>
1676          Dropping a database on the primary conflicts with sessions connected
1677          to that database on the standby.
1678         </para>
1679        </listitem>
1680        <listitem>
1681         <para>
1682          Application of a vacuum cleanup record from WAL conflicts with
1683          standby transactions whose snapshots can still <quote>see</> any of
1684          the rows to be removed.
1685         </para>
1686        </listitem>
1687        <listitem>
1688         <para>
1689          Application of a vacuum cleanup record from WAL conflicts with
1690          queries accessing the target page on the standby, whether or not
1691          the data to be removed is visible.
1692         </para>
1693        </listitem>
1694       </itemizedlist>
1695    </para>
1696
1697    <para>
1698     On the primary server, these cases simply result in waiting; and the
1699     user might choose to cancel either of the conflicting actions.  However,
1700     on the standby there is no choice: the WAL-logged action already occurred
1701     on the primary so the standby must not fail to apply it.  Furthermore,
1702     allowing WAL application to wait indefinitely may be very undesirable,
1703     because the standby's state will become increasingly far behind the
1704     primary's.  Therefore, a mechanism is provided to forcibly cancel standby
1705     queries that conflict with to-be-applied WAL records.
1706    </para>
1707
1708    <para>
1709     An example of the problem situation is an administrator on the primary
1710     server running <command>DROP TABLE</> on a table that is currently being
1711     queried on the standby server.  Clearly the standby query cannot continue
1712     if the <command>DROP TABLE</> is applied on the standby. If this situation
1713     occurred on the primary, the <command>DROP TABLE</> would wait until the
1714     other query had finished. But when <command>DROP TABLE</> is run on the
1715     primary, the primary doesn't have information about what queries are
1716     running on the standby, so it will not wait for any such standby
1717     queries. The WAL change records come through to the standby while the
1718     standby query is still running, causing a conflict.  The standby server
1719     must either delay application of the WAL records (and everything after
1720     them, too) or else cancel the conflicting query so that the <command>DROP
1721     TABLE</> can be applied.
1722    </para>
1723
1724    <para>
1725     When a conflicting query is short, it's typically desirable to allow it to
1726     complete by delaying WAL application for a little bit; but a long delay in
1727     WAL application is usually not desirable.  So the cancel mechanism has
1728     parameters, <xref linkend="guc-max-standby-archive-delay"> and <xref
1729     linkend="guc-max-standby-streaming-delay">, that define the maximum
1730     allowed delay in WAL application.  Conflicting queries will be canceled
1731     once it has taken longer than the relevant delay setting to apply any
1732     newly-received WAL data.  There are two parameters so that different delay
1733     values can be specified for the case of reading WAL data from an archive
1734     (i.e., initial recovery from a base backup or <quote>catching up</> a
1735     standby server that has fallen far behind) versus reading WAL data via
1736     streaming replication.
1737    </para>
1738
1739    <para>
1740     In a standby server that exists primarily for high availability, it's
1741     best to set the delay parameters relatively short, so that the server
1742     cannot fall far behind the primary due to delays caused by standby
1743     queries.  However, if the standby server is meant for executing
1744     long-running queries, then a high or even infinite delay value may be
1745     preferable.  Keep in mind however that a long-running query could
1746     cause other sessions on the standby server to not see recent changes
1747     on the primary, if it delays application of WAL records.
1748    </para>
1749
1750    <para>
1751     Once the delay specified by <varname>max_standby_archive_delay</> or
1752     <varname>max_standby_streaming_delay</> has been exceeded, conflicting
1753     queries will be canceled.  This usually results just in a cancellation
1754     error, although in the case of replaying a <command>DROP DATABASE</>
1755     the entire conflicting session will be terminated.  Also, if the conflict
1756     is over a lock held by an idle transaction, the conflicting session is
1757     terminated (this behavior might change in the future).
1758    </para>
1759
1760    <para>
1761     Canceled queries may be retried immediately (after beginning a new
1762     transaction, of course).  Since query cancellation depends on
1763     the nature of the WAL records being replayed, a query that was
1764     canceled may well succeed if it is executed again.
1765    </para>
1766
1767    <para>
1768     Keep in mind that the delay parameters are compared to the elapsed time
1769     since the WAL data was received by the standby server.  Thus, the grace
1770     period allowed to any one query on the standby is never more than the
1771     delay parameter, and could be considerably less if the standby has already
1772     fallen behind as a result of waiting for previous queries to complete, or
1773     as a result of being unable to keep up with a heavy update load.
1774    </para>
1775
1776    <para>
1777     The most common reason for conflict between standby queries and WAL replay
1778     is <quote>early cleanup</>.  Normally, <productname>PostgreSQL</> allows
1779     cleanup of old row versions when there are no transactions that need to
1780     see them to ensure correct visibility of data according to MVCC rules.
1781     However, this rule can only be applied for transactions executing on the
1782     master.  So it is possible that cleanup on the master will remove row
1783     versions that are still visible to a transaction on the standby.
1784    </para>
1785
1786    <para>
1787     Experienced users should note that both row version cleanup and row version
1788     freezing will potentially conflict with standby queries. Running a manual
1789     <command>VACUUM FREEZE</> is likely to cause conflicts even on tables with
1790     no updated or deleted rows.
1791    </para>
1792
1793    <para>
1794     Users should be clear that tables that are regularly and heavily updated
1795     on the primary server will quickly cause cancellation of longer running
1796     queries on the standby. In such cases the setting of a finite value for
1797     <varname>max_standby_archive_delay</> or
1798     <varname>max_standby_streaming_delay</> can be considered similar to
1799     setting <varname>statement_timeout</>.
1800    </para>
1801
1802    <para>
1803     Remedial possibilities exist if the number of standby-query cancellations
1804     is found to be unacceptable.  The first option is to set the parameter
1805     <varname>hot_standby_feedback</>, which prevents <command>VACUUM</> from
1806     removing recently-dead rows and so cleanup conflicts do not occur.
1807     If you do this, you
1808     should note that this will delay cleanup of dead rows on the primary,
1809     which may result in undesirable table bloat. However, the cleanup
1810     situation will be no worse than if the standby queries were running
1811     directly on the primary server, and you are still getting the benefit of
1812     off-loading execution onto the standby.
1813     <varname>max_standby_archive_delay</> must be kept large in this case,
1814     because delayed WAL files might already contain entries that conflict with
1815     the desired standby queries.
1816    </para>
1817
1818    <para>
1819     Another option is to increase <xref linkend="guc-vacuum-defer-cleanup-age">
1820     on the primary server, so that dead rows will not be cleaned up as quickly
1821     as they normally would be.  This will allow more time for queries to
1822     execute before they are canceled on the standby, without having to set
1823     a high <varname>max_standby_streaming_delay</>.  However it is
1824     difficult to guarantee any specific execution-time window with this
1825     approach, since <varname>vacuum_defer_cleanup_age</> is measured in
1826     transactions executed on the primary server.
1827    </para>
1828
1829    <para>
1830     The number of query cancels and the reason for them can be viewed using
1831     the <structname>pg_stat_database_conflicts</> system view on the standby
1832     server. The <structname>pg_stat_database</> system view also contains
1833     summary information.
1834    </para>
1835   </sect2>
1836
1837   <sect2 id="hot-standby-admin">
1838    <title>Administrator's Overview</title>
1839
1840    <para>
1841     If <varname>hot_standby</> is turned <literal>on</> in
1842     <filename>postgresql.conf</> and there is a <filename>recovery.conf</>
1843     file present, the server will run in Hot Standby mode.
1844     However, it may take some time for Hot Standby connections to be allowed,
1845     because the server will not accept connections until it has completed
1846     sufficient recovery to provide a consistent state against which queries
1847     can run.  During this period,
1848     clients that attempt to connect will be refused with an error message.
1849     To confirm the server has come up, either loop trying to connect from
1850     the application, or look for these messages in the server logs:
1851
1852 <programlisting>
1853 LOG:  entering standby mode
1854
1855 ... then some time later ...
1856
1857 LOG:  consistent recovery state reached
1858 LOG:  database system is ready to accept read only connections
1859 </programlisting>
1860
1861     Consistency information is recorded once per checkpoint on the primary.
1862     It is not possible to enable hot standby when reading WAL
1863     written during a period when <varname>wal_level</> was not set to
1864     <literal>hot_standby</> on the primary.  Reaching a consistent state can
1865     also be delayed in the presence of both of these conditions:
1866
1867       <itemizedlist>
1868        <listitem>
1869         <para>
1870          A write transaction has more than 64 subtransactions
1871         </para>
1872        </listitem>
1873        <listitem>
1874         <para>
1875          Very long-lived write transactions
1876         </para>
1877        </listitem>
1878       </itemizedlist>
1879
1880     If you are running file-based log shipping ("warm standby"), you might need
1881     to wait until the next WAL file arrives, which could be as long as the
1882     <varname>archive_timeout</> setting on the primary.
1883    </para>
1884
1885    <para>
1886     The setting of some parameters on the standby will need reconfiguration
1887     if they have been changed on the primary. For these parameters,
1888     the value on the standby must
1889     be equal to or greater than the value on the primary. If these parameters
1890     are not set high enough then the standby will refuse to start.
1891     Higher values can then be supplied and the server
1892     restarted to begin recovery again.  These parameters are:
1893
1894       <itemizedlist>
1895        <listitem>
1896         <para>
1897          <varname>max_connections</>
1898         </para>
1899        </listitem>
1900        <listitem>
1901         <para>
1902          <varname>max_prepared_transactions</>
1903         </para>
1904        </listitem>
1905        <listitem>
1906         <para>
1907          <varname>max_locks_per_transaction</>
1908         </para>
1909        </listitem>
1910       </itemizedlist>
1911    </para>
1912
1913    <para>
1914     It is important that the administrator select appropriate settings for
1915     <xref linkend="guc-max-standby-archive-delay"> and <xref
1916     linkend="guc-max-standby-streaming-delay">.  The best choices vary
1917     depending on business priorities.  For example if the server is primarily
1918     tasked as a High Availability server, then you will want low delay
1919     settings, perhaps even zero, though that is a very aggressive setting. If
1920     the standby server is tasked as an additional server for decision support
1921     queries then it might be acceptable to set the maximum delay values to
1922     many hours, or even -1 which means wait forever for queries to complete.
1923    </para>
1924
1925    <para>
1926     Transaction status "hint bits" written on the primary are not WAL-logged,
1927     so data on the standby will likely re-write the hints again on the standby.
1928     Thus, the standby server will still perform disk writes even though
1929     all users are read-only; no changes occur to the data values
1930     themselves.  Users will still write large sort temporary files and
1931     re-generate relcache info files, so no part of the database
1932     is truly read-only during hot standby mode.
1933     Note also that writes to remote databases using
1934     <application>dblink</application> module, and other operations outside the
1935     database using PL functions will still be possible, even though the
1936     transaction is read-only locally.
1937    </para>
1938
1939    <para>
1940     The following types of administration commands are not accepted
1941     during recovery mode:
1942
1943       <itemizedlist>
1944        <listitem>
1945         <para>
1946          Data Definition Language (DDL) - e.g. <command>CREATE INDEX</>
1947         </para>
1948        </listitem>
1949        <listitem>
1950         <para>
1951          Privilege and Ownership - <command>GRANT</>, <command>REVOKE</>,
1952          <command>REASSIGN</>
1953         </para>
1954        </listitem>
1955        <listitem>
1956         <para>
1957          Maintenance commands - <command>ANALYZE</>, <command>VACUUM</>,
1958          <command>CLUSTER</>, <command>REINDEX</>
1959         </para>
1960        </listitem>
1961       </itemizedlist>
1962    </para>
1963
1964    <para>
1965     Again, note that some of these commands are actually allowed during
1966     "read only" mode transactions on the primary.
1967    </para>
1968
1969    <para>
1970     As a result, you cannot create additional indexes that exist solely
1971     on the standby, nor statistics that exist solely on the standby.
1972     If these administration commands are needed, they should be executed
1973     on the primary, and eventually those changes will propagate to the
1974     standby.
1975    </para>
1976
1977    <para>
1978     <function>pg_cancel_backend()</>
1979     and <function>pg_terminate_backend()</> will work on user backends,
1980     but not the Startup process, which performs
1981     recovery. <structname>pg_stat_activity</structname> does not show an
1982     entry for the Startup process, nor do recovering transactions show
1983     as active. As a result, <structname>pg_prepared_xacts</structname>
1984     is always empty during recovery. If you wish to resolve in-doubt
1985     prepared transactions, view <literal>pg_prepared_xacts</> on the
1986     primary and issue commands to resolve transactions there.
1987    </para>
1988
1989    <para>
1990     <structname>pg_locks</structname> will show locks held by backends,
1991     as normal. <structname>pg_locks</structname> also shows
1992     a virtual transaction managed by the Startup process that owns all
1993     <literal>AccessExclusiveLocks</> held by transactions being replayed by recovery.
1994     Note that the Startup process does not acquire locks to
1995     make database changes, and thus locks other than <literal>AccessExclusiveLocks</>
1996     do not show in <structname>pg_locks</structname> for the Startup
1997     process; they are just presumed to exist.
1998    </para>
1999
2000    <para>
2001     The <productname>Nagios</> plugin <productname>check_pgsql</> will
2002     work, because the simple information it checks for exists.
2003     The <productname>check_postgres</> monitoring script will also work,
2004     though some reported values could give different or confusing results.
2005     For example, last vacuum time will not be maintained, since no
2006     vacuum occurs on the standby.  Vacuums running on the primary
2007     do still send their changes to the standby.
2008    </para>
2009
2010    <para>
2011     WAL file control commands will not work during recovery,
2012     e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc.
2013    </para>
2014
2015    <para>
2016     Dynamically loadable modules work, including <structname>pg_stat_statements</>.
2017    </para>
2018
2019    <para>
2020     Advisory locks work normally in recovery, including deadlock detection.
2021     Note that advisory locks are never WAL logged, so it is impossible for
2022     an advisory lock on either the primary or the standby to conflict with WAL
2023     replay. Nor is it possible to acquire an advisory lock on the primary
2024     and have it initiate a similar advisory lock on the standby. Advisory
2025     locks relate only to the server on which they are acquired.
2026    </para>
2027
2028    <para>
2029     Trigger-based replication systems such as <productname>Slony</>,
2030     <productname>Londiste</> and <productname>Bucardo</> won't run on the
2031     standby at all, though they will run happily on the primary server as
2032     long as the changes are not sent to standby servers to be applied.
2033     WAL replay is not trigger-based so you cannot relay from the
2034     standby to any system that requires additional database writes or
2035     relies on the use of triggers.
2036    </para>
2037
2038    <para>
2039     New OIDs cannot be assigned, though some <acronym>UUID</> generators may still
2040     work as long as they do not rely on writing new status to the database.
2041    </para>
2042
2043    <para>
2044     Currently, temporary table creation is not allowed during read only
2045     transactions, so in some cases existing scripts will not run correctly.
2046     This restriction might be relaxed in a later release. This is
2047     both a SQL Standard compliance issue and a technical issue.
2048    </para>
2049
2050    <para>
2051     <command>DROP TABLESPACE</> can only succeed if the tablespace is empty.
2052     Some standby users may be actively using the tablespace via their
2053     <varname>temp_tablespaces</> parameter. If there are temporary files in the
2054     tablespace, all active queries are canceled to ensure that temporary
2055     files are removed, so the tablespace can be removed and WAL replay
2056     can continue.
2057    </para>
2058
2059    <para>
2060     Running <command>DROP DATABASE</> or <command>ALTER DATABASE ... SET
2061     TABLESPACE</> on the primary
2062     will generate a WAL entry that will cause all users connected to that
2063     database on the standby to be forcibly disconnected. This action occurs
2064     immediately, whatever the setting of
2065     <varname>max_standby_streaming_delay</>. Note that
2066     <command>ALTER DATABASE ... RENAME</> does not disconnect users, which
2067     in most cases will go unnoticed, though might in some cases cause a
2068     program confusion if it depends in some way upon database name.
2069    </para>
2070
2071    <para>
2072     In normal (non-recovery) mode, if you issue <command>DROP USER</> or <command>DROP ROLE</>
2073     for a role with login capability while that user is still connected then
2074     nothing happens to the connected user - they remain connected. The user cannot
2075     reconnect however. This behavior applies in recovery also, so a
2076     <command>DROP USER</> on the primary does not disconnect that user on the standby.
2077    </para>
2078
2079    <para>
2080     The statistics collector is active during recovery. All scans, reads, blocks,
2081     index usage, etc., will be recorded normally on the standby. Replayed
2082     actions will not duplicate their effects on primary, so replaying an
2083     insert will not increment the Inserts column of pg_stat_user_tables.
2084     The stats file is deleted at the start of recovery, so stats from primary
2085     and standby will differ; this is considered a feature, not a bug.
2086    </para>
2087
2088    <para>
2089     Autovacuum is not active during recovery.  It will start normally at the
2090     end of recovery.
2091    </para>
2092
2093    <para>
2094     The background writer is active during recovery and will perform
2095     restartpoints (similar to checkpoints on the primary) and normal block
2096     cleaning activities. This can include updates of the hint bit
2097     information stored on the standby server.
2098     The <command>CHECKPOINT</> command is accepted during recovery,
2099     though it performs a restartpoint rather than a new checkpoint.
2100    </para>
2101   </sect2>
2102
2103   <sect2 id="hot-standby-parameters">
2104    <title>Hot Standby Parameter Reference</title>
2105
2106    <para>
2107     Various parameters have been mentioned above in
2108     <xref linkend="hot-standby-conflict"> and
2109     <xref linkend="hot-standby-admin">.
2110    </para>
2111
2112    <para>
2113     On the primary, parameters <xref linkend="guc-wal-level"> and
2114     <xref linkend="guc-vacuum-defer-cleanup-age"> can be used.
2115     <xref linkend="guc-max-standby-archive-delay"> and
2116     <xref linkend="guc-max-standby-streaming-delay"> have no effect if set on
2117     the primary.
2118    </para>
2119
2120    <para>
2121     On the standby, parameters <xref linkend="guc-hot-standby">,
2122     <xref linkend="guc-max-standby-archive-delay"> and
2123     <xref linkend="guc-max-standby-streaming-delay"> can be used.
2124     <xref linkend="guc-vacuum-defer-cleanup-age"> has no effect
2125     as long as the server remains in standby mode, though it will
2126     become relevant if the standby becomes primary.
2127    </para>
2128   </sect2>
2129
2130   <sect2 id="hot-standby-caveats">
2131    <title>Caveats</title>
2132
2133    <para>
2134     There are several limitations of Hot Standby.
2135     These can and probably will be fixed in future releases:
2136
2137   <itemizedlist>
2138    <listitem>
2139     <para>
2140      Operations on hash indexes are not presently WAL-logged, so
2141      replay will not update these indexes.
2142     </para>
2143    </listitem>
2144    <listitem>
2145     <para>
2146      Full knowledge of running transactions is required before snapshots
2147      can be taken. Transactions that use large numbers of subtransactions
2148      (currently greater than 64) will delay the start of read only
2149      connections until the completion of the longest running write transaction.
2150      If this situation occurs, explanatory messages will be sent to the server log.
2151     </para>
2152    </listitem>
2153    <listitem>
2154     <para>
2155      Valid starting points for standby queries are generated at each
2156      checkpoint on the master. If the standby is shut down while the master
2157      is in a shutdown state, it might not be possible to re-enter Hot Standby
2158      until the primary is started up, so that it generates further starting
2159      points in the WAL logs.  This situation isn't a problem in the most
2160      common situations where it might happen. Generally, if the primary is
2161      shut down and not available anymore, that's likely due to a serious
2162      failure that requires the standby being converted to operate as
2163      the new primary anyway.  And in situations where the primary is
2164      being intentionally taken down, coordinating to make sure the standby
2165      becomes the new primary smoothly is also standard procedure.
2166     </para>
2167    </listitem>
2168    <listitem>
2169     <para>
2170      At the end of recovery, <literal>AccessExclusiveLocks</> held by prepared transactions
2171      will require twice the normal number of lock table entries. If you plan
2172      on running either a large number of concurrent prepared transactions
2173      that normally take <literal>AccessExclusiveLocks</>, or you plan on having one
2174      large transaction that takes many <literal>AccessExclusiveLocks</>, you are
2175      advised to select a larger value of <varname>max_locks_per_transaction</>,
2176      perhaps as much as twice the value of the parameter on
2177      the primary server. You need not consider this at all if
2178      your setting of <varname>max_prepared_transactions</> is 0.
2179     </para>
2180    </listitem>
2181    <listitem>
2182     <para>
2183      The Serializable transaction isolation level is not yet available in hot
2184      standby.  (See <xref linkend="xact-serializable"> and
2185      <xref linkend="serializable-consistency"> for details.)
2186      An attempt to set a transaction to the serializable isolation level in
2187      hot standby mode will generate an error.
2188     </para>
2189    </listitem>
2190   </itemizedlist>
2191
2192    </para>
2193   </sect2>
2194
2195  </sect1>
2196
2197 </chapter>