-<!-- $PostgreSQL: pgsql/doc/src/sgml/protocol.sgml,v 1.83 2010/02/22 18:12:04 momjian Exp $ -->
+<!-- doc/src/sgml/protocol.sgml -->
<chapter id="protocol">
<title>Frontend/Backend Protocol</title>
</para>
<para>
- The CopyInResponse and CopyOutResponse messages include fields that
- inform the frontend of the number of columns per row and the format
- codes being used for each column. (As of the present implementation,
- all columns in a given <command>COPY</> operation will use the same
- format, but the message design does not assume this.)
+ There is another Copy-related mode called Copy-both, which allows
+ high-speed bulk data transfer to <emphasis>and</> from the server.
+ Copy-both mode is initiated when a backend in walsender mode
+ executes a <command>START_REPLICATION</command> statement. The
+ backend sends a CopyBothResponse message to the frontend. Both
+ the backend and the frontend may then send CopyData messages
+ until the connection is terminated. See <xref
+ linkend="protocol-replication">.
</para>
+
+ <para>
+ The CopyInResponse, CopyOutResponse and CopyBothResponse messages
+ include fields that inform the frontend of the number of columns
+ per row and the format codes being used for each column. (As of
+ the present implementation, all columns in a given <command>COPY</>
+ operation will use the same format, but the message design does not
+ assume this.)
+ </para>
+
</sect2>
<sect2 id="protocol-async">
</sect2>
</sect1>
+<sect1 id="protocol-replication">
+<title>Streaming Replication Protocol</title>
+
+<para>
+To initiate streaming replication, the frontend sends the
+<literal>replication</> parameter in the startup message. This tells the
+backend to go into walsender mode, wherein a small set of replication commands
+can be issued instead of SQL statements. Only the simple query protocol can be
+used in walsender mode.
+
+The commands accepted in walsender mode are:
+
+<variablelist>
+ <varlistentry>
+ <term>IDENTIFY_SYSTEM</term>
+ <listitem>
+ <para>
+ Requests the server to identify itself. Server replies with a result
+ set of a single row, containing two fields:
+ </para>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term>
+ systemid
+ </term>
+ <listitem>
+ <para>
+ The unique system identifier identifying the cluster. This
+ can be used to check that the base backup used to initialize the
+ standby came from the same cluster.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>
+ timeline
+ </term>
+ <listitem>
+ <para>
+ Current TimelineID. Also useful to check that the standby is
+ consistent with the master.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>START_REPLICATION <replaceable>XXX</>/<replaceable>XXX</></term>
+ <listitem>
+ <para>
+ Instructs server to start streaming WAL, starting at
+ WAL position <replaceable>XXX</>/<replaceable>XXX</>.
+ The server can reply with an error, e.g. if the requested section of WAL
+ has already been recycled. On success, server responds with a
+ CopyBothResponse message, and then starts to stream WAL to the frontend.
+ WAL will continue to be streamed until the connection is broken;
+ no further commands will be accepted.
+ </para>
+
+ <para>
+ WAL data is sent as a series of CopyData messages. (This allows
+ other information to be intermixed; in particular the server can send
+ an ErrorResponse message if it encounters a failure after beginning
+ to stream.) The payload in each CopyData message follows this format:
+ </para>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term>
+ XLogData (B)
+ </term>
+ <listitem>
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term>
+ Byte1('w')
+ </term>
+ <listitem>
+ <para>
+ Identifies the message as WAL data.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte8
+ </term>
+ <listitem>
+ <para>
+ The starting point of the WAL data in this message, given in
+ XLogRecPtr format.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte8
+ </term>
+ <listitem>
+ <para>
+ The current end of WAL on the server, given in
+ XLogRecPtr format.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte8
+ </term>
+ <listitem>
+ <para>
+ The server's system clock at the time of transmission,
+ given in TimestampTz format.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte<replaceable>n</replaceable>
+ </term>
+ <listitem>
+ <para>
+ A section of the WAL data stream.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ <para>
+ A single WAL record is never split across two CopyData messages.
+ When a WAL record crosses a WAL page boundary, and is therefore
+ already split using continuation records, it can be split at the page
+ boundary. In other words, the first main WAL record and its
+ continuation records can be sent in different CopyData messages.
+ </para>
+ <para>
+ Note that all fields within the WAL data and the above-described header
+ will be in the sending server's native format. Endianness, and the
+ format for the timestamp, are unpredictable unless the receiver has
+ verified that the sender's system identifier matches its own
+ <filename>pg_control</> contents.
+ </para>
+ <para>
+ If the WAL sender process is terminated normally (during postmaster
+ shutdown), it will send a CommandComplete message before exiting.
+ This might not happen during an abnormal shutdown, of course.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>BASE_BACKUP [<literal>LABEL</literal> <replaceable>'label'</replaceable>] [<literal>PROGRESS</literal>] [<literal>FAST</literal>]</term>
+ <listitem>
+ <para>
+ Instructs the server to start streaming a base backup.
+ The system will automatically be put in backup mode before the backup
+ is started, and taken out of it when the backup is complete. The
+ following options are accepted:
+ <variablelist>
+ <varlistentry>
+ <term><literal>LABEL</literal> <replaceable>'label'</replaceable></term>
+ <listitem>
+ <para>
+ Sets the label of the backup. If none is specified, a backup label
+ of <literal>base backup</literal> will be used. The quoting rules
+ for the label are the same as a standard SQL string with
+ <xref linkend="guc-standard-conforming-strings"> turned on.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>PROGRESS</></term>
+ <listitem>
+ <para>
+ Request information required to generate a progress report. This will
+ send back an approximate size in the header of each tablespace, which
+ can be used to calculate how far along the stream is done. This is
+ calculated by enumerating all the file sizes once before the transfer
+ is even started, and may as such have a negative impact on the
+ performance - in particular it may take longer before the first data
+ is streamed. Since the database files can change during the backup,
+ the size is only approximate and may both grow and shrink between
+ the time of approximation and the sending of the actual files.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>FAST</></term>
+ <listitem>
+ <para>
+ Request a fast checkpoint.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ <para>
+ When the backup is started, the server will first send a header in
+ ordinary result set format, followed by one or more CopyResponse
+ results, one for PGDATA and one for each additional tablespace other
+ than <literal>pg_default</> and <literal>pg_global</>. The data in
+ the CopyResponse results will be a tar format (using ustar00
+ extensions) dump of the tablespace contents.
+ </para>
+ <para>
+ The header is an ordinary resultset with one row for each tablespace.
+ The fields in this row are:
+ <variablelist>
+ <varlistentry>
+ <term>spcoid</term>
+ <listitem>
+ <para>
+ The oid of the tablespace, or <literal>NULL</> if it's the base
+ directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>spclocation</term>
+ <listitem>
+ <para>
+ The full path of the tablespace directory, or <literal>NULL</>
+ if it's the base directory.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>size</term>
+ <listitem>
+ <para>
+ The approximate size of the tablespace, if progress report has
+ been requested; otherwise it's <literal>NULL</>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ <para>
+ The tar archive for the data directory and each tablespace will contain
+ all files in the directories, regardless of whether they are
+ <productname>PostgreSQL</> files or other files added to the same
+ directory. The only excluded files are:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <filename>postmaster.pid</>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <filename>pg_xlog</> (including subdirectories)
+ </para>
+ </listitem>
+ </itemizedlist>
+ Owner, group and file mode are set if the underlying filesystem on
+ the server supports it.
+ </para>
+ </listitem>
+ </varlistentry>
+</variablelist>
+
+</para>
+
+</sect1>
+
<sect1 id="protocol-message-types">
<title>Message Data Types</title>
characters, etc).
1 indicates the overall copy format is binary (similar
to DataRow format).
- See <xref linkend="sql-copy" endterm="sql-copy-title">
+ See <xref linkend="sql-copy">
for more information.
</para>
</listitem>
is textual (rows separated by newlines, columns
separated by separator characters, etc). 1 indicates
the overall copy format is binary (similar to DataRow
- format). See <xref linkend="sql-copy"
- endterm="sql-copy-title"> for more information.
+ format). See <xref linkend="sql-copy"> for more information.
+</para>
+</listitem>
+</varlistentry>
+<varlistentry>
+<term>
+ Int16
+</term>
+<listitem>
+<para>
+ The number of columns in the data to be copied
+ (denoted <replaceable>N</> below).
+</para>
+</listitem>
+</varlistentry>
+<varlistentry>
+<term>
+ Int16[<replaceable>N</>]
+</term>
+<listitem>
+<para>
+ The format codes to be used for each column.
+ Each must presently be zero (text) or one (binary).
+ All must be zero if the overall copy format is textual.
+</para>
+</listitem>
+</varlistentry>
+</variablelist>
+
+</para>
+</listitem>
+</varlistentry>
+
+
+<varlistentry>
+<term>
+CopyBothResponse (B)
+</term>
+<listitem>
+<para>
+
+<variablelist>
+<varlistentry>
+<term>
+ Byte1('W')
+</term>
+<listitem>
+<para>
+ Identifies the message as a Start Copy Both response.
+ This message is used only for Streaming Replication.
+</para>
+</listitem>
+</varlistentry>
+<varlistentry>
+<term>
+ Int32
+</term>
+<listitem>
+<para>
+ Length of message contents in bytes, including self.
+</para>
+</listitem>
+</varlistentry>
+<varlistentry>
+<term>
+ Int8
+</term>
+<listitem>
+<para>
+ 0 indicates the overall <command>COPY</command> format
+ is textual (rows separated by newlines, columns
+ separated by separator characters, etc). 1 indicates
+ the overall copy format is binary (similar to DataRow
+ format). See <xref linkend="sql-copy"> for more information.
</para>
</listitem>
</varlistentry>
</sect1>
-<sect1 id="protocol-replication">
-<title>Streaming Replication Protocol</title>
-
-<para>
-To initiate streaming replication, the frontend sends the "replication"
-parameter in the startup message. This tells the backend to go into
-walsender mode, where a small set of replication commands can be issued
-instead of SQL statements. Only the simple query protocol can be used in
-walsender mode.
-
-The commands accepted in walsender mode are:
-
-<variablelist>
- <varlistentry>
- <term>IDENTIFY_SYSTEM</term>
- <listitem>
- <para>
- Requests the server to idenfity itself. Server replies with a result
- set of a single row, and two fields:
-
- systemid: The unique system identifier identifying the cluster. This
- can be used to check that the base backup used to initialize the
- slave came from the same cluster.
-
- timeline: Current TimelineID. Also used to check that the slave is
- consistent with the master.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>START_REPLICATION XXX/XXX</term>
- <listitem>
- <para>
- Instructs backend to start streaming WAL, starting at point XXX/XXX.
- Server can reply with an error e.g if the requested piece of WAL has
- already been recycled. On success, server responds with a
- CopyOutResponse message, and backend starts to stream WAL as CopyData
- messages.
- The payload in CopyData message consists of the following format.
- </para>
-
- <para>
- <variablelist>
- <varlistentry>
- <term>
- XLogData (B)
- </term>
- <listitem>
- <para>
- <variablelist>
- <varlistentry>
- <term>
- Byte1('w')
- </term>
- <listitem>
- <para>
- Identifies the message as WAL data.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>
- Int32
- </term>
- <listitem>
- <para>
- The log file number of the LSN, indicating the starting point of
- the WAL in the message.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>
- Int32
- </term>
- <listitem>
- <para>
- The byte offset of the LSN, indicating the starting point of
- the WAL in the message.
- </para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>
- Byte<replaceable>n</replaceable>
- </term>
- <listitem>
- <para>
- Data that forms part of WAL data stream.
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
- </para>
- <para>
- A single WAL record is never split across two CopyData messages. When
- a WAL record crosses a WAL page boundary, however, and is therefore
- already split using continuation records, it can be split at the page
- boundary. In other words, the first main WAL record and its
- continuation records can be split across different CopyData messages.
- </para>
- </listitem>
- </varlistentry>
-</variablelist>
-
-</para>
-
-</sect1>
-
<sect1 id="protocol-changes">
<title>Summary of Changes since Protocol 2.0</title>