Need to hold ControlFileLock while updating control file. Update
minRecoveryPoint in control file when replaying a parameter change record,
to ensure that we don't allow hot standby on WAL generated without
wal_level='hot_standby' after a standby restart.
Add cross-reference from wal_level to hot_standby setting. Update
the PITR documentation to mention that you need to set wal_level to
'archive' or 'hot_standby', to enable WAL archiving. Per Simon's request.
Tom Lane [Sun, 2 May 2010 22:28:05 +0000 (22:28 +0000)]
Fix replay of XLOG_HEAP_NEWPAGE WAL records to pay attention to the forknum
field of the WAL record. The previous coding always wrote to the main fork,
resulting in data corruption if the page was meant to go into a non-default
fork.
At present, the only operation that can produce such WAL records is
ALTER TABLE/INDEX SET TABLESPACE when executed with archive_mode = on.
Data corruption would be observed on standby slaves, and could occur on the
master as well if a database crash and recovery occurred after committing
the ALTER and before the next checkpoint. Per report from Gordon Shannon.
Back-patch to 8.4; the problem doesn't exist in earlier branches because
we didn't have a concept of multiple relation forks then.
Simon Riggs [Sun, 2 May 2010 11:32:53 +0000 (11:32 +0000)]
Mention that max_standby_delay has units of milliseconds. Units are mentioned
for all other parameters where the default is expressed in a different unit.
Tom Lane [Sun, 2 May 2010 02:10:33 +0000 (02:10 +0000)]
Clean up some awkward, inaccurate, and inefficient processing around
MaxStandbyDelay. Use the GUC units mechanism for the value, and choose more
appropriate timestamp functions for performing tests with it. Make the
ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have
behavior similar to ps_activity code elsewhere, notably not updating the
display when update_process_title is off and not truncating the display
contents at an arbitrarily-chosen length. Improve the docs to be explicit
about what MaxStandbyDelay actually measures, viz the difference between
primary and standby servers' clocks, and the possible hazards if their clocks
aren't in sync.
Tom Lane [Sat, 1 May 2010 22:46:30 +0000 (22:46 +0000)]
Add code to InternalIpcMemoryCreate() to handle the case where shmget()
returns EINVAL for an existing shared memory segment. Although it's not
terribly sensible, that behavior does meet the POSIX spec because EINVAL
is the appropriate error code when the existing segment is smaller than the
requested size, and the spec explicitly disclaims any particular ordering of
error checks. Moreover, it does in fact happen on OS X and probably other
BSD-derived kernels. (We were able to talk NetBSD into changing their code,
but purging that behavior from the wild completely seems unlikely to happen.)
We need to distinguish collision with a pre-existing segment from invalid size
request in order to behave sensibly, so it's worth some extra code here to get
it right. Per report from Gavin Kistner and subsequent investigation.
Back-patch to all supported versions, since any of them could get used
with a kernel having the debatable behavior.
Tom Lane [Sat, 1 May 2010 21:31:17 +0000 (21:31 +0000)]
Install hack workaround for failure of 'make all' in VPATH builds.
It appears that gmake gets confused if postgres.sgml is not present in
the working directory, and instantiates some default rule or other that
would let postgres.sgml be built from postgres.xml. I haven't been able
to track down exactly where that's coming from, but the problem can be
dodged by specifying srcdir explicitly in the rule for postgres.xml.
Per report from Vladimir Kokovic.
Tom Lane [Sat, 1 May 2010 18:15:07 +0000 (18:15 +0000)]
Adjust postgres.xml rule so that make will notice a failure exit from osx.
The previous coding had it in a pipe, which on most shells won't report
the error. Per experimentation with a bug report from Vladimir Kokovic.
This doesn't actually fix his problem, but it does explain why make
didn't report that there was a problem.
Tom Lane [Fri, 30 Apr 2010 22:24:50 +0000 (22:24 +0000)]
Update our information about OS X shared memory configuration: it's now
possible to set most of the SHM kernel parameters without a reboot.
Also, reorder the paragraph to explain the modern configuration method first.
There are probably not too many people who still care about how to do it on
OS X 10.3 or older.
Tom Lane [Fri, 30 Apr 2010 19:15:45 +0000 (19:15 +0000)]
Fix multiple memory leaks in PLy_spi_execute_fetch_result: it would leak
memory if the result had zero rows, and also if there was any sort of error
while converting the result tuples into Python data. Reported and partially
fixed by Andres Freund.
Back-patch to all supported versions. Note: I haven't tested the 7.4 fix.
7.4's configure check for python is so obsolete it doesn't work on my
current machines :-(. The logic change is pretty straightforward though.
Tom Lane [Fri, 30 Apr 2010 17:09:13 +0000 (17:09 +0000)]
Fix a couple of places where the result of fgets() wasn't checked.
This is mostly to suppress compiler warnings, although in principle
the cases could result in undesirable behavior.
Fix handling of b-tree reuse WAL records when hot standby is disabled,
and add missing code in btree_desc for them. This fixes the bug
with "tree_redo: unknown op code 208" error reported by Jaime Casanova.
Tom Lane [Thu, 29 Apr 2010 21:49:03 +0000 (21:49 +0000)]
Adjust error checks in pg_start_backup and pg_stop_backup to make it possible
to perform a backup without archive_mode being enabled. This gives up some
user-error protection in order to improve usefulness for streaming-replication
scenarios. Per discussion.
Tom Lane [Thu, 29 Apr 2010 21:36:19 +0000 (21:36 +0000)]
Rename the parameter recovery_connections to hot_standby, to reduce possible
confusion with streaming-replication settings. Also, change its default
value to "off", because of concern about executing new and poorly-tested
code during ordinary non-replicating operation. Per discussion.
In passing do some minor editing of related documentation.
Tom Lane [Thu, 29 Apr 2010 16:32:41 +0000 (16:32 +0000)]
Install a workaround for 'TeX capacity exceeded' problem
when building PDF output for recent versions of the documentation.
There is probably a better answer out there somewhere, but
we need something now so we can build beta releases.
Tom Lane [Wed, 28 Apr 2010 19:38:49 +0000 (19:38 +0000)]
Minor editorializing on pg_controldata and pg_resetxlog: adjust some message
wording, deal explicitly with some fields that were being silently left zero.
Tom Lane [Wed, 28 Apr 2010 16:54:16 +0000 (16:54 +0000)]
Modify ShmemInitStruct and ShmemInitHash to throw errors internally,
rather than returning NULL for some-but-not-all failures as they used to.
Remove now-redundant tests for NULL from call sites.
We had to do something about this because many call sites were failing to
check for NULL; and changing it like this seems a lot more useful and
mistake-proof than adding checks to the call sites without them.
Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
Tom Lane [Wed, 28 Apr 2010 02:04:16 +0000 (02:04 +0000)]
Modify the built-in text search parser to handle URLs more nearly according
to RFC 3986. In particular, these characters now terminate the path part
of a URL: '"', '<', '>', '\', '^', '`', '{', '|', '}'. The previous behavior
was inconsistent and depended on whether a "?" was present in the path.
Per gripe from Donald Fraser and spec research by Kevin Grittner.
This is a pre-existing bug, but not back-patching since the risks of
breaking existing applications seem to outweigh the benefits.
Tom Lane [Wed, 28 Apr 2010 00:09:05 +0000 (00:09 +0000)]
Replace the KnownAssignedXids hash table with a sorted-array data structure,
and be more tense about the locking requirements for it, to improve performance
in Hot Standby mode. In passing fix a few bugs and improve a number of
comments in the existing HS code.
If a base backup is cancelled by server shutdown or crash, throw an error
in WAL recovery when it sees the shutdown checkpoint record. It's more
user-friendly to find out about it at that point than at the end of
recovery, and you're not left wondering why your hot standby server never
opens up for read-only connections.
Robert Haas [Mon, 26 Apr 2010 10:52:00 +0000 (10:52 +0000)]
When we're restricting who can connect, don't allow new walsenders.
Normal superuser processes are allowed to connect even when the database
system is shutting down, or when fewer than superuser_reserved_connection
slots remain. This is intended to make sure an administrator can log in
and troubleshoot, so don't extend these same courtesies to users connecting
for replication.
Simon Riggs [Fri, 23 Apr 2010 22:23:39 +0000 (22:23 +0000)]
Add missing optimizer hooks for function cost and number of rows.
Closely follow design of other optimizer hooks: if hook exists
retrieve value from plugin; if still not set then get from cache.
Simon Riggs [Fri, 23 Apr 2010 19:57:19 +0000 (19:57 +0000)]
Make CheckRequiredParameterValues() depend upon correct combination
of parameters. Fix bug report by Robert Haas that error message and
hint was incorrect if wrong mode parameters specified on master.
Internal changes only. Proposals for parameter simplification on
master/primary still under way.
Simon Riggs [Thu, 22 Apr 2010 08:04:25 +0000 (08:04 +0000)]
Optimise btree delete processing when no active backends.
Clarify comments, downgrade a message to DEBUG and remove some
debug counters. Direct from ideas by Heikki Linnakangas.
Simon Riggs [Thu, 22 Apr 2010 02:15:45 +0000 (02:15 +0000)]
Further reductions in Hot Standby conflict processing. These
come from the realistion that HEAP2_CLEAN records don't
always remove user visible data, so conflict processing for
them can be skipped. Confirm validity using Assert checks,
clarify circumstances under which we log heap_cleanup_info
records. Tuning arises from bug fixing of earlier safety
check failures.
Fix encoding issue when lc_monetary or lc_numeric are different encoding
from lc_ctype, that could happen on Windows. We need to change lc_ctype
together with lc_monetary or lc_numeric, and convert strings in lconv
from lc_ctype encoding to the database encoding.
The bug reported by Mikko, original patch by Hiroshi Inoue,
with changes by Bruce and me.
Tom Lane [Wed, 21 Apr 2010 20:54:19 +0000 (20:54 +0000)]
Enforce superuser permissions checks during ALTER ROLE/DATABASE SET, rather
than during define_custom_variable(). This entails rejecting an ALTER
command if the target variable doesn't have a known (non-placeholder)
definition, unless the calling user is superuser. When the variable *is*
known, we can correctly apply the rule that only superusers can issue ALTER
for SUSET parameters. This allows define_custom_variable to apply ALTER's
values for SUSET parameters at module load time, secure in the knowledge
that only a superuser could have set the ALTER value. This change fixes a
longstanding gotcha in the usage of SUSET-level custom parameters; which
is a good thing to fix now that plpgsql defines such a parameter.
Simon Riggs [Wed, 21 Apr 2010 19:53:24 +0000 (19:53 +0000)]
Only send cleanup_info messages if VACUUM removes any tuples.
There is no other purpose for this message type than to report
the latestRemovedXid of removed tuples, prior to index scans.
Removes overlooked path for sending invalid latestRemovedXid.
Fixes buildfarm failure on centaur.
Simon Riggs [Wed, 21 Apr 2010 19:08:14 +0000 (19:08 +0000)]
Relax locking during GetCurrentVirtualXIDs(). Earlier improvements
to handling of btree delete records mean that all snapshot
conflicts on standby now have a valid, useful latestRemovedXid.
Our earlier approach using LW_EXCLUSIVE was useful when we didnt
always have a valid value, though is no longer useful or necessary.
Asserts added to code path to prove and ensure this is the case.
This will reduce contention and improve performance of larger Hot
Standby servers.
Simon Riggs [Wed, 21 Apr 2010 17:20:56 +0000 (17:20 +0000)]
Fix oversight in collecting values for cleanup_info records.
vacuum_log_cleanup_info() now generates log records with a valid
latestRemovedXid set in all cases. Also be careful not to zero the
value when we do a round of vacuuming part-way through lazy_scan_heap().
Incidentally, this reduces frequency of conflicts in Hot Standby.
Tom Lane [Wed, 21 Apr 2010 03:32:53 +0000 (03:32 +0000)]
Fix pg_hba.conf matching so that replication connections only match records
with database = replication. The previous coding would allow them to match
ordinary records too, but that seems like a recipe for security breaches.
Improve the messages associated with no-such-pg_hba.conf entry to report
replication connections as such, since that's now a critical aspect of
whether the connection matches. Make some cursory improvements in the related
documentation, too.
Tom Lane [Wed, 21 Apr 2010 00:51:57 +0000 (00:51 +0000)]
Move the check for whether walreceiver has authenticated as a superuser
from walsender.c, where it didn't really belong, to postinit.c where it does
belong (and is essentially free, too).
Tom Lane [Tue, 20 Apr 2010 23:48:47 +0000 (23:48 +0000)]
Arrange for client authentication to occur before we select a specific
database to connect to. This is necessary for the walsender code to work
properly (it was previously using an untenable assumption that template1 would
always be available to connect to). This also gets rid of a small security
shortcoming that was introduced in the original patch to eliminate the flat
authentication files: before, you could find out whether or not the requested
database existed even if you couldn't pass the authentication checks.
The changes needed to support this are mainly just to treat pg_authid and
pg_auth_members as nailed relations, so that we can read them without having
to be able to locate real pg_class entries for them. This mechanism was
already debugged for pg_database, but we hadn't recognized the value of
applying it to those catalogs too.
Since the current code doesn't have support for accessing toast tables before
we've brought up all of the relcache, remove pg_authid's toast table to ensure
that no one can store an out-of-line toasted value of rolpassword. The case
seems quite unlikely to occur in practice, and was effectively unsupported
anyway in the old "flatfiles" implementation.
Update genbki.pl to actually implement the same rules as bootstrap.c does for
not-nullability of catalog columns. The previous coding was a bit cheesy but
worked all right for the previous set of bootstrap catalogs. It does not work
for pg_authid, where rolvaliduntil needs to be nullable.
Initdb forced due to minor catalog changes (mainly the toast table removal).
Robert Haas [Tue, 20 Apr 2010 11:15:06 +0000 (11:15 +0000)]
Rename standby_keep_segments to wal_keep_segments.
Also, make the name of the GUC and the name of the backing variable match.
Alnong the way, clean up a couple of slight typographical errors in the
related docs.
Tom Lane [Tue, 20 Apr 2010 01:38:52 +0000 (01:38 +0000)]
Move the responsibility for calling StartupXLOG into InitPostgres, for
those process types that go through InitPostgres; in particular, bootstrap
and standalone-backend cases. This ensures that we have set up a PGPROC
and done some other basic initialization steps (corresponding to the
if (IsUnderPostmaster) block in AuxiliaryProcessMain) before we attempt to
run WAL recovery in a standalone backend. As was discovered last September,
this is necessary for some corner-case code paths during WAL recovery,
particularly end-of-WAL cleanup.
Moving the bootstrap case here too is not necessary for correctness, but it
seems like a good idea since it reduces the number of distinct code paths.
Robert Haas [Tue, 20 Apr 2010 00:26:06 +0000 (00:26 +0000)]
Update docs as to when WAL logging can be skipped.
In 8.4 and prior, WAL-logging could potentially be skipped whenever
archive_mode=off. With streaming replication, this is now true only
if max_wal_senders=0.
Simon Riggs [Mon, 19 Apr 2010 18:03:38 +0000 (18:03 +0000)]
Check RecoveryInProgress() while holding ProcArrayLock during snapshots.
This prevents a rare, yet possible race condition at the exact moment
of transition from recovery to normal running.
Tom Lane [Mon, 19 Apr 2010 17:54:48 +0000 (17:54 +0000)]
Fix uninitialized local variables. Not sure why gcc doesn't complain about
these --- maybe because they're effectively unused? MSVC does complain though,
per buildfarm.
Magnus Hagander [Mon, 19 Apr 2010 14:10:45 +0000 (14:10 +0000)]
Add wrapper function libpqrcv_PQexec() in the walreceiver that uses async
libpq to send queries, making the waiting for responses interruptible on
platforms where PQexec() can't normally be interrupted by signals, such
as win32.
Robert Haas [Mon, 19 Apr 2010 00:55:26 +0000 (00:55 +0000)]
Add an 'enable_material' GUC.
The logic for determining whether to materialize has been significantly
overhauled for 9.0. In case there should be any doubt about whether
materialization is a win in any particular case, this should provide a
convenient way of seeing what happens without it; but even with enable_material
turned off, we still materialize in cases where it is required for
correctness.
Simon Riggs [Sun, 18 Apr 2010 18:44:53 +0000 (18:44 +0000)]
Improve sequence and sense of messages from pg_stop_backup().
Now doesn't report it is waiting until it actually is waiting,
plus message doesn't appear until at least 5 seconds wait, so
we avoid reporting the wait before we've given the archiver
a reasonable time to wake up and archive the file we just
created earlier in the function.
Also add new unconditional message to confirm safe completion.
Now a normal, healthy execution does not report waiting at
all, just safe completion.
Simon Riggs [Sun, 18 Apr 2010 18:06:07 +0000 (18:06 +0000)]
Tune GetSnapshotData() during Hot Standby by avoiding loop
through normal backends. Makes code clearer also, since we
avoid various Assert()s. Performance of snapshots taken
during recovery no longer depends upon number of read-only
backends.
On Windows, syslogger runs in two threads. The main thread processes config
reload and rotation signals, and a helper thread reads messages from the
pipe and writes them to the log file. However, server code isn't generally
thread-safe, so if both try to do e.g palloc()/pfree() at the same time,
bad things will happen. To fix that, use a critical section (which is like
a mutex) to enforce that only one the threads are active at a time.
In standby mode, suppress repeated LOG messages about a corrupt record,
which just indicates that we've reached the end of valid WAL found in
the standby.
Tom Lane [Wed, 14 Apr 2010 23:52:10 +0000 (23:52 +0000)]
Fix plpgsql's exec_eval_expr() to ensure it returns a sane type OID
even when the expression is a query that returns no rows.
So far as I can tell, the only caller that actually fails when a garbage
OID is returned is exec_stmt_case(), which is new in 8.4 --- in all other
cases, we might make a useless trip through casting logic, but we won't
fail since the isnull flag will be set. Hence, backpatch only to 8.4,
just in case there are apps out there that aren't expecting an error to
be thrown if the query returns more or less than one column. (Which seems
unlikely, since the error would be thrown if the query ever did return a
row; but it's possible there's some never-exercised code out there.)
Tom Lane [Wed, 14 Apr 2010 21:31:11 +0000 (21:31 +0000)]
Fix a problem introduced by my patch of 2010-01-12 that revised the way
relcache reload works. In the patched code, a relcache entry in process of
being rebuilt doesn't get unhooked from the relcache hash table; which means
that if a cache flush occurs due to sinval queue overrun while we're
rebuilding it, the entry could get blown away by RelationCacheInvalidate,
resulting in crash or misbehavior. Fix by ensuring that an entry being
rebuilt has positive refcount, so it won't be seen as a target for removal
if a cache flush occurs. (This will mean that the entry gets rebuilt twice
in such a scenario, but that's okay.) It appears that the problem can only
arise within a transaction that has previously reassigned the relfilenode of
a pre-existing table, via TRUNCATE or a similar operation. Per bug #5412
from Rusty Conover.
Back-patch to 8.2, same as the patch that introduced the problem.
I think that the failure can't actually occur in 8.2, since it lacks the
rd_newRelfilenodeSubid optimization, but let's make it work like the later
branches anyway.
Magnus Hagander [Tue, 13 Apr 2010 08:16:09 +0000 (08:16 +0000)]
Only try to do a graceful disconnect if we've successfully loaded the
shared library with the disconnect function in it. Fixes segmentation
fault reported by Jeff Davis.