Check for partial WAL files in standby mode. If restore_command restores
a partial WAL file, assume it's because the file is just being copied to
the archive and treat it the same as "file not found" in standby mode.
pg_standby has a similar check, so it seems reasonable to have the same
level of protection in the built-in standby mode.
Simon Riggs [Thu, 11 Feb 2010 19:35:22 +0000 (19:35 +0000)]
Fix typo bug in Hot Standby from recent refactoring. Bug introduced
into code recently patched by Andres Freund, so quickly fixed by him
when bug report from Tatsuo Ishii arrived.
Teodor Sigaev [Thu, 11 Feb 2010 14:29:50 +0000 (14:29 +0000)]
Generic implementation of red-black binary tree. It's planned to use in
several places, but for now only GIN uses it during index creation.
Using self-balanced tree greatly speeds up index creation in corner cases
with preordered data.
Now that streaming replication switches between streaming mode and
restoring from archive, the last WAL segment is not necessarily open at
the end of recovery. Fix assertion that assumed that.
Fujii Masao, fixing the assertion failure reported by Martin Pihlak.
Tom Lane [Wed, 10 Feb 2010 03:38:35 +0000 (03:38 +0000)]
Improve planner's choices about when to use hashing vs sorting for DISTINCT.
The previous coding missed a bet by sometimes picking the "sorted" path
from query_planner even though hashing would be preferable. To fix, we have
to be willing to make the choice sooner. This contorts things a little bit,
but I thought of a factorization that makes it not too awful.
Tom Lane [Tue, 9 Feb 2010 21:43:30 +0000 (21:43 +0000)]
Fix up rickety handling of relation-truncation interlocks.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens. Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation. We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation. This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.
In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
Magnus Hagander [Tue, 9 Feb 2010 19:55:14 +0000 (19:55 +0000)]
Define the value for in6addr_any on MingW, since it provides the struct
only in the header files and not in any libraries, yet declare it as
an extern.
Move "Warm Standby Servers for High Availability" and "Hot Standby"
sections under "High Availability, Load Balancing, and Replication"
chapter. Streaming replication chapter needs a lot more work, but this
commit just moves things around.
Tom Lane [Tue, 9 Feb 2010 00:28:30 +0000 (00:28 +0000)]
Rearrange lazy-vacuum code a little bit to reduce the window between
truncating the table and transaction commit. This isn't really making
it safe, but at least there is no good reason to do free space map
cleanup within the risk window. Don't lock out cancel interrupts
until we have to, either.
Tom Lane [Mon, 8 Feb 2010 20:39:52 +0000 (20:39 +0000)]
Create an official API function for C functions to use to check if they are
being called as aggregates, and to get the aggregate transition state memory
context if needed. Use it instead of poking directly into AggState and
WindowAggState in places that shouldn't know so much.
We should have done this in 8.4, probably, but better late than never.
Tom Lane [Mon, 8 Feb 2010 16:50:21 +0000 (16:50 +0000)]
Fix serious performance bug in new implementation of VACUUM FULL:
cluster_rel necessarily builds an all-new toast table, so it's useless to
then go and VACUUM FULL the toast table.
Remove piece of code to zero out minRecoveryPoint when starting crash
recovery. It's zeroed out whenever a checkpoint is written, so the only
scenario where the removed code did anything is when you kill archive
recovery, remove recovery.conf, and start up the server, so that it goes
into crash recovery instead. That's a "don't do that" scenario, but it
seems better to not clear minRecoveryPoint but instead update it like we
do in archive recovery, which is what will now happen.
Tom Lane [Mon, 8 Feb 2010 05:53:55 +0000 (05:53 +0000)]
Remove CatalogCacheFlushRelation, and the reloidattr infrastructure that was
needed by nothing else.
The restructuring I just finished doing on cache management exposed to me how
silly this routine was. Its function was to go into the catcache and blow
away all entries related to a given relation when there was a relcache flush
on that relation. However, there is no point in removing a catcache entry
if the catalog row it represents is still valid --- and if it isn't valid,
there must have been a catcache entry flush on it, because that's triggered
directly by heap_update or heap_delete on the catalog row. So this routine
accomplished nothing except to blow away valid cache entries that we'd very
likely be wanting in the near future to help reconstruct the relcache entry.
Dumb.
On top of which, it required a subtle and easy-to-get-wrong attribute in
syscache definitions, ie, the column containing the OID of the related
relation if any. Removing that is a very useful maintenance simplification.
Tom Lane [Mon, 8 Feb 2010 04:33:55 +0000 (04:33 +0000)]
Remove old-style VACUUM FULL (which was known for a little while as
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.
Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).
We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
Tom Lane [Sun, 7 Feb 2010 22:40:33 +0000 (22:40 +0000)]
Work around deadlock problems with VACUUM FULL/CLUSTER on system catalogs,
as per my recent proposal.
First, teach IndexBuildHeapScan to not wait for INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples to commit unless the index build is checking
uniqueness/exclusion constraints. If it isn't, there's no harm in just
indexing the in-doubt tuple.
Second, modify VACUUM FULL/CLUSTER to suppress reverifying
uniqueness/exclusion constraint properties while rebuilding indexes of
the target relation. This is reasonable because these commands aren't
meant to deal with corrupted-data situations. Constraint properties
will still be rechecked when an index is rebuilt by a REINDEX command.
This gets us out of the problem that new-style VACUUM FULL would often
wait for other transactions while holding exclusive lock on a system
catalog, leading to probable deadlock because those other transactions
need to look at the catalogs too. Although the real ultimate cause of
the problem is a debatable choice to release locks early after modifying
system catalogs, changing that choice would require pretty serious
analysis and is not something to be undertaken lightly or on a tight
schedule. The present patch fixes the problem in a fairly reasonable
way and should also improve the speed of VACUUM FULL/CLUSTER a little bit.
Tom Lane [Sun, 7 Feb 2010 20:48:13 +0000 (20:48 +0000)]
Create a "relation mapping" infrastructure to support changing the relfilenodes
of shared or nailed system catalogs. This has two key benefits:
* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.
* We no longer have to use an unsafe reindex-in-place approach for reindexing
shared catalogs.
CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.
Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.
This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch. As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
Bruce Momjian [Fri, 5 Feb 2010 23:37:43 +0000 (23:37 +0000)]
Document that archive_timeout will force new WAL files even if a single
checkpoint has happened, and recommend adjusting checkpoint_timeout to
reduce the impact of this.
Joe Conway [Fri, 5 Feb 2010 03:09:05 +0000 (03:09 +0000)]
Modify recently added PQconnectdbParams() with new argument, expand_dbname.
If expand_dbname is non-zero and dbname contains an = sign, it is taken as
a conninfo string in exactly the same way as if it had been passed to
PQconnectdb. This is equivalent to the way PQsetdbLogin() works, allowing
PQconnectdbParams() to be a complete alternative.
Also improve the way the new function is called from psql and replace a
previously missed call to PQsetdbLogin() in psql. Additionally use
PQconnectdbParams() for pg_dump and friends, and the bin/scripts
command line utilities such as vacuumdb, createdb, etc.
Finally, update the documentation for the new parameter, as well as the
nuances of precedence in cases where key words are repeated or duplicated
in the conninfo string.
Tom Lane [Thu, 4 Feb 2010 00:09:14 +0000 (00:09 +0000)]
Restructure CLUSTER/newstyle VACUUM FULL/ALTER TABLE support so that swapping
of old and new toast tables can be done either at the logical level (by
swapping the heaps' reltoastrelid links) or at the physical level (by swapping
the relfilenodes of the toast tables and their indexes). This is necessary
infrastructure for upcoming changes to support CLUSTER/VAC FULL on shared
system catalogs, where we cannot change reltoastrelid. The physical swap
saves a few catalog updates too.
We unfortunately have to keep the logical-level swap logic because in some
cases we will be adding or deleting a toast table, so there's no possibility
of a physical swap. However, that only happens as a consequence of schema
changes in the table, which we do not need to support for system catalogs,
so such cases aren't an obstacle for that.
In passing, refactor the cluster support functions a little bit to eliminate
unnecessarily-duplicated code; and fix the problem that while CLUSTER had
been taught to rename the final toast table at need, ALTER TABLE had not.
Joe Conway [Wed, 3 Feb 2010 23:01:11 +0000 (23:01 +0000)]
Check to ensure the number of primary key fields supplied does not
exceed the total number of non-dropped source table fields for
dblink_build_sql_*(). Addresses bug report from Rushabh Lathia.
Move the responsibility of writing a "unlogged WAL operation" record from
heap_sync() to the callers, because heap_sync() is sometimes called even
if the operation itself is WAL-logged. This eliminates the bogus unlogged
records from CLUSTER that Simon Riggs reported, patch by Fujii Masao.
Add a message type header to the CopyData messages sent from primary
to standby in streaming replication. While we only have one message type
at the moment, adding a message type header makes this easier to extend.
Tom Lane [Wed, 3 Feb 2010 03:21:25 +0000 (03:21 +0000)]
Fix timing-sensitive regression test result I just created :-( --- the
DROP USER at the end of the cluster.sql test could fail, if the temp
table created in the previous session hadn't finished getting dropped.
Unluckily, I didn't see this in several repetitions of the parallel
regression tests, but it's popping up on quite a few buildfarm machines.
Tom Lane [Wed, 3 Feb 2010 01:14:17 +0000 (01:14 +0000)]
Assorted cleanups in preparation for using a map file to support altering
the relfilenode of currently-not-relocatable system catalogs.
1. Get rid of inval.c's dependency on relfilenode, by not having it emit
smgr invalidations as a result of relcache flushes. Instead, smgr sinval
messages are sent directly from smgr.c when an actual relation delete or
truncate is done. This makes considerably more structural sense and allows
elimination of a large number of useless smgr inval messages that were
formerly sent even in cases where nothing was changing at the
physical-relation level. Note that this reintroduces the concept of
nontransactional inval messages, but that's okay --- because the messages
are sent by smgr.c, they will be sent in Hot Standby slaves, just from a
lower logical level than before.
2. Move setNewRelfilenode out of catalog/index.c, where it never logically
belonged, into relcache.c; which is a somewhat debatable choice as well but
better than before. (I considered catalog/storage.c, but that seemed too
low level.) Rename to RelationSetNewRelfilenode.
3. Cosmetic cleanups of some other relfilenode manipulations.
Tom Lane [Tue, 2 Feb 2010 19:12:29 +0000 (19:12 +0000)]
CLUSTER specified the wrong namespace when renaming toast tables of temporary
relations (they don't live in pg_toast). This caused an Assert failure in
assert-enabled builds. So far as I can see, in a non-assert build it would
only have messed up the checks for conflicting names, so a failure would be
quite improbable but perhaps not impossible.
Robert Haas [Tue, 2 Feb 2010 18:52:33 +0000 (18:52 +0000)]
Fold FindConversion() into FindConversionByName() and remove ACL check.
All callers of FindConversionByName() already do suitable permissions
checking already apart from this function, but this is not just dead
code removal: the unnecessary permissions check can actually lead to
spurious failures - there's no reason why inability to execute the
underlying function should prohibit renaming the conversion, for example.
(The error messages in these cases were also rather poor:
FindConversion would return InvalidOid, eventually leading to a complaint
that the conversion "did not exist", which was not correct.)
Tom Lane [Tue, 2 Feb 2010 18:16:10 +0000 (18:16 +0000)]
The particular table names used in the new inheritance regression test are
prone to sort differently in different locales, as seen in buildfarm results.
Let's cast to name not text to avoid that.
Robert Haas [Mon, 1 Feb 2010 19:28:56 +0000 (19:28 +0000)]
Tighten integrity checks on ALTER TABLE ... ALTER COLUMN ... RENAME.
When a column is renamed, we recursively rename the same column in
all descendent tables. But if one of those tables also inherits that
column from a table outside the inheritance hierarchy rooted at the
named table, we must throw an error. The previous coding correctly
prohibited the rename when the parent had inherited the column from
elsewhere, but overlooked the case where the parent was OK but a child
table also inherited the same column from a second, unrelated parent.
For now, not backpatched due to lack of complaints from the field.
KaiGai Kohei, with further changes by me.
Reviewed by Bernd Helme and Tom Lane.
Robert Haas [Mon, 1 Feb 2010 15:43:36 +0000 (15:43 +0000)]
Augment EXPLAIN output with more details on Hash nodes.
We show the number of buckets, the number of batches (and also the original
number if it has changed), and the peak space used by the hash table. Minor
executor changes to track peak space used.
Add string_agg aggregate functions. The one argument version concatenates
the input values into a string. The two argument version also does the same
thing, but inserts delimiters between elements.
Original patch by Pavel Stehule, reviewed by David E. Wheeler and me.
Tom Lane [Mon, 1 Feb 2010 02:45:29 +0000 (02:45 +0000)]
Change regexp engine's ccondissect/crevdissect routines to perform DFA
matching before recursing instead of after. The DFA match eliminates
unworkable midpoint choices a lot faster than the recursive check, in most
cases, so doing it first can speed things up; particularly in pathological
cases such as recently exhibited by Michael Glaesemann.
In addition, apply some cosmetic changes that were applied upstream (in the
Tcl project) at the same time, in order to sync with upstream version 1.15
of regexec.c.
Upstream apparently intends to backpatch this, so I will too. The
pathological behavior could be unpleasant if encountered in the field,
which seems to justify any risk of introducing new bugs.
Tom Lane, reviewed by Donal K. Fellows of Tcl project
Simon Riggs [Sun, 31 Jan 2010 19:01:11 +0000 (19:01 +0000)]
Detect early deadlock in Hot Standby when Startup is already waiting. First
stage of required deadlock detection to allow re-enabling max_standby_delay
setting of -1, which is now essential in the absence of improved relation-
specific conflict resoluton. Requested by Greg Stark et al.
Tom Lane [Sun, 31 Jan 2010 18:15:39 +0000 (18:15 +0000)]
Fix memory leak created by deferrable-index-constraints patches.
We need to free the OID list returned by ExecInsertIndexTuples to avoid
a query-lifespan memory leak. When many rows require rechecking, this
can be a significant leak --- it's even more than the space used for the
queued trigger events.
Magnus Hagander [Sun, 31 Jan 2010 17:16:23 +0000 (17:16 +0000)]
Fix race condition in win32 signal handling.
There was a race condition where the receiving pipe could be closed by the
child thread if the main thread was pre-empted before it got a chance to
create a new one, and the dispatch thread ran to completion during that time.
One symptom of this is that rows in pg_listener could be dropped under
heavy load.
Analysis and original patch by Radu Ilie, with some small
modifications by Magnus Hagander.
Tom Lane [Sat, 30 Jan 2010 20:09:53 +0000 (20:09 +0000)]
Avoid performing encoding conversion on command tag strings during EndCommand.
Since all current and foreseeable future command tags will be pure ASCII,
there is no need to do conversion on them. This saves a few cycles and also
avoids polluting otherwise-pristine subtransaction memory contexts, which
is the cause of the backend memory leak exhibited in bug #5302. (Someday
we'll probably want to have a better method of determining whether
subtransaction contexts need to be kept around, but today is not that day.)
Backpatch to 8.0. The cycle-shaving aspect of this would work in 7.4
too, but without subtransactions the memory-leak aspect doesn't apply,
so it doesn't seem worth touching 7.4.
Tom Lane [Sat, 30 Jan 2010 18:59:51 +0000 (18:59 +0000)]
Fix memory leakage introduced into print_aligned_text by 8.4 changes
(failure to free col_lineptrs[] array elements) and exacerbated in the
current devel cycle (failure to free "wrap"). This resulted in moderate
bloat of psql over long script runs. Noted while testing bug #5302,
although what the reporter was complaining of was backend-side leakage.
Andrew Dunstan [Sat, 30 Jan 2010 01:46:57 +0000 (01:46 +0000)]
Add plperl.on_perl_init setting to provide for initializing the perl library on load. Also, handle END blocks in plperl.
Database access is disallowed during both these operations, although it might be allowed in END blocks in future.
Simon Riggs [Fri, 29 Jan 2010 19:45:12 +0000 (19:45 +0000)]
Adjust GetLockConflicts() so that it uses TopMemoryContext when
executed InHotStandby. Cleaner solution than using malloc or palloc
depending upon situation, as proposed by Tom.
Simon Riggs [Fri, 29 Jan 2010 18:39:05 +0000 (18:39 +0000)]
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.
Simon Riggs [Fri, 29 Jan 2010 17:10:05 +0000 (17:10 +0000)]
Filter recovery conflicts based upon dboid from relfilenode of WAL
records for heap and btree. Minor change, mostly API changes to
pass through the required values. This is a simple change though
also provides the refactoring required for further enhancements
to conflict processing using the relOid. Changes only have effect
during Hot Standby.
Andrew Dunstan [Thu, 28 Jan 2010 23:59:52 +0000 (23:59 +0000)]
Add new make targets "world", "install-world" and "installcheck-world" to build, install and check just about everything.
In addition to everything built installed and tested by all, install and installcheck targets, these build HTML Docs,
build and test contrib, and test PLs and ECPG.
Simon Riggs [Thu, 28 Jan 2010 10:05:37 +0000 (10:05 +0000)]
Use malloc() in GetLockConflicts() when called InHotStandby to avoid repeated
palloc calls. Current code assumed this was already true, so this is a bug fix.
Change a few remaining calls of XLogArchivingActive() to use
XLogIsNeeded() instead, to determine if an otherwise non-logged operation
needs to be logged in WAL for standby servers.
Joe Conway [Thu, 28 Jan 2010 06:28:26 +0000 (06:28 +0000)]
Introduce two new libpq connection functions, PQconnectdbParams and
PQconnectStartParams. These are analogous to PQconnectdb and PQconnectStart
respectively. They differ from the legacy functions in that they accept
two NULL-terminated arrays, keywords and values, rather than conninfo
strings. This avoids the need to build the conninfo string in cases
where it might be inconvenient to do so. Includes documentation.
Also modify psql to utilize PQconnectdbParams rather than PQsetdbLogin.
This allows the new config parameter application_name to be set, which
in turn is displayed in the pg_stat_activity view and included in CSV
log entries. This will also ensure both new functions get regularly
exercised.
Patch by Guillaume Lelarge with review and minor adjustments by
Joe Conway.
Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers.
LogwrtRqst.Write can be set to non-existent FF log segment, we mustn't
try to send that in XLogSend().
Also fix similar bug in ReadRecord(), which I just introduced in the
ReadRecord() refactoring patch.
Make standby server continuously retry restoring the next WAL segment with
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover automatically, if the connection is
lost for a long time and standby falls behind so much that the required
WAL segments have been archived and deleted in the master.
This also makes standby_mode useful without streaming replication; the
server will keep retrying restore_command every few seconds until the
trigger file is found. That's the same basic functionality pg_standby
offers, but without the bells and whistles.
To implement that, refactor the ReadRecord/FetchRecord functions. The
FetchRecord() function introduced in the original streaming replication
patch is removed, and all the retry logic is now in a new function called
XLogReadPage(). XLogReadPage() is now responsible for executing
restore_command, launching walreceiver, and waiting for new WAL to arrive
from primary, as required.
This also changes the life cycle of walreceiver. When launched, it now only
tries to connect to the master once, and exits if the connection fails, or
is lost during streaming for any reason. The startup process detects the
death, and re-launches walreceiver if necessary.