Don't use O_DIRECT when writing WAL files if archiving or streaming is
enabled. Bypassing the kernel cache is counter-productive in that case,
because the archiver/walsender process will read from the WAL file
soon after it's written, and if it's not cached the read will cause
a physical read, eating I/O bandwidth available on the WAL drive.
Also, walreceiver process does unaligned writes, so disable O_DIRECT
in walreceiver process for that reason too.
Itagaki Takahiro [Fri, 19 Feb 2010 01:04:03 +0000 (01:04 +0000)]
Fix STOP WAL LOCATION in backup history files no to return the next
segment of XLOG_BACKUP_END record even if the the record is placed
at a segment boundary. Furthermore the previous implementation could
return nonexistent segment file name when the boundary is in segments
that has "FE" suffix; We never use segments with "FF" suffix.
Backpatch to 8.0, where hot backup was introduced.
Tom Lane [Thu, 18 Feb 2010 23:50:06 +0000 (23:50 +0000)]
Volatile-ize all five places where we expect a PG_TRY block to restore
old memory context in plpython. Before only one of them was marked
volatile, but per report from Zdenek Kotala, some compilers do the
wrong thing here.
Tom Lane [Thu, 18 Feb 2010 22:43:31 +0000 (22:43 +0000)]
Provide some rather hokey ways for EXPLAIN to print FieldStore and assignment
ArrayRef expressions that are not in the immediate context of an INSERT or
UPDATE targetlist. Such cases never arise in stored rules, so ruleutils.c
hadn't tried to handle them. However, they do occur in the targetlists of
plans derived from such statements, and now that EXPLAIN VERBOSE tries to
print targetlists, we need some way to deal with the case.
I chose to represent an assignment ArrayRef as "array[subscripts] := source",
which is fairly reasonable and doesn't omit any information. However,
FieldStore is problematic because the planner will fold multiple assignments
to fields of the same composite column into one FieldStore, resulting in a
structure that is hard to understand at all, let alone display comprehensibly.
So in that case I punted and just made it print the source expression(s).
Backpatch to 8.4 --- the lack of functionality exists in older releases,
but doesn't seem to be important for lack of anything that would call it.
Tom Lane [Thu, 18 Feb 2010 18:41:47 +0000 (18:41 +0000)]
Fix ExecEvalArrayRef to pass down the old value of the array element or slice
being assigned to, in case the expression to be assigned is a FieldStore that
would need to modify that value. The need for this was foreseen some time
ago, but not implemented then because we did not have arrays of composites.
Now we do, but the point evidently got overlooked in that patch. Net result
is that updating a field of an array element doesn't work right, as
illustrated if you try the new regression test on an unpatched backend.
Noted while experimenting with EXPLAIN VERBOSE, which has also got some issues
in this area.
Backpatch to 8.3, where arrays of composites were introduced.
Fix pq_getbyte_if_available() function. It was confused on what it
returns if no data is immediately available. Patch by me with numerous
fixes from Fujii Masao and Magnus Hagander.
Tom Lane [Thu, 18 Feb 2010 03:06:46 +0000 (03:06 +0000)]
Force READY portals into FAILED state when a transaction or subtransaction
is aborted, if they were created within the failed xact. This prevents
ExecutorEnd from being run on them, which is a good idea because they may
contain references to tables or other objects that no longer exist.
In particular this is hazardous when auto_explain is active, but it's
really rather surprising that nobody has seen an issue with this before.
I'm back-patching this to 8.4, since that's the first version that contains
auto_explain or an ExecutorEnd hook, but I wonder whether we shouldn't
back-patch further.
Tom Lane [Thu, 18 Feb 2010 01:29:10 +0000 (01:29 +0000)]
Fix up pg_dump's treatment of large object ownership and ACLs. We now emit
a separate archive entry for each BLOB, and use pg_dump's standard methods
for dealing with its ownership, ACL if any, and comment if any. This means
that switches like --no-owner and --no-privileges do what they're supposed
to. Preliminary testing says that performance is still reasonable even
with many blobs, though we'll have to see how that shakes out in the field.
Itagaki Takahiro [Wed, 17 Feb 2010 04:09:40 +0000 (04:09 +0000)]
Support new syntax and improve handling of parentheses in psql tab-completion.
Newly supported syntax are:
- ALTER {TABLE|INDEX|TABLESPACE} {SET|RESET} with options
- ALTER TABLE ALTER COLUMN {SET|RESET} with options
- ALTER TABLE ALTER COLUMN SET STORAGE
- CREATE INDEX CONCURRENTLY
- CREATE INDEX ON (without name)
- CREATE INDEX ... USING with pg_am.amname instead of hard-corded names
- CREATE TRIGGER with events
- DROP AGGREGATE function with arguments
Tom Lane [Wed, 17 Feb 2010 03:10:33 +0000 (03:10 +0000)]
When updating ShmemVariableCache from a checkpoint record, be sure to set
all the values derived from oldestXid, not just that field. Brain fade in
one of my patches associated with flat file removal, exposed by a report
from Fujii Masao.
With this change, xidVacLimit should always be valid, so remove a couple of
bits of complexity associated with the previous assumption that sometimes
it wouldn't get set right away.
Tom Lane [Wed, 17 Feb 2010 00:52:09 +0000 (00:52 +0000)]
Make NOTIFY_PAYLOAD_MAX_LENGTH depend explicitly on BLCKSZ and
NAMEDATALEN, so this code doesn't go nuts with smaller than default
BLCKSZ or larger than default NAMEDATALEN. The standard value is
still exactly 8000.
Tom Lane [Tue, 16 Feb 2010 22:34:57 +0000 (22:34 +0000)]
Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue.
In addition, add support for a "payload" string to be passed along with
each notify event.
This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage. There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.
Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
Andrew Dunstan [Tue, 16 Feb 2010 21:39:52 +0000 (21:39 +0000)]
Clean up package namespace use and use of Safe in plperl.
Prevent use of another buggy version of Safe.pm.
Only register the exit handler if we have successfully created an interpreter.
Change log level of perl warnings from NOTICE to WARNING.
The infrastructure is there if in future we decide to allow
DBAs to specify extra modules that will be allowed in trusted code.
However, for now the relevant variables are declared as lexicals
rather than as package variables, so that they are not (or should not be)
accessible.
Mostly code from Tim Bunce, reviewed by Alex Hunsaker, with some
tweaks by me.
Magnus Hagander [Tue, 16 Feb 2010 19:26:02 +0000 (19:26 +0000)]
Add emulation of non-blocking sockets to the win32 socket/signal layer,
and use this in pq_getbyte_if_available.
It's only a limited implementation which swithes the whole emulation layer
no non-blocking mode, but that's enough as long as non-blocking is only
used during a short period of time, and only one socket is accessed during
this time.
Alvaro Herrera [Mon, 15 Feb 2010 22:23:25 +0000 (22:23 +0000)]
Move main error message text in plperl into errmsg from errdetail,
and move the context information into errcontext instead of errmsg.
This makes them better conform to our guidelines.
Also remove a few errcode declarations that were providing the default
value ERRCODE_INTERNAL_ERROR.
Greg Stark [Mon, 15 Feb 2010 02:36:26 +0000 (02:36 +0000)]
Display explain buffers measurements in memory units rather than blocks. Also show "Total Buffer Usage" to hint that these are totals not averages per loop
Greg Stark [Mon, 15 Feb 2010 00:50:57 +0000 (00:50 +0000)]
Speed up CREATE DATABASE by deferring the fsyncs until after copying
all the data and using posix_fadvise to nudge the OS into flushing it
earlier. This also hopefully makes CREATE DATABASE avoid spamming the
cache.
Tests show a big speedup on Linux at least on some filesystems.
Robert Haas [Sun, 14 Feb 2010 18:42:19 +0000 (18:42 +0000)]
Wrap calls to SearchSysCache and related functions using macros.
The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4). This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.
Magnus Hagander [Sun, 14 Feb 2010 14:10:23 +0000 (14:10 +0000)]
Make the msvc build system ask python about details of version and installation
prefix, instead of assuming it will always be following the default layout.
All information we need is not available on Windows, but the number of
assumptions are at least fewer this way than before.
Tom Lane [Sat, 13 Feb 2010 20:46:52 +0000 (20:46 +0000)]
Don't expose the inline definition of MemoryContextSwitchTo when FRONTEND is
defined. Its reference to CurrentMemoryContext causes link failures on some
platforms, evidently because the inline function gets compiled despite lack of
use. Per buildfarm member warthog.
Simon Riggs [Sat, 13 Feb 2010 16:15:48 +0000 (16:15 +0000)]
Fix relcache init file invalidation during Hot Standby for the case
where a database has a non-default tablespaceid. Pass thru MyDatabaseId
and MyDatabaseTableSpace to allow file path to be re-created in
standby and correct invalidation to take place in all cases.
Update and rework xact_commit_desc() debug messages.
Bug report from Tom by code inspection. Fix by me.
Tom Lane [Sat, 13 Feb 2010 02:34:16 +0000 (02:34 +0000)]
Support inlining various small performance-critical functions on non-GCC
compilers, by applying a configure check to see if the compiler will accept
an unreferenced "static inline foo ..." function without warnings. It is
believed that such warnings are the only reason not to declare inlined
functions in headers, if the compiler understands "inline" at all.
Simon Riggs [Sat, 13 Feb 2010 01:32:20 +0000 (01:32 +0000)]
Re-enable max_standby_delay = -1 using deadlock detection on startup
process. If startup waits on a buffer pin we send a request to all
backends to cancel themselves if they are holding the buffer pin
required and they are also waiting on a lock. If not, startup waits
until max_standby_delay before cancelling any backend waiting for
the requested buffer pin.
Simon Riggs [Sat, 13 Feb 2010 00:59:58 +0000 (00:59 +0000)]
Introduce WAL records to log reuse of btree pages, allowing conflict
resolution during Hot Standby. Page reuse interlock requested by Tom.
Analysis and patch by me.
Tom Lane [Fri, 12 Feb 2010 22:48:56 +0000 (22:48 +0000)]
Tweak the order of processing of WITH clauses so that they are processed
before we start analyzing the parent statement. This is to make it
more clear that the WITH isn't affected by anything in the parent.
I don't believe there's any actual bug here, because the stuff that
was being done before WITH didn't affect subqueries; but it's certainly
a potential for error (and apparently misled Marko into committing some
real errors...).
Andrew Dunstan [Fri, 12 Feb 2010 19:35:25 +0000 (19:35 +0000)]
Add plperl.on_plperl_init and plperl.on_plperlu_init settings for language-specific startup. Rename recently added plperl.on_perl_init to plperl.on_init. Also, code cleanup for utf8 hack. Patch from Tim Bunce, reviewed by Alex Hunsaker.
Tom Lane [Fri, 12 Feb 2010 17:33:21 +0000 (17:33 +0000)]
Extend the set of frame options supported for window functions.
This patch allows the frame to start from CURRENT ROW (in either RANGE or
ROWS mode), and it also adds support for ROWS n PRECEDING and ROWS n FOLLOWING
start and end points. (RANGE value PRECEDING/FOLLOWING isn't there yet ---
the grammar works, but that's all.)
Reduce the chatter to the log when starting a standby server. Don't
echo all the recovery.conf options. Don't emit the "initializing
recovery connections" message, which doesn't mean anything to a user.
Remove the "starting archive recovery" message and replace the
"automatic recovery in progress" message with a more informative message
saying whether the server is doing PITR, normal archive recovery, or
standby mode.
Check for partial WAL files in standby mode. If restore_command restores
a partial WAL file, assume it's because the file is just being copied to
the archive and treat it the same as "file not found" in standby mode.
pg_standby has a similar check, so it seems reasonable to have the same
level of protection in the built-in standby mode.
Simon Riggs [Thu, 11 Feb 2010 19:35:22 +0000 (19:35 +0000)]
Fix typo bug in Hot Standby from recent refactoring. Bug introduced
into code recently patched by Andres Freund, so quickly fixed by him
when bug report from Tatsuo Ishii arrived.
Teodor Sigaev [Thu, 11 Feb 2010 14:29:50 +0000 (14:29 +0000)]
Generic implementation of red-black binary tree. It's planned to use in
several places, but for now only GIN uses it during index creation.
Using self-balanced tree greatly speeds up index creation in corner cases
with preordered data.
Now that streaming replication switches between streaming mode and
restoring from archive, the last WAL segment is not necessarily open at
the end of recovery. Fix assertion that assumed that.
Fujii Masao, fixing the assertion failure reported by Martin Pihlak.
Tom Lane [Wed, 10 Feb 2010 03:38:35 +0000 (03:38 +0000)]
Improve planner's choices about when to use hashing vs sorting for DISTINCT.
The previous coding missed a bet by sometimes picking the "sorted" path
from query_planner even though hashing would be preferable. To fix, we have
to be willing to make the choice sooner. This contorts things a little bit,
but I thought of a factorization that makes it not too awful.
Tom Lane [Tue, 9 Feb 2010 21:43:30 +0000 (21:43 +0000)]
Fix up rickety handling of relation-truncation interlocks.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens. Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation. We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation. This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.
In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
Magnus Hagander [Tue, 9 Feb 2010 19:55:14 +0000 (19:55 +0000)]
Define the value for in6addr_any on MingW, since it provides the struct
only in the header files and not in any libraries, yet declare it as
an extern.
Move "Warm Standby Servers for High Availability" and "Hot Standby"
sections under "High Availability, Load Balancing, and Replication"
chapter. Streaming replication chapter needs a lot more work, but this
commit just moves things around.
Tom Lane [Tue, 9 Feb 2010 00:28:30 +0000 (00:28 +0000)]
Rearrange lazy-vacuum code a little bit to reduce the window between
truncating the table and transaction commit. This isn't really making
it safe, but at least there is no good reason to do free space map
cleanup within the risk window. Don't lock out cancel interrupts
until we have to, either.
Tom Lane [Mon, 8 Feb 2010 20:39:52 +0000 (20:39 +0000)]
Create an official API function for C functions to use to check if they are
being called as aggregates, and to get the aggregate transition state memory
context if needed. Use it instead of poking directly into AggState and
WindowAggState in places that shouldn't know so much.
We should have done this in 8.4, probably, but better late than never.
Tom Lane [Mon, 8 Feb 2010 16:50:21 +0000 (16:50 +0000)]
Fix serious performance bug in new implementation of VACUUM FULL:
cluster_rel necessarily builds an all-new toast table, so it's useless to
then go and VACUUM FULL the toast table.
Remove piece of code to zero out minRecoveryPoint when starting crash
recovery. It's zeroed out whenever a checkpoint is written, so the only
scenario where the removed code did anything is when you kill archive
recovery, remove recovery.conf, and start up the server, so that it goes
into crash recovery instead. That's a "don't do that" scenario, but it
seems better to not clear minRecoveryPoint but instead update it like we
do in archive recovery, which is what will now happen.
Tom Lane [Mon, 8 Feb 2010 05:53:55 +0000 (05:53 +0000)]
Remove CatalogCacheFlushRelation, and the reloidattr infrastructure that was
needed by nothing else.
The restructuring I just finished doing on cache management exposed to me how
silly this routine was. Its function was to go into the catcache and blow
away all entries related to a given relation when there was a relcache flush
on that relation. However, there is no point in removing a catcache entry
if the catalog row it represents is still valid --- and if it isn't valid,
there must have been a catcache entry flush on it, because that's triggered
directly by heap_update or heap_delete on the catalog row. So this routine
accomplished nothing except to blow away valid cache entries that we'd very
likely be wanting in the near future to help reconstruct the relcache entry.
Dumb.
On top of which, it required a subtle and easy-to-get-wrong attribute in
syscache definitions, ie, the column containing the OID of the related
relation if any. Removing that is a very useful maintenance simplification.
Tom Lane [Mon, 8 Feb 2010 04:33:55 +0000 (04:33 +0000)]
Remove old-style VACUUM FULL (which was known for a little while as
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.
Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).
We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
Tom Lane [Sun, 7 Feb 2010 22:40:33 +0000 (22:40 +0000)]
Work around deadlock problems with VACUUM FULL/CLUSTER on system catalogs,
as per my recent proposal.
First, teach IndexBuildHeapScan to not wait for INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples to commit unless the index build is checking
uniqueness/exclusion constraints. If it isn't, there's no harm in just
indexing the in-doubt tuple.
Second, modify VACUUM FULL/CLUSTER to suppress reverifying
uniqueness/exclusion constraint properties while rebuilding indexes of
the target relation. This is reasonable because these commands aren't
meant to deal with corrupted-data situations. Constraint properties
will still be rechecked when an index is rebuilt by a REINDEX command.
This gets us out of the problem that new-style VACUUM FULL would often
wait for other transactions while holding exclusive lock on a system
catalog, leading to probable deadlock because those other transactions
need to look at the catalogs too. Although the real ultimate cause of
the problem is a debatable choice to release locks early after modifying
system catalogs, changing that choice would require pretty serious
analysis and is not something to be undertaken lightly or on a tight
schedule. The present patch fixes the problem in a fairly reasonable
way and should also improve the speed of VACUUM FULL/CLUSTER a little bit.