Tom Lane [Mon, 22 Feb 2010 15:29:46 +0000 (15:29 +0000)]
Let's try forcing errno to zero before issuing fsync. The current buildfarm
results claiming EBADF seem improbable enough that I'm not convinced fsync
is really returning that --- could it be failing to set errno at all?
Tom Lane [Mon, 22 Feb 2010 15:26:14 +0000 (15:26 +0000)]
Adjust pg_fsync_writethrough so that it will set errno when failing
on a platform that doesn't support this operation. The former coding
would allow an unrelated errno to be reported, which would be quite
misleading. Not sure if this has anything to do with the current
buildfarm failures, but it's certainly bogus as-is.
Move documentation of all recovery.conf option to a new chapter.
They used to be scattered between the "backup and restore" and "streaming
replication" chapters.
Tom Lane [Sat, 20 Feb 2010 21:24:02 +0000 (21:24 +0000)]
Clean up handling of XactReadOnly and RecoveryInProgress checks.
Add some checks that seem logically necessary, in particular let's make
real sure that HS slave sessions cannot create temp tables. (If they did
they would think that temp tables belonging to the master's session with
the same BackendId were theirs. We *must* not allow myTempNamespace to
become set in a slave session.)
Change setval() and nextval() so that they are only allowed on temp sequences
in a read-only transaction. This seems consistent with what we allow for
table modifications in read-only transactions. Since an HS slave can't have a
temp sequence, this also provides a nicer cure for the setval PANIC reported
by Erik Rijkers.
Make the error messages more uniform, and have them mention the specific
command being complained of. This seems worth the trifling amount of extra
code, since people are likely to see such messages a lot more than before.
Tom Lane [Fri, 19 Feb 2010 21:49:10 +0000 (21:49 +0000)]
Reduce the rescan cost estimate for Materialize nodes to cpu_operator_cost per
tuple, instead of the former cpu_tuple_cost. It is sane to charge less than
cpu_tuple_cost because Materialize never does any qual-checking or projection,
so it's got less overhead than most plan node types. In particular, we want
to have the same charge here as is charged for readout in cost_sort. That
avoids the problem recently exhibited by Teodor wherein the planner prefers
a useless sort over a materialize step in a context where a lot of rescanning
will happen. The rescan costs should be just about the same for both node
types, so make their estimates the same.
Not back-patching because all of the current logic for rescan cost estimates
is new in 9.0. The old handling of rescans is sufficiently not-sane that
changing this in that structure is a bit pointless, and might indeed cause
regressions.
Don't use O_DIRECT when writing WAL files if archiving or streaming is
enabled. Bypassing the kernel cache is counter-productive in that case,
because the archiver/walsender process will read from the WAL file
soon after it's written, and if it's not cached the read will cause
a physical read, eating I/O bandwidth available on the WAL drive.
Also, walreceiver process does unaligned writes, so disable O_DIRECT
in walreceiver process for that reason too.
Itagaki Takahiro [Fri, 19 Feb 2010 01:04:03 +0000 (01:04 +0000)]
Fix STOP WAL LOCATION in backup history files no to return the next
segment of XLOG_BACKUP_END record even if the the record is placed
at a segment boundary. Furthermore the previous implementation could
return nonexistent segment file name when the boundary is in segments
that has "FE" suffix; We never use segments with "FF" suffix.
Backpatch to 8.0, where hot backup was introduced.
Tom Lane [Thu, 18 Feb 2010 23:50:06 +0000 (23:50 +0000)]
Volatile-ize all five places where we expect a PG_TRY block to restore
old memory context in plpython. Before only one of them was marked
volatile, but per report from Zdenek Kotala, some compilers do the
wrong thing here.
Tom Lane [Thu, 18 Feb 2010 22:43:31 +0000 (22:43 +0000)]
Provide some rather hokey ways for EXPLAIN to print FieldStore and assignment
ArrayRef expressions that are not in the immediate context of an INSERT or
UPDATE targetlist. Such cases never arise in stored rules, so ruleutils.c
hadn't tried to handle them. However, they do occur in the targetlists of
plans derived from such statements, and now that EXPLAIN VERBOSE tries to
print targetlists, we need some way to deal with the case.
I chose to represent an assignment ArrayRef as "array[subscripts] := source",
which is fairly reasonable and doesn't omit any information. However,
FieldStore is problematic because the planner will fold multiple assignments
to fields of the same composite column into one FieldStore, resulting in a
structure that is hard to understand at all, let alone display comprehensibly.
So in that case I punted and just made it print the source expression(s).
Backpatch to 8.4 --- the lack of functionality exists in older releases,
but doesn't seem to be important for lack of anything that would call it.
Tom Lane [Thu, 18 Feb 2010 18:41:47 +0000 (18:41 +0000)]
Fix ExecEvalArrayRef to pass down the old value of the array element or slice
being assigned to, in case the expression to be assigned is a FieldStore that
would need to modify that value. The need for this was foreseen some time
ago, but not implemented then because we did not have arrays of composites.
Now we do, but the point evidently got overlooked in that patch. Net result
is that updating a field of an array element doesn't work right, as
illustrated if you try the new regression test on an unpatched backend.
Noted while experimenting with EXPLAIN VERBOSE, which has also got some issues
in this area.
Backpatch to 8.3, where arrays of composites were introduced.
Fix pq_getbyte_if_available() function. It was confused on what it
returns if no data is immediately available. Patch by me with numerous
fixes from Fujii Masao and Magnus Hagander.
Tom Lane [Thu, 18 Feb 2010 03:06:46 +0000 (03:06 +0000)]
Force READY portals into FAILED state when a transaction or subtransaction
is aborted, if they were created within the failed xact. This prevents
ExecutorEnd from being run on them, which is a good idea because they may
contain references to tables or other objects that no longer exist.
In particular this is hazardous when auto_explain is active, but it's
really rather surprising that nobody has seen an issue with this before.
I'm back-patching this to 8.4, since that's the first version that contains
auto_explain or an ExecutorEnd hook, but I wonder whether we shouldn't
back-patch further.
Tom Lane [Thu, 18 Feb 2010 01:29:10 +0000 (01:29 +0000)]
Fix up pg_dump's treatment of large object ownership and ACLs. We now emit
a separate archive entry for each BLOB, and use pg_dump's standard methods
for dealing with its ownership, ACL if any, and comment if any. This means
that switches like --no-owner and --no-privileges do what they're supposed
to. Preliminary testing says that performance is still reasonable even
with many blobs, though we'll have to see how that shakes out in the field.
Itagaki Takahiro [Wed, 17 Feb 2010 04:09:40 +0000 (04:09 +0000)]
Support new syntax and improve handling of parentheses in psql tab-completion.
Newly supported syntax are:
- ALTER {TABLE|INDEX|TABLESPACE} {SET|RESET} with options
- ALTER TABLE ALTER COLUMN {SET|RESET} with options
- ALTER TABLE ALTER COLUMN SET STORAGE
- CREATE INDEX CONCURRENTLY
- CREATE INDEX ON (without name)
- CREATE INDEX ... USING with pg_am.amname instead of hard-corded names
- CREATE TRIGGER with events
- DROP AGGREGATE function with arguments
Tom Lane [Wed, 17 Feb 2010 03:10:33 +0000 (03:10 +0000)]
When updating ShmemVariableCache from a checkpoint record, be sure to set
all the values derived from oldestXid, not just that field. Brain fade in
one of my patches associated with flat file removal, exposed by a report
from Fujii Masao.
With this change, xidVacLimit should always be valid, so remove a couple of
bits of complexity associated with the previous assumption that sometimes
it wouldn't get set right away.
Tom Lane [Wed, 17 Feb 2010 00:52:09 +0000 (00:52 +0000)]
Make NOTIFY_PAYLOAD_MAX_LENGTH depend explicitly on BLCKSZ and
NAMEDATALEN, so this code doesn't go nuts with smaller than default
BLCKSZ or larger than default NAMEDATALEN. The standard value is
still exactly 8000.
Tom Lane [Tue, 16 Feb 2010 22:34:57 +0000 (22:34 +0000)]
Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue.
In addition, add support for a "payload" string to be passed along with
each notify event.
This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage. There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.
Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
Andrew Dunstan [Tue, 16 Feb 2010 21:39:52 +0000 (21:39 +0000)]
Clean up package namespace use and use of Safe in plperl.
Prevent use of another buggy version of Safe.pm.
Only register the exit handler if we have successfully created an interpreter.
Change log level of perl warnings from NOTICE to WARNING.
The infrastructure is there if in future we decide to allow
DBAs to specify extra modules that will be allowed in trusted code.
However, for now the relevant variables are declared as lexicals
rather than as package variables, so that they are not (or should not be)
accessible.
Mostly code from Tim Bunce, reviewed by Alex Hunsaker, with some
tweaks by me.
Magnus Hagander [Tue, 16 Feb 2010 19:26:02 +0000 (19:26 +0000)]
Add emulation of non-blocking sockets to the win32 socket/signal layer,
and use this in pq_getbyte_if_available.
It's only a limited implementation which swithes the whole emulation layer
no non-blocking mode, but that's enough as long as non-blocking is only
used during a short period of time, and only one socket is accessed during
this time.
Alvaro Herrera [Mon, 15 Feb 2010 22:23:25 +0000 (22:23 +0000)]
Move main error message text in plperl into errmsg from errdetail,
and move the context information into errcontext instead of errmsg.
This makes them better conform to our guidelines.
Also remove a few errcode declarations that were providing the default
value ERRCODE_INTERNAL_ERROR.
Greg Stark [Mon, 15 Feb 2010 02:36:26 +0000 (02:36 +0000)]
Display explain buffers measurements in memory units rather than blocks. Also show "Total Buffer Usage" to hint that these are totals not averages per loop
Greg Stark [Mon, 15 Feb 2010 00:50:57 +0000 (00:50 +0000)]
Speed up CREATE DATABASE by deferring the fsyncs until after copying
all the data and using posix_fadvise to nudge the OS into flushing it
earlier. This also hopefully makes CREATE DATABASE avoid spamming the
cache.
Tests show a big speedup on Linux at least on some filesystems.
Robert Haas [Sun, 14 Feb 2010 18:42:19 +0000 (18:42 +0000)]
Wrap calls to SearchSysCache and related functions using macros.
The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4). This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.
Magnus Hagander [Sun, 14 Feb 2010 14:10:23 +0000 (14:10 +0000)]
Make the msvc build system ask python about details of version and installation
prefix, instead of assuming it will always be following the default layout.
All information we need is not available on Windows, but the number of
assumptions are at least fewer this way than before.
Tom Lane [Sat, 13 Feb 2010 20:46:52 +0000 (20:46 +0000)]
Don't expose the inline definition of MemoryContextSwitchTo when FRONTEND is
defined. Its reference to CurrentMemoryContext causes link failures on some
platforms, evidently because the inline function gets compiled despite lack of
use. Per buildfarm member warthog.
Simon Riggs [Sat, 13 Feb 2010 16:15:48 +0000 (16:15 +0000)]
Fix relcache init file invalidation during Hot Standby for the case
where a database has a non-default tablespaceid. Pass thru MyDatabaseId
and MyDatabaseTableSpace to allow file path to be re-created in
standby and correct invalidation to take place in all cases.
Update and rework xact_commit_desc() debug messages.
Bug report from Tom by code inspection. Fix by me.
Tom Lane [Sat, 13 Feb 2010 02:34:16 +0000 (02:34 +0000)]
Support inlining various small performance-critical functions on non-GCC
compilers, by applying a configure check to see if the compiler will accept
an unreferenced "static inline foo ..." function without warnings. It is
believed that such warnings are the only reason not to declare inlined
functions in headers, if the compiler understands "inline" at all.
Simon Riggs [Sat, 13 Feb 2010 01:32:20 +0000 (01:32 +0000)]
Re-enable max_standby_delay = -1 using deadlock detection on startup
process. If startup waits on a buffer pin we send a request to all
backends to cancel themselves if they are holding the buffer pin
required and they are also waiting on a lock. If not, startup waits
until max_standby_delay before cancelling any backend waiting for
the requested buffer pin.
Simon Riggs [Sat, 13 Feb 2010 00:59:58 +0000 (00:59 +0000)]
Introduce WAL records to log reuse of btree pages, allowing conflict
resolution during Hot Standby. Page reuse interlock requested by Tom.
Analysis and patch by me.
Tom Lane [Fri, 12 Feb 2010 22:48:56 +0000 (22:48 +0000)]
Tweak the order of processing of WITH clauses so that they are processed
before we start analyzing the parent statement. This is to make it
more clear that the WITH isn't affected by anything in the parent.
I don't believe there's any actual bug here, because the stuff that
was being done before WITH didn't affect subqueries; but it's certainly
a potential for error (and apparently misled Marko into committing some
real errors...).
Andrew Dunstan [Fri, 12 Feb 2010 19:35:25 +0000 (19:35 +0000)]
Add plperl.on_plperl_init and plperl.on_plperlu_init settings for language-specific startup. Rename recently added plperl.on_perl_init to plperl.on_init. Also, code cleanup for utf8 hack. Patch from Tim Bunce, reviewed by Alex Hunsaker.
Tom Lane [Fri, 12 Feb 2010 17:33:21 +0000 (17:33 +0000)]
Extend the set of frame options supported for window functions.
This patch allows the frame to start from CURRENT ROW (in either RANGE or
ROWS mode), and it also adds support for ROWS n PRECEDING and ROWS n FOLLOWING
start and end points. (RANGE value PRECEDING/FOLLOWING isn't there yet ---
the grammar works, but that's all.)
Reduce the chatter to the log when starting a standby server. Don't
echo all the recovery.conf options. Don't emit the "initializing
recovery connections" message, which doesn't mean anything to a user.
Remove the "starting archive recovery" message and replace the
"automatic recovery in progress" message with a more informative message
saying whether the server is doing PITR, normal archive recovery, or
standby mode.
Check for partial WAL files in standby mode. If restore_command restores
a partial WAL file, assume it's because the file is just being copied to
the archive and treat it the same as "file not found" in standby mode.
pg_standby has a similar check, so it seems reasonable to have the same
level of protection in the built-in standby mode.
Simon Riggs [Thu, 11 Feb 2010 19:35:22 +0000 (19:35 +0000)]
Fix typo bug in Hot Standby from recent refactoring. Bug introduced
into code recently patched by Andres Freund, so quickly fixed by him
when bug report from Tatsuo Ishii arrived.
Teodor Sigaev [Thu, 11 Feb 2010 14:29:50 +0000 (14:29 +0000)]
Generic implementation of red-black binary tree. It's planned to use in
several places, but for now only GIN uses it during index creation.
Using self-balanced tree greatly speeds up index creation in corner cases
with preordered data.
Now that streaming replication switches between streaming mode and
restoring from archive, the last WAL segment is not necessarily open at
the end of recovery. Fix assertion that assumed that.
Fujii Masao, fixing the assertion failure reported by Martin Pihlak.
Tom Lane [Wed, 10 Feb 2010 03:38:35 +0000 (03:38 +0000)]
Improve planner's choices about when to use hashing vs sorting for DISTINCT.
The previous coding missed a bet by sometimes picking the "sorted" path
from query_planner even though hashing would be preferable. To fix, we have
to be willing to make the choice sooner. This contorts things a little bit,
but I thought of a factorization that makes it not too awful.
Tom Lane [Tue, 9 Feb 2010 21:43:30 +0000 (21:43 +0000)]
Fix up rickety handling of relation-truncation interlocks.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens. Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation. We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation. This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.
In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.