Tom Lane [Mon, 4 Jan 2010 16:34:11 +0000 (16:34 +0000)]
Improve PGXS makefile system to allow the module's makefile to specify
where to install DATA and DOCS files. This is mainly intended to allow
versioned installation, eg, install into contrib/fooM.N/ rather than
directly into contrib/.
Write an end-of-backup WAL record at pg_stop_backup(), and wait for it at
recovery instead of reading the backup history file. This is more robust,
as it stops you from prematurely starting up an inconsisten cluster if the
backup history file is lost for some reason, or if the base backup was
never finished with pg_stop_backup().
This also paves the way for a simpler streaming replication patch, which
doesn't need to care about backup history files anymore.
The backup history file is still created and archived as before, but it's
not used by the system anymore. It's just for informational purposes now.
Bump PG_CONTROL_VERSION as the location of the backup startpoint is now
written to a new field in pg_control, and catversion because initdb is
required
Original patch by Fujii Masao per Simon's idea, with further fixes by me.
Tom Lane [Mon, 4 Jan 2010 02:44:40 +0000 (02:44 +0000)]
When estimating the selectivity of an inequality "column > constant" or
"column < constant", and the comparison value is in the first or last
histogram bin or outside the histogram entirely, try to fetch the actual
column min or max value using an index scan (if there is an index on the
column). If successful, replace the lower or upper histogram bound with
that value before carrying on with the estimate. This limits the
estimation error caused by moving min/max values when the comparison
value is close to the min or max. Per a complaint from Josh Berkus.
It is tempting to consider using this mechanism for mergejoinscansel as well,
but that would inject index fetches into main-line join estimation not just
endpoint cases. I'm refraining from that until we can get a better handle
on the costs of doing this type of lookup.
Tom Lane [Sun, 3 Jan 2010 05:39:08 +0000 (05:39 +0000)]
Dept of second thoughts: my first cut at supporting "x IS NOT NULL" btree
indexscans would do the wrong thing if index_rescan() was called with a
NULL instead of a new set of scankeys and the index was DESC order,
because sk_strategy would not get flipped a second time. I think
that those provisions for a NULL argument are dead code now as far as the
core backend goes, but possibly somebody somewhere is still using it.
In any case, this refactoring seems clearer, and it's definitely shorter.
Tom Lane [Sat, 2 Jan 2010 20:59:16 +0000 (20:59 +0000)]
Fix similar_escape() to convert parentheses to non-capturing style.
This is needed to avoid unwanted interference with SUBSTRING behavior,
as per bug #5257 from Roman Kononov. Also, add some basic intelligence
about character classes (bracket expressions) since we now have several
behaviors that aren't appropriate inside a character class.
As with the previous patch in this area, I'm reluctant to back-patch
since it might affect applications that are relying on the prior
behavior.
Tom Lane [Sat, 2 Jan 2010 17:53:57 +0000 (17:53 +0000)]
check_exclusion_constraint didn't actually work correctly for index
expressions: FormIndexDatum requires the estate's scantuple to already point
at the tuple the values are supposedly being extracted from. Adjust test
case so that this type of confusion will be exposed.
Per report from hubert depesz lubaczewski.
Tom Lane [Fri, 1 Jan 2010 23:03:10 +0000 (23:03 +0000)]
Add an "argisrow" field to NullTest nodes, following a plan made way back in
8.2beta but never carried out. This avoids repetitive tests of whether the
argument is of scalar or composite type. Also, be a bit more paranoid about
composite arguments in some places where we previously weren't checking.
Tom Lane [Fri, 1 Jan 2010 21:53:49 +0000 (21:53 +0000)]
Support "x IS NOT NULL" clauses as indexscan conditions. This turns out
to be just a minor extension of the previous patch that made "x IS NULL"
indexable, because we can treat the IS NOT NULL condition as if it were
"x < NULL" or "x > NULL" (depending on the index's NULLS FIRST/LAST option),
just like IS NULL is treated like "x = NULL". Aside from any possible
usefulness in its own right, this is an important improvement for
index-optimized MAX/MIN aggregates: it is now reliably possible to get
a column's min or max value cheaply, even when there are a lot of nulls
cluttering the interesting end of the index.
Magnus Hagander [Fri, 1 Jan 2010 14:57:16 +0000 (14:57 +0000)]
Make the win32 putenv() override update *all* present versions of the
MSVCRxx runtime, not just the current + Visual Studio 6 (MSVCRT). Clearly
there can be an almost unlimited number of runtimes loaded at the same
time.
Tom Lane [Wed, 30 Dec 2009 21:21:33 +0000 (21:21 +0000)]
Dept of second thoughts: recursive case in ANALYZE shouldn't emit a
pgstats message. This might need to be done differently later, but
with the current logic that's what should happen.
Tom Lane [Wed, 30 Dec 2009 20:32:14 +0000 (20:32 +0000)]
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
Tom Lane [Wed, 30 Dec 2009 03:45:46 +0000 (03:45 +0000)]
Set errno to zero before invoking SSL_read or SSL_write. It appears that
at least in some Windows versions, these functions are capable of returning
a failure indication without setting errno. That puts us into an infinite
loop if the previous value happened to be EINTR. Per report from Brendan
Hill.
Back-patch to 8.2. We could take it further back, but since this is only
known to be an issue on Windows and we don't support Windows before 8.2,
it does not seem worth the trouble.
Robert Haas [Wed, 30 Dec 2009 01:29:22 +0000 (01:29 +0000)]
Reject invalid input in int2vectorin.
Since the int2vector type is intended only for internal use, this patch doesn't
worry about prettifying the error messages, which has the fringe benefit of
avoiding creating additional translatable strings. For a type intended to be
used by end-users, we would want to do better, but the approach taken here
seems like the correct trade-off for this case.
Tom Lane [Tue, 29 Dec 2009 22:00:14 +0000 (22:00 +0000)]
Add an index on pg_inherits.inhparent, and use it to avoid seqscans in
find_inheritance_children(). This is a complete no-op in databases without
any inheritance. In databases where there are just a few entries in
pg_inherits, it could conceivably be a small loss. However, in databases with
many inheritance parents, it can be a big win.
Tom Lane [Tue, 29 Dec 2009 20:11:45 +0000 (20:11 +0000)]
Add the ability to store inheritance-tree statistics in pg_statistic,
and teach ANALYZE to compute such stats for tables that have subclasses.
Per my proposal of yesterday.
autovacuum still needs to be taught about running ANALYZE on parent tables
when their subclasses change, but the feature is useful even without that.
Previous fix for temporary file management broke returning a set from
PL/pgSQL function within an exception handler. Make sure we use the right
resource owner when we create the tuplestore to hold returned tuples.
Simplify tuplestore API so that the caller doesn't need to be in the right
memory context when calling tuplestore_put* functions. tuplestore.c
automatically switches to the memory context used when the tuplestore was
created. Tuplesort was already modified like this earlier. This patch also
removes the now useless MemoryContextSwitch calls from callers.
Report by Aleksei on pgsql-bugs on Dec 22 2009. Backpatch to 8.1, like
the previous patch that broke this.
Tom Lane [Sun, 27 Dec 2009 19:40:07 +0000 (19:40 +0000)]
Avoid memory leak if pgstat_vacuum_stat is interrupted partway through.
The temporary hash tables made by pgstat_collect_oids should be allocated
in a short-term memory context, which is not the default behavior of
hash_create. Noted while looking through hash_create calls in connection
with Robert Haas' recent complaint.
This is a pre-existing bug, but it doesn't seem important enough to
back-patch. The hash table is not so large that it would matter unless this
happened many times within a session, which seems quite unlikely.
Tom Lane [Sun, 27 Dec 2009 18:55:52 +0000 (18:55 +0000)]
Remove a couple of unnecessary calls of CreateCacheMemoryContext. These
probably got there via blind copy-and-paste from one of the legitimate
callers, so rearrange and comment that code a bit to make it clearer that
this isn't a necessary prerequisite to hash_create. Per observation
from Robert Haas.
Magnus Hagander [Sun, 27 Dec 2009 16:01:39 +0000 (16:01 +0000)]
If the MSVCRT module is not found in the current binary, proceed to update
system and local environments anyway, instead of aborting. (This will
happen in a MSVC build with no or very few external libraries linked in)
Tom Lane [Fri, 25 Dec 2009 17:11:32 +0000 (17:11 +0000)]
Fix brain fade in join-removal patch: a pushed-down clause in the outer join's
restrict list is not just something to ignore, it's actually grounds to
abandon the optimization entirely. Per bug #5255 from Matteo Beccati.
Tom Lane [Thu, 24 Dec 2009 23:36:39 +0000 (23:36 +0000)]
Try to improve the clarity of the psql documentation for the \d family of
commands, as per recent discussion. Includes suggestions from Adrian Klaver
and Filip Rembialkowski.
Bruce Momjian [Thu, 24 Dec 2009 22:09:24 +0000 (22:09 +0000)]
Binary upgrade:
Modify pg_dump --binary-upgrade and add backend support routines to
support the preservation of pg_type oids when doing a binary upgrade.
This allows user-defined composite types and arrays to be binary
upgraded.
Tom Lane [Thu, 24 Dec 2009 17:52:04 +0000 (17:52 +0000)]
Fix wrong WAL info value generated when gistContinueInsert() performs an
index page split. This would result in index corruption, or even more likely
an error during WAL replay, if we were unlucky enough to crash during
end-of-recovery cleanup after having completed an incomplete GIST insertion.
Tom Lane [Wed, 23 Dec 2009 17:41:45 +0000 (17:41 +0000)]
Allow the index name to be omitted in CREATE INDEX, causing the system to
choose an index name the same as it would do for an unnamed index constraint.
(My recent changes to the index naming logic have helped to ensure that this
will be a reasonable choice.) Per a suggestion from Peter.
A necessary side-effect is to promote CONCURRENTLY to type_func_name_keyword
status, ie, it can't be a table/column/index name anymore unless quoted.
This is not all bad, since we have heard more than once of people typing
CREATE INDEX CONCURRENTLY ON foo (...) and getting a normal index build of
an index named "concurrently", which was not what they wanted. Now this
syntax will result in a concurrent build of an index with system-chosen
name; which they can rename afterwards if they want something else.
Tom Lane [Wed, 23 Dec 2009 16:43:43 +0000 (16:43 +0000)]
Remove code that attempted to rename index columns to keep them in sync with
their underlying table columns. That code was not bright enough to cope with
collision situations (ie, new name conflicts with some other column of the
index). Since there is no functional reason to do this at all, trying to
upgrade the logic to be bulletproof doesn't seem worth the trouble.
This change means that both the index name and the column names of an index
are set when it's created, and won't be automatically changed when the
underlying table columns are renamed. Neatnik DBAs are still free to rename
them manually, of course.
Always pass catalog id to the options validator function specified in
CREATE FOREIGN DATA WRAPPER. Arguably it wasn't a bug because the
documentation said that it's passed the catalog ID or zero, but surely
we should provide it when it's known. And there isn't currently any
scenario where it's not known, and I can't imagine having one in the
future either, so better remove the "or zero" escape hatch and always
pass a valid catalog ID. Backpatch to 8.4.
Tom Lane [Wed, 23 Dec 2009 02:35:25 +0000 (02:35 +0000)]
Adjust naming of indexes and their columns per recent discussion.
Index expression columns are now named after the FigureColname result for
their expressions, rather than always being "pg_expression_N". Digits are
appended to this name if needed to make the column name unique within the
index. (That happens for regular columns too, thus fixing the old problem
that CREATE INDEX fooi ON foo (f1, f1) fails. Before exclusion indexes
there was no real reason to do such a thing, but now maybe there is.)
Default names for indexes and associated constraints now include the column
names of all their columns, not only the first one as in previous practice.
(Of course, this will be truncated as needed to fit in NAMEDATALEN. Also,
pkey indexes retain the historical behavior of not naming specific columns
at all.)
An example of the results:
regression=# create table foo (f1 int, f2 text,
regression(# exclude (f1 with =, lower(f2) with =));
NOTICE: CREATE TABLE / EXCLUDE will create implicit index "foo_f1_lower_exclusion" for table "foo"
CREATE TABLE
regression=# \d foo_f1_lower_exclusion
Index "public.foo_f1_lower_exclusion"
Column | Type | Definition
--------+---------+------------
f1 | integer | f1
lower | text | lower(f2)
btree, for table "public.foo"
Tom Lane [Tue, 22 Dec 2009 23:54:17 +0000 (23:54 +0000)]
Disallow comments on columns of relation types other than tables, views,
and composite types, which are the only relkinds for which pg_dump support
exists for dumping column comments. There is no obvious usefulness for
comments on columns of sequences or toast tables; and while comments on
index columns might have some value, it's not worth the risk of compatibility
problems due to possible changes in the algorithm for assigning names to
index columns. Per discussion.
In consequence, remove now-dead code for copying such comments in CREATE TABLE
LIKE.
Robert Haas [Mon, 21 Dec 2009 01:34:11 +0000 (01:34 +0000)]
More cleanups for the recent large object permissions patch.
Rewrite or adjust various comments for clarity. Remove one bogus comment that
doesn't reflect what the code actually does. Improve the description of the
lo_compat_privileges option.
Tom Lane [Sun, 20 Dec 2009 18:28:14 +0000 (18:28 +0000)]
There is no good reason for the CREATE TABLE LIKE INCLUDING COMMENTS code to
have hard-wired knowledge of the rules for naming index columns. It can
just look at the actual names in the source index, instead. Do some minor
formatting cleanup too.
Tom Lane [Sat, 19 Dec 2009 04:08:32 +0000 (04:08 +0000)]
Bump catversion to reflect the fact that HS patch changed pg_proc
contents, and PG_CONTROL_VERSION to reflect the fact that it changed
pg_control contents. (I see we did at least remember to change
XLOG_PAGE_MAGIC for the WAL contents changes.)
Simon Riggs [Sat, 19 Dec 2009 01:32:45 +0000 (01:32 +0000)]
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
Tom Lane [Fri, 18 Dec 2009 18:45:50 +0000 (18:45 +0000)]
Force the TZ environment variable to be set during initdb. This is to
short-circuit the rather expensive identify_system_timezone() procedure,
which we have no real need for during initdb since nothing done here depends
on the timezone setting. Since we launch quite a few standalone backends
during the initdb sequence, this adds up to a significant savings, and seems
worth doing to save developer time even though it will hardly matter to end
users. Per my report today on pgsql-hackers.
Robert Haas [Thu, 17 Dec 2009 14:36:16 +0000 (14:36 +0000)]
Improve documentation for pg_largeobject changes.
Rewrite the documentation in more idiomatic English, and in the process make
it somewhat more succinct. Move the discussion of specific large object
privileges out of the "server-side functions" section, where it certainly
doesn't belong, and into "implementation features". That might not be
exactly right either, but it doesn't seem worth creating a new section for
this amount of information. Fix a few spelling and layout problems, too.
Peter Eisentraut [Wed, 16 Dec 2009 23:05:00 +0000 (23:05 +0000)]
Don't unblock SIGQUIT in the SIGQUIT handler
This was possibly linked to a deadlock-like situation in glibc syslog code
invoked by the ereport call in quickdie(). In any case, a signal handler
should not unblock its own signal unless there is a specific reason to.
Tom Lane [Wed, 16 Dec 2009 22:24:13 +0000 (22:24 +0000)]
Avoid a premature coercion failure in transformSetOperationTree() when
presented with an UNKNOWN-type Var, which can happen in cases where an
unknown literal appeared in a subquery. While many such cases will fail
later on anyway in the planner, there are some cases where the planner is
able to flatten the query and replace the Var by the constant before it has
to coerce the union column to the final type. I had added this check in 8.4
to provide earlier/better error detection, but it causes a regression for
some cases that worked OK before. Fix by not making the check if the input
node is UNKNOWN type and not a Const or Param. If it isn't going to work,
it will fail anyway at plan time, with the only real loss being inability to
provide an error cursor. Per gripe from Britt Piehler.
In passing, rename a couple of variables to remove confusion from an
inner scope masking the same variable names in an outer scope.
Robert Haas [Wed, 16 Dec 2009 22:16:16 +0000 (22:16 +0000)]
Several fixes for EXPLAIN (FORMAT YAML), plus one for EXPLAIN (FORMAT JSON).
ExplainSeparatePlans() was busted for both JSON and YAML output - the present
code is a holdover from the original version of my machine-readable explain
patch, which didn't have the grouping_stack machinery. Also, fix an odd
distribution of labor between ExplainBeginGroup() and ExplainYAMLLineStarting()
when marking lists with "- ", with each providing one character. This broke
the output format for multi-query statements. Also, fix ExplainDummyGroup()
for the YAML output format.
Along the way, make the YAML format use escape_yaml() in situations where the
JSON format uses escape_json(). Right now, it doesn't matter because all the
values are known not to need escaping, but it seems safer this way. Finally,
I added some comments to better explain what the YAML output format is doing.
Greg Sabino Mullane reported the issues with multi-query statements.
Analysis and remaining cleanups by me.
Michael Meskes [Wed, 16 Dec 2009 10:15:07 +0000 (10:15 +0000)]
Fixed auto-prepare to not try preparing statements that are not preparable. Bug
found and solved by Boszormenyi Zoltan <zb@cybertec.at>, some small adjustments
by me.
Peter Eisentraut [Tue, 15 Dec 2009 22:59:55 +0000 (22:59 +0000)]
Python 3 support in PL/Python
Behaves more or less unchanged compared to Python 2, but the new language
variant is called plpython3u. Documentation describing the naming scheme
is included.