Tom Lane [Wed, 30 Dec 2009 21:21:33 +0000 (21:21 +0000)]
Dept of second thoughts: recursive case in ANALYZE shouldn't emit a
pgstats message. This might need to be done differently later, but
with the current logic that's what should happen.
Tom Lane [Wed, 30 Dec 2009 20:32:14 +0000 (20:32 +0000)]
Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.
The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.
To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table. This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation. Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector. There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.
Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.
The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
Tom Lane [Wed, 30 Dec 2009 03:45:46 +0000 (03:45 +0000)]
Set errno to zero before invoking SSL_read or SSL_write. It appears that
at least in some Windows versions, these functions are capable of returning
a failure indication without setting errno. That puts us into an infinite
loop if the previous value happened to be EINTR. Per report from Brendan
Hill.
Back-patch to 8.2. We could take it further back, but since this is only
known to be an issue on Windows and we don't support Windows before 8.2,
it does not seem worth the trouble.
Robert Haas [Wed, 30 Dec 2009 01:29:22 +0000 (01:29 +0000)]
Reject invalid input in int2vectorin.
Since the int2vector type is intended only for internal use, this patch doesn't
worry about prettifying the error messages, which has the fringe benefit of
avoiding creating additional translatable strings. For a type intended to be
used by end-users, we would want to do better, but the approach taken here
seems like the correct trade-off for this case.
Tom Lane [Tue, 29 Dec 2009 22:00:14 +0000 (22:00 +0000)]
Add an index on pg_inherits.inhparent, and use it to avoid seqscans in
find_inheritance_children(). This is a complete no-op in databases without
any inheritance. In databases where there are just a few entries in
pg_inherits, it could conceivably be a small loss. However, in databases with
many inheritance parents, it can be a big win.
Tom Lane [Tue, 29 Dec 2009 20:11:45 +0000 (20:11 +0000)]
Add the ability to store inheritance-tree statistics in pg_statistic,
and teach ANALYZE to compute such stats for tables that have subclasses.
Per my proposal of yesterday.
autovacuum still needs to be taught about running ANALYZE on parent tables
when their subclasses change, but the feature is useful even without that.
Previous fix for temporary file management broke returning a set from
PL/pgSQL function within an exception handler. Make sure we use the right
resource owner when we create the tuplestore to hold returned tuples.
Simplify tuplestore API so that the caller doesn't need to be in the right
memory context when calling tuplestore_put* functions. tuplestore.c
automatically switches to the memory context used when the tuplestore was
created. Tuplesort was already modified like this earlier. This patch also
removes the now useless MemoryContextSwitch calls from callers.
Report by Aleksei on pgsql-bugs on Dec 22 2009. Backpatch to 8.1, like
the previous patch that broke this.
Tom Lane [Sun, 27 Dec 2009 19:40:07 +0000 (19:40 +0000)]
Avoid memory leak if pgstat_vacuum_stat is interrupted partway through.
The temporary hash tables made by pgstat_collect_oids should be allocated
in a short-term memory context, which is not the default behavior of
hash_create. Noted while looking through hash_create calls in connection
with Robert Haas' recent complaint.
This is a pre-existing bug, but it doesn't seem important enough to
back-patch. The hash table is not so large that it would matter unless this
happened many times within a session, which seems quite unlikely.
Tom Lane [Sun, 27 Dec 2009 18:55:52 +0000 (18:55 +0000)]
Remove a couple of unnecessary calls of CreateCacheMemoryContext. These
probably got there via blind copy-and-paste from one of the legitimate
callers, so rearrange and comment that code a bit to make it clearer that
this isn't a necessary prerequisite to hash_create. Per observation
from Robert Haas.
Magnus Hagander [Sun, 27 Dec 2009 16:01:39 +0000 (16:01 +0000)]
If the MSVCRT module is not found in the current binary, proceed to update
system and local environments anyway, instead of aborting. (This will
happen in a MSVC build with no or very few external libraries linked in)
Tom Lane [Fri, 25 Dec 2009 17:11:32 +0000 (17:11 +0000)]
Fix brain fade in join-removal patch: a pushed-down clause in the outer join's
restrict list is not just something to ignore, it's actually grounds to
abandon the optimization entirely. Per bug #5255 from Matteo Beccati.
Tom Lane [Thu, 24 Dec 2009 23:36:39 +0000 (23:36 +0000)]
Try to improve the clarity of the psql documentation for the \d family of
commands, as per recent discussion. Includes suggestions from Adrian Klaver
and Filip Rembialkowski.
Bruce Momjian [Thu, 24 Dec 2009 22:09:24 +0000 (22:09 +0000)]
Binary upgrade:
Modify pg_dump --binary-upgrade and add backend support routines to
support the preservation of pg_type oids when doing a binary upgrade.
This allows user-defined composite types and arrays to be binary
upgraded.
Tom Lane [Thu, 24 Dec 2009 17:52:04 +0000 (17:52 +0000)]
Fix wrong WAL info value generated when gistContinueInsert() performs an
index page split. This would result in index corruption, or even more likely
an error during WAL replay, if we were unlucky enough to crash during
end-of-recovery cleanup after having completed an incomplete GIST insertion.
Tom Lane [Wed, 23 Dec 2009 17:41:45 +0000 (17:41 +0000)]
Allow the index name to be omitted in CREATE INDEX, causing the system to
choose an index name the same as it would do for an unnamed index constraint.
(My recent changes to the index naming logic have helped to ensure that this
will be a reasonable choice.) Per a suggestion from Peter.
A necessary side-effect is to promote CONCURRENTLY to type_func_name_keyword
status, ie, it can't be a table/column/index name anymore unless quoted.
This is not all bad, since we have heard more than once of people typing
CREATE INDEX CONCURRENTLY ON foo (...) and getting a normal index build of
an index named "concurrently", which was not what they wanted. Now this
syntax will result in a concurrent build of an index with system-chosen
name; which they can rename afterwards if they want something else.
Tom Lane [Wed, 23 Dec 2009 16:43:43 +0000 (16:43 +0000)]
Remove code that attempted to rename index columns to keep them in sync with
their underlying table columns. That code was not bright enough to cope with
collision situations (ie, new name conflicts with some other column of the
index). Since there is no functional reason to do this at all, trying to
upgrade the logic to be bulletproof doesn't seem worth the trouble.
This change means that both the index name and the column names of an index
are set when it's created, and won't be automatically changed when the
underlying table columns are renamed. Neatnik DBAs are still free to rename
them manually, of course.
Always pass catalog id to the options validator function specified in
CREATE FOREIGN DATA WRAPPER. Arguably it wasn't a bug because the
documentation said that it's passed the catalog ID or zero, but surely
we should provide it when it's known. And there isn't currently any
scenario where it's not known, and I can't imagine having one in the
future either, so better remove the "or zero" escape hatch and always
pass a valid catalog ID. Backpatch to 8.4.
Tom Lane [Wed, 23 Dec 2009 02:35:25 +0000 (02:35 +0000)]
Adjust naming of indexes and their columns per recent discussion.
Index expression columns are now named after the FigureColname result for
their expressions, rather than always being "pg_expression_N". Digits are
appended to this name if needed to make the column name unique within the
index. (That happens for regular columns too, thus fixing the old problem
that CREATE INDEX fooi ON foo (f1, f1) fails. Before exclusion indexes
there was no real reason to do such a thing, but now maybe there is.)
Default names for indexes and associated constraints now include the column
names of all their columns, not only the first one as in previous practice.
(Of course, this will be truncated as needed to fit in NAMEDATALEN. Also,
pkey indexes retain the historical behavior of not naming specific columns
at all.)
An example of the results:
regression=# create table foo (f1 int, f2 text,
regression(# exclude (f1 with =, lower(f2) with =));
NOTICE: CREATE TABLE / EXCLUDE will create implicit index "foo_f1_lower_exclusion" for table "foo"
CREATE TABLE
regression=# \d foo_f1_lower_exclusion
Index "public.foo_f1_lower_exclusion"
Column | Type | Definition
--------+---------+------------
f1 | integer | f1
lower | text | lower(f2)
btree, for table "public.foo"
Tom Lane [Tue, 22 Dec 2009 23:54:17 +0000 (23:54 +0000)]
Disallow comments on columns of relation types other than tables, views,
and composite types, which are the only relkinds for which pg_dump support
exists for dumping column comments. There is no obvious usefulness for
comments on columns of sequences or toast tables; and while comments on
index columns might have some value, it's not worth the risk of compatibility
problems due to possible changes in the algorithm for assigning names to
index columns. Per discussion.
In consequence, remove now-dead code for copying such comments in CREATE TABLE
LIKE.
Robert Haas [Mon, 21 Dec 2009 01:34:11 +0000 (01:34 +0000)]
More cleanups for the recent large object permissions patch.
Rewrite or adjust various comments for clarity. Remove one bogus comment that
doesn't reflect what the code actually does. Improve the description of the
lo_compat_privileges option.
Tom Lane [Sun, 20 Dec 2009 18:28:14 +0000 (18:28 +0000)]
There is no good reason for the CREATE TABLE LIKE INCLUDING COMMENTS code to
have hard-wired knowledge of the rules for naming index columns. It can
just look at the actual names in the source index, instead. Do some minor
formatting cleanup too.
Tom Lane [Sat, 19 Dec 2009 04:08:32 +0000 (04:08 +0000)]
Bump catversion to reflect the fact that HS patch changed pg_proc
contents, and PG_CONTROL_VERSION to reflect the fact that it changed
pg_control contents. (I see we did at least remember to change
XLOG_PAGE_MAGIC for the WAL contents changes.)
Simon Riggs [Sat, 19 Dec 2009 01:32:45 +0000 (01:32 +0000)]
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
Tom Lane [Fri, 18 Dec 2009 18:45:50 +0000 (18:45 +0000)]
Force the TZ environment variable to be set during initdb. This is to
short-circuit the rather expensive identify_system_timezone() procedure,
which we have no real need for during initdb since nothing done here depends
on the timezone setting. Since we launch quite a few standalone backends
during the initdb sequence, this adds up to a significant savings, and seems
worth doing to save developer time even though it will hardly matter to end
users. Per my report today on pgsql-hackers.
Robert Haas [Thu, 17 Dec 2009 14:36:16 +0000 (14:36 +0000)]
Improve documentation for pg_largeobject changes.
Rewrite the documentation in more idiomatic English, and in the process make
it somewhat more succinct. Move the discussion of specific large object
privileges out of the "server-side functions" section, where it certainly
doesn't belong, and into "implementation features". That might not be
exactly right either, but it doesn't seem worth creating a new section for
this amount of information. Fix a few spelling and layout problems, too.
Peter Eisentraut [Wed, 16 Dec 2009 23:05:00 +0000 (23:05 +0000)]
Don't unblock SIGQUIT in the SIGQUIT handler
This was possibly linked to a deadlock-like situation in glibc syslog code
invoked by the ereport call in quickdie(). In any case, a signal handler
should not unblock its own signal unless there is a specific reason to.
Tom Lane [Wed, 16 Dec 2009 22:24:13 +0000 (22:24 +0000)]
Avoid a premature coercion failure in transformSetOperationTree() when
presented with an UNKNOWN-type Var, which can happen in cases where an
unknown literal appeared in a subquery. While many such cases will fail
later on anyway in the planner, there are some cases where the planner is
able to flatten the query and replace the Var by the constant before it has
to coerce the union column to the final type. I had added this check in 8.4
to provide earlier/better error detection, but it causes a regression for
some cases that worked OK before. Fix by not making the check if the input
node is UNKNOWN type and not a Const or Param. If it isn't going to work,
it will fail anyway at plan time, with the only real loss being inability to
provide an error cursor. Per gripe from Britt Piehler.
In passing, rename a couple of variables to remove confusion from an
inner scope masking the same variable names in an outer scope.
Robert Haas [Wed, 16 Dec 2009 22:16:16 +0000 (22:16 +0000)]
Several fixes for EXPLAIN (FORMAT YAML), plus one for EXPLAIN (FORMAT JSON).
ExplainSeparatePlans() was busted for both JSON and YAML output - the present
code is a holdover from the original version of my machine-readable explain
patch, which didn't have the grouping_stack machinery. Also, fix an odd
distribution of labor between ExplainBeginGroup() and ExplainYAMLLineStarting()
when marking lists with "- ", with each providing one character. This broke
the output format for multi-query statements. Also, fix ExplainDummyGroup()
for the YAML output format.
Along the way, make the YAML format use escape_yaml() in situations where the
JSON format uses escape_json(). Right now, it doesn't matter because all the
values are known not to need escaping, but it seems safer this way. Finally,
I added some comments to better explain what the YAML output format is doing.
Greg Sabino Mullane reported the issues with multi-query statements.
Analysis and remaining cleanups by me.
Michael Meskes [Wed, 16 Dec 2009 10:15:07 +0000 (10:15 +0000)]
Fixed auto-prepare to not try preparing statements that are not preparable. Bug
found and solved by Boszormenyi Zoltan <zb@cybertec.at>, some small adjustments
by me.
Peter Eisentraut [Tue, 15 Dec 2009 22:59:55 +0000 (22:59 +0000)]
Python 3 support in PL/Python
Behaves more or less unchanged compared to Python 2, but the new language
variant is called plpython3u. Documentation describing the naming scheme
is included.
Tom Lane [Tue, 15 Dec 2009 20:37:17 +0000 (20:37 +0000)]
Avoid unnecessary copying of source string when generating a cloned TParser.
For long source strings the copying results in O(N^2) behavior, and the
multiplier can be significant if wide-char conversion is involved.
Tom Lane [Tue, 15 Dec 2009 17:57:48 +0000 (17:57 +0000)]
Support ORDER BY within aggregate function calls, at long last providing a
non-kluge method for controlling the order in which values are fed to an
aggregate function. At the same time eliminate the old implementation
restriction that DISTINCT was only supported for single-argument aggregates.
Possibly release-notable behavioral change: formerly, agg(DISTINCT x)
dropped null values of x unconditionally. Now, it does so only if the
agg transition function is strict; otherwise nulls are treated as DISTINCT
normally would, ie, you get one copy.
Robert Haas [Tue, 15 Dec 2009 04:57:48 +0000 (04:57 +0000)]
Add an EXPLAIN (BUFFERS) option to show buffer-usage statistics.
This patch also removes buffer-usage statistics from the track_counts
output, since this (or the global server statistics) is deemed to be a better
interface to this information.
Itagaki Takahiro, reviewed by Euler Taveira de Oliveira.
Tom Lane [Mon, 14 Dec 2009 02:15:54 +0000 (02:15 +0000)]
Fix a bug introduced when set-returning SQL functions were made inline-able:
we have to cope with the possibility that the declared result rowtype contains
dropped columns. This fails in 8.4, as per bug #5240.
While at it, be more paranoid about inserting binary coercions when inlining.
The pre-8.4 code did not really need to worry about that because it could not
inline at all in any case where an added coercion could change the behavior
of the function's statement. However, when inlining a SRF we allow sorting,
grouping, and set-ops such as UNION. In these cases, modifying one of the
targetlist entries that the sort/group/setop depends on could conceivably
change the behavior of the function's statement --- so don't inline when
such a case applies.
Itagaki Takahiro [Mon, 14 Dec 2009 00:39:11 +0000 (00:39 +0000)]
Additional fixes for large object access control.
Use pg_largeobject_metadata.oid instead of pg_largeobject.loid
to enumerate existing large objects in pg_dump, pg_restore, and
contrib modules.
Magnus Hagander [Sat, 12 Dec 2009 21:35:21 +0000 (21:35 +0000)]
Allow LDAP authentication to operate in search+bind mode, meaning it
does a search for the user in the directory first, and then binds with
the DN found for this user.
This allows for LDAP logins in scenarios where the DN of the user cannot
be determined simply by prefix and suffix, such as the case where different
users are located in different containers.
The old way of authentication can be significantly faster, so it's kept
as an option.
Tom Lane [Sat, 12 Dec 2009 19:24:35 +0000 (19:24 +0000)]
Fix integer-to-bit-string conversions to handle the first fractional byte
correctly when the output bit width is wider than the given integer by
something other than a multiple of 8 bits.
This has been wrong since I first wrote that code for 8.0 :-(. Kudos to
Roman Kononov for being the first to notice, though I didn't use his
patch. Per bug #5237.
Robert Haas [Sat, 12 Dec 2009 00:35:34 +0000 (00:35 +0000)]
Export ExplainBeginOutput() and ExplainEndOutput() for auto_explain.
Without these functions, anyone outside of explain.c can't actually use
ExplainPrintPlan, because the ExplainState won't be initialized properly.
The user-visible result of this was a crash when using auto_explain with
the JSON output format.
Report by Euler Taveira de Oliveira. Analysis by Tom Lane. Patch by me.
Tom Lane [Fri, 11 Dec 2009 21:50:06 +0000 (21:50 +0000)]
Arrange to generate different random sequences in the different child
processes of a pgbench run, when we are using -j > 1 and are emulating
threads via fork(). Otherwise the children all inherit the same random
sequence state and produce the same random-number sequence.
In the threaded case the different threads will share one RNG state, so
they will produce different subsets of one sequence, which is maybe more
correlated than a purist would like but will not be "the same". So we
leave that case alone.
First noticed by Takahiro Itagaki, and is also part of the explanation
for the pgbench misbehavior recently reported by Jaime Casanova.
Tom Lane [Fri, 11 Dec 2009 18:14:43 +0000 (18:14 +0000)]
Ensure that the result tuple of an EvalPlanQual cycle gets materialized
before we zap the input tuple. Otherwise, pass-by-reference columns of
the result slot are likely to contain just references to the input
tuple, leading to big trouble if the pfree'd space is reused. Per
trouble report from Jaime Casanova. This is a new bug in the recent
rewrite of EvalPlanQual, so nothing to back-patch.
Peter Eisentraut [Thu, 10 Dec 2009 06:32:28 +0000 (06:32 +0000)]
Add init[db] option to pg_ctl
pg_ctl gets a new mode that runs initdb. Adjust the documentation a bit to
not assume that initdb is the only way to run database cluster initialization.
But don't replace initdb as the canonical way.
Tom Lane [Wed, 9 Dec 2009 21:57:51 +0000 (21:57 +0000)]
Prevent indirect security attacks via changing session-local state within
an allegedly immutable index function. It was previously recognized that
we had to prevent such a function from executing SET/RESET ROLE/SESSION
AUTHORIZATION, or it could trivially obtain the privileges of the session
user. However, since there is in general no privilege checking for changes
of session-local state, it is also possible for such a function to change
settings in a way that might subvert later operations in the same session.
Examples include changing search_path to cause an unexpected function to
be called, or replacing an existing prepared statement with another one
that will execute a function of the attacker's choosing.
The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
these threats, which are the same places previously deemed to need protection
against the SET ROLE issue. GUC changes are still allowed, since there are
many useful cases for that, but we prevent security problems by forcing a
rollback of any GUC change after completing the operation. Other cases are
handled by throwing an error if any change is attempted; these include temp
table creation, closing a cursor, and creating or deleting a prepared
statement. (In 7.4, the infrastructure to roll back GUC changes doesn't
exist, so we settle for rejecting changes of "search_path" in these contexts.)
Original report and patch by Gurjeet Singh, additional analysis by
Tom Lane.
Magnus Hagander [Wed, 9 Dec 2009 06:37:06 +0000 (06:37 +0000)]
Reject certificates with embedded NULLs in the commonName field. This stops
attacks where an attacker would put <attack>\0<propername> in the field and
trick the validation code that the certificate was for <attack>.
This is a very low risk attack since it reuqires the attacker to trick the
CA into issuing a certificate with an incorrect field, and the common
PostgreSQL deployments are with private CAs, and not external ones. Also,
default mode in 8.4 does not do any name validation, and is thus also not
vulnerable - but the higher security modes are.
Backpatch all the way. Even though versions 8.3.x and before didn't have
certificate name validation support, they still exposed this field for
the user to perform the validation in the application code, and there
is no way to detect this problem through that API.