Robert Haas [Mon, 5 Dec 2016 20:54:28 +0000 (15:54 -0500)]
Ensure gatherstate->nextreader is properly initialized.
The previously code worked OK as long as a Gather node was never
rescanned, or if it was rescanned, as long as it got at least as
many workers on rescan as it had originally. But if the number
of workers ever decreased on a rescan, then it could crash.
Stephen Frost [Mon, 5 Dec 2016 20:50:55 +0000 (15:50 -0500)]
Add support for restrictive RLS policies
We have had support for restrictive RLS policies since 9.5, but they
were only available through extensions which use the appropriate hooks.
This adds support into the grammer, catalog, psql and pg_dump for
restrictive RLS policies, thus reducing the cases where an extension is
necessary.
In passing, also move away from using "AND"d and "OR"d in comments.
As pointed out by Alvaro, it's not really appropriate to attempt
to make verbs out of "AND" and "OR", so reword those comments which
attempted to.
Reviewed By: Jeevan Chalke, Dean Rasheed
Discussion: https://postgr.es/m/20160901063404.GY4028@tamriel.snowman.net
If we failed to connect to one or more hosts, and then afterwards we
find one that fails to be read-write, the latter error message was
clobbering any earlier ones. Repair.
Robert Haas [Mon, 5 Dec 2016 15:53:21 +0000 (10:53 -0500)]
Reduce the default for max_worker_processes back to 8.
Commit b460f5d6693103076dc554aa7cbb96e1e53074f9 -- at my suggestion --
increased the default value of max_worker_processes from 8 to 16, on
the theory that this would be harmless and convenient for users.
Unfortunately, this caused some buildfarm machines with low connection
limits to start failing, so apparently it's not harmless after all.
Robert Haas [Mon, 5 Dec 2016 15:00:49 +0000 (10:00 -0500)]
Try to fix some DSA-related compiler warnings.
Commit 13df76a537cca3b8884911d8fdf7c89a457a8dd3 was overconfident
about how portable %016lx is. Some compilers complain because they
need %016llx, while platforms where DSA pointers are only 32 bits
get unhappy about using a 64-bit format for a 32-bit quantity.
Replace PostmasterRandom() with a stronger source, second attempt.
This adds a new routine, pg_strong_random() for generating random bytes,
for use in both frontend and backend. At the moment, it's only used in
the backend, but the upcoming SCRAM authentication patches need strong
random numbers in libpq as well.
pg_strong_random() is based on, and replaces, the existing implementation
in pgcrypto. It can acquire strong random numbers from a number of sources,
depending on what's available:
- OpenSSL RAND_bytes(), if built with OpenSSL
- On Windows, the native cryptographic functions are used
- /dev/urandom
Unlike the current pgcrypto function, the source is chosen by configure.
That makes it easier to test different implementations, and ensures that
we don't accidentally fall back to a less secure implementation, if the
primary source fails. All of those methods are quite reliable, it would be
pretty surprising for them to fail, so we'd rather find out by failing
hard.
If no strong random source is available, we fall back to using erand48(),
seeded from current timestamp, like PostmasterRandom() was. That isn't
cryptographically secure, but allows us to still work on platforms that
don't have any of the above stronger sources. Because it's not very secure,
the built-in implementation is only used if explicitly requested with
--disable-strong-random.
This replaces the more complicated Fortuna algorithm we used to have in
pgcrypto, which is unfortunate, but all modern platforms have /dev/urandom,
so it doesn't seem worth the maintenance effort to keep that. pgcrypto
functions that require strong random numbers will be disabled with
--disable-strong-random.
Original patch by Magnus Hagander, tons of further work by Michael Paquier
and me.
Fujii Masao [Mon, 5 Dec 2016 11:29:41 +0000 (20:29 +0900)]
Fix incorrect output from gin_desc().
Previously gin_desc() displayed incorrect output "unknown action 0"
for XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE records with
valid actions. The cause of this problem was that gin_desc() wrongly
used XLogRecGetData() to extract data from those records.
Since they were registered by XLogRegisterBufData(), gin_desc() should
have used XLogRecGetBlockData(), instead, like gin_redo().
Also there were other differences about how to treat XLOG_GIN_INSERT
record between gin_desc() and gin_redo().
This commit fixes gin_desc() routine so that it treats those records
in the same way as gin_redo().
Batch-patch to 9.5 where WAL record format was revamped and
XLogRegisterBufData() was added.
Reported-By: Andres Freund Reviewed-By: Tom Lane
Discussion: <20160509194645.7lewnpw647zegx2m@alap3.anarazel.de>
Tom Lane [Sun, 4 Dec 2016 20:02:27 +0000 (15:02 -0500)]
Don't mess up pstate->p_next_resno in transformOnConflictClause().
transformOnConflictClause incremented p_next_resno while generating the
phony targetlist for the EXCLUDED pseudo-rel. Then that field got
incremented some more during transformTargetList, possibly leading to
free_parsestate concluding that we'd overrun the allowed length of a tlist,
as reported by Justin Pryzby.
We could fix this by resetting p_next_resno to 1 after using it for the
EXCLUDED pseudo-rel tlist, but it seems easier and less coupled to other
places if we just don't use that field at all in this loop. (Note that
this doesn't change anything about the resnos that end up appearing in
the main target list, because those are all replaced with target-column
numbers by updateTargetListEntry.)
In passing, fix incorrect type OID assigned to the whole-row Var for
"EXCLUDED.*" (somehow this escaped having any bad consequences so far,
but it's certainly wrong); remove useless assignment to var->location;
pstrdup the column names in case of a relcache flush; and improve
nearby comments.
Back-patch to 9.5 where ON CONFLICT was introduced.
Noah Misch [Sun, 4 Dec 2016 05:16:54 +0000 (00:16 -0500)]
Make pgwin32_putenv() probe every known CRT, regardless of compiler.
This extends to MinGW builds the provision for MSVC-built libraries to
see putenv() effects. Doing so repairs, for example, the handling of
the krb_server_keyfile parameter when linked with MSVC-built MIT
Kerberos. Like the previous commit, no back-patch.
Noah Misch [Sat, 3 Dec 2016 20:46:36 +0000 (15:46 -0500)]
Make pgwin32_putenv() follow DLL loading and unloading.
Until now, the first putenv() call of a given postgres.exe process would
cache the set of loaded CRTs. If a CRT unloaded after that call, the
next putenv() would crash. That risk was largely theoretical, because
the first putenv() precedes all PostgreSQL-initiated module loading.
However, this might explain bad interactions with antivirus and other
software that injects threads asynchronously. If an additional CRT
loaded after the first putenv(), pgwin32_putenv() would not discover it.
That CRT would have all environment changes predating its load, but it
would not receive later PostgreSQL-initiated changes. An additional CRT
loading concurrently with the first putenv() might miss that change in
addition to missing later changes. Fix all those problems. This
removes the cache mechanism from pgwin32_putenv(); the cost, less than
100 μs per backend startup, is negligible.
No resulting misbehavior was known to be user-visible given the core
distribution alone, but one can readily construct an affected extension
module. No back-patch given the lack of complaints and the potential
for behavior changes in non-PostgreSQL code running in the backend.
Noah Misch [Sat, 3 Dec 2016 20:46:36 +0000 (15:46 -0500)]
Make pgwin32_putenv() visit debug CRTs.
This has no effect in the most conventional case, where no relevant DLL
uses a debug build. For an example where it does matter, given a debug
build of MIT Kerberos, the krb_server_keyfile parameter usually had no
effect. Since nobody wants a Heisenbug, back-patch to 9.2 (all
supported versions).
Noah Misch [Sat, 3 Dec 2016 20:46:35 +0000 (15:46 -0500)]
Remove wrong CloseHandle() call.
In accordance with its own documentation, invoke CloseHandle() only when
directed in the documentation for the function that furnished the
handle. GetModuleHandle() does not so direct. We have been issuing
this call only in the rare event that a CRT DLL contains no "_putenv"
symbol, so lack of bug reports is uninformative. Back-patch to 9.2 (all
supported versions).
Noah Misch [Sat, 3 Dec 2016 20:46:35 +0000 (15:46 -0500)]
Refine win32env.c cosmetics.
Replace use of plain 0 as a null pointer constant. In comments, update
terminology and lessen redundancy. Back-patch to 9.2 (all supported
versions) for the convenience of back-patching the next two commits.
Christian Ullrich and Noah Misch, reviewed (in earlier versions) by
Michael Paquier.
Tom Lane [Fri, 2 Dec 2016 22:23:54 +0000 (17:23 -0500)]
Fix broken wait-for-previous-process-to-exit loop in regression test.
Must do pg_stat_clear_snapshot() inside test's loop, or our snapshot of
pg_stat_activity will never change :-(. Thinko in b3427dade -- evidently
my workstation never really iterated the loop in testing. Per buildfarm.
Tom Lane [Fri, 2 Dec 2016 19:57:35 +0000 (14:57 -0500)]
Delete deleteWhatDependsOn() in favor of more performDeletion() flag bits.
deleteWhatDependsOn() had grown an uncomfortably large number of
assumptions about what it's used for. There are actually only two minor
differences between what it does and what a regular performDeletion() call
can do, so let's invent additional bits in performDeletion's existing flags
argument that specify those behaviors, and get rid of deleteWhatDependsOn()
as such. (We'd probably have done it this way from the start, except that
performDeletion didn't originally have a flags argument, IIRC.)
Also, add a SKIP_EXTENSIONS flag bit that prevents ever recursing to an
extension, and use that when dropping temporary objects at session end.
This provides a more general solution to the problem addressed in a hacky
way in commit 08dd23cec: if an extension script creates temp objects and
forgets to remove them again, the whole extension went away when its
contained temp objects were deleted. The previous solution only covered
temp relations, but this solves it for all object types.
These changes require minor additions in dependency.c to pass the flags
to subroutines that previously didn't get them, but it's still a net
savings of code, and it seems cleaner than before.
Having done this, revert the special-case code added in 08dd23cec that
prevented addition of pg_depend records for temp table extension
membership, because that caused its own oddities: dropping an extension
that had created such a table didn't automatically remove the table,
leading to a failure if the table had another dependency on the extension
(such as use of an extension data type), or to a duplicate-name failure if
you then tried to recreate the extension. But we keep the part that
prevents the pg_temp_nnn schema from becoming an extension member; we never
want that to happen. Add a regression test case covering these behaviors.
Although this fixes some arguable bugs, we've heard few field complaints,
and any such problems are easily worked around by explicitly dropping temp
objects at the end of extension scripts (which seems like good practice
anyway). So I won't risk a back-patch.
Robert Haas [Fri, 2 Dec 2016 17:34:36 +0000 (12:34 -0500)]
Introduce dynamic shared memory areas.
Programmers discovered decades ago that it was useful to have a simple
interface for allocating and freeing memory, which is why malloc() and
free() were invented. Unfortunately, those handy tools don't work
with dynamic shared memory segments because those are specific to
PostgreSQL and are not necessarily mapped at the same address in every
cooperating process. So invent our own allocator instead. This makes
it possible for processes cooperating as part of parallel query
execution to allocate and free chunks of memory without having to
reserve them prior to the start of execution. It could also be used
for longer lived objects; for example, we could consider storing data
for pg_stat_statements or the stats collector in shared memory using
these interfaces, rather than writing them to files. Basically,
anything that needs shared memory but can't predict in advance how
much it's going to need might find this useful.
Thomas Munro and Robert Haas. The original code (of mine) on which
Thomas based his work was actually designed to be a new backend-local
memory allocator for PostgreSQL, but that hasn't gone anywhere - or
not yet, anyway. Thomas took that work and performed major
refactoring and extensive modifications to make it work with dynamic
shared memory, including the addition of appropriate locking.
Robert Haas [Fri, 2 Dec 2016 17:03:30 +0000 (12:03 -0500)]
Management of free memory pages.
This is intended as infrastructure for a full-fledged allocator for
dynamic shared memory. The interface looks a bit like a real
allocator, but only supports allocating and freeing memory in
multiples of the 4kB page size. Further, to free memory, you must
know the size of the span you wish to free, in pages. While these are
make it unsuitable as an allocator in and of itself, it still serves
as very useful scaffolding for a full-fledged allocator.
Robert Haas and Thomas Munro. This code is mostly the same as my 2014
submission, but Thomas fixed quite a few bugs and made some changes to
the interface.
Robert Haas [Fri, 2 Dec 2016 16:29:01 +0000 (11:29 -0500)]
Add a crude facility for dealing with relative pointers.
C doesn't have any sort of built-in understanding of a pointer
relative to some arbitrary base address, but dynamic shared memory
segments can be mapped at different addresses in different processes,
so any sort of shared data structure stored within a dynamic shared
memory segment can't use absolute pointers. We could use something
like Size to represent a relative pointer, but then the compiler
provides no type-checking. Use stupid macro tricks to get some
type-checking.
Patch originally by me. Concept suggested by Andres Freund. Recently
resubmitted as part of Thomas Munro's work on dynamic shared memory
allocation.
Robert Haas [Fri, 2 Dec 2016 13:58:41 +0000 (08:58 -0500)]
Clarify that pg_stat_activity.query has a length limit.
There was always documentation of the GUC that controlled what the
limit actually was, but previously the documentation of the field
itself made no mention of that limit.
Robert Haas [Fri, 2 Dec 2016 12:42:58 +0000 (07:42 -0500)]
Add max_parallel_workers GUC.
Increase the default value of the existing max_worker_processes GUC
from 8 to 16, and add a new max_parallel_workers GUC with a maximum
of 8. This way, even if the maximum amount of parallel query is
happening, there is still room for background workers that do other
things, as originally envisioned when max_worker_processes was added.
Julien Rouhaud, reviewed by Amit Kapila and by revised by me.
Alvaro Herrera [Fri, 2 Dec 2016 03:34:01 +0000 (00:34 -0300)]
Permit dump/reload of not-too-large >1GB tuples
Our documentation states that our maximum field size is 1 GB, and that
our maximum row size of 1.6 TB. However, while this might be attainable
in theory with enough contortions, it is not workable in practice; for
starters, pg_dump fails to dump tables containing rows larger than 1 GB,
even if individual columns are well below the limit; and even if one
does manage to manufacture a dump file containing a row that large, the
server refuses to load it anyway.
This commit enables dumping and reloading of such tuples, provided two
conditions are met:
1. no single column is larger than 1 GB (in output size -- for bytea
this includes the formatting overhead)
2. the whole row is not larger than 2 GB
There are three related changes to enable this:
a. StringInfo's API now has two additional functions that allow creating
a string that grows beyond the typical 1GB limit (and "long" string).
ABI compatibility is maintained. We still limit these strings to 2 GB,
though, for reasons explained below.
b. COPY now uses long StringInfos, so that pg_dump doesn't choke
trying to emit rows longer than 1GB.
c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation
for the input tuple, which means that large tuples are accepted on
input. Note that at this point we do not apply any further limit to the
input tuple size.
The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit
length words to describe each row; and because the documentation is
ambiguous on its signedness and libpq does consider it signed, we cannot
use the highest-order bit. Additionally, the StringInfo API uses "int"
(which is 4 bytes wide in most platforms) in many places, so we'd need
to change that API too in order to improve, which has lots of fallout.
Backpatch to 9.5, which is the oldest that has
MemoryContextAllocExtended, a necessary piece of infrastructure. We
could apply to 9.4 with very minimal additional effort, but any further
than that would require backpatching "huge" allocations too.
This is the largest set of changes we could find that can be
back-patched without breaking compatibility with existing systems.
Fixing a bigger set of problems (for example, dumping tuples bigger than
2GB, or dumping fields bigger than 1GB) would require changing the FE/BE
protocol and/or changing the StringInfo API in an ABI-incompatible way,
neither of which would be back-patchable.
Authors: Daniel Vérité, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql
Peter Eisentraut [Wed, 30 Nov 2016 17:00:00 +0000 (12:00 -0500)]
Refactor libpqwalreceiver
The whole walreceiver API is now wrapped into a struct, like most of our
other loadable module APIs. The libpq connection is no longer a global
variable in libpqwalreceiver. Instead, it is encapsulated into a struct
that is passed around the functions. This allows multiple walreceivers
to run at the same time.
Add some rudimentary support for logical replication connections to
libpqwalreceiver.
These changes are mostly cosmetic and are going to be useful for the
future logical replication patches.
Peter Eisentraut [Thu, 15 Sep 2016 17:00:00 +0000 (12:00 -0500)]
Use grammar symbol function_with_argtypes consistently
Instead of sometimes referring to a function signature like func_name
func_args, use the existing function_with_argtypes symbol, which
combines the two.
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Robert Haas [Thu, 1 Dec 2016 19:36:39 +0000 (14:36 -0500)]
libpq: Fix inadvertent change in PQhost() behavior.
Commit 274bb2b3857cc987cfa21d14775cae9b0dababa5 caused PQhost() to
return the value of the hostaddr parameter rather than the relevant
host when the latter parameter was specified. That's wrong. Commit 9a1d0af4ad2cbd419115b453d811c141b80d872b then amplified the damage by
using PQhost() in more places, so that the SSL test suite started
failing.
Andres Freund [Thu, 1 Dec 2016 01:30:09 +0000 (17:30 -0800)]
User narrower representative tuples in the hash-agg hashtable.
So far the hashtable stored representative tuples in the form of its
input slot, with all columns in the hashtable that are not
needed (i.e. not grouped upon or functionally dependent) set to NULL.
Thats good for saving memory, but it turns out that having tuples full
of NULL isn't free. slot_deform_tuple is faster if there's no NULL
bitmap even if no NULLs are encountered, and skipping over leading NULLs
isn't free.
So compute a separate tuple descriptor that only contains the needed
columns. As columns have already been moved in/out the slot for the
hashtable that does not imply additional per-row overhead.
Author: Andres Freund Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/20161103110721.h5i5t5saxfk5eeik@alap3.anarazel.de
Andres Freund [Thu, 1 Dec 2016 00:08:11 +0000 (16:08 -0800)]
Perform one only projection to compute agg arguments.
Previously we did a ExecProject() for each individual aggregate
argument. That turned out to be a performance bottleneck in queries with
multiple aggregates.
Doing all the argument computations in one ExecProject() is quite a bit
cheaper because ExecProject's fastpath can do the work at once in a
relatively tight loop, and because it can get all the required columns
with a single slot_getsomeattr and save some other redundant setup
costs.
Author: Andres Freund Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/20161103110721.h5i5t5saxfk5eeik@alap3.anarazel.de
Robert Haas [Wed, 30 Nov 2016 20:39:21 +0000 (15:39 -0500)]
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Tom Lane [Wed, 30 Nov 2016 18:34:13 +0000 (13:34 -0500)]
Doc: improve description of trim() and related functions.
Per bug #14441 from Mark Pether, the documentation could be misread,
mainly because some of the examples failed to show what happens with
a multicharacter "characters to trim" string. Also, while the text
description in most of these entries was fairly clear that the
"characters" argument is a set of characters not a substring to match,
some of them used variant wording that was a bit less clear.
trim() itself suffered from both deficiencies and was thus pretty
misinterpretable.
Also fix failure to explain which of LEADING/TRAILING/BOTH is the
default.
Rewrite the perl scripts to produce our Unicode conversion tables.
Generate EUC_CN mappings from gb-18030-2000.xml, because GB2312.TXT is no
longer available.
Get UHC from windows-949-2000.xml, it's more up-to-date.
Plus tons more small changes. With these changes, the perl scripts
faithfully produce the *.map files we have in the repository, from the
external source files.
In the passing, fix the Makefile to also download CP932.TXT and CP950.TXT.
Based on patches by Kyotaro Horiguchi, reviewed by Daniel Gustafsson.
Remove leading zeros, for consistency with other map files.
The common style is to pad to 4 digits.
Running the current perl scripts to generate these map files would override
this change, but the next commit will rewrite the perl scripts to produce
this style. I'm doing this as a separate commit, to make it more clear what
non-cosmetic changes the next commit makes to the map files.
Remove code points < 0x80 from character conversion tables.
PostgreSQL treats characters with < 0x80 leading byte as plain ASCII, and
they are not even passed to the conversion routines. There is no point in
having them in the conversion tables.
Everything in the tables were direct ASCII-ASCII mappings, except for two:
* SHIFT_JIS_2004 code point 0x5C (backslash in ASCII) was mapped to Unicode
YEN SIGN character.
* Unicode 0x5C (backslash again) was mapped to "REVERSE SOLIDUS" in
SHIFT_JIS_2004
These mappings never had any effect, so there's no functional change from
removing them.
pgp-pubkey-DISABLED test has been unused since 2006, when support for
built-in bignum math was added (commit 1abf76e8). pgp-encrypt-DISABLED has
been unused forever, AFAICS.
Tom Lane [Wed, 30 Nov 2016 00:32:35 +0000 (19:32 -0500)]
Fix bogus handling of JOIN_UNIQUE_OUTER/INNER cases for parallel joins.
consider_parallel_nestloop passed the wrong jointype down to its
subroutines for JOIN_UNIQUE_INNER cases (it should pass JOIN_INNER), and it
thought that it could pass paths other than innerrel->cheapest_total_path
to create_unique_path, which create_unique_path is not on board with.
These bugs would lead to assertion failures or other errors, suggesting
that this code path hasn't been tested much.
hash_inner_and_outer's code for parallel join effectively treated both
JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER the same as JOIN_INNER (for
different reasons :-(), leading to incorrect plans that treated a semijoin
as if it were a plain join.
Michael Day submitted a test case demonstrating that hash_inner_and_outer
failed for JOIN_UNIQUE_OUTER, and I found the other cases through code
review.
Tom Lane [Tue, 29 Nov 2016 23:00:49 +0000 (18:00 -0500)]
Improve eqjoinsel_semi's behavior for small inner relations with no stats.
If we don't have any MCV statistics for the inner relation, and we don't
trust its numdistinct estimate either, eqjoinsel_semi falls back to a very
conservative estimate (that 50% of the outer rows have matches). This is
particularly problematic if the inner relation is completely empty, since
then even an explicit ANALYZE won't produce any pg_statistic entries,
so there's no way to budge the planner off the bad estimate.
We'd produce a better estimate in such cases if we used the nd2/nd1
selectivity heuristic, so an easy fix is to treat the nd2 estimate as
non-default if we derive it from clamping to the inner rel's rowcount
estimate. This won't fix every related case (mainly because the rowcount
estimate might be larger than DEFAULT_NUM_DISTINCT), but it seems like a
sane extension of the existing logic, so let's apply the change in HEAD
and see if anyone complains. Per bug #14438 from Nikolay Nikitin.
Tom Lane [Tue, 29 Nov 2016 20:05:22 +0000 (15:05 -0500)]
Test all contrib-created operator classes with amvalidate.
I'd supposed that people would do this manually when creating new operator
classes, but the folly of that was exposed today. The tests seem fast
enough that we can just apply them during the normal regression tests.
contrib/isn fails the checks for lack of complete sets of cross-type
operators. That's a nice-to-have policy rather than a functional
requirement, so leave it as-is, but insert ORDER BY in the query to
ensure consistent cross-platform output.
Robert Haas [Tue, 29 Nov 2016 17:18:31 +0000 (12:18 -0500)]
libpq: Add target_session_attrs parameter.
Commit 274bb2b3857cc987cfa21d14775cae9b0dababa5 made it possible to
specify multiple IPs in a connection string, but that's not good
enough for the case where you have a read-write master and a bunch of
read-only standbys and want to connect to whichever server is the
master at the current time. This commit allows that, by making it
possible to specify target_session_attrs=read-write as a connection
parameter.
There was extensive discussion of the best name for the connection
parameter and its values as well as the best way to distinguish master
and standbys. For now, adopt the same solution as JDBC: if the user
wants a read-write connection, issue 'show transaction_read_only' and
rejection the connection if the result is 'on'. In the future, we
could add additional values of this new target_session_attrs parameter
that issue different queries; or we might have some way of
distinguishing the server type without resorting to an SQL query; but
right now, we have this, and that's (hopefully) a good start.
Victor Wagner and Mithun Cy. Design review by Álvaro Herrera, Catalin
Iacob, Takayuki Tsunakawa, and Craig Ringer; code review by me. I
changed Mithun's patch to skip all remaining IPs for a host if we
reject a connection based on this new parameter, rewrote the
documentation, and did some other cosmetic cleanup.
Stephen Frost [Tue, 29 Nov 2016 16:09:35 +0000 (11:09 -0500)]
Add --no-blobs option to pg_dump
Add an option to exclude blobs when running pg_dump. By default, blobs
are included but this option can be used to exclude them while keeping
the rest of the dump.
Commment updates and regression tests from me.
Author: Guillaume Lelarge Reviewed-by: Amul Sul
Discussion: https://postgr.es/m/VisenaEmail.48.49926ea6f91dceb6.15355a48249@tc7-visena
Stephen Frost [Tue, 29 Nov 2016 15:35:04 +0000 (10:35 -0500)]
Clarify pg_dump -b documentation
The documentation around the -b/--blobs option to pg_dump seemed to
imply that it might be possible to add blobs to a "schema-only" dump or
similar. Clarify that blobs are data and therefore will only be
included in dumps where data is being included, even when -b is used to
request blobs be included.
The -b option has been around since before 9.2, so back-patch to all
supported branches.
Tom Lane [Tue, 29 Nov 2016 00:08:38 +0000 (19:08 -0500)]
Fix estimate_expression_value to constant-fold SQLValueFunction nodes.
Oversight in my commit 0bb51aa96. Noted while poking at a recent
bug report --- HEAD's estimates for a query using CURRENT_DATE
were unexpectedly much worse than 9.6's.
Tom Lane [Mon, 28 Nov 2016 02:23:39 +0000 (21:23 -0500)]
Code review for early drop of orphaned temp relations in autovacuum.
Commit a734fd5d1 exposed some race conditions that existed previously
in the autovac code, but were basically harmless because autovac would
not try to delete orphaned relations immediately. Specifically, the test
for orphaned-ness was made on a pg_class tuple that might be dead by now,
allowing autovac to try to remove a table that the owning backend had just
finished deleting. This resulted in a hard crash due to inadequate caution
about accessing the table's catalog entries without any lock. We must take
a relation lock and then recheck whether the table is still present and
still looks deletable before we do anything.
Also, it seemed to me that deleting multiple tables per transaction, and
trying to continue after errors, represented unjustifiable complexity.
We do not expect this code path to be taken often in the field, nor even
during testing, which means that prioritizing performance over correctness
is a bad tradeoff. Rip all that out in favor of just starting a new
transaction after each successful temp table deletion. If we're unlucky
enough to get an error, which shouldn't happen anyway now that we're being
more cautious, let the autovacuum worker fail as it normally would.
In passing, improve the order of operations in the initial scan loop.
Now that we don't care about whether a temp table is a wraparound hazard,
there's no need to perform extract_autovac_opts, get_pgstat_tabentry_relid,
or relation_needs_vacanalyze for temp tables.
Also, if GetTempNamespaceBackendId returns InvalidBackendId (indicating
it doesn't recognize the schema as temp), treat that as meaning it's NOT
an orphaned temp table, not that it IS one, which is what happened before
because BackendIdGetProc necessarily failed. The case really shouldn't
come up for a table that has RELPERSISTENCE_TEMP, but the consequences
if it did seem undesirable. (This might represent a back-patchable bug
fix; not sure if it's worth the trouble.)
Tom Lane [Sat, 26 Nov 2016 18:31:35 +0000 (13:31 -0500)]
Fix test about ignoring extension dependencies during extension scripts.
Commit 08dd23cec introduced an exception to the rule that extension member
objects can only be dropped as part of dropping the whole extension,
intending to allow such drops while running the extension's own creation or
update scripts. However, the exception was only applied at the outermost
recursion level, because it was modeled on a pre-existing check to ignore
dependencies on objects listed in pendingObjects. Bug #14434 from Philippe
Beaudoin shows that this is inadequate: in some cases we can reach an
extension member object by recursion from another one. (The bug concerns
the serial-sequence case; I'm not sure if there are other cases, but there
might well be.)
To fix, revert 08dd23cec's changes to findDependentObjects() and instead
apply the creating_extension exception regardless of stack level.
Having seen this example, I'm a bit suspicious that the pendingObjects
logic is also wrong and such cases should likewise be allowed at any
recursion level. However, changing that would interact in subtle ways
with the recursion logic (at least it would need to be moved to after the
recursing-from check). Given that the code's been like that a long time,
I'll refrain from touching it without a clear example showing it's wrong.
Back-patch to all active branches. In HEAD and 9.6, where suitable
test infrastructure exists, add a regression test case based on the
bug report.
Tom Lane [Fri, 25 Nov 2016 23:36:10 +0000 (18:36 -0500)]
Bring some clarity to the defaults for the xxx_flush_after parameters.
Instead of confusingly stating platform-dependent defaults for these
parameters in the comments in postgresql.conf.sample (with the main
entry being a lie on Linux), teach initdb to install the correct
platform-dependent value in postgresql.conf, similarly to the way
we handle other platform-dependent defaults. This won't do anything
for existing 9.6 installations, but since it's effectively only a
documentation improvement, that seems OK.
Since this requires initdb to have access to the default values,
move the #define's for those to pg_config_manual.h; the original
placement in bufmgr.h is unworkable because that file can't be
included by frontend programs.
Adjust the default value for wal_writer_flush_after so that it is 1MB
regardless of XLOG_BLCKSZ, conforming to what is stated in both the
SGML docs and postgresql.conf. (We could alternatively make it scale
with XLOG_BLCKSZ, but I'm not sure I see the point.)
Copy-edit related SGML documentation.
Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra.
Tom Lane [Fri, 25 Nov 2016 21:20:12 +0000 (16:20 -0500)]
Mark a query's topmost Paths parallel-unsafe if they will have initPlans.
Andreas Seltenreich found another case where we were being too optimistic
about allowing a plan to be considered parallelizable despite it containing
initPlans. It seems like the real issue here is that if we know we are
going to tack initPlans onto the topmost Plan node for a subquery, we
had better mark that subquery's result Paths as not-parallel-safe. That
fixes this problem and allows reversion of a kluge (added in commit 7b67a0a49 and extended in f24cf960d) to not trust the parallel_safe flag
at top level.
Tom Lane [Fri, 25 Nov 2016 18:44:47 +0000 (13:44 -0500)]
Check for pending trigger events on far end when dropping an FK constraint.
When dropping a foreign key constraint with ALTER TABLE DROP CONSTRAINT,
we refuse the drop if there are any pending trigger events on the named
table; this ensures that we won't remove the pg_trigger row that will be
consulted by those events. But we should make the same check for the
referenced relation, else we might remove a due-to-be-referenced pg_trigger
row for that relation too, resulting in "could not find trigger NNN" or
"relation NNN has no triggers" errors at commit. Per bug #14431 from
Benjie Gillam. Back-patch to all supported branches.
Alvaro Herrera [Thu, 24 Nov 2016 18:39:55 +0000 (15:39 -0300)]
Fix commit_ts for FrozenXid and BootstrapXid
Previously, requesting commit timestamp for transactions
FrozenTransactionId and BootstrapTransactionId resulted in an error.
But since those values can validly appear in committed tuples' Xmin,
this behavior is unhelpful and error prone: each caller would have to
special-case those values before requesting timestamp data for an Xid.
We already have a perfectly good interface for returning "the Xid you
requested is too old for us to have commit TS data for it", so let's use
that instead.
Backpatch to 9.5, where commit timestamps appeared.
Tom Lane [Wed, 23 Nov 2016 18:45:55 +0000 (13:45 -0500)]
Make sure ALTER TABLE preserves index tablespaces.
When rebuilding an existing index, ALTER TABLE correctly kept the
physical file in the same tablespace, but it messed up the pg_class
entry if the index had been in the database's default tablespace
and "default_tablespace" was set to some non-default tablespace.
This led to an inaccessible index.
Fix by fixing pg_get_indexdef_string() to always include a tablespace
clause, whether or not the index is in the default tablespace. The
previous behavior was installed in commit 537e92e41, and I think it just
wasn't thought through very clearly; certainly the possible effect of
default_tablespace wasn't considered. There's some risk in changing the
behavior of this function, but there are no other call sites in the core
code. Even if it's being used by some third party extension, it's fairly
hard to envision a usage that is okay with a tablespace clause being
appended some of the time but can't handle it being appended all the time.
Back-patch to all supported versions.
Code fix by me, investigation and test cases by Michael Paquier.
Tom Lane [Tue, 22 Nov 2016 22:56:16 +0000 (17:56 -0500)]
Doc: improve documentation about composite-value usage.
Create a section specifically for the syntactic rules around whole-row
variable usage, such as expansion of "foo.*". This was previously
documented only haphazardly, with some critical info buried in
unexpected places like xfunc-sql-composite-functions. Per repeated
questions in different mailing lists.
Avoid memory leak in conninfo_uri_parse_options. Use the current host
rather than the comma-separated list of host names when the host name
is needed for GSS, SSPI, or SSL authentication. Document the way
connect_timeout interacts with multiple host specifications.
Tom Lane [Tue, 22 Nov 2016 20:19:57 +0000 (15:19 -0500)]
Improve handling of "UPDATE ... SET (column_list) = row_constructor".
Previously, the right-hand side of a multiple-column assignment, if it
wasn't a sub-SELECT, had to be a simple parenthesized expression list,
because gram.y was responsible for "bursting" the construct into
independent column assignments. This had the minor defect that you
couldn't write ROW (though you should be able to, since the standard says
this is a row constructor), and the rather larger defect that unlike other
uses of row constructors, we would not expand a "foo.*" item into multiple
columns.
Fix that by changing the RHS to be just "a_expr" in the grammar, leaving
it to transformMultiAssignRef to separate the elements of a RowExpr;
which it will do only after performing standard transformation of the
RowExpr, so that "foo.*" behaves as expected.
The key reason we didn't do that before was the hard-wired handling of
DEFAULT tokens (SetToDefault nodes). This patch deals with that issue by
allowing DEFAULT in any a_expr and having parse analysis throw an error
if SetToDefault is found in an unexpected place. That's an improvement
anyway since the error can be more specific than just "syntax error".
The SQL standard suggests that the RHS could be any a_expr yielding a
suitable row value. This patch doesn't really move the goal posts in that
respect --- you're still limited to RowExpr or a sub-SELECT --- but it does
fix the grammar restriction, so it provides some tangible progress towards
a full implementation. And the limitation is now documented by an explicit
error message rather than an unhelpful "syntax error".
Robert Haas [Tue, 22 Nov 2016 19:26:40 +0000 (14:26 -0500)]
Support condition variables.
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true. In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time. However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set. Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.
Tom Lane [Tue, 22 Nov 2016 19:02:52 +0000 (14:02 -0500)]
Doc: add a section in Part II concerning RETURNING.
There are assorted references to RETURNING in Part II, but nothing
that would qualify as an explanation of the feature, which seems
like an oversight considering how useful it is. Add something.
Noted while looking for a place to point a cross-reference to ...
Tom Lane [Mon, 21 Nov 2016 23:21:55 +0000 (18:21 -0500)]
Fix PGLC_localeconv() to handle errors better.
The code was intentionally not very careful about leaking strdup'd
strings in case of an error. That was forgivable probably, but it
also failed to notice strdup() failures, which could lead to subsequent
null-pointer-dereference crashes, since many callers unsurprisingly
didn't check for null pointers in the struct lconv fields. An even
worse problem is that it could throw error while we were setlocale'd
to a non-C locale, causing unwanted behavior in subsequent libc calls.
Rewrite to ensure that we cannot throw elog(ERROR) until after we've
restored the previous locale settings, or at least attempted to.
(I'm sorely tempted to make restore failure be a FATAL error, but
will refrain for the moment.) Having done that, it's not much more
work to ensure that we clean up strdup'd storage on the way out, too.
This code is substantially the same in all supported branches, so
back-patch all the way.
Tom Lane [Mon, 21 Nov 2016 18:19:14 +0000 (13:19 -0500)]
Fix optimization for skipping searches for parallel-query hazards.
Fix thinko in commit da1c91631: even if the original query was free of
parallel hazards, we might introduce such a hazard by adding PARAM_EXEC
Param nodes. Adjust is_parallel_safe() so that it will scan the given
expression whenever any such nodes have been created. Per report from
Andreas Seltenreich.
Robert Haas [Mon, 21 Nov 2016 17:54:19 +0000 (12:54 -0500)]
autovacuum: Drop orphan temp tables more quickly but with more caution.
Previously, we only dropped an orphan temp table when it became old
enough to threaten wraparound; instead, doing it immediately. The
only value of waiting is that someone might be able to examine the
contents of the orphan temp table for forensic purposes, but it's
pretty difficult to actually do that and few users will wish to do so.
On the flip side, not performing the drop immediately generates log
spam and bloats pg_class.
In addition, per a report from Grigory Smolkin, if a temporary schema
contains a very large number of temporary tables, a backend attempting
to clear the temporary schema might fail due to lock table exhaustion.
It's helpful for autovacuum to clean up after such cases, and we don't
want it to wait for wraparound to threaten before doing so. To
prevent autovacuum from failing in the same manner as a backend trying
to drop an entire temp schema, remove orphan temp tables in batches of
50, committing after each batch, so that we don't accumulate an
unbounded number of locks. If a drop fails, retry other orphan tables
that need to be dropped up to 10 times before giving up. With this
system, if a backend does fail to clean a temporary schema due to
lock table exhaustion, autovacuum should hopefully put things right
the next time it processes the database.
Tom Lane [Mon, 21 Nov 2016 16:09:24 +0000 (11:09 -0500)]
Fix test for subplans in force-parallel mode.
We mustn't force parallel mode if the query has any subplans, since
ExecSerializePlan doesn't transmit them to workers. Testing
top_plan->initPlan is inadequate because (1) there might be initPlans
attached to lower plan nodes, and (2) non-initPlan subplans don't
work either. There's certainly room for improvement in those
restrictions, but for the moment that's what we've got.
Tom Lane [Sun, 20 Nov 2016 19:26:19 +0000 (14:26 -0500)]
Prevent multicolumn expansion of "foo.*" in an UPDATE source expression.
Because we use transformTargetList() for UPDATE as well as SELECT
tlists, the code accidentally tried to expand a "*" reference into
several columns. This is nonsensical, because the UPDATE syntax
provides exactly one target column to put the value into. The
immediate result was that transformUpdateTargetList() got confused
and reported "UPDATE target count mismatch --- internal error".
It seems better to treat such a reference as a plain whole-row
variable, as it would be in other contexts. (This could produce
useful results when the target column is of composite type.)
Fix by tweaking transformTargetList() to perform *-expansion only
conditionally, depending on its exprKind parameter.
Back-patch to 9.3. The problem exists further back, but a fix would be
much more invasive before that, because transformTargetList() wasn't
told what kind of list it was working on. Doesn't seem worth the
trouble given the lack of field reports. (I only noticed it because
I was checking the code while trying to improve the documentation about
how we handle "foo.*".)
Tom Lane [Sat, 19 Nov 2016 20:06:45 +0000 (15:06 -0500)]
Fix latent costing error in create_merge_append_path.
create_merge_append_path should use the path rowcount it just computed,
not rel->tuples, for costing purposes. Those numbers should always be
the same at present, but if we ever support parameterized MergeAppend
paths (a case this function is otherwise prepared for), the former would
be right and the latter wrong.
No need for back-patch since the problem is only latent.
Tom Lane [Sat, 19 Nov 2016 19:26:19 +0000 (14:26 -0500)]
Code review for GUC serialization/deserialization code.
The serialization code dumped core for a string-valued GUC whose value
is NULL, which is a legal state. The infrastructure isn't capable of
transmitting that state exactly, but fortunately, transmitting an empty
string instead should be close enough (compare, eg, commit e45e990e4).
The code potentially underestimated the space required to format a
real-valued variable, both because it made an unwarranted assumption that
%g output would never be longer than %e output, and because it didn't count
right even for %e format. In practice this would pretty much always be
masked by overestimates for other variables, but it's still wrong.
Also fix boundary-case error in read_gucstate, incorrect handling of the
case where guc_sourcefile is non-NULL but zero length (not clear that can
happen, but if it did, this code would get totally confused), and
confusingly useless check for a NULL result from read_gucstate.
Andreas Seltenreich discovered the core dump; other issues noted while
reading nearby code. Back-patch to 9.5 where this code was introduced.
Peter Eisentraut [Fri, 18 Nov 2016 17:00:00 +0000 (12:00 -0500)]
Add pg_sequences view
Like pg_tables, pg_views, and others, this view contains information
about sequences in a way that is independent of the system catalog
layout but more comprehensive than the information schema.
To help implement the view, add a new internal function
pg_sequence_last_value() to return the last value of a sequence. This
is kept separate from pg_sequence_parameters() to separate querying
run-time state from catalog-like information.
Stephen Frost [Fri, 18 Nov 2016 19:21:33 +0000 (14:21 -0500)]
Clean up pg_dump tests, re-enable BLOB testing
Add a loop to check that each test covers all of the pg_dump runs. We
(I) had been a bit sloppy when adding new runs and not making sure to
mark if they should be under like or unlike for each test, this loop
makes sure that the test system will complain if any are forgotten in
the future.
The loop also correctly handles the 'catch all' cases, which are used to
avoid running unnecessary specific checks when a single catch-all can be
done (eg: a no-acl run should not have any GRANT commands).
Also, re-enable the testing of blobs, but use lo_from_bytea() instead of
trying to be cute and writing out to a file and then reading it back in
with psql, which proved to be difficult for some buildfarm members.
This allows us to add support for testing the --no-blobs option which
will be getting added shortly, provided the buildfarm doesn't blow up on
this.
Robert Haas [Thu, 17 Nov 2016 21:54:24 +0000 (16:54 -0500)]
Remove or reduce verbosity of some debug messages.
The debug messages that merely print StartTransactionCommand,
CommitTransactionCommand, ProcessUtilty, or ProcessQuery with no
additional details seem to be useless. Get rid of them.
The transaction status messages produced by ShowTransactionState are
occasionally useful, but they are extremely verbose, producing
multiple lines of log output every time they fire, which can happens
multiple times per transaction. So, reduce the level to DEBUG5; avoid
emitting an extra line just to explain which debug point is at issue;
and tighten up the rest of the message so it doesn't use quite so much
horizontal space.
With these changes, it's possible to run a somewhat busy system with a
log level even as high as DEBUG4, whereas previously anything above
DEBUG2 would flood the log with output that probably wasn't really all
that useful.
Tom Lane [Thu, 17 Nov 2016 20:25:59 +0000 (15:25 -0500)]
Fix pg_dump's handling of circular dependencies in views.
pg_dump's traditional solution for breaking a circular dependency involving
a view was to create the view with CREATE TABLE and then later issue CREATE
RULE "_RETURN" ... to convert the table to a view, relying on the backend's
very very ancient code that supports making views that way. We've wanted
to get rid of that kluge for a long time, but the thing that finally
motivates doing something about it is the recognition that this method
fails with the --clean option, because it leads to issuing DROP RULE
"_RETURN" followed by DROP TABLE --- and the backend won't let you drop a
view's _RETURN rule.
Instead, let's break circular dependencies by initially creating the view
using CREATE VIEW AS SELECT NULL::columntype AS columnname, ... (so that
it has the right column names and types to support external references,
but no dependencies beyond the column data types), and then later dumping
the ON SELECT rule using the spelling CREATE OR REPLACE VIEW. This method
wasn't available when this code was originally written, but it's been
possible since PG 7.3, so it seems fine to start relying on it now.
To solve the --clean problem, make the dropStmt for an ON SELECT rule
be CREATE OR REPLACE VIEW with the same dummy target list as above.
In this way, during the DROP phase, we first reduce the view to have
no extra dependencies, and then we can drop it entirely when we've
gotten rid of whatever had a circular dependency on it.
(Note: this should work adequately well with the --if-exists option, since
the CREATE OR REPLACE VIEW will go through whether the view exists or not.
It could fail if the view exists with a conflicting column set, but we
don't really support --clean against a non-matching database anyway.)
This allows cleaning up some other kluges inside pg_dump, notably that
we don't need a notion of reloptions attached to a rule anymore.
Although this is a bug fix, commit to HEAD only for now. The problem's
existed for a long time and we've had relatively few complaints, so it
doesn't really seem worth taking risks to fix it in the back branches.
We might revisit that choice if no problems emerge.
Teach it not to complain if the dropStmt attached to an archive entry
is actually spelled CREATE OR REPLACE VIEW, since that will happen due to
an upcoming bug fix. Also, if it doesn't recognize a dropStmt, have it
print a WARNING and then emit the dropStmt unmodified. That seems like a
much saner behavior than Assert'ing or dumping core due to a null-pointer
dereference, which is what would happen before :-(.
Back-patch to 9.4 where this option was introduced.
Alvaro Herrera [Thu, 17 Nov 2016 16:31:30 +0000 (13:31 -0300)]
Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.
This commit skips the “pin scan” that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.
No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
This is a back-patch of commits 687f2cd7a015, 3e4b7d87988f, b60284261375
by Simon Riggs, to branches 9.4 and 9.5. No further backpatch is
possible because this depends on catalog scans being MVCC. I (Álvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.