If --rate was used to throttle pgbench, it failed to sleep when it had
nothing to do, leading to a busy-wait with 100% CPU usage. This bug was
introduced in the refactoring in v10. Before that, sleep() was called with
a timeout, even when there were no file descriptors to wait for.
Reported by Jeff Janes, patch by Fabien COELHO. Backpatch to v10.
Tom Lane [Fri, 29 Sep 2017 20:26:21 +0000 (16:26 -0400)]
Fix inadequate locking during get_rel_oids().
get_rel_oids used to not take any relation locks at all, but that stopped
being a good idea with commit 3c3bb9933, which inserted a syscache lookup
into the function. A concurrent DROP TABLE could now produce "cache lookup
failed", which we don't want to have happen in normal operation. The best
solution seems to be to transiently take a lock on the relation named by
the RangeVar (which also makes the result of RangeVarGetRelid a lot less
spongy). But we shouldn't hold the lock beyond this function, because we
don't want VACUUM to lock more than one table at a time. (That would not
be a big problem right now, but it will become one after the pending
feature patch to allow multiple tables to be named in VACUUM.)
In passing, adjust vacuum_rel and analyze_rel to document that we don't
trust the passed RangeVar to be accurate, and allow the RangeVar to
possibly be NULL --- which it is anyway for a whole-database VACUUM,
though we accidentally didn't crash for that case.
The passed RangeVar is in fact inaccurate when dealing with a child
partition, as of v10, and it has been wrong for a whole long time in the
case of vacuum_rel() recursing to a TOAST table. None of these things
present visible bugs up to now, because the passed RangeVar is in fact
only consulted for autovacuum logging, and in that particular context it's
always accurate because autovacuum doesn't let vacuum.c expand partitions
nor recurse to toast tables. Still, this seems like trouble waiting to
happen, so let's nail the door at least partly shut. (Further cleanup
is planned, in HEAD only, as part of the pending feature patch.)
Fix some sadly inaccurate/obsolete comments too. Back-patch to v10.
Peter Eisentraut [Mon, 25 Sep 2017 15:59:46 +0000 (11:59 -0400)]
psql: Update \d sequence display
For \d sequencename, the psql code just did SELECT * FROM sequencename
to get the information to display, but this does not contain much
interesting information anymore in PostgreSQL 10, because the metadata
has been moved to a separate system catalog.
This patch creates a newly designed sequence display that is not merely
an extension of the general relation/table display as it was previously.
Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples. But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.
(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)
One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it. But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.
Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead). In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.
An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer. It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message https://postgr.es/m/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.
Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling. In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.
Reported-by: Daniel Wood Diagnosed-by: Daniel Wood Reviewed-by: Yi Wen Wong, Michaël Paquier
Discussion: https://postgr.es/m/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com
Tom Lane [Wed, 27 Sep 2017 21:05:53 +0000 (17:05 -0400)]
Fix behavior when converting a float infinity to numeric.
float8_numeric() and float4_numeric() failed to consider the possibility
that the input is an IEEE infinity. The results depended on the
platform-specific behavior of sprintf(): on most platforms you'd get
something like
ERROR: invalid input syntax for type numeric: "inf"
but at least on Windows it's possible for the conversion to succeed and
deliver a finite value (typically 1), due to a nonstandard output format
from sprintf and lack of syntax error checking in these functions.
Since our numeric type lacks the concept of infinity, a suitable conversion
is impossible; the best thing to do is throw an explicit error before
letting sprintf do its thing.
While at it, let's use snprintf not sprintf. Overrunning the buffer
should be impossible if sprintf does what it's supposed to, but this
is cheap insurance against a stack smash if it doesn't.
Problem reported by Taiki Kondo. Patch by me based on fix suggestion
from KaiGai Kohei. Back-patch to all supported branches.
Tom Lane [Wed, 27 Sep 2017 20:14:37 +0000 (16:14 -0400)]
Revert to 9.6 treatment of ALTER TYPE enumtype ADD VALUE.
This reverts commit 15bc038f9, along with the followon commits 1635e80d3
and 984c92074 that tried to clean up the problems exposed by bug #14825.
The result was incomplete because it failed to address parallel-query
requirements. With 10.0 release so close upon us, now does not seem like
the time to be adding more code to fix that. I hope we can un-revert this
code and add the missing parallel query support during the v11 cycle.
Dean Rasheed [Wed, 27 Sep 2017 16:13:37 +0000 (17:13 +0100)]
Improve the CREATE POLICY documentation.
Provide a correct description of how multiple policies are combined,
clarify when SELECT permissions are required, mention SELECT FOR
UPDATE/SHARE, and do some other more minor tidying up.
It drops objects outside information_schema that depend on objects
inside information_schema. For example, it will drop a user-defined
view if the view query refers to information_schema.
posix_fallocate() is not quite a drop-in replacement for fallocate(),
because it is defined to return the error code as its function result,
not in "errno". I (tgl) missed this because RHEL6's version seems
to set errno as well. That is not the case on more modern Linuxen,
though, as per buildfarm results.
Aside from fixing the return-convention confusion, remove the test
for ENOSYS; we expect that glibc will mask that for posix_fallocate,
though it does not for fallocate. Keep the test for EINTR, because
POSIX specifies that as a possible result, and buildfarm results
suggest that it can happen in practice.
Tom Lane [Tue, 26 Sep 2017 17:12:13 +0000 (13:12 -0400)]
Remove heuristic same-transaction test from check_safe_enum_use().
The blacklist mechanism added by the preceding commit directly fixes
most of the practical cases that the same-transaction test was meant
to cover. What remains is use-cases like
begin;
create type e as enum('x');
alter type e add value 'y';
-- use 'y' somehow
commit;
However, because the same-transaction test is heuristic, it fails on
small variants of that, such as renaming the type or changing its
owner. Rather than try to explain the behavior to users, let's
remove it and just have a rule that the newly added value can't be
used before being committed, full stop. Perhaps later it will be
worth the implementation effort and overhead to have a more accurate
test for type-was-created-in-this-transaction. We'll wait for some
field experience with v10 before deciding to do that.
Tom Lane [Tue, 26 Sep 2017 17:12:03 +0000 (13:12 -0400)]
Use a blacklist to distinguish original from add-on enum values.
Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside
transaction blocks, by disallowing the use of the added value later
in the same transaction, except under limited circumstances. However,
the test for "limited circumstances" was heuristic and could reject
references to enum values that were created during CREATE TYPE AS ENUM,
not just later. This breaks the use-case of restoring pg_dump scripts
in a single transaction, as reported in bug #14825 from Balazs Szilfai.
We can improve this by keeping a "blacklist" table of enum value OIDs
created by ALTER TYPE ADD VALUE during the current transaction. Any
visible-but-uncommitted value whose OID is not in the blacklist must
have been created by CREATE TYPE AS ENUM, and can be used safely
because it could not have a lifespan shorter than its parent enum type.
This change also removes the restriction that a renamed enum value
can't be used before being committed (unless it was on the blacklist).
Andrew Dunstan, with cosmetic improvements by me.
Back-patch to v10.
Peter Eisentraut [Tue, 26 Sep 2017 14:03:56 +0000 (10:03 -0400)]
Handle heap rewrites better in logical replication
A FOR ALL TABLES publication naturally considers all base tables to be a
candidate for replication. This includes transient heaps that are
created during a table rewrite during DDL. This causes failures on the
subscriber side because it will not have a table like pg_temp_16386 to
receive data (and if it did, it would be the wrong table).
The prevent this problem, we filter out any tables that match this
naming pattern and match an actual table from FOR ALL TABLES
publications. This is only a heuristic, meaning that user tables that
match that naming could accidentally be omitted. A more robust solution
might require an explicit marking of such tables in pg_class somehow.
Reported-by: yxq <yxq@o2.pl>
Bug: #14785 Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Tom Lane [Mon, 25 Sep 2017 20:09:19 +0000 (16:09 -0400)]
Avoid SIGBUS on Linux when a DSM memory request overruns tmpfs.
On Linux, shared memory segments created with shm_open() are backed by
swap files created in tmpfs. If the swap file needs to be extended,
but there's no tmpfs space left, you get a very unfriendly SIGBUS trap.
To avoid this, force allocation of the full request size when we create
the segment. This adds a few cycles, but none that we wouldn't expend
later anyway, assuming the request isn't hugely bigger than the actual
need.
Make this code #ifdef __linux__, because (a) there's not currently a
reason to think the same problem exists on other platforms, and (b)
applying posix_fallocate() to an FD created by shm_open() isn't very
portable anyway.
Peter Eisentraut [Sun, 24 Sep 2017 04:56:31 +0000 (00:56 -0400)]
Allow ICU to use SortSupport on Windows with UTF-8
There is no reason to ever prevent the use of SortSupport on Windows
when ICU locales are used. We previously avoided SortSupport on Windows
with UTF-8 server encoding and a non C-locale due to restrictions in
Windows' libc functionality.
This is now considered to be a restriction in one platform's libc
collation provider, and not a more general platform restriction.
Peter Eisentraut [Sun, 24 Sep 2017 04:29:59 +0000 (00:29 -0400)]
doc: Expand user documentation on SCRAM
Explain more about how the different password authentication methods and
the password_encryption settings relate to each other, give some
upgrading advice, and set a better link from the release notes.
Peter Eisentraut [Sun, 24 Sep 2017 02:59:26 +0000 (22:59 -0400)]
Fix pg_basebackup test to original intent
One test case was meant to check that pg_basebackup does not succeed
when a slot is specified with -S but WAL streaming is not selected,
which used to require specifying -X stream. Since -X stream is the
default in PostgreSQL 10, this test case no longer covers that meaning,
but the pg_basebackup invocation happened to fail anyway for the
unrelated reason that the specified replication slot does not exist. To
fix, move the test case to later in the file where the slot does exist,
and add -X none to the invocation so that it covers the originally meant
behavior.
extracted from a patch by Michael Banck <michael.banck@credativ.de>
Peter Eisentraut [Fri, 22 Sep 2017 20:50:59 +0000 (16:50 -0400)]
Fix saving and restoring umask
In two cases, we set a different umask for some piece of code and
restore it afterwards. But if the contained code errors out, the umask
is not restored. So add TRY/CATCH blocks to fix that.
Reported-by: Jeremy Schneider
Author: Michael Paquier
Discussion: https://postgr.es/m/CA+fnDAZaPCwfY8Lp-pfLnUGFAXRu1VfLyRgdup-L-kwcBj8MqQ@mail.gmail.com
Tom Lane [Fri, 22 Sep 2017 04:04:21 +0000 (00:04 -0400)]
Sync our copy of the timezone library with IANA tzcode master.
This patch absorbs a few unreleased fixes in the IANA code.
It corresponds to commit 2d8b944c1cec0808ac4f7a9ee1a463c28f9cd00a
in https://github.com/eggert/tz. Non-cosmetic changes include:
TZDEFRULESTRING is updated to match current US DST practice,
rather than what it was over ten years ago. This only matters
for interpretation of POSIX-style zone names (e.g., "EST5EDT"),
and only if the timezone database doesn't include either an exact
match for the zone name or a "posixrules" entry. The latter
should not be true in any current Postgres installation, but
this could possibly matter when using --with-system-tzdata.
Get rid of a nonportable use of "++var" on a bool var.
This is part of a larger fix that eliminates some vestigial
support for consecutive leap seconds, and adds checks to
the "zic" compiler that the data files do not specify that.
Remove a couple of ancient compatibility hacks. The IANA
crew think these are obsolete, and I tend to agree. But
perhaps our buildfarm will think different.
Back-patch to all supported branches, in line with our policy
that all branches should be using current IANA code. Before v10,
this includes application of current pgindent rules, to avoid
whitespace problems in future back-patches.
Tom Lane [Thu, 21 Sep 2017 22:13:11 +0000 (18:13 -0400)]
Give a better error for duplicate entries in VACUUM/ANALYZE column list.
Previously, the code didn't think about this case and would just try to
analyze such a column twice. That would fail at the point of inserting
the second version of the pg_statistic row, with obscure error messsages
like "duplicate key value violates unique constraint" or "tuple already
updated by self", depending on context and PG version. We could allow
the case by ignoring duplicate column specifications, but it seems better
to reject it explicitly.
The bogus error messages seem like arguably a bug, so back-patch to
all supported versions.
Nathan Bossart, per a report from Michael Paquier, and whacked
around a bit by me.
Tom Lane [Wed, 20 Sep 2017 17:52:36 +0000 (13:52 -0400)]
Improve dubious memory management in pg_newlocale_from_collation().
pg_newlocale_from_collation() used malloc() and strdup() directly,
which is generally not per backend coding style, and it didn't bother
to check for failure results, but would just SIGSEGV instead. Also,
if one of the numerous error checks in the middle of the function
failed, the already-allocated memory would be leaked permanently.
Admittedly, it's not a lot of memory, but it could build up if this
function were called repeatedly for a bad collation.
The first two problems are easily cured by palloc'ing in TopMemoryContext
instead of calling libc directly. We can fairly easily dodge the leakage
problem for the struct pg_locale_struct by filling in a temporary variable
and allocating permanent storage only once we reach the bottom of the
function. It's harder to get rid of the potential leakage for ICU's copy
of the collcollate string, but at least that's only allocated after most
of the error checks; so live with that aspect.
Back-patch to v10 where this code came in, with one or another of the
ICU patches.
Tom Lane [Wed, 20 Sep 2017 15:28:34 +0000 (11:28 -0400)]
Fix instability in subscription regression test.
005_encoding.pl neglected to wait for the subscriber's initial
synchronization to happen. While we have not seen this fail in
the buildfarm, it's pretty easy to demonstrate there's an issue
by hacking logicalrep_worker_launch() to fail most of the time.
Tom Lane [Wed, 20 Sep 2017 15:10:42 +0000 (11:10 -0400)]
Fix erroneous documentation about noise word GROUP.
GRANT, REVOKE, and some allied commands allow the noise word GROUP
before a role name (cf. grantee production in gram.y). This option
does not exist elsewhere, but it had nonetheless snuck into the
documentation for ALTER ROLE, ALTER USER, and CREATE SCHEMA.
Seems to be a copy-and-pasteo in commit 31eae6028, which did expand the
syntax choices here, but not in that way. Back-patch to 9.5 where that
came in.
Magnus Hagander [Wed, 20 Sep 2017 12:09:05 +0000 (14:09 +0200)]
Mention need for --no-inc-recursive in rsync command
Since rsync 3.0.0 (released in 2008), the default way to enumerate
changes was changed in a way that makes it less likely that the hardlink
sync mode works. Since the whole point of the documented procedure is
for the hardlinks to work, change our docs to suggest using the
backwards compatibility switch.
Tom Lane [Mon, 18 Sep 2017 15:39:44 +0000 (11:39 -0400)]
Fix, or at least ameliorate, bugs in logicalrep_worker_launch().
If we failed to get a background worker slot, the code just walked
away from the logicalrep-worker slot it already had, leaving that
looking like the worker is still starting up. This led to an indefinite
hang in subscription startup, as reported by Thomas Munro. We must
release the slot on failure.
Also fix a thinko: we must capture the worker slot's generation before
releasing LogicalRepWorkerLock the first time, else testing to see if
it's changed is pretty meaningless.
BTW, the CHECK_FOR_INTERRUPTS() in WaitForReplicationWorkerAttach is a
ticking time bomb, even without considering the possibility of elog(ERROR)
in one of the other functions it calls. Really, this entire business needs
a redesign with some actual thought about error recovery. But for now
I'm just band-aiding the case observed in testing.
Peter Eisentraut [Mon, 18 Sep 2017 01:37:02 +0000 (21:37 -0400)]
Fix DROP SUBSCRIPTION hang
When ALTER SUBSCRIPTION DISABLE is run in the same transaction before
DROP SUBSCRIPTION, the latter will hang because workers will still be
running, not having seen the DISABLE committed, and DROP SUBSCRIPTION
will wait until the workers have vacated the replication origin slots.
Previously, DROP SUBSCRIPTION killed the logical replication workers
immediately only if it was going to drop the replication slot, otherwise
it scheduled the worker killing for the end of the transaction, as a
result of 7e174fa793a2df89fe03d002a5087ef67abcdde8. This, however,
causes the present problem. To fix, kill the workers immediately in all
cases. This covers all cases: A subscription that doesn't have a
replication slot must be disabled. It was either disabled in the same
transaction, or it was already disabled before the current transaction,
but then there shouldn't be any workers left and this won't make a
difference.
Tom Lane [Sun, 17 Sep 2017 21:04:21 +0000 (17:04 -0400)]
Doc: update v10 release notes through today.
Add item about number of times statement-level triggers will be fired.
Rearrange the compatibility items into (what seems to me) a less
random ordering.
Tom Lane [Sun, 17 Sep 2017 18:50:01 +0000 (14:50 -0400)]
Fix possible dangling pointer dereference in trigger.c.
AfterTriggerEndQuery correctly notes that the query_stack could get
repalloc'd during a trigger firing, but it nonetheless passes the address
of a query_stack entry to afterTriggerInvokeEvents, so that if such a
repalloc occurs, afterTriggerInvokeEvents is already working with an
obsolete dangling pointer while it scans the rest of the events. Oops.
The only code at risk is its "delete_ok" cleanup code, so we can
prevent unsafe behavior by passing delete_ok = false instead of true.
However, that could have a significant performance penalty, because the
point of passing delete_ok = true is to not have to re-scan possibly
a large number of dead trigger events on the next time through the loop.
There's more than one way to skin that cat, though. What we can do is
delete all the "chunks" in the event list except the last one, since
we know all events in them must be dead. Deleting the chunks is work
we'd have had to do later in AfterTriggerEndQuery anyway, and it ends
up saving rescanning of just about the same events we'd have gotten
rid of with delete_ok = true.
In v10 and HEAD, we also have to be careful to mop up any per-table
after_trig_events pointers that would become dangling. This is slightly
annoying, but I don't think that normal use-cases will traverse this code
path often enough for it to be a performance problem.
It's pretty hard to hit this in practice because of the unlikelihood
of the query_stack getting resized at just the wrong time. Nonetheless,
it's definitely a live bug of ancient standing, so back-patch to all
supported branches.
Tom Lane [Sun, 17 Sep 2017 16:16:38 +0000 (12:16 -0400)]
Ensure that BEFORE STATEMENT triggers fire the right number of times.
Commit 0f79440fb introduced mechanism to keep AFTER STATEMENT triggers
from firing more than once per statement, which was formerly possible
if more than one FK enforcement action had to be applied to a given
table. Add a similar mechanism for BEFORE STATEMENT triggers, so that
we don't have the unexpected situation of firing BEFORE STATEMENT
triggers more often than AFTER STATEMENT.
Tom Lane [Sat, 16 Sep 2017 19:31:26 +0000 (15:31 -0400)]
Doc: add example of transition table use in a trigger.
I noticed that there were exactly no complete examples of use of
a transition table in a trigger function, and no clear description
of just how you'd do it either. Improve that.
Tom Lane [Sat, 16 Sep 2017 17:20:32 +0000 (13:20 -0400)]
Fix SQL-spec incompatibilities in new transition table feature.
The standard says that all changes of the same kind (insert, update, or
delete) caused in one table by a single SQL statement should be reported
in a single transition table; and by that, they mean to include foreign key
enforcement actions cascading from the statement's direct effects. It's
also reasonable to conclude that if the standard had wCTEs, they would say
that effects of wCTEs applying to the same table as each other or the outer
statement should be merged into one transition table. We weren't doing it
like that.
Hence, arrange to merge tuples from multiple update actions into a single
transition table as much as we can. There is a problem, which is that if
the firing of FK enforcement triggers and after-row triggers with
transition tables is interspersed, we might need to report more tuples
after some triggers have already seen the transition table. It seems like
a bad idea for the transition table to be mutable between trigger calls.
There's no good way around this without a major redesign of the FK logic,
so for now, resolve it by opening a new transition table each time this
happens.
Also, ensure that AFTER STATEMENT triggers fire just once per statement,
or once per transition table when we're forced to make more than one.
Previous versions of Postgres have allowed each FK enforcement query
to cause an additional firing of the AFTER STATEMENT triggers for the
referencing table, but that's certainly not per spec. (We're still
doing multiple firings of BEFORE STATEMENT triggers, though; is that
something worth changing?)
Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature.
The transition tables are now managed by the AfterTriggers data structures,
rather than being directly the responsibility of ModifyTable nodes. This
removes a subtransaction-lifespan memory leak introduced by my previous
band-aid patch 3c4359521.
In passing, refactor the AfterTriggers data structures to reduce the
management overhead for them, by using arrays of structs rather than
several parallel arrays for per-query-level and per-subtransaction state.
I failed to resist the temptation to do some copy-editing on the SGML
docs about triggers, above and beyond merely documenting the effects
of this patch.
Back-patch to v10, because we don't want the semantics of transition
tables to change post-release.
Patch by me, with help and review from Thomas Munro.
Bruce Momjian [Sat, 16 Sep 2017 15:58:00 +0000 (11:58 -0400)]
docs: clarify pg_upgrade docs regarding standbys and rsync
Document that rsync is an _optional_ way to upgrade standbys, suggest
rsync option --dry-run, and mention a way of upgrading one standby from
another using rsync. Also clarify some instructions by specifying if
they operate on the old or new clusters.
Reported-by: Stephen Frost, Magnus Hagander
Discussion: https://postgr.es/m/20170914191250.GB6595@momjian.us
Robert Haas [Sat, 16 Sep 2017 01:15:55 +0000 (21:15 -0400)]
After a MINVALUE/MAXVALUE bound, allow only more of the same.
In the old syntax, which used UNBOUNDED, we had a similar restriction,
but commit d363d42bb9a4399a0207bd3b371c966e22e06bd3, which changed the
syntax, eliminated it. Put it back.
Robert Haas [Thu, 14 Sep 2017 14:43:44 +0000 (10:43 -0400)]
Set partitioned_rels appropriately when UNION ALL is used.
In most cases, this omission won't matter, because the appropriate
locks will have been acquired during parse/plan or by AcquireExecutorLocks.
But it's a bug all the same.
Report by Ashutosh Bapat. Patch by me, reviewed by Amit Langote.
Andres Freund [Thu, 14 Sep 2017 08:53:10 +0000 (01:53 -0700)]
Properly check interrupts in execScan.c.
During the development of d47cfef711 the CFI()s in ExecScan() were
moved back and forth, ending up in the wrong place. Thus queries that
largely spend their time in ExecScan(), and have neither projection
nor a qual, can't be cancelled in a timely manner.
Reported-By: Jeff Janes
Author: Andres Freund
Discussion: https://postgr.es/m/CAMkU=1weDXp8eLLPt9SO1LEUsJYYK9cScaGhLKpuN+WbYo9b5g@mail.gmail.com
Backpatch: 10, as d47cfef711
Stephen Frost [Thu, 14 Sep 2017 00:04:43 +0000 (20:04 -0400)]
Fix ordering in pg_dump of GRANTs
The order in which GRANTs are output is important as GRANTs which have
been GRANT'd by individuals via WITH GRANT OPTION GRANTs have to come
after the GRANT which included the WITH GRANT OPTION. This happens
naturally in the backend during normal operation as we only change
existing ACLs in-place, only add new ACLs to the end, and when removing
an ACL we remove any which depend on it also.
Also, adjust the comments in acl.h to make this clear.
Unfortunately, the updates to pg_dump to handle initial privileges
involved pulling apart ACLs and then combining them back together and
could end up putting them back together in an invalid order, leading to
dumps which wouldn't restore.
Fix this by adjusting the queries used by pg_dump to ensure that the
ACLs are rebuilt in the same order in which they were originally.
Back-patch to 9.6 where the changes for initial privileges were done.
The documentation claimed that one should send
"pg_same_as_startup_message" as the user name in the SCRAM messages, but
this did not match the actual implementation, so remove it.
Bruce Momjian [Wed, 13 Sep 2017 13:11:28 +0000 (09:11 -0400)]
docs: improve pg_upgrade standby instructions
This makes it clear that pg_upgrade standby upgrade instructions should
only be used in link mode, adds examples, and explains how rsync works
with links.
Reported-by: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.6c.c0e592c5af4ef0a2.15e785dcb61@tc7-visena
Tom Lane [Tue, 12 Sep 2017 02:02:58 +0000 (22:02 -0400)]
Fix RecursiveCopy.pm to cope with disappearing files.
When copying from an active database tree, it's possible for files to be
deleted after we see them in a readdir() scan but before we can open them.
(Once we've got a file open, we don't expect any further errors from it
getting unlinked, though.) Tweak RecursiveCopy so it can cope with this
case, so as to avoid irreproducible test failures.
Back-patch to 9.6 where this code was added. In v10 and HEAD, also
remove unused "use RecursiveCopy" in one recovery test script.
Tom Lane [Sun, 10 Sep 2017 18:59:56 +0000 (14:59 -0400)]
Quick-hack fix for foreign key cascade vs triggers with transition tables.
AFTER triggers using transition tables crashed if they were fired due
to a foreign key ON CASCADE update. This is because ExecEndModifyTable
flushes the transition tables, on the assumption that any trigger that
could need them was already fired during ExecutorFinish. Normally
that's true, because we don't allow transition-table-using triggers
to be deferred. However, foreign key CASCADE updates force any
triggers on the referencing table to be deferred to the outer query
level, by means of the EXEC_FLAG_SKIP_TRIGGERS flag. I don't recall
all the details of why it's like that and am pretty loath to redesign
it right now. Instead, just teach ExecEndModifyTable to skip destroying
the TransitionCaptureState when that flag is set. This will allow the
transition table data to survive until end of the current subtransaction.
This isn't a terribly satisfactory solution, because (1) we might be
leaking the transition tables for much longer than really necessary,
and (2) as things stand, an AFTER STATEMENT trigger will fire once per
RI updating query, ie once per row updated or deleted in the referenced
table. I suspect that is not per SQL spec. But redesigning this is a
research project that we're certainly not going to get done for v10.
So let's go with this hackish answer for now.
In passing, tweak AfterTriggerSaveEvent to not save the transition_capture
pointer into the event record for a deferrable trigger. This is not
necessary to fix the current bug, but it avoids letting dangling pointers
to long-gone transition tables persist in the trigger event queue. That's
at least a safety feature. It might also allow merging shared trigger
states in more cases than before.
I added a regression test that demonstrates the crash on unpatched code,
and also exposes the behavior of firing the AFTER STATEMENT triggers
once per row update.
Per bug #14808 from Philippe Beaudoin. Back-patch to v10.
Tom Lane [Fri, 8 Sep 2017 23:04:32 +0000 (19:04 -0400)]
Fix uninitialized-variable bug.
map_partition_varattnos() failed to set its found_whole_row output
parameter if the given expression list was NIL. This seems to be
a pre-existing bug that chanced to be exposed by commit 6f6b99d13.
It might be unreachable in v10, but I have little faith in that
proposition, so back-patch.
Robert Haas [Thu, 7 Sep 2017 14:55:45 +0000 (10:55 -0400)]
Even if some partitions are foreign, allow tuple routing.
This doesn't allow routing tuple to the foreign partitions themselves,
but it permits tuples to be routed to regular partitions despite the
presence of foreign partitions in the same inheritance hierarchy.
Etsuro Fujita, reviewed by Amit Langote and by me.
Tom Lane [Wed, 6 Sep 2017 15:35:31 +0000 (11:35 -0400)]
Add psql variables showing server version and psql version.
We already had a psql variable VERSION that shows the verbose form of
psql's own version. Add VERSION_NAME to show the short form (e.g.,
"11devel") and VERSION_NUM to show the numeric form (e.g., 110000).
Also add SERVER_VERSION_NAME and SERVER_VERSION_NUM to show the short and
numeric forms of the server's version. (We'd probably add SERVER_VERSION
with the verbose string if it were readily available; but adding another
network round trip to get it seems too expensive.)
The numeric forms, in particular, are expected to be useful for scripting
purposes, now that psql can do conditional tests.
Tom Lane [Wed, 6 Sep 2017 14:41:05 +0000 (10:41 -0400)]
Clean up handling of dropped columns in NAMEDTUPLESTORE RTEs.
The NAMEDTUPLESTORE patch piggybacked on the infrastructure for
TABLEFUNC/VALUES/CTE RTEs, none of which can ever have dropped columns,
so the possibility was ignored most places. Fix that, including adding a
specification to parsenodes.h about what it's supposed to look like.
In passing, clean up assorted comments that hadn't been maintained
properly by said patch.
Per bug #14799 from Philippe Beaudoin. Back-patch to v10.
Throttling for sending a base backup in walsender is broken for the case
where there is a lot of WAL traffic, because the latch used to put the
walsender to sleep is also signalled by regular WAL traffic (and each
signal causes an additional batch of data to be sent); the net effect is
that there is no or little actual throttling. This is undesirable, so
rewrite the sleep into a loop to achieve the desired effeect.
Author: Jeff Janes, small tweaks by me Reviewed-by: Antonin Houska
Discussion: https://postgr.es/m/CAMkU=1xH6mde-yL-Eo1TKBGNd0PB1-TMxvrNvqcAkN-qr2E9mw@mail.gmail.com
Tom Lane [Sun, 3 Sep 2017 15:01:08 +0000 (11:01 -0400)]
Fix macro-redefinition warning on MSVC.
In commit 9d6b160d7, I tweaked pg_config.h.win32 to use
"#define HAVE_LONG_LONG_INT_64 1" rather than defining it as empty,
for consistency with what happens in an autoconf'd build.
But Solution.pm injects another definition of that macro into
ecpg_config.h, leading to justifiable (though harmless) compiler whining.
Make that one consistent too. Back-patch, like the previous patch.
Tom Lane [Fri, 1 Sep 2017 21:38:54 +0000 (17:38 -0400)]
Improve division of labor between execParallel.c and nodeGather[Merge].c.
Move the responsibility for creating/destroying TupleQueueReaders into
execParallel.c, to avoid duplicative coding in nodeGather.c and
nodeGatherMerge.c. Also, instead of having DestroyTupleQueueReader do
shm_mq_detach, do it in the caller (which is now only ExecParallelFinish).
This means execParallel.c does both the attaching and detaching of the
tuple-queue-reader shm_mqs, which seems less weird than the previous
arrangement.
These changes also eliminate a vestigial memory leak (of the pei->tqueue
array). It's now demonstrable that rescans of Gather or GatherMerge don't
leak memory.
Tom Lane [Fri, 1 Sep 2017 19:14:18 +0000 (15:14 -0400)]
Make [U]INT64CONST safe for use in #if conditions.
Instead of using a cast to force the constant to be the right width,
assume we can plaster on an L, UL, LL, or ULL suffix as appropriate.
The old approach to this is very hoary, dating from before we were
willing to require compilers to have working int64 types.
This fix makes the PG_INT64_MIN, PG_INT64_MAX, and PG_UINT64_MAX
constants safe to use in preprocessor conditions, where a cast
doesn't work. Other symbolic constants that might be defined using
[U]INT64CONST are likewise safer than before.
Also fix the SIZE_MAX macro to be similarly safe, if we are forced
to provide a definition for that. The test added in commit 2e70d6b5e
happens to do what we want even with the hack "(size_t) -1" definition,
but we could easily get burnt on other tests in future.
Back-patch to all supported branches, like the previous commits.
Tom Lane [Fri, 1 Sep 2017 17:52:53 +0000 (13:52 -0400)]
Ensure SIZE_MAX can be used throughout our code.
Pre-C99 platforms may lack <stdint.h> and thereby SIZE_MAX. We have
a couple of places using the hack "(size_t) -1" as a fallback, but
it wasn't universally available; which means the code added in commit 2e70d6b5e fails to compile everywhere. Move that hack to c.h so that
we can rely on having SIZE_MAX everywhere.
Per discussion, it'd be a good idea to make the macro's value safe
for use in #if-tests, but that will take a bit more work. This is
just a quick expedient to get the buildfarm green again.
Back-patch to all supported branches, like the previous commit.
The original code had a race condition because it never ensured the
standby was caught up before proceeding; add a wait similar to every
other place that does this.
Do for replication origins what the previous commit did for replication
slots: restore the original behavior of replication origin drop to raise
an error rather than blocking, because users might be depending on the
original behavior. Maintain the blocking behavior when invoked
internally from logical replication subscription handling.
Commit 9915de6c1cb2 changed the default behavior of
DROP_REPLICATION_SLOT so that it would wait until any session holding
the slot active would release it, instead of raising an error. But
users are already depending on the original behavior, so revert to it by
default and add a WAIT option to invoke the new behavior.
Per complaint from Simone Gotti, in
Discussion: https://postgr.es/m/CAEvsy6Wgdf90O6pUvg2wSVXL2omH5OPC-38OD4Zzgk-FXavj3Q@mail.gmail.com
Tom Lane [Thu, 31 Aug 2017 20:20:58 +0000 (16:20 -0400)]
Avoid memory leaks when a GatherMerge node is rescanned.
Rescanning a GatherMerge led to leaking some memory in the executor's
query-lifespan context, because most of the node's working data structures
were simply abandoned and rebuilt from scratch. In practice, this might
never amount to much, given the cost of relaunching worker processes ---
but it's still pretty messy, so let's fix it.
We can rearrange things so that the tuple arrays are simply cleared and
reused, and we don't need to rebuild the TupleTableSlots either, just
clear them. One small complication is that because we might get a
different number of workers on each iteration, we can't keep the old
convention that the leader's gm_slots[] entry is the last one; the leader
might clobber a TupleTableSlot that we need for a worker in a future
iteration. Hence, adjust the logic so that the leader has slot 0 always,
while the active workers have slots 1..n.
Back-patch to v10 to keep all the existing versions of nodeGatherMerge.c
in sync --- because of the renumbering of the slots, there would otherwise
be a very large risk that any future backpatches in this module would
introduce bugs.
Tom Lane [Thu, 31 Aug 2017 19:10:24 +0000 (15:10 -0400)]
Clean up shm_mq cleanup.
The logic around shm_mq_detach was a few bricks shy of a load, because
(contrary to the comments for shm_mq_attach) all it did was update the
shared shm_mq state. That left us leaking a bit of process-local
memory, but much worse, the on_dsm_detach callback for shm_mq_detach
was still armed. That means that whenever we ultimately detach from
the DSM segment, we'd run shm_mq_detach again for already-detached,
possibly long-dead queues. This accidentally fails to fail today,
because we only ever re-use a shm_mq's memory for another shm_mq, and
multiple detach attempts on the last such shm_mq are fairly harmless.
But it's gonna bite us someday, so let's clean it up.
To do that, change shm_mq_detach's API so it takes a shm_mq_handle
not the underlying shm_mq. This makes the callers simpler in most
cases anyway. Also fix a few places in parallel.c that were just
pfree'ing the handle structs rather than doing proper cleanup.
Back-patch to v10 because of the risk that the revenant shm_mq_detach
callbacks would cause a live bug sometime. Since this is an API
change, it's too late to do it in 9.6. (We could make a variant
patch that preserves API, but I'm not excited enough to do that.)
Tom Lane [Wed, 30 Aug 2017 21:21:08 +0000 (17:21 -0400)]
Code review for nodeGatherMerge.c.
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
Tom Lane [Wed, 30 Aug 2017 17:18:16 +0000 (13:18 -0400)]
Separate reinitialization of shared parallel-scan state from ExecReScan.
Previously, the parallel executor logic did reinitialization of shared
state within the ExecReScan code for parallel-aware scan nodes. This is
problematic, because it means that the ExecReScan call has to occur
synchronously (ie, during the parent Gather node's ReScan call). That is
swimming very much against the tide so far as the ExecReScan machinery is
concerned; the fact that it works at all today depends on a lot of fragile
assumptions, such as that no plan node between Gather and a parallel-aware
scan node is parameterized. Another objection is that because ExecReScan
might be called in workers as well as the leader, hacky extra tests are
needed in some places to prevent unwanted shared-state resets.
Hence, let's separate this code into two functions, a ReInitializeDSM
call and the ReScan call proper. ReInitializeDSM is called only in
the leader and is guaranteed to run before we start new workers.
ReScan is returned to its traditional function of resetting only local
state, which means that ExecReScan's usual habits of delaying or
eliminating child rescan calls are safe again.
As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary
to make these changes in 9.6, which is a good thing because the FDW and
CustomScan APIs are impacted.
Revert the reversion commits a20aac890 and 9b644745c. In the wake of
commit 7df2c1f8d, we should get stable buildfarm results from this test;
if not, I'd like to know sooner not later.
Tom Lane [Wed, 30 Aug 2017 13:29:56 +0000 (09:29 -0400)]
Force rescanning of parallel-aware scan nodes below a Gather[Merge].
The ExecReScan machinery contains various optimizations for postponing
or skipping rescans of plan subtrees; for example a HashAgg node may
conclude that it can re-use the table it built before, instead of
re-reading its input subtree. But that is wrong if the input contains
a parallel-aware table scan node, since the portion of the table scanned
by the leader process is likely to vary from one rescan to the next.
This explains the timing-dependent buildfarm failures we saw after
commit a2b70c89c.
The established mechanism for showing that a plan node's output is
potentially variable is to mark it as depending on some runtime Param.
Hence, to fix this, invent a dummy Param (one that has a PARAM_EXEC
parameter number, but carries no actual value) associated with each Gather
or GatherMerge node, mark parallel-aware nodes below that node as dependent
on that Param, and arrange for ExecReScanGather[Merge] to flag that Param
as changed whenever the Gather[Merge] node is rescanned.
This solution breaks an undocumented assumption made by the parallel
executor logic, namely that all rescans of nodes below a Gather[Merge]
will happen synchronously during the ReScan of the top node itself.
But that's fundamentally contrary to the design of the ExecReScan code,
and so was doomed to fail someday anyway (even if you want to argue
that the bug being fixed here wasn't a failure of that assumption).
A follow-on patch will address that issue. In the meantime, the worst
that's expected to happen is that given very bad timing luck, the leader
might have to do all the work during a rescan, because workers think
they have nothing to do, if they are able to start up before the eventual
ReScan of the leader's parallel-aware table scan node has reset the
shared scan state.
Although this problem exists in 9.6, there does not seem to be any way
for it to manifest there. Without GatherMerge, it seems that a plan tree
that has a rescan-short-circuiting node below Gather will always also
have one above it that will short-circuit in the same cases, preventing
the Gather from being rescanned. Hence we won't take the risk of
back-patching this change into 9.6. But v10 needs it.
Peter Eisentraut [Tue, 29 Aug 2017 23:33:24 +0000 (19:33 -0400)]
doc: Avoid sidebar element
The formatting of the sidebar element didn't carry over to the new tool
chain. Instead of inventing a whole new way of dealing with it, just
convert the one use to a "note".
Tom Lane [Tue, 29 Aug 2017 19:38:05 +0000 (15:38 -0400)]
Doc: document libpq's restriction to INT_MAX rows in a PGresult.
As long as PQntuples, PQgetvalue, etc, use "int" for row numbers, we're
pretty much stuck with this limitation. The documentation formerly stated
that the result of PQntuples "might overflow on 32-bit operating systems",
which is just nonsense: that's not where the overflow would happen, and
if you did reach an overflow it would not be on a 32-bit machine, because
you'd have OOM'd long since.
Tom Lane [Tue, 29 Aug 2017 19:18:01 +0000 (15:18 -0400)]
Teach libpq to detect integer overflow in the row count of a PGresult.
Adding more than 1 billion rows to a PGresult would overflow its ntups and
tupArrSize fields, leading to client crashes. It'd be desirable to use
wider fields on 64-bit machines, but because all of libpq's external APIs
use plain "int" for row counters, that's going to be hard to accomplish
without an ABI break. Given the lack of complaints so far, and the general
pain that would be involved in using such huge PGresults, let's settle for
just preventing the overflow and reporting a useful error message if it
does happen. Also, for a couple more lines of code we can increase the
threshold of trouble from INT_MAX/2 to INT_MAX rows.
To do that, refactor pqAddTuple() to allow returning an error message that
replaces the default assumption that it failed because of out-of-memory.
Along the way, fix PQsetvalue() so that it reports all failures via
pqInternalNotice(). It already did so in the case of bad field number,
but neglected to report anything for other error causes.
Because of the potential for crashes, this seems like a back-patchable
bug fix, despite the lack of field reports.
Tom Lane [Tue, 29 Aug 2017 13:34:21 +0000 (09:34 -0400)]
Improve docs about numeric formatting patterns (to_char/to_number).
The explanation about "0" versus "9" format characters was confusing
and arguably wrong; the discussion of sign handling wasn't very good
either. Notably, while it's accurate to say that "FM" strips leading
zeroes in date/time values, what it really does with numeric values
is to strip *trailing* zeroes, and then only if you wrote "9" rather
than "0". Per gripes from Erwin Brandstetter.