Enforce cube dimension limit in all cube construction functions
contrib/cube has a limit to 100 dimensions for cube datatype. However, it's
not enforced everywhere, and one can actually construct cube with more than
100 dimensions having then trouble with dump/restore. This commit add checks
for dimensions limit in all functions responsible for cube construction.
Backpatch to all supported versions.
Reported-by: Andrew Gierth
Discussion: https://postgr.es/m/87va7uybt4.fsf%40news-spur.riddles.org.uk
Author: Andrey Borodin with small additions by me
Review: Tom Lane
Backpatch-through: 9.3
Split contrib/cube platform-depended checks into separate test
We're currently maintaining two outputs for cube regression test. But that
appears to be unsuitable, because these outputs are different in out few checks
involving scientific notation. So, split checks involving scientific notation
into separate test, making contrib/cube easier to maintain. Backpatch to all
supported versions in order to make further backpatching easier.
Discussion: https://postgr.es/m/CAPpHfdvJgWjxHsJTtT%2Bo1tz3OR8EFHcLQjhp-d3%2BUcmJLh-fQA%40mail.gmail.com
Author: Alexander Korotkov
Backpatch-through: 9.3
Tom Lane [Fri, 31 Aug 2018 16:26:20 +0000 (12:26 -0400)]
Make checksum_impl.h safe to compile with -fstrict-aliasing.
In general, Postgres requires -fno-strict-aliasing with compilers that
implement C99 strict aliasing rules. There's little hope of getting
rid of that overall. But it seems like it would be a good idea if
storage/checksum_impl.h in particular didn't depend on it, because
that header is explicitly intended to be included by external programs.
We don't have a lot of control over the compiler switches that an
external program might use, as shown by Michael Banck's report of
failure in a privately-modified version of pg_verify_checksums.
Hence, switch to using a union in place of willy-nilly pointer casting
inside this file. I think this makes the code a bit more readable
anyway.
checksum_impl.h hasn't changed since it was introduced in 9.3,
so back-patch all the way.
Michael Paquier [Thu, 30 Aug 2018 00:11:27 +0000 (17:11 -0700)]
Stop bgworkers during fast shutdown with postmaster in startup phase
When a postmaster gets into its phase PM_STARTUP, it would start
background workers using BgWorkerStart_PostmasterStart mode immediately,
which would cause problems for a fast shutdown as the postmaster forgets
to send SIGTERM to already-started background workers. With smart and
immediate shutdowns, this correctly happened, and fast shutdown is the
only mode missing the shot.
Author: Alexander Kukushkin Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAFh8B=mvnD8+DZUfzpi50DoaDfZRDfd7S=gwj5vU9GYn8UvHkA@mail.gmail.com
Backpatch-through: 9.5
Tom Lane [Tue, 28 Aug 2018 23:46:59 +0000 (19:46 -0400)]
Make pg_restore's identify_locking_dependencies() more bulletproof.
This function had a blacklist of dump object types that it believed
needed exclusive lock ... but we hadn't maintained that, so that it
was missing ROW SECURITY, POLICY, and INDEX ATTACH items, all of
which need (or should be treated as needing) exclusive lock.
Since the same oversight seems likely in future, let's reverse the
sense of the test so that the code has a whitelist of safe object
types; better to wrongly assume a command can't be run in parallel
than the opposite. Currently the only POST_DATA object type that's
safe is CREATE INDEX ... and that list hasn't changed in a long time.
Andrew Gierth [Tue, 28 Aug 2018 13:43:51 +0000 (14:43 +0100)]
postgres_fdw: don't push ORDER BY with no vars (bug #15352)
Commit aa09cd242 changed a condition in find_em_expr_for_rel from
being a bms_equal comparison of relids to bms_is_subset, in order to
support order by clauses on foreign joins. But this also allows
through the degenerate case of expressions with no Vars at all (and
hence empty relids), including integer constants which will be parsed
unexpectedly on the remote (viz. "ERROR: ORDER BY position 0 is not in
select list" as in the bug report).
Repair by adding an additional !bms_is_empty test.
Backpatch through to 9.6 where the aforementioned change was made.
Per bug #15352 from Maksym Boguk; analysis and patch by me.
Andrew Gierth [Tue, 28 Aug 2018 08:52:25 +0000 (09:52 +0100)]
Avoid quadratic slowdown in regexp match/split functions.
regexp_matches, regexp_split_to_table and regexp_split_to_array all
work by compiling a list of match positions as character offsets (NOT
byte positions) in the source string.
Formerly, they then used text_substr to extract the matched text; but
in a multi-byte encoding, that counts the characters in the string,
and the characters needed to reach the starting byte position, on
every call. Accordingly, the performance degraded as the product of
the input string length and the number of match positions, such that
splitting a string of a few hundred kbytes could take many minutes.
Repair by keeping the wide-character copy of the input string
available (only in the case where encoding_max_length is not 1) after
performing the match operation, and extracting substrings from that
instead. This reduces the complexity to being linear in the number of
result bytes, discounting the actual regexp match itself (which is not
affected by this patch).
In passing, remove cleanup using retail pfree() which was obsoleted by
commit ff428cded (Feb 2008) which made cleanup of SRF multi-call
contexts automatic. Also increase (to ~134 million) the maximum number
of matches and provide an error message when it is reached.
Backpatch all the way because this has been wrong forever.
Tom Lane [Mon, 27 Aug 2018 19:11:12 +0000 (15:11 -0400)]
Fix missing dependency for pg_dump's ENABLE ROW LEVEL SECURITY items.
The archive should show a dependency on the item's table, but it failed
to include one. This could cause failures in parallel restore due to
emitting ALTER TABLE ... ENABLE ROW LEVEL SECURITY before restoring
the table's data. In practice the odds of a problem seem low, since
you would typically need to have set FORCE ROW LEVEL SECURITY as well,
and you'd also need a very high --jobs count to have any chance of this
happening. That probably explains the lack of field reports.
Still, it's a bug, so back-patch to 9.5 where RLS was introduced.
Tom Lane [Sun, 26 Aug 2018 18:21:55 +0000 (14:21 -0400)]
Make syslogger more robust against failures in opening CSV log files.
The previous coding figured it'd be good enough to postpone opening
the first CSV log file until we got a message we needed to write there.
This is unsafe, though, because if the open fails we end up in infinite
recursion trying to report the failure. Instead make the CSV log file
management code look as nearly as possible like the longstanding logic
for the stderr log file. In particular, open it immediately at postmaster
startup (if enabled), or when we get a SIGHUP in which we find that
log_destination has been changed to enable CSV logging.
It seems OK to fail if a postmaster-start-time open attempt fails, as
we've long done for the stderr log file. But we can't die if we fail
to open a CSV log file during SIGHUP, so we're still left with a problem.
In that case, write any output meant for the CSV log file to the stderr
log file. (This will also cover race-condition cases in which backends
send CSV log data before or after we have the CSV log file open.)
This patch also fixes an ancient oversight that, if CSV logging was
turned off during a SIGHUP, we never actually closed the last CSV
log file.
In passing, remember to reset whereToSendOutput = DestNone during syslogger
start, since (unlike all other postmaster children) it's forked before the
postmaster has done that. This made for a platform-dependent difference
in error reporting behavior between the syslogger and other children:
except on Windows, it'd report problems to the original postmaster stderr
as well as the normal error log file(s). It's barely possible that that
was intentional at some point; but it doesn't seem likely to be desirable
in production, and the platform dependency definitely isn't desirable.
Per report from Alexander Kukushkin. It's been like this for a long time,
so back-patch to all supported branches.
Bruce Momjian [Sat, 25 Aug 2018 17:35:14 +0000 (13:35 -0400)]
doc: "Latest checkpoint location" will not match in pg_upgrade
Mention that "Latest checkpoint location" will not match in pg_upgrade
if the standby server is still running during the upgrade, which is
possible. "Match" text first appeared in PG 9.5.
Reported-by: Paul Bonaud
Discussion: https://postgr.es/m/c7268794-edb4-1772-3bfd-04c54585c24e@trainline.com
Andrew Gierth [Thu, 23 Aug 2018 17:29:18 +0000 (18:29 +0100)]
Fix lexing of standard multi-character operators in edge cases.
Commits c6b3c939b (which fixed the precedence of >=, <=, <> operators)
and 865f14a2d (which added support for the standard => notation for
named arguments) created a class of lexer tokens which look like
multi-character operators but which have their own token IDs distinct
from Op. However, longest-match rules meant that following any of
these tokens with another operator character, as in (1<>-1), would
cause them to be incorrectly returned as Op.
The error here isn't immediately obvious, because the parser would
usually still find the correct operator via the Op token, but there
were more subtle problems:
1. If immediately followed by a comment or +-, >= <= <> would be given
the old precedence of Op rather than the correct new precedence;
2. If followed by a comment, != would be returned as Op rather than as
NOT_EQUAL, causing it not to be found at all;
3. If followed by a comment or +-, the => token for named arguments
would be lexed as Op, causing the argument to be mis-parsed as a
simple expression, usually causing an error.
Fix by explicitly checking for the operators in the {operator} code
block in addition to all the existing special cases there.
Backpatch to 9.5 where the problem was introduced.
Analysis and patch by me; review by Tom Lane.
Discussion: https://postgr.es/m/87va851ppl.fsf@news-spur.riddles.org.uk
Andrew Gierth [Thu, 23 Aug 2018 19:00:33 +0000 (20:00 +0100)]
Reduce an unnecessary O(N^3) loop in lexer.
The lexer's handling of operators contained an O(N^3) hazard when
dealing with long strings of + or - characters; it seems hard to
prevent this case from being O(N^2), but the additional N multiplier
was not needed.
Backpatch all the way since this has been there since 7.x, and it
presents at least a mild hazard in that trying to do Bind, PREPARE or
EXPLAIN on a hostile query could take excessive time (without
honouring cancels or timeouts) even if the query was never executed.
Michael Paquier [Tue, 21 Aug 2018 06:18:00 +0000 (15:18 +0900)]
Fix set of NLS translation issues
While monitoring the code, a couple of issues related to string
translation has showed up:
- Some routines for auto-updatable views return an error string, which
sometimes missed the shot. A comment regarding string translation is
added for each routine to help with future features.
- GSSAPI authentication missed two translations.
- vacuumdb handles non-translated strings.
Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20180810.152131.31921918.horiguchi.kyotaro@lab.ntt.co.jp
Backpatch-through: 9.3
Tom Lane [Fri, 17 Aug 2018 21:12:21 +0000 (17:12 -0400)]
Ensure schema qualification in pg_restore DISABLE/ENABLE TRIGGER commands.
Previously, this code blindly followed the common coding pattern of
passing PQserverVersion(AH->connection) as the server-version parameter
of fmtQualifiedId. That works as long as we have a connection; but in
pg_restore with text output, we don't. Instead we got a zero from
PQserverVersion, which fmtQualifiedId interpreted as "server is too old to
have schemas", and so the name went unqualified. That still accidentally
managed to work in many cases, which is probably why this ancient bug went
undetected for so long. It only became obvious in the wake of the changes
to force dump/restore to execute with restricted search_path.
In HEAD/v11, let's deal with this by ripping out fmtQualifiedId's server-
version behavioral dependency, and just making it schema-qualify all the
time. We no longer support pg_dump from servers old enough to need the
ability to omit schema name, let alone restoring to them. (Also, the few
callers outside pg_dump already didn't work with pre-schema servers.)
In older branches, that's not an acceptable solution, so instead just
tweak the DISABLE/ENABLE TRIGGER logic to ensure it will schema-qualify
its output regardless of server version.
Per bug #15338 from Oleg somebody. Back-patch to all supported branches.
Andrew Gierth [Fri, 17 Aug 2018 14:04:26 +0000 (15:04 +0100)]
Set scan direction appropriately for SubPlans (bug #15336)
When executing a SubPlan in an expression, the EState's direction
field was left alone, resulting in an attempt to execute the subplan
backwards if it was encountered during a backwards scan of a cursor.
Also, though much less likely, it was possible to reach the execution
of an InitPlan while in backwards-scan state.
Repair by saving/restoring estate->es_direction and forcing forward
scan mode in the relevant places.
Backpatch all the way, since this has been broken since 8.3 (prior to
commit c7ff7663e, SubPlans had their own EStates rather than sharing
the parent plan's, so there was no confusion over scan direction).
Per bug #15336 reported by Vladimir Baranoff; analysis and patch by
me, review by Tom Lane.
Bruce Momjian [Fri, 17 Aug 2018 14:25:48 +0000 (10:25 -0400)]
pg_upgrade: issue helpful error message for use on standbys
Commit 777e6ddf1723306bd2bf8fe6f804863f459b0323 checked for a shut down
message from a standby and allowed it to continue. This patch reports a
helpful error message in these cases, suggesting to use rsync as
documented.
Tomas Vondra [Thu, 16 Aug 2018 14:49:10 +0000 (16:49 +0200)]
Close the file descriptor in ApplyLogicalMappingFile
The function was forgetting to close the file descriptor, resulting
in failures like this:
ERROR: 53000: exceeded maxAllocatedDescs (492) while trying to open
file "pg_logical/mappings/map-4000-4eb-1_60DE1E08-5376b5-537c6b"
LOCATION: OpenTransientFile, fd.c:2161
Simply close the file at the end, and backpatch to 9.4 (where logical
decoding was introduced). While at it, fix a nearby typo.
Tom Lane [Wed, 15 Aug 2018 21:25:23 +0000 (17:25 -0400)]
Make snprintf.c follow the C99 standard for snprintf's result value.
C99 says that the result should be the number of bytes that would have
been emitted given a large enough buffer, not the number we actually
were able to put in the buffer. It's time to make our substitute
implementation comply with that. Not doing so results in inefficiency
in buffer-enlargement cases, and also poses a portability hazard for
third-party code that might expect C99-compliant snprintf behavior
within Postgres.
In passing, remove useless tests for str == NULL; neither C99 nor
predecessor standards ever allowed that except when count == 0,
so I see no reason to expend cycles on making that a non-crash case
for this implementation. Also, don't waste a byte in pg_vfprintf's
local I/O buffer; this might have performance benefits by allowing
aligned writes during flushbuffer calls.
Back-patch of commit 805889d7d. There was some concern about this
possibly breaking code that assumes pre-C99 behavior, but there is
much more risk (and reality, in our own code) of code that assumes
C99 behavior and hence fails to detect buffer overrun without this.
Alvaro Herrera [Wed, 15 Aug 2018 21:09:29 +0000 (18:09 -0300)]
Update FSM on WAL replay of page all-visible/frozen
We aren't very strict about keeping FSM up to date on WAL replay,
because per-page freespace values aren't critical in replicas (can't
write to heap in a replica; and if the replica is promoted, the values
would be updated by VACUUM anyway). However, VACUUM since 9.6 can skip
processing pages marked all-visible or all-frozen, and if such pages are
recorded in FSM with wrong values, those values are blindly propagated
to FSM's upper layers by VACUUM's FreeSpaceMapVacuum. (This rationale
assumes that crashes are not very frequent, because those would cause
outdated FSM to occur in the primary.)
Even when the FSM is outdated in standby, things are not too bad
normally, because, most per-page FSM values will be zero (other than
those propagated with the base-backup that created the standby); only
once the remaining free space is less than 0.2*BLCKSZ the per-page value
is maintained by WAL replay of heap ins/upd/del. However, if
wal_log_hints=on causes complete FSM pages to be propagated to a standby
via full-page images, many too-optimistic per-page values can end up
being registered in the standby.
Incorrect per-page values aren't critical in most cases, since an
inserter that is given a page that doesn't actually contain the claimed
free space will update FSM with the correct value, and retry until it
finds a usable page. However, if there are many such updates to do, an
inserter can spend a long time doing them before a usable page is found;
in a heavily trafficked insert-only table with many concurrent inserters
this has been observed to cause several second stalls, causing visible
application malfunction.
To fix this problem, it seems sufficient to have heap_xlog_visible
(replay of setting all-visible and all-frozen VM bits for a heap page)
update the FSM value for the page being processed. This fixes the
per-page counters together with making the page skippable to vacuum, so
when vacuum does FreeSpaceMapVacuum, the values propagated to FSM upper
layers are the correct ones, avoiding the problem.
While at it, apply the same fix to heap_xlog_clean (replay of tuple
removal by HOT pruning and vacuum). This makes any space freed by the
cleaning available earlier than the next vacuum in the promoted replica.
Backpatch to 9.6, where this problem was diagnosed on an insert-only
table with all-frozen pages, which were introduced as a concept in that
release. Theoretically it could apply with all-visible pages to older
branches, but there's been no report of that and it doesn't backpatch
cleanly anyway.
Tom Lane [Wed, 15 Aug 2018 20:29:32 +0000 (16:29 -0400)]
Clean up assorted misuses of snprintf()'s result value.
Fix a small number of places that were testing the result of snprintf()
but doing so incorrectly. The right test for buffer overrun, per C99,
is "result >= bufsize" not "result > bufsize". Some places were also
checking for failure with "result == -1", but the standard only says
that a negative value is delivered on failure.
(Note that this only makes these places correct if snprintf() delivers
C99-compliant results. But at least now these places are consistent
with all the other places where we assume that.)
Also, make psql_start_test() and isolation_start_test() check for
buffer overrun while constructing their shell commands. There seems
like a higher risk of overrun, with more severe consequences, here
than there is for the individual file paths that are made elsewhere
in the same functions, so this seemed like a worthwhile change.
Also fix guc.c's do_serialize() to initialize errno = 0 before
calling vsnprintf. In principle, this should be unnecessary because
vsnprintf should have set errno if it returns a failure indication ...
but the other two places this coding pattern is cribbed from don't
assume that, so let's be consistent.
These errors are all very old, so back-patch as appropriate. I think
that only the shell command overrun cases are even theoretically
reachable in practice, but there's not much point in erroneous error
checks.
Bruce Momjian [Tue, 14 Aug 2018 21:19:02 +0000 (17:19 -0400)]
pg_upgrade: fix shutdown check for standby servers
Commit 244142d32afd02e7408a2ef1f249b00393983822 only tested for the
pg_controldata output for primary servers, but standby servers have
different "Database cluster state" output, so check for that too.
Diagnosed-by: Michael Paquier
Discussion: https://postgr.es/m/20180810164240.GM13638@paquier.xyz
Amit Kapila [Mon, 13 Aug 2018 03:26:37 +0000 (08:56 +0530)]
Prohibit shutting down resources if there is a possibility of back up.
Currently, we release the asynchronous resources as soon as it is evident
that no more rows will be needed e.g. when a Limit is filled. This can be
problematic especially for custom and foreign scans where we can scan
backward. Fix that by disallowing the shutting down of resources in such
cases.
Reported-by: Robert Haas Analysed-by: Robert Haas and Amit Kapila
Author: Amit Kapila Reviewed-by: Robert Haas
Backpatch-through: 9.6 where this code was introduced
Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info
Bruce Momjian [Thu, 9 Aug 2018 14:13:15 +0000 (10:13 -0400)]
docs: Only first instance of a PREPARE parameter sets data type
If the first reference to $1 is "($1 = col) or ($1 is null)", the data
type can be determined, but not for "($1 is null) or ($1 = col)". This
change documents this.
Don't run atexit callbacks in quickdie signal handlers.
exit() is not async-signal safe. Even if the libc implementation is, 3rd
party libraries might have installed unsafe atexit() callbacks. After
receiving SIGQUIT, we really just want to exit as quickly as possible, so
we don't really want to run the atexit() callbacks anyway.
The original report by Jimmy Yih was a self-deadlock in startup_die().
However, this patch doesn't address that scenario; the signal handling
while waiting for the startup packet is more complicated. But at least this
alleviates similar problems in the SIGQUIT handlers, like that reported
by Asim R P later in the same thread.
Tom Lane [Tue, 7 Aug 2018 20:32:50 +0000 (16:32 -0400)]
Don't record FDW user mappings as members of extensions.
CreateUserMapping has a recordDependencyOnCurrentExtension call that's
been there since extensions were introduced (very possibly my fault).
However, there's no support anywhere else for user mappings as members
of extensions, nor are they listed as a possible member object type in
the documentation. Nor does it really seem like a good idea for user
mappings to belong to extensions when roles don't. Hence, remove the
bogus call.
(As we saw in bug #15310, the lack of any pg_dump support for this case
ensures that any such membership record would silently disappear during
pg_upgrade. So there's probably no need for us to do anything else
about cleaning up after this mistake.)
Tom Lane [Tue, 7 Aug 2018 20:00:44 +0000 (16:00 -0400)]
Fix incorrect initialization of BackendActivityBuffer.
Since commit c8e8b5a6e, this has been zeroed out using the wrong length.
In practice the length would always be too small, leading to not zeroing
the whole buffer rather than clobbering additional memory; and that's
pretty harmless, both because shmem would likely start out as zeroes
and because we'd reinitialize any given entry before use. Still,
it's bogus, so fix it.
Tom Lane [Tue, 7 Aug 2018 19:43:49 +0000 (15:43 -0400)]
Fix pg_upgrade to handle event triggers in extensions correctly.
pg_dump with --binary-upgrade must emit ALTER EXTENSION ADD commands
for all objects that are members of extensions. It forgot to do so for
event triggers, as per bug #15310 from Nick Barnes. Back-patch to 9.3
where event triggers were introduced.
Tom Lane [Tue, 7 Aug 2018 17:13:42 +0000 (13:13 -0400)]
Ensure pg_dump_sort.c sorts null vs non-null namespace consistently.
The original coding here (which is, I believe, my fault) supposed that
it didn't need to concern itself with the possibility that one object
of a given type-priority has a namespace while another doesn't. But
that's not reliably true anymore, if it ever was; and if it does happen
then it's possible that DOTypeNameCompare returns self-inconsistent
comparison results. That leads to unspecified behavior in qsort()
and a resultant weird output order from pg_dump.
This should end up being only a cosmetic problem, because any ordering
constraints that actually matter should be enforced by the later
dependency-based sort. Still, it's a bug, so back-patch.
Report and fix by Jacob Champion, though I editorialized on his
patch to the extent of making NULL sort after non-NULL, for consistency
with our usual sorting definitions.
Tom Lane [Mon, 6 Aug 2018 14:53:35 +0000 (10:53 -0400)]
Fix failure to reset libpq's state fully between connection attempts.
The logic in PQconnectPoll() did not take care to ensure that all of
a PGconn's internal state variables were reset before trying a new
connection attempt. If we got far enough in the connection sequence
to have changed any of these variables, and then decided to try a new
server address or server name, the new connection might be completed
with some state that really only applied to the failed connection.
While this has assorted bad consequences, the only one that is clearly
a security issue is that password_needed didn't get reset, so that
if the first server asked for a password and the second didn't,
PQconnectionUsedPassword() would return an incorrect result. This
could be leveraged by unprivileged users of dblink or postgres_fdw
to allow them to use server-side login credentials that they should
not be able to use.
Other notable problems include the possibility of forcing a v2-protocol
connection to a server capable of supporting v3, or overriding
"sslmode=prefer" to cause a non-encrypted connection to a server that
would have accepted an encrypted one. Those are certainly bugs but
it's harder to paint them as security problems in themselves. However,
forcing a v2-protocol connection could result in libpq having a wrong
idea of the server's standard_conforming_strings setting, which opens
the door to SQL-injection attacks. The extent to which that's actually
a problem, given the prerequisite that the attacker needs control of
the client's connection parameters, is unclear.
These problems have existed for a long time, but became more easily
exploitable in v10, both because it introduced easy ways to force libpq
to abandon a connection attempt at a late stage and then try another one
(rather than just giving up), and because it provided an easy way to
specify multiple target hosts.
Fix by rearranging PQconnectPoll's state machine to provide centralized
places to reset state properly when moving to a new target host or when
dropping and retrying a connection to the same host.
Tom Lane, reviewed by Noah Misch. Our thanks to Andrew Krasichkov
for finding and reporting the problem.
Tom Lane [Sat, 4 Aug 2018 23:38:58 +0000 (19:38 -0400)]
Fix INSERT ON CONFLICT UPDATE through a view that isn't just SELECT *.
When expanding an updatable view that is an INSERT's target, the rewriter
failed to rewrite Vars in the ON CONFLICT UPDATE clause. This accidentally
worked if the view was just "SELECT * FROM ...", as the transformation
would be a no-op in that case. With more complicated view targetlists,
this omission would often lead to "attribute ... has the wrong type" errors
or even crashes, as reported by Mario De Frutos Dieguez.
Fix by adding code to rewriteTargetView to fix up the data structure
correctly. The easiest way to update the exclRelTlist list is to rebuild
it from scratch looking at the new target relation, so factor the code
for that out of transformOnConflictClause to make it sharable.
In passing, avoid duplicate permissions checks against the EXCLUDED
pseudo-relation, and prevent useless view expansion of that relation's
dummy RTE. The latter is only known to happen (after this patch) in cases
where the query would fail later due to not having any INSTEAD OF triggers
for the view. But by exactly that token, it would create an unintended
and very poorly tested state of the query data structure, so it seems like
a good idea to prevent it from happening at all.
This has been broken since ON CONFLICT was introduced, so back-patch
to 9.5.
Dean Rasheed, based on an earlier patch by Amit Langote;
comment-kibitzing and back-patching by me
Michael Paquier [Sat, 4 Aug 2018 20:32:27 +0000 (05:32 +0900)]
Reset properly errno before calling write()
6cb3372 enforces errno to ENOSPC when less bytes than what is expected
have been written when it is unset, though it forgot to properly reset
errno before doing a system call to write(), causing errno to
potentially come from a previous system call.
Reported-by: Tom Lane
Author: Michael Paquier Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
Peter Geoghegan [Fri, 3 Aug 2018 21:44:44 +0000 (14:44 -0700)]
Add table relcache invalidation to index builds.
It's necessary to make sure that owning tables have a relcache
invalidation prior to advancing the command counter to make
newly-entered catalog tuples for the index visible. inval.c must be
able to maintain the consistency of the local caches in the event of
transaction abort. There is usually only a problem when CREATE INDEX
transactions abort, since there is a generic invalidation once we reach
index_update_stats().
This bug is of long standing. Problems were made much more likely by
the addition of parallel CREATE INDEX (commit 9da0cc35284), but it is
strongly suspected that similar problems can be triggered without
involving plan_create_index_workers(). (plan_create_index_workers()
triggers a relcache build or rebuild, which previously only happened in
rare edge cases.)
Author: Peter Geoghegan Reported-By: Luca Ferrari Diagnosed-By: Andres Freund Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAKoxK+5fVodiCtMsXKV_1YAKXbzwSfp7DgDqUmcUAzeAhf=HEQ@mail.gmail.com
Backpatch: 9.3-
Amit Kapila [Fri, 3 Aug 2018 06:13:01 +0000 (11:43 +0530)]
Fix buffer usage stats for parallel nodes.
The buffer usage stats is accounted only for the execution phase of the
node. For Gather and Gather Merge nodes, such stats are accumulated at
the time of shutdown of workers which is done after execution of node due
to which we missed to account them for such nodes. Fix it by treating
nodes as running while we shut down them.
We can also miss accounting for a Limit node when Gather or Gather Merge
is beneath it, because it can finish the execution before shutting down
such nodes. So we allow a Limit node to shut down the resources before it
completes the execution.
In the passing fix the gather node code to allow workers to shut down as
soon as we find that all the tuples from the workers have been retrieved.
The original code use to do that, but is accidently removed by commit 01edb5c7fc.
Reported-by: Adrien Nayrat
Author: Amit Kapila and Robert Haas Reviewed-by: Robert Haas and Andres Freund
Backpatch-through: 9.6 where this code was introduced
Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info
Amit Kapila [Fri, 3 Aug 2018 04:37:56 +0000 (10:07 +0530)]
Match the buffer usage tracking for leader and worker backends.
In the leader backend, we don't track the buffer usage for ExecutorStart
phase whereas in worker backend we track it for ExecutorStart phase as
well. This leads to different value for buffer usage stats for the
parallel and non-parallel query. Change the code so that worker backend
also starts tracking buffer usage after ExecutorStart.
Author: Amit Kapila and Robert Haas Reviewed-by: Robert Haas and Andres Freund
Backpatch-through: 9.6 where this code was introduced
Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info
Tom Lane [Tue, 31 Jul 2018 17:00:08 +0000 (13:00 -0400)]
Further fixes for quoted-list GUC values in pg_dump and ruleutils.c.
Commits 742869946 et al turn out to be a couple bricks shy of a load.
We were dumping the stored values of GUC_LIST_QUOTE variables as they
appear in proconfig or setconfig catalog columns. However, although that
quoting rule looks a lot like SQL-identifier double quotes, there are two
critical differences: empty strings ("") are legal, and depending on which
variable you're considering, values longer than NAMEDATALEN might be valid
too. So the current technique fails altogether on empty-string list
entries (as reported by Steven Winfield in bug #15248) and it also risks
truncating file pathnames during dump/reload of GUC values that are lists
of pathnames.
To fix, split the stored value without any downcasing or truncation,
and then emit each element as a SQL string literal.
This is a tad annoying, because we now have three copies of the
comma-separated-string splitting logic in varlena.c as well as a fourth
one in dumputils.c. (Not to mention the randomly-different-from-those
splitting logic in libpq...) I looked at unifying these, but it would
be rather a mess unless we're willing to tweak the API definitions of
SplitIdentifierString, SplitDirectoriesString, or both. That might be
worth doing in future; but it seems pretty unsafe for a back-patched
bug fix, so for now accept the duplication.
Back-patch to all supported branches, as the previous fix was.
Tom Lane [Mon, 30 Jul 2018 16:35:49 +0000 (12:35 -0400)]
Fix pg_dump's failure to dump REPLICA IDENTITY for constraint indexes.
pg_dump knew about printing ALTER TABLE ... REPLICA IDENTITY USING INDEX
for indexes declared as indexes, but it failed to print that for indexes
declared as unique or primary-key constraints. Per report from Achilleas
Mantzios.
This has been broken since the feature was introduced, AFAICS.
Back-patch to 9.4.
Tom Lane [Mon, 30 Jul 2018 15:54:41 +0000 (11:54 -0400)]
Doc: fix oversimplified example for CREATE POLICY.
As written, this policy constrained only the post-image not the pre-image
of rows, meaning that users could delete other users' rows or take
ownership of such rows, contrary to what the docs claimed would happen.
We need two separate policies to achieve the documented effect.
While at it, try to explain what's happening a bit more fully.
Per report from Олег Самойлов. Back-patch to 9.5 where this was added.
Thanks to Stephen Frost for off-list discussion.
Affected test queries have been testing the wrong thing since their
introduction in commit 4c1383efd132e4f532213c8a8cc63a455f55e344.
Back-patch to 9.3 (all supported versions).
Document security implications of qualified names.
Commit 5770172cb0c9df9e6ce27c507b449557e5b45124 documented secure schema
usage, and that advice suffices for using unqualified names securely.
Document, in typeconv-func primarily, the additional issues that arise
with qualified names. Back-patch to 9.3 (all supported versions).
Bruce Momjian [Sat, 28 Jul 2018 19:01:55 +0000 (15:01 -0400)]
pg_upgrade: check for clean server shutdowns
Previously pg_upgrade checked for the pid file and started/stopped the
server to force a clean shutdown. However, "pg_ctl -m immediate"
removes the pid file but doesn't do a clean shutdown, so check
pg_controldata for a clean shutdown too.
Diagnosed-by: Vimalraj A
Discussion: https://postgr.es/m/CAFKBAK5e4Q-oTUuPPJ56EU_d2Rzodq6GWKS3ncAk3xo7hAsOZg@mail.gmail.com
Andres Freund [Tue, 24 Jul 2018 17:51:17 +0000 (10:51 -0700)]
doc: Fix reference to "decoder" to instead be the correct "output plugin".
Author: Jonathan Katz
Discussion: https://postgr.es/m/DD02DD86-5989-4BFD-8712-468541F68383@postgresql.org
Backpatch: 9.4-, where logical decoding was added
Tom Lane [Sat, 21 Jul 2018 19:40:51 +0000 (15:40 -0400)]
Further portability hacking in pg_upgrade's test script.
I blew the dust off a Bourne shell (file date 1996, yea verily) and
tried to run test.sh with it. It mostly worked, but I found that the
temp-directory creation code introduced by commit be76a6d39 was not
compatible, for a couple of reasons: this shell thinks "set -e" should
force an exit if a command within backticks fails, and it also thinks code
within braces should be executed by a sub-shell, meaning that variable
settings don't propagate back up to the parent shell. In view of Victor
Wagner's report that Solaris is still using pre-POSIX shells, seems like
we oughta make this case work. It's not like the code is any less
idiomatic this way; the prior coding technique appeared nowhere else.
(There is a remaining bash-ism here, which is that $RANDOM doesn't do
what the code hopes in non-bash shells. But the use of $$ elsewhere in
that path should be enough to ensure uniqueness and some amount of
randomness, so I think it's okay as-is.)
Back-patch to all supported branches, as the previous commit was.
Fix handling of empty uncompressed posting list pages in GIN
PostgreSQL 9.4 introduces posting list compression in GIN. This feature
supports online upgrade, so that after pg_upgrade uncompressed posting
lists are compressed on-the-fly. Underlying code appears to always
expect at least one item on uncompressed posting list page. But there
could be completely empty pages, because VACUUM never deletes leftmost
and rightmost pages from posting trees. This commit fixes that.
I was confused by what "intended to be parallel serially" meant, until
Robert Haas and David G. Johnston explained it. Rephrase the comment to
make it more clear, using David's suggested wording.
Michael Paquier [Thu, 19 Jul 2018 00:55:25 +0000 (09:55 +0900)]
Fix print of Path nodes when using OPTIMIZER_DEBUG
GatherMergePath (introduced in 10) and CustomPath (introduced in 9.5)
have gone missing. The order of the Path nodes was inconsistent with
what is listed in nodes.h, so make the order consistent at the same time
to ease future checks and additions.
Author: Sawada Masahiko Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAD21AoBQMLoc=ohH-oocuAPsELrmk8_EsRJjOyR8FQLZkbE0wA@mail.gmail.com
Tom Lane [Fri, 13 Jul 2018 22:45:30 +0000 (18:45 -0400)]
Fix crash in contrib/ltree's lca() function for empty input array.
lca_inner() wasn't prepared for the possibility of getting no inputs.
Fix that, and make some cosmetic improvements to the code while at it.
Also, I thought the documentation of this function as returning the
"longest common prefix" of the paths was entirely misleading; it really
returns a path one shorter than the longest common prefix, for the typical
definition of "prefix". Don't use that term in the docs, and adjust the
examples to clarify what really happens.
This has been broken since its beginning, so back-patch to all supported
branches.
Per report from Hailong Li. Thanks to Pierre Ducroquet for diagnosing
and for the initial patch, though I whacked it around some and added
test cases.
Tom Lane [Fri, 13 Jul 2018 15:52:50 +0000 (11:52 -0400)]
Fix inadequate buffer locking in FSM and VM page re-initialization.
When reading an existing FSM or VM page that was found to be corrupt by the
buffer manager, the code applied PageInit() to reinitialize the page, but
did so without any locking. There is thus a hazard that two backends might
concurrently do PageInit, which in itself would still be OK, but the slower
one might then zero over subsequent data changes applied by the faster one.
Even that is unlikely to be fatal; but it's not desirable, so add locking
to prevent it.
This does not add any locking overhead in the normal code path where the
page is OK. It's not immediately obvious that that's safe, but I believe
it is, for reasons explained in the added comments.
Problem noted by R P Asim. It's been like this for a long time, so
back-patch to all supported branches.
Tom Lane [Thu, 12 Jul 2018 16:28:43 +0000 (12:28 -0400)]
Doc: minor improvement in pl/pgsql FETCH/MOVE documentation.
Explain that you can use any integer expression for the "count" in
pl/pgsql's versions of FETCH/MOVE, unlike the SQL versions which only
allow a constant.
Remove the duplicate version of this para under MOVE. I don't see
a good reason to maintain two identical paras when we just said that
MOVE works exactly like FETCH.
Michael Paquier [Thu, 12 Jul 2018 01:20:14 +0000 (10:20 +0900)]
Make logical WAL sender report streaming state appropriately
WAL senders sending logically-decoded data fail to properly report in
"streaming" state when starting up, hence as long as one extra record is
not replayed, such WAL senders would remain in a "catchup" state, which
is inconsistent with the physical cousin.
This can be easily reproduced by for example using pg_recvlogical and
restarting the upstream server. The TAP tests have been slightly
modified to detect the failure and strengthened so as future tests also
make sure that a node is in streaming state when waiting for its
catchup.
Backpatch down to 9.4 where this code has been introduced.
Reported-by: Sawada Masahiko
Author: Simon Riggs, Sawada Masahiko Reviewed-by: Petr Jelinek, Michael Paquier, Vaishnavi Prabakaran
Discussion: https://postgr.es/m/CAD21AoB2ZbCCqOx=bgKMcLrAvs1V0ZMqzs7wBTuDySezTGtMZA@mail.gmail.com
Tom Lane [Wed, 11 Jul 2018 19:25:29 +0000 (15:25 -0400)]
Fix create_scan_plan's handling of sortgrouprefs for physical tlists.
We should only run apply_pathtarget_labeling_to_tlist if CP_LABEL_TLIST
was specified, because only in that case has use_physical_tlist checked
that the labeling will succeed; otherwise we may get an "ORDER/GROUP BY
expression not found in targetlist" error. (This subsumes the previous
test about gating_clauses, because we reset "flags" to zero earlier
if there are gating clauses to apply.)
The only known case in which a failure can occur is with a ProjectSet
path directly atop a table scan path, although it seems likely that there
are other cases or will be such in future. This means that the failure
is currently only visible in the v10 branch: 9.6 didn't have ProjectSet,
while in v11 and HEAD, apply_scanjoin_target_to_paths for some weird
reason is using create_projection_path not apply_projection_to_path,
masking the problem because there's a ProjectionPath in between.
Nonetheless this code is clearly wrong on its own terms, so back-patch
to 9.6 where this logic was introduced.
Tom Lane [Mon, 9 Jul 2018 23:26:19 +0000 (19:26 -0400)]
Avoid emitting a bogus WAL record when recycling an all-zero btree page.
Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for
a page that it was about to recycle. However, it failed to distinguish
all-zero pages from dead pages, which is important because only the
latter have valid btpo.xact values, or indeed any special space at all.
Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore
led to an Assert failure, or to emission of a WAL record containing a
bogus cutoff XID, which might lead to unnecessary query cancellations
on hot standby servers.
Per reports from Antonin Houska and 自己. Amit Kapila was first to
propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi
reviewed it at various times.
This is an old bug, so back-patch to all supported branches.
Tom Lane [Mon, 9 Jul 2018 21:23:31 +0000 (17:23 -0400)]
Prevent accidental linking of system-supplied copies of libpq.so etc.
Back-patch commit dddfc4cb2, which broke LDFLAGS and related Makefile
variables into two parts, one for within-build-tree library references and
one for external libraries, to ensure that the order of -L flags has all
of the former before all of the latter. This turns out to fix a problem
recently noted on buildfarm member peripatus, that we attempted to
incorporate code from libpgport.a into a shared library. That will fail on
platforms that are sticky about putting non-PIC code into shared libraries.
(It's quite surprising we hadn't seen such failures before, since the code
in question has been like that for a long time.)
I think that peripatus' problem could have been fixed with just a subset
of this patch; but since the previous issue of accidentally linking to the
wrong copy of a Postgres shlib seems likely to bite people in the field,
let's just back-patch the whole change. Now that commit dddfc4cb2 has
survived some beta testing, I'm less afraid to back-patch it than I was
at the time.
This also fixes undesired inclusion of "-DFRONTEND" in pg_config's CPPFLAGS
output (in 9.6 and up) and undesired inclusion of "-L../../src/common" in
its LDFLAGS output (in all supported branches).
Back-patch to v10 and older branches; this is already in v11.
Michael Paquier [Mon, 9 Jul 2018 01:26:41 +0000 (10:26 +0900)]
Rework order of end-of-recovery actions to delay timeline history write
A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written. Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf. With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster. It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions. The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.
Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now. This is left as a future item and this
commit fixes the original complain.
Reported-by: Magnus Hagander
Author: Heikki Linnakangas Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
Michael Paquier [Fri, 6 Jul 2018 23:10:47 +0000 (08:10 +0900)]
Add note in pg_rewind documentation about read-only files
When performing pg_rewind, the presence of a read-only file which is not
accessible for writes will cause a failure while processing. This can
cause the control file of the target data folder to be truncated,
causing it to not be reusable with a successive run.
Also, when pg_rewind fails mid-flight, there is likely no way to be able
to recover the target data folder anyway, in which case a new base
backup is the best option. A note is added in the documentation as
well about.
Reported-by: Christian H.
Author: Michael Paquier Reviewed-by: Andrew Dunstan
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org
logical decoding: beware of an unset specinsert change
Coverity complains that there is no protection in the code (at least in
non-assertion-enabled builds) against speculative insertion failing to
follow the expected protocol. Add an elog(ERROR) for the case.
Reduce cost of test_decoding's new oldest_xmin test
Change a whole-database VACUUM into doing just pg_attribute, which is
the portion that verifies what we want it to do. The original
formulation wastes a lot of CPU time, which leads the test to fail when
runtime exceeds isolationtester timeout when it's super-slow, such as
under CLOBBER_CACHE_ALWAYS. Per buildfarm member friarbird.
It turns out that the previous shape of the test doesn't always detect
the condition it is supposed to detect (on unpatched reorderbuffer
code): the reason is that there is a good chance of encountering a
xl_running_xacts record (logged every 15 seconds) before the checkpoint
-- and because we advance the xmin when we receive that WAL record, and
we *don't* advance the xmin twice consecutively without receiving a
client message in between, that means the xmin is not advanced enough
for the tuple to be pruned from pg_attribute by VACUUM. So the test
would spuriously pass.
The reason this test deficiency wasn't detected earlier is that HOT
pruning removes the tuple anyway, even if vacuum leaves it in place, so
the test correctly fails (detecting the coding mistake), but for the
wrong reason.
To fix this mess, run the s0_get_changes step twice before vacuum
instead of once: this seems to cause the xmin to be advanced reliably,
wreaking havoc with more certainty.
Michael Paquier [Thu, 5 Jul 2018 01:47:19 +0000 (10:47 +0900)]
Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed. This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed. When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed. Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.
Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me. The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.
Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore. The test gets easily supported down to 10, still it
has been tested on all branches.
Andres Freund [Wed, 4 Jul 2018 21:53:49 +0000 (14:53 -0700)]
Check for interrupts inside the nbtree page deletion code.
When deleting pages the nbtree code has to walk through siblings of a
tree node. When those sibling links are corrupted that can lead to
endless loops - which are currently not interruptible. This is
especially problematic if autovacuum is repeatedly blocked on such
indexes, as it can be hard to get out of that situation without
resorting to single user mode.
Thus add interrupt checks to appropriate places in such
loops. Unfortunately in one of the cases it's it's not easy to do so.
Between 9.3 and 9.4 the page deletion (and page split) code changed
significantly. Before it was significantly less robust against
interruptions. Therefore don't backpatch to 9.3.
Author: Andres Freund
Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de
Backpatch: 9.4-
Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
When these programs call pg_catalog.set_config, they need to check for
PGRES_TUPLES_OK instead of PGRES_COMMAND_OK. Fix for 5770172cb0c9df9e6ce27c507b449557e5b45124.
Michael Paquier [Fri, 29 Jun 2018 13:17:54 +0000 (22:17 +0900)]
Replace search.cpan.org with metacpan.org
search.cpan.org has been EOL'd, with metacpan.org being the official
replacement to which URLs now redirect. Update links to match the new
URL. Also update links to CPAN to use https as it will redirect from
http.
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/B74C0219-6BA9-46E1-A524-5B9E8CD3BDB3@yesql.se
Alvaro Herrera [Tue, 26 Jun 2018 20:38:34 +0000 (16:38 -0400)]
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Fujii Masao [Tue, 26 Jun 2018 15:45:21 +0000 (00:45 +0900)]
Fix documentation bug related to backup history file.
The backup history file has been no longer necessary for recovery
since the version 9.0. It's now basically just for informational purpose.
But previously the documentations still described that a recovery
requests the backup history file to proceed. The commit fixes this
documentation bug.
Thomas Munro [Mon, 18 Jun 2018 06:33:53 +0000 (18:33 +1200)]
Add PGTYPESchar_free() to avoid cross-module problems on Windows.
On Windows, it is sometimes important for corresponding malloc() and
free() calls to be made from the same DLL, since some build options can
result in multiple allocators being active at the same time. For that
reason we already provided PQfreemem(). This commit adds a similar
function for freeing string results allocated by the pgtypes library.
Thomas Munro [Tue, 26 Jun 2018 06:44:31 +0000 (18:44 +1200)]
Move RecoveryLockList into a hash table.
Standbys frequently need to release all locks held by a given xid.
Instead of searching one big list linearly, let's create one list
per xid and put them in a hash table, so we can find what we need
in O(1) time.
Earlier analysis and a prototype were done by David Rowley, though
this isn't his patch.
Back-patch all the way.
Author: Thomas Munro Diagnosed-by: David Rowley, Andres Freund Reviewed-by: Andres Freund, Tom Lane, Robert Haas
Discussion: https://postgr.es/m/CAEepm%3D1mL0KiQ2KJ4yuPpLGX94a4Ns_W6TL4EGRouxWibu56pA%40mail.gmail.com
Discussion: https://postgr.es/m/CAKJS1f9vJ841HY%3DwonnLVbfkTWGYWdPN72VMxnArcGCjF3SywA%40mail.gmail.com
Michael Paquier [Mon, 25 Jun 2018 02:20:50 +0000 (11:20 +0900)]
Address set of issues with errno handling
System calls mixed up in error code paths are causing two issues which
several code paths have not correctly handled:
1) For write() calls, sometimes the system may return less bytes than
what has been written without errno being set. Some paths were careful
enough to consider that case, and assumed that errno should be set to
ENOSPC, other calls missed that.
2) errno generated by a system call is overwritten by other system calls
which may succeed once an error code path is taken, causing what is
reported to the user to be incorrect.
This patch uses the brute-force approach of correcting all those code
paths. Some refactoring could happen in the future, but this is let as
future work, which is not targeted for back-branches anyway.
Author: Michael Paquier Reviewed-by: Ashutosh Sharma
Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
Tom Lane [Mon, 18 Jun 2018 19:55:06 +0000 (15:55 -0400)]
Fix contrib/hstore_plperl to look through scalar refs.
Bring this transform function into sync with the policy established
by commit 3a382983d.
Also, fix it to make sure that what it drills down to is indeed a
hash, and not some other kind of Perl SV. Previously, the test
cases added here provoked crashes.
Because of the crash hazard, back-patch to 9.5 where this module
was introduced.
Michael Paquier [Mon, 18 Jun 2018 01:43:27 +0000 (10:43 +0900)]
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Tom Lane [Sat, 16 Jun 2018 19:34:07 +0000 (15:34 -0400)]
Use -Wno-format-truncation and -Wno-stringop-truncation, if available.
gcc 8 has started emitting some warnings that are largely useless for
our purposes, particularly since they complain about code following
the project-standard coding convention that path names are assumed
to be shorter than MAXPGPATH. Even if we make the effort to remove
that assumption in some future release, the changes wouldn't get
back-patched. Hence, just suppress these warnings, on compilers that
have these switches.
Tom Lane [Sat, 16 Jun 2018 18:58:11 +0000 (14:58 -0400)]
Avoid unnecessary use of strncpy in a couple of places in ecpg.
Use of strncpy with a length limit based on the source, rather than
the destination, is non-idiomatic and draws warnings from gcc 8.
Replace with memcpy, which does exactly the same thing in these cases,
but with less chance for confusion.
Tom Lane [Sat, 16 Jun 2018 18:45:47 +0000 (14:45 -0400)]
Use snprintf not sprintf in pg_waldump's timestamptz_to_str.
This could only cause an issue if strftime returned a ridiculously
long timezone name, which seems unlikely; and it wouldn't qualify
as a security problem even then, since pg_waldump (nee pg_xlogdump)
is a debug tool not part of the server. But gcc 8 has started issuing
warnings about it, so let's use snprintf and be safe.
Alvaro Herrera [Thu, 14 Jun 2018 16:51:32 +0000 (12:51 -0400)]
Fail BRIN control functions during recovery explicitly
They already fail anyway, but prior to this patch they raise an ugly
error message about a lock that cannot be acquired. This just improves
the message.
Author: Masahiko Sawada Reported-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoBZau4g4_NUf3BKNd=CdYK+xaPdtJCzvOC1TxGdTiJx_Q@mail.gmail.com Reviewed-by: Kuntal Ghosh, Alexander Korotkov, Simon Riggs, Michaël Paquier, Álvaro Herrera
Documentation of word_similarity() and strict_word_similarity() functions
contains some vague wordings which could confuse users. This patch makes
those wordings more clear. word_similarity() was introduced in PostgreSQL 9.6,
and corresponding part of documentation needs to be backpatched.
Author: Bruce Momjian, Alexander Korotkov
Discussion: https://postgr.es/m/20180526165648.GB12510%40momjian.us
Backpatch: 9.6, where word_similarity() was introduced
Andres Freund [Tue, 12 Jun 2018 18:13:21 +0000 (11:13 -0700)]
Fix bugs in vacuum of shared rels, by keeping their relcache entries current.
When vacuum processes a relation it uses the corresponding relcache
entry's relfrozenxid / relminmxid as a cutoff for when to remove
tuples etc. Unfortunately for nailed relations (i.e. critical system
catalogs) bugs could frequently lead to the corresponding relcache
entry being stale.
This set of bugs could cause actual data corruption as vacuum would
potentially not remove the correct row versions, potentially reviving
them at a later point. After 699bf7d05c some corruptions in this vein
were prevented, but the additional error checks could also trigger
spuriously. Examples of such errors are:
ERROR: found xmin ... from before relfrozenxid ...
and
ERROR: found multixact ... from before relminmxid ...
To be caused by this bug the errors have to occur on system catalog
tables.
The two bugs are:
1) Invalidations for nailed relations were ignored, based on the
theory that the relcache entry for such tables doesn't
change. Which is largely true, except for fields like relfrozenxid
etc. This means that changes to relations vacuumed in other
sessions weren't picked up by already existing sessions. Luckily
autovacuum doesn't have particularly longrunning sessions.
2) For shared *and* nailed relations, the shared relcache init file
was never invalidated while running. That means that for such
tables (e.g. pg_authid, pg_database) it's not just already existing
sessions that are affected, but even new connections are as well.
That explains why the reports usually were about pg_authid et. al.
To fix 1), revalidate the rd_rel portion of a relcache entry when
invalid. This implies a bit of extra complexity to deal with
bootstrapping, but it's not too bad. The fix for 2) is simpler,
simply always remove both the shared and local init files.
Alvaro Herrera [Wed, 6 Jun 2018 18:46:53 +0000 (14:46 -0400)]
Fix function code in error report
This bug causes a lseek() failure to be reported as a "could not open"
failure in the error message, muddling bug reports. I introduced this
copy-and-pasteo in commit 78e122010422.
Noticed while reviewing code for bug report #15221, from lily liang. In
version 10 the affected function is only used by multixact.c and
commit_ts, and only in corner-case circumstances, neither of which are
involved in the reported bug (a pg_subtrans failure.)