Alvaro Herrera [Wed, 1 Aug 2018 19:06:47 +0000 (15:06 -0400)]
Fix per-tuple memory leak in partition tuple routing
Some operations were being done in a longer-lived memory context,
causing intra-query leaks. It's not noticeable unless you're doing a
large COPY, but if you are, it eats enough memory to cause a problem.
Tom Lane [Wed, 1 Aug 2018 16:30:36 +0000 (12:30 -0400)]
Fix libpq's code for searching .pgpass; rationalize empty-list-item cases.
Before v10, we always searched ~/.pgpass using the host parameter,
and nothing else, to match to the "hostname" field of ~/.pgpass.
(However, null host or host matching DEFAULT_PGSOCKET_DIR was replaced by
"localhost".) In v10, this got broken by commit 274bb2b38, repaired by
commit bdac9836d, and broken again by commit 7b02ba62e; in the code
actually shipped, we'd search with hostaddr if both that and host were
specified --- though oddly, *not* if only hostaddr were specified.
Since this is directly contrary to the documentation, and not
backwards-compatible, it's clearly a bug.
However, the change wasn't totally without justification, even though it
wasn't done quite right, because the pre-v10 behavior has arguably been
buggy since we added hostaddr. If hostaddr is specified and host isn't,
the pre-v10 code will search ~/.pgpass for "localhost", and ship that
password off to a server that most likely isn't local at all. That's
unhelpful at best, and could be a security breach at worst.
Therefore, rather than just revert to that old behavior, let's define
the behavior as "search with host if provided, else with hostaddr if
provided, else search for localhost". (As before, a host name matching
DEFAULT_PGSOCKET_DIR is replaced by localhost.) This matches the
behavior of the actual connection code, so that we don't pick up an
inappropriate password; and it allows useful searches to happen when
only hostaddr is given.
While we're messing around here, ensure that empty elements within a
host or hostaddr list select the same behavior as a totally-empty
field would; for instance "host=a,,b" is equivalent to "host=a,/tmp,b"
if DEFAULT_PGSOCKET_DIR is /tmp. Things worked that way in some cases
already, but not consistently so, which contributed to the confusion
about what key ~/.pgpass would get searched with.
Update documentation accordingly, and also clarify some nearby text.
Back-patch to v10 where the host/hostaddr list functionality was
introduced.
Tom Lane [Tue, 31 Jul 2018 17:00:08 +0000 (13:00 -0400)]
Further fixes for quoted-list GUC values in pg_dump and ruleutils.c.
Commits 742869946 et al turn out to be a couple bricks shy of a load.
We were dumping the stored values of GUC_LIST_QUOTE variables as they
appear in proconfig or setconfig catalog columns. However, although that
quoting rule looks a lot like SQL-identifier double quotes, there are two
critical differences: empty strings ("") are legal, and depending on which
variable you're considering, values longer than NAMEDATALEN might be valid
too. So the current technique fails altogether on empty-string list
entries (as reported by Steven Winfield in bug #15248) and it also risks
truncating file pathnames during dump/reload of GUC values that are lists
of pathnames.
To fix, split the stored value without any downcasing or truncation,
and then emit each element as a SQL string literal.
This is a tad annoying, because we now have three copies of the
comma-separated-string splitting logic in varlena.c as well as a fourth
one in dumputils.c. (Not to mention the randomly-different-from-those
splitting logic in libpq...) I looked at unifying these, but it would
be rather a mess unless we're willing to tweak the API definitions of
SplitIdentifierString, SplitDirectoriesString, or both. That might be
worth doing in future; but it seems pretty unsafe for a back-patched
bug fix, so for now accept the duplication.
Back-patch to all supported branches, as the previous fix was.
Change bms_add_range to be a no-op for empty ranges
In commit 84940644de93, bms_add_range was added with an API to fail with
an error if an empty range was specified. This seems arbitrary and
unhelpful, so turn that case into a no-op instead. Callers that require
further verification on the arguments or result can apply them by
themselves.
This fixes the bug that partition pruning throws an API error for a case
involving the default partition of a default partition, as in the
included test case.
Reported-by: Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/16590.1532622503@sss.pgh.pa.us
Tom Lane [Mon, 30 Jul 2018 22:04:39 +0000 (18:04 -0400)]
Ensure we build generated headers at the start of some more cases.
"make installcheck" and some related cases, when invoked from the toplevel
directory, start out by doing "make all" in src/test/regress. Since that's
one make recursion level down, the submake-generated-headers target will
do nothing, causing us to fail to create/update generated headers before
building pg_regress. This is, I believe, a new failure mode induced by
commit 3b8f6e75f, so let's fix it. To do so, we have to invoke
submake-generated-headers at the top level.
Set ActiveSnapshot when logically replaying inserts
Input functions for the inserted tuples may require a snapshot, when
they are replayed by native logical replication. An example is a domain
with a constraint using a SQL-language function, which prior to this
commit failed to apply on the subscriber side.
Tom Lane [Mon, 30 Jul 2018 16:35:49 +0000 (12:35 -0400)]
Fix pg_dump's failure to dump REPLICA IDENTITY for constraint indexes.
pg_dump knew about printing ALTER TABLE ... REPLICA IDENTITY USING INDEX
for indexes declared as indexes, but it failed to print that for indexes
declared as unique or primary-key constraints. Per report from Achilleas
Mantzios.
This has been broken since the feature was introduced, AFAICS.
Back-patch to 9.4.
Tom Lane [Mon, 30 Jul 2018 15:54:41 +0000 (11:54 -0400)]
Doc: fix oversimplified example for CREATE POLICY.
As written, this policy constrained only the post-image not the pre-image
of rows, meaning that users could delete other users' rows or take
ownership of such rows, contrary to what the docs claimed would happen.
We need two separate policies to achieve the documented effect.
While at it, try to explain what's happening a bit more fully.
Per report from Олег Самойлов. Back-patch to 9.5 where this was added.
Thanks to Stephen Frost for off-list discussion.
Affected test queries have been testing the wrong thing since their
introduction in commit 4c1383efd132e4f532213c8a8cc63a455f55e344.
Back-patch to 9.3 (all supported versions).
Document security implications of qualified names.
Commit 5770172cb0c9df9e6ce27c507b449557e5b45124 documented secure schema
usage, and that advice suffices for using unqualified names securely.
Document, in typeconv-func primarily, the additional issues that arise
with qualified names. Back-patch to 9.3 (all supported versions).
Bruce Momjian [Sat, 28 Jul 2018 19:01:55 +0000 (15:01 -0400)]
pg_upgrade: check for clean server shutdowns
Previously pg_upgrade checked for the pid file and started/stopped the
server to force a clean shutdown. However, "pg_ctl -m immediate"
removes the pid file but doesn't do a clean shutdown, so check
pg_controldata for a clean shutdown too.
Diagnosed-by: Vimalraj A
Discussion: https://postgr.es/m/CAFKBAK5e4Q-oTUuPPJ56EU_d2Rzodq6GWKS3ncAk3xo7hAsOZg@mail.gmail.com
Amit Kapila [Fri, 27 Jul 2018 05:26:07 +0000 (10:56 +0530)]
Fix the buffer release order for parallel index scans.
During parallel index scans, if the current page to be read is deleted, we
skip it and try to get the next page for a scan without releasing the buffer
lock on the current page. To get the next page, sometimes it needs to wait
for another process to complete its scan and advance it to the next page.
Now, it is quite possible that the master backend has errored out before
advancing the scan and issued a termination signal for all workers. The
workers failed to notice the termination request during wait because the
interrupts are held due to buffer lock on the previous page. This lead to
all workers being stuck.
The fix is to release the buffer lock on current page before trying to get
the next page. We are already doing same in backward scans, but missed
it for forward scans.
Reported-by: Victor Yegorov
Bug: 15290 Diagnosed-by: Thomas Munro and Amit Kapila
Author: Amit Kapila Reviewed-by: Thomas Munro Tested-By: Thomas Munro and Victor Yegorov
Backpatch-through: 10 where parallel index scans were introduced
Discussion:https://postgr.es/m/153228422922.1395.1746424054206154747@wrigleys.postgresql.org
Michael Paquier [Fri, 27 Jul 2018 04:43:36 +0000 (13:43 +0900)]
Fix handling of pgbench's hash when no argument is provided
Depending on the platform used, this can cause a crash in the worst
case, or an unhelpful error message, so fail gracefully.
Author: Fabien Coelho
Discussion: https://postgr.es/m/alpine.DEB.2.21.1807262302550.29874@lancre
Backpatch: 11-, where hash() has been added in pgbench.
Tom Lane [Thu, 26 Jul 2018 22:18:37 +0000 (18:18 -0400)]
Provide plpgsql tests for cases involving record field changes.
We suppressed one of these test cases in commit feb1cc559 because
it was failing to produce the expected results on CLOBBER_CACHE_ALWAYS
buildfarm members. But now we need another test with similar behavior,
so let's set up a test file that is expected to vary between regular and
CLOBBER_CACHE_ALWAYS cases, and provide variant expected files.
Someday we should fix plpgsql's failure for change-of-field-type, and
then the discrepancy will go away and we can fold these tests back
into plpgsql_record.sql. But today is not that day.
Tom Lane [Thu, 26 Jul 2018 20:08:45 +0000 (16:08 -0400)]
Avoid crash in eval_const_expressions if a Param's type changes.
Since commit 6719b238e it's been possible for the values of plpgsql
record field variables to be exposed to the planner as Params.
(Before that, plpgsql never supplied values for such variables during
planning, so that the problematic code wasn't reached.) Other places
that touch potentially-type-mutable Params either cope gracefully or
do runtime-test-and-ereport checks that the type is what they expect.
But eval_const_expressions() just had an Assert, meaning that it either
failed the assertion or risked crashes due to using an incompatible
value.
In this case, rather than throwing an ereport immediately, we can just
not perform a const-substitution in case of a mismatch. This seems
important for the same reason that the Param fetch was speculative:
we might not actually reach this part of the expression at runtime.
Test case will follow in a separate commit.
Patch by me, pursuant to bug report from Andrew Gierth.
Back-patch to v11 where the previous commit appeared.
Andres Freund [Wed, 25 Jul 2018 23:31:49 +0000 (16:31 -0700)]
LLVMJIT: Release JIT context after running ExprContext shutdown callbacks.
Due to inlining it previously was possible that an ExprContext's
shutdown callback pointed to a JITed function. As the JIT context
previously was shut down before the shutdown callbacks were called,
that could lead to segfaults. Fix the ordering.
Reported-By: Dmitry Dolgov
Author: Andres Freund
Discussion: https://postgr.es/m/CA+q6zcWO7CeAJtHBxgcHn_hj+PenM=tvG0RJ93X1uEJ86+76Ug@mail.gmail.com
Backpatch: 11-, where JIT compilation was added
Thomas Munro [Tue, 24 Jul 2018 22:58:44 +0000 (10:58 +1200)]
Pad semaphores to avoid false sharing.
In a USE_UNNAMED_SEMAPHORES build, the default on Linux and FreeBSD
since commit ecb0d20a, we have an array of sem_t objects. This
turned out to reduce performance compared to the previous default
USE_SYSV_SEMAPHORES on an 8 socket system. Testing showed that the
lost performance could be regained by padding the array elements so
that they have their own cache lines. This matches what we do for
similar hot arrays (see LWLockPadded, WALInsertLockPadded).
Back-patch to 10, where unnamed semaphores were adopted as the default
semaphore interface on those operating systems.
Author: Thomas Munro Reviewed-by: Andres Freund Reported-by: Mithun Cy Tested-by: Mithun Cy, Tom Lane, Thomas Munro
Discussion: https://postgr.es/m/CAD__OugYDM3O%2BdyZnnZSbJprSfsGFJcQ1R%3De59T3hcLmDug4_w%40mail.gmail.com
Andres Freund [Tue, 24 Jul 2018 17:51:06 +0000 (10:51 -0700)]
doc: Fix reference to "decoder" to instead be the correct "output plugin".
Author: Jonathan Katz
Discussion: https://postgr.es/m/DD02DD86-5989-4BFD-8712-468541F68383@postgresql.org
Backpatch: 9.4-, where logical decoding was added
Michael Paquier [Tue, 24 Jul 2018 01:33:07 +0000 (10:33 +0900)]
Fix calculation for WAL segment recycling and removal
Commit 4b0d28de06 has removed the prior checkpoint and related
facilities but has left WAL recycling based on the LSN of the prior
checkpoint, which causes incorrect calculations for WAL removal and
recycling for max_wal_size and min_wal_size. This commit changes things
so as the base calculation point is the last checkpoint generated.
Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/20180723.135748.42558387.horiguchi.kyotaro@lab.ntt.co.jp
Backpatch: 11-, where the prior checkpoint has been removed.
Andres Freund [Mon, 23 Jul 2018 04:13:20 +0000 (21:13 -0700)]
LLVMJIT: Adapt to API changes in gdb and perf support.
During the work of upstreaming my previous patches for gdb and perf
support the API changed. Adapt. Normally this wouldn't necessarily be
something to backpatch, but the previous API wasn't upstream, and at
least the gdb support is quite useful for debugging.
Author: Andres Freund
Backpatch: 11, where LLVM based JIT support was added.
Andres Freund [Sun, 22 Jul 2018 23:47:00 +0000 (16:47 -0700)]
Fix JITed EEOP_AGG_INIT_TRANS, which missed some state.
The JIT compiled implementation missed maintaining
AggState->{current_set,curaggcontext}. That could lead to trouble
because the transition value could be allocated in the wrong context.
Reported-By: Rushabh Lathia Diagnosed-By: Dmitry Dolgov
Author: Dmitry Dolgov, with minor changes by me
Discussion: https://postgr.es/m/CAGPqQf165-=+Drw3Voim7M5EjHT1zwPF9BQRjLFQzCzYnNZEiQ@mail.gmail.com
Backpatch: 11-, where JIT compilation support was added
Tom Lane [Sat, 21 Jul 2018 19:40:51 +0000 (15:40 -0400)]
Further portability hacking in pg_upgrade's test script.
I blew the dust off a Bourne shell (file date 1996, yea verily) and
tried to run test.sh with it. It mostly worked, but I found that the
temp-directory creation code introduced by commit be76a6d39 was not
compatible, for a couple of reasons: this shell thinks "set -e" should
force an exit if a command within backticks fails, and it also thinks code
within braces should be executed by a sub-shell, meaning that variable
settings don't propagate back up to the parent shell. In view of Victor
Wagner's report that Solaris is still using pre-POSIX shells, seems like
we oughta make this case work. It's not like the code is any less
idiomatic this way; the prior coding technique appeared nowhere else.
(There is a remaining bash-ism here, which is that $RANDOM doesn't do
what the code hopes in non-bash shells. But the use of $$ elsewhere in
that path should be enough to ensure uniqueness and some amount of
randomness, so I think it's okay as-is.)
Back-patch to all supported branches, as the previous commit was.
Tom Lane [Sat, 21 Jul 2018 16:05:25 +0000 (12:05 -0400)]
Be more paranoid about quoting in pg_upgrade's test script.
Double-quote $PGDATA in "find" commands introduced by commit da9b580d8,
in case that path contains spaces or other special characters.
Adjust a few other places so that quoting is done more consistently.
None of the others are actual bugs AFAICS, but it's confusing to readers
if the same thing is done differently in different places.
Tom Lane [Fri, 20 Jul 2018 17:59:27 +0000 (13:59 -0400)]
Avoid unportable shell syntax in pg_upgrade's test script.
Most of test.sh uses traditional backtick syntax for command substitution,
but commit da9b580d8 introduced two uses of $(...) syntax, which is not
recognized by very old shells. Bring those into line with the rest.
Dean Rasheed [Fri, 20 Jul 2018 07:57:08 +0000 (08:57 +0100)]
Guard against rare RAND_bytes() failures in pg_strong_random().
When built using OpenSSL, pg_strong_random() uses RAND_bytes() to
generate the random number. On very rare occasions that can fail, if
its PRNG has not been seeded with enough data. Additionally, once it
does fail, all subsequent calls will also fail until more seed data is
added. Since this is required during backend startup, this can result
in all new backends failing to start until a postmaster restart.
Guard against that by checking the state of OpenSSL's PRNG using
RAND_status(), and if necessary (very rarely), seeding it using
RAND_poll().
Back-patch to v10, where pg_strong_random() was introduced.
Fix handling of empty uncompressed posting list pages in GIN
PostgreSQL 9.4 introduces posting list compression in GIN. This feature
supports online upgrade, so that after pg_upgrade uncompressed posting
lists are compressed on-the-fly. Underlying code appears to always
expect at least one item on uncompressed posting list page. But there
could be completely empty pages, because VACUUM never deletes leftmost
and rightmost pages from posting trees. This commit fixes that.
Tom Lane [Thu, 19 Jul 2018 19:41:46 +0000 (15:41 -0400)]
Remove undocumented restriction against duplicate partition key columns.
transformPartitionSpec rejected duplicate simple partition columns
(e.g., "PARTITION BY RANGE (x,x)") but paid no attention to expression
columns, resulting in inconsistent behavior. Worse, cases like
"PARTITION BY RANGE (x,(x))") were accepted but would then result in
dump/reload failures, since the expression (x) would get simplified
to a plain column later.
There seems no better reason for this restriction than there was for
the one against duplicate included index columns (cf commit 701fd0bbc),
so let's just remove it.
Tom Lane [Thu, 19 Jul 2018 18:53:41 +0000 (14:53 -0400)]
Improve psql's \d command to show whether index columns are key columns.
This is essential information when looking at an index that has
"included" columns. Per discussion, follow the style used in \dC
and some other places: column header is "Key?" and values are "yes"
or "no" (all translatable).
While at it, revise describeOneTableDetails to be a bit more maintainable:
avoid hard-wired column numbers and multiple repetitions of what needs
to be identical test logic. This also results in the emitted catalog
query corresponding more closely to what we print, which should be a
benefit to users of ECHO_HIDDEN mode, and perhaps a bit faster too
(the old logic sometimes asked for values it would not print, even
ones that are fairly expensive to get).
Tom Lane [Thu, 19 Jul 2018 17:48:05 +0000 (13:48 -0400)]
Fix pg_get_indexdef()'s behavior for included index columns.
The multi-argument form of pg_get_indexdef() failed to print anything when
asked to print a single index column that is an included column rather than
a key column. This seems an unintentional result of someone having tried
to take a short-cut and use the attrsOnly flag for two different purposes.
To fix, split said flag into two flags, attrsOnly which suppresses
non-attribute info, and keysOnly which suppresses included columns.
Add a test case using psql's \d command, which relies on that function.
(It's mighty tempting at this point to replace pg_get_indexdef_worker's
mess of boolean flag arguments with a single bitmask-of-flags argument,
which would allow making the call sites much more self-documenting.
But I refrained for the moment.)
Rewrite comments in replication slot advance implementation
The code added by 9c7d06d60680 was a bit obscure; clarify that by
rewriting the comments. Lack of clarity has already caused bugs, so
it's a worthy goal.
I was confused by what "intended to be parallel serially" meant, until
Robert Haas and David G. Johnston explained it. Rephrase the comment to
make it more clear, using David's suggested wording.
Michael Paquier [Thu, 19 Jul 2018 00:55:02 +0000 (09:55 +0900)]
Fix print of Path nodes when using OPTIMIZER_DEBUG
GatherMergePath (introduced in 10) and CustomPath (introduced in 9.5)
have gone missing. The order of the Path nodes was inconsistent with
what is listed in nodes.h, so make the order consistent at the same time
to ease future checks and additions.
Author: Sawada Masahiko Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAD21AoBQMLoc=ohH-oocuAPsELrmk8_EsRJjOyR8FQLZkbE0wA@mail.gmail.com
Tom Lane [Wed, 18 Jul 2018 21:39:27 +0000 (17:39 -0400)]
Remove race-prone hot_standby_feedback test cases in 001_stream_rep.pl.
This script supposed that if it turned hot_standby_feedback on and then
shut down the standby server, at least one feedback message would be
guaranteed to be sent before the standby stops. But there is no such
guarantee, if the standby's walreceiver process is slow enough --- and
we've seen multiple failures in the buildfarm showing that that does
happen in practice. While we could rearrange the walreceiver logic to
make it less likely, it seems probably impossible to create a really
bulletproof guarantee of that sort; and if we tried, we might create
situations where the walreceiver wouldn't react in a timely manner to
shutdown commands. It seems better instead to remove the script's
assumption that feedback will occur before shutdown.
But once we do that, these last few tests seem quite redundant with
the earlier tests in the script. So let's just drop them altogether
and save some buildfarm cycles.
Tom Lane [Wed, 18 Jul 2018 18:43:03 +0000 (14:43 -0400)]
Drop the rule against included index columns duplicating key columns.
The initial version of the included-index-column feature stated that
included columns couldn't be the same as any key column of the index.
While it'd be pretty silly to do that, since the included column would be
entirely redundant, we've never prohibited redundant index columns before
so it's not very consistent to do so here. Moreover, the prohibition
was itself badly implemented, so that it failed to reject columns that
were effectively identical but not spelled quite alike, as reported by
Aditya Toshniwal.
(Moreover, it's not hard to imagine that for some non-btree index types,
such cases would be non-silly anyhow: the index might use a lossy
representation for key columns but be able to support retrieval of the
original form of included columns.)
Hence, let's just drop the prohibition.
In passing, do some copy-editing on the documentation for the
included-column feature.
Yugo Nagata; documentation and test corrections by me
doc: move PARTITION OF stanza to just below PARTITION BY
It's more logical this way, since the new ordering matches the way the
tables are created; but in any case, the previous location of PARTITION OF
did not appear carefully chosen anyway (since it didn't match the
location in which it appears in the synopsys either, which is what we
normally do.)
In the PARTITION BY stanza, add a link to the partitioning section in
the DDL chapter, too.
Suggested-by: David G. Johnston
Discussion: https://postgr.es/m/CAKFQuwY4Ld7ecxL_KAmaxwt0FUu5VcPPN2L4dh+3BeYbrdBa5g@mail.gmail.com
Fix ALTER TABLE...SET STATS error message for included columns
The existing error message was complaining that the column is not an
expression, which is not correct. Introduce a suitable wording
variation and a test.
The original code was unable to prune partitions that could not possibly
contain NULL values, when the query specified less than all columns in a
multicolumn partition key. Reorder the if-tests so that it is, and add
more commentary and regression tests.
Robert Haas [Mon, 16 Jul 2018 21:33:22 +0000 (17:33 -0400)]
Add subtransaction handling for table synchronization workers.
Since the old logic was completely unaware of subtransactions, a
change made in a subsequently-aborted subtransaction would still cause
workers to be stopped at toplevel transaction commit. Fix that by
managing a stack of worker lists rather than just one.
Peter Eisentraut [Mon, 16 Jul 2018 08:44:06 +0000 (10:44 +0200)]
doc: Update redirecting links
Update links that resulted in redirects. Most are changes from http to
https, but there are also some other minor edits. (There are still some
redirects where the target URL looks less elegant than the one we
currently have. I have left those as is.)
Tom Lane [Sat, 14 Jul 2018 15:59:12 +0000 (11:59 -0400)]
Fix hashjoin costing mistake introduced with inner_unique optimization.
In final_cost_hashjoin(), commit 9c7f5229a allowed inner_unique cases
to follow a code path previously used only for SEMI/ANTI joins; but it
neglected to fix an if-test within that path that assumed SEMI and ANTI
were the only possible cases. This resulted in a wrong value for
hashjointuples, and an ensuing bad cost estimate, for inner_unique normal
joins. Fortunately, for inner_unique normal joins we can assume the number
of joined tuples is the same as for a SEMI join; so there's no need for
more code, we just have to invert the test to check for ANTI not SEMI.
It turns out that in two contrib tests in which commit 9c7f5229a
changed the plan expected for a query, the change was actually wrong
and induced by this estimation error, not by any real improvement.
Hence this patch also reverts those changes.
Per report from RK Korlapati. Backpatch to v10 where the error was
introduced.
Tom Lane [Fri, 13 Jul 2018 22:45:30 +0000 (18:45 -0400)]
Fix crash in contrib/ltree's lca() function for empty input array.
lca_inner() wasn't prepared for the possibility of getting no inputs.
Fix that, and make some cosmetic improvements to the code while at it.
Also, I thought the documentation of this function as returning the
"longest common prefix" of the paths was entirely misleading; it really
returns a path one shorter than the longest common prefix, for the typical
definition of "prefix". Don't use that term in the docs, and adjust the
examples to clarify what really happens.
This has been broken since its beginning, so back-patch to all supported
branches.
Per report from Hailong Li. Thanks to Pierre Ducroquet for diagnosing
and for the initial patch, though I whacked it around some and added
test cases.
Peter Eisentraut [Fri, 13 Jul 2018 19:23:41 +0000 (21:23 +0200)]
Update documentation editor setup instructions
Now that the documentation sources are in XML rather than SGML, some of
the documentation about the editor, or more specifically Emacs, setup
needs updating. The updated instructions recommend using nxml-mode,
which works mostly out of the box, with some small tweaks in
emacs.samples and .dir-locals.el.
Also remove some obsolete stuff in .dir-locals.el. I did, however,
leave the sgml-mode settings in there so that someone using Emacs
without emacs.samples gets those settings when editing a *.sgml file.
Tom Lane [Fri, 13 Jul 2018 18:16:47 +0000 (14:16 -0400)]
Fix crash in json{b}_populate_recordset() and json{b}_to_recordset().
As of commit 37a795a60, populate_recordset_worker() tried to pass back
(as rsi.setDesc) a tupdesc that it also had cached in its fn_extra.
But the core executor would free the passed-back tupdesc, risking a
crash if the function were called again in the same query. The safest
and least invasive way to fix that is to make an extra tupdesc copy
to pass back.
While at it, I failed to resist the temptation to get rid of unnecessary
get_fn_expr_argtype() calls here and in populate_record_worker().
Per report from Dmitry Dolgov; thanks to Michael Paquier and
Andrew Gierth for investigation and discussion.
The patch that ended up as commit 3de241dba86f ("Foreign keys on
partitioned tables") lacked pg_dump tests, so the pg_dump code that was
there to support it inadvertently stopped working when in later
development I modified the backend code not to emit pg_trigger rows for
the partitioned table itself.
Bug analysis and code fix is by Michaël. I (Álvaro) added the test.
Tom Lane [Fri, 13 Jul 2018 15:52:50 +0000 (11:52 -0400)]
Fix inadequate buffer locking in FSM and VM page re-initialization.
When reading an existing FSM or VM page that was found to be corrupt by the
buffer manager, the code applied PageInit() to reinitialize the page, but
did so without any locking. There is thus a hazard that two backends might
concurrently do PageInit, which in itself would still be OK, but the slower
one might then zero over subsequent data changes applied by the faster one.
Even that is unlikely to be fatal; but it's not desirable, so add locking
to prevent it.
This does not add any locking overhead in the normal code path where the
page is OK. It's not immediately obvious that that's safe, but I believe
it is, for reasons explained in the added comments.
Problem noted by R P Asim. It's been like this for a long time, so
back-patch to all supported branches.
Prohibit transaction commands in security definer procedures
Starting and aborting transactions in security definer procedures
doesn't work. StartTransaction() insists that the security context
stack is empty, so this would currently cause a crash, and
AbortTransaction() resets it. This could be made to work by
reorganizing the code, but right now we just prohibit it.
Reported-by: amul sul <sulamul@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b96Gupt_LFL7uNyy3c50-wbhA68NUjiK5%3DrF6_w%3Dpq_T%3DQ%40mail.gmail.com
Peter Eisentraut [Thu, 12 Jul 2018 18:22:17 +0000 (20:22 +0200)]
Reset shmem_exit_inprogress after shmem_exit()
In ad9a274778d2d88c46b90309212b92ee7fdf9afe, shmem_exit_inprogress was
introduced. But we need to reset it after shmem_exit(), because unlike
the similar proc_exit(), shmem_exit() can also be called for cleanup
when the process will not exit.
Reported-by: Andrew Gierth <andrew@tao11.riddles.org.uk>
Tom Lane [Thu, 12 Jul 2018 16:28:43 +0000 (12:28 -0400)]
Doc: minor improvement in pl/pgsql FETCH/MOVE documentation.
Explain that you can use any integer expression for the "count" in
pl/pgsql's versions of FETCH/MOVE, unlike the SQL versions which only
allow a constant.
Remove the duplicate version of this para under MOVE. I don't see
a good reason to maintain two identical paras when we just said that
MOVE works exactly like FETCH.
Fix FK checks of TRUNCATE involving partitioned tables
When truncating a table that is referenced by foreign keys in
partitioned tables, the check to ensure the referencing table are also
truncated spuriously failed. This is because it was relying on
relhastriggers as a proxy for the table having FKs, and that's wrong for
partitioned tables. Fix it to consider such tables separately. There
may be a better way ... but this code is pretty inefficient already.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Michael Paquiër <michael@paquier.xyz>
Discussion: https://postgr.es/m/20180711000624.zmeizicibxeehhsg@alvherre.pgsql
Tom Lane [Thu, 12 Jul 2018 15:10:24 +0000 (11:10 -0400)]
Doc: update documentation for requirement of ORDER BY in GROUPS mode.
Commit ff4f88916 adjusted the code to enforce the SQL spec's requirement
that a window using GROUPS mode must have an ORDER BY clause. But I missed
that the documentation explicitly said you didn't have to have one.
Also minor wordsmithing in the window-function section of select.sgml.
Per Masahiko Sawada, though I didn't use his patch.
Amit Kapila [Thu, 12 Jul 2018 06:47:27 +0000 (12:17 +0530)]
Allow using the updated tuple while moving it to a different partition.
An update that causes the tuple to be moved to a different partition was
missing out on re-constructing the to-be-updated tuple, based on the latest
tuple in the update chain. Instead, it's simply deleting the latest tuple
and inserting a new tuple in the new partition based on the old tuple.
Commit 2f17844104 didn't consider this case, so some of the updates were
getting lost.
In passing, change the argument order for output parameter in ExecDelete
and add some commentary about it.
Reported-by: Pavan Deolasee
Author: Amit Khandekar, with minor changes by me Reviewed-by: Dilip Kumar, Amit Kapila and Alvaro Herrera
Backpatch-through: 11
Discussion: https://postgr.es/m/CAJ3gD9fRbEzDqdeDq1jxqZUb47kJn+tQ7=Bcgjc8quqKsDViKQ@mail.gmail.com
Michael Paquier [Thu, 12 Jul 2018 01:19:51 +0000 (10:19 +0900)]
Make logical WAL sender report streaming state appropriately
WAL senders sending logically-decoded data fail to properly report in
"streaming" state when starting up, hence as long as one extra record is
not replayed, such WAL senders would remain in a "catchup" state, which
is inconsistent with the physical cousin.
This can be easily reproduced by for example using pg_recvlogical and
restarting the upstream server. The TAP tests have been slightly
modified to detect the failure and strengthened so as future tests also
make sure that a node is in streaming state when waiting for its
catchup.
Backpatch down to 9.4 where this code has been introduced.
Reported-by: Sawada Masahiko
Author: Simon Riggs, Sawada Masahiko Reviewed-by: Petr Jelinek, Michael Paquier, Vaishnavi Prabakaran
Discussion: https://postgr.es/m/CAD21AoB2ZbCCqOx=bgKMcLrAvs1V0ZMqzs7wBTuDySezTGtMZA@mail.gmail.com
Tom Lane [Wed, 11 Jul 2018 19:25:28 +0000 (15:25 -0400)]
Fix create_scan_plan's handling of sortgrouprefs for physical tlists.
We should only run apply_pathtarget_labeling_to_tlist if CP_LABEL_TLIST
was specified, because only in that case has use_physical_tlist checked
that the labeling will succeed; otherwise we may get an "ORDER/GROUP BY
expression not found in targetlist" error. (This subsumes the previous
test about gating_clauses, because we reset "flags" to zero earlier
if there are gating clauses to apply.)
The only known case in which a failure can occur is with a ProjectSet
path directly atop a table scan path, although it seems likely that there
are other cases or will be such in future. This means that the failure
is currently only visible in the v10 branch: 9.6 didn't have ProjectSet,
while in v11 and HEAD, apply_scanjoin_target_to_paths for some weird
reason is using create_projection_path not apply_projection_to_path,
masking the problem because there's a ProjectionPath in between.
Nonetheless this code is clearly wrong on its own terms, so back-patch
to 9.6 where this logic was introduced.
Tom Lane [Wed, 11 Jul 2018 16:07:21 +0000 (12:07 -0400)]
Fix bugs with degenerate window ORDER BY clauses in GROUPS/RANGE mode.
nodeWindowAgg.c failed to cope with the possibility that no ordering
columns are defined in the window frame for GROUPS mode or RANGE OFFSET
mode, leading to assertion failures or odd errors, as reported by Masahiko
Sawada and Lukas Eder. In RANGE OFFSET mode, an ordering column is really
required, so add an Assert about that. In GROUPS mode, the code would
work, except that the node initialization code wasn't in sync with the
execution code about when to set up tuplestore read pointers and spare
slots. Fix the latter for consistency's sake (even though I think the
changes described below make the out-of-sync cases unreachable for now).
Per SQL spec, a single ordering column is required for RANGE OFFSET mode,
and at least one ordering column is required for GROUPS mode. The parser
enforced the former but not the latter; add a check for that.
We were able to reach the no-ordering-column cases even with fully spec
compliant queries, though, because the planner would drop partitioning
and ordering columns from the generated plan if they were redundant with
earlier columns according to the redundant-pathkey logic, for instance
"PARTITION BY x ORDER BY y" in the presence of a "WHERE x=y" qual.
While in principle that's an optimization that could save some pointless
comparisons at runtime, it seems unlikely to be meaningful in the real
world. I think this behavior was not so much an intentional optimization
as a side-effect of an ancient decision to construct the plan node's
ordering-column info by reverse-engineering the PathKeys of the input
path. If we give up redundant-column removal then it takes very little
code to generate the plan node info directly from the WindowClause,
ensuring that we have the expected number of ordering columns in all
cases. (If anyone does complain about this, the planner could perhaps
be taught to remove redundant columns only when it's safe to do so,
ie *not* in RANGE OFFSET mode. But I doubt anyone ever will.)
With these changes, the WindowAggPath.winpathkeys field is not used for
anything anymore, so remove it.
The test cases added here are not actually very interesting given the
removal of the redundant-column-removal logic, but they would represent
important corner cases if anyone ever tries to put that back.
Tom Lane and Masahiko Sawada. Back-patch to v11 where RANGE OFFSET
and GROUPS modes were added.
Michael Paquier [Tue, 10 Jul 2018 23:57:18 +0000 (08:57 +0900)]
Block replication slot advance for these not yet reserving WAL
Such replication slots are physical slots freshly created without WAL
being reserved, which is the default behavior, which have not been used
yet as WAL consumption resources to retain WAL. This prevents advancing
a slot to a position older than any WAL available, which could falsify
calculations for WAL segment recycling.
This also cleans up a bit the code, as ReplicationSlotRelease() would be
called on ERROR, and improves error messages.
Reported-by: Kyotaro Horiguchi
Author: Michael Paquier Reviewed-by: Andres Freund, Álvaro Herrera, Kyotaro Horiguchi
Discussion: https://postgr.es/m/20180626071305.GH31353@paquier.xyz
We fail to handle polymorphic types properly when they are used as
partition keys: we were unnecessarily adding a RelabelType node on top,
which confuses code examining the nodes. In particular, this makes
predtest.c-based partition pruning not to work, and ruleutils.c to emit
expressions that are uglier than needed. Fix it by not adding RelabelType
when not needed.
In master/11 the new pruning code is separate so it doesn't suffer from
this problem, since we already fixed it (in essentially the same way) in e5dcbb88a15d, which also added a few tests; back-patch those tests to
pg10 also. But since UPDATE/DELETE still uses predtest.c in pg11, this
change improves partitioning for those cases too. Add tests for this.
The ruleutils.c behavior change is relevant in pg11/master too.
Co-authored-by: Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/54745d13-7ed4-54ac-97d8-ea1eec95ae25@lab.ntt.co.jp
Tom Lane [Mon, 9 Jul 2018 23:26:19 +0000 (19:26 -0400)]
Avoid emitting a bogus WAL record when recycling an all-zero btree page.
Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for
a page that it was about to recycle. However, it failed to distinguish
all-zero pages from dead pages, which is important because only the
latter have valid btpo.xact values, or indeed any special space at all.
Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore
led to an Assert failure, or to emission of a WAL record containing a
bogus cutoff XID, which might lead to unnecessary query cancellations
on hot standby servers.
Per reports from Antonin Houska and 自己. Amit Kapila was first to
propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi
reviewed it at various times.
This is an old bug, so back-patch to all supported branches.
Tom Lane [Mon, 9 Jul 2018 18:28:04 +0000 (14:28 -0400)]
Fix yet more problems with incorrectly-constructed zero-length arrays.
Commit 716ea626a attempted to fix the problem of building 1-D zero-size
arrays once and for all. But it turns out that contrib/intarray has some
code that doesn't use construct_array() but just builds arrays by hand,
so it didn't get the memo. This appears to affect all of subarray(),
intset_subtract(), inner_int_union(), inner_int_inter(), and
intarray_concat_arrays().
Back-patch into v11. In the past we've not back-patched this type of
change, but since v11 is still in beta it seems all right to include
this fix in it. Besides it's more consistent to make the fix in v11
where 716ea626a appeared.
Report and patch by Alexey Kryuchkov, some cosmetic adjustments by me
Michael Paquier [Mon, 9 Jul 2018 01:25:40 +0000 (10:25 +0900)]
Rework order of end-of-recovery actions to delay timeline history write
A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written. Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf. With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster. It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions. The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.
Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now. This is left as a future item and this
commit fixes the original complain.
Reported-by: Magnus Hagander
Author: Heikki Linnakangas Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
Add separate error message for procedure does not exist
While we probably don't want to split up all error messages into
function and procedure variants, this one is a very prominent one, so
it's helpful to be more specific here.
Michael Paquier [Fri, 6 Jul 2018 23:10:32 +0000 (08:10 +0900)]
Add note in pg_rewind documentation about read-only files
When performing pg_rewind, the presence of a read-only file which is not
accessible for writes will cause a failure while processing. This can
cause the control file of the target data folder to be truncated,
causing it to not be reusable with a successive run.
Also, when pg_rewind fails mid-flight, there is likely no way to be able
to recover the target data folder anyway, in which case a new base
backup is the best option. A note is added in the documentation as
well about.
Reported-by: Christian H.
Author: Michael Paquier Reviewed-by: Andrew Dunstan
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org
Peter Eisentraut [Fri, 29 Jun 2018 11:28:39 +0000 (13:28 +0200)]
Fix assert in nested SQL procedure call
When executing CALL in PL/pgSQL, we need to set a snapshot before
invoking the to-be-called procedure. Otherwise, the to-be-called
procedure might end up running without a snapshot. For LANGUAGE SQL
procedures, this would result in an assertion failure. (For most other
languages, this is usually not a problem, because those use SPI and SPI
sets snapshots in most cases.) Setting the snapshot restores the
behavior of how CALL worked when it was handled as a generic SQL
statement in PL/pgSQL (exec_stmt_execsql()).
This change revealed another problem: In SPI_commit(), we popped the
active snapshot before committing the transaction, to avoid "snapshot %p
still active" errors. However, there is no particular reason why only
at most one snapshot should be on the stack. So change this to pop all
active snapshots instead of only one.
Allow replication slots to be dropped in single-user mode
Starting with commit 9915de6c1cb2, replication slot drop uses a
condition variable sleep to wait until the current user of the slot goes
away. This is more user friendly than the previous behavior of erroring
out if the slot is in use, but it fails with a not-for-user-consumption
error message in single-user mode; plus, if you're using single-user
mode because you don't want to start the server in the regular mode
(say, disk is full and WAL won't recycle because of the slot), it's
inconvenient.
Fix by skipping the cond variable sleep in single-user mode, since
there can't be anybody to wait for anyway.
doc: Reword old inheritance partitioning documentation
Prefer to use phrases like "child" instead of "partition" when
describing the legacy inheritance-based partitioning. The word
"partition" now has a fixed meaning for the built-in partitioning, so
keeping it out of the documentation of the old method makes things
clearer.
logical decoding: beware of an unset specinsert change
Coverity complains that there is no protection in the code (at least in
non-assertion-enabled builds) against speculative insertion failing to
follow the expected protocol. Add an elog(ERROR) for the case.
Reduce cost of test_decoding's new oldest_xmin test
Change a whole-database VACUUM into doing just pg_attribute, which is
the portion that verifies what we want it to do. The original
formulation wastes a lot of CPU time, which leads the test to fail when
runtime exceeds isolationtester timeout when it's super-slow, such as
under CLOBBER_CACHE_ALWAYS. Per buildfarm member friarbird.
It turns out that the previous shape of the test doesn't always detect
the condition it is supposed to detect (on unpatched reorderbuffer
code): the reason is that there is a good chance of encountering a
xl_running_xacts record (logged every 15 seconds) before the checkpoint
-- and because we advance the xmin when we receive that WAL record, and
we *don't* advance the xmin twice consecutively without receiving a
client message in between, that means the xmin is not advanced enough
for the tuple to be pruned from pg_attribute by VACUUM. So the test
would spuriously pass.
The reason this test deficiency wasn't detected earlier is that HOT
pruning removes the tuple anyway, even if vacuum leaves it in place, so
the test correctly fails (detecting the coding mistake), but for the
wrong reason.
To fix this mess, run the s0_get_changes step twice before vacuum
instead of once: this seems to cause the xmin to be advanced reliably,
wreaking havoc with more certainty.
Michael Paquier [Thu, 5 Jul 2018 01:46:43 +0000 (10:46 +0900)]
Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed. This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed. When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed. Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.
Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me. The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.
Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore. The test gets easily supported down to 10, still it
has been tested on all branches.
Andres Freund [Thu, 5 Jul 2018 00:36:01 +0000 (17:36 -0700)]
Use context with correct lifetime in hypothetical_dense_rank_final.
The query lifetime expression context created in
hypothetical_dense_rank_final() was buggily allocated in the calling
memory context. I (Andres) broke that in bf6c614a2f2.
Reported-By: Rajkumar Raghuwanshi
Author: Amit Langote
Discussion: https://postgr.es/m/CAKcux6kmzWmur5HhA_aU6gYVFu0RLQdgJJ+aC9SLdcOvBSrpfA@mail.gmail.com
Backpatch: 11-
Andres Freund [Wed, 4 Jul 2018 21:53:30 +0000 (14:53 -0700)]
Check for interrupts inside the nbtree page deletion code.
When deleting pages the nbtree code has to walk through siblings of a
tree node. When those sibling links are corrupted that can lead to
endless loops - which are currently not interruptible. This is
especially problematic if autovacuum is repeatedly blocked on such
indexes, as it can be hard to get out of that situation without
resorting to single user mode.
Thus add interrupt checks to appropriate places in such
loops. Unfortunately in one of the cases it's it's not easy to do so.
Between 9.3 and 9.4 the page deletion (and page split) code changed
significantly. Before it was significantly less robust against
interruptions. Therefore don't backpatch to 9.3.
Author: Andres Freund
Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de
Backpatch: 9.4-
Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
Michael Paquier [Wed, 4 Jul 2018 01:41:44 +0000 (10:41 +0900)]
Remove dead code for temporary relations in partition planning
Since recent commit 1c7c317c, temporary relations cannot be mixed with
permanent relations within the same partition tree, and the same counts
for temporary relations created by other sessions, which the planner
simply discarded. Instead be paranoid and issue an error, as those
should be blocked at definition time, at least for now.
At the same time, a test case is added to stress what has been moved
when expand_partitioned_rtentry gets called recursively but bumps on a
partitioned relation with no partitions which should be handled the same
way as the non-inheritance case. This code may be reworked in a close
future, and covering this code path will limit surprises.
Reported-by: David Rowley
Author: David Rowley Reviewed-by: Amit Langote, Robert Haas, Michael Paquier
Discussion: https://postgr.es/m/CAKJS1f_HyV1txn_4XSdH5EOhBMYaCwsXyAj6bHXk9gOu4JKsbw@mail.gmail.com
When these programs call pg_catalog.set_config, they need to check for
PGRES_TUPLES_OK instead of PGRES_COMMAND_OK. Fix for 5770172cb0c9df9e6ce27c507b449557e5b45124.