Tom Lane [Sat, 14 Jul 2018 15:59:12 +0000 (11:59 -0400)]
Fix hashjoin costing mistake introduced with inner_unique optimization.
In final_cost_hashjoin(), commit 9c7f5229a allowed inner_unique cases
to follow a code path previously used only for SEMI/ANTI joins; but it
neglected to fix an if-test within that path that assumed SEMI and ANTI
were the only possible cases. This resulted in a wrong value for
hashjointuples, and an ensuing bad cost estimate, for inner_unique normal
joins. Fortunately, for inner_unique normal joins we can assume the number
of joined tuples is the same as for a SEMI join; so there's no need for
more code, we just have to invert the test to check for ANTI not SEMI.
It turns out that in two contrib tests in which commit 9c7f5229a
changed the plan expected for a query, the change was actually wrong
and induced by this estimation error, not by any real improvement.
Hence this patch also reverts those changes.
Per report from RK Korlapati. Backpatch to v10 where the error was
introduced.
Tom Lane [Fri, 13 Jul 2018 22:45:30 +0000 (18:45 -0400)]
Fix crash in contrib/ltree's lca() function for empty input array.
lca_inner() wasn't prepared for the possibility of getting no inputs.
Fix that, and make some cosmetic improvements to the code while at it.
Also, I thought the documentation of this function as returning the
"longest common prefix" of the paths was entirely misleading; it really
returns a path one shorter than the longest common prefix, for the typical
definition of "prefix". Don't use that term in the docs, and adjust the
examples to clarify what really happens.
This has been broken since its beginning, so back-patch to all supported
branches.
Per report from Hailong Li. Thanks to Pierre Ducroquet for diagnosing
and for the initial patch, though I whacked it around some and added
test cases.
Peter Eisentraut [Fri, 13 Jul 2018 19:23:41 +0000 (21:23 +0200)]
Update documentation editor setup instructions
Now that the documentation sources are in XML rather than SGML, some of
the documentation about the editor, or more specifically Emacs, setup
needs updating. The updated instructions recommend using nxml-mode,
which works mostly out of the box, with some small tweaks in
emacs.samples and .dir-locals.el.
Also remove some obsolete stuff in .dir-locals.el. I did, however,
leave the sgml-mode settings in there so that someone using Emacs
without emacs.samples gets those settings when editing a *.sgml file.
Tom Lane [Fri, 13 Jul 2018 18:16:47 +0000 (14:16 -0400)]
Fix crash in json{b}_populate_recordset() and json{b}_to_recordset().
As of commit 37a795a60, populate_recordset_worker() tried to pass back
(as rsi.setDesc) a tupdesc that it also had cached in its fn_extra.
But the core executor would free the passed-back tupdesc, risking a
crash if the function were called again in the same query. The safest
and least invasive way to fix that is to make an extra tupdesc copy
to pass back.
While at it, I failed to resist the temptation to get rid of unnecessary
get_fn_expr_argtype() calls here and in populate_record_worker().
Per report from Dmitry Dolgov; thanks to Michael Paquier and
Andrew Gierth for investigation and discussion.
The patch that ended up as commit 3de241dba86f ("Foreign keys on
partitioned tables") lacked pg_dump tests, so the pg_dump code that was
there to support it inadvertently stopped working when in later
development I modified the backend code not to emit pg_trigger rows for
the partitioned table itself.
Bug analysis and code fix is by Michaël. I (Álvaro) added the test.
Improve performance of tuple conversion map generation
Previously convert_tuples_by_name_map naively performed a search of each
outdesc column starting at the first column in indesc and searched each
indesc column until a match was found. When partitioned tables had many
columns this could result in slow generation of the tuple conversion maps.
For INSERT and UPDATE statements that touched few rows, this could mean a
very large overhead indeed.
We can do a bit better with this loop. It's quite likely that the columns
in partitioned tables and their partitions are in the same order, so it
makes sense to start searching for each column outer column at the inner
column position 1 after where the previous match was found (per idea from
Alexander Kuzmenkov). This makes the best case search O(N) instead of
O(N^2). The worst case is still O(N^2), but it seems unlikely that would
happen.
Likewise, in the planner, make_inh_translation_list's search for the
matching column could often end up falling back on an O(N^2) type search.
This commit also improves that by first checking the column that follows
the previous match, instead of the column with the same attnum. If we
fail to match here we fallback on the syscache's hashtable lookup.
Author: David Rowley Reviewed-by: Alexander Kuzmenkov
Discussion: https://www.postgresql.org/message-id/CAKJS1f9-wijVgMdRp6_qDMEQDJJ%2BA_n%3DxzZuTmLx5Fz6cwf%2B8A%40mail.gmail.com
Tom Lane [Fri, 13 Jul 2018 15:52:50 +0000 (11:52 -0400)]
Fix inadequate buffer locking in FSM and VM page re-initialization.
When reading an existing FSM or VM page that was found to be corrupt by the
buffer manager, the code applied PageInit() to reinitialize the page, but
did so without any locking. There is thus a hazard that two backends might
concurrently do PageInit, which in itself would still be OK, but the slower
one might then zero over subsequent data changes applied by the faster one.
Even that is unlikely to be fatal; but it's not desirable, so add locking
to prevent it.
This does not add any locking overhead in the normal code path where the
page is OK. It's not immediately obvious that that's safe, but I believe
it is, for reasons explained in the added comments.
Problem noted by R P Asim. It's been like this for a long time, so
back-patch to all supported branches.
Prohibit transaction commands in security definer procedures
Starting and aborting transactions in security definer procedures
doesn't work. StartTransaction() insists that the security context
stack is empty, so this would currently cause a crash, and
AbortTransaction() resets it. This could be made to work by
reorganizing the code, but right now we just prohibit it.
Reported-by: amul sul <sulamul@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b96Gupt_LFL7uNyy3c50-wbhA68NUjiK5%3DrF6_w%3Dpq_T%3DQ%40mail.gmail.com
Peter Eisentraut [Fri, 13 Jul 2018 08:01:04 +0000 (10:01 +0200)]
Remove obsolete documentation build tools for Windows
The scripts and instructions have been nonfunctional at least since
PostgreSQL 10 (commit 510074f9f0131a04322d6a3d2a51c87e6db243f9) and
nobody has stepped up to fix them. So right now just remove them until
someone wants to resurrect them.
Thomas Munro [Fri, 13 Jul 2018 04:12:03 +0000 (16:12 +1200)]
Accept invalidation messages in InitializeSessionUserId().
If the authentication method modified the system catalogs through a
separate database connection (say, to create a new role on the fly),
make sure syscache sees the changes before we try to find the user.
Author: Thomas Munro Reviewed-by: Tom Lane, Andres Freund
Discussion: https://postgr.es/m/CAEepm%3D3_h0_cgmz5PMyab4xk_OFrg6G5VCN%3DnF4chFXM9iFOqA%40mail.gmail.com
Michael Paquier [Fri, 13 Jul 2018 00:32:12 +0000 (09:32 +0900)]
Fix argument of pg_create_logical_replication_slot for slot name
All attributes and arguments using a slot name map to the data type
"name", but this function has been using "text". This is cosmetic, as
even if text is used then the slot name would be truncated to 64
characters anyway and stored as such. The documentation already said
so and the function already assumed that the argument was of this type
when fetching its value.
Michael Paquier [Thu, 12 Jul 2018 21:43:20 +0000 (06:43 +0900)]
Clean up temporary WAL segments after an instance crash
Temporary WAL segments are created in pg_wal and named as xlogtemp.pid
before being renamed to the real deal when creating a new segment. If
an instance crashes after the temporary segment is created and before
the rename is done, then the server would finish with unremovable data.
After an instance crash, scan pg_wal and remove any such segments. With
repetitive unlucky crashes this would contribute to disk bloat and
presents risks of ENOSPC especially with max_wal_size close to the
maximum allowed.
Author: Michael Paquier Reviewed-by: Yugo Nagata, Heikki Linnakangas
Discussion: https://postgr.es/m/20180514054955.GF1528@paquier.xyz
Peter Eisentraut [Thu, 12 Jul 2018 18:22:17 +0000 (20:22 +0200)]
Reset shmem_exit_inprogress after shmem_exit()
In ad9a274778d2d88c46b90309212b92ee7fdf9afe, shmem_exit_inprogress was
introduced. But we need to reset it after shmem_exit(), because unlike
the similar proc_exit(), shmem_exit() can also be called for cleanup
when the process will not exit.
Reported-by: Andrew Gierth <andrew@tao11.riddles.org.uk>
Tom Lane [Thu, 12 Jul 2018 16:28:43 +0000 (12:28 -0400)]
Doc: minor improvement in pl/pgsql FETCH/MOVE documentation.
Explain that you can use any integer expression for the "count" in
pl/pgsql's versions of FETCH/MOVE, unlike the SQL versions which only
allow a constant.
Remove the duplicate version of this para under MOVE. I don't see
a good reason to maintain two identical paras when we just said that
MOVE works exactly like FETCH.
Fix FK checks of TRUNCATE involving partitioned tables
When truncating a table that is referenced by foreign keys in
partitioned tables, the check to ensure the referencing table are also
truncated spuriously failed. This is because it was relying on
relhastriggers as a proxy for the table having FKs, and that's wrong for
partitioned tables. Fix it to consider such tables separately. There
may be a better way ... but this code is pretty inefficient already.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Michael Paquiër <michael@paquier.xyz>
Discussion: https://postgr.es/m/20180711000624.zmeizicibxeehhsg@alvherre.pgsql
Tom Lane [Thu, 12 Jul 2018 15:10:24 +0000 (11:10 -0400)]
Doc: update documentation for requirement of ORDER BY in GROUPS mode.
Commit ff4f88916 adjusted the code to enforce the SQL spec's requirement
that a window using GROUPS mode must have an ORDER BY clause. But I missed
that the documentation explicitly said you didn't have to have one.
Also minor wordsmithing in the window-function section of select.sgml.
Per Masahiko Sawada, though I didn't use his patch.
Amit Kapila [Thu, 12 Jul 2018 07:21:39 +0000 (12:51 +0530)]
Allow using the updated tuple while moving it to a different partition.
An update that causes the tuple to be moved to a different partition was
missing out on re-constructing the to-be-updated tuple, based on the latest
tuple in the update chain. Instead, it's simply deleting the latest tuple
and inserting a new tuple in the new partition based on the old tuple.
Commit 2f17844104 didn't consider this case, so some of the updates were
getting lost.
In passing, change the argument order for output parameter in ExecDelete
and add some commentary about it.
Reported-by: Pavan Deolasee
Author: Amit Khandekar, with minor changes by me Reviewed-by: Dilip Kumar, Amit Kapila and Alvaro Herrera
Backpatch-through: 11
Discussion: https://postgr.es/m/CAJ3gD9fRbEzDqdeDq1jxqZUb47kJn+tQ7=Bcgjc8quqKsDViKQ@mail.gmail.com
Michael Paquier [Thu, 12 Jul 2018 05:28:28 +0000 (14:28 +0900)]
Rename VACOPT_NOWAIT to VACOPT_SKIP_LOCKED
When it comes to SELECT ... FOR or LOCK, NOWAIT means to not wait for
something to happen, and issue an error. SKIP LOCKED means to not wait
for something to happen but to move on without issuing an error. The
internal option of autovacuum and autoanalyze mentioned above, used only
when wraparound is not involved was named NOWAIT, but behaves like SKIP
LOCKED which is confusing.
Michael Paquier [Thu, 12 Jul 2018 04:50:17 +0000 (13:50 +0900)]
Add assertion in expand_vacuum_rel() for non-autovacuum path
The code path where the assertion is added helps to check that
autovacuum always includes a relation OID when doing a vacuum on it.
Extracted from a larger patch set to add support for SKIP LOCKED with
manual VACUUM commands.
Michael Paquier [Thu, 12 Jul 2018 01:19:35 +0000 (10:19 +0900)]
Make logical WAL sender report streaming state appropriately
WAL senders sending logically-decoded data fail to properly report in
"streaming" state when starting up, hence as long as one extra record is
not replayed, such WAL senders would remain in a "catchup" state, which
is inconsistent with the physical cousin.
This can be easily reproduced by for example using pg_recvlogical and
restarting the upstream server. The TAP tests have been slightly
modified to detect the failure and strengthened so as future tests also
make sure that a node is in streaming state when waiting for its
catchup.
Backpatch down to 9.4 where this code has been introduced.
Reported-by: Sawada Masahiko
Author: Simon Riggs, Sawada Masahiko Reviewed-by: Petr Jelinek, Michael Paquier, Vaishnavi Prabakaran
Discussion: https://postgr.es/m/CAD21AoB2ZbCCqOx=bgKMcLrAvs1V0ZMqzs7wBTuDySezTGtMZA@mail.gmail.com
Tom Lane [Wed, 11 Jul 2018 22:47:31 +0000 (18:47 -0400)]
Mark built-in btree comparison functions as leakproof where it's safe.
Generally, if the comparison operators for a datatype or pair of datatypes
are leakproof, the corresponding btree comparison support function can be
considered so as well. But we had not originally worried about marking
support functions as leakproof, reasoning that they'd not likely be used in
queries so the marking wouldn't matter. It turns out there's at least one
place where it does matter: calc_arraycontsel() finds the target datatype's
default btree comparison function and tries to use that to estimate
selectivity, but it will be blocked in some cases if the function isn't
leakproof. This leads to unnecessarily poor selectivity estimates and bad
plans, as seen in bug #15251.
Hence, run around and apply proleakproof markings where the corresponding
btree comparison operators are leakproof. (I did eyeball each function
to verify that it wasn't doing anything surprising, too.)
This isn't a full solution to bug #15251, and it's not back-patchable
because of the need for a catversion bump. A more useful response probably
is to consider whether we can check permissions on the parent table instead
of the child. However, this change will help in some cases where that
won't, and it's easy enough to do in HEAD, so let's do so.
Tom Lane [Wed, 11 Jul 2018 19:25:28 +0000 (15:25 -0400)]
Fix create_scan_plan's handling of sortgrouprefs for physical tlists.
We should only run apply_pathtarget_labeling_to_tlist if CP_LABEL_TLIST
was specified, because only in that case has use_physical_tlist checked
that the labeling will succeed; otherwise we may get an "ORDER/GROUP BY
expression not found in targetlist" error. (This subsumes the previous
test about gating_clauses, because we reset "flags" to zero earlier
if there are gating clauses to apply.)
The only known case in which a failure can occur is with a ProjectSet
path directly atop a table scan path, although it seems likely that there
are other cases or will be such in future. This means that the failure
is currently only visible in the v10 branch: 9.6 didn't have ProjectSet,
while in v11 and HEAD, apply_scanjoin_target_to_paths for some weird
reason is using create_projection_path not apply_projection_to_path,
masking the problem because there's a ProjectionPath in between.
Nonetheless this code is clearly wrong on its own terms, so back-patch
to 9.6 where this logic was introduced.
Tom Lane [Wed, 11 Jul 2018 16:07:20 +0000 (12:07 -0400)]
Fix bugs with degenerate window ORDER BY clauses in GROUPS/RANGE mode.
nodeWindowAgg.c failed to cope with the possibility that no ordering
columns are defined in the window frame for GROUPS mode or RANGE OFFSET
mode, leading to assertion failures or odd errors, as reported by Masahiko
Sawada and Lukas Eder. In RANGE OFFSET mode, an ordering column is really
required, so add an Assert about that. In GROUPS mode, the code would
work, except that the node initialization code wasn't in sync with the
execution code about when to set up tuplestore read pointers and spare
slots. Fix the latter for consistency's sake (even though I think the
changes described below make the out-of-sync cases unreachable for now).
Per SQL spec, a single ordering column is required for RANGE OFFSET mode,
and at least one ordering column is required for GROUPS mode. The parser
enforced the former but not the latter; add a check for that.
We were able to reach the no-ordering-column cases even with fully spec
compliant queries, though, because the planner would drop partitioning
and ordering columns from the generated plan if they were redundant with
earlier columns according to the redundant-pathkey logic, for instance
"PARTITION BY x ORDER BY y" in the presence of a "WHERE x=y" qual.
While in principle that's an optimization that could save some pointless
comparisons at runtime, it seems unlikely to be meaningful in the real
world. I think this behavior was not so much an intentional optimization
as a side-effect of an ancient decision to construct the plan node's
ordering-column info by reverse-engineering the PathKeys of the input
path. If we give up redundant-column removal then it takes very little
code to generate the plan node info directly from the WindowClause,
ensuring that we have the expected number of ordering columns in all
cases. (If anyone does complain about this, the planner could perhaps
be taught to remove redundant columns only when it's safe to do so,
ie *not* in RANGE OFFSET mode. But I doubt anyone ever will.)
With these changes, the WindowAggPath.winpathkeys field is not used for
anything anymore, so remove it.
The test cases added here are not actually very interesting given the
removal of the redundant-column-removal logic, but they would represent
important corner cases if anyone ever tries to put that back.
Tom Lane and Masahiko Sawada. Back-patch to v11 where RANGE OFFSET
and GROUPS modes were added.
Rethink how to get float.h in old Windows API for isnan/isinf
We include <float.h> in every place that needs isnan(), because MSVC
used to require it. However, since MSVC 2013 that's no longer necessary
(cf. commit cec8394b5ccd), so we can retire the inclusion to a
version-specific stanza in win32_port.h, where it doesn't need to
pollute random .c files. The header is of course still needed in a few
places for other reasons.
I (Álvaro) removed float.h from a few more files than in Emre's original
patch. This doesn't break the build in my system, but we'll see what
the buildfarm has to say about it all.
Header comment of shm_mq.c was mistakenly specifying path to shm_mq.h.
It was introduced in ec9037df. So, theoretically it could be
backpatched to 9.4, but it doesn't seem to worth it.
Thomas Munro [Wed, 11 Jul 2018 00:40:58 +0000 (12:40 +1200)]
Use signals for postmaster death on Linux.
Linux provides a way to ask for a signal when your parent process dies.
Use that to make PostmasterIsAlive() very cheap.
Based on a suggestion from Andres Freund.
Author: Thomas Munro, Heikki Linnakangas Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083b5e%40iki.fi
Discussion: https://postgr.es/m/20180411002643.6buofht4ranhei7k%40alap3.anarazel.de
Michael Paquier [Tue, 10 Jul 2018 23:56:24 +0000 (08:56 +0900)]
Block replication slot advance for these not yet reserving WAL
Such replication slots are physical slots freshly created without WAL
being reserved, which is the default behavior, which have not been used
yet as WAL consumption resources to retain WAL. This prevents advancing
a slot to a position older than any WAL available, which could falsify
calculations for WAL segment recycling.
This also cleans up a bit the code, as ReplicationSlotRelease() would be
called on ERROR, and improves error messages.
Reported-by: Kyotaro Horiguchi
Author: Michael Paquier Reviewed-by: Andres Freund, Álvaro Herrera, Kyotaro Horiguchi
Discussion: https://postgr.es/m/20180626071305.GH31353@paquier.xyz
We fail to handle polymorphic types properly when they are used as
partition keys: we were unnecessarily adding a RelabelType node on top,
which confuses code examining the nodes. In particular, this makes
predtest.c-based partition pruning not to work, and ruleutils.c to emit
expressions that are uglier than needed. Fix it by not adding RelabelType
when not needed.
In master/11 the new pruning code is separate so it doesn't suffer from
this problem, since we already fixed it (in essentially the same way) in e5dcbb88a15d, which also added a few tests; back-patch those tests to
pg10 also. But since UPDATE/DELETE still uses predtest.c in pg11, this
change improves partitioning for those cases too. Add tests for this.
The ruleutils.c behavior change is relevant in pg11/master too.
Co-authored-by: Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/54745d13-7ed4-54ac-97d8-ea1eec95ae25@lab.ntt.co.jp
Peter Eisentraut [Tue, 10 Jul 2018 15:37:42 +0000 (17:37 +0200)]
Remove dynamic_shared_memory_type=none
PostgreSQL nowadays offers some kind of dynamic shared memory feature on
all supported platforms. Having the choice of "none" prevents us from
relying on DSM in core features. So this patch removes the choice of
"none".
Add test case for EEOP_INNER_SYSVAR/EEOP_OUTER_SYSVAR executor opcodes.
The EEOP_INNER_SYSVAR and EEOP_OUTER_SYSVAR executor opcodes are not
exercised by normal queries, because setrefs.c will resolve the references
to system columns in the scan nodes already. Join nodes refer to them by
their position in the child node's target list, like user columns.
The only place where those opcodes are used, is in evaluating a trigger's
WHEN condition that references system columns. Trigger evaluation abuses
the INNER/OUTER Vars to refer to the OLD and NEW tuples. The code to handle
the opcodes is pretty straightforward, but it seems like a good idea to
have some test coverage for them, anyway, so that they don't get removed or
broken by accident.
Author: Ashutosh Bapat, with some changes by me.
Discussion: https://www.postgresql.org/message-id/CAFjFpRerUFX=T0nSnCoroXAJMoo-xah9J+pi7+xDUx86PtQmew@mail.gmail.com
Michael Paquier [Mon, 9 Jul 2018 23:51:10 +0000 (08:51 +0900)]
Add pg_rewind --no-sync
This is an option consistent with what pg_dump and pg_basebackup provide
which is useful for leveraging the I/O effort when testing things, not
to be used in a production environment.
Author: Michael Paquier Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/20180325122607.GB3707@paquier.xyz
Michael Paquier [Mon, 9 Jul 2018 23:39:27 +0000 (08:39 +0900)]
Simplify logic to sync target directory at the end of pg_rewind
The previous sync logic relied on looking for and then launching
externally initdb -S, which is a simple wrapper on top of fsync_pgdata.
There is nothing preventing pg_rewind to directly call this routine, so
remove the dependency to initdb and just call it directly.
Author: Michael Paquier Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/20180325122607.GB3707@paquier.xyz
Tom Lane [Mon, 9 Jul 2018 23:26:19 +0000 (19:26 -0400)]
Avoid emitting a bogus WAL record when recycling an all-zero btree page.
Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for
a page that it was about to recycle. However, it failed to distinguish
all-zero pages from dead pages, which is important because only the
latter have valid btpo.xact values, or indeed any special space at all.
Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore
led to an Assert failure, or to emission of a WAL record containing a
bogus cutoff XID, which might lead to unnecessary query cancellations
on hot standby servers.
Per reports from Antonin Houska and 自己. Amit Kapila was first to
propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi
reviewed it at various times.
This is an old bug, so back-patch to all supported branches.
Tom Lane [Mon, 9 Jul 2018 18:28:04 +0000 (14:28 -0400)]
Fix yet more problems with incorrectly-constructed zero-length arrays.
Commit 716ea626a attempted to fix the problem of building 1-D zero-size
arrays once and for all. But it turns out that contrib/intarray has some
code that doesn't use construct_array() but just builds arrays by hand,
so it didn't get the memo. This appears to affect all of subarray(),
intset_subtract(), inner_int_union(), inner_int_inter(), and
intarray_concat_arrays().
Back-patch into v11. In the past we've not back-patched this type of
change, but since v11 is still in beta it seems all right to include
this fix in it. Besides it's more consistent to make the fix in v11
where 716ea626a appeared.
Report and patch by Alexey Kryuchkov, some cosmetic adjustments by me
Michael Paquier [Mon, 9 Jul 2018 01:22:34 +0000 (10:22 +0900)]
Rework order of end-of-recovery actions to delay timeline history write
A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written. Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf. With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster. It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions. The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.
Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now. This is left as a future item and this
commit fixes the original complain.
Reported-by: Magnus Hagander
Author: Heikki Linnakangas Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
Jeff Davis [Sun, 8 Jul 2018 07:14:51 +0000 (00:14 -0700)]
Fix WITH CHECK OPTION on views referencing postgres_fdw tables.
If a view references a foreign table, and the foreign table has a
BEFORE INSERT trigger, then it's possible for a tuple inserted or
updated through the view to be changed such that it violates the
view's WITH CHECK OPTION constraint.
Before this commit, postgres_fdw handled this case inconsistently. A
RETURNING clause on the INSERT or UPDATE statement targeting the view
would cause the finally-inserted tuple to be read back, and the WITH
CHECK OPTION violation would throw an error. But without a RETURNING
clause, postgres_fdw would not read the final tuple back, and WITH
CHECK OPTION would not throw an error for the violation (or may throw
an error when there is no real violation). AFTER ROW triggers on the
foreign table had a similar effect as a RETURNING clause on the INSERT
or UPDATE statement.
To fix, this commit retrieves the attributes needed to enforce the
WITH CHECK OPTION constraint along with the attributes needed for the
RETURNING clause (if any) from the remote side. Thus, the WITH CHECK
OPTION constraint is always evaluated against the final tuple after
any triggers on the remote side.
This fix may be considered inconsistent with CHECK constraints
declared on foreign tables, which are not enforced locally at all
(because the constraint is on a remote object). The discussion
concluded that this difference is reasonable, because the WITH CHECK
OPTION is a constraint on the local view (not any remote object);
therefore it only makes sense to enforce its WITH CHECK OPTION
constraint locally.
Author: Etsuro Fujita Reviewed-by: Arthur Zakirov, Stephen Frost
Discussion: https://www.postgresql.org/message-id/7eb58fab-fd3b-781b-ac33-f7cfec96021f%40lab.ntt.co.jp
Peter Geoghegan [Sun, 8 Jul 2018 17:50:13 +0000 (10:50 -0700)]
Correct obsolete unique index insertion comment.
Commit bc292937ae6 failed to update a comment about unique index
checking. _bt_insertonpg() is no longer responsible for finding an
insertion location while preventing conflicting insertions.
Michael Paquier [Sun, 8 Jul 2018 09:53:20 +0000 (18:53 +0900)]
Use access() to check file existence in GetNewRelFileNode()
Previous code used BasicOpenFile() and close() just to check for a file
collision, while there is no need to hold open a file descriptor but
that's an overkill here.
Author: Paul Guo Reviewed-by: Peter Eisentraut, Michael Paquier
Discussion: https://postgr.es/m/CABQrizcUtiHaquxK=d4etBX8GF9kbZB50Nt1gO9_aN-e9SptyQ@mail.gmail.com
Add separate error message for procedure does not exist
While we probably don't want to split up all error messages into
function and procedure variants, this one is a very prominent one, so
it's helpful to be more specific here.
Michael Paquier [Fri, 6 Jul 2018 23:10:10 +0000 (08:10 +0900)]
Add note in pg_rewind documentation about read-only files
When performing pg_rewind, the presence of a read-only file which is not
accessible for writes will cause a failure while processing. This can
cause the control file of the target data folder to be truncated,
causing it to not be reusable with a successive run.
Also, when pg_rewind fails mid-flight, there is likely no way to be able
to recover the target data folder anyway, in which case a new base
backup is the best option. A note is added in the documentation as
well about.
Reported-by: Christian H.
Author: Michael Paquier Reviewed-by: Andrew Dunstan
Discussion: https://postgr.es/m/20180104200633.17004.16377%40wrigleys.postgresql.org
Peter Eisentraut [Fri, 29 Jun 2018 11:28:39 +0000 (13:28 +0200)]
Fix assert in nested SQL procedure call
When executing CALL in PL/pgSQL, we need to set a snapshot before
invoking the to-be-called procedure. Otherwise, the to-be-called
procedure might end up running without a snapshot. For LANGUAGE SQL
procedures, this would result in an assertion failure. (For most other
languages, this is usually not a problem, because those use SPI and SPI
sets snapshots in most cases.) Setting the snapshot restores the
behavior of how CALL worked when it was handled as a generic SQL
statement in PL/pgSQL (exec_stmt_execsql()).
This change revealed another problem: In SPI_commit(), we popped the
active snapshot before committing the transaction, to avoid "snapshot %p
still active" errors. However, there is no particular reason why only
at most one snapshot should be on the stack. So change this to pop all
active snapshots instead of only one.
Allow replication slots to be dropped in single-user mode
Starting with commit 9915de6c1cb2, replication slot drop uses a
condition variable sleep to wait until the current user of the slot goes
away. This is more user friendly than the previous behavior of erroring
out if the slot is in use, but it fails with a not-for-user-consumption
error message in single-user mode; plus, if you're using single-user
mode because you don't want to start the server in the regular mode
(say, disk is full and WAL won't recycle because of the slot), it's
inconvenient.
Fix by skipping the cond variable sleep in single-user mode, since
there can't be anybody to wait for anyway.
logical decoding: beware of an unset specinsert change
Coverity complains that there is no protection in the code (at least in
non-assertion-enabled builds) against speculative insertion failing to
follow the expected protocol. Add an elog(ERROR) for the case.
doc: Reword old inheritance partitioning documentation
Prefer to use phrases like "child" instead of "partition" when
describing the legacy inheritance-based partitioning. The word
"partition" now has a fixed meaning for the built-in partitioning, so
keeping it out of the documentation of the old method makes things
clearer.
Reduce cost of test_decoding's new oldest_xmin test
Change a whole-database VACUUM into doing just pg_attribute, which is
the portion that verifies what we want it to do. The original
formulation wastes a lot of CPU time, which leads the test to fail when
runtime exceeds isolationtester timeout when it's super-slow, such as
under CLOBBER_CACHE_ALWAYS. Per buildfarm member friarbird.
It turns out that the previous shape of the test doesn't always detect
the condition it is supposed to detect (on unpatched reorderbuffer
code): the reason is that there is a good chance of encountering a
xl_running_xacts record (logged every 15 seconds) before the checkpoint
-- and because we advance the xmin when we receive that WAL record, and
we *don't* advance the xmin twice consecutively without receiving a
client message in between, that means the xmin is not advanced enough
for the tuple to be pruned from pg_attribute by VACUUM. So the test
would spuriously pass.
The reason this test deficiency wasn't detected earlier is that HOT
pruning removes the tuple anyway, even if vacuum leaves it in place, so
the test correctly fails (detecting the coding mistake), but for the
wrong reason.
To fix this mess, run the s0_get_changes step twice before vacuum
instead of once: this seems to cause the xmin to be advanced reliably,
wreaking havoc with more certainty.
Michael Paquier [Thu, 5 Jul 2018 01:46:18 +0000 (10:46 +0900)]
Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed. This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed. When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed. Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.
Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me. The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.
Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore. The test gets easily supported down to 10, still it
has been tested on all branches.
Andres Freund [Thu, 5 Jul 2018 00:36:01 +0000 (17:36 -0700)]
Use context with correct lifetime in hypothetical_dense_rank_final.
The query lifetime expression context created in
hypothetical_dense_rank_final() was buggily allocated in the calling
memory context. I (Andres) broke that in bf6c614a2f2.
Reported-By: Rajkumar Raghuwanshi
Author: Amit Langote
Discussion: https://postgr.es/m/CAKcux6kmzWmur5HhA_aU6gYVFu0RLQdgJJ+aC9SLdcOvBSrpfA@mail.gmail.com
Backpatch: 11-
Andres Freund [Wed, 4 Jul 2018 21:53:30 +0000 (14:53 -0700)]
Check for interrupts inside the nbtree page deletion code.
When deleting pages the nbtree code has to walk through siblings of a
tree node. When those sibling links are corrupted that can lead to
endless loops - which are currently not interruptible. This is
especially problematic if autovacuum is repeatedly blocked on such
indexes, as it can be hard to get out of that situation without
resorting to single user mode.
Thus add interrupt checks to appropriate places in such
loops. Unfortunately in one of the cases it's it's not easy to do so.
Between 9.3 and 9.4 the page deletion (and page split) code changed
significantly. Before it was significantly less robust against
interruptions. Therefore don't backpatch to 9.3.
Author: Andres Freund
Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de
Backpatch: 9.4-
Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
Michael Paquier [Wed, 4 Jul 2018 01:37:40 +0000 (10:37 +0900)]
Remove dead code for temporary relations in partition planning
Since recent commit 1c7c317c, temporary relations cannot be mixed with
permanent relations within the same partition tree, and the same counts
for temporary relations created by other sessions, which the planner
simply discarded. Instead be paranoid and issue an error, as those
should be blocked at definition time, at least for now.
At the same time, a test case is added to stress what has been moved
when expand_partitioned_rtentry gets called recursively but bumps on a
partitioned relation with no partitions which should be handled the same
way as the non-inheritance case. This code may be reworked in a close
future, and covering this code path will limit surprises.
Reported-by: David Rowley
Author: David Rowley Reviewed-by: Amit Langote, Robert Haas, Michael Paquier
Discussion: https://postgr.es/m/CAKJS1f_HyV1txn_4XSdH5EOhBMYaCwsXyAj6bHXk9gOu4JKsbw@mail.gmail.com
Michael Paquier [Mon, 2 Jul 2018 13:19:46 +0000 (22:19 +0900)]
Add wait event for fsync of WAL segments
This has been visibly a forgotten spot in the first implementation of
wait events for I/O added by 249cf07, and what has been missing is a
fsync call for WAL segments which is a wrapper reacting on the value of
GUC wal_sync_method.
Reported-by: Konstantin Knizhnik
Author: Konstantin Knizhnik Reviewed-by: Craig Ringer, Michael Paquier
Discussion: https://postgr.es/m/4a243897-0ad8-f471-aa40-242591f2476e@postgrespro.ru
When these programs call pg_catalog.set_config, they need to check for
PGRES_TUPLES_OK instead of PGRES_COMMAND_OK. Fix for 5770172cb0c9df9e6ce27c507b449557e5b45124.
Michael Paquier [Sun, 1 Jul 2018 11:20:06 +0000 (20:20 +0900)]
Add tests for inheritance trees mixing permanent and temporary relations
While working on 1c7c317c and related things, which has clarified the
use of partitions with temporary tables, I have noticed that there could
be better coverage for inheritance trees mixing temporary and permanent
relations. A lot of cross-checks happen in MergeAttributes() which is
not designed for this purpose, so the tests added in this commit will
make sure that any kind of future refactoring will limit the amount of
compatibility breakage.
Author: Michael Paquier Reviewed-by: Ashutosh Bapat
Discussion: https://postgr.es/m/20180619022131.GE3314@paquier.xyz
Peter Eisentraut [Tue, 22 May 2018 18:25:01 +0000 (14:25 -0400)]
Use $Test::Builder::Level in TAP test functions
In TAP test functions, that is, those that produce test results, locally
increment $Test::Builder::Level. This has the effect that test failures
are reported at the callers location rather than somewhere in the test
support libraries.
Michael Paquier [Sun, 1 Jul 2018 06:10:10 +0000 (15:10 +0900)]
Use optimized bitmap set function for membership test in postgres_fdw
Deparsing logic in postgres_fdw for locking, FROM clause (alias) and Var
(column qualification) does not need to know the exact number of members
involved, which can be calculated with bms_num_members(), but just if
there is more than one relation involved, which is what bms_membership()
does. The latter is more performant than the former so this shaves a
couple of cycles.
Author: Daniel Gustafsson Reviewed-by: Ashutosh Bapat, Nathan Bossart
Discussion: https://postgr.es/m/C73594E0-2B67-4E10-BB35-CDE0E41CC384@yesql.se
Alvaro Herrera [Fri, 29 Jun 2018 15:40:36 +0000 (11:40 -0400)]
psql: show cloned triggers in partitions
In a partition, row triggers that had been cloned from their parent
partitioned table would not be listed at all in psql's \d, which could
surprise users, per insistent complaint from Ashutosh Bapat (though his
aim was elsewhere). The simplest possible fix, suggested by Peter
Eisentraut, seems to be to list triggers marked as internal if they have
a row in pg_depend that points to some other trigger.
Alvaro Herrera [Fri, 29 Jun 2018 15:27:57 +0000 (11:27 -0400)]
Fix crash when ALTER TABLE recreates indexes on partitions
The skip_build flag was not being passed correctly when recursing to
indexes on partitions, leading to attempts to rebuild indexes when they
were not yet ready to be rebuilt.
Michael Paquier [Fri, 29 Jun 2018 13:02:20 +0000 (22:02 +0900)]
Replace search.cpan.org with metacpan.org
search.cpan.org has been EOL'd, with metacpan.org being the official
replacement to which URLs now redirect. Update links to match the new
URL. Also update links to CPAN to use https as it will redirect from
http.
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/B74C0219-6BA9-46E1-A524-5B9E8CD3BDB3@yesql.se
Andres Freund [Wed, 27 Jun 2018 06:40:32 +0000 (23:40 -0700)]
Change pqformat.h's integer handling functions to take unsigned integers.
As added in 1de09ad8eb1fa673ee7899d6dfbb2b49ba204818 the new functions
all accept signed integers as parameters. That's not great, because
it's perfectly reasonable to call them with unsigned parameters.
Unfortunately unsigned to signed conversion is not well defined, when
exceeding the range of the signed value. That's presently not a
practical issue in postgres (among other reasons because we force
gcc's hand with -fwrapv). But it's clearly not quite right.
Thus change the signatures to accept unsigned integers instead, signed
to unsigned conversion is always well defined. Also change the
backward compat pq_sendint() - while it's deprecated it seems better
to be consistent.
Per discussion between Andrew Gierth, Michael Paquier, Alvaro Herrera
and Tom Lane.
Reported-By: Andrew Gierth
Author: Andres Freund
Discussion: https://postgr.es/m/87r2m10zm2.fsf@news-spur.riddles.org.uk
Amit Kapila [Wed, 27 Jun 2018 02:46:13 +0000 (08:16 +0530)]
Cosmetic improvements for faster column addition.
Changed the name of few structure members for the sake of clarity and
removed spurious whitespace.
Reported-by: Amit Kapila
Author: Amit Kapila, based on suggestion by Andrew Dunstan Reviewed-by: Alvaro Herrera
Discussion: https://postgr.es/m/CAA4eK1K2znsFpC+NQ9A4vxT4uDxADN4RmvHX0L6Y=aHVo9gB4Q@mail.gmail.com
Alvaro Herrera [Tue, 26 Jun 2018 20:38:34 +0000 (16:38 -0400)]
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Fix upper limit for vacuum_cleanup_index_scale_factor
6ca33a88 sets upper limit for vacuum_cleanup_index_scale_factor to
DBL_MAX. DBL_MAX appears to be platform-dependent. That causes
many buildfarm animals to fail, because we check boundaries of
vacuum_cleanup_index_scale_factor in regression tests.
This commit changes upper limit from DBL_MAX to just "large enough"
limit, which was arbitrary selected as 1e10.
Author: Alexander Korotkov Reported-by: Tom Lane, Darafei Praliaskouski
Discussion: https://postgr.es/m/CAPpHfdvewmr4PcpRjrkstoNn1n2_6dL-iHRB21CCfZ0efZdBTg%40mail.gmail.com
Discussion: https://postgr.es/m/CAC8Q8tLYFOpKNaPS_E7V8KtPdE%3D_TnAn16t%3DA3LuL%3DXjfOO-BQ%40mail.gmail.com
Peter Geoghegan [Tue, 26 Jun 2018 18:16:20 +0000 (11:16 -0700)]
Correct a comment on logtape.c's leader tape.
randomAccess parallel tuplesorts are disallowed because the leader would
try to write to its own leader tape, not because the leader would try to
write to a worker tape directly.