granicus.if.org Git - postgresql/log

pg_basebackup pg_receivexlog: Issue fsync more carefully

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyncs.

This adds a couple of functions in src/common/ that have an equivalent
in the backend: durable_rename(), fsync_parent_path()

From: Michael Paquier <michael.paquier@gmail.com>

Move fsync routines of initdb into src/common/

The intention is to used those in other utilities such as pg_basebackup
and pg_receivexlog.

From: Michael Paquier <michael.paquier@gmail.com>

Don't bother to lock bufmgr partitions in pg_buffercache.

That makes the view a lot less disruptive to use on a production system.
Without the locks, you don't get a consistent snapshot across all buffers,
but that's OK. It wasn't a very useful guarantee in practice.

Ivan Kartyshov, reviewed by Tomas Vondra and Robert Haas.

Discusssion: <f9d6cab2-73a7-7a84-55a8-07dcb8516ae5@postgrespro.ru>

Exclude additional directories in pg_basebackup

The list of files and directories that pg_basebackup excludes from the
backup was somewhat incomplete and unorganized.  Change that with having
the exclusion driven from tables.  Clean up some code around it.  Also
document the exclusions in more detail so that users of pg_start_backup
can make use of it as well.

The contents of these directories are now excluded from the backup:
pg_dynshmem, pg_notify, pg_serial, pg_snapshots, pg_subtrans

Also fix a bug that a pg_repl_slot or pg_stat_tmp being a symlink would
cause a corrupt tar header to be created.  Now such symlinks are
included in the backup as empty directories.  Bug found by Ashutosh
Sharma <ashu.coek88@gmail.com>.

From: David Steele <david@pgmasters.net>
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>

Silence compiler warnings

Reported by Peter Eisentraut. Coding suggested by Tom Lane.

Rationalize format-picture caching logic in formatting.c.

Add a validity flag to DCHCacheEntry and NUMCacheEntry entries, and
do not set it true until after we've parsed the supplied format string.
This allows dealing with possible errors while parsing the format
without the baroque hack that was there before (which only covered
errors within NUMDesc_prepare, anyway).  We can get rid of the PG_TRY in
NUMDesc_prepare, as well as last_NUMCacheEntry and NUM_cache_remove.
(Essentially, this reverts commit ff783fbae in favor of a less fragile
solution; the problems with that approach are well illustrated by later
hacking such as 55f927a46.)

In passing, define the size of these caches as DCH_CACHE_ENTRIES not
DCH_CACHE_FIELDS + 1 (whoever thought that was a good definition?)
and likewise for the NUM cache.  Also const-ify format string parameters
where convenient, and merge duplicated cache lookup logic.

This is primarily driven by a proposed patch from Artur Zakirov,
which introduced some ereport's into format string parsing for
the datetime case.  He proposed preventing the creation of invalid
cache entries by parsing the format string first into a local-variable
array, and then copying that to a cache entry.  That seemed a bit
ugly to me, and anyway randomly different from the way the identical
problem had been solved for the numeric case.  Let's make the two
sets of code more similar not less so.

I'm not sure whether we'll adopt the new error conditions Artur proposes,
but this patch seems like good code cleanup and future-proofing in any
case.  The existing code is critically (and undocumented-ly) dependent on
no elog being thrown out of several nontrivial functions, which is trouble
waiting to happen, though it doesn't seem to be actively broken today.

Discussion: <b2a39359-3282-b402-f4a3-057aae500ee7@postgrespro.ru>

Make to_timestamp() and to_date() range-check fields of their input.

Historically, something like to_date('2009-06-40','YYYY-MM-DD') would
return '2009-07-10' because there was no prohibition on out-of-range
month or day numbers.  This has been widely panned, and it also turns
out that Oracle throws an error in such cases.  Since these functions
are nominally Oracle-compatibility features, let's change that.

There's no particular restriction on year (modulo the fact that the
scanner may not believe that more than 4 digits are year digits,
a matter to be addressed separately if at all).  But we now check month,
day, hour, minute, second, and fractional-second fields, as well as
day-of-year and second-of-day fields if those are used.

Currently, no checks are made on ISO-8601-style week numbers or day
numbers; it's not very clear what the appropriate rules would be there,
and they're probably so little used that it's not worth sweating over.

Artur Zakirov, reviewed by Amul Sul, further adjustments by me

Discussion: <1873520224.1784572.1465833145330.JavaMail.yahoo@mail.yahoo.com>
See-Also: <57786490.9010201@wars-nicht.de>

Remove dead line of code

worker_spi: Call pgstat_report_stat.

Without this, statistics changes accumulated by the worker never get
reported to the stats collector, which is bad.

Julien Rouhaud

Fix CRC check handling in get_controlfile

The previous patch broke this by returning NULL for a failed CRC check,
which pg_controldata would then try to read. Fix by returning the
result of the CRC check in a separate argument.

Michael Paquier and myself

Fix dangling pointer problem in ReorderBufferSerializeChange.

Commit 3fe3511d05127cc024b221040db2eeb352e7d716 introduced a new
case into this function, but neglected to ensure that the "ondisk"
pointer got updated after a possible reallocation as the code does
in other cases.

Stas Kelvich, per diagnosis by Konstantin Knizhnik.

Turn password_encryption GUC into an enum.

This makes the parameter easier to extend, to support other password-based
authentication protocols than MD5. (SCRAM is being worked on.)

The GUC still accepts on/off as aliases for "md5" and "plain", although
we may want to remove those once we actually add support for another
password hash type.

Michael Paquier, reviewed by David Steele, with some further edits by me.

Discussion: <CAB7nPqSMXU35g=W9X74HVeQp0uvgJxvYOuA4A-A3M+0wfEBv-w@mail.gmail.com>

Disallow pushing volatile quals past set-returning functions.

Pushing an upper-level restriction clause into an unflattened
subquery-in-FROM is okay when the subquery contains no SRFs in its
targetlist, or when it does but the SRFs are unreferenced by the clause
*and the clause is not volatile*.  Otherwise, we're changing the number
of times the clause is evaluated, which is bad for volatile quals, and
possibly changing the result, since a volatile qual might succeed for some
SRF output rows and not others despite not referencing any of the changing
columns.  (Indeed, if the clause is something like "random() > 0.5", the
user is probably expecting exactly that behavior.)

We had most of these restrictions down, but not the one about the upper
clause not being volatile.  Fix that, and add a regression test to
illustrate the expected behavior.

Although this is definitely a bug, it doesn't seem like back-patch
material, since possibly some users don't realize that the broken
behavior is broken and are relying on what happens now.  Also, while
the added test is quite cheap in the wake of commit a4c35ea1c, it would
be much more expensive (or else messier) in older branches.

Per report from Tom van Tilburg.

Discussion: <CAP3PPDiucxYCNev52=YPVkrQAPVF1C5PFWnrQPT7iMzO1fiKFQ@mail.gmail.com>

Make struct ParallelSlot private within pg_dump/parallel.c.

The only field of this struct that other files have any need to touch
is the pointer to the TocEntry a worker is working on. (Well,
pg_backup_archiver.c is actually looking at workerStatus too, but that
can be finessed by specifying that the TocEntry pointer is NULL for a
non-busy worker.)

Hence, move out the TocEntry pointers to a separate array within
struct ParallelState, and then we can make struct ParallelSlot private.

I noted the possibility of this previously, but hadn't got round to
actually doing it.

Discussion: <1188.1464544443@sss.pgh.pa.us>

Rationalize parallel dump/restore's handling of worker cmd/status messages.

The existing APIs for creating and parsing command and status messages are
rather messy; for example, archive-format modules have to provide code
for constructing command messages, which is entirely pointless since
the code to read them is hard-wired in WaitForCommands() and hence
no format-specific variation is actually possible. But there's little
foreseeable reason to need format-specific variation anyway.

The situation for status messages is no better; at least those are both
constructed and parsed by format-specific code, but said code is quite
redundant since there's no actual need for format-specific variation.

To add insult to injury, the first API involves returning pointers to
static buffers, which is bad, while the second involves returning pointers
to malloc'd strings, which is safer but randomly inconsistent.

Hence, get rid of the MasterStartParallelItem and MasterEndParallelItem
APIs, and instead write centralized functions that construct and parse
command and status messages. If we ever do need more flexibility, these
functions can be the standard implementations of format-specific
callback methods, but that's a long way off if it ever happens.

Tom Lane, reviewed by Kevin Grittner

Discussion: <17340.1464465717@sss.pgh.pa.us>

Redesign parallel dump/restore's wait-for-workers logic.

The ListenToWorkers/ReapWorkerStatus APIs were messy and hard to use.
Instead, make DispatchJobForTocEntry register a callback function that
will take care of state cleanup, doing whatever had been done by the caller
of ReapWorkerStatus in the old design.  (This callback is essentially just
the old mark_work_done function in the restore case, and a trivial test for
worker failure in the dump case.)  Then we can have ListenToWorkers call
the callback immediately on receipt of a status message, and return the
worker to WRKR_IDLE state; so the WRKR_FINISHED state goes away.

This allows us to design a unified wait-for-worker-messages loop:
WaitForWorkers replaces EnsureIdleWorker and EnsureWorkersFinished as well
as the mess in restore_toc_entries_parallel.  Also, we no longer need the
fragile API spec that the caller of DispatchJobForTocEntry is responsible
for ensuring there's an idle worker, since DispatchJobForTocEntry can just
wait until there is one.

In passing, I got rid of the ParallelArgs struct, which was a net negative
in terms of notational verboseness, and didn't seem to be providing any
noticeable amount of abstraction either.

Tom Lane, reviewed by Kevin Grittner

Discussion: <1188.1464544443@sss.pgh.pa.us>

Improve contrib/cube's handling of zero-D cubes, infinities, and NaNs.

It's always been possible to create a zero-dimensional cube by converting
from a zero-length float8 array, but cube_in failed to accept the '()'
representation that cube_out produced for that case, resulting in a
dump/reload hazard.  Make it accept the case.  Also fix a couple of
other places that didn't behave sanely for zero-dimensional cubes:
cube_size would produce 1.0 when surely the answer should be 0.0,
and g_cube_distance risked a divide-by-zero failure.

Likewise, it's always been possible to create cubes containing float8
infinity or NaN coordinate values, but cube_in couldn't parse such input,
and cube_out produced platform-dependent spellings of the values.  Convert
them to use float8in_internal and float8out_internal so that the behavior
will be the same as for float8, as we recently did for the core geometric
types (cf commit 50861cd68).  As in that commit, I don't pretend that this
patch fixes all insane corner-case behaviors that may exist for NaNs, but
it's a step forward.

(This change allows removal of the separate cube_1.out and cube_3.out
expected-files, as the platform dependency that previously required them
is now gone: an underflowing coordinate value will now produce an error
not plus or minus zero.)

Make errors from cube_in follow project conventions as to spelling
("invalid input syntax for cube" not "bad cube representation")
and errcode (INVALID_TEXT_REPRESENTATION not SYNTAX_ERROR).

Also a few marginal code cleanups and comment improvements.

Tom Lane, reviewed by Amul Sul

Discussion: <15085.1472494782@sss.pgh.pa.us>

Include <sys/select.h> where needed

<sys/select.h> is required by POSIX.1-2001 to get the prototype of
select(2), but nearly no systems enforce that because older standards
let you get away with including some other headers. Recent OpenBSD
hacking has removed that frail touch of friendliness, however, which
broke some compiles; fix all the way back to 9.1 by adding the required
standard. Only vacuumdb.c was reported to fail, but it seems easier to
fix the whole lot in a fell swoop.

Per bug #14334 by Sean Farrell.

Fix some typos in comment

Fix newly-introduced issues in pgbench.

The result of FD_ISSET() doesn't necessarily fit in a bool, though
assigning it to one might accidentally work depending on platform and which
socket FD number is being inquired of.  Rewrite to test it with if(),
rather than making any specific assumption about the result width,
to match the way every other such call in PG is written.

Don't break out of the input_mask-filling loop after finding the first
client that we're waiting for results from.  That mostly breaks parallel
query management.

Also, if we choose not to call select(), be sure to clear out any bits
the mask-filling loop might have set, so that we don't accidentally call
doCustom for clients we don't know have input.  Doing so would likely
be harmless, but it's a waste of cycles and doesn't seem to be intended.

Make this_usec wide enough.  (Yeah, the value would usually fit in an
int, but then why are we using int64 everywhere else?)

Minor cosmetic adjustments, mostly comment improvements.

Problems introduced by commit 12788ae49.  The first issue was discovered
by buildfarm testing, the others by code review.

Replace the built-in GIN array opclasses with a single polymorphic opclass.

We had thirty different GIN array opclasses sharing the same operators and
support functions.  That still didn't cover all the built-in types, nor
did it cover arrays of extension-added types.  What we want is a single
polymorphic opclass for "anyarray".  There were two missing features needed
to make this possible:

1. We have to be able to declare the index storage type as ANYELEMENT
when the opclass is declared to index ANYARRAY.  This just takes a few
more lines in index_create().  Although this currently seems of use only
for GIN, there's no reason to make index_create() restrict it to that.

2. We have to be able to identify the proper GIN compare function for
the index storage type.  This patch proceeds by making the compare function
optional in GIN opclass definitions, and specifying that the default btree
comparison function for the index storage type will be looked up when the
opclass omits it.  Again, that seems pretty generically useful.

Since the comparison function lookup is done in initGinState(), making
use of the second feature adds an additional cache lookup to GIN index
access setup.  It seems unlikely that that would be very noticeable given
the other costs involved, but maybe at some point we should consider
making GinState data persist longer than it now does --- we could keep it
in the index relcache entry, perhaps.

Rather fortuitously, we don't seem to need to do anything to get this
change to play nice with dump/reload or pg_upgrade scenarios: the new
opclass definition is automatically selected to replace existing index
definitions, and the on-disk data remains compatible.  Also, if a user has
created a custom opclass definition for a non-builtin type, this doesn't
break that, since CREATE INDEX will prefer an exact match to opcintype
over a match to ANYARRAY.  However, if there's anyone out there with
handwritten DDL that explicitly specifies _bool_ops or one of the other
replaced opclass names, they'll need to adjust that.

Tom Lane, reviewed by Enrique Meneses

Discussion: <14436.1470940379@sss.pgh.pa.us>

Document has_type_privilege().

Evidently an oversight in commit 729205571. Back-patch to 9.2 where
privileges for types were introduced.

Report: <20160922173517.8214.88959@wrigleys.postgresql.org>

Refactor script execution state machine in pgbench.

The doCustom() function had grown into quite a mess. Rewrite it, in a more
explicit state machine style, for readability.

This also fixes one minor bug: if a script consisted entirely of meta
commands, doCustom() never returned to the caller, so progress reports
with the -P option were not printed. I don't want to backpatch this
refactoring, and the bug is quite insignificant, so only commit this to
master, and leave the bug unfixed in back-branches.

Review and original bug report by Fabien Coelho.

Discussion: <alpine.DEB.2.20.1607090850120.3412@sto>

Refer to OS X as "macOS", except for the port name which is still "darwin".

We weren't terribly consistent about whether to call Apple's OS "OS X"
or "Mac OS X", and the former is probably confusing to people who aren't
Apple users.  Now that Apple has rebranded it "macOS", follow their lead
to establish a consistent naming pattern.  Also, avoid the use of the
ancient project name "Darwin", except as the port code name which does not
seem desirable to change.  (In short, this patch touches documentation and
comments, but no actual code.)

I didn't touch contrib/start-scripts/osx/, either.  I suspect those are
obsolete and due for a rewrite, anyway.

I dithered about whether to apply this edit to old release notes, but
those were responsible for quite a lot of the inconsistencies, so I ended
up changing them too.  Anyway, Apple's being ahistorical about this,
so why shouldn't we be?

Do a final round of updates on the 9.6 release notes.

Set release date, document a few recent commits, do one last pass of
copy-editing.

Install TAP test infrastructure so it's available for extension testing.

When configured with --enable-tap-tests, "make install" will now install
the Perl support files for TAP testing where PGXS will find them.
This allows extensions to rely on $(prove_check) even when being built
out-of-tree. Back-patch to 9.4 where we first started to support TAP
testing, to reduce the number of cases extension makefiles need to
consider.

Craig Ringer

Discussion: <CAMsr+YFXv+2qne6xJW7z_25mYBtktRX5rpkrgrb+DRgQ_FxgHQ@mail.gmail.com>

Doc: fix examples of # operators so they actually work.

These worked as-is until around 7.0, but fail in newer versions because
there are more operators named "#". Besides it's a bit inconsistent that
only two of the examples on this page lack type names on their constants.

Report: <20160923081530.1517.75670@wrigleys.postgresql.org>

Fix incorrect logic for excluding range constructor functions in pg_dump.

Faulty AND/OR nesting in the WHERE clause of getFuncs' SQL query led to
dumping range constructor functions if they are part of an extension
and we're in binary-upgrade mode.  Actually, we don't want to dump them
separately even then, since CREATE TYPE AS RANGE will create the range's
constructor functions regardless.  Per report from Andrew Dunstan.

It looks like this mistake was introduced by me, in commit b985d4877, in
perhaps-overzealous refactoring to reduce code duplication.  I'm suitably
embarrassed.

Report: <34854939-02d7-f591-5677-ce2994104599@dunslane.net>

Remove useless code.

Apparent copy-and-pasteo in standby_desc_invalidations() had two
entries for msg->id == SHAREDINVALRELMAP_ID.

Aleksander Alekseev

Discussion: <20160923090814.GB1238@e733>

Don't trust CreateFileMapping() to clear the error code on success.

We must test GetLastError() even when CreateFileMapping() returns a
non-null handle. If that value were left over from some previous system
call, we might be fooled into thinking the segment already existed.
Experimentation on Windows 7 suggests that CreateFileMapping() clears
the error code on success, but it is not documented to do so, so let's
not rely on that happening in all Windows releases.

Amit Kapila

Discussion: <20811.1474390987@sss.pgh.pa.us>

Avoid using PostmasterRandom() for DSM control segment ID.

Commits 470d886c3 et al intended to fix the problem that the postmaster
selected the same "random" DSM control segment ID on every start. But
using PostmasterRandom() for that destroys the intended property that the
delay between random_start_time and random_stop_time will be unpredictable.
(Said delay is probably already more predictable than we could wish, but
that doesn't mean that reducing it by a couple orders of magnitude is OK.)
Revert the previous patch and add a comment warning against misuse of
PostmasterRandom. Fix the original problem by calling srandom() early in
PostmasterMain, using a low-security seed that will later be overwritten
by PostmasterRandom.

Discussion: <20789.1474390434@sss.pgh.pa.us>

pg_ctl: Add promote wait option to help output

pointed out by Masahiko Sawada <sawada.mshk@gmail.com>

Improve error message on MSVC if perl*.lib is not found.

John Harvey, reviewed by Michael Paquier

Discussion: <CABcP5fjEjgOsh097cWnQrsK9yCswo4DZxp-V47DKCH-MxY9Gig@mail.gmail.com>

Fix typo in comment.

Daniel Gustafsson

C comment: fix function header comment

Fix for transformOnConflictClause().

Author: Tomonari Katsumata

Remove nearly-unused SizeOfIptrData macro.

Past refactorings have removed all but one reference to SizeOfIptrData
(and that one place was in a pretty noncritical spot). Since nobody's
complained, it seems probable that there are no supported compilers
that don't think sizeof(ItemPointerData) is 6. If there are, we're
wasting MAXALIGN per heap tuple anyway, so it's rather silly to worry
about whether we can shave space in places like WAL records.

Pavan Deolasee

Discussion: <CABOikdOOawDda4hwLOT6zdA6MFfPLu3Z2YBZkX0JdayNS6JOeQ@mail.gmail.com>

Be sure to rewind the tuplestore read pointer in non-leader CTEScan nodes.

ExecInitCteScan supposed that it didn't have to do anything to the extra
tuplestore read pointer it gets from tuplestore_alloc_read_pointer.
However, it needs this read pointer to be positioned at the start of the
tuplestore, while tuplestore_alloc_read_pointer is actually defined as
cloning the current position of read pointer 0.  In normal situations
that accidentally works because we initialize the whole plan tree at once,
before anything gets read.  But it fails in an EvalPlanQual recheck, as
illustrated in bug #14328 from Dima Pavlov.  To fix, just forcibly rewind
the pointer after tuplestore_alloc_read_pointer.  The cost of doing so is
negligible unless the tuplestore is already in TSS_READFILE state, which
wouldn't happen in normal cases.  We could consider altering tuplestore's
API to make that case cheaper, but that would make for a more invasive
back-patch and it doesn't seem worth it.

This has been broken probably for as long as we've had CTEs, so back-patch
to all supported branches.

Discussion: <32468.1474548308@sss.pgh.pa.us>

Add tests for various connection string issues

Add tests for consistent support of connection strings in frontend
programs as well as proper handling of unusual characters in database
and user names. These tests were developed for the issues of
CVE-2016-5424.

To allow testing of names with spaces, change the pg_regress
command-line options --create-role and --dbname to split their arguments
by comma only, not space or comma as before. Only commas were actually
used in existing uses.

Noah Misch, Michael Paquier, Peter Eisentraut

pg_ctl: Add wait option to promote action

When waiting is selected for the promote action, look into pg_control
until the state changes, then use the PQping-based waiting until the
server is reachable.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>

Delay updating control file to "in production"

Move the updating of the control file to "in production" status until
the point where WAL writes are allowed. Before, there could be a
significant gap between the control file update and write transactions
actually being allowed. This makes it more reliable to use the control
status to verify the end of a promotion.

From: Michael Paquier <michael.paquier@gmail.com>

pg_ctl: Detect current standby state from pg_control

pg_ctl used to determine whether a server was in standby mode by looking
for a recovery.conf file. With this change, it instead looks into
pg_control, which is potentially more accurate. There are also
occasional discussions about removing recovery.conf, so this removes one
dependency.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>

pg_ctl: Add tests for promote action

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>

Make command_like output more compact

Consistently print the test name, not the full command, which can be
quite lenghty and include temporary directory names and other
distracting details.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>

Fix typo

From: Michael Paquier <michael.paquier@gmail.com>

Add more parallel query documentation.

Previously, the individual settings were documented, but there was
no overall discussion of the capabilities and limitations of the
feature. Add that.

Patch by me, reviewed by Peter Eisentraut and Álvaro Herrera.

Print test parameters like "foo: 123", and results like "foo = 123".

The way "latency average" was printed was differently if it was calculated
from the overall run time or was measured on a per-transaction basis.
Also, the per-script weight is a test parameter, rather than a result, so
use the "weight: %f" style for that.

Backpatch to 9.6, since the inconsistency on "latency average" was
introduced there.

Fabien Coelho

Discussion: <alpine.DEB.2.20.1607131015370.7486@sto>

Fix pgbench's calculation of average latency, when -T is not used.

If the test duration was given in # of transactions (-t or no option),
rather as a duration (-T), the latency average was always printed as 0.
It has been broken ever since the display of latency average was added,
in 9.4.

Fabien Coelho

Discussion: <alpine.DEB.2.20.1607131015370.7486@sto>

pg_restore: Add -N option to exclude schemas

This is similar to the -N option in pg_dump, except that it doesn't take
a pattern, just like the existing -n option in pg_restore.

From: Michael Banck <michael.banck@credativ.de>

doc: Fix documentation to match actual make output

based on patch from Takeshi Ideriha <iderihatakeshi@gmail.com>

doc: Correct ALTER USER MAPPING example

The existing example threw an error.

From: gabrielle <gorthx@gmail.com>

Re-add translation markers that were lost

When win32security.c was moved from src/backend/port/win32/security.c,
the message writing function was changed from write_stderr to log_error,
but nls.mk was not updated. We could add log_error to GETTEXT_TRIGGERS,
but it's also used in src/common/exec.c in a different way and that
would create some confusion or a larger patch. For now, just put an
explicit translation marker onto the strings that were previously
translated.

Use PostmasterRandom(), not random(), for DSM control segment ID.

Otherwise, every startup gets the same "random" value, which is
definitely not what was intended.

Retry DSM control segment creation if Windows indicates access denied.

Otherwise, attempts to run multiple postmasters running on the same
machine may fail, because Windows sometimes returns ERROR_ACCESS_DENIED
rather than ERROR_ALREADY_EXISTS when there is an existing segment.

Hitting this bug is much more likely because of another defect not
fixed by this patch, namely that dsm_postmaster_startup() uses
random() which returns the same value every time. But that's not
a reason not to fix this.

Kyotaro Horiguchi and Amit Kapila, reviewed by Michael Paquier

Discussion: <CAA4eK1JyNdMeF-dgrpHozDecpDfsRZUtpCi+1AbtuEkfG3YooQ@mail.gmail.com>

Fix outdated comments, GIST search queue is not an RBTree anymore.

The GiST search queue is implemented as a pairing heap rather than as
Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments
in that commit.

Fix latency calculation when there are \sleep commands in the script.

We can't use txn_scheduled to hold the sleep-until time for \sleep, because
that interferes with calculation of the latency of the transaction as whole.

Backpatch to 9.4, where this bug was introduced.

Fabien COELHO

Discussion: <alpine.DEB.2.20.1608231622170.7102@lancre>

Remove obsolete warning from docs.

Python 2.4 and Fedora 4 are both obsolete at this point, especially
unpatched debug builds.

Discussion: <85e377b2-d459-396e-59b1-115548bbc059@iki.fi>

MSVC: Include pg_recvlogical in client-only install.

MauMau, reviewed by Michael Paquier

Update recovery_min_apply_delay docs for remote_apply mode.

Bernd Helmle, reviewed by Thomas Munro, tweaked by me.

Fix ecpg -? option on Windows, add -V alias for --version.

This makes the -? and -V options work consistently with other binaries.
--help and --version are now only recognized as the first option, i.e.
"ecpg --foobar --help" no longer prints the help, but that's consistent
with most of our other binaries, too.

Backpatch to all supported versions.

Haribabu Kommi

Discussion: <CAJrrPGfnRXvmCzxq6Dy=stAWebfNHxiL+Y_z7uqksZUCkW_waQ@mail.gmail.com>

Add debugging aid "bmsToString(Bitmapset *bms)".

This function has no direct callers at present, but it's convenient for
manual use in a debugger, rather than having to inspect memory and do
bit-counting in your head.

In passing, get rid of useless outBitmapset() wrapper around
_outBitmapset(); let's just export the function that does the work.
Likewise for outToken().

Ashutosh Bapat, tweaked a bit by me

Discussion: <CAFjFpRdiht8e1HTVirbubr4YzaON5iZTzFJjq909y4sU8M_6eA@mail.gmail.com>

Clarify policy on marking inherited constraints as valid.

Amit Langote and Robert Haas

Fix building with LibreSSL.

LibreSSL defines OPENSSL_VERSION_NUMBER to claim that it is version 2.0.0,
but it doesn't have the functions added in OpenSSL 1.1.0. Add autoconf
checks for the individual functions we need, and stop relying on
OPENSSL_VERSION_NUMBER.

Backport to 9.5 and 9.6, like the patch that broke this. In the
back-branches, there are still a few OPENSSL_VERSION_NUMBER checks left,
to check for OpenSSL 0.9.8 or 0.9.7. I left them as they were - LibreSSL
has all those functions, so they work as intended.

Per buildfarm member curculio.

Discussion: <2442.1473957669@sss.pgh.pa.us>

Fix typo in comment.

Amit Langote

Make min_parallel_relation_size's default value platform-independent.

The documentation states that the default value is 8MB, but this was
only true at BLCKSZ = 8kB, because the default was hard-coded as 1024.
Make the code match the docs by computing the default as 8MB/BLCKSZ.

Oversight in commit 75be66464, noted pursuant to a gripe from Peter E.

Discussion: <90634e20-097a-e4fd-67d5-fb2c42f0dd71@2ndquadrant.com>

pg_buffercache: Allow huge allocations.

Otherwise, users who have configured shared_buffers >= 256GB won't
be able to use this module. There probably aren't many of those, but
it doesn't hurt anything to fix it so that it works.

Backpatch to 9.4, where MemoryContextAllocHuge was introduced. The
same problem exists in older branches, but there's no easy way to
fix it there.

KaiGai Kohei

Support OpenSSL 1.1.0.

Changes needed to build at all:

- Check for SSL_new in configure, now that SSL_library_init is a macro.
- Do not access struct members directly. This includes some new code in
  pgcrypto, to use the resource owner mechanism to ensure that we don't
  leak OpenSSL handles, now that we can't embed them in other structs
  anymore.
- RAND_SSLeay() -> RAND_OpenSSL()

Changes that were needed to silence deprecation warnings, but were not
strictly necessary:

- RAND_pseudo_bytes() -> RAND_bytes().
- SSL_library_init() and OpenSSL_config() -> OPENSSL_init_ssl()
- ASN1_STRING_data() -> ASN1_STRING_get0_data()
- DH_generate_parameters() -> DH_generate_parameters()
- Locking callbacks are not needed with OpenSSL 1.1.0 anymore. (Good
  riddance!)

Also change references to SSLEAY_VERSION_NUMBER with OPENSSL_VERSION_NUMBER,
for the sake of consistency. OPENSSL_VERSION_NUMBER has existed since time
immemorial.

Fix SSL test suite to work with OpenSSL 1.1.0. CA certificates must have
the "CA:true" basic constraint extension now, or OpenSSL will refuse them.
Regenerate the test certificates with that. The "openssl" binary, used to
generate the certificates, is also now more picky, and throws an error
if an X509 extension is specified in "req_extensions", but that section
is empty.

Backpatch to all supported branches, per popular demand. In back-branches,
we still support OpenSSL 0.9.7 and above. OpenSSL 0.9.6 should still work
too, but I didn't test it. In master, we only support 0.9.8 and above.

Patch by Andreas Karlsson, with additional changes by me.

Discussion: <20160627151604.GD1051@msg.df7cb.de>

Fix and clarify comments on replacement selection.

These were modified by the patch to only use replacement selection for the
first run in an external sort.

Add overflow checks to money type input function

The money type input function did not have any overflow checks at all.
There were some regression tests that purported to check for overflow,
but they actually checked for the overflow behavior of the int8 type
before casting to money. Remove those unnecessary checks and add some
that actually check the money input function.

Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>

Tweak targetlist-SRF tests some more.

Seems like it would be good to have a test case documenting the
existing behavior for non-top-level SRFs.

Improve code comment for GatherPath's single_copy flag.

Discussion: 5934.1472642782@sss.pgh.pa.us

Tweak targetlist-SRF tests.

Add a test case showing that we don't support SRFs in window-function
arguments. Remove a duplicate test case for SRFs in aggregate arguments.

Be pickier about converting between Name and Datum.

We were misapplying NameGetDatum() to plain C strings in some places.
This worked, because it was just a pointer cast anyway, but it's a type
cheat in some sense. Use CStringGetDatum instead, and modify the
NameGetDatum macro so it won't compile if applied to something that's
not a pointer to NameData. This should result in no changes to
generated code, but it is logically cleaner.

Mark Dilger, tweaked a bit by me

Discussion: <EFD8AC94-4C1F-40C1-A5EA-304080089C1B@gmail.com>

Fix executor/README to reflect disallowing SRFs in UPDATE.

The parenthetical comment here is obsoleted by commit a4c35ea1c.
Noted by Andres Freund.

Improve parser's and planner's handling of set-returning functions.

Teach the parser to reject misplaced set-returning functions during parse
analysis using p_expr_kind, in much the same way as we do for aggregates
and window functions (cf commit eaccfded9).  While this isn't complete
(it misses nesting-based restrictions), it's much better than the previous
error reporting for such cases, and it allows elimination of assorted
ad-hoc expression_returns_set() error checks.  We could add nesting checks
later if it seems important to catch all cases at parse time.

There is one case the parser will now throw error for although previous
versions allowed it, which is SRFs in the tlist of an UPDATE.  That never
behaved sensibly (since it's ill-defined which generated row should be
used to perform the update) and it's hard to see why it should not be
treated as an error.  It's a release-note-worthy change though.

Also, add a new Query field hasTargetSRFs reporting whether there are
any SRFs in the targetlist (including GROUP BY/ORDER BY expressions).
The parser can now set that basically for free during parse analysis,
and we can use it in a number of places to avoid expression_returns_set
searches.  (There will be more such checks soon.)  In some places, this
allows decontorting the logic since it's no longer expensive to check for
SRFs in the tlist --- so I made the checks parallel to the handling of
hasAggs/hasWindowFuncs wherever it seemed appropriate.

catversion bump because adding a Query field changes stored rules.

Andres Freund and Tom Lane

Discussion: <24639.1473782855@sss.pgh.pa.us>

Have heapam.h include lockdefs.h rather than lock.h.

lockdefs.h was only split from lock.h relatively recently, and
represents a minimal subset of the old lock.h. heapam.h only needs
that smaller subset, so adjust it to include only that. This requires
some corresponding adjustments elsewhere.

Peter Geoghegan

Remove user_relns() SRF from regression tests.

The output of the function changes whenever previous (or, as in this
case, concurrent) tests leave a table in place. That causes unneeded
churn.

This should fix failures due to the tests added bfe16d1a5, like on
lapwing, caused by the tsrf test running concurrently with misc. Those
could also have been addressed by using temp tables, but that test has
annoyed me before.

Discussion: <27626.1473729905@sss.pgh.pa.us>

Address portability issues in bfe16d1a5 test output.

Add more tests for targetlist SRFs.

We're considering changing the implementation of targetlist SRFs
considerably, and a lot of the current behaviour isn't tested in our
regression tests. Thus it seems useful to increase coverage to avoid
accidental behaviour changes.

It's quite possible that some of the plans here will require adjustments
to avoid falling afoul of ordering differences (e.g. hashed group
bys). The buildfarm will tell us.

Reviewed-By: Tom Lane
Discussion: <20160827214829.zo2dfb5jaikii5nw@alap3.anarazel.de>

Docs: assorted minor cleanups.

Standardize on "user_name" for a field name in related examples in
ddl.sgml; before we had variously "user_name", "username", and "user".
The last is flat wrong because it conflicts with a reserved word.

Be consistent about entry capitalization in a table in func.sgml.

Fix a typo in pgtrgm.sgml.

Back-patch to 9.6 and 9.5 as relevant.

Alexander Law

pg_basebackup: Clean created directories on failure

Like initdb, clean up created data and xlog directories, unless the new
-n/--noclean option is specified.

Tablespace directories are not cleaned up, but a message is written
about that.

Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>

Fix recent commit for tab-completion of database template.

The details of commit 52803098ab26051c0c9802f3c19edffc90de8843 were
based on a misunderstanding of the role inheritance allowing use
of a database for a template. While the CREATEDB privilege is not
inherited, the database ownership is privileges are.

Pointed out by Vitaly Burovoy and Tom Lane.
Fix provided by Tom Lane, reviewed by Vitaly Burovoy.

Fix copy/pasto in file identification

Daniel Gustafsson

Identify walsenders in pg_stat_activity

Following 8299471c37fff0b walsender procs are now visible in pg_stat_activity.
Set query to ‘walsender’ for walsender procs to allow them to be identified.

Discussion:CAB7nPqS8c76KPSufK_HSDeYrbtg+zZ7D0EEkjeM6txSEuCB_jA@mail.gmail.com

Michael Paquier, issue raised by Fujii Masao, reviewed by Tom Lane

Raise max setting of checkpoint_timeout to 1d

Previously checkpoint_timeout was capped at 3600s
New max setting is 86400s = 24h = 1d

Discussion: 32558.1454471895@sss.pgh.pa.us

psql tab completion for CREATE DATABASE ... TEMPLATE ...

Sehrope Sarkuni, reviewed by Merlin Moncure & Vitaly Burovoy
with some editing by me

Allow CREATE EXTENSION to follow extension update paths.

Previously, to update an extension you had to produce both a version-update
script and a new base installation script.  It's become more and more
obvious that that's tedious, duplicative, and error-prone.  This patch
attempts to improve matters by allowing the new base installation script
to be omitted.  CREATE EXTENSION will install a requested version if it
can find a base script and a chain of update scripts that will get there.
As in the existing update logic, shorter chains are preferred if there's
more than one possibility, with an arbitrary tie-break rule for chains
of equal length.

Also adjust the pg_available_extension_versions view to show such versions
as installable.

While at it, refactor the code so that CASCADE processing works for
extensions requested during ApplyExtensionUpdates().  Without this,
addition of a new requirement in an updated extension would require
creating a new base script, even if there was no other reason to do that.
(It would be easy at this point to add a CASCADE option to ALTER EXTENSION
UPDATE, to allow the same thing to happen during a manually-commanded
version update, but I have not done that here.)

Tom Lane, reviewed by Andres Freund

Discussion: <20160905005919.jz2m2yh3und2dsuy@alap3.anarazel.de>

Fix and simplify MSVC build's handling of xml/xslt/uuid dependencies.

Solution.pm mistakenly believed that the xml option requires the xslt
option, when actually the dependency is the other way around; and it
believed that libxml requires libiconv, which is not necessarily so,
so we shouldn't enforce it here. Fix the option cross-checking logic.

Also, since AddProject already takes care of adding libxml and libxslt
include and library dependencies to every project, there's no need
for the custom code that did that in mkvcbuild. While at it, let's
handle the similar dependencies for uuid in a similar fashion.

Given the lack of field complaints about these overly strict build
dependency requirements, there seems no need for a back-patch.

Michael Paquier

Discussion: <CAB7nPqR0+gpu3mRQvFjf-V-bMxmiSJ6NpTg9_WzVDL+a31cV2g@mail.gmail.com>

Implement binary heap replace-top operation in a smarter way.

In external sort's merge phase, we maintain a binary heap holding the next
tuple from each input tape. On each step, the topmost tuple is returned,
and replaced with the next tuple from the same tape. We were doing the
replacement by deleting the top node in one operation, and inserting the
next tuple after that. However, you can do a "replace-top" operation more
efficiently, in one "sift-up". A deletion will always walk the heap from
top to bottom, but in a replacement, we can stop as soon as we find the
right place for the new tuple. This is particularly helpful, if the tapes
are not in completely random order, so that the next tuple from a tape is
likely to land near the top of the heap.

Peter Geoghegan, reviewed by Claudio Freire, with some editing by me.

Discussion: <CAM3SWZRhBhiknTF_=NjDSnNZ11hx=U_SEYwbc5vd=x7M4mMiCw@mail.gmail.com>

Improve unreachability recognition in elog() macro.

Some experimentation with an older version of gcc showed that it is able
to determine whether "if (elevel_ >= ERROR)" is compile-time constant
if elevel_ is declared "const", but otherwise not so much. We had
accounted for that in ereport() but were too miserly with braces to
make it so in elog(). I don't know how many currently-interesting
compilers have the same quirk, but in case it will save some code
space, let's make sure that elog() is on the same footing as ereport()
for this purpose.

Back-patch to 9.3 where we introduced pg_unreachable() calls into
elog/ereport.

Fix miserable coding in pg_stat_get_activity().

Commit dd1a3bccc replaced a test on whether a subroutine returned a
null pointer with a test on whether &pointer->backendStatus was null.
This accidentally failed to fail, at least on common compilers, because
backendStatus is the first field in the struct; but it was surely trouble
waiting to happen. Commit f91feba87 then messed things up further,
changing the logic to

local_beentry = pgstat_fetch_stat_local_beentry(curr_backend);
if (!local_beentry)
continue;
beentry = &local_beentry->backendStatus;
if (!beentry)
{

where the second "if" is now dead code, so that the intended behavior of
printing a row with "<backend information not available>" cannot occur.

I suspect this is all moot because pgstat_fetch_stat_local_beentry
will never actually return null in this function's usage, but it's still
very poor coding. Repair back to 9.4 where the original problem was
introduced.

Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.

The full generality of deleting an arbitrary number of tuples is no longer
needed, so let's save some code and cycles by replacing the original coding
with an implementation based on PageIndexTupleDelete.

We can always get back the old code from git if we need it again for new
callers (though I don't care for its willingness to mess with line pointers
it wasn't told to mess with).

Discussion: <552.1473445163@sss.pgh.pa.us>

Convert PageAddItem into a macro to save a few cycles.

Nowadays this is just a backwards-compatibility wrapper around
PageAddItemExtended, so let's avoid the extra level of function call.
In addition, because pretty much all callers are passing constants
for the two bool arguments, compilers will be able to constant-fold
the conversion to a flags bitmask.

Discussion: <552.1473445163@sss.pgh.pa.us>

Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.

PageIndexTupleOverwrite performs approximately the same function as
PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem
targeting the same item pointer offset. But in the case where the new
tuple is the same size as the old, it avoids shuffling other data around on
the page, because the new tuple is placed where the old one was rather than
being appended to the end of the page. This has been shown to provide a
substantial speedup for some GiST use-cases.

Also, this change allows some API simplifications: we can get rid of
the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for
PageAddItemExtended, since that was used only to cover a corner case
for BRIN that's better expressed by using PageIndexTupleOverwrite.

Note that this patch causes a rather subtle WAL incompatibility: the
physical page content change represented by certain WAL records is now
different than it was before, because while the tuples have the same
itempointer line numbers, the tuples themselves are in different places.
I have not bumped the WAL version number because I think it doesn't matter
unless you are trying to do bitwise comparisons of original and replayed
pages, and in any case we're early in a devel cycle and there will probably
be more WAL changes before v10 gets out the door.

There is probably room to make use of PageIndexTupleOverwrite in SP-GiST
and GIN too, but that is left for a future patch.

Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit
by me

Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>

Fix locking a tuple updated by an aborted (sub)transaction

When heap_lock_tuple decides to follow the update chain, it tried to
also lock any version of the tuple that was created by an update that
was subsequently rolled back. This is pointless, since for all intents
and purposes that tuple exists no more; and moreover it causes
misbehavior, as reported independently by Marko Tiikkaja and Marti
Raudsepp: some SELECT FOR UPDATE/SHARE queries may fail to return
the tuples, and assertion-enabled builds crash.

Fix by having heap_lock_updated_tuple test the xmin and return success
immediately if the tuple was created by an aborted transaction.

The condition where tuples become invisible occurs when an updated tuple
chain is followed by heap_lock_updated_tuple, which reports the problem
as HeapTupleSelfUpdated to its caller heap_lock_tuple, which in turn
propagates that code outwards possibly leading the calling code
(ExecLockRows) to believe that the tuple exists no longer.

Backpatch to 9.3. Only on 9.5 and newer this leads to a visible
failure, because of commit 27846f02c176; before that, heap_lock_tuple
skips the whole dance when the tuple is already locked by the same
transaction, because of the ancient HeapTupleSatisfiesUpdate behavior.
Still, the buggy condition may also exist in more convoluted scenarios
involving concurrent transactions, so it seems safer to fix the bug in
the old branches too.

Discussion:
https://www.postgresql.org/message-id/CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
https://www.postgresql.org/message-id/48d3eade-98d3-8b9a-477e-1a8dc32a724d@joh.to

In PageIndexTupleDelete, don't assume stored item lengths are MAXALIGNed.

PageAddItem stores the item length as-is. It MAXALIGN's the amount of
space actually allocated for each tuple, but not the stored length.
PageRepairFragmentation, PageIndexMultiDelete, and PageIndexDeleteNoCompact
are all on board with this and MAXALIGN item lengths after fetching them.
But PageIndexTupleDelete expects the stored length to be a MAXALIGN
multiple already. This accidentally works for existing index AMs because
they all maxalign their tuple sizes internally; but we don't do that for
heap tuples, and it shouldn't be a requirement for index tuples either.

So, sync PageIndexTupleDelete with the rest of bufpage.c by having it
maxalign the item size after fetching.

Also add a check that pd_special is maxaligned, to ensure that the test
"(offset + size) > phdr->pd_special" is still doing the right thing.
(If offset and pd_special are aligned, it doesn't matter whether size is.)
Again, this is in sync with the rest of the routines here, except for
PageAddItem which doesn't test because it doesn't actually do anything
for which pd_special alignment matters.

This shouldn't have any immediate functional impact; it just adds the
flexibility to use PageIndexTupleDelete on index tuples with non-aligned
lengths.

Discussion: <3814.1473366762@sss.pgh.pa.us>

Make better use of existing enums in plpgsql

plpgsql.h defines a number of enums, but most of the code passes them
around as ints. Update structs and function prototypes to take enum
types instead. This clarifies the struct definitions in plpgsql.h in
particular.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>

Avoid reporting "cache lookup failed" for some user-reachable cases.

We have a not-terribly-thoroughly-enforced-yet project policy that internal
errors with SQLSTATE XX000 (ie, plain elog) should not be triggerable from
SQL.  record_in, domain_in, and PL validator functions all failed to meet
this standard, because they threw plain elog("cache lookup failed for XXX")
errors on bad OIDs, and those are all invokable from SQL.

For record_in, the best fix is to upgrade typcache.c (lookup_type_cache)
to throw a user-facing error for this case.  That seems consistent because
it was more than halfway there already, having user-facing errors for shell
types and non-composite types.  Having done that, tweak domain_in to rely
on the typcache to throw an appropriate error.  (This costs little because
InitDomainConstraintRef would fetch the typcache entry anyway.)

For the PL validator functions, we already have a single choke point at
CheckFunctionValidatorAccess, so just fix its error to be user-facing.

Dilip Kumar, reviewed by Haribabu Kommi

Discussion: <87wpxfygg9.fsf@credativ.de>

Fix corruption of 2PC recovery with subxacts

Reading 2PC state files during recovery was borked, causing corruptions during
recovery. Effect limited to servers with 2PC, subtransactions and
recovery/replication.

Stas Kelvich, reviewed by Michael Paquier and Pavan Deolasee

Correct TABLESAMPLE docs

Revert to original use of word “sample”, though with clarification,
per Tom Lane.

Discussion: 29052.1471015383@sss.pgh.pa.us

Improve scalability of md.c for large relations.

So far md.c used a linked list of segments. That proved to be a problem
when processing large relations, because every smgr.c/md.c level access
to a page incurred walking through a linked list of all preceding
segments. Thus making accessing pages O(#segments).

Replace the linked list of segments hanging off SMgrRelationData with an
array of opened segments. That allows O(1) access to individual
segments, if they've previously been opened.

Discussion: <20140331101001.GE13135@alap3.anarazel.de>
Reviewed-By: Peter Geoghegan, Tom Lane (in an older version)