granicus.if.org Git - postgresql/log

Fix possible crash in ALTER TABLE ... REPLICA IDENTITY USING INDEX.

Careless coding added by commit 07cacba983ef79be could result in a crash
or a bizarre error message if someone tried to select an index on the
OID column as the replica identity index for a table. Back-patch to 9.4
where the feature was introduced.

Discussion: CAKJS1f8TQYgTRDyF1_u9PVCKWRWz+DkieH=U7954HeHVPJKaKg@mail.gmail.com

David Rowley

postgres_fdw: Clean up handling of system columns.

Previously, querying the xmin column of a single postgres_fdw foreign
table fetched the tuple length, xmax the typmod, and cmin or cmax the
composite type OID of the tuple.  However, when you queried several
such tables and the join got shipped to the remote side, these columns
ended up containing the remote values of the corresponding columns.
Both behaviors are rather unprincipled, the former for obvious reasons
and the latter because the remote values of these columns don't have
any local significance; our transaction IDs are in a different space
than those of the remote machine.  Clean this up by setting all of
these fields to 0 in both cases.  Also fix the handling of tableoid
to be sane.

Robert Haas and Ashutosh Bapat, reviewed by Etsuro Fujita.

Tweak EXPLAIN for parallel query to show workers launched.

The previous display was sort of confusing, because it didn't
distinguish between the number of workers that we planned to launch
and the number that actually got launched. This has already confused
several people, so display both numbers and label them clearly.

Julien Rouhaud, reviewed by me.

Fix portability problem induced by commit a6f6b7819.

pg_xlogdump includes bufmgr.h.  With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage().  So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere.  Per buildfarm member pademelon.

Fix typo in comment

Update helptext for vcregress.pl

This has clearly not been tracking the code changse for quite some time.

Michael Paquier, problem spotted by Kyotaro HORIGUCHI

Make regression test for multiple synchronous standbys more stable.

The regression test checks whether the output of pg_stat_replication is
expected or not after changing synchronous_standby_names and reloading
the configuration file. Regarding this test logic, previously there was
a timing issue which made the test result unstable. That is,
pg_stat_replication could return unexpected result during small window
after the configuration file was reloaded before new setting value
took effect, and which made the test fail.

This commit changes the test logic so that it uses a loop with a timeout
to give some room for the test to pass. Now the test fails only when
pg_stat_replication keeps returning unexpected result for 30 seconds.

Michael Paquier

Fix memory leak in GIN index scans.

The code had a query-lifespan memory leak when encountering GIN entries
that have posting lists (rather than posting trees, ie, there are a
relatively small number of heap tuples containing this index key value).
With a suitable data distribution this could add up to a lot of leakage.
Problem seems to have been introduced by commit 36a35c550, so back-patch
to 9.4.

Julien Rouhaud

Rethink \crosstabview's argument parsing logic.

\crosstabview interpreted its arguments in an unusual way, including
doing case-insensitive matching of unquoted column names, which is
surely not the right thing. Rip that out in favor of doing something
equivalent to the dequoting/case-folding rules used by other psql
commands. To keep it simple, change the syntax so that the optional
sort column is specified as a separate argument, instead of the
also-quite-unusual syntax that attached it to the colH argument with
a colon.

Also, rework the error messages to be closer to project style.

Make init_spin_delay() C89 compliant #2.

My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().

Give up on using a macro, and use a static inline instead.

Discussion: 4369.1460435533@sss.pgh.pa.us

Remove trailing commas in enums.

These aren't valid C89. Found thanks to gcc's -Wc90-c99-compat. These
exist in differing places in most supported branches.

Fix trivial typo.

Fix core dump in ReorderBufferRestoreChange on alignment-picky platforms.

When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?

The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.

Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.

Discussion: <20160413094117.GC21485@msg.credativ.de>

Adjust signature of walrcv_receive hook.

Commit 314cbfc5da988eff redefined the signature of this hook as
typedef int (*walrcv_receive_type) (char **buffer, int *wait_fd);

But in fact the type of the "wait_fd" variable ought to be pgsocket,
which is what WaitLatchOrSocket expects, and which is necessary if
we want to be able to assign PGINVALID_SOCKET to it on Windows.
So fix that.

Adjust datatype of ReplicationState.acquired_by.

It was declared as "pid_t", which would be fine except that none of
the places that printed it in error messages took any thought for the
possibility that it's not equivalent to "int".  This leads to warnings
on some buildfarm members, and could possibly lead to actually wrong
error messages on those platforms.  There doesn't seem to be any very
good reason not to just make it "int"; it's only ever assigned from
MyProcPid, which is int.  If we want to cope with PIDs that are wider
than int, this is not the place to start.

Also, fix the comment, which seems to perhaps be a leftover from a time
when the field was only a bool?

Per buildfarm.  Back-patch to 9.5 which has same issue.

Docs: clarify description of LIMIT/OFFSET behavior.

Section 7.6 was a tad confusing because it specified what LIMIT NULL
does, but neglected to do the same for OFFSET NULL, making this look
like perhaps a special case or a wrong restatement of the bit about
LIMIT ALL. Wordsmith a bit while at it. Per bug #14084.

Fix prototype of pgwin32_bind().

I (tgl) had copied-and-pasted this from pgwin32_accept(), failing to
notice that the third parameter should be "int" not "int *".

David Rowley

Fix broken dependency-mongering for index operator classes/families.

For a long time, opclasscmds.c explained that "we do not create a
dependency link to the AM [for an opclass or opfamily], because we don't
currently support DROP ACCESS METHOD". Commit 473b93287040b200 invented
DROP ACCESS METHOD, but it batted only 1 for 2 on adding the dependency
links, and 0 for 2 on updating the comments about the topic.

In passing, undo the same commit's entirely inappropriate decision to
blow away an existing index as a side-effect of create_am.sql.

Fix duplicated index entry in doc.

Commit cfe96ae corrected the name of pg_logical_emit_message()
in its index entry. But this typo fix caused duplicated index
entry because there was another index entry for the function.

Spotted by Tom Lane.

Disallow SET SESSION AUTHORIZATION pg_*

As part of reserving the pg_* namespace for default roles and in line
with SET ROLE and other previous efforts, disallow settings the role
to a default/reserved role using SET SESSION AUTHORIZATION.

These checks and restrictions on what is allowed regarding default /
reserved roles are under debate, but it seems prudent to ensure that
the existing checks at least cover the intended cases while the
debate rages on. On me to clean it up if the consensus decision is
to remove these checks.

Add required database and origin filtering for logical messages.

Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.

Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.

Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.

Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de

Make init_spin_delay() C89 compliant and change stuck spinlock reporting.

The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.

The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).

This also fixes what file/line number is reported for waits via
s_lock().

As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.

Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us

Fix pg_dump so pg_upgrade'ing an extension with simple opfamilies works.

As reported by Michael Feld, pg_upgrade'ing an installation having
extensions with operator families that contain just a single operator class
failed to reproduce the extension membership of those operator families.
This caused no immediate ill effects, but would create problems when later
trying to do a plain dump and restore, because the seemingly-not-part-of-
the-extension operator families would appear separately in the pg_dump
output, and then would conflict with the families created by loading the
extension.  This has been broken ever since extensions were introduced,
and many of the standard contrib extensions are affected, so it's a bit
astonishing nobody complained before.

The cause of the problem is a perhaps-ill-considered decision to omit
such operator families from pg_dump's output on the grounds that the
CREATE OPERATOR CLASS commands could recreate them, and having explicit
CREATE OPERATOR FAMILY commands would impede loading the dump script into
pre-8.3 servers.  Whatever the merits of that decision when 8.3 was being
written, it looks like a poor tradeoff now.  We can fix the pg_upgrade
problem simply by removing that code, so that the operator families are
dumped explicitly (and then will be properly made to be part of their
extensions).

Although this fixes the behavior of future pg_upgrade runs, it does nothing
to clean up existing installations that may have improperly-linked operator
families.  Given the small number of complaints to date, maybe we don't
need to worry about providing an automated solution for that; anyone who
needs to clean it up can do so with manual "ALTER EXTENSION ADD OPERATOR
FAMILY" commands, or even just ignore the duplicate-opfamily errors they
get during a pg_restore.  In any case we need this fix.

Back-patch to all supported branches.

Discussion: <20228.1460575691@sss.pgh.pa.us>

Avoid atomic operation in MarkLocalBufferDirty().

The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.

On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail. As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.

In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.

Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us

Widen amount-to-flush arguments of FileWriteback and callers.

It's silly to define these counts as narrower than they might someday
need to be. Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.

Fix assorted portability issues with using msync() for data flushing.

Commit 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf introduced the use of
msync() for flushing dirty data from the kernel's file buffers.  Several
portability issues were overlooked, though:

* Not all implementations of mmap() think that nbytes == 0 means "map
the whole file".  To fix, use lseek() to find out the true length.
Fix callers of pg_flush_data to be aware that nbytes == 0 may result
in trashing the file's seek position.

* Not all implementations of mmap() will accept partial-page mmap
requests.  To fix, round down the length request to whatever sysconf()
says the page size is.  (I think this is OK from a portability standpoint,
because sysconf() is required by SUS v2, and we aren't trying to compile
this part on Windows anyway.  Buildfarm should let us know if not.)

* On 32-bit machines, the file size might exceed the available free
address space, or even exceed what will fit in size_t.  Check for
the latter explicitly to avoid passing a false request size to mmap().
If mmap fails, silently fall through to the next implementation method,
rather than bleating to the postmaster log and giving up.

* mmap'ing directories fails on some platforms, and even if it works,
msync'ing the directory is quite unlikely to help, as for that matter are
the other flush implementations.  In pre_sync_fname(), just skip flush
attempts on directories.

In passing, copy-edit the comments a bit.

Stas Kelvich and myself

Improve documentation for \crosstabview.

Fix misleading syntax summary (there cannot be a space between colH and
scolH). Provide a link from the existing crosstab() function's
documentation to \crosstabview. Copy-edit the command's description.

Christoph Berg and Tom Lane

Use PG_INT32_MIN instead of reiterating the constant.

Makes no difference, but it's cleaner this way.

Michael Paquier

Provide errno-translation wrappers around bind() and listen() on Windows.

I've seen one too many "could not bind IPv4 socket: No error" log entries
from the Windows buildfarm members. Per previous discussion, this is
likely caused by the fact that we're doing nothing to translate
WSAGetLastError() to errno. Put in a wrapper layer to do that.

If this works as expected, it should get back-patched, but let's see what
happens in the buildfarm first.

Discussion: <4065.1452450340@sss.pgh.pa.us>

Fix costing for parallel aggregation.

The original patch kind of ignored the fact that we were doing something
different from a costing point of view, but nobody noticed. This patch
fixes that oversight.

David Rowley

Remove unused function GetOldestWALSendPointer from walsender code.

That unused function was introduced as a sample because synchronous
replication or replication monitoring tools might need it in the future.
Recently commit 989be08 added the function SyncRepGetOldestSyncRecPtr
which provides almost the same functionality for multiple synchronous
standbys feature. So it's time to remove that unused sample function.
This commit does that.

Redefine create_upper_paths_hook as being invoked once per upper relation.

Per discussion, this gives potential users of the hook more flexibility,
because they can build custom Paths that implement only one stage of
upper processing atop core-provided Paths for earlier stages.

Improve coding of column-name parsing in psql's new crosstabview.c.

Coverity complained about this code, not without reason because it was
rather messy. Adjust it to not scribble on the passed string; that adds
one malloc/free cycle per column name, which is going to be insignificant
in context. We can actually const-ify both the string argument and the
PGresult.

Daniel Verité, with some further cleanup by me

Avoid extra locks in GetSnapshotData if old_snapshot_threshold < 0

On a big NUMA machine with 1000 connections in saturation load
there was a performance regression due to spinlock contention, for
acquiring values which were never used. Just fill with dummy
values if we're not going to use them.

This patch has not been benchmarked yet on a big NUMA machine, but
it seems like a good idea on general principle, and it seemed to
prevent an apparent 2.2% regression on a single-socket i7 box
running 200 connections at saturation load.

Improve API of GenericXLogRegister().

Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future. Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.

Alexander Korotkov, adjusted slightly by me

In generic WAL application and replay, ensure page "hole" is always zero.

The previous coding could allow the contents of the "hole" between pd_lower
and pd_upper to diverge during replay from what it had been when the update
was originally applied. This would pose a problem if checksums were in
use, and in any case would complicate forensic comparisons between master
and slave servers. So force the "hole" to contain zeroes, both at initial
application of a generically-logged action, and at replay.

Alexander Korotkov, adjusted slightly by me

Add page id to bloom index

Added to ensure that bloom index pages can be distinguished from other pages
by pg_filedump. Because there wasn't any public/production versions before,
it doesn't pay attention to any compatibility issues.

Per notice from Tom Lane

Remove unnecessary definition of _WIN64 in libpq/win32.mak.

In commit b0e40d189325dc7a54d2546245e766f8c47a7c8d, I should have just
removed the /D switch defining WIN64.  The reason the code worked before
is that all Windows64 compilers automatically predefine _WIN64.  Perhaps
at one time we had code that depended on WIN64 being defined, but it's
long gone, and we should not encourage any reappearance.  Per discussion
with Christian Ullrich.

Correct copyright for newly added genericdesc.c

It's 2016 these days (no, not entirely sure how we got here either).

Pointed out by Amit Langote

Fix whitespace

Fix _SPI_execute_plan() for CREATE TABLE IF NOT EXISTS foo AS ...

When IF NOT EXISTS was added to CREATE TABLE AS, this logic didn't get
the memo, possibly resulting in an Assert failure. It looks like there
would have been no ill effects in a non-Assert build, though. Back-patch
to 9.5 where the IF NOT EXISTS option was added.

Stas Kelvich

Fix two places that thought Windows64 is indicated by WIN64 macro.

Everyplace else thinks it's _WIN64, so make these places fall in line.

The pg_regress.c usage is not going to result in any change in behavior,
only suppressing (or not) a compiler warning about downcasting HANDLEs.
So there seems no need for back-patching there.

The libpq/win32.mak usage might represent an actual bug, if anyone were
using this script to build for Windows64, which perhaps nobody is.
Given the lack of field complaints, no back-patch here either.

pg_regress.c problem found by Christian Ullrich, the other by me.

Fix freshly-introduced PL/Python portability bug.

It turns out that those PyErr_Clear() calls I removed from plpy_elog.c
in 7e3bb080387f4143 et al were not quite as random as they appeared: they
mask a Python 2.3.x bug.  (Specifically, it turns out that PyType_Ready()
can fail if the error indicator is set on entry, and PLy_traceback's fetch
of frame.f_code may be the first operation in a session that requires the
"frame" type to be readied.  Ick.)  Put back the clear call, but in a more
centralized place closer to what it's protecting, and this time with a
comment warning what it's really for.

Per buildfarm member prairiedog.  Although prairiedog was only failing
on HEAD, it seems clearly possible for this to occur in older branches
as well, so back-patch to 9.2 the same as the previous patch.

Use static inline function for BufferGetPage()

I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases. Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call. So make it readable.

Per gripes from Álvaro Herrera and Tom Lane

Make oldSnapshotControl a pointer to a volatile structure

It was incorrectly declared as a volatile pointer to a non-volatile
structure. Eliminate the OldSnapshotControl struct definition; it
is really not needed. Pointed out by Tom Lane.

While at it, add OldSnapshotControlData to pgindent's list of
structures.

Fix whitespace

Prefix RLS regression test roles with 'regress_'

To avoid any possible overlap with existing roles on a system when
doing a 'make installcheck', use role names which start with
'regress_'.

Pointed out by Tom.

Add directory created during build to gitignore

Fix missing "volatile" in PLy_output().

Commit 5c3c3cd0a3046339 plastered "volatile" on a bunch of variables
in PLy_output(), but removed the one that actually mattered, ie the
one on "oldcontext". This allows some versions of clang to generate
code in which "oldcontext" has been trashed when control reaches the
PG_CATCH block. Per buildfarm member tick.

cpluspluscheck: Update include path

Some things in src/include/fe_utils require libpq headers, so add
libpq's include path to the command line used here.

Fix documented return type of pg_logical_emit_message() in func.sgml.

Use ereport(ERROR) instead of Assert() to emit syncrep_parser error.

The existing code would either Assert or generate an invalid
SyncRepConfig variable, neither of which is desirable. A regular
error should be thrown instead.

This commit silences compiler warning in non assertion-enabled builds.

Per report from Jeff Janes.
Suggested fix by Tom Lane.

Fix poorly thought-through code from commit 5c3c3cd0a3046339.

It's not entirely clear to me whether PyString_AsString can return
null (looks like the answer might vary between Python 2 and 3).
But in any case, this code's attempt to cope with the possibility
was quite broken, because pstrdup() neither allows a null argument
nor ever returns a null.

Moreover, the code below this point assumes that "message" is a
palloc'd string, which would not be the case for a dgettext result.

Fix both problems by doing the pstrdup step separately.

pg_dump: add missing "destroyPQExpBuffer(query)" in dumpForeignServer().

Coverity complained about this resource leak (why now, I don't know,
since it's been like that a long time). Our general policy in pg_dump
is that PQExpBuffers are worth cleaning up, so do it here too. But
don't bother with a back-patch, because it seems unlikely that very
many databases contain enough FOREIGN SERVER objects to notice.

Add comment about intentional fallthrough in switch.

Coverity complained about an apparent missing "break" in a switch
added by bb140506df605fab. The human-readable comments are pretty
clear that this is intentional, but add a standard /* FALL THRU */
comment to make it clear to tools too.

Clean up foreign-key caching code in planner.

Coverity complained that the code added by 015e88942aa50f0d lacked an
error check for SearchSysCache1 failures, which it should have. But
the code was pretty duff in other ways too, including failure to think
about whether it could really cope with arrays of different lengths.

Fix access-to-already-freed-memory issue in plpython's error handling.

PLy_elog() could attempt to access strings that Python had already freed,
because the strings that PLy_get_spi_error_data() returns are simply
pointers into storage associated with the error "val" PyObject.  That's
fine at the instant PLy_get_spi_error_data() returns them, but just after
that PLy_traceback() intentionally releases the only refcount on that
object, allowing it to be freed --- so that the strings we pass to
ereport() are dangling pointers.

In principle this could result in garbage output or a coredump.  In
practice, I think the risk is pretty low, because there are no Python
operations between where we decrement that refcount and where we use the
strings (and copy them into PG storage), and thus no reason for Python
to recycle the storage.  Still, it's clearly hazardous, and it leads to
Valgrind complaints when running under a Valgrind that hasn't been
lobotomized to ignore Python memory allocations.

The code was a mess anyway: we fetched the error data out of Python
(clearing Python's error indicator) with PyErr_Fetch, examined it, pushed
it back into Python with PyErr_Restore (re-setting the error indicator),
then immediately pulled it back out with another PyErr_Fetch.  Just to
confuse matters even more, there were some gratuitous-and-yet-hazardous
PyErr_Clear calls in the "examine" step, and we didn't get around to doing
PyErr_NormalizeException until after the second PyErr_Fetch, making it even
less clear which object was being manipulated where and whether we still
had a refcount on it.  (If PyErr_NormalizeException did substitute a
different "val" object, it's possible that the problem could manifest for
real, because then we'd be doing assorted Python stuff with no refcount
on the object we have string pointers into.)

So, rearrange all that into some semblance of sanity, and don't decrement
the refcount on the Python error objects until the end of PLy_elog().
In HEAD, I failed to resist the temptation to reformat some messy bits
from 5c3c3cd0a3046339 along the way.

Back-patch as far as 9.2, because the code is substantially the same
that far back.  I believe that 9.1 has the bug as well; but the code
around it is rather different and I don't want to take a chance on
breaking something for what seems a low-probability problem.

Avoid the use of a separate spinlock to protect a LWLock's wait queue.

Previously we used a spinlock, in adition to the atomically manipulated
->state field, to protect the wait queue. But it's pretty simple to
instead perform the locking using a flag in state.

Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte
spinlocks, increased their size above 64byte. As 64 bytes are the size
we pad allocated BufferDescs to, this can increase false sharing;
causing performance problems in turn. Together with the previous commit
this reduces the size to <= 64 bytes on all common platforms.

Author: Andres Freund
Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com
20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de

Allow Pin/UnpinBuffer to operate in a lockfree manner.

Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).

To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).

As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.

As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.

This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.

On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.

Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell

Improve contrib/bloom regression test using code coverage info.

Originally, this test created a 100000-row test table, which made it
run rather slowly compared to other contrib tests. Investigation with
gcov showed that we got no further improvement in code coverage after
the first 700 or so rows, making the large table 99% a waste of time.
Cut it back to 2000 rows to fix the runtime problem and still leave
some headroom for testing behaviors that may appear later.

A closer look at the gcov results showed that the main coverage
omissions in contrib/bloom occurred because the test never filled more
than one entry in the notFullPage array; which is unsurprising because
it exercised index cleanup only in the scenario of complete table
deletion, allowing every page in the index to become deleted rather
than not-full. Add testing that allows the not-full path to be
exercised as well.

Also, test the amvalidate function, because blvalidate.c had zero
coverage without that, and besides it's a good idea to check for
mistakes in the bloom opclass definitions.

Fix possible NULL dereference in ExecAlterObjectDependsStmt

I used the wrong variable here. Doesn't make a difference today because
the only plausible caller passes a non-NULL variable, but someday it
will be wrong, and even today's correctness is subtle: the caller that
does pass a NULL is never invoked because of object type constraints.
Surely not a condition to rely on.

Noted by Coverity

Further minor improvement in generic_xlog.c: always say REGBUF_STANDARD.

Since we're requiring pages handled by generic_xlog.c to be standard
format, specify REGBUF_STANDARD when doing a full-page image, so that
xloginsert.c can compress out the "hole" between pd_lower and pd_upper.
Given the current API in which this path will be taken only for a newly
initialized page, the hole is likely to be particularly large in such
cases, so that this oversight could easily be performance-significant.
I don't notice any particular change in the runtime of contrib/bloom's
regression test, though.

Micro-optimize GenericXLogFinish().

Make the inner comparison loops of computeDelta() as tight as possible by
pulling considerations of valid and invalid ranges out of the inner loops,
and extending a match or non-match detection as far as possible before
deciding what to do next.  To keep this tractable, give up the possibility
of merging fragments across the pd_lower to pd_upper gap.  The fraction of
pages where that could happen (ie, there are 4 or fewer bytes in the gap,
*and* data changes immediately adjacent to it on both sides) is too small
to be worth spending cycles on.

Also, avoid two BLCKSZ-length memcpy()s by computing the delta before
moving data into the target buffer, instead of after.  This doesn't save
nearly as many cycles as being tenser about computeDelta(), but it still
seems worth doing.

On my machine, this patch cuts a full 40% off the runtime of
contrib/bloom's regression test.

Fix PL/Python ereport() test to work on Python 2.3.

Per buildfarm.

Pavel Stehule

Get rid of GenericXLogUnregister().

This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.

Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.

While at it, do some wordsmithing on the SGML docs for generic WAL.

Get rid of blinsert()'s use of GenericXLogUnregister().

That routine is dangerous, and unnecessary once we get rid of this
one caller.

In passing, fix failure to clean up temp memory context, or switch
back to caller's context, during slowest exit path.

Code review/prettification for generic_xlog.c.

Improve commentary, use more specific names for the delta fields,
const-ify pointer arguments where possible, avoid assuming that
initializing only the first element of a local array will guarantee
that the remaining elements end up as we need them. (I think that
code in generic_redo actually worked, but only because InvalidBuffer
is zero; this is a particularly ugly way of depending on that ...)

Run pgindent on generic_xlog.c.

This code desperately needs some micro-optimization, and I'd like it
to be formatted a bit more nicely while I work on it.

Fix typo in C comment.

Turn special page pointer validation to static inline function

Inclusion of multiple macros inside another macro was pushing MSVC
past its size liimit. Reported by buildfarm.

Move \crosstabview regression tests to a separate file

It cannot run in the same parallel group as misc, because it creates a
table which is unpredictably visible in that test.

Per buildfarm member crake.

Support \crosstabview in psql

\crosstabview is a completely different way to display results from a
query: instead of a vertical display of rows, the data values are placed
in a grid where the column and row headers come from the data itself,
similar to a spreadsheet.

The sort order of the horizontal header can be specified by using
another column in the query, and the vertical header determines its
ordering from the order in which they appear in the query.

This only allows displaying a single value in each cell.  If more than
one value correspond to the same cell, an error is thrown.  Merging of
values can be done in the query itself, if necessary.  This may be
revisited in the future.

Author: Daniel Verité
Reviewed-by: Pavel Stehule, Dean Rasheed

Add snapshot_too_old to NSVC @contrib_excludes

The buildfarm showed failure for Windows MSVC builds due to this
omission. This might not be the only problem with the Makefile for
this feature, but hopefully this will get it past the immediate
problem.

Fix suggested by Tom Lane

Expose more out/readfuncs support functions.

Previously bcac23d exposed a subset of support functions, namely the
ones Kaigai found useful. In
20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de I mentioned that
there's some functions missing to use the facility in an external
project.

To avoid having to add functions piecemeal, add all the functions which
are used to define READ_* and WRITE_* macros; users of the extensible
node functionality are likely to need these. Additionally expose
outDatum(), which doesn't have it's own WRITE_ macro, as it needs
information from the embedding struct.

Discussion: 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de

Create default roles

This creates an initial set of default roles which administrators may
use to grant access to, historically, superuser-only functions. Using
these roles instead of granting superuser access reduces the number of
superuser roles required for a system. Documention for each of the
default roles has been added to user-manag.sgml.

Bump catversion to 201604082, as we had a commit that bumped it to
201604081 and another that set it back to 201604071...

Reviews by José Luis Tallón and Robert Haas

Reserve the "pg_" namespace for roles

This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.

This will allow for default roles to be provided at initdb time.

Reviews by José Luis Tallón and Robert Haas

Fix improper usage of 'dump' bitmap

Now that 'dump' is a bitmap, we can't simply set it to 'true'.

Noticed while debugging the prior issue.

Add the "snapshot too old" feature

This feature is controlled by a new old_snapshot_threshold GUC.  A
value of -1 disables the feature, and that is the default.  The
value of 0 is just intended for testing.  Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect.  The xmin associated with a transaction ID does
still protect dead tuples.  A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.

This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.

Modify BufferGetPage() to prepare for "snapshot too old" feature

This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in.  It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point.  This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third.  The follow-on patch will change the places where the test
needs to be made.

In dumpTable, re-instate the skipping logic

Pretty sure I removed this based on some incorrect thinking that it was
no longer possible to reach this point for a table which will not be
dumped, but that's clearly wrong.

Pointed out on IRC by Erik Rijkers.

Revert CREATE INDEX ... INCLUDING ...

It's not ready yet, revert two commits
690c543550b0d2852060c18d270cdb534d339d9a - unstable test output
386e3d7609c49505e079c40c65919d99feb82505 - patch itself

Add authentication parameters compat_realm and upn_usename for SSPI

These parameters are available for SSPI authentication only, to make
it possible to make it behave more like "normal gssapi", while
making it possible to maintain compatibility.

compat_realm is on by default, but can be turned off to make the
authentication use the full Kerberos realm instead of the NetBIOS name.

upn_username is off by default, and can be turned on to return the users
Kerberos UPN rather than the SAM-compatible name (a user in Active
Directory can have both a legacy SAM-compatible username and a new
Kerberos one. Normally they are the same, but not always)

Author: Christian Ullrich
Reviewed by: Robbie Harwood, Alvaro Herrera, me

Fix possible use of uninitialised value in ts_headline()

Found during investigation of failure of skink buildfarm member and its
valgrind report.

Backpatch to all supported branches

Fix unstable regression test output.

Output order from the pg_indexes view might vary depending on the
phase of the moon, so add ORDER BY to ensure stable results of tests
added by commit 386e3d7609c49505e079c40c65919d99feb82505.
Per buildfarm.

Distrust external OpenSSL clients; clear err queue

OpenSSL has an unfortunate tendency to mix per-session state error
handling with per-thread error handling.  This can cause problems when
programs that link to libpq with OpenSSL enabled have some other use of
OpenSSL; without care, one caller of OpenSSL may cause problems for the
other caller.  Backend code might similarly be affected, for example
when a third party extension independently uses OpenSSL without taking
the appropriate precautions.

To fix, don't trust other users of OpenSSL to clear the per-thread error
queue.  Instead, clear the entire per-thread queue ahead of certain I/O
operations when it appears that there might be trouble (these I/O
operations mostly need to call SSL_get_error() to check for success,
which relies on the queue being empty).  This is slightly aggressive,
but it's pretty clear that the other callers have a very dubious claim
to ownership of the per-thread queue.  Do this is both frontend and
backend code.

Finally, be more careful about clearing our own error queue, so as to
not cause these problems ourself.  It's possibly that control previously
did not always reach SSLerrmessage(), where ERR_get_error() was supposed
to be called to clear the queue's earliest code.  Make sure
ERR_get_error() is always called, so as to spare other users of OpenSSL
the possibility of similar problems caused by libpq (as opposed to
problems caused by a third party OpenSSL library like PHP's OpenSSL
extension).  Again, do this is both frontend and backend code.

See bug #12799 and https://bugs.php.net/bug.php?id=68276

Based on patches by Dave Vitek and Peter Eisentraut.

From: Peter Geoghegan <pg@bowt.ie>

Add BSD authentication method.

Create a "bsd" auth method that works the same as "password" so far as
clients are concerned, but calls the BSD Authentication service to
check the password. This is currently only available on OpenBSD.

Marisa Emerson, reviewed by Thomas Munro

Add combine functions for various floating-point aggregates.

This allows parallel aggregation to use them. It may seem surprising
that we use float8_combine for both float4_accum and float8_accum
transition functions, but that's because those functions differ only
in the type of the non-transition-state argument.

Haribabu Kommi, reviewed by David Rowley and Tomas Vondra

Fix output of regression test of contrib/tsearch2

Just forget to add in 1ec4c7c055ca045c5df6352a4cdacd9aa778e598

Restore original tsquery operation numbering.

As noticed by Tom Lane changing operation's number in commit
bb140506df605fab58f48926ee1db1f80bdafb59 causes on-disk format incompatibility.
Revert to previous numbering, that is reason to add special array to store
priorities of operation. Also it reverts order of tsquery to previous.

Author: Dmitry Ivanov

Silence warning from modern perl about unescaped braces

CREATE INDEX ... INCLUDING (column[, ...])

Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.

Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes

Replace printf format %i by %d

see also ce8d7bb6440710058503d213b2aafcdf56a5b481

Turn down MSVC compiler verbosity

Most of what is produced by the detailed verbosity level is of no
interest at all, so switch to the normal level for more usable output.

Christian Ullrich

Backpatch to all live branches

Fix printf format

Fix multiple bugs in tablespace symlink removal.

Don't try to examine S_ISLNK(st.st_mode) after a failed lstat().
It's undefined.

Also, if the lstat() reported ENOENT, we do not wish that to be a hard
error, but the code might nonetheless treat it as one (giving an entirely
misleading error message, too) depending on luck-of-the-draw as to what
S_ISLNK() returned.

Don't throw error for ENOENT from rmdir(), either. (We're not really
expecting ENOENT because we just stat'd the file successfully; but
if we're going to allow ENOENT in the symlink code path, surely the
directory code path should too.)

Generate an appropriate errcode for its-the-wrong-type-of-file complaints.
(ERRCODE_SYSTEM_ERROR doesn't seem appropriate, and failing to write
errcode() around it certainly doesn't work, and not writing an errcode
at all is not per project policy.)

Valgrind noticed the undefined S_ISLNK result; the other problems emerged
while reading the code in the area.

All of this appears to have been introduced in 8f15f74a44f68f9c.
Back-patch to 9.5 where that commit appeared.

Document which aggregates support partial mode.

David Rowley, reviewed by Tomas Vondra

Enhanced custom error in PLPythonu

Patch adds a new, more rich, way to emit error message or exception from
PL/Pythonu code.

Author: Pavel Stehule
Reviewers: Catalin Iacob, Peter Eisentraut, Jim Nasby

Increase maximum number of clog buffers.

Benchmarking has shown that the current number of clog buffers limits
scalability. We've previously increased the number in 33aaa139, but
that's not sufficient with a large number of clients.

We've benchmarked the cost of increasing the limit by benchmarking worst
case scenarios; testing showed that 128 buffers don't cause a
regression, even in contrived scenarios, whereas 256 does

There are a number of more complex patches flying around to address
various clog scalability problems, but this is simple enough that we can
get it into 9.6; and is beneficial even after those patches have been
applied.

It is a bit unsatisfactory to increase this in small steps every few
releases, but a better solution seems to require a rewrite of slru.c;
not something done quickly.

Author: Amit Kapila and Andres Freund
Discussion: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com

Add a 'parallel_degree' reloption.

The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.

Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me. Some further hacking by me.

Attempt to fix breakage due to declaration following code.

Per Tom Lane and the buildfarm.