granicus.if.org Git - postgresql/log

Make the to_reg*() functions accept text not cstring.

Using cstring as the input type was a poor decision, because that's not
really a full-fledged type. In particular, it lacks implicit coercions
from text or varchar, meaning that usages like to_regproc('foo'||'bar')
wouldn't work; basically the only case that did work without explicit
casting was a simple literal constant argument.

The lack of field complaints about this suggests that hardly anyone
is using these functions, so hopefully fixing it won't cause much of
a compatibility problem. They've only been there since 9.4, anyway.

Petr Korobeinikov

Make pg_shseclabel available in early backend startup

While the in-core authentication mechanism doesn't need to access
pg_shseclabel at all, it's reasonable to think that an authentication
hook will want to look at the label for the role logging in, or for rows
in other catalogs used during the authentication phase of startup.

Catalog version bumped, because this changes the "is nailed" status for
pg_shseclabel.

Author: Adam Brightwell

Add to_regnamespace() and to_regrole() to the documentation.

Commits cb9fa802b32b222b and 0c90f6769de6a60f added these functions,
but did not bother with documentation.

Convert psql's tab completion for backslash commands to the new style.

This requires adding some more infrastructure to handle both case-sensitive
and case-insensitive matching, as well as the ability to match a prefix of
a previous word. So it ends up being about a wash line-count-wise, but
it's just as big a readability win here as in the SQL tab completion rules.

Michael Paquier, some adjustments by me

In psql's tab completion, change most TailMatches patterns to Matches.

In the refactoring in commit d37b816dc9e8f976c8913296781e08cbd45c5af1,
we mostly kept to the original design whereby only the last few words
on the line were matched to identify a completable pattern.  However,
after commit d854118c8df8c413d069f7e88bb01b9e18e4c8ed, there's really
no reason to do it like that: where it's sensible, we can use patterns
that expect to match the entire input line.  And mostly, it's sensible.
Matching the entire line greatly reduces the odds of a false match that
leads to offering irrelevant completions.  Moreover (though I've not
tried to measure this), it should make tab completion faster since
many of the patterns will be discarded after a single integer comparison
that finds that the wrong number of words appear on the line.

There are certain identifiable places where we still need to use
TailMatches because the statement in question is allowed to appear
embedded in a larger statement.  These are just a small minority of
the existing patterns, though, so the benefit of switching where
possible is large.

It's possible that this patch has removed some within-line matching
behaviors that are in fact desirable, but we can put those back when
we get complaints.  Most of the removed behaviors are certainly silly.

Michael Paquier, with some further adjustments by me

Docs: provide a concrete discussion and example for RLS race conditions.

Commit 43cd468cf01007f3 added some wording to create_policy.sgml purporting
to warn users against a race condition of the sort that had been noted some
time ago by Peter Geoghegan. However, that warning was far too vague to be
useful (or at least, I completely failed to grasp what it was on about).
Since the problem case occurs with a security design pattern that lots of
people are likely to try to use, we need to be as clear as possible about
it. Provide a concrete example in the main-line docs in place of the
original warning.

Adjust behavior of row_security GUC to match the docs.

Some time back we agreed that row_security=off should not be a way to
bypass RLS entirely, but only a way to get an error if it was being
applied.  However, the code failed to act that way for table owners.
Per discussion, this is a must-fix bug for 9.5.0.

Adjust the logic in rls.c to behave as expected; also, modify the
error message to be more consistent with the new interpretation.
The regression tests need minor corrections as well.  Also update
the comments about row_security in ddl.sgml to be correct.  (The
official description of the GUC in config.sgml is already correct.)

I failed to resist the temptation to do some other very minor
cleanup as well, such as getting rid of a duplicate extern declaration.

Fix typo in comment.

Masahiko Sawada

Fix regrole and regnamespace output functions to do quoting, too.

We discussed this but somehow failed to implement it...

Fix regrole and regnamespace types to honor quoting like other reg* types.

Aside from any consistency arguments, this is logically necessary because
the I/O functions for these types also handle numeric OID values.  Without
a quoting rule it is impossible to distinguish numeric OIDs from role or
namespace names that happen to contain only digits.

Also change the to_regrole and to_regnamespace functions to dequote their
arguments.  While not logically essential, this seems like a good idea
since the other to_reg* functions do it.  Anyone who really wants raw
lookup of an uninterpreted name can fall back on the time-honored solution
of (SELECT oid FROM pg_namespace WHERE nspname = whatever).

Report and patch by Jim Nasby, reviewed by Michael Paquier

Fix bogus lock release in RemovePolicyById and RemoveRoleFromObjectPolicy.

Can't release the AccessExclusiveLock on the target table until commit.
Otherwise there is a race condition whereby other backends might service
our cache invalidation signals before they can actually see the updated
catalog rows.

Just to add insult to injury, RemovePolicyById was closing the rel (with
incorrect lock drop) and then passing the now-dangling rel pointer to
CacheInvalidateRelcache. Probably the reason this doesn't fall over on
CLOBBER_CACHE buildfarm members is that some outer level of the DROP logic
is still holding the rel open ... but it'd have bit us on the arse
eventually, no doubt.

Do some copy-editing on the docs for row-level security.

Clarifications, markup improvements, corrections of misleading or
outright wrong statements.

Guard against null arguments in binary_upgrade_create_empty_extension().

The CHECK_IS_BINARY_UPGRADE macro is not sufficient security protection
if we're going to dereference pass-by-reference arguments before it.

But in any case we really need to explicitly check PG_ARGISNULL for all
the arguments of a non-strict function, not only the ones we expect null
values for.

Oversight in commits 30982be4e5019684e1772dd9170aaa53f5a8e894 and
f92fc4c95ddcc25978354a8248d3df22269201bc. Found by Andreas Seltenreich.
(The other usages in pg_upgrade_support.c seem safe.)

Do some copy-editing on the docs for replication origins.

Minor grammar and markup improvements.

Do a final round of copy-editing on the 9.5 release notes.

Fix treatment of *lpNumberOfBytesRecvd == 0: that's a completion condition.

pgwin32_recv() has treated a non-error return of zero bytes from WSARecv()
as being a reason to block ever since the current implementation was
introduced in commit a4c40f140d23cefb.  However, so far as one can tell
from Microsoft's documentation, that is just wrong: what it means is
graceful connection closure (in stream protocols) or receipt of a
zero-length message (in message protocols), and neither case should result
in blocking here.  The only reason the code worked at all was that control
then fell into the retry loop, which did *not* treat zero bytes specially,
so we'd get out after only wasting some cycles.  But as of 9.5 we do not
normally reach the retry loop and so the bug is exposed, as reported by
Shay Rojansky and diagnosed by Andres Freund.

Remove the unnecessary test on the byte count, and rearrange the code
in the retry loop so that it looks identical to the initial sequence.

Back-patch to 9.5.  The code is wrong all the way back, AFAICS, but
since it's relatively harmless in earlier branches we'll leave it alone.

Teach pg_dump to quote reloption values safely.

Commit c7e27becd2e6eb93 fixed this on the backend side, but we neglected
the fact that several code paths in pg_dump were printing reloptions
values that had not gotten massaged by ruleutils. Apply essentially the
same quoting logic in those places, too.

Fix overly-strict assertions in spgtextproc.c.

spg_text_inner_consistent is capable of reconstructing an empty string
to pass down to the next index level; this happens if we have an empty
string coming in, no prefix, and a dummy node label.  (In practice, what
is needed to trigger that is insertion of a whole bunch of empty-string
values.)  Then, we will arrive at the next level with in->level == 0
and a non-NULL (but zero length) in->reconstructedValue, which is valid
but the Assert tests weren't expecting it.

Per report from Andreas Seltenreich.  This has no impact in non-Assert
builds, so should not be a problem in production, but back-patch to
all affected branches anyway.

In passing, remove a couple of useless variable initializations and
shorten the code by not duplicating DatumGetPointer() calls.

Adjust back-branch release note description of commits a2a718b22 et al.

As pointed out by Michael Paquier, recovery_min_apply_delay didn't exist
in 9.0-9.3, making the release note text not very useful. Instead make it
talk about recovery_target_xid, which did exist then.

9.0 is already out of support, but we can fix the text in the newer
branches' copies of its release notes.

Make copyright.pl cope with nonstandard case choices in copyright notices.

The need for this is shown by the files it missed in Bruce's recent run.
I fixed it so that it will actually adjust the case when needed.

In passing, also make it skip .po files, since those will just get
overwritten anyway from the translation repository.

Update copyright for 2016

On closer inspection, the reason copyright.pl was missing files is
that it is looking for 'Copyright (c)' and they had 'Copyright (C)'.
Fix that, and update a couple more that grepping for that revealed.

Update copyright for 2016

Manually fix some copyright lines missed by the automated script.

Update copyright for 2016

Backpatch certain files through 9.1

Cover heap_page_prune_opt()'s cleanup lock tactic in README.

Jeff Janes, reviewed by Jim Nasby.

Teach flatten_reloptions() to quote option values safely.

flatten_reloptions() supposed that it didn't really need to do anything
beyond inserting commas between reloption array elements.  However, in
principle the value of a reloption could be nearly anything, since the
grammar allows a quoted string there.  Any restrictions on it would come
from validity checking appropriate to the particular option, if any.

A reloption value that isn't a simple identifier or number could thus lead
to dump/reload failures due to syntax errors in CREATE statements issued
by pg_dump.  We've gotten away with not worrying about this so far with
the core-supported reloptions, but extensions might allow reloption values
that cause trouble, as in bug #13840 from Kouhei Sutou.

To fix, split the reloption array elements explicitly, and then convert
any value that doesn't look like a safe identifier to a string literal.
(The details of the quoting rule could be debated, but this way is safe
and requires little code.)  While we're at it, also quote reloption names
if they're not safe identifiers; that may not be a likely problem in the
field, but we might as well try to be bulletproof here.

It's been like this for a long time, so back-patch to all supported
branches.

Kouhei Sutou, adjusted some by me

Add some more defenses against silly estimates to gincostestimate().

A report from Andy Colson showed that gincostestimate() was not being
nearly paranoid enough about whether to believe the statistics it finds in
the index metapage.  The problem is that the metapage stats (other than the
pending-pages count) are only updated by VACUUM, and in the worst case
could still reflect the index's original empty state even when it has grown
to many entries.  We attempted to deal with that by scaling up the stats to
match the current index size, but if nEntries is zero then scaling it up
still gives zero.  Moreover, the proportion of pages that are entry pages
vs. data pages vs. pending pages is unlikely to be estimated very well by
scaling if the index is now orders of magnitude larger than before.

We can improve matters by expanding the use of the rule-of-thumb estimates
I introduced in commit 7fb008c5ee59b040: if the index has grown by more
than a cutoff amount (here set at 4X growth) since VACUUM, then use the
rule-of-thumb numbers instead of scaling.  This might not be exactly right
but it seems much less likely to produce insane estimates.

I also improved both the scaling estimate and the rule-of-thumb estimate
to account for numPendingPages, since it's reasonable to expect that that
is accurate in any case, and certainly pages that are in the pending list
are not either entry or data pages.

As a somewhat separate issue, adjust the estimation equations that are
concerned with extra fetches for partial-match searches.  These equations
suppose that a fraction partialEntries / numEntries of the entry and data
pages will be visited as a consequence of a partial-match search.  Now,
it's physically impossible for that fraction to exceed one, but our
estimate of partialEntries is mostly bunk, and our estimate of numEntries
isn't exactly gospel either, so we could arrive at a silly value.  In the
example presented by Andy we were coming out with a value of 100, leading
to insane cost estimates.  Clamp the fraction to one to avoid that.

Like the previous patch, back-patch to all supported branches; this
problem can be demonstrated in one form or another in all of them.

Fix comments about WAL rule "write xlog before data" versus pg_multixact.

Recovery does not achieve its goal of zeroing all pg_multixact entries
whose accompanying WAL records never reached disk.  Remove that claim
and justify its expendability.  Detail the need for TrimMultiXact(),
which has little in common with the TrimCLOG() rationale.  Merge two
tightly-related comments.  Stop presenting pg_multixact as specific to
heap_lock_tuple(); PostgreSQL 9.3 extended its use to heap_update().

Noticed while investigating a report from Andres Freund.

doc: Remove redundant duplicate URLs from ulink elements

Empty ulink elements default to displaying the URL, so there is no need
to specify the URL again. This was already done for most occurrences,
but some cases didn't follow this convention.

doc: Add index entries and better documentation link for Linux OOM

Add a comment noting that FDWs don't have to implement EXCEPT or LIMIT TO.

postgresImportForeignSchema pays attention to IMPORT's EXCEPT and LIMIT TO
options, but only as an efficiency hack, not for correctness' sake. The
FDW documentation does explain that, but someone using postgres_fdw.c
as a coding guide might not remember it, so let's add a comment here.
Per question from Regina Obe.

Fix ALTER OPERATOR to update dependencies properly.

Fix an oversight in commit 321eed5f0f7563a0: replacing an operator's
selectivity functions needs to result in a corresponding update in
pg_depend. We have a function that can handle that, but it was not
called by AlterOperator().

To fix this without enlarging pg_operator.h's #include list beyond
what clients can safely include, split off the function definitions
into a new file pg_operator_fn.h, similarly to what we've done for
some other catalog header files. It's not entirely clear whether
any client-side code needs to include pg_operator.h, but it seems
prudent to assume that there is some such code somewhere.

Dept of second thoughts: the !scan_all exit mustn't increase scanned_pages.

In the extreme edge case where contended pages are the only ones that
escape being scanned, the previous commit would have allowed us to think
that relfrozenxid could be advanced, which is exactly wrong.

Avoid useless truncation attempts during VACUUM.

VACUUM can skip heap pages altogether when there's a run of consecutive
pages that are all-visible according to the visibility map.  This causes it
to not update its nonempty_pages count, just as if those pages were empty,
which means that at the end we will think they are candidates for deletion.
Thus, we may take the table's AccessExclusive lock only to find that no
pages are really truncatable.  This usually causes no real problems on a
master server, thanks to the lock being acquired only conditionally; but on
hot-standby servers, the same lock must be acquired unconditionally which
can result in unnecessary query cancellations.

To improve matters, force examination of the table's last page whenever
we reach there with a nonempty_pages count that would allow a truncation
attempt.  If it's not empty, we'll advance nonempty_pages and thereby
prevent the truncation attempt.

If we are unable to acquire cleanup lock on that page, there's no need to
force it, unless we're doing an anti-wraparound vacuum.  We can just check
for tuples with a shared buffer lock and then give up.  (When we are doing
an anti-wraparound vacuum, and decide it's okay to skip the page because it
contains no freezable tuples, this patch still improves matters because
nonempty_pages is properly updated, which it was not before.)

Since only the last page is special-cased in this way, we might attempt a
truncation that will release many fewer pages than the normal heuristic
would suggest; at worst, only one page would be truncated.  But that seems
all right, because the situation won't repeat during the next vacuum.
The real problem with the old logic is that the useless truncation attempt
happens every time we vacuum, so long as the state of the last few dozen
pages doesn't change.

This is a longstanding deficiency, but since the consequences aren't very
severe in most scenarios, I'm not going to risk a back-patch.

Jeff Janes and Tom Lane

Minor hacking on contrib/cube documentation.

Improve markup, particularly of the table of functions; add or improve
examples for some of the functions; wordsmith some of the function
descriptions.

Add some comments about division of labor between rewriter and planner.

The rationale for the way targetlist processing is done wasn't clearly
stated anywhere, and I for one had forgotten some of the details. Having
just painfully re-learned them, add some breadcrumbs for the next person.

Put back one copyObject() in rewriteTargetView().

Commit 6f8cb1e23485bd6d tried to centralize rewriteTargetView's copying
of a target view's Query struct.  However, it ignored the fact that the
jointree->quals field was used twice.  This only accidentally failed to
fail immediately because the same ChangeVarNodes mutation is applied in
both cases, so that we end up with logically identical expression trees
for both uses (and, as the code stands, the second ChangeVarNodes call
actually does nothing).  However, we end up linking *physically*
identical expression trees into both an RTE's securityQuals list and
the WithCheckOption list.  That's pretty dangerous, mainly because
prepsecurity.c is utterly cavalier about further munging such structures
without copying them first.

There may be no live bug in HEAD as a consequence of the fact that we apply
preprocess_expression in between here and prepsecurity.c, and that will
make a copy of the tree anyway.  Or it may just be that the regression
tests happen to not trip over it.  (I noticed this only because things
fell over pretty badly when I tried to relocate the planner's call of
expand_security_quals to before expression preprocessing.)  In any case
it's very fragile because if anyone tried to make the securityQuals and
WithCheckOption trees diverge before we reach preprocess_expression, it
would not work.  The fact that the current code will preprocess
securityQuals and WithCheckOptions lists at completely different times in
different query levels does nothing to increase my trust that that can't
happen.

In view of the fact that 9.5.0 is almost upon us and the aforesaid commit
has seen exactly zero field testing, the prudent course is to make an extra
copy of the quals so that the behavior is not different from what has been
in the field during beta.

Rename (new|old)estCommitTs to (new|old)estCommitTsXid

The variables newestCommitTs and oldestCommitTs sound as if they are
timestamps, but in fact they are the transaction Ids that correspond
to the newest and oldest timestamps rather than the actual timestamps.
Rename these variables to reflect that they are actually xids: to wit
newestCommitTsXid and oldestCommitTsXid respectively. Also modify
related code in a similar fashion, particularly the user facing output
emitted by pg_controldata and pg_resetxlog.

Complaint and patch by me, review by Tom Lane and Alvaro Herrera.
Backpatch to 9.5 where these variables were first introduced.

Code and docs review for cube kNN support.

Commit 33bd250f6c4cc309f4eeb657da80f1e7743b3e5c could have done with
some more review:

Adjust coding so that compilers unfamiliar with elog/ereport don't complain
about uninitialized values.

Fix misuse of PG_GETARG_INT16 to retrieve arguments declared as "integer"
at the SQL level. (This was evidently copied from cube_ll_coord and
cube_ur_coord, but those were wrong too.)

Fix non-style-guide-conforming error messages.

Fix underparenthesized if statements, which pgindent would have made a
hash of, and remove some unnecessary parens elsewhere.

Run pgindent over new code.

Revise documentation: repeated accretion of more operators without any
rethinking of the text already there had left things in a bit of a mess.
Merge all the cube operators into one table and adjust surrounding text
appropriately.

David Rowley and Tom Lane

Document brin_summarize_new_pages

Pointer out by Jeff Janes

Document the exponentiation operator as associating left to right.

Common mathematical convention is that exponentiation associates right to
left. We aren't going to change the parser for this, but we could note
it in the operator's description. (It's already noted in the operator
precedence/associativity table, but users might not look there.)
Per bug #13829 from Henrik Pauli.

Fix omission of -X (--no-psqlrc) in some psql invocations.

As of commit d5563d7df, psql -c no longer implies -X, but not all of
our regression testing scripts had gotten that memo.

To ensure consistency of results across different developers, make
sure that *all* invocations of psql in all scripts in our tree
use -X, even where this is not what previously happened.

Michael Paquier and Tom Lane

doc: pg_committs -> pg_commit_ts

Reported by: Alain Laporte (#13836)

Update documentation about pseudo-types.

Tone down an overly strong statement about which pseudo-types PLs are
likely to allow. Add "event_trigger" to the list, as well as
"pg_ddl_command" in 9.5/HEAD. Back-patch to 9.3 where event_trigger
was added.

Fix translation domain in pg_basebackup

For some reason, we've been overlooking the fact that pg_receivexlog
and pg_recvlogical are using wrong translation domains all along,
so their output hasn't ever been translated. The right domain is
pg_basebackup, not their own executable names.

Noticed by Ioseph Kim, who's been working on the Korean translation.

Backpatch pg_receivexlog to 9.2 and pg_recvlogical to 9.4.

Add forgotten CHECK_FOR_INTERRUPT calls in pgcrypto's crypt()

Both Blowfish and DES implementations of crypt() can take arbitrarily
long time, depending on the number of rounds specified by the caller;
make sure they can be interrupted.

Author: Andreas Karlsson
Reviewer: Jeff Janes

Backpatch to 9.1.

Include typmod when complaining about inherited column type mismatches.

MergeAttributes() rejects cases where columns to be merged have the same
type but different typmod, which is correct; but the error message it
printed didn't show either typmod, which is unhelpful. Changing this
requires using format_type_with_typemod() in place of TypeNameToString(),
which will have some minor side effects on the way some type names are
printed, but on balance this is an improvement: the old code sometimes
printed one type according to one set of rules and the other type according
to the other set, which could be confusing in its own way.

Oddly, there were no regression test cases covering any of this behavior,
so add some.

Complaint and fix by Amit Langote

Fix brin_summarize_new_values() to check index type and ownership.

brin_summarize_new_values() did not check that the passed OID was for
an index at all, much less that it was a BRIN index, and would fail in
obscure ways if it wasn't (possibly damaging data first?). It also
lacked any permissions test; by analogy to VACUUM, we should only allow
the table's owner to summarize.

Noted by Jeff Janes, fix by Michael Paquier and me

Improve SECURITY LABEL tab completion

Add DATABASE, EVENT TRIGGER, FOREIGN TABLE, ROLE, and TABLESPACE to
tab completion for SECURITY LABEL.

Kyotaro Horiguchi

Improve the gin index scan performance in pg_trgm.

Previous coding assumes too pessimistic upper bound of similarity
in consistent method of GIN.

Author: Fornaroli Christophe with comments by me.

Remove unnecessary row ordering dependency in pg_rewind test suite.

t/002_databases.pl was expecting to see a specific physical order of the
rows in pg_database. I broke that in HEAD with commit 01e386a325549b77,
but I'd say it's a pretty fragile test methodology in any case, so fix
it in 9.5 as well.

Docs: fix erroneously-given function name.

pg_replication_session_is_setup() exists nowhere; apparently this is
meant to refer to pg_replication_origin_session_is_setup().

Adrien Nayrat

Fix factual and grammatical errors in comments for struct _tableInfo.

Amit Langote, further adjusted by me

Docs typo fix.

Michael Paquier

Avoid VACUUM FULL altogether in initdb.

Commit ed7b3b3811c5836a purported to remove initdb's use of VACUUM FULL,
as had been agreed to in a pghackers discussion back in Dec 2014.
But it missed this one ...

Improve handling of password reuse in src/bin/scripts programs.

This reverts most of commit 83dec5a71 in favor of having connectDatabase()
store the possibly-reusable password in a static variable, similar to the
coding we've had for a long time in pg_dump's version of that function.
To avoid possible problems with unwanted password reuse, make callers
specify whether it's reasonable to attempt to re-use the password.
This is a wash for cases where re-use isn't needed, but it is far simpler
for callers that do want that. Functionally there should be no difference.

Even though we're past RC1, it seems like a good idea to back-patch this
into 9.5, like the prior commit. Otherwise, if there are any third-party
users of connectDatabase(), they'll have to deal with an API change in
9.5 and then another one in 9.6.

Michael Paquier

In pg_dump, remember connection passwords no matter how we got them.

When pg_dump prompts the user for a password, it remembers the password
for possible re-use by parallel worker processes.  However, libpq might
have extracted the password from a connection string originally passed
as "dbname".  Since we don't record the original form of dbname but
break it down to host/port/etc, the password gets lost.  Fix that by
retrieving the actual password from the PGconn.

(It strikes me that this whole approach is rather broken, as it will also
lose other information such as options that might have been present in
the connection string.  But we'll leave that problem for another day.)

In passing, get rid of rather silly use of malloc() for small fixed-size
arrays.

Back-patch to 9.3 where parallel pg_dump was introduced.

Report and fix by Zeus Kronion, adjusted a bit by Michael Paquier and me

Read from the same worker repeatedly until it returns no tuple.

The original coding read tuples from workers in round-robin fashion,
but performance testing shows that it works much better to read enough
to empty one queue before moving on to the next.  I believe the
reason for this is that, with the old approach, we could easily wake
up a worker repeatedly to write only one new tuple into the shm_mq
each time.  With this approach, by the time the process gets scheduled,
it has a decent chance of being able to fill the entire buffer in
one go.

Patch by me.  Dilip Kumar helped with performance testing.

Change Gather not to use a physical tlist.

This should have been part of the original commit, but was missed.
Pushing data between processes is expensive, so we definitely want
to project away unneeded columns here, just as we do for other nodes
like Sort and Hash that care about the volume of data.

Remove unnecessary escaping in C character literals

'\"' is more commonly written simply as '"'.

Allow omitting one or both boundaries in an array slice specifier.

Omitted boundaries represent the upper or lower limit of the corresponding
array subscript. This allows simpler specification of many common
use-cases.

(Revised version of commit 9246af6799819847faa33baf441251003acbb8fe)

YUriy Zhuravlev

Comment improvements for abbreviated keys.

Peter Geoghegan and Robert Haas

postgres_fdw: Consider requesting sorted data so we can do a merge join.

When use_remote_estimate is enabled, consider adding ORDER BY to the
query we sending to the remote server so that we can use that ordered
data for a merge join.  Commit f18c944b6137329ac4a6b2dce5745c5dc21a8578
arranges to push down the query pathkeys, which seems like the case
mostly likely to be a win, but testing shows this can sometimes win,
too.

For a regular table, we know which indexes are present and therefore
test whether the ordering provided by each such index is useful.  Here,
we take the opposite approach: guess what orderings would be useful if
they could be generated cheaply, and then ask the remote side what those
will cost.

Ashutosh Bapat, with very substantial cosmetic revisions by me.  Also
reviewed by Rushabh Lathia.

Fix calculation of space needed for parsed words in tab completion.

Yesterday in commit d854118c8, I had a serious brain fade leading me to
underestimate the number of words that the tab-completion logic could
divide a line into. On input such as "(((((", each character will get
seen as a separate word, which means we do indeed sometimes need more
space for the words than for the original line. Fix that.

Make viewquery a copy in rewriteTargetView()

Rather than expect the Query returned by get_view_query() to be
read-only and then copy bits and pieces of it out, simply copy the
entire structure when we get it. This addresses an issue where
AcquireRewriteLocks, which is called by acquireLocksOnSubLinks(),
scribbles on the parsetree passed in, which was actually an entry
in relcache, leading to segfaults with certain view definitions.
This also future-proofs us a bit for anyone adding more code to this
path.

The acquireLocksOnSubLinks() was added in commit c3e0ddd40.

Back-patch to 9.3 as that commit was.

Remove silly completion for "DELETE FROM tabname ...".

psql offered USING, WHERE, and SET in this context, but SET is not a valid
possibility here. Seems to have been a thinko in commit f5ab0a14ea83eb6c
which added DELETE's USING option.

Teach psql's tab completion to consider the entire input string.

Up to now, the tab completion logic has only examined the last few words
of the current input line; "last few" being originally as few as four
words, but lately up to nine words.  Furthermore, it only looked at what
libreadline considers the current line of input, which made it rather
myopic if you split your command across lines.  This was tolerable,
sort of, so long as the match patterns were only designed to consider the
last few words of input; but with the recent addition of HeadMatches()
and Matches() matching rules, we really have to do better if we want
those to behave sanely.

Hence, change the code to break the entire line down into words, and to
include any previous lines in the command buffer along with the active
readline input buffer.

This will be a little bit slower than the previous coding, but some
measurements say that even a query of several thousand characters can be
parsed in a hundred or so microseconds on modern machines; so it's really
not going to be significant for interactive tab completion.  To reduce
the cost some, I arranged to avoid the per-word malloc calls that used
to occur: all the words are now kept in one malloc'd buffer.

psql: Review of new help output strings

Add missing COSTS OFF to EXPLAIN commands in rowsecurity.sql.

Commit e5e11c8cc added a bunch of EXPLAIN statements without COSTS OFF
to the regression tests. This is contrary to project policy since it
results in unnecessary platform dependencies in the output (it's just
luck that we didn't get buildfarm failures from it). Per gripe from
Mike Wilson.

Adopt a more compact, less error-prone notation for tab completion code.

Replace tests like

    else if (pg_strcasecmp(prev4_wd, "CREATE") == 0 &&
             pg_strcasecmp(prev3_wd, "TRIGGER") == 0 &&
             (pg_strcasecmp(prev_wd, "BEFORE") == 0 ||
              pg_strcasecmp(prev_wd, "AFTER") == 0))

with new notation like this:

    else if (TailMatches4("CREATE", "TRIGGER", MatchAny, "BEFORE|AFTER"))

In addition, provide some macros COMPLETE_WITH_LISTn() to reduce the amount
of clutter needed to specify a small number of predetermined completion
alternatives.

This makes the code substantially more compact: tab-complete.c gets over a
thousand lines shorter in this patch, despite the addition of a couple of
hundred lines of infrastructure for the new notations.  The new way of
specifying match rules seems a whole lot more readable and less
error-prone, too.

There's a lot more that could be done now to make matching faster and more
reliable; for example I suspect that most of the TailMatches() rules should
now be Matches() rules.  That would allow them to be skipped after a single
integer comparison if there aren't the right number of words on the line,
and it would reduce the risk of unintended matches.  But for now, (mostly)
refrain from reworking any match rules in favor of just converting what
we've got into the new notation.

Thomas Munro, reviewed by Michael Paquier, some adjustments by me

Fix whitespace

Fix tab completion for ALTER ... TABLESPACE ... OWNED BY.

Previously the completion used the wrong word to match 'BY'. This was
introduced brokenly, in b2de2a. While at it, also add completion of
IN TABLESPACE ... OWNED BY and fix comments referencing nonexistent
syntax.

Reported-By: Michael Paquier
Author: Michael Paquier and Andres Freund
Discussion: CAB7nPqSHDdSwsJqX0d2XzjqOHr==HdWiubCi4L=Zs7YFTUne8w@mail.gmail.com
Backpatch: 9.4, like the commit introducing the bug

Revert 9246af6799819847faa33baf441251003acbb8fe because
I miss too much. Patch is returned to commitfest process.

pgbench: Change terminology from "threshold" to "parameter".

Per a recommendation from Tomas Vondra, it's more helpful to refer to
the value that determines how skewed a Gaussian or exponential
distribution is as a parameter rather than a threshold.

Since it's not quite too late to get this right in 9.5, where it was
introduced, back-patch this. Most of the patch changes only comments
and documentation, but a few pgbench messages are altered to match.

Fabien Coelho, reviewed by Michael Paquier and by me.

Remove duplicate word.

Kyotaro Horiguchi

Fix TupleQueueReaderNext not to ignore its nowait argument.

This was a silly goof on my (rhaas's) part.

Report and fix by Rushabh Lathia.

Fix copy-and-paste error in logical decoding callback.

This could result in the error context misidentifying where the error
actually occurred.

Craig Ringer

Fix typo in comment.

Amit Langote

Allow to omit boundaries in array subscript

Allow to omiy lower or upper or both boundaries in array subscript
for selecting slice of array.

Author: YUriy Zhuravlev

Cube extension kNN support

Introduce distance operators over cubes:
<#> taxicab distance
<-> euclidean distance
<=> chebyshev distance

Also add kNN support of those distances in GiST opclass.

Author: Stas Kelvich

Remove unreferenced function declarations.

datapagemap_create() and datapagemap_destroy() were declared extern,
but they don't actually exist anywhere. Per YUriy Zhuravlev and
Michael Paquier.

Use just one standalone-backend session for initdb's post-bootstrap steps.

Previously, each subroutine in initdb fired up its own standalone backend
session.  Over time we'd grown as many as fifteen of these sessions,
and the cumulative startup and shutdown work for them was getting pretty
noticeable.  Combining things so that all these steps share a single
backend session cuts a good 10% off the total runtime of initdb, more
if you're not fsync'ing.

The main stumbling block to doing this before was that some of the sessions
were run with -j and some not.  The improved definition of -j mode
implemented by my previous commit makes it possible to fix that by running
all the post-bootstrap steps with -j; we just have to use double instead of
single newlines to end command strings.  (This is only absolutely necessary
around the VACUUM and CREATE DATABASE steps, since those can't be run in a
transaction block.  But it seems best to make them all use double newlines
so that the commands remain separate for error-reporting purposes.)

A minor disadvantage is that since initdb can't tell how much of its
output the backend has executed, we can no longer have the per-step
progress reporting initdb used to print.  But things are fast enough
nowadays that that's not really all that useful anyway.

In passing, add more const decoration to some of the static arrays in
initdb.c.

Adjust behavior of single-user -j mode for better initdb error reporting.

Previously, -j caused the entire input file to be read in and executed as
a single command string.  That's undesirable, not least because any error
causes the entire file to be regurgitated as the "failing query".  Some
experimentation suggests a better rule: end the command string when we see
a semicolon immediately followed by two newlines, ie, an empty line after
a query.  This serves nicely to break up the existing examples such as
information_schema.sql and system_views.sql.  A limitation is that it's
no longer possible to write such a sequence within a string literal or
multiline comment in a file meant to be read with -j; but there are no
instances of such a problem within the data currently used by initdb.
(If someone does make such a mistake in future, it'll be obvious because
they'll get an unterminated-literal or unterminated-comment syntax error.)
Other than that, there shouldn't be any negative consequences; you're not
forced to end statements that way, it's just a better idea in most cases.

In passing, remove src/include/tcop/tcopdebug.h, which is dead code
because it's not included anywhere, and hasn't been for more than
ten years.  One of the debug-support symbols it purported to describe
has been unreferenced for at least the same amount of time, and the
other is removed by this commit on the grounds that it was useless:
forcing -j mode all the time would have broken initdb.  The lack of
complaints about that, or about the missing inclusion, shows that
no one has tried to use TCOP_DONTUSENEWLINE in many years.

Fix improper initialization order for readline.

Turns out we must set rl_basic_word_break_characters *before* we call
rl_initialize() the first time, because it will quietly copy that value
elsewhere --- but only on the first call.  (Love these undocumented
dependencies.)  I broke this yesterday in commit 2ec477dc8108339d;
like that commit, back-patch to all active branches.  Per report from
Pavel Stehule.

Rework internals of changing a type's ownership

This is necessary so that REASSIGN OWNED does the right thing with
composite types, to wit, that it also alters ownership of the type's
pg_class entry -- previously, the pg_class entry remained owned by the
original user, which caused later other failures such as the new owner's
inability to use ALTER TYPE to rename an attribute of the affected
composite.  Also, if the original owner is later dropped, the pg_class
entry becomes owned by a non-existant user which is bogus.

To fix, create a new routine AlterTypeOwner_oid which knows whether to
pass the request to ATExecChangeOwner or deal with it directly, and use
that in shdepReassignOwner rather than calling AlterTypeOwnerInternal
directly.  AlterTypeOwnerInternal is now simpler in that it only
modifies the pg_type entry and recurses to handle a possible array type;
higher-level tasks are handled by either AlterTypeOwner directly or
AlterTypeOwner_oid.

I took the opportunity to add a few more objects to the test rig for
REASSIGN OWNED, so that more cases are exercised.  Additional ones could
be added for superuser-only-ownable objects (such as FDWs and event
triggers) but I didn't want to push my luck by adding a new superuser to
the tests on a backpatchable bug fix.

Per bug #13666 reported by Chris Pacejo.

Backpatch to 9.5.

(I would back-patch this all the way back, except that it doesn't apply
cleanly in 9.4 and earlier because 59367fdf9 wasn't backpatched.  If we
decide that we need this in earlier branches too, we should backpatch
both.)

Cope with Readline's failure to track SIGWINCH events outside of input.

It emerges that libreadline doesn't notice terminal window size change
events unless they occur while collecting input.  This is easy to stumble
over if you resize the window while using a pager to look at query output,
but it can be demonstrated without any pager involvement.  The symptom is
that queries exceeding one line are misdisplayed during subsequent input
cycles, because libreadline has the wrong idea of the screen dimensions.

The safest, simplest way to fix this is to call rl_reset_screen_size()
just before calling readline().  That causes an extra ioctl(TIOCGWINSZ)
for every command; but since it only happens when reading from a tty, the
performance impact should be negligible.  A more valid objection is that
this still leaves a tiny window during entry to readline() wherein delivery
of SIGWINCH will be missed; but the practical consequences of that are
probably negligible.  In any case, there doesn't seem to be any good way to
avoid the race, since readline exposes no functions that seem safe to call
from a generic signal handler --- rl_reset_screen_size() certainly isn't.

It turns out that we also need an explicit rl_initialize() call, else
rl_reset_screen_size() dumps core when called before the first readline()
call.

rl_reset_screen_size() is not present in old versions of libreadline,
so we need a configure test for that.  (rl_initialize() is present at
least back to readline 4.0, so we won't bother with a test for it.)
We would need a configure test anyway since libedit's emulation of
libreadline doesn't currently include such a function.  Fortunately,
libedit seems not to have any corresponding bug.

Merlin Moncure, adjusted a bit by me

Speed up CREATE INDEX CONCURRENTLY's TID sort.

Encode TIDs as 64-bit integers to speed up comparisons. This seems to
speed things up on all platforms, but is even more beneficial when
8-byte integers are passed by value.

Peter Geoghegan. Design suggestions and review by Tom Lane. Review
also by Simon Riggs and by me.

Mark CHECK constraints declared NOT VALID valid if created with table.

FOREIGN KEY constraints have behaved this way for a long time, but for
some reason the behavior of CHECK constraints has been inconsistent up
until now.

Amit Langote and Amul Sul, with assorted tweaks by me.

Document use of Subject Alternative Names in SSL server certificates.

Commit acd08d764 did not bother with updating the documentation.

Update 9.5 release notes through today.

Also do another round of copy-editing, and fix up remaining FIXME items.

Teach mdnblocks() not to create zero-length files.

It's entirely surprising that mdnblocks() has the side effect of
creating new files on disk, so let's make it not do that. One
consequence of the old behavior is that, if running on a damaged
cluster that is missing a file, mdnblocks() can recreate the file
and allow a subsequent _mdfd_getseg() for a higher segment to succeed.
This happens because, while mdnblocks() stops when it finds a segment
that is shorter than 1GB, _mdfd_getseg() has no such check, and thus
the empty file created by mdnblocks() can allow it to continue its
traversal and find higher-numbered segments which remain.

It might be a good idea for _mdfd_getseg() to actually verify that
each segment it finds is exactly 1GB before proceeding to the next
one, but that would involve some additional system calls, so for
now I'm just doing this much.

Patch by me, per off-list analysis by Kevin Grittner and Rahila Syed.
Review by Andres Freund.

Move buffer I/O and content LWLocks out of the main tranche.

Move the content lock directly into the BufferDesc, so that locking and
pinning a buffer touches only one cache line rather than two. Adjust
the definition of BufferDesc slightly so that this doesn't make the
BufferDesc any larger than one cache line (at least on platforms where
a spinlock is only 1 or 2 bytes).

We can't fit the I/O locks into the BufferDesc and stay within one
cache line, so move those to a completely separate tranche. This
leaves a relatively limited number of LWLocks in the main tranche, so
increase the padding of those remaining locks to a full cache line,
rather than allowing adjacent locks to share a cache line, hopefully
reducing false sharing.

Performance testing shows that these changes make little difference
on laptop-class machines, but help significantly on larger servers,
especially those with more than 2 sockets.

Andres Freund, originally based on an earlier patch by Simon Riggs.
Review and cosmetic adjustments (including heavy rewriting of the
comments) by me.

Provide a way to predefine LWLock tranche IDs.

It's a bit cumbersome to use LWLockNewTrancheId(), because the returned
value needs to be shared between backends so that each backend can call
LWLockRegisterTranche() with the correct ID. So, for built-in tranches,
use a hard-coded value instead.

This is motivated by an upcoming patch adding further built-in tranches.

Andres Freund and Robert Haas

Improve CREATE POLICY documentation

Clarify that SELECT policies are now applied when SELECT rights
are required for a given query, even if the query is an UPDATE or
DELETE query. Pointed out by Noah.

Additionally, note the risk regarding concurrently open transactions
where a relation which controls access to the rows of another relation
are updated and the rows of the primary relation are also being
modified. Pointed out by Peter Geoghegan.

Back-patch to 9.5.

Collect the global OR of hasRowSecurity flags for plancache

We carry around information about if a given query has row security or
not to allow the plancache to use that information to invalidate a
planned query in the event that the environment changes.

Previously, the flag of one of the subqueries was simply being copied
into place to indicate if the query overall included RLS components.
That's wrong as we need the global OR of all subqueries. Fix by
changing the code to match how fireRIRules works, which is results
in OR'ing all of the flags.

Noted by Tom.

Back-patch to 9.5 where RLS was introduced.

Add missing cleanup logic in pg_rewind/t/005_same_timeline.pl test.

Per Michael Paquier

Add missing CHECK_FOR_INTERRUPTS in lseg_inside_poly

Apparently, there are bugs in this code that cause it to loop endlessly.
That bug still needs more research, but in the meantime it's clear that
the loop is missing a check for interrupts so that it can be cancelled
timely.

Backpatch to 9.1 -- this has been missing since 49475aab8d0d.

Remove xmlparse(document '') test

This one test was behaving differently between the ubuntu fix for
CVE-2015-7499 and the base "expected" file. It's not worth having
yet another version of the expected file for this test, so drop it.
Perhaps at some point when all distros have settled down to the
same behavior on this test, it can be restored.

Problem found by me on libxml2 (2.9.1+dfsg1-3ubuntu4.6).
Solution suggested by Tom Lane.
Backpatch to 9.5, where the test was added.

Fix out-of-memory error handling in ParameterDescription message processing.

If libpq ran out of memory while constructing the result set, it would hang,
waiting for more data from the server, which might never arrive. To fix,
distinguish between out-of-memory error and not-enough-data cases, and give
a proper error message back to the client on OOM.

There are still similar issues in handling COPY start messages, but let's
handle that as a separate patch.

Michael Paquier, Amit Kapila and me. Backpatch to all supported versions.

Fix bug in SetOffsetVacuumLimit() triggered by find_multixact_start() failure.

Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would
install 0 into MultiXactState->offsetStopLimit if it previously succeeded.
Luckily, there are no known cases where find_multixact_start() will return
an error in 9.5 and above. But if it were to happen, for example due to
filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId()
could continue allocating mxids even if close to a wraparound, or it could
erroneously stop allocating mxids, even if no wraparound is looming. The
wrong value would be corrected the next time SetOffsetVacuumLimit() is
called, or by a restart.

Reported-By: Noah Misch, although this is not his preferred fix
Discussion: 20151210140450.GA22278@alap3.anarazel.de
Backpatch: 9.5, where the bug was introduced as part of 4f627f

Correct statement to actually be the intended assert statement.

e3f4cfc7 introduced a LWLockHeldByMe() call, without the corresponding
Assert() surrounding it.

Spotted by Coverity.

Backpatch: 9.1+, like the previous commit