Tom Lane [Thu, 12 Oct 2017 02:18:01 +0000 (22:18 -0400)]
Prevent sharing transition states between ordered-set aggregates.
This ought to work, but the built-in OSAs are not capable of coping,
because their final-functions destructively modify their transition
state (specifically, the contained tuplesort object). That was fine
when those functions were written, but commit 804163bc2 moved the
goalposts without telling orderedsetaggs.c.
We should fix the built-in OSAs to support this, but it will take
a little work, especially if we don't want to sacrifice performance
in the normal non-shared-state case. Given that it took a year after
9.6 release for anyone to notice this bug, we should not prioritize
sharable-state over nonsharable-state performance. And a proper fix
is likely to be more complicated than we'd want to back-patch, too.
Therefore, let's just put in this stop-gap patch to prevent nodeAgg.c
from choosing to use shared state for OSAs. We can revert it in HEAD
when we get a better fix.
Report from Lukas Eder, diagnosis by me, patch by David Rowley.
Back-patch to 9.6 where the problem was introduced.
Andres Freund [Thu, 12 Oct 2017 02:06:29 +0000 (19:06 -0700)]
Temporary attempt at a workaround for further MSVC restrict build failures.
It appears some versions of msvc use __declspec(restrict) in stdlib.h
and subsidiary headers. Including those after defining 'restrict' to
'__restrict' doesn't work. Try to get the buildfarm green to see
whether there's further problems, by including stdlib.h just before
said define.
Andres Freund [Wed, 11 Oct 2017 23:49:31 +0000 (16:49 -0700)]
Improve performance of SendRowDescriptionMessage.
There's three categories of changes leading to better performance:
- Splitting the per-attribute part of SendRowDescriptionMessage into a
v2 and a v3 version allows avoiding branches for every attribute.
- Preallocating the size of the buffer to be big enough for all
attributes and then using pq_write* avoids unnecessary buffer
size checks & resizing.
- Reusing a persistently allocated StringInfo for all
SendRowDescriptionMessage() invocations avoids repeated allocations
& reallocations.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
Robert Haas [Wed, 11 Oct 2017 23:52:46 +0000 (19:52 -0400)]
pg_stat_statements: Widen query IDs from 32 bits to 64 bits.
This takes advantage of the infrastructure introduced by commit 81c5e46c490e2426db243eada186995da5bb0ba7 to greatly reduce the
likelihood that two different queries will end up with the same query
ID. It's still possible, of course, but whereas before it the chances
of a collision reached 25% around 50,000 queries, it will now take
more than 3 billion queries.
Backward incompatibility: Because the type exposed at the SQL level is
int8, users may now see negative query IDs in the pg_stat_statements
view (and also, query IDs more than 4 billion, which was the old
limit).
Patch by me, reviewed by Michael Paquier and Peter Geoghegan.
Andres Freund [Wed, 11 Oct 2017 23:26:35 +0000 (16:26 -0700)]
Use one stringbuffer for all rows printed in printtup.c.
This avoids newly allocating, and then possibly growing, the
stringbuffer for every row. For wide rows this can substantially
reduce memory allocator overhead, at the price of not immediately
reducing memory usage after outputting an especially wide row.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
Andres Freund [Wed, 11 Oct 2017 23:01:52 +0000 (16:01 -0700)]
Add more efficient functions to pqformat API.
There's three prongs to achieve greater efficiency here:
1) Allow reusing a stringbuffer across pq_beginmessage/endmessage,
with the new pq_beginmessage_reuse/endmessage_reuse. This can be
beneficial both because it avoids allocating the initial buffer,
and because it's more likely to already have an correctly sized
buffer.
2) Replacing pq_sendint() with pq_sendint$width() inline
functions. Previously unnecessary and unpredictable branches in
pq_sendint() were needed. Additionally the replacement functions
are implemented more efficiently. pq_sendint is now deprecated, a
separate commit will convert all in-tree callers.
3) Add pq_writeint$width(), pq_writestring(). These rely on sufficient
space in the StringInfo's buffer, avoiding individual space checks
& potential individual resizing. To allow this to be used for
strings, expose mbutil.c's MAX_CONVERSION_GROWTH.
Followup commits will make use of these facilities.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
Andres Freund [Wed, 11 Oct 2017 23:01:52 +0000 (16:01 -0700)]
Allow to avoid NUL-byte management for stringinfos and use in format.c.
In a lot of the places having appendBinaryStringInfo() maintain a
trailing NUL byte wasn't actually meaningful, e.g. when appending an
integer which can contain 0 in one of its bytes.
Removing this yields some small speedup, but more importantly will be
more consistent when providing faster variants of pq_sendint etc.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
Andres Freund [Wed, 11 Oct 2017 23:01:52 +0000 (16:01 -0700)]
Add configure infrastructure to detect support for C99's restrict.
Will be used in later commits improving performance for a few key
routines where information about aliasing allows for significantly
better code generation.
This allows to use the C99 'restrict' keyword without breaking C89, or
for that matter C++, compilers. If not supported it's defined to be
empty.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
Tom Lane [Wed, 11 Oct 2017 21:43:50 +0000 (17:43 -0400)]
Remove unnecessary PG_TRY overhead for CurrentResourceOwner changes.
resowner/README contained advice to use a PG_TRY block to restore the
old CurrentResourceOwner value anywhere that that variable is transiently
changed. That advice was only inconsistently followed, however, and
on reflection it seems like unnecessary overhead. We don't bother
with such a convention for transient CurrentMemoryContext changes,
on the grounds that any (sub)transaction abort will start out by
resetting CurrentMemoryContext to what it wants. But the same is
true of CurrentResourceOwner, so there seems no need to treat it
differently.
Hence, remove PG_TRY blocks that exist only to restore CurrentResourceOwner
before re-throwing the error. There are a couple of places that restore
it along with some other actions, and I left those alone; the restore is
probably unnecessary but no noticeable gain will result from removing it.
Andres Freund [Wed, 11 Oct 2017 19:03:26 +0000 (12:03 -0700)]
Prevent idle in transaction session timeout from sometimes being ignored.
The previous coding in ProcessInterrupts() could lead to
idle_in_transaction_session_timeout being ignored, when
statement_timeout occurred earlier.
The problem was that ProcessInterrupts() would return before
processing the transaction timeout if QueryCancelPending was set while
QueryCancelHoldoffCount != 0 - which is the case when reading new
commands from the client. Ergo when the idle transaction timeout would
hit.
Fix that by removing the early return. Alternatively the transaction
timeout code could have been moved up, but that early return seems
like an issue that could hit other cases too.
Author: Lukas Fittl
Bug: #14821
Discussion:
https://www.postgresql.org/message-id/20170921010956.17345.61461%40wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAP53PkxQnv3OWJpyNPGJYT62uY=n1=2CF_Lpc6gVOFnc0-gazw@mail.gmail.com
Backpatch: 9.6-, where idle_in_transaction_session_timeout was introduced.
Tom Lane [Wed, 11 Oct 2017 20:56:23 +0000 (16:56 -0400)]
Doc: fix missing explanation of default object privileges.
The GRANT reference page, which lists the default privileges for new
objects, failed to mention that USAGE is granted by default for data
types and domains. As a lesser sin, it also did not specify anything
about the initial privileges for sequences, FDWs, foreign servers,
or large objects. Fix that, and add a comment to acldefault() in the
probably vain hope of getting people to maintain this list in future.
Noted by Laurenz Albe, though I editorialized on the wording a bit.
Back-patch to all supported branches, since they all have this behavior.
Tom Lane [Wed, 11 Oct 2017 18:28:33 +0000 (14:28 -0400)]
Fix low-probability loss of NOTIFY messages due to XID wraparound.
Up to now async.c has used TransactionIdIsInProgress() to detect whether
a notify message's source transaction is still running. However, that
function has a quick-exit path that reports that XIDs before RecentXmin
are no longer running. If a listening backend is doing nothing but
listening, and not running any queries, there is nothing that will advance
its value of RecentXmin. Once 2 billion transactions elapse, the
RecentXmin check causes active transactions to be reported as not running.
If they aren't committed yet according to CLOG, async.c decides they
aborted and discards their messages. The timing for that is a bit tight
but it can happen when multiple backends are sending notifies concurrently.
The net symptom therefore is that a sufficiently-long-surviving
listen-only backend starts to miss some fraction of NOTIFY traffic,
but only under heavy load.
The only function that updates RecentXmin is GetSnapshotData().
A brute-force fix would therefore be to take a snapshot before
processing incoming notify messages. But that would add cycles,
as well as contention for the ProcArrayLock. We can be smarter:
having taken the snapshot, let's use that to check for running
XIDs, and not call TransactionIdIsInProgress() at all. In this
way we reduce the number of ProcArrayLock acquisitions from one
per message to one per notify interrupt; that's the same under
light load but should be a benefit under heavy load. Light testing
says that this change is a wash performance-wise for normal loads.
I looked around for other callers of TransactionIdIsInProgress()
that might be at similar risk, and didn't find any; all of them
are inside transactions that presumably have already taken a
snapshot.
Problem report and diagnosis by Marko Tiikkaja, patch by me.
Back-patch to all supported branches, since it's been like this
since 9.0.
Andres Freund [Tue, 10 Oct 2017 21:42:16 +0000 (14:42 -0700)]
Rewrite strnlen replacement implementation from 8a241792f96.
The previous placement of the fallback implementation in libpgcommon
was problematic, because libpqport functions need strnlen
functionality.
Move replacement into libpgport. Provide strnlen() under its posix
name, instead of pg_strnlen(). Fix stupid configure bug, executing the
test only when compiled with threading support.
Author: Andres Freund
Discussion: https://postgr.es/m/E1e1gR2-0005fB-SI@gemulon.postgresql.org
Tom Lane [Tue, 10 Oct 2017 16:51:09 +0000 (12:51 -0400)]
Add missing clean step to src/test/modules/brin/Makefile.
I noticed the tmp_check subdirectory wasn't getting cleaned up
after a check-world run. Apparently pgxs.mk will only do this
for you if you've defined REGRESS. The only other src/test/modules
Makefile that does not set that is snapshot_too_old, and it
does it like this.
Andres Freund [Sun, 8 Oct 2017 22:08:25 +0000 (15:08 -0700)]
Reduce memory usage of targetlist SRFs.
Previously nodeProjectSet only released memory once per input tuple,
rather than once per returned tuple. If the computation of an
individual returned tuple requires a lot of memory, that can lead to
problems.
Instead change things so that the expression context can be reset once
per output tuple, which requires a new memory context to store SRF
arguments in.
This is a longstanding issue, but was hard to fix before 9.6, due to
the way tSRFs where evaluated. But it's fairly easy to fix now. We
could backpatch this into 10, but given there've been fewc omplaints
that doesn't seem worth the risk so far.
Reported-By: Lucas Fairchild
Author: Andres Freund, per discussion with Tom Lane
Discussion: https://postgr.es/m/4514.1507318623@sss.pgh.pa.us
Tom Lane [Sun, 8 Oct 2017 19:25:26 +0000 (15:25 -0400)]
Increase distance between flush requests during bulk file copies.
copy_file() reads and writes data 64KB at a time (with default BLCKSZ),
and historically has issued a pg_flush_data request after each write.
This turns out to interact really badly with macOS's new APFS file
system: a large file copy takes over 100X longer than it ought to on
APFS, as reported by Brent Dearth. While that's arguably a macOS bug,
it's not clear whether Apple will do anything about it in the near
future, and in any case experimentation suggests that issuing flushes
a bit less often can be helpful on other platforms too.
Hence, rearrange the logic in copy_file() so that flush requests are
issued once per N writes rather than every time through the loop.
I set the FLUSH_DISTANCE to 32MB on macOS (any less than that still
results in a noticeable speed degradation on APFS), but 1MB elsewhere.
In limited testing on Linux and FreeBSD, this seems slightly faster
than the previous code, and certainly no worse. It helps noticeably
on macOS even with the older HFS filesystem.
A simpler change would have been to just increase the size of the
copy buffer without changing the loop logic, but that seems likely
to trash the processor cache without really helping much.
Back-patch to 9.6 where we introduced msync() as an implementation
option for pg_flush_data(). The problem seems specific to APFS's
mmap/msync support, so I don't think we need to go further back.
Tom Lane [Sun, 8 Oct 2017 16:23:32 +0000 (12:23 -0400)]
Reduce "X = X" to "X IS NOT NULL", if it's easy to do so.
If the operator is a strict btree equality operator, and X isn't volatile,
then the clause must yield true for any non-null value of X, or null if X
is null. At top level of a WHERE clause, we can ignore the distinction
between false and null results, so it's valid to simplify the clause to
"X IS NOT NULL". This is a useful improvement mainly because we'll get
a far better selectivity estimate in most cases.
Because such cases seldom arise in well-written queries, it is unappetizing
to expend a lot of planner cycles looking for them ... but it turns out
that there's a place we can shoehorn this in practically for free, because
equivclass.c already has to detect and reject candidate equivalences of the
form X = X. That doesn't catch every place that it would be valid to
simplify to X IS NOT NULL, but it catches the typical case. Working harder
doesn't seem justified.
Tom Lane [Sat, 7 Oct 2017 22:04:25 +0000 (18:04 -0400)]
Improve pg_regress's error reporting for schedule-file problems.
The previous coding here trashed the line buffer as it scanned it,
making it impossible to print the source line in subsequent error
messages. With a few save/restore/strdup pushups we can improve
that situation.
In passing, move the free'ing of the various strings that are collected
while processing one set of tests down to the bottom of the loop.
That's simpler, less surprising, and should make valgrind less unhappy
about the strings that were previously leaked by the last iteration.
Tom Lane [Sat, 7 Oct 2017 21:20:09 +0000 (17:20 -0400)]
Enforce our convention about max number of parallel regression tests.
We have a very old rule that parallel_schedule should have no more
than twenty tests in any one parallel group, so as to provide a
bound on the number of concurrently running processes needed to
pass the tests. But people keep forgetting the rule, so let's add
a few lines of code to check it.
Tom Lane [Sat, 7 Oct 2017 17:19:13 +0000 (13:19 -0400)]
Clean up sloppy maintenance of regression test schedule files.
The partition_join test was added to a parallel group that was already
at the maximum of 20 concurrent tests. The hash_func test wasn't
added to serial_schedule at all. The identity and partition_join tests
were added to serial_schedule with the aid of a dartboard, rather than
maintaining consistency with parallel_schedule.
There are proposals afoot to make these sorts of errors harder to make,
but in the meantime let's fix the ones already in place.
Tom Lane [Fri, 6 Oct 2017 23:18:58 +0000 (19:18 -0400)]
Fix crash when logical decoding is invoked from a PL function.
The logical decoding functions do BeginInternalSubTransaction and
RollbackAndReleaseCurrentSubTransaction to clean up after themselves.
It turns out that AtEOSubXact_SPI has an unrecognized assumption that
we always need to cancel the active SPI operation in the SPI context
that surrounds the subtransaction (if there is one). That's true
when the RollbackAndReleaseCurrentSubTransaction call is coming from
the SPI-using function itself, but not when it's happening inside
some unrelated function invoked by a SPI query. In practice the
affected callers are the various PLs.
To fix, record the current subtransaction ID when we begin a SPI
operation, and clean up only if that ID is the subtransaction being
canceled.
Also, remove AtEOSubXact_SPI's assertion that it must have cleaned
up the surrounding SPI context's active tuptable. That's proven
wrong by the same test case.
Also clarify (or, if you prefer, reinterpret) the calling conventions
for _SPI_begin_call and _SPI_end_call. The memory context cleanup
in the latter means that these have always had the flavor of a matched
resource-management pair, but they weren't documented that way before.
Per report from Ben Chobot.
Back-patch to 9.4 where logical decoding came in. In principle,
the SPI changes should go all the way back, since the problem dates
back to commit 7ec1c5a86. But given the lack of field complaints
it seems few people are using internal subtransactions in this way.
So I don't feel a need to take any risks in 9.2/9.3.
Tom Lane [Fri, 6 Oct 2017 18:28:42 +0000 (14:28 -0400)]
Fix intra-query memory leakage in nodeProjectSet.c.
Both ExecMakeFunctionResultSet() and evaluation of simple expressions
need to be done in the per-tuple memory context, not per-query, else
we leak data until end of query. This is a consideration that was
missed while refactoring code in the ProjectSet patch (note that in
pre-v10, ExecMakeFunctionResult is called in the per-tuple context).
Per bug #14843 from Ben M. Diagnosed independently by Andres and myself.
Tom Lane [Fri, 6 Oct 2017 16:20:12 +0000 (12:20 -0400)]
Fix access-off-end-of-array in clog.c.
Sloppy loop coding in set_status_by_pages() resulted in fetching one array
element more than it should from the subxids[] array. The odds of this
resulting in SIGSEGV are pretty small, but we've certainly seen that happen
with similar mistakes elsewhere. While at it, we can get rid of an extra
TransactionIdToPage() calculation per loop.
Per report from David Binderman. Back-patch to all supported branches,
since this code is quite old.
Tom Lane [Fri, 6 Oct 2017 15:35:49 +0000 (11:35 -0400)]
#ifdef out some dead code in psql/mainloop.c.
This pg_send_history() call is unreachable, since the block it's in
is currently only entered in !cur_cmd_interactive mode. But rather
than just delete it, make it #ifdef NOT_USED, in hopes that we'll
remember to enable it if we ever change that decision.
Per report from David Binderman. Since this is basically cosmetic,
I see no great need to back-patch.
Alvaro Herrera [Fri, 6 Oct 2017 15:14:42 +0000 (17:14 +0200)]
Fix traversal of half-frozen update chains
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken. This
breaks HOT (and probably other things). A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed. This can be seen because
index creation (or REINDEX) complain like
ERROR: XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"
Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.
This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.
Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax. But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best. Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.
I didn't optimize the new function for performance. The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.
This is a followup after 20b655224249 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.
Reported-by: Peter Geoghegan Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
Yi Wen Wong, Ćlvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
Robert Haas [Fri, 6 Oct 2017 15:11:10 +0000 (11:11 -0400)]
Basic partition-wise join functionality.
Instead of joining two partitioned tables in their entirety we can, if
it is an equi-join on the partition keys, join the matching partitions
individually. This involves teaching the planner about "other join"
rels, which are related to regular join rels in the same way that
other member rels are related to baserels. This can use significantly
more CPU time and memory than regular join planning, because there may
now be a set of "other" rels not only for every base relation but also
for every join relation. In most practical cases, this probably
shouldn't be a problem, because (1) it's probably unusual to join many
tables each with many partitions using the partition keys for all
joins and (2) if you do that scenario then you probably have a big
enough machine to handle the increased memory cost of planning and (3)
the resulting plan is highly likely to be better, so what you spend in
planning you'll make up on the execution side. All the same, for now,
turn this feature off by default.
Currently, we can only perform joins between two tables whose
partitioning schemes are absolutely identical. It would be nice to
cope with other scenarios, such as extra partitions on one side or the
other with no match on the other side, but that will have to wait for
a future patch.
Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
Khandekar, and by me. A few final adjustments by me.
Robert Haas [Thu, 5 Oct 2017 17:06:46 +0000 (13:06 -0400)]
On attach, consider skipping validation of subpartitions individually.
If the table attached as a partition is itself partitioned, individual
partitions might have constraints strong enough to skip scanning the
table even if the table actually attached does not. This is pretty
cheap to check, and possibly a big win if it works out.
Robert Haas [Thu, 5 Oct 2017 15:34:38 +0000 (11:34 -0400)]
Allow DML commands that create tables to use parallel query.
Haribabu Kommi, reviewed by Dilip Kumar and Rafia Sabih. Various
cosmetic changes by me to explain why this appears to be safe but
allowing inserts in parallel mode in general wouldn't be. Also, I
removed the REFRESH MATERIALIZED VIEW case from Haribabu's patch,
since I'm not convinced that case is OK, and hacked on the
documentation somewhat.
Tom Lane [Thu, 5 Oct 2017 14:47:47 +0000 (10:47 -0400)]
Improve comments in vacuum_rel() and analyze_rel().
Remove obsolete references to get_rel_oids(). Avoid listing specific
relkinds in the comments, since we seem unable to keep such things
in sync with the code, and it's not all that helpful anyhow.
Noted by Michael Paquier, though I rewrote the comments a bit more.
Peter Eisentraut [Thu, 31 Aug 2017 02:16:50 +0000 (22:16 -0400)]
Document and use SPI_result_code_string()
A lot of semi-internal code just prints out numeric SPI error codes,
which is not very helpful. We already have an API function to convert
the codes to a string, so let's make more use of that.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Andres Freund [Wed, 4 Oct 2017 07:22:38 +0000 (00:22 -0700)]
Replace binary search in fmgr_isbuiltin with a lookup array.
Turns out we have enough functions that the binary search is quite
noticeable in profiles.
Thus have Gen_fmgrtab.pl build a new mapping from a builtin function's
oid to an index in the existing fmgr_builtins array. That keeps the
additional memory usage at a reasonable amount.
Author: Andres Freund, with input from Tom Lane
Discussion: https://postgr.es/m/20170914065128.a5sk7z4xde5uy3ei@alap3.anarazel.de
Tom Lane [Tue, 3 Oct 2017 22:53:44 +0000 (18:53 -0400)]
Allow multiple tables to be specified in one VACUUM or ANALYZE command.
Not much to say about this; does what it says on the tin.
However, formerly, if there was a column list then the ANALYZE action was
implied; now it must be specified, or you get an error. This is because
it would otherwise be a bit unclear what the user meant if some tables
have column lists and some don't.
Nathan Bossart, reviewed by Michael Paquier and Masahiko Sawada, with some
editorialization by me
Tom Lane [Tue, 3 Oct 2017 18:00:56 +0000 (14:00 -0400)]
Fix race condition with unprotected use of a latch pointer variable.
Commit 597a87ccc introduced a latch pointer variable to replace use
of a long-lived shared latch in the shared WalRcvData structure.
This was not well thought out, because there are now hazards of the
pointer variable changing while it's being inspected by another
process. This could obviously lead to a core dump in code like
if (WalRcv->latch)
SetLatch(WalRcv->latch);
and there's a more remote risk of a torn read, if we have any
platforms where reading/writing a pointer is not atomic.
An actual problem would occur only if the walreceiver process
exits (gracefully) while the startup process is trying to
signal it, but that seems well within the realm of possibility.
To fix, treat the pointer variable (not the referenced latch)
as being protected by the WalRcv->mutex spinlock. There
remains a race condition that we could apply SetLatch to a
process latch that no longer belongs to the walreceiver, but
I believe that's harmless: at worst it'd cause an extra wakeup
of the next process to use that PGPROC structure.
Back-patch to v10 where the faulty code was added.
Alvaro Herrera [Tue, 3 Oct 2017 12:58:25 +0000 (14:58 +0200)]
Fix coding rules violations in walreceiver.c
1. Since commit b1a9bad9e744 we had pstrdup() inside a
spinlock-protected critical section; reported by Andreas Seltenreich.
Turn those into strlcpy() to stack-allocated variables instead.
Backpatch to 9.6.
2. Since commit 9ed551e0a4fd we had a pfree() uselessly inside a
spinlock-protected critical section. Tom Lane noticed in code review.
Move down. Backpatch to 9.6.
3. Since commit 64233902d22b we had GetCurrentTimestamp() (a kernel
call) inside a spinlock-protected critical section. Tom Lane noticed in
code review. Move it up. Backpatch to 9.2.
4. Since commit 1bb2558046cc we did elog(PANIC) while holding spinlock.
Tom Lane noticed in code review. Release spinlock before dying.
Backpatch to 9.2.
Peter Eisentraut [Fri, 22 Sep 2017 17:51:01 +0000 (13:51 -0400)]
Expand collation documentation
Document better how to create custom collations and what locale strings
ICU accepts. Explain the ICU examples in more detail. Also update the
text on the CREATE COLLATION reference page a bit to take ICU more into
account.
Andres Freund [Sun, 1 Oct 2017 22:36:14 +0000 (15:36 -0700)]
Replace most usages of ntoh[ls] and hton[sl] with pg_bswap.h.
All postgres internal usages are replaced, it's just libpq example
usages that haven't been converted. External users of libpq can't
generally rely on including postgres internal headers.
Note that this includes replacing open-coded byte swapping of 64bit
integers (using two 32 bit swaps) with a single 64bit swap.
Where it looked applicable, I have removed netinet/in.h and
arpa/inet.h usage, which previously provided the relevant
functionality. It's perfectly possible that I missed other reasons for
including those, the buildfarm will tell.
Author: Andres Freund
Discussion: https://postgr.es/m/20170927172019.gheidqy6xvlxb325@alap3.anarazel.de
Andres Freund [Sun, 1 Oct 2017 22:17:10 +0000 (15:17 -0700)]
Allow pg_ctl kill to send SIGKILL.
Previously that was disallowed out of an abundance of
caution. Providing KILL support however is helpful to make the
013_crash_restart.pl test portable, and there's no actual issue with
allowing it. SIGABRT, which has similar consequences except it also
dumps core, was already allowed.
Author: Andres Freund
Discussion: https://postgr.es/m/45d42d41-6145-9be1-7261-84acf6d9e344@2ndQuadrant.com
Tom Lane [Sun, 1 Oct 2017 16:43:46 +0000 (12:43 -0400)]
Use a longer connection timeout in pg_isready test.
Buildfarm members skink and sungazer have both recently failed this
test, with symptoms indicating that the default 3-second timeout
isn't quite enough for those very slow systems. There's no reason
to be miserly with this timeout, so boost it to 60 seconds.
Back-patch to all versions containing this test. That may be overkill,
because the failure has only been observed in the v10 branch, but
I don't feel like having to revisit this later.
If --rate was used to throttle pgbench, it failed to sleep when it had
nothing to do, leading to a busy-wait with 100% CPU usage. This bug was
introduced in the refactoring in v10. Before that, sleep() was called with
a timeout, even when there were no file descriptors to wait for.
Reported by Jeff Janes, patch by Fabien COELHO. Backpatch to v10.
Tom Lane [Sat, 30 Sep 2017 21:05:07 +0000 (17:05 -0400)]
Fix pg_dump to assign domain array type OIDs during pg_upgrade.
During a binary upgrade, all type OIDs are supposed to be assigned by
pg_dump based on their values in the old cluster. But now that domains
have arrays, there's nothing to base the arrays' type OIDs on, if we're
upgrading from a pre-v11 cluster. Make pg_dump search for an unused type
OID to use for this purpose. Per buildfarm.
Tom Lane [Sat, 30 Sep 2017 17:40:56 +0000 (13:40 -0400)]
Support arrays over domains.
Allowing arrays with a domain type as their element type was left un-done
in the original domain patch, but not for any very good reason. This
omission leads to such surprising results as array_agg() not working on
a domain column, because the parser can't identify a suitable output type
for the polymorphic aggregate.
In order to fix this, first clean up the APIs of coerce_to_domain() and
some internal functions in parse_coerce.c so that we consistently pass
around a CoercionContext along with CoercionForm. Previously, we sometimes
passed an "isExplicit" boolean flag instead, which is strictly less
information; and coerce_to_domain() didn't even get that, but instead had
to reverse-engineer isExplicit from CoercionForm. That's contrary to the
documentation in primnodes.h that says that CoercionForm only affects
display and not semantics. I don't think this change fixes any live bugs,
but it makes things more consistent. The main reason for doing it though
is that now build_coercion_expression() receives ccontext, which it needs
in order to be able to recursively invoke coerce_to_target_type().
Next, reimplement ArrayCoerceExpr so that the node does not directly know
any details of what has to be done to the individual array elements while
performing the array coercion. Instead, the per-element processing is
represented by a sub-expression whose input is a source array element and
whose output is a target array element. This simplifies life in
parse_coerce.c, because it can build that sub-expression by a recursive
invocation of coerce_to_target_type(). The executor now handles the
per-element processing as a compiled expression instead of hard-wired code.
The main advantage of this is that we can use a single ArrayCoerceExpr to
handle as many as three successive steps per element: base type conversion,
typmod coercion, and domain constraint checking. The old code used two
stacked ArrayCoerceExprs to handle type + typmod coercion, which was pretty
inefficient, and adding yet another array deconstruction to do domain
constraint checking seemed very unappetizing.
In the case where we just need a single, very simple coercion function,
doing this straightforwardly leads to a noticeable increase in the
per-array-element runtime cost. Hence, add an additional shortcut evalfunc
in execExprInterp.c that skips unnecessary overhead for that specific form
of expression. The runtime speed of simple cases is within 1% or so of
where it was before, while cases that previously required two levels of
array processing are significantly faster.
Finally, create an implicit array type for every domain type, as we do for
base types, enums, etc. Everything except the array-coercion case seems
to just work without further effort.
Andres Freund [Fri, 29 Sep 2017 22:52:55 +0000 (15:52 -0700)]
Extend & revamp pg_bswap.h infrastructure.
Upcoming patches are going to address performance issues that involve
slow system provided ntohs/htons etc. To address that expand
pg_bswap.h to provide pg_ntoh{16,32,64}, pg_hton{16,32,64} and
optimize their respective implementations by using compiler intrinsics
for gcc compatible compilers and msvc. Fall back to manual
implementations using shifts etc otherwise.
Additionally remove multiple evaluation hazards from the existing
BSWAP32/64 macros, by replacing them with inline functions when
necessary. In the course of that the naming scheme is changed to
pg_bswap16/32/64.
Author: Andres Freund
Discussion: https://postgr.es/m/20170927172019.gheidqy6xvlxb325@alap3.anarazel.de
Tom Lane [Fri, 29 Sep 2017 20:26:21 +0000 (16:26 -0400)]
Fix inadequate locking during get_rel_oids().
get_rel_oids used to not take any relation locks at all, but that stopped
being a good idea with commit 3c3bb9933, which inserted a syscache lookup
into the function. A concurrent DROP TABLE could now produce "cache lookup
failed", which we don't want to have happen in normal operation. The best
solution seems to be to transiently take a lock on the relation named by
the RangeVar (which also makes the result of RangeVarGetRelid a lot less
spongy). But we shouldn't hold the lock beyond this function, because we
don't want VACUUM to lock more than one table at a time. (That would not
be a big problem right now, but it will become one after the pending
feature patch to allow multiple tables to be named in VACUUM.)
In passing, adjust vacuum_rel and analyze_rel to document that we don't
trust the passed RangeVar to be accurate, and allow the RangeVar to
possibly be NULL --- which it is anyway for a whole-database VACUUM,
though we accidentally didn't crash for that case.
The passed RangeVar is in fact inaccurate when dealing with a child
partition, as of v10, and it has been wrong for a whole long time in the
case of vacuum_rel() recursing to a TOAST table. None of these things
present visible bugs up to now, because the passed RangeVar is in fact
only consulted for autovacuum logging, and in that particular context it's
always accurate because autovacuum doesn't let vacuum.c expand partitions
nor recurse to toast tables. Still, this seems like trouble waiting to
happen, so let's nail the door at least partly shut. (Further cleanup
is planned, in HEAD only, as part of the pending feature patch.)
Fix some sadly inaccurate/obsolete comments too. Back-patch to v10.
Robert Haas [Fri, 29 Sep 2017 19:59:11 +0000 (15:59 -0400)]
psql: Don't try to print a partition constraint we didn't fetch.
If \d rather than \d+ is used, then verbose is false and we don't ask
the server for the partition constraint; so we shouldn't print it in
that case either.
Maksim Milyutin, per a report from Jesper Pedersen. Reviewed by
Jesper Pedersen and Amit Langote.
Peter Eisentraut [Mon, 25 Sep 2017 15:59:46 +0000 (11:59 -0400)]
psql: Update \d sequence display
For \d sequencename, the psql code just did SELECT * FROM sequencename
to get the information to display, but this does not contain much
interesting information anymore in PostgreSQL 10, because the metadata
has been moved to a separate system catalog.
This patch creates a newly designed sequence display that is not merely
an extension of the general relation/table display as it was previously.
Tom Lane [Fri, 29 Sep 2017 15:32:05 +0000 (11:32 -0400)]
Marginal improvement for generated code in execExprInterp.c.
Avoid the coding pattern "*op->resvalue = f();", as some compilers think
that requires them to evaluate "op->resvalue" before the function call.
Unless there are lots of free registers, this can lead to a useless
register spill and reload across the call.
I changed all the cases like this in ExecInterpExpr(), but didn't bother
in the out-of-line opcode eval subroutines, since those are presumably
not as performance-critical.
Peter Eisentraut [Thu, 31 Aug 2017 16:24:47 +0000 (12:24 -0400)]
Add background worker type
Add bgw_type field to background worker structure. It is intended to be
set to the same value for all workers of the same type, so they can be
grouped in pg_stat_activity, for example.
The backend_type column in pg_stat_activity now shows bgw_type for a
background worker. The ps listing also no longer calls out that a
process is a background worker but just show the bgw_type. That way,
being a background worker is more of an implementation detail now that
is not shown to the user. However, most log messages still refer to
'background worker "%s"'; otherwise constructing sensible and
translatable log messages would become tricky.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Robert Haas [Fri, 29 Sep 2017 14:20:44 +0000 (10:20 -0400)]
Remove replacement selection sort.
At the time replacement_sort_tuples was introduced, there were still
cases where replacement selection sort noticeably outperformed using
quicksort even for the first run. However, those cases seem to have
evaporated as a result of further improvements made since that time
(and perhaps also advances in CPU technology). So remove replacement
selection and the controlling GUC entirely. This makes tuplesort.c
noticeably simpler and probably paves the way for further
optimizations someone might want to do later.
Peter Geoghegan, with review and testing by Tomas Vondra and me.
Peter Eisentraut [Fri, 11 Aug 2017 03:33:47 +0000 (23:33 -0400)]
Add lcov --initial
By just running lcov on the produced .gcda data files, we don't account
for source files that are not touched by tests at all. To fix that, run
lcov --initial to create a base line info file with all zero counters,
and merge that with the actual counters when creating the final report.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Peter Eisentraut [Thu, 28 Sep 2017 20:17:28 +0000 (16:17 -0400)]
Remove SGML marked sections
For XML compatibility, replace marked sections <![IGNORE[ ]]> with
comments <!-- -->. In some cases it seemed better to remove the ignored
text altogether, and in one case the text should not have been ignored.
Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples. But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.
(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)
One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it. But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.
Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead). In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.
An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer. It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message https://postgr.es/m/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.
Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling. In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.
Reported-by: Daniel Wood Diagnosed-by: Daniel Wood Reviewed-by: Yi Wen Wong, Michaƫl Paquier
Discussion: https://postgr.es/m/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com
Peter Eisentraut [Fri, 11 Aug 2017 03:33:47 +0000 (23:33 -0400)]
Run only top-level recursive lcov
This is the way lcov was intended to be used. It is much faster and
more robust and makes the makefiles simpler than running it in each
subdirectory.
The previous coding ran gcov before lcov, but that is useless because
lcov/geninfo call gcov internally and use that information. Moreover,
this led to complications and failures during parallel make. This
separates the two targets: You either use "make coverage" to get
textual output from gcov or "make coverage-html" to get an HTML report
via lcov. (Using both is still problematic because they write the same
output files.)
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Tom Lane [Wed, 27 Sep 2017 21:05:53 +0000 (17:05 -0400)]
Fix behavior when converting a float infinity to numeric.
float8_numeric() and float4_numeric() failed to consider the possibility
that the input is an IEEE infinity. The results depended on the
platform-specific behavior of sprintf(): on most platforms you'd get
something like
ERROR: invalid input syntax for type numeric: "inf"
but at least on Windows it's possible for the conversion to succeed and
deliver a finite value (typically 1), due to a nonstandard output format
from sprintf and lack of syntax error checking in these functions.
Since our numeric type lacks the concept of infinity, a suitable conversion
is impossible; the best thing to do is throw an explicit error before
letting sprintf do its thing.
While at it, let's use snprintf not sprintf. Overrunning the buffer
should be impossible if sprintf does what it's supposed to, but this
is cheap insurance against a stack smash if it doesn't.
Problem reported by Taiki Kondo. Patch by me based on fix suggestion
from KaiGai Kohei. Back-patch to all supported branches.
Tom Lane [Wed, 27 Sep 2017 20:14:37 +0000 (16:14 -0400)]
Revert to 9.6 treatment of ALTER TYPE enumtype ADD VALUE.
This reverts commit 15bc038f9, along with the followon commits 1635e80d3
and 984c92074 that tried to clean up the problems exposed by bug #14825.
The result was incomplete because it failed to address parallel-query
requirements. With 10.0 release so close upon us, now does not seem like
the time to be adding more code to fix that. I hope we can un-revert this
code and add the missing parallel query support during the v11 cycle.
Peter Eisentraut [Wed, 27 Sep 2017 19:51:04 +0000 (15:51 -0400)]
Fix plperl build
The changes in 639928c988c1c2f52bbe7ca89e8c7c78a041b3e2 turned out to
require Perl 5.9.3, which is newer than our minimum required version.
So revert back to the old code for the normal case and only use the new
variant when both coverage and vpath are used. As the minimum Perl
version moves forward, we can drop the old code sometime.
Dean Rasheed [Wed, 27 Sep 2017 16:16:15 +0000 (17:16 +0100)]
Improve the CREATE POLICY documentation.
Provide a correct description of how multiple policies are combined,
clarify when SELECT permissions are required, mention SELECT FOR
UPDATE/SHARE, and do some other more minor tidying up.
Peter Eisentraut [Fri, 11 Aug 2017 03:33:47 +0000 (23:33 -0400)]
Improve vpath support in plperl build
Run xsubpp with the -output option instead of redirecting stdout. That
ensures that the #line directives in the output file point to the right
place in a vpath build. This in turn fixes an error in coverage builds
that it can't find the source files.
Refactor the makefile rules while we're here.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Peter Eisentraut [Fri, 15 Sep 2017 14:17:37 +0000 (10:17 -0400)]
Get rid of parameterized marked sections in SGML
Previously, we created a variant of the installation instructions for
producing the plain-text INSTALL file by marking up certain parts of
installation.sgml using SGML parameterized marked sections. Marked
sections will not work anymore in XML, so before we can convert the
documentation to XML, we need a new approach.
DocBook provides a "profiling" feature that allows selecting content
based on attributes, which would work here. But it imposes a noticeable
overhead when building the full documentation and causes complications
when building some output formats, and given that we recently spent a
fair amount of effort optimizing the documentation build time, it seems
sad to have to accept that.
So as an alternative, (1) we create our own mini-profiling layer that
adjusts just the text we want, and (2) assemble the pieces of content
that we want in the INSTALL file using XInclude. That way, there is no
overhead when building the full documentation and most of the "ugly"
stuff in installation.sgml can be removed and dealt with out of line.
Peter Eisentraut [Tue, 26 Sep 2017 20:07:52 +0000 (16:07 -0400)]
pg_basebackup: Add option to create replication slot
When requesting a particular replication slot, the new pg_basebackup
option -C/--create-slot creates it before starting to replicate from it.
Further refactor the slot creation logic to include the temporary slot
creation logic into the same function. Add new arguments is_temporary
and preserve_wal to CreateReplicationSlot(). Print in --verbose mode
that a slot has been created.
It drops objects outside information_schema that depend on objects
inside information_schema. For example, it will drop a user-defined
view if the view query refers to information_schema.