Tom Lane [Thu, 19 Feb 2015 01:53:14 +0000 (20:53 -0500)]
Split array_push into separate array_append and array_prepend functions.
There wasn't any good reason for a single C function to implement both
these SQL functions: it saved very little code overall, and it required
significant pushups to re-determine at runtime which case applied. Redoing
it as two functions ends up with just slightly more lines of code, but it's
simpler to understand, and faster too because we need not repeat syscache
lookups on every call.
An important side benefit is that this eliminates the only case in which
different aliases of the same C function had both anyarray and anyelement
arguments at the same position, which would almost always be a mistake.
The opr_sanity regression test will now notice such mistakes since there's
no longer a valid case where it happens.
Alvaro Herrera [Wed, 18 Feb 2015 17:44:27 +0000 (14:44 -0300)]
Fix opclass/opfamily identity strings
The original representation uses "opcname for amname", which is good
enough; but if we replace "for" with "using", we can apply the returned
identity directly in a DROP command, as in
DROP OPERATOR CLASS opcname USING amname
This slightly simplifies code using object identities to programatically
execute commands on these kinds of objects.
Note backwards-incompatible change:
The previous representation dates back to 9.3 when object identities
were introduced by commit f8348ea3, but we don't want to change the
behavior on released branches unnecessarily and so this is not
backpatched.
Tom Lane [Wed, 18 Feb 2015 17:23:40 +0000 (12:23 -0500)]
Fix placement of "SET row_security" command issuance in pg_dump.
Somebody apparently threw darts at the code to decide where to insert
these. They certainly didn't proceed by adding them where other similar
SETs were handled. This at least broke pg_restore, and perhaps other
use-cases too.
Tom Lane [Wed, 18 Feb 2015 16:43:00 +0000 (11:43 -0500)]
Fix failure to honor -Z compression level option in pg_dump -Fd.
cfopen() and cfopen_write() failed to pass the compression level through
to zlib, so that you always got the default compression level if you got
any at all.
In passing, also fix these and related functions so that the correct errno
is reliably returned on failure; the original coding supposes that free()
cannot change errno, which is untrue on at least some platforms.
Per bug #12779 from Christoph Berg. Back-patch to 9.1 where the faulty
code was introduced.
Tom Lane [Tue, 17 Feb 2015 23:04:11 +0000 (18:04 -0500)]
Fix EXPLAIN output for cases where parent table is excluded by constraints.
The previous coding in EXPLAIN always labeled a ModifyTable node with the
name of the target table affected by its first child plan. When originally
written, this was necessarily the parent table of the inheritance tree,
so everything was unconfusing. But when we added NO INHERIT constraints,
it became possible for the parent table to be deleted from the plan by
constraint exclusion while still leaving child tables present. This led to
the ModifyTable plan node being labeled with the first surviving child,
which was deemed confusing. Fix it by retaining the parent table's RT
index in a new field in ModifyTable.
Etsuro Fujita, reviewed by Ashutosh Bapat and myself
After removal, the next_sibling pointer of a node was sometimes incorrectly
left to point to another node in the heap, which meant that a node was
sometimes linked twice into the heap. Surprisingly that didn't cause any
crashes in my testing, but it was clearly wrong and could easily segfault
in other scenarios.
Also always keep the prev_or_parent pointer as NULL on the root node. That
was not a correctness issue AFAICS, but let's be tidy.
Add a debugging function, to dump the contents of a pairing heap as a
string. It's #ifdef'd out, as it's not used for anything in any normal
code, but it was highly useful in debugging this. Let's keep it handy for
further reference.
Fix knn-GiST queue comparison function to return heap tuples first.
The part of the comparison function that was supposed to keep heap tuples
ahead of index items was backwards. It would not lead to incorrect results,
but it is more efficient to return heap tuples first, before scanning more
index pages, when both have the same distance.
Tom Lane [Tue, 17 Feb 2015 17:49:18 +0000 (12:49 -0500)]
Remove code to match IPv4 pg_hba.conf entries to IPv4-in-IPv6 addresses.
In investigating yesterday's crash report from Hugo Osvaldo Barrera, I only
looked back as far as commit f3aec2c7f51904e7 where the breakage occurred
(which is why I thought the IPv4-in-IPv6 business was undocumented). But
actually the logic dates back to commit 3c9bb8886df7d56a and was simply
broken by erroneous refactoring in the later commit. A bit of archives
excavation shows that we added the whole business in response to a report
that some 2003-era Linux kernels would report IPv4 connections as having
IPv4-in-IPv6 addresses. The fact that we've had no complaints since 9.0
seems to be sufficient confirmation that no modern kernels do that, so
let's just rip it all out rather than trying to fix it.
Do this in the back branches too, thus essentially deciding that our
effective behavior since 9.0 is correct. If there are any platforms on
which the kernel reports IPv4-in-IPv6 addresses as such, yesterday's fix
would have made for a subtle and potentially security-sensitive change in
the effective meaning of IPv4 pg_hba.conf entries, which does not seem like
a good thing to do in minor releases. So let's let the post-9.0 behavior
stand, and change the documentation to match it.
In passing, I failed to resist the temptation to wordsmith the description
of pg_hba.conf IPv4 and IPv6 address entries a bit. A lot of this text
hasn't been touched since we were IPv4-only.
Robert Haas [Tue, 17 Feb 2015 15:19:30 +0000 (10:19 -0500)]
Improve pg_check_dir code and comments.
Avoid losing errno if readdir() fails and closedir() works. Consistently
return 4 rather than 3 if both a lost+found directory and other files are
found, rather than returning one value or the other depending on the
order of the directory listing. Update comments to match the actual
behavior.
Tom Lane [Mon, 16 Feb 2015 21:17:48 +0000 (16:17 -0500)]
Fix misuse of memcpy() in check_ip().
The previous coding copied garbage into a local variable, pretty much
ensuring that the intended test of an IPv6 connection address against a
promoted IPv4 address from pg_hba.conf would never match. The lack of
field complaints likely indicates that nobody realized this was supposed
to work, which is unsurprising considering that no user-facing docs suggest
it should work.
In principle this could have led to a SIGSEGV due to reading off the end of
memory, but since the source address would have pointed to somewhere in the
function's stack frame, that's quite unlikely. What led to discovery of
the bug is Hugo Osvaldo Barrera's report of a crash after an OS upgrade,
which is probably because he is now running a system in which memcpy raises
abort() upon detecting overlapping source and destination areas. (You'd
have to additionally suppose some things about the stack frame layout to
arrive at this conclusion, but it seems plausible.)
This has been broken since the code was added, in commit f3aec2c7f51904e7,
so back-patch to all supported branches.
Restore the SSL_set_session_id_context() call to OpenSSL renegotiation.
This reverts the removal of the call in commit (272923a0). It turns out it
wasn't superfluous after all: without it, renegotiation fails if a client
certificate was used. The rest of the changes in that commit are still OK
and not reverted.
Per investigation of bug #12769 by Arne Scheffer, although this doesn't fix
the reported bug yet.
Tom Lane [Mon, 16 Feb 2015 20:28:40 +0000 (15:28 -0500)]
Use fast path in plpgsql's RETURN/RETURN NEXT in more cases.
exec_stmt_return() and exec_stmt_return_next() have fast-path code for
handling a simple variable reference (i.e. "return var") without going
through the full expression evaluation machinery. For some reason,
pl_gram.y was under the impression that this fast path only applied for
record/row variables; but in reality code for handling regular scalar
variables has been there all along. Adjusting the logic to allow that
code to be used actually results in a net savings of code in pl_gram.y
(by eliminating some redundancy), and it buys a measurable though not
very impressive amount of speedup.
Noted while fooling with my expanded-array patch, wherein this makes a much
bigger difference because it enables returning an expanded array variable
without an extra flattening step. But AFAICS this is a win regardless,
so commit it separately.
In the SSL test suite, use a root CA cert that won't expire (so quickly)
All the other certificates were created to be valid for 10000 days, because
we don't want to have to recreate them. But I missed the root CA cert, and
the pre-created certificates included in the repository expired in January.
Fix, and re-create all the certificates.
Tom Lane [Mon, 16 Feb 2015 17:23:58 +0000 (12:23 -0500)]
Rationalize the APIs of array element/slice access functions.
The four functions array_ref, array_set, array_get_slice, array_set_slice
have traditionally declared their array inputs and results as being of type
"ArrayType *". This is a lie, and has been since Berkeley days, because
they actually also support "fixed-length array" types such as "name" and
"point"; not to mention that the inputs could be toasted. These values
should be declared Datum instead to avoid confusion. The current coding
already risks possible misoptimization by compilers, and it'll get worse
when "expanded" array representations become a valid alternative.
However, there's a fair amount of code using array_ref and array_set with
arrays that *are* known to be ArrayType structures, and there might be more
such places in third-party code. Rather than cluttering those call sites
with PointerGetDatum/DatumGetArrayTypeP cruft, what I did was to rename the
existing functions to array_get_element/array_set_element, fix their
signatures, then reincarnate array_ref/array_set as backwards compatibility
wrappers.
array_get_slice/array_set_slice have no such constituency in the core code,
and probably not in third-party code either, so I just changed their APIs.
Tom Lane [Mon, 16 Feb 2015 04:26:45 +0000 (23:26 -0500)]
Fix null-pointer-deref crash while doing COPY IN with check constraints.
In commit bf7ca15875988a88e97302e012d7c4808bef3ea9 I introduced an
assumption that an RTE referenced by a whole-row Var must have a valid eref
field. This is false for RTEs constructed by DoCopy, and there are other
places taking similar shortcuts. Perhaps we should make all those places
go through addRangeTableEntryForRelation or its siblings instead of having
ad-hoc logic, but the most reliable fix seems to be to make the new code in
ExecEvalWholeRowVar cope if there's no eref. We can reasonably assume that
there's no need to insert column aliases if no aliases were provided.
Add a regression test case covering this, and also verifying that a sane
column name is in fact available in this situation.
Although the known case only crashes in 9.4 and HEAD, it seems prudent to
back-patch the code change to 9.2, since all the ingredients for a similar
failure exist in the variant patch applied to 9.3 and 9.2.
Tom Lane [Sat, 14 Feb 2015 17:20:56 +0000 (12:20 -0500)]
Avoid returning undefined bytes in chkpass_in().
We can't really fix the problem that the result is defined to depend on
random(), so it is still going to fail the "unstable input conversion"
test in parse_type.c. However, we can at least satify valgrind. (It
looks like this code used to be valgrind-clean, actually, until somebody
did a careless s/strncpy/strlcpy/g on it.)
In passing, let's just make real sure that chkpass_out doesn't overrun
its output buffer.
No need for backpatch, I think, since this is just to satisfy debugging
tools.
Simplify waiting logic in reading from / writing to client.
The client socket is always in non-blocking mode, and if we actually want
blocking behaviour, we emulate it by sleeping and retrying. But we have
retry loops at different layers for reads and writes, which was confusing.
To simplify, remove all the sleeping and retrying code from the lower
levels, from be_tls_read and secure_raw_read and secure_raw_write, and put
all the logic in secure_read() and secure_write().
Simplify the way OpenSSL renegotiation is initiated in server.
At least in all modern versions of OpenSSL, it is enough to call
SSL_renegotiate() once, and then forget about it. Subsequent SSL_write()
and SSL_read() calls will finish the handshake.
The SSL_set_session_id_context() call is unnecessary too. We only have
one SSL context, and the SSL session was created with that to begin with.
Bruce Momjian [Thu, 12 Feb 2015 02:02:07 +0000 (21:02 -0500)]
pg_upgrade: preserve freeze info for postgres/template1 dbs
pg_database.datfrozenxid and pg_database.datminmxid were not preserved
for the 'postgres' and 'template1' databases. This could cause missing
clog file errors on access to user tables and indexes after upgrades in
these databases.
Tom Lane [Thu, 12 Feb 2015 00:09:54 +0000 (19:09 -0500)]
Fix minor memory leak in ident_inet().
We'd leak the ident_serv data structure if the second pg_getaddrinfo_all
(the one for the local address) failed. This is not of great consequence
because a failure return here just leads directly to backend exit(), but
if this function is going to try to clean up after itself at all, it should
not have such holes in the logic. Try to fix it in a future-proof way by
having all the failure exits go through the same cleanup path, rather than
"optimizing" some of them.
Per Coverity. Back-patch to 9.2, which is as far back as this patch
applies cleanly.
Tom Lane [Wed, 11 Feb 2015 23:35:23 +0000 (18:35 -0500)]
Fix more memory leaks in failure path in buildACLCommands.
We already had one go at this issue in commit d73b7f973db5ec7e, but we
failed to notice that buildACLCommands also leaked several PQExpBuffers
along with a simply malloc'd string. This time let's try to make the
fix a bit more future-proof by eliminating the separate exit path.
It's still not exactly critical because pg_dump will curl up and die on
failure; but since the amount of the potential leak is now several KB,
it seems worth back-patching as far as 9.2 where the previous fix landed.
Per Coverity, which evidently is smarter than clang's static analyzer.
Tom Lane [Wed, 11 Feb 2015 03:38:15 +0000 (22:38 -0500)]
Fix pg_dump's heuristic for deciding which casts to dump.
Back in 2003 we had a discussion about how to decide which casts to dump.
At the time pg_dump really only considered an object's containing schema
to decide what to dump (ie, dump whatever's not in pg_catalog), and so
we chose a complicated idea involving whether the underlying types were to
be dumped (cf commit a6790ce85752b67ad994f55fdf1a450262ccc32e). But users
are allowed to create casts between built-in types, and we failed to dump
such casts. Let's get rid of that heuristic, which has accreted even more
ugliness since then, in favor of just looking at the cast's OID to decide
if it's a built-in cast or not.
In passing, also fix some really ancient code that supposed that it had to
manufacture a dependency for the cast on its cast function; that's only
true when dumping from a pre-7.3 server. This just resulted in some wasted
cycles and duplicate dependency-list entries with newer servers, but we
might as well improve it.
Per gripes from a number of people, most recently Greg Sabino Mullane.
Back-patch to all supported branches.
Tom Lane [Wed, 11 Feb 2015 01:37:19 +0000 (20:37 -0500)]
Fix GEQO to not assume its join order heuristic always works.
Back in commit 400e2c934457bef4bc3cc9a3e49b6289bd761bc0 I rewrote GEQO's
gimme_tree function to improve its heuristic for modifying the given tour
into a legal join order. In what can only be called a fit of hubris,
I supposed that this new heuristic would *always* find a legal join order,
and ripped out the old logic that allowed gimme_tree to sometimes fail.
The folly of this is exposed by bug #12760, in which the "greedy" clumping
behavior of merge_clump() can lead it into a dead end which could only be
recovered from by un-clumping. We have no code for that and wouldn't know
exactly what to do with it if we did. Rather than try to improve the
heuristic rules still further, let's just recognize that it *is* a
heuristic and probably must always have failure cases. So, put back the
code removed in the previous commit to allow for failure (but comment it
a bit better this time).
It's possible that this code was actually fully correct at the time and
has only been broken by the introduction of LATERAL. But having seen this
example I no longer have much faith in that proposition, so back-patch to
all supported branches.
Michael Meskes [Tue, 10 Feb 2015 11:00:13 +0000 (12:00 +0100)]
Fixed array handling in ecpg.
When ecpg was rewritten to the new protocol version not all variable types
were corrected. This patch rewrites the code for these types to fix that. It
also fixes the documentation to correctly tell the status of array handling.
Speed up CRC calculation using slicing-by-8 algorithm.
This speeds up WAL generation and replay. The new algorithm is
significantly faster with large inputs, like full-page images or when
inserting wide rows. It is slower with tiny inputs, i.e. less than 10 bytes
or so, but the speedup with longer inputs more than make up for that. Even
small WAL records at least have 24 byte header in the front.
The output is identical to the current byte-at-a-time computation, so this
does not affect compatibility. The new algorithm is only used for the
CRC-32C variant, not the legacy version used in tsquery or the
"traditional" CRC-32 used in hstore and ltree. Those are not as performance
critical, and are usually only applied over small inputs, so it seems
better to not carry around the extra lookup tables to speed up those rare
cases.
Tom Lane [Mon, 9 Feb 2015 17:30:52 +0000 (12:30 -0500)]
Minor cleanup/code review for "indirect toast" stuff.
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer. Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding. An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input. Back-patch to 9.4
just in case that is reachable now.
In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
Move pg_crc.c to src/common, and remove pg_crc_tables.h
To get CRC functionality in a client program, you now need to link with
libpgcommon instead of libpgport. The CRC code has nothing to do with
portability, so libpgcommon is a better home. (libpgcommon didn't exist
when pg_crc.c was originally moved to src/port.)
Remove the possibility to get CRC functionality by just #including
pg_crc_tables.h. I'm not aware of any extensions that actually did that and
couldn't simply link with libpgcommon.
This also moves the pg_crc.h header file from src/include/utils to
src/include/common, which will require changes to any external programs
that currently does #include "utils/pg_crc.h". That seems acceptable, as
include/common is clearly the right home for it now, and the change needed
to any such programs is trivial.
Fujii Masao [Mon, 9 Feb 2015 06:15:24 +0000 (15:15 +0900)]
Move pg_lzcompress.c to src/common.
The meta data of PGLZ symbolized by PGLZ_Header is removed, to make
the compression and decompression code independent on the backend-only
varlena facility. PGLZ_Header is being used to store some meta data
related to the data being compressed like the raw length of the uncompressed
record or some varlena-related data, making it unpluggable once PGLZ is
stored in src/common as it contains some backend-only code paths with
the management of varlena structures. The APIs of PGLZ are reworked
at the same time to do only compression and decompression of buffers
without the meta-data layer, simplifying its use for a more general usage.
On-disk format is preserved as well, so there is no incompatibility with
previous major versions of PostgreSQL for TOAST entries.
Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. Especially this commit is required
for upcoming WAL compression feature so that the WAL reader facility can
decompress the WAL data by using pglz_decompress.
Noah Misch [Sat, 7 Feb 2015 04:39:52 +0000 (23:39 -0500)]
Check DCH_MAX_ITEM_SIZ limits with <=, not <.
We reserve space for the full amount, not one less. The affected checks
deal with localized month and day names. Today's DCH_MAX_ITEM_SIZ value
would suffice for a 60-byte day name, while the longest known is the
49-byte mn_CN.utf-8 word for "Saturday." Thus, the upshot of this
change is merely to avoid misdirecting future readers of the code; users
are not expected to see errors either way.
Report WAL flush, not insert, position in replication IDENTIFY_SYSTEM
When beginning streaming replication, the client usually issues the
IDENTIFY_SYSTEM command, which used to return the current WAL insert
position. That's not suitable for the intended purpose of that field,
however. pg_receivexlog uses it to start replication from the reported
point, but if it hasn't been flushed to disk yet, it will fail. Change
IDENTIFY_SYSTEM to report the flush position instead.
Backpatch to 9.1 and above. 9.0 doesn't report any WAL position.
Michael Meskes [Thu, 5 Feb 2015 14:12:34 +0000 (15:12 +0100)]
This routine was calling ecpg_alloc to allocate to memory but did not
actually check the returned pointer allocated, potentially NULL which
could be the result of a malloc call.
Issue noted by Coverity, fixed by Michael Paquier <michael@otacoo.com>
It was getting tedious to track and release all the different things that
form a scan key. We were leaking at least the queryCategories array, and
possibly more, on a rescan. That was visible if a GIN index was used in a
nested loop join. This also protects from leaks in extractQuery method.
No backpatching, given the lack of complaints from the field. Maybe later,
after this has received more field testing.
Fix reference-after-free when waiting for another xact due to constraint.
If an insertion or update had to wait for another transaction to finish,
because there was another insertion with conflicting key in progress,
we would pass a just-free'd item pointer to XactLockTableWait().
All calls to XactLockTableWait() and MultiXactIdWait() had similar issues.
Some passed a pointer to a buffer in the buffer cache, after already
releasing the lock. The call in EvalPlanQualFetch had already released the
pin too. All but the call in execUtils.c would merely lead to reporting a
bogus ctid, however (or an assertion failure, if enabled).
All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus
anyway: if the tuple was updated (again) in the same transaction, its ctid
field would point to the next tuple in the chain, not the tuple itself.
Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added
(in commit f88d4cfc)
Add dummy PQsslAttributes function for non-SSL builds.
All the other new SSL information functions had dummy versions in
be-secure.c, but I missed PQsslAttributes(). Oops. Surprisingly, the linker
did not complain about the missing function on most platforms represented in
the buildfarm, even though it is exported, except for a few Windows systems.
Andres Freund [Tue, 3 Feb 2015 22:52:15 +0000 (23:52 +0100)]
Remove ill-conceived Assertion in ProcessClientWriteInterrupt().
It's perfectly fine to have blocked interrupts when
ProcessClientWriteInterrupt() is called. In fact it's commonly the
case when emitting error reports. And we deal with that correctly.
Even if that'd not be the case, it'd be a bad location for such a
assertion. Because ProcessClientWriteInterrupt() is only called when
the socket is blocked it's hard to hit.
Per Heikki and buildfarm animals nightjar and dunlin.
Andres Freund [Tue, 3 Feb 2015 22:25:00 +0000 (23:25 +0100)]
Remove the option to service interrupts during PGSemaphoreLock().
The remaining caller (lwlocks) doesn't need that facility, and we plan
to remove ImmedidateInterruptOK entirely. That means that interrupts
can't be serviced race-free and portably anyway, so there's little
reason for keeping the feature.
Andres Freund [Tue, 3 Feb 2015 22:24:38 +0000 (23:24 +0100)]
Move deadlock and other interrupt handling in proc.c out of signal handlers.
Deadlock checking was performed inside signal handlers up to
now. While it's a remarkable feat to have made this work reliably,
it's quite complex to understand why that is the case. Partially it
worked due to the assumption that semaphores are signal safe - which
is not actually documented to be the case for sysv semaphores.
The reason we had to rely on performing this work inside signal
handlers is that semaphores aren't guaranteed to be interruptable by
signals on all platforms. But now that latches provide a somewhat
similar API, which actually has the guarantee of being interruptible,
we can avoid doing so.
Signalling between ProcSleep, ProcWakeup, ProcWaitForSignal and
ProcSendSignal is now done using latches. This increases the
likelihood of spurious wakeups. As spurious wakeup already were
possible and aren't likely to be frequent enough to be an actual
problem, this seems acceptable.
This change would allow for further simplification of the deadlock
checking, now that it doesn't have to run in a signal handler. But
even if I were motivated to do so right now, it would still be better
to do that separately. Such a cleanup shouldn't have to be reviewed a
the same time as the more fundamental changes in this commit.
There is one possible usability regression due to this commit. Namely
it is more likely than before that log_lock_waits messages are output
more than once.
Andres Freund [Tue, 3 Feb 2015 21:54:48 +0000 (22:54 +0100)]
Don't allow immediate interrupts during authentication anymore.
We used to handle authentication_timeout by setting
ImmediateInterruptOK to true during large parts of the authentication
phase of a new connection. While that happens to work acceptably in
practice, it's not particularly nice and has ugly corner cases.
Previous commits converted the FE/BE communication to use latches and
implemented support for interrupt handling during both
send/recv. Building on top of that work we can get rid of
ImmediateInterruptOK during authentication, by immediately treating
timeouts during authentication as a reason to die. As die interrupts
are handled immediately during client communication that provides a
sensibly quick reaction time to authentication timeout.
Additionally add a few CHECK_FOR_INTERRUPTS() to some more complex
authentication methods. More could be added, but this already should
provides a reasonable coverage.
While it this overall increases the maximum time till a timeout is
reacted to, it greatly reduces complexity and increases
reliability. That seems like a overall win. If the increase proves to
be noticeable we can deal with those cases by moving to nonblocking
network code and add interrupt checking there.
Tom Lane [Tue, 3 Feb 2015 21:50:50 +0000 (16:50 -0500)]
Remove unused "m" field in LSEG.
This field has been unreferenced since 1998, and does not appear in lseg
values stored on disk (since sizeof(lseg) is only 32 bytes according to
pg_type). There was apparently some idea of maintaining it just in values
appearing in memory, but the bookkeeping required to make that work would
surely far outweigh the cost of recalculating the line's slope when needed.
Remove it to (a) simplify matters and (b) suppress some uninitialized-field
whining from Coverity.
Andres Freund [Tue, 3 Feb 2015 21:45:45 +0000 (22:45 +0100)]
Process 'die' interrupts while reading/writing from the client socket.
Up to now it was impossible to terminate a backend that was trying to
send/recv data to/from the client when the socket's buffer was already
full/empty. While the send/recv calls itself might have gotten
interrupted by signals on some platforms, we just immediately retried.
That could lead to situations where a backend couldn't be terminated ,
after a client died without the connection being closed, because it
was blocked in send/recv.
The problem was far more likely to be hit when sending data than when
reading. That's because while reading a command from the client, and
during authentication, we processed interrupts immediately . That
primarily left COPY FROM STDIN as being problematic for recv.
Change things so that that we process 'die' events immediately when
the appropriate signal arrives. We can't sensibly react to query
cancels at that point, because we might loose sync with the client as
we could be in the middle of writing a message.
We don't interrupt writes if the write buffer isn't full, as indicated
by write() returning EWOULDBLOCK, as that would lead to fewer error
messages reaching clients.
Per discussion with Kyotaro HORIGUCHI and Heikki Linnakangas
Andres Freund [Tue, 3 Feb 2015 21:25:20 +0000 (22:25 +0100)]
Introduce and use infrastructure for interrupt processing during client reads.
Up to now large swathes of backend code ran inside signal handlers
while reading commands from the client, to allow for speedy reaction to
asynchronous events. Most prominently shared invalidation and NOTIFY
handling. That means that complex code like the starting/stopping of
transactions is run in signal handlers... The required code was
fragile and verbose, and is likely to contain bugs.
That approach also severely limited what could be done while
communicating with the client. As the read might be from within
openssl it wasn't safely possible to trigger an error, e.g. to cancel
a backend in idle-in-transaction state. We did that in some cases,
namely fatal errors, nonetheless.
Now that FE/BE communication in the backend employs non-blocking
sockets and latches to block, we can quite simply interrupt reads from
signal handlers by setting the latch. That allows us to signal an
interrupted read, which is supposed to be retried after returning from
within the ssl library.
As signal handlers now only need to set the latch to guarantee timely
interrupt processing, remove a fair amount of complicated & fragile
code from async.c and sinval.c.
We could now actually start to process some kinds of interrupts, like
sinval ones, more often that before, but that seems better done
separately.
This work will hopefully allow to handle cases like being blocked by
sending data, interrupting idle transactions and similar to be
implemented without too much effort. In addition to allowing getting
rid of ImmediateInterruptOK, that is.
Author: Andres Freund Reviewed-By: Heikki Linnakangas
Andres Freund [Tue, 3 Feb 2015 21:03:48 +0000 (22:03 +0100)]
Use a nonblocking socket for FE/BE communication and block using latches.
This allows to introduce more elaborate handling of interrupts while
reading from a socket. Currently some interrupt handlers have to do
significant work from inside signal handlers, and it's very hard to
correctly write code to do so. Generic signal handler limitations,
combined with the fact that we can't safely jump out of a signal
handler while reading from the client have prohibited implementation
of features like timeouts for idle-in-transaction.
Additionally we use the latch code to wait in a couple places where we
previously only had waiting code on windows as other platforms just
busy looped.
This can increase the number of systemcalls happening during FE/BE
communication. Benchmarks so far indicate that the impact isn't very
high, and there's room for optimization in the latch code. The chance
of cleaning up the usage of latches gives us, seem to outweigh the
risk of small performance regressions.
This commit theoretically can't used without the next patch in the
series, as WaitLatchOrSocket is not defined to be fully signal
safe. As we already do that in some cases though, it seems better to
keep the commits separate, so they're easier to understand.
Author: Andres Freund Reviewed-By: Heikki Linnakangas
Tom Lane [Tue, 3 Feb 2015 20:20:45 +0000 (15:20 -0500)]
Fix breakage in GEODEBUG debug code.
LINE doesn't have an "m" field (anymore anyway). Also fix unportable
assumption that %x can print the result of pointer subtraction.
In passing, improve single_decode() in minor ways:
* Remove unnecessary leading-whitespace skip (strtod does that already).
* Make GEODEBUG message more intelligible.
* Remove entirely-useless test to see if strtod returned a silly pointer.
* Don't bother computing trailing-whitespace skip unless caller wants
an ending pointer.
Add API functions to libpq to interrogate SSL related stuff.
This makes it possible to query for things like the SSL version and cipher
used, without depending on OpenSSL functions or macros. That is a good
thing if we ever get another SSL implementation.
PQgetssl() still works, but it should be considered as deprecated as it
only works with OpenSSL. In particular, PQgetSslInUse() should be used to
check if a connection uses SSL, because as soon as we have another
implementation, PQgetssl() will return NULL even if SSL is in use.
The logic to compact away removed tuples from page was duplicated with
small differences in PageRepairFragmentation, PageIndexMultiDelete, and
PageIndexDeleteNoCompact. Put it into a common function.
Robert Haas [Mon, 2 Feb 2015 21:23:59 +0000 (16:23 -0500)]
Add new function BackgroundWorkerInitializeConnectionByOid.
Sometimes it's useful for a background worker to be able to initialize
its database connection by OID rather than by name, so provide a way
to do that.
Be more careful to not lose sync in the FE/BE protocol.
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.
We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.
To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.
To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.
Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.
Noah Misch [Mon, 2 Feb 2015 15:00:45 +0000 (10:00 -0500)]
Prevent Valgrind Memcheck errors around px_acquire_system_randomness().
This function uses uninitialized stack and heap buffers as supplementary
entropy sources. Mark them so Memcheck will not complain. Back-patch
to 9.4, where Valgrind Memcheck cooperation first appeared.
Noah Misch [Mon, 2 Feb 2015 15:00:45 +0000 (10:00 -0500)]
Cherry-pick security-relevant fixes from upstream imath library.
This covers alterations to buffer sizing and zeroing made between imath
1.3 and imath 1.20. Valgrind Memcheck identified the buffer overruns
and reliance on uninitialized data; their exploit potential is unknown.
Builds specifying --with-openssl are unaffected, because they use the
OpenSSL BIGNUM facility instead of imath. Back-patch to 9.0 (all
supported versions).
Noah Misch [Mon, 2 Feb 2015 15:00:45 +0000 (10:00 -0500)]
Fix buffer overrun after incomplete read in pullf_read_max().
Most callers pass a stack buffer. The ensuing stack smash can crash the
server, and we have not ruled out the viability of attacks that lead to
privilege escalation. Back-patch to 9.0 (all supported versions).
Bruce Momjian [Mon, 2 Feb 2015 15:00:45 +0000 (10:00 -0500)]
port/snprintf(): fix overflow and do padding
Prevent port/snprintf() from overflowing its local fixed-size
buffer and pad to the desired number of digits with zeros, even
if the precision is beyond the ability of the native sprintf().
port/snprintf() is only used on systems that lack a native
snprintf().
Reported by Bruce Momjian. Patch by Tom Lane. Backpatch to all
supported versions.
doc: Improve claim about location of pg_service.conf
The previous wording claimed that the file was always in /etc, but of
course this varies with the installation layout. Write instead that it
can be found via `pg_config --sysconfdir`. Even though this is still
somewhat incorrect because it doesn't account of moved installations, it
at least conveys that the location depends on the installation.
Tom Lane [Sat, 31 Jan 2015 23:35:13 +0000 (18:35 -0500)]
Fix documentation of psql's ECHO all mode.
"ECHO all" is ignored for interactive input, and has been for a very long
time, though possibly not for as long as the documentation has claimed the
opposite. Fix that, and also note that empty lines aren't echoed, which
while dubious is another longstanding behavior (it's embedded in our
regression test files for one thing). Per bug #12721 from Hans Ginzel.
In HEAD, also improve the code comments in this area, and suppress an
unnecessary fflush(stdout) when we're not echoing. That would likely
be safe to back-patch, but I'll not risk it mere hours before a release
wrap.
Tom Lane [Sat, 31 Jan 2015 22:30:30 +0000 (17:30 -0500)]
First-draft release notes for 9.4.1 et al.
As usual, the release notes for older branches will be made by cutting
these down, but put them up for community review first.
Note: a significant fraction of these items don't apply to 9.4.1, only to
older branches, because the fixes already appeared in 9.4.0. These can be
distinguished by noting the branch commits in the associated SGML comments.
This will be adjusted tomorrow while copying items into the older
release-X.Y.sgml files. In a few cases I've made two separate entries with
different wordings for 9.4 than for the equivalent commits in the older
branches.
Stephen Frost [Fri, 30 Jan 2015 21:09:41 +0000 (16:09 -0500)]
Policy documentation improvements
In ALTER POLICY, use 'check_expression' instead of 'expression' for the
parameter, to match up with the recent CREATE POLICY change.
In CREATE POLICY, frame the discussion as granting access to rows
instead of limiting access to rows. Further, clarify that the
expression must return true for rows to be visible/allowed and that a
false or NULL result will mean the row is not visible/allowed.
Tom Lane [Fri, 30 Jan 2015 19:44:46 +0000 (14:44 -0500)]
Fix jsonb Unicode escape processing, and in consequence disallow \u0000.
We've been trying to support \u0000 in JSON values since commit 78ed8e03c67d7333, and have introduced increasingly worse hacks to try to
make it work, such as commit 0ad1a816320a2b53. However, it fundamentally
can't work in the way envisioned, because the stored representation looks
the same as for \\u0000 which is not the same thing at all. It's also
entirely bogus to output \u0000 when de-escaped output is called for.
The right way to do this would be to store an actual 0x00 byte, and then
throw error only if asked to produce de-escaped textual output. However,
getting to that point seems likely to take considerable work and may well
never be practical in the 9.4.x series.
To preserve our options for better behavior while getting rid of the nasty
side-effects of 0ad1a816320a2b53, revert that commit in toto and instead
throw error if \u0000 is used in a context where it needs to be de-escaped.
(These are the same contexts where non-ASCII Unicode escapes throw error
if the database encoding isn't UTF8, so this behavior is by no means
without precedent.)
In passing, make both the \u0000 case and the non-ASCII Unicode case report
ERRCODE_UNTRANSLATABLE_CHARACTER / "unsupported Unicode escape sequence"
rather than claiming there's something wrong with the input syntax.
Back-patch to 9.4, where we have to do something because 0ad1a816320a2b53
broke things for many cases having nothing to do with \u0000. 9.3 also has
bogus behavior, but only for that specific escape value, so given the lack
of field complaints it seems better to leave 9.3 alone.
Tom Lane [Fri, 30 Jan 2015 18:04:56 +0000 (13:04 -0500)]
Fix Coverity warning about contrib/pgcrypto's mdc_finish().
Coverity points out that mdc_finish returns a pointer to a local buffer
(which of course is gone as soon as the function returns), leaving open
a risk of misbehaviors possibly as bad as a stack overwrite.
In reality, the only possible call site is in process_data_packets()
which does not examine the returned pointer at all. So there's no
live bug, but nonetheless the code is confusing and risky. Refactor
to avoid the issue by letting process_data_packets() call mdc_finish()
directly instead of going through the pullf_read() API.
Although this is only cosmetic, it seems good to back-patch so that
the logic in pgp-decrypt.c stays in sync across all branches.
Tom Lane [Fri, 30 Jan 2015 17:30:38 +0000 (12:30 -0500)]
Fix assorted oversights in range selectivity estimation.
calc_rangesel() failed outright when comparing range variables to empty
constant ranges with < or >=, as a result of missing cases in a switch.
It also produced a bogus estimate for > comparison to an empty range.
On top of that, the >= and > cases were mislabeled throughout. For
nonempty constant ranges, they managed to produce the right answers
anyway as a result of counterbalancing typos.
Also, default_range_selectivity() omitted cases for elem <@ range,
range &< range, and range &> range, so that rather dubious defaults
were applied for these operators.
In passing, rearrange the code in rangesel() so that the elem <@ range
case is handled in a less opaque fashion.
Report and patch by Emre Hasegeli, some additional work by me
The requiredEntries / additionalEntries arrays were not freed in
freeScanKeys() like other per-key stuff.
It's not obvious, but startScanKey() was only ever called after the keys
have been initialized with ginNewScanKey(). That's why it doesn't need to
worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
was never true, because ginrescan free's the existing keys, and it's not OK
to call gingetbitmap twice in a row without calling ginrescan in between.
To make that clear, remove the unnecessary ginIsNewKey(). And just to be
extra sure that nothing funny happens if there is an existing key after all,
call freeScanKeys() to free it if it exists. This makes the code more
straightforward.
(I'm seeing other similar leaks in testing a query that rescans an GIN index
scan, but that's a different issue. This just fixes the obvious leak with
those two arrays.)
Kevin Grittner [Fri, 30 Jan 2015 14:57:24 +0000 (08:57 -0600)]
Allow pg_dump to use jobs and serializable transactions together.
Since 9.3, when the --jobs option was introduced, using it together
with the --serializable-deferrable option generated multiple
errors. We can get correct behavior by allowing the connection
which acquires the snapshot to use SERIALIZABLE, READ ONLY,
DEFERRABLE and pass that to the workers running the other
connections using REPEATABLE READ, READ ONLY. This is a bit of a
kluge since the SERIALIZABLE behavior is achieved by running some
of the participating connections at a different isolation level,
but it is a simple and safe change, suitable for back-patching.
This will be followed by a proposal for a more invasive fix with
some slight behavioral changes on just the master branch, based on
suggestions from Andres Freund, but the kluge will be applied to
master until something is agreed along those lines.
Back-patched to 9.3, where the --jobs option was added.
Stephen Frost [Fri, 30 Jan 2015 02:59:34 +0000 (21:59 -0500)]
Fix BuildIndexValueDescription for expressions
In 804b6b6db4dcfc590a468e7be390738f9f7755fb we modified
BuildIndexValueDescription to pay attention to which columns are visible
to the user, but unfortunatley that commit neglected to consider indexes
which are built on expressions.
Handle error-reporting of violations of constraint indexes based on
expressions by not returning any detail when the user does not have
table-level SELECT rights.
Tom Lane [Fri, 30 Jan 2015 01:18:33 +0000 (20:18 -0500)]
Handle unexpected query results, especially NULLs, safely in connectby().
connectby() didn't adequately check that the constructed SQL query returns
what it's expected to; in fact, since commit 08c33c426bfebb32 it wasn't
checking that at all. This could result in a null-pointer-dereference
crash if the constructed query returns only one column instead of the
expected two. Less excitingly, it could also result in surprising data
conversion failures if the constructed query returned values that were
not I/O-conversion-compatible with the types specified by the query
calling connectby().
In all branches, insist that the query return at least two columns;
this seems like a minimal sanity check that can't break any reasonable
use-cases.
In HEAD, insist that the constructed query return the types specified by
the outer query, including checking for typmod incompatibility, which the
code never did even before it got broken. This is to hide the fact that
the implementation does a conversion to text and back; someday we might
want to improve that.
In back branches, leave that alone, since adding a type check in a minor
release is more likely to break things than make people happy. Type
inconsistencies will continue to work so long as the actual type and
declared type are I/O representation compatible, and otherwise will fail
the same way they used to.
Also, in all branches, be on guard for NULL results from the constructed
query, which formerly would cause null-pointer dereference crashes.
We now print the row with the NULL but don't recurse down from it.
In passing, get rid of the rather pointless idea that
build_tuplestore_recursively() should return the same tuplestore that's
passed to it.
Andres Freund [Thu, 29 Jan 2015 16:49:03 +0000 (17:49 +0100)]
Properly terminate the array returned by GetLockConflicts().
GetLockConflicts() has for a long time not properly terminated the
returned array. During normal processing the returned array is zero
initialized which, while not pretty, is sufficient to be recognized as
a invalid virtual transaction id. But the HotStandby case is more than
aesthetically broken: The allocated (and reused) array is neither
zeroed upon allocation, nor reinitialized, nor terminated.
Not having a terminating element means that the end of the array will
not be recognized and that recovery conflict handling will thus read
ahead into adjacent memory. Only terminating when hitting memory
content that looks like a invalid virtual transaction id. Luckily
this seems so far not have caused significant problems, besides making
recovery conflict more expensive.
Andres Freund [Thu, 29 Jan 2015 16:49:03 +0000 (17:49 +0100)]
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Fix bug where GIN scan keys were not initialized with gin_fuzzy_search_limit.
When gin_fuzzy_search_limit was used, we could jump out of startScan()
without calling startScanKey(). That was harmless in 9.3 and below, because
startScanKey()() didn't do anything interesting, but in 9.4 it initializes
information needed for skipping entries (aka GIN fast scans), and you
readily get a segfault if it's not done. Nevertheless, it was clearly wrong
all along, so backpatch all the way to 9.1 where the early return was
introduced.
(AFAICS startScanKey() did nothing useful in 9.3 and below, because the
fields it initialized were already initialized in ginFillScanKey(), but I
don't dare to change that in a minor release. ginFillScanKey() is always
called in gingetbitmap() even though there's a check there to see if the
scan keys have already been initialized, because they never are; ginrescan()
free's them.)
In the passing, remove unnecessary if-check from the second inner loop in
startScan(). We already check in the first loop that the condition is true
for all entries.
Reported by Olaf Gawenda, bug #12694, Backpatch to 9.1 and above, although
AFAICS it causes a live bug only in 9.4.
Robert Haas [Thu, 29 Jan 2015 15:23:38 +0000 (10:23 -0500)]
Move out-of-memory error checks from aset.c to mcxt.c
This potentially allows us to add mcxt.c interfaces that do something
other than throw an error when memory cannot be allocated. We'll
handle adding those interfaces in a separate commit.
Stephen Frost [Thu, 29 Jan 2015 04:21:54 +0000 (23:21 -0500)]
Reword CREATE POLICY parameter descriptions
The parameter description for the using_expression and check_expression
in CREATE POLICY were unclear and arguably included a typo. Clarify
and improve the consistency of that language.