granicus.if.org Git - postgresql/log

Add defenses against integer overflow in dynahash numbuckets calculations.

The dynahash code requires the number of buckets in a hash table to fit
in an int; but since we calculate the desired hash table size dynamically,
there are various scenarios where we might calculate too large a value.
The resulting overflow can lead to infinite loops, division-by-zero
crashes, etc.  I (tgl) had previously installed some defenses against that
in commit 299d1716525c659f0e02840e31fbe4dea3, but that covered only one
call path.  Moreover it worked by limiting the request size to work_mem,
but in a 64-bit machine it's possible to set work_mem high enough that the
problem appears anyway.  So let's fix the problem at the root by installing
limits in the dynahash.c functions themselves.

Trouble report and patch by Jeff Davis.

Disable event triggers in standalone mode.

Per discussion, this seems necessary to allow recovery from broken event
triggers, or broken indexes on pg_event_trigger.

Dimitri Fontaine

Fix performance problems with autovacuum truncation in busy workloads.

In situations where there are over 8MB of empty pages at the end of
a table, the truncation work for trailing empty pages takes longer
than deadlock_timeout, and there is frequent access to the table by
processes other than autovacuum, there was a problem with the
autovacuum worker process being canceled by the deadlock checking
code. The truncation work done by autovacuum up that point was
lost, and the attempt tried again by a later autovacuum worker. The
attempts could continue indefinitely without making progress,
consuming resources and blocking other processes for up to
deadlock_timeout each time.

This patch has the autovacuum worker checking whether it is
blocking any other thread at 20ms intervals. If such a condition
develops, the autovacuum worker will persist the work it has done
so far, release its lock on the table, and sleep in 50ms intervals
for up to 5 seconds, hoping to be able to re-acquire the lock and
try again. If it is unable to get the lock in that time, it moves
on and a worker will try to continue later from the point this one
left off.

While this patch doesn't change the rules about when and what to
truncate, it does cause the truncation to occur sooner, with less
blocking, and with the consumption of fewer resources when there is
contention for the table's lock.

The only user-visible change other than improved performance is
that the table size during truncation may change incrementally
instead of just once.

This problem exists in all supported versions but is infrequently
reported, although some reports of performance problems when
autovacuum runs might be caused by this. Initial commit is just the
master branch, but this should probably be backpatched once the
build farm and general developer usage confirm that there are no
surprising effects.

Jan Wieck

Fix pg_upgrade for invalid indexes

All versions of pg_upgrade upgraded invalid indexes caused by CREATE
INDEX CONCURRENTLY failures and marked them as valid. The patch adds a
check to all pg_upgrade versions and throws an error during upgrade or
--check.

Backpatch to 9.2, 9.1, 9.0. Patch slightly adjusted.

Consistency check should compare last record replayed, not last record read.

EndRecPtr is the last record that we've read, but not necessarily yet
replayed. CheckRecoveryConsistency should compare minRecoveryPoint with the
last replayed record instead. This caused recovery to think it's reached
consistency too early.

Now that we do the check in CheckRecoveryConsistency correctly, we have to
move the call of that function to after redoing a record. The current place,
after reading a record but before replaying it, is wrong. In particular, if
there are no more records after the one ending at minRecoveryPoint, we don't
enter hot standby until one extra record is generated and read by the
standby, and CheckRecoveryConsistency is called. These two bugs conspired
to make the code appear to work correctly, except for the small window
between reading the last record that reaches minRecoveryPoint, and
replaying it.

In the passing, rename recoveryLastRecPtr, which is the last record
replayed, to lastReplayedEndRecPtr. This makes it slightly less confusing
with replayEndRecPtr, which is the last record read that we're about to
replay.

Original report from Kyotaro HORIGUCHI, further diagnosis by Fujii Masao.
Backpatch to 9.0, where Hot Standby subtly changed the test from
"minRecoveryPoint < EndRecPtr" to "minRecoveryPoint <= EndRecPtr". The
former works because where the test is performed, we have always read one
more record than we've replayed.

Add mode where contrib installcheck runs each module in a separately named database.

Normally each module is tested in a database named contrib_regression,
which is dropped and recreated at the beginhning of each pg_regress run.
This new mode, enabled by adding USE_MODULE_DB=1 to the make command
line, runs most modules in a database with the module name embedded in
it.

This will make testing pg_upgrade on clusters with the contrib modules
a lot easier.

Second attempt at this, this time accomodating make versions older
than 3.82.

Still to be done: adapt to the MSVC build system.

Backpatch to 9.0, which is the earliest version it is reasonably
possible to test upgrading from.

Fix pg_upgrade -O/-o options

Fix previous commit that added synchronous_commit=off, but broke -O/-o
due to missing space in argument passing.

Backpatch to 9.2.

doc: Remove blastwave.org link

Apparently, this service has been dead since 2008.

Update minimum recovery point on truncation.

If a file is truncated, we must update minRecoveryPoint. Once a file is
truncated, there's no going back; it would not be safe to stop recovery
at a point earlier than that anymore.

Per report from Kyotaro HORIGUCHI. Backpatch to 8.4. Before that,
minRecoveryPoint was not updated during recovery at all.

Fix the tracking of min recovery point timeline.

Forgot to update it at the right place. Also, consider checkpoint record
that switches to new timelne to be on the new timeline.

This fixes erroneous "requested timeline 2 does not contain minimum recovery
point" errors, pointed out by Amit Kapila while testing another patch.

Fix assorted bugs in privileges-for-types patch.

Commit 729205571e81b4767efc42ad7beb53663e08d1ff added privileges on data
types, but there were a number of oversights. The implementation of
default privileges for types missed a few places, and pg_dump was
utterly innocent of the whole concept. Per bug #7741 from Nathan Alden,
and subsequent wider investigation.

Support automatically-updatable views.

This patch makes "simple" views automatically updatable, without the need
to create either INSTEAD OF triggers or INSTEAD rules.  "Simple" views
are those classified as updatable according to SQL-92 rules.  The rewriter
transforms INSERT/UPDATE/DELETE commands on such views directly into an
equivalent command on the underlying table, which will generally have
noticeably better performance than is possible with either triggers or
user-written rules.  A view that has INSTEAD OF triggers or INSTEAD rules
continues to operate the same as before.

For the moment, security_barrier views are not considered simple.
Also, we do not support WITH CHECK OPTION.  These features may be
added in future.

Dean Rasheed, reviewed by Amit Kapila

Update iso.org page link

The old one is responding with 404.

Improve pg_upgrade's status display

Pg_upgrade displays file names during copy and database names during
dump/restore.  Andrew Dunstan identified three bugs:

*  long file names were being truncated to 60 _leading_ characters, which
   often do not change for long file names

*  file names were truncated to 60 characters in log files

*  carriage returns were being output to log files

This commit fixes these --- it prints 60 _trailing_ characters to the
status display, and full path names without carriage returns to log
files.  It also suppresses status output to the log file unless verbose
mode is used.

Correct xmax test for COPY FREEZE

Optimize COPY FREEZE with CREATE TABLE also.

Jeff Davis, additional test by me

Clarify that COPY FREEZE is not a hard rule.
Remove message when FREEZE not honoured,
clarify reasons in comments and docs.

Improve pl/pgsql to support composite-type expressions in RETURN.

For some reason lost in the mists of prehistory, RETURN was only coded to
allow a simple reference to a composite variable when the function's return
type is composite.  Allow an expression instead, while preserving the
efficiency of the original code path in the case where the expression is
indeed just a composite variable's name.  Likewise for RETURN NEXT.

As is true in various other places, the supplied expression must yield
exactly the number and data types of the required columns.  There was some
discussion of relaxing that for pl/pgsql, but no consensus yet, so this
patch doesn't address that.

Asif Rehman, reviewed by Pavel Stehule

Background worker processes

Background workers are postmaster subprocesses that run arbitrary
user-specified code.  They can request shared memory access as well as
backend database connections; or they can just use plain libpq frontend
database connections.

Modules listed in shared_preload_libraries can register background
workers in their _PG_init() function; this is early enough that it's not
necessary to provide an extra GUC option, because the necessary extra
resources can be allocated early on.  Modules can install more than one
bgworker, if necessary.

Care is taken that these extra processes do not interfere with other
postmaster tasks: only one such process is started on each ServerLoop
iteration.  This means a large number of them could be waiting to be
started up and postmaster is still able to quickly service external
connection requests.  Also, shutdown sequence should not be impacted by
a worker process that's reasonably well behaved (i.e. promptly responds
to termination signals.)

The current implementation lets worker processes specify their start
time, i.e. at what point in the server startup process they are to be
started: right after postmaster start (in which case they mustn't ask
for shared memory access), when consistent state has been reached
(useful during recovery in a HOT standby server), or when recovery has
terminated (i.e. when normal backends are allowed).

In case of a bgworker crash, actions to take depend on registration
data: if shared memory was requested, then all other connections are
taken down (as well as other bgworkers), just like it were a regular
backend crashing.  The bgworker itself is restarted, too, within a
configurable timeframe (which can be configured to be never).

More features to add to this framework can be imagined without much
effort, and have been discussed, but this seems good enough as a useful
unit already.

An elementary sample module is supplied.

Author: Álvaro Herrera

This patch is loosely based on prior patches submitted by KaiGai Kohei,
and unsubmitted code by Simon Riggs.

Reviewed by: KaiGai Kohei, Markus Wanner, Andres Freund,
Heikki Linnakangas, Simon Riggs, Amit Kapila

Fix intermittent crash in DROP INDEX CONCURRENTLY.

When deleteOneObject closes and reopens the pg_depend relation,
we must see to it that the relcache pointer held by the calling function
(typically performMultipleDeletions) is updated.  Usually the relcache
entry is retained so that the pointer value doesn't change, which is why
the problem had escaped notice ... but after a cache flush event there's
no guarantee that the same memory will be reassigned.  To fix, change
the recursive functions' APIs so that we pass around a "Relation *"
not just "Relation".

Per investigation of occasional buildfarm failures.  This is trivial
to reproduce with -DCLOBBER_CACHE_ALWAYS, which points up the sad
lack of any buildfarm member running that way on a regular basis.

Update comment at top of index_create

I neglected to update it in commit f4c4335.

Michael Paquier

Ensure recovery pause feature doesn't pause unless users can connect.

If we're not in hot standby mode, then there's no way for users to connect
to reset the recoveryPause flag, so we shouldn't pause.  The code was aware
of this but the test to see if pausing was safe was seriously inadequate:
it wasn't paying attention to reachedConsistency, and besides what it was
testing was that we could legally enter hot standby, not that we have
done so.  Get rid of that in favor of checking LocalHotStandbyActive,
which because of the coding in CheckRecoveryConsistency is tantamount to
checking that we have told the postmaster to enter hot standby.

Also, move the recoveryPausesHere() call that reacts to asynchronous
recoveryPause requests so that it's not in the middle of application of a
WAL record.  I put it next to the recoveryStopsHere() call --- in future
those are going to need to interact significantly, so this seems like a
good waystation.

Also, don't bother trying to read another WAL record if we've already
decided not to continue recovery.  This was no big deal when the code was
written originally, but now that reading a record might entail actions like
fetching an archive file, it seems a bit silly to do it like that.

Per report from Jeff Janes and subsequent discussion.  The pause feature
needs quite a lot more work, but this gets rid of some indisputable bugs,
and seems safe enough to back-patch.

Oops, meant to change the comment in writeTimeLineHistory.

Must not reach consistency before XLOG_BACKUP_RECORD
When waiting for an XLOG_BACKUP_RECORD the minRecoveryPoint
will be incorrect, so we must not declare recovery as consistent
before we have seen the record. Major bug allowing recovery to end
too early in some cases, allowing people to see inconsistent db.
This patch to HEAD and 9.2, other fix required for 9.1 and 9.0

Simon Riggs and Andres Freund, bug report by Jeff Janes

Add pgstatginindex() function to get the size of the GIN pending list.

Fujii Masao, reviewed by Kyotaro Horiguchi.

Attempt to un-break Windows builds with USE_LDAP.

The buildfarm shows this case is entirely broken, and I'm betting the
reason is lack of any include file.

Include isinf.o in libecpg if isinf() is not available on the system.

Patch done by Jiang Guiqing <jianggq@cn.fujitsu.com>.

Downgrade a status message from LOG to DEBUG2.

I never intended this to be anything other than a debugging aid, but forgot
to change the level before committing.

Write exact xlog position of timeline switch in the timeline history file.

This allows us to do some more rigorous sanity checking for various
incorrect point-in-time recovery scenarios, and provides more information
for debugging purposes. It will also come handy in the upcoming patch to
allow timeline switches to be replicated by streaming replication.

In initdb.c, move auth warning code into main() from secondary function.

In pg_upgrade testing script, turn off command echo at the end so status
report is clearer.

Fix build of LDAP URL feature

Some code was not ifdef'ed out for non-LDAP builds.

patch from Bruce Momjian

Track the timeline associated with minRecoveryPoint, for more sanity checks.

This allows recovery to notice certain incorrect recovery scenarios.
If a server has recovered to point X on timeline 5, and you restart
recovery, it better be on timeline 5 when it reaches point X again, not on
some timeline with a higher ID. This can happen e.g if you a standby server
is shut down, a new timeline appears in the WAL archive, and the standby
server is restarted. It will try to follow the new timeline, which is wrong
because some WAL on the old timeline was already replayed before shutdown.

Requires an initdb (or at least pg_resetxlog), because this adds a field to
the control file.

Restore set -x in pg_upgrade/test.sh, so the user can see what is being
executed.

Add support for LDAP URLs

Allow specifying LDAP authentication parameters as RFC 4516 LDAP URLs.

In initdb.c, rename some newly created functions, and move the directory
creation and xlog symlink creation to separate functions.

Per suggestions from Andrew Dunstan.

Add initdb --sync-only option to sync the data directory to durable
storage.

Have pg_upgrade use it, and enable server options fsync=off and
full_page_writes=off.

Document that users turning fsync from off to on should run initdb
--sync-only.

[ Previous commit was incorrectly applied as a git merge. ]

Revert initdb --sync-only patch that had incorrect commit messages.

dummy commit

In pg_upgrade, fix bug where no users were dumped in pg_dumpall
binary-upgrade mode; instead only skip dumping the current user.

This bug was introduced in during the removal of split_old_dump(). Bug
discovered during local testing.

Update release notes for 9.2.2, 9.1.7, 9.0.11, 8.4.15, 8.3.22.

Revert "Add mode where contrib installcheck runs each module in a separately named database."

This reverts commit e2b3c21b05c78c3a726b189242e41d4aa4422bf1.

Avoid holding vmbuffer pin after VACUUM.
During VACUUM if we pause to perform a cycle
of index cleanup we drop the vmbuffer pin,
so we should do the same thing when heap
scan completes. This avoids holding vmbuffer
pin across the main index cleanup in VACUUM,
which could be minutes or hours longer than
necessary for correctness.

Bug report and suggested fix from Pavan Deolasee

Fix documentation of path(polygon) function.

Obviously, this returns type "path", but somebody made a copy-and-pasteo
long ago.

Dagfinn Ilmari Mannsåker

Attempt to unbreak MSVC builds broken by f21bb9cfb5646e1793dcc9c0ea697bab99afa523.

We can't use type uint, so use uint32.

Refactor inCommit flag into generic delayChkpt flag.
Rename PGXACT->inCommit flag into delayChkpt flag,
and generalise comments to allow use in other situations,
such as the forthcoming potential use in checksum patch.
Replace wait loop to look for VXIDs with delayChkpt set.
No user visible changes, not behaviour changes at present.

Simon Riggs, reviewed and rebased by Jeff Davis

Clarify locking for PageGetLSN() in XLogCheckBuffer()

Clarify when to use PageSetLSN/PageGetLSN().
Update README to explain prerequisites for
correct access to LSN fields of a page.
Independent chunk removed from checksums
patch to reduce size of patch.

Refactor the code implementing standby-mode logic.

It is now easier to see that it's a state machine, making the code easier
to understand overall.

Add mode where contrib installcheck runs each module in a separately named database.

Normally each module is tested in aq database named contrib_regression,
which is dropped and recreated at the beginhning of each pg_regress run.
This mode, enabled by adding USE_MODULE_DB=1 to the make command line,
runs most modules in a database with the module name embedded in it.

This will make testing pg_upgrade on clusters with the contrib modules
a lot easier.

Still to be done: adapt to the MSVC build system.

Backpatch to 9.0, which is the earliest version it is reasonably possible
to test upgrading from.

Update time zone data files to tzdata release 2012j.

DST law changes in Cuba, Israel, Jordan, Libya, Palestine, Western Samoa,
and portions of Brazil.

Recommend triggers, not rules, in the CREATE VIEW reference page.

We've generally recommended use of INSTEAD triggers over rules since that
feature was added; but this old text in the CREATE VIEW reference page
didn't get the memo. Noted by Thomas Kellerer.

Reduce scope of changes for COPY FREEZE.
Allow support only for freezing tuples by explicit
command. Previous coding mistakenly extended
slightly beyond what was agreed as correct on -hackers.
So essentially a partial revoke of earlier work,
leaving just the COPY FREEZE command.

Don't advance checkPoint.nextXid near the end of a checkpoint sequence.

This reverts commit c11130690d6dca64267201a169cfb38c1adec5ef in favor of
actually fixing the problem: namely, that we should never have been
modifying the checkpoint record's nextXid at this point to begin with.
The nextXid should match the state as of the checkpoint's logical WAL
position (ie the redo point), not the state as of its physical position.
It's especially bogus to advance it in some wal_levels and not others.
In any case there is no need for the checkpoint record to carry the
same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by
LogStandbySnapshot, as any replay operation will already have adopted
that value as current.

This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug
#6291 from Daniel Farina, in that if a checkpoint were in progress at the
instant of XID wraparound, the epoch bump would be lost as reported.
(And, of course, these days there's at least a 50-50 chance of a checkpoint
being in progress at any given instant.)

Diagnosed by me and independently by Andres Freund. Back-patch to all
branches supporting hot standby.

Rearrange storage of data in xl_running_xacts.
Previously we stored all xids mixed together.
Now we store top-level xids first, followed
by all subxids. Also skip logging any subxids
if the snapshot is suboverflowed, since there
are potentially large numbers of them and they
are not useful in that case anyway. Has value
in the envisaged design for decoding of WAL.
No planned effect on Hot Standby.

Andres Freund, reviewed by me

XidEpoch++ if wraparound during checkpoint.
If wal_level = hot_standby we update the checkpoint nextxid,
though in the case where a wraparound occurred half-way through
a checkpoint we would neglect updating the epoch also. Updating
the nextxid is arguably the wrong thing to do, but changing that
may introduce subtle bugs into hot standby startup, while updating
the value doesn't cause any known bugs yet. Minimal fix now to
HEAD and backbranches, wider fix later in HEAD.

Bug reported in #6291 by Daniel Farina and slightly differently in

Cause analysis and recommended fixes from Tom Lane and Andres Freund.

Applied patch is minimal version of Andres Freund's work.

Clarify operation of online checkpoints.
Previous comments left, but were too obscure
for such an important aspect of the system.

Fix psql crash while parsing SQL file whose encoding is different from
client encoding and the client encoding is not *safe* one. Such an
example is, file encoding is UTF-8 and client encoding SJIS. Patch
contributed by Jiang Guiqing.

Prevent passing gmake's environment variables down through pg_regress.

When we do "make install" to create a temp installation, we don't want
that instance of make to try to communicate with any instance of make
that might be calling us. This is known to cause problems if the
upper make has a -jN flag, and in principle could cause problems even
without that. Unset the relevant environment variables to prevent such
issues.

Andres Freund

Make sure sharedir/extension/ directory is created when needed.

The previous coding worked as long as MODULEDIR wasn't set explicitly,
because we create sharedir/$(datamoduledir) and the default value of
that is "extension". But if some other value is specified for MODULEDIR
then the installation directory needed for the control file wasn't made.

Cédric Villemain

Allow adding values to an enum type created in the current transaction.

Normally it is unsafe to allow ALTER TYPE ADD VALUE in a transaction block,
because instances of the value could be added to indexes later in the same
transaction, and then they would still be accessible even if the
transaction rolls back. However, we can allow this if the enum type itself
was created in the current transaction, because then any such indexes would
have to go away entirely on rollback.

The reason for allowing this is to support pg_upgrade's new usage of
pg_restore --single-transaction: in --binary-upgrade mode, pg_dump emits
enum types as a succession of ALTER TYPE ADD VALUE commands so that it can
preserve the values' OIDs. The support is a bit limited, so we'll leave
it undocumented.

Andres Freund

In pg_upgrade, remove 'set -x' from test script.

Revert:

In pg_upgrade, remove pg_restore's --single-transaction option,
as it throws errors in certain cases.

Remove pg_restore's --single-transaction option, as it throws errors in
certain cases.

Second tweak of COPY FREEZE

Tweak tests in COPY FREEZE

COPY FREEZE and mark committed on fresh tables.
When a relfilenode is created in this subtransaction or
a committed child transaction and it cannot otherwise
be seen by our own process, mark tuples committed ahead
of transaction commit for all COPY commands in same
transaction. If FREEZE specified on COPY
and pre-conditions met then rows will also be frozen.
Both options designed to avoid revisiting rows after commit,
increasing performance of subsequent commands after
data load and upgrade. pg_restore changes later.

Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design
input from Tom Lane, Robert Haas and Kevin Grittner

doc: Fix broken links to DocBook wiki

In pg_upgrade, improve status wording now that we have per-database
status output for dump/restore.

Change test ExceptionalCondition to return void

Commit 81107282a changed it in assert.c, but overlooked this other file.

Take buffer lock while inspecting btree index pages in contrib/pageinspect.

It's not safe to examine a shared buffer without any lock.

Split initdb.c main() code into multiple functions, for easier
maintenance.

In pg_upgrade, dump each database separately and use
--single-transaction to restore each database schema. This yields
performance improvements for databases with many tables. Also, remove
split_old_dump() as it is no longer needed.

Move long_options structures to the top of main() functions, for
consistency.

Per suggestion from Tom.

Add missing buffer lock acquisition in GetTupleForTrigger().

If we had not been holding buffer pin continuously since the tuple was
initially fetched by the UPDATE or DELETE query, it would be possible for
VACUUM or a page-prune operation to move the tuple while we're trying to
copy it.  This would result in a garbage "old" tuple value being passed to
an AFTER ROW UPDATE or AFTER ROW DELETE trigger.  The preconditions for
this are somewhat improbable, and the timing constraints are very tight;
so it's not so surprising that this hasn't been reported from the field,
even though the bug has been there a long time.

Problem found by Andres Freund.  Back-patch to all active branches.

Clean environment for pg_upgrade test.

This removes exisiting PG settings from the environment for
pg_upgrade tests, just like pg_regress does.

Add libpq function PQconninfo()

This allows a caller to get back the exact conninfo array that was
used to create a connection, including parameters read from the
environment.

In doing this, restructure how options are copied from the conninfo
to the actual connection.

Zoltan Boszormenyi and Magnus Hagander

Produce a more useful error message for over-length Unix socket paths.

The length of a socket path name is constrained by the size of struct
sockaddr_un, and there's not a lot we can do about it since that is a
kernel API.  However, it would be a good thing if we produced an
intelligible error message when the user specifies a socket path that's too
long --- and getaddrinfo's standard API is too impoverished to do this in
the natural way.  So insert explicit tests at the places where we construct
a socket path name.  Now you'll get an error that makes sense and even
tells you what the limit is, rather than something generic like
"Non-recoverable failure in name resolution".

Per trouble report from Jeremy Drake and a fix idea from Andrew Dunstan.

Correctly init fast path fields on PGPROC

Cleanup VirtualXact at end of Hot Standby.

Basic binary heap implementation.

There are probably other places where this can be used, but for now,
this just makes MergeAppend use it, so that this code will have test
coverage. There is other work in the queue that will use this, as
well.

Abhijit Menon-Sen, reviewed by Andres Freund, Robert Haas, Álvaro
Herrera, Tom Lane, and others.

When processing nested structure pointer variables ecpg always expected an
array datatype which of course is wrong.

Applied patch by Muhammad Usama <m.usama@gmail.com> to fix this.

Suppress parallel build in interfaces/ecpg/preproc/.

This is to see if it will stop intermittent build failures on buildfarm
member okapi. We know that gmake 3.82 has some problems with sometimes
not honoring dependencies in parallel builds, and it seems likely that
this is more of the same. Since the vast bulk of the work in the preproc
directory is associated with creating preproc.c and then preproc.o,
parallelism buys us hardly anything here anyway.

Also, make both this .NOTPARALLEL and the one previously added in
interfaces/ecpg/Makefile be conditional on "ifeq ($(MAKE_VERSION),3.82)".
The known bug in gmake is fixed upstream and should not be present in
3.83 and up, and there's no reason to think it affects older releases.

Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.

Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation.  The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY.  This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions.  Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.

To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished.  (This
change obviously is only possible in HEAD.  This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)

In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update.  The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption.  This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.

In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate.  These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.

Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation.  Previously we
could have been left with stale values of some fields in an index relcache
entry.  It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.

In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.

This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.

Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.

Split out rmgr rm_desc functions into their own files

This is necessary (but not sufficient) to have them compilable outside
of a backend environment.

If we don't have a backup-end-location, don't claim we've reached it.

This was apparently a typo, which caused recovery to think that it
immediately reached the end of backup, and allowed the database to start
up too early.

Reported by Jeff Janes. Backpatch to 9.2, where this code was introduced.

Add explicit casts in ilist.h's inline functions.

Needed to silence C++ errors, per report from Peter Eisentraut.

Andres Freund

Add OpenTransientFile, with automatic cleanup at end-of-xact.

Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().

This plugs a few rare fd leaks in error cases:

1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
   use OpenTransientFile here because the fd is supposed to persist over
   transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
   PathNameOpenFile.

In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.

The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.

Revert patch for taking fewer snapshots.

This reverts commit d573e239f03506920938bf0be56c868d9c3416da, "Take fewer
snapshots".  While that seemed like a good idea at the time, it caused
execution to use a snapshot that had been acquired before locking any of
the tables mentioned in the query.  This created user-visible anomalies
that were not present in any prior release of Postgres, as reported by
Tomas Vondra.  While this whole area could do with a redesign (since there
are related cases that have anomalies anyway), it doesn't seem likely that
any future patch would be reasonably back-patchable; and we don't want 9.2
to exhibit a behavior that's subtly unlike either past or future releases.
Hence, revert to prior code while we rethink the problem.

Fix SELECT DISTINCT with index-optimized MIN/MAX on inheritance trees.

In a query such as "SELECT DISTINCT min(x) FROM tab", the DISTINCT is
pretty useless (there being only one output row), but nonetheless it
shouldn't fail.  But it could fail if "tab" is an inheritance parent,
because planagg.c's code for fixing up equivalence classes after making the
index-optimized MIN/MAX transformation wasn't prepared to find child-table
versions of the aggregate expression.  The least ugly fix seems to be
to add an option to mutate_eclass_expressions() to skip child-table
equivalence class members, which aren't used anymore at this stage of
planning so it's not really necessary to fix them.  Since child members
are ignored in many cases already, it seems plausible for
mutate_eclass_expressions() to have an option to ignore them too.

Per bug #7703 from Maxim Boguk.

Back-patch to 9.1.  Although the same code exists before that, it cannot
encounter child-table aggregates AFAICS, because the index optimization
transformation cannot succeed on inheritance trees before 9.1 (for lack
of MergeAppend).

In pg_upgrade, simplify function copy_file() by using pg_malloc() and
centralizing error/shutdown code.

In pg_upgrade, fix a few place that used maloc/free rather than
pg_malloc/pg_free.

Remove -Wlogical-op from standard compiler flags

It creates too many warnings with GCC 4.3 and 4.4.

Applied patch by Chen Huajun <chenhj@cn.fujitsu.com> to make ecpg able to cope
with very long structs.

Fix pg_resetxlog to use correct path to postmaster.pid.

Since we've already chdir'd into the data directory, the file should
be referenced as just "postmaster.pid", without prefixing the directory
path. This is harmless in the normal case where an absolute PGDATA path
is used, but quite dangerous if a relative path is specified, since the
program might then fail to notice an active postmaster.

Reported by Hari Babu. This got broken in my commit
eb5949d190e80360386113fde0f05854f0c9824d, so patch all active versions.

Avoid bogus "out-of-sequence timeline ID" errors in standby-mode.

When startup process opens a WAL segment after replaying part of it, it
validates the first page on the WAL segment, even though the page it's
really interested in later in the file. As part of the validation, it checks
that the TLI on the page header is >= the TLI it saw on the last page it
read. If the segment contains a timeline switch, and we have already
replayed it, and then re-open the WAL segment (because of streaming
replication got disconnected and reconnected, for example), the TLI check
will fail when the first page is validated. Fix that by relaxing the TLI
check when re-opening a WAL segment.

Backpatch to 9.0. Earlier versions had the same code, but before standby
mode was introduced in 9.0, recovery never tried to re-read a segment after
partially replaying it.

Reported by Amit Kapila, while testing a new feature.

Don't launch new child processes after we've been told to shut down.

Once we've received a shutdown signal (SIGINT or SIGTERM), we should not
launch any more child processes, even if we get signals requesting such.
The normal code path for spawning backends has always understood that,
but the postmaster's infrastructure for hot standby and autovacuum didn't
get the memo.  As reported by Hari Babu in bug #7643, this could lead to
failure to shut down at all in some cases, such as when SIGINT is received
just before the startup process sends PMSIGNAL_RECOVERY_STARTED: we'd
launch a bgwriter and checkpointer, and then those processes would have no
idea that they ought to quit.  Similarly, launching a new autovacuum worker
would result in waiting till it finished before shutting down.

Also, switch the order of the code blocks in reaper() that detect startup
process crash versus shutdown termination.  Once we've sent it a signal,
we should not consider that exit(1) is surprising.  This is just a cosmetic
fix since shutdown occurs correctly anyway, but better not to log a phony
complaint about startup process crash.

Back-patch to 9.0.  Some parts of this might be applicable before that,
but given the lack of prior complaints I'm not going to worry too much
about older branches.