to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
not holding any page lock within the index. We do continue to hold a pin
-on the leaf page, to protect against concurrent deletions (see below).
-In this state the scan is effectively stopped "between" pages, either
-before or after the page it has pinned. This is safe in the presence of
-concurrent insertions and even page splits, because items are never moved
-across pre-existing page boundaries --- so the scan cannot miss any items
-it should have seen, nor accidentally return the same item twice. The scan
-must remember the page's right-link at the time it was scanned, since that
-is the page to move right to; if we move right to the current right-link
-then we'd re-scan any items moved by a page split. We don't similarly
-remember the left-link, since it's best to use the most up-to-date
-left-link when trying to move left (see detailed move-left algorithm below).
+on the leaf page in some circumstances, to protect against concurrent
+deletions (see below). In this state the scan is effectively stopped
+"between" pages, either before or after the page it has pinned. This is
+safe in the presence of concurrent insertions and even page splits, because
+items are never moved across pre-existing page boundaries --- so the scan
+cannot miss any items it should have seen, nor accidentally return the same
+item twice. The scan must remember the page's right-link at the time it
+was scanned, since that is the page to move right to; if we move right to
+the current right-link then we'd re-scan any items moved by a page split.
+We don't similarly remember the left-link, since it's best to use the most
+up-to-date left-link when trying to move left (see detailed move-left
+algorithm below).
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
operations themselves; as explained above, index scans logically stop
"between" pages and so can't lose their place. The reason we do it is to
provide an interlock between non-full VACUUM and indexscans. Since VACUUM
-deletes index entries before deleting tuples, the super-exclusive lock
-guarantees that VACUUM can't delete any heap tuple that an indexscanning
-process might be about to visit. (This guarantee works only for simple
-indexscans that visit the heap in sync with the index scan, not for bitmap
-scans. We only need the guarantee when using non-MVCC snapshot rules; in
-an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with
-an unrelated tuple at the same TID, because the new tuple wouldn't be
-visible to our scan anyway.)
-
-Because a page can be split even while someone holds a pin on it, it is
-possible that an indexscan will return items that are no longer stored on
-the page it has a pin on, but rather somewhere to the right of that page.
-To ensure that VACUUM can't prematurely remove such heap tuples, we require
-btbulkdelete to obtain super-exclusive lock on every leaf page in the index,
-even pages that don't contain any deletable tuples. This guarantees that
-the btbulkdelete call cannot return while any indexscan is still holding
-a copy of a deleted index tuple. Note that this requirement does not say
-that btbulkdelete must visit the pages in any particular order. (See also
-on-the-fly deletion, below.)
+deletes index entries before reclaiming heap tuple line pointers, the
+super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
+line pointer that an indexscanning process might be about to visit. This
+guarantee works only for simple indexscans that visit the heap in sync
+with the index scan, not for bitmap scans. We only need the guarantee
+when using non-MVCC snapshot rules; when using an MVCC snapshot, it
+doesn't matter if the heap tuple is replaced with an unrelated tuple at
+the same TID, because the new tuple won't be visible to our scan anyway.
+Therefore, a scan using an MVCC snapshot which has no other confounding
+factors will not hold the pin after the page contents are read. The
+current reasons for exceptions, where a pin is still needed, are if the
+index is not WAL-logged or if the scan is an index-only scan. If later
+work allows the pin to be dropped for all cases we will be able to
+simplify the vacuum code, since the concept of a super-exclusive lock
+for btree indexes will no longer be needed.
+
+Because a pin is not always held, and a page can be split even while
+someone does hold a pin on it, it is possible that an indexscan will
+return items that are no longer stored on the page it has a pin on, but
+rather somewhere to the right of that page. To ensure that VACUUM can't
+prematurely remove such heap tuples, we require btbulkdelete to obtain a
+super-exclusive lock on every leaf page in the index, even pages that
+don't contain any deletable tuples. Any scan which could yield incorrect
+results if the tuple at a TID matching the the scan's range and filter
+conditions were replaced by a different tuple while the scan is in
+progress must hold the pin on each index page until all index entries read
+from the page have been processed. This guarantees that the btbulkdelete
+call cannot return while any indexscan is still holding a copy of a
+deleted index tuple if the scan could be confused by that. Note that this
+requirement does not say that btbulkdelete must visit the pages in any
+particular order. (See also on-the-fly deletion, below.)
There is no such interlocking for deletion of items in internal pages,
since backends keep no lock nor pin on a page they have descended past.
not so: as long as an indexscanning process has a pin on the page where
the index item used to be, VACUUM cannot complete its btbulkdelete scan
and so cannot remove the heap tuple. This is another reason why
-btbulkdelete has to get super-exclusive lock on every leaf page, not only
-the ones where it actually sees items to delete.
+btbulkdelete has to get a super-exclusive lock on every leaf page, not
+only the ones where it actually sees items to delete. So that we can
+handle the cases where we attempt LP_DEAD flagging for a page after we
+have released its pin, we remember the LSN of the index page when we read
+the index tuples from it; we do not attempt to flag index tuples as dead
+if the we didn't hold the pin the entire time and the LSN has changed.
WAL Considerations
------------------
of the insertion into the parent level. When splitting the root page, the
metapage update is handled as part of the "new root" action.
-Each step in page deletion are logged as separate WAL entries: marking the
+Each step in page deletion is logged as a separate WAL entry: marking the
leaf as half-dead and removing the downlink is one record, and unlinking a
page is a second record. If vacuum is interrupted for some reason, or the
system crashes, the tree is consistent for searches and insertions. The
* If there's not enough room in the space, we try to make room by
* removing any LP_DEAD tuples.
*
- * On entry, *buf and *offsetptr point to the first legal position
+ * On entry, *bufptr and *offsetptr point to the first legal position
* where the new tuple could be inserted. The caller should hold an
- * exclusive lock on *buf. *offsetptr can also be set to
+ * exclusive lock on *bufptr. *offsetptr can also be set to
* InvalidOffsetNumber, in which case the function will search for the
* right location within the page if needed. On exit, they point to the
* chosen insert location. If _bt_findinsertloc decides to move right,
/* allocate private workspace */
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
- so->currPos.buf = so->markPos.buf = InvalidBuffer;
+ BTScanPosInvalidate(so->currPos);
+ BTScanPosInvalidate(so->markPos);
if (scan->numberOfKeys > 0)
so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
else
* scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
*/
so->currTuples = so->markTuples = NULL;
- so->currPos.nextTupleOffset = 0;
- so->markPos.nextTupleOffset = 0;
scan->xs_itupdesc = RelationGetDescr(rel);
{
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
- _bt_killitems(scan, false);
- ReleaseBuffer(so->currPos.buf);
- so->currPos.buf = InvalidBuffer;
+ _bt_killitems(scan);
+ BTScanPosUnpinIfPinned(so->currPos);
+ BTScanPosInvalidate(so->currPos);
}
- if (BTScanPosIsValid(so->markPos))
- {
- ReleaseBuffer(so->markPos.buf);
- so->markPos.buf = InvalidBuffer;
- }
so->markItemIndex = -1;
+ BTScanPosUnpinIfPinned(so->markPos);
+ BTScanPosInvalidate(so->markPos);
/*
* Allocate tuple workspace arrays, if needed for an index-only scan and
{
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
- _bt_killitems(scan, false);
- ReleaseBuffer(so->currPos.buf);
- so->currPos.buf = InvalidBuffer;
+ _bt_killitems(scan);
+ BTScanPosUnpinIfPinned(so->currPos);
}
- if (BTScanPosIsValid(so->markPos))
- {
- ReleaseBuffer(so->markPos.buf);
- so->markPos.buf = InvalidBuffer;
- }
so->markItemIndex = -1;
+ BTScanPosUnpinIfPinned(so->markPos);
+
+ /* No need to invalidate positions, the RAM is about to be freed. */
/* Release storage */
if (so->keyData != NULL)
IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- /* we aren't holding any read locks, but gotta drop the pin */
- if (BTScanPosIsValid(so->markPos))
- {
- ReleaseBuffer(so->markPos.buf);
- so->markPos.buf = InvalidBuffer;
- }
+ /* There may be an old mark with a pin (but no lock). */
+ BTScanPosUnpinIfPinned(so->markPos);
/*
* Just record the current itemIndex. If we later step to next page
if (BTScanPosIsValid(so->currPos))
so->markItemIndex = so->currPos.itemIndex;
else
+ {
+ BTScanPosInvalidate(so->markPos);
so->markItemIndex = -1;
+ }
/* Also record the current positions of any array keys */
if (so->numArrayKeys)
if (so->markItemIndex >= 0)
{
/*
- * The mark position is on the same page we are currently on. Just
+ * The scan has never moved to a new page since the last mark. Just
* restore the itemIndex.
+ *
+ * NB: In this case we can't count on anything in so->markPos to be
+ * accurate.
*/
so->currPos.itemIndex = so->markItemIndex;
}
+ else if (so->currPos.currPage == so->markPos.currPage)
+ {
+ /*
+ * so->markItemIndex < 0 but mark and current positions are on the
+ * same page. This would be an unusual case, where the scan moved to
+ * a new index page after the mark, restored, and later restored again
+ * without moving off the marked page. It is not clear that this code
+ * can currently be reached, but it seems better to make this function
+ * robust for this case than to Assert() or elog() that it can't
+ * happen.
+ *
+ * We neither want to set so->markItemIndex >= 0 (because that could
+ * cause a later move to a new page to redo the memcpy() executions)
+ * nor re-execute the memcpy() functions for a restore within the same
+ * page. The previous restore to this page already set everything
+ * except markPos as it should be.
+ */
+ so->currPos.itemIndex = so->markPos.itemIndex;
+ }
else
{
- /* we aren't holding any read locks, but gotta drop the pin */
+ /*
+ * The scan moved to a new page after last mark or restore, and we are
+ * now restoring to the marked page. We aren't holding any read
+ * locks, but if we're still holding the pin for the current position,
+ * we must drop it.
+ */
if (BTScanPosIsValid(so->currPos))
{
/* Before leaving current page, deal with any killed items */
- if (so->numKilled > 0 &&
- so->currPos.buf != so->markPos.buf)
- _bt_killitems(scan, false);
- ReleaseBuffer(so->currPos.buf);
- so->currPos.buf = InvalidBuffer;
+ if (so->numKilled > 0)
+ _bt_killitems(scan);
+ BTScanPosUnpinIfPinned(so->currPos);
}
if (BTScanPosIsValid(so->markPos))
{
/* bump pin on mark buffer for assignment to current buffer */
- IncrBufferRefCount(so->markPos.buf);
+ if (BTScanPosIsPinned(so->markPos))
+ IncrBufferRefCount(so->markPos.buf);
memcpy(&so->currPos, &so->markPos,
offsetof(BTScanPosData, items[1]) +
so->markPos.lastItem * sizeof(BTScanPosItem));
memcpy(so->currTuples, so->markTuples,
so->markPos.nextTupleOffset);
}
+ else
+ BTScanPosInvalidate(so->currPos);
}
PG_RETURN_VOID();
#include "storage/predicate.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
+#include "utils/tqual.h"
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+
+/*
+ * _bt_drop_lock_and_maybe_pin()
+ *
+ * Unlock the buffer; and if it is safe to release the pin, do that, too. It
+ * is safe if the scan is using an MVCC snapshot and the index is WAL-logged.
+ * This will prevent vacuum from stalling in a blocked state trying to read a
+ * page when a cursor is sitting on it -- at least in many important cases.
+ *
+ * Set the buffer to invalid if the pin is released, since the buffer may be
+ * re-used. If we need to go back to this block (for example, to apply
+ * LP_DEAD hints) we must get a fresh reference to the buffer. Hopefully it
+ * will remain in shared memory for as long as it takes to scan the index
+ * buffer page.
+ */
+static void
+_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+{
+ LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK);
+
+ if (IsMVCCSnapshot(scan->xs_snapshot) &&
+ RelationNeedsWAL(scan->indexRelation) &&
+ !scan->xs_want_itup)
+ {
+ ReleaseBuffer(sp->buf);
+ sp->buf = InvalidBuffer;
+ }
+}
/*
StrategyNumber strat_total;
BTScanPosItem *currItem;
+ Assert(!BTScanPosIsValid(so->currPos));
+
pgstat_count_index_scan(rel);
/*
/* don't need to keep the stack around... */
_bt_freestack(stack);
- /* remember which buffer we have pinned, if any */
- so->currPos.buf = buf;
-
if (!BufferIsValid(buf))
{
/*
break;
}
+ /* remember which buffer we have pinned, if any */
+ Assert(!BTScanPosIsValid(so->currPos));
+ so->currPos.buf = buf;
+
/*
* Now load data from the first page of the scan.
*/
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))
return false;
}
-
- /* Drop the lock, but not pin, on the current page */
- LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
/* OK, itemIndex says what to return */
currItem = &so->currPos.items[so->currPos.itemIndex];
/*
* _bt_next() -- Get the next item in a scan.
*
- * On entry, so->currPos describes the current page, which is pinned
- * but not locked, and so->currPos.itemIndex identifies which item was
+ * On entry, so->currPos describes the current page, which may be pinned
+ * but is not locked, and so->currPos.itemIndex identifies which item was
* previously returned.
*
* On successful exit, scan->xs_ctup.t_self is set to the TID of the
{
if (++so->currPos.itemIndex > so->currPos.lastItem)
{
- /* We must acquire lock before applying _bt_steppage */
- Assert(BufferIsValid(so->currPos.buf));
- LockBuffer(so->currPos.buf, BT_READ);
if (!_bt_steppage(scan, dir))
return false;
- /* Drop the lock, but not pin, on the new page */
- LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
}
}
else
{
if (--so->currPos.itemIndex < so->currPos.firstItem)
{
- /* We must acquire lock before applying _bt_steppage */
- Assert(BufferIsValid(so->currPos.buf));
- LockBuffer(so->currPos.buf, BT_READ);
if (!_bt_steppage(scan, dir))
return false;
- /* Drop the lock, but not pin, on the new page */
- LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
}
}
IndexTuple itup;
bool continuescan;
- /* we must have the buffer pinned and locked */
+ /*
+ * We must have the buffer pinned and locked, but the usual macro can't be
+ * used here; this function is what makes it good for currPos.
+ */
Assert(BufferIsValid(so->currPos.buf));
page = BufferGetPage(so->currPos.buf);
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = PageGetLSN(page);
+
/*
* we must save the page's right-link while scanning it; this tells us
* where to step right to after we're done with these items. There is no
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ /*
+ * Now that the current page has been made consistent, the macro should be
+ * good.
+ */
+ Assert(BTScanPosIsPinned(so->currPos));
+
if (ScanDirectionIsForward(dir))
{
/* load items[] in ascending order */
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
- * On entry, so->currPos.buf must be pinned and read-locked. We'll drop
- * the lock and pin before moving to next page.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
+ * if pinned, we'll drop the pin before moving to next page. The buffer is
+ * not locked on entry.
*
- * On success exit, we hold pin and read-lock on the next interesting page,
- * and so->currPos is updated to contain data from that page.
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page. For success on a scan using a non-MVCC snapshot we hold
+ * a pin, but not a read lock, on that page. If we do not hold the pin, we
+ * set so->currPos.buf to InvalidBuffer. We return TRUE to indicate success.
*
* If there are no more matching records in the given direction, we drop all
* locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
Page page;
BTPageOpaque opaque;
- /* we must have the buffer pinned and locked */
- Assert(BufferIsValid(so->currPos.buf));
+ Assert(BTScanPosIsValid(so->currPos));
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
- _bt_killitems(scan, true);
+ _bt_killitems(scan);
/*
* Before we modify currPos, make a copy of the page data if there was a
if (so->markItemIndex >= 0)
{
/* bump pin on current buffer for assignment to mark buffer */
- IncrBufferRefCount(so->currPos.buf);
+ if (BTScanPosIsPinned(so->currPos))
+ IncrBufferRefCount(so->currPos.buf);
memcpy(&so->markPos, &so->currPos,
offsetof(BTScanPosData, items[1]) +
so->currPos.lastItem * sizeof(BTScanPosItem));
/* Remember we left a page with data */
so->currPos.moreLeft = true;
+ /* release the previous buffer, if pinned */
+ BTScanPosUnpinIfPinned(so->currPos);
+
for (;;)
{
- /* release the previous buffer */
- _bt_relbuf(rel, so->currPos.buf);
- so->currPos.buf = InvalidBuffer;
/* if we're at end of scan, give up */
if (blkno == P_NONE || !so->currPos.moreRight)
+ {
+ BTScanPosInvalidate(so->currPos);
return false;
+ }
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
/* step right one page */
if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
break;
}
+
/* nope, keep going */
blkno = opaque->btpo_next;
+ _bt_relbuf(rel, so->currPos.buf);
}
}
else
* to our left splits while we are in flight to it, plus the
* possibility that the page we were on gets deleted after we leave
* it. See nbtree/README for details.
+ *
+ * It might be possible to rearrange this code to have less overhead
+ * in pinning and locking, but that would require capturing the left
+ * pointer when the page is initially read, and using it here, along
+ * with big changes to _bt_walk_left() and the code below. It is not
+ * clear whether this would be a win, since if the page immediately to
+ * the left splits after we read this page and before we step left, we
+ * would need to visit more pages than with the current code.
+ *
+ * Note that if we change the code so that we drop the pin for a scan
+ * which uses a non-MVCC snapshot, we will need to modify the code for
+ * walking left, to allow for the possibility that a referenced page
+ * has been deleted. As long as the buffer is pinned or the snapshot
+ * is MVCC the page cannot move past the half-dead state to fully
+ * deleted.
*/
+ if (BTScanPosIsPinned(so->currPos))
+ LockBuffer(so->currPos.buf, BT_READ);
+ else
+ so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+
for (;;)
{
/* Done if we know there are no matching keys to the left */
if (!so->currPos.moreLeft)
{
_bt_relbuf(rel, so->currPos.buf);
- so->currPos.buf = InvalidBuffer;
+ BTScanPosInvalidate(so->currPos);
return false;
}
/* if we're physically at end of index, return failure */
if (so->currPos.buf == InvalidBuffer)
+ {
+ BTScanPosInvalidate(so->currPos);
return false;
+ }
/*
* Okay, we managed to move left to a non-deleted page. Done if
}
}
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
return true;
}
* exists.
*/
PredicateLockRelation(rel, scan->xs_snapshot);
- so->currPos.buf = InvalidBuffer;
+ BTScanPosInvalidate(so->currPos);
return false;
}
* There's no actually-matching data on this page. Try to advance to
* the next page. Return false if there's no matching data at all.
*/
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))
return false;
}
-
- /* Drop the lock, but not pin, on the current page */
- LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
/* OK, itemIndex says what to return */
currItem = &so->currPos.items[so->currPos.itemIndex];
* _bt_killitems - set LP_DEAD state for items an indexscan caller has
* told us were killed
*
- * scan->so contains information about the current page and killed tuples
- * thereon (generally, this should only be called if so->numKilled > 0).
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
*
- * The caller must have pin on so->currPos.buf, but may or may not have
- * read-lock, as indicated by haveLock. Note that we assume read-lock
- * is sufficient for setting LP_DEAD status (which is only a hint).
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer. Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
*
* We match items by heap TID before assuming they are the right ones to
* delete. We cope with cases where items have moved right due to insertions.
* If an item has moved off the current page due to a split, we'll fail to
* find it and do nothing (this is not an error case --- we assume the item
- * will eventually get marked in a future indexscan). Note that because we
- * hold pin on the target page continuously from initially reading the items
- * until applying this function, VACUUM cannot have deleted any items from
- * the page, and so there is no need to search left from the recorded offset.
- * (This observation also guarantees that the item is still the right one
- * to delete, which might otherwise be questionable since heap TIDs can get
- * recycled.)
+ * will eventually get marked in a future indexscan).
+ *
+ * Note that if we hold a pin on the target page continuously from initially
+ * reading the items until applying this function, VACUUM cannot have deleted
+ * any items from the page, and so there is no need to search left from the
+ * recorded offset. (This observation also guarantees that the item is still
+ * the right one to delete, which might otherwise be questionable since heap
+ * TIDs can get recycled.) This holds true even if the page has been modified
+ * by inserts and page splits, so there is no need to consult the LSN.
+ *
+ * If the pin was released after reading the page, then we re-read it. If it
+ * has been modified since we read it (as determined by the LSN), we dare not
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple.
*/
void
-_bt_killitems(IndexScanDesc scan, bool haveLock)
+_bt_killitems(IndexScanDesc scan)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Page page;
OffsetNumber minoff;
OffsetNumber maxoff;
int i;
+ int numKilled = so->numKilled;
bool killedsomething = false;
- Assert(BufferIsValid(so->currPos.buf));
+ Assert(BTScanPosIsValid(so->currPos));
+
+ /*
+ * Always reset the scan state, so we don't look for same items on other
+ * pages.
+ */
+ so->numKilled = 0;
- if (!haveLock)
+ if (BTScanPosIsPinned(so->currPos))
+ {
+ /*
+ * We have held the pin on this page since we read the index tuples,
+ * so all we need to do is lock it. The pin will have prevented
+ * re-use of any TID on the page, so there is no need to check the
+ * LSN.
+ */
LockBuffer(so->currPos.buf, BT_READ);
- page = BufferGetPage(so->currPos.buf);
+ page = BufferGetPage(so->currPos.buf);
+ }
+ else
+ {
+ Buffer buf;
+
+ /* Attempt to re-read the buffer, getting pin and lock. */
+ buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+
+ /* It might not exist anymore; in which case we can't hint it. */
+ if (!BufferIsValid(buf))
+ return;
+
+ page = BufferGetPage(buf);
+ if (PageGetLSN(page) == so->currPos.lsn)
+ so->currPos.buf = buf;
+ else
+ {
+ /* Modified while not pinned means hinting is not safe. */
+ _bt_relbuf(scan->indexRelation, buf);
+ return;
+ }
+ }
+
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
- for (i = 0; i < so->numKilled; i++)
+ for (i = 0; i < numKilled; i++)
{
int itemIndex = so->killedItems[i];
BTScanPosItem *kitem = &so->currPos.items[itemIndex];
MarkBufferDirtyHint(so->currPos.buf, true);
}
- if (!haveLock)
- LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-
- /*
- * Always reset the scan state, so we don't look for same items on other
- * pages.
- */
- so->numKilled = 0;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
}
* unpinned between the lastBlockVacuumed and the current block, if there
* are any. This prevents replay of the VACUUM from reaching the stage of
* removing heap tuples while there could still be indexscans "in flight"
- * to those particular tuples (see nbtree/README).
+ * to those particular tuples for those scans which could be confused by
+ * finding new tuples at the old TID locations (see nbtree/README).
*
* It might be worth checking if there are actually any backends running;
* if not, we could just skip this.
handles this efficiently. It is considered OK to hold a pin for long
intervals --- for example, sequential scans hold a pin on the current page
until done processing all the tuples on the page, which could be quite a
-while if the scan is the outer scan of a join. Similarly, btree index
-scans hold a pin on the current index page. This is OK because normal
+while if the scan is the outer scan of a join. Similarly, a btree index
+scan may hold a pin on the current index page. This is OK because normal
operations never wait for a page's pin count to drop to zero. (Anything
that might need to do such a wait is instead handled by waiting to obtain
the relation-level lock, which is why you'd better hold one first.) Pins
{
Buffer buf; /* if valid, the buffer is pinned */
+ XLogRecPtr lsn; /* pos in the WAL stream when page was read */
+ BlockNumber currPage; /* page we've referencd by items array */
BlockNumber nextPage; /* page's right link when we scanned it */
/*
typedef BTScanPosData *BTScanPos;
-#define BTScanPosIsValid(scanpos) BufferIsValid((scanpos).buf)
+#define BTScanPosIsPinned(scanpos) \
+( \
+ AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+ !BufferIsValid((scanpos).buf)), \
+ BufferIsValid((scanpos).buf) \
+)
+#define BTScanPosUnpin(scanpos) \
+ do { \
+ ReleaseBuffer((scanpos).buf); \
+ (scanpos).buf = InvalidBuffer; \
+ } while (0)
+#define BTScanPosUnpinIfPinned(scanpos) \
+ do { \
+ if (BTScanPosIsPinned(scanpos)) \
+ BTScanPosUnpin(scanpos); \
+ } while (0)
+
+#define BTScanPosIsValid(scanpos) \
+( \
+ AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+ !BufferIsValid((scanpos).buf)), \
+ BlockNumberIsValid((scanpos).currPage) \
+)
+#define BTScanPosInvalidate(scanpos) \
+ do { \
+ (scanpos).currPage = InvalidBlockNumber; \
+ (scanpos).nextPage = InvalidBlockNumber; \
+ (scanpos).buf = InvalidBuffer; \
+ (scanpos).lsn = InvalidXLogRecPtr; \
+ (scanpos).nextTupleOffset = 0; \
+ } while (0);
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
typedef struct BTArrayKeyInfo
extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
Page page, OffsetNumber offnum,
ScanDirection dir, bool *continuescan);
-extern void _bt_killitems(IndexScanDesc scan, bool haveLock);
+extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);
extern void _bt_end_vacuum(Relation rel);