-$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $
+$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $
This directory contains a correct implementation of Lehman and Yao's
-btree management algorithm that supports concurrent access for Postgres.
+high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
+Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
+on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).
+
We have made the following changes in order to incorporate their algorithm
into Postgres:
- + The requirement that all btree keys be unique is too onerous,
- but the algorithm won't work correctly without it. As a result,
- this implementation adds an OID (guaranteed to be unique) to
- every key in the index. This guarantees uniqueness within a set
- of duplicates. Space overhead is four bytes.
-
- For this reason, when we're passed an index tuple to store by the
- common access method code, we allocate a larger one and copy the
- supplied tuple into it. No Postgres code outside of the btree
- access method knows about this xid or sequence number.
-
- + Lehman and Yao don't require read locks, but assume that in-
- memory copies of tree nodes are unshared. Postgres shares
- in-memory buffers among backends. As a result, we do page-
- level read locking on btree nodes in order to guarantee that
- no record is modified while we are examining it. This reduces
- concurrency but guaranteees correct behavior.
-
- + Read locks on a page are held for as long as a scan has a pointer
- to the page. However, locks are always surrendered before the
- sibling page lock is acquired (for readers), so we remain deadlock-
- free. I will do a formal proof if I get bored anytime soon.
++ The requirement that all btree keys be unique is too onerous,
+ but the algorithm won't work correctly without it. Fortunately, it is
+ only necessary that keys be unique on a single tree level, because L&Y
+ only use the assumption of key uniqueness when re-finding a key in a
+ parent node (to determine where to insert the key for a split page).
+ Therefore, we can use the link field to disambiguate multiple
+ occurrences of the same user key: only one entry in the parent level
+ will be pointing at the page we had split. (Indeed we need not look at
+ the real "key" at all, just at the link field.) We can distinguish
+ items at the leaf level in the same way, by examining their links to
+ heap tuples; we'd never have two items for the same heap tuple.
+
++ Lehman and Yao assume that the key range for a subtree S is described
+ by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
+ node. This does not work for nonunique keys (for example, if we have
+ enough equal keys to spread across several leaf pages, there *must* be
+ some equal bounding keys in the first level up). Therefore we assume
+ Ki <= v <= Ki+1 instead. A search that finds exact equality to a
+ bounding key in an upper tree level must descend to the left of that
+ key to ensure it finds any equal keys in the preceding page. An
+ insertion that sees the high key of its target page is equal to the key
+ to be inserted has a choice whether or not to move right, since the new
+ key could go on either page. (Currently, we try to find a page where
+ there is room for the new key without a split.)
+
++ Lehman and Yao don't require read locks, but assume that in-memory
+ copies of tree nodes are unshared. Postgres shares in-memory buffers
+ among backends. As a result, we do page-level read locking on btree
+ nodes in order to guarantee that no record is modified while we are
+ examining it. This reduces concurrency but guaranteees correct
+ behavior. An advantage is that when trading in a read lock for a
+ write lock, we need not re-read the page after getting the write lock.
+ Since we're also holding a pin on the shared buffer containing the
+ page, we know that buffer still contains the page and is up-to-date.
+
++ We support the notion of an ordered "scan" of an index as well as
+ insertions, deletions, and simple lookups. A scan in the forward
+ direction is no problem, we just use the right-sibling pointers that
+ L&Y require anyway. (Thus, once we have descended the tree to the
+ correct start point for the scan, the scan looks only at leaf pages
+ and never at higher tree levels.) To support scans in the backward
+ direction, we also store a "left sibling" link much like the "right
+ sibling". (This adds an extra step to the L&Y split algorithm: while
+ holding the write lock on the page being split, we also lock its former
+ right sibling to update that page's left-link. This is safe since no
+ writer of that page can be interested in acquiring a write lock on our
+ page.) A backwards scan has one additional bit of complexity: after
+ following the left-link we must account for the possibility that the
+ left sibling page got split before we could read it. So, we have to
+ move right until we find a page whose right-link matches the page we
+ came from.
+
++ Read locks on a page are held for as long as a scan has a pointer
+ to the page. However, locks are always surrendered before the
+ sibling page lock is acquired (for readers), so we remain deadlock-
+ free. I will do a formal proof if I get bored anytime soon.
+ NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
+ on the current page of a scan before control leaves nbtree. When we
+ come back to resume the scan, we have to re-grab the read lock and
+ then move right if the current item moved (see _bt_restscan()).
+
++ Lehman and Yao fail to discuss what must happen when the root page
+ becomes full and must be split. Our implementation is to split the
+ root in the same way that any other page would be split, then construct
+ a new root page holding pointers to both of the resulting pages (which
+ now become siblings on level 2 of the tree). The new root page is then
+ installed by altering the root pointer in the meta-data page (see
+ below). This works because the root is not treated specially in any
+ other way --- in particular, searches will move right using its link
+ pointer if the link is set. Therefore, searches will find the data
+ that's been moved into the right sibling even if they read the metadata
+ page before it got updated. This is the same reasoning that makes a
+ split of a non-root page safe. The locking considerations are similar too.
+
++ Lehman and Yao assume fixed-size keys, but we must deal with
+ variable-size keys. Therefore there is not a fixed maximum number of
+ keys per page; we just stuff in as many as will fit. When we split a
+ page, we try to equalize the number of bytes, not items, assigned to
+ each of the resulting pages. Note we must include the incoming item in
+ this calculation, otherwise it is possible to find that the incoming
+ item doesn't fit on the split page where it needs to go!
In addition, the following things are handy to know:
- + Page zero of every btree is a meta-data page. This page stores
- the location of the root page, a pointer to a list of free
- pages, and other stuff that's handy to know.
-
- + This algorithm doesn't really work, since it requires ordered
- writes, and UNIX doesn't support ordered writes.
-
- + There's one other case where we may screw up in this
- implementation. When we start a scan, we descend the tree
- to the key nearest the one in the qual, and once we get there,
- position ourselves correctly for the qual type (eg, <, >=, etc).
- If we happen to step off a page, decide we want to get back to
- it, and fetch the page again, and if some bad person has split
- the page and moved the last tuple we saw off of it, then the
- code complains about botched concurrency in an elog(WARN, ...)
- and gives up the ghost. This is the ONLY violation of Lehman
- and Yao's guarantee of correct behavior that I am aware of in
- this code.
++ Page zero of every btree is a meta-data page. This page stores
+ the location of the root page, a pointer to a list of free
+ pages, and other stuff that's handy to know. (Currently, we
+ never shrink btree indexes so there are never any free pages.)
+
++ The algorithm assumes we can fit at least three items per page
+ (a "high key" and two real data items). Therefore it's unsafe
+ to accept items larger than 1/3rd page size. Larger items would
+ work sometimes, but could cause failures later on depending on
+ what else gets put on their page.
+
++ This algorithm doesn't guarantee btree consistency after a kernel crash
+ or hardware failure. To do that, we'd need ordered writes, and UNIX
+ doesn't support ordered writes (short of fsync'ing every update, which
+ is too high a price). Rebuilding corrupted indexes during restart
+ seems more attractive.
+
++ On deletions, we need to adjust the position of active scans on
+ the index. The code in nbtscan.c handles this. We don't need to
+ do this for insertions or splits because _bt_restscan can find the
+ new position of the previously-found item. NOTE that nbtscan.c
+ only copes with deletions issued by the current backend. This
+ essentially means that concurrent deletions are not supported, but
+ that's true already in the Lehman and Yao algorithm. nbtscan.c
+ exists only to support VACUUM and allow it to delete items while
+ it's scanning the index.
+
+Notes about data representation:
+
++ The right-sibling link required by L&Y is kept in the page "opaque
+ data" area, as is the left-sibling link and some flags.
+
++ We also keep a parent link in the opaque data, but this link is not
+ very trustworthy because it is not updated when the parent page splits.
+ Thus, it points to some page on the parent level, but possibly a page
+ well to the left of the page's actual current parent. In most cases
+ we do not need this link at all. Normally we return to a parent page
+ using a stack of entries that are made as we descend the tree, as in L&Y.
+ There is exactly one case where the stack will not help: concurrent
+ root splits. If an inserter process needs to split what had been the
+ root when it started its descent, but finds that that page is no longer
+ the root (because someone else split it meanwhile), then it uses the
+ parent link to move up to the next level. This is OK because we do fix
+ the parent link in a former root page when splitting it. This logic
+ will work even if the root is split multiple times (even up to creation
+ of multiple new levels) before an inserter returns to it. The same
+ could not be said of finding the new root via the metapage, since that
+ would work only for a single level of added root.
+
++ The Postgres disk block data format (an array of items) doesn't fit
+ Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
+ so we have to play some games.
+
++ On a page that is not rightmost in its tree level, the "high key" is
+ kept in the page's first item, and real data items start at item 2.
+ The link portion of the "high key" item goes unused. A page that is
+ rightmost has no "high key", so data items start with the first item.
+ Putting the high key at the left, rather than the right, may seem odd,
+ but it avoids moving the high key as we add data items.
+
++ On a leaf page, the data items are simply links to (TIDs of) tuples
+ in the relation being indexed, with the associated key values.
+
++ On a non-leaf page, the data items are down-links to child pages with
+ bounding keys. The key in each data item is the *lower* bound for
+ keys on that child page, so logically the key is to the left of that
+ downlink. The high key (if present) is the upper bound for the last
+ downlink. The first data item on each such page has no lower bound
+ --- or lower bound of minus infinity, if you prefer. The comparison
+ routines must treat it accordingly. The actual key stored in the
+ item is irrelevant, and need not be stored at all. This arrangement
+ corresponds to the fact that an L&Y non-leaf page has one more pointer
+ than key.
Notes to operator class implementors:
- With this implementation, we require the user to supply us with
- a procedure for pg_amproc. This procedure should take two keys
- A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
- respectively. See the contents of that relation for the btree
- access method for some samples.
-
-Notes to mao for implementation document:
-
- On deletions, we need to adjust the position of active scans on
- the index. The code in nbtscan.c handles this. We don't need to
- do this for splits because of the way splits are handled; if they
- happen behind us, we'll automatically go to the next page, and if
- they happen in front of us, we're not affected by them. For
- insertions, if we inserted a tuple behind the current scan location
- on the current scan page, we move one space ahead.
++ With this implementation, we require the user to supply us with
+ a procedure for pg_amproc. This procedure should take two keys
+ A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
+ respectively. See the contents of that relation for the btree
+ access method for some samples.
*
*
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.59 2000/06/08 22:36:52 momjian Exp $
+ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.60 2000/07/21 06:42:32 tgl Exp $
*
*-------------------------------------------------------------------------
*/
#include "access/nbtree.h"
-static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf, BTStack stack, int keysz, ScanKey scankey, BTItem btitem, BTItem afteritem);
-static Buffer _bt_split(Relation rel, Size keysz, ScanKey scankey,
- Buffer buf, OffsetNumber firstright);
-static OffsetNumber _bt_findsplitloc(Relation rel, Size keysz, ScanKey scankey,
- Page page, OffsetNumber start,
- OffsetNumber maxoff, Size llimit);
+typedef struct
+{
+ /* context data for _bt_checksplitloc */
+ Size newitemsz; /* size of new item to be inserted */
+ bool non_leaf; /* T if splitting an internal node */
+
+ bool have_split; /* found a valid split? */
+
+ /* these fields valid only if have_split is true */
+ bool newitemonleft; /* new item on left or right of best split */
+ OffsetNumber firstright; /* best split point */
+ int best_delta; /* best size delta so far */
+} FindSplitData;
+
+
+static TransactionId _bt_check_unique(Relation rel, BTItem btitem,
+ Relation heapRel, Buffer buf,
+ ScanKey itup_scankey);
+static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf,
+ BTStack stack,
+ int keysz, ScanKey scankey,
+ BTItem btitem,
+ OffsetNumber afteritem);
+static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+ OffsetNumber newitemoff, Size newitemsz,
+ BTItem newitem, bool newitemonleft,
+ OffsetNumber *itup_off, BlockNumber *itup_blkno);
+static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+ OffsetNumber newitemoff,
+ Size newitemsz,
+ bool *newitemonleft);
+static void _bt_checksplitloc(FindSplitData *state, OffsetNumber firstright,
+ int leftfree, int rightfree,
+ bool newitemonleft, Size firstrightitemsz);
+static Buffer _bt_getstackbuf(Relation rel, BTStack stack);
static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
-static OffsetNumber _bt_pgaddtup(Relation rel, Buffer buf, int keysz, ScanKey itup_scankey, Size itemsize, BTItem btitem, BTItem afteritem);
-static bool _bt_goesonpg(Relation rel, Buffer buf, Size keysz, ScanKey scankey, BTItem afteritem);
-static void _bt_updateitem(Relation rel, Size keysz, Buffer buf, BTItem oldItem, BTItem newItem);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey);
-static int32 _bt_tuplecompare(Relation rel, Size keysz, ScanKey scankey,
- IndexTuple tuple1, IndexTuple tuple2);
+static void _bt_pgaddtup(Relation rel, Page page,
+ Size itemsize, BTItem btitem,
+ OffsetNumber itup_off, const char *where);
+static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
+ int keysz, ScanKey scankey);
/*
* _bt_doinsert() -- Handle insertion of a single btitem in the tree.
*
* This routine is called by the public interface routines, btbuild
- * and btinsert. By here, btitem is filled in, and has a unique
- * (xid, seqno) pair.
+ * and btinsert. By here, btitem is filled in, including the TID.
*/
InsertIndexResult
-_bt_doinsert(Relation rel, BTItem btitem, bool index_is_unique, Relation heapRel)
+_bt_doinsert(Relation rel, BTItem btitem,
+ bool index_is_unique, Relation heapRel)
{
+ IndexTuple itup = &(btitem->bti_itup);
+ int natts = rel->rd_rel->relnatts;
ScanKey itup_scankey;
- IndexTuple itup;
BTStack stack;
Buffer buf;
- BlockNumber blkno;
- int natts = rel->rd_rel->relnatts;
InsertIndexResult res;
- Buffer buffer;
-
- itup = &(btitem->bti_itup);
/* we need a scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup);
+top:
/* find the page containing this key */
- stack = _bt_search(rel, natts, itup_scankey, &buf);
+ stack = _bt_search(rel, natts, itup_scankey, &buf, BT_WRITE);
/* trade in our read lock for a write lock */
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBuffer(buf, BT_WRITE);
-l1:
-
/*
* If the page was split between the time that we surrendered our read
* lock and acquired our write lock, then this page may no longer be
* need to move right in the tree. See Lehman and Yao for an
* excruciatingly precise description.
*/
-
buf = _bt_moveright(rel, buf, natts, itup_scankey, BT_WRITE);
- blkno = BufferGetBlockNumber(buf);
- /* if we're not allowing duplicates, make sure the key isn't */
- /* already in the node */
+ /*
+ * If we're not allowing duplicates, make sure the key isn't
+ * already in the index. XXX this belongs somewhere else, likely
+ */
if (index_is_unique)
{
- OffsetNumber offset,
- maxoff;
- Page page;
+ TransactionId xwait;
- page = BufferGetPage(buf);
- maxoff = PageGetMaxOffsetNumber(page);
+ xwait = _bt_check_unique(rel, btitem, heapRel, buf, itup_scankey);
+
+ if (TransactionIdIsValid(xwait))
+ {
+ /* Have to wait for the other guy ... */
+ _bt_relbuf(rel, buf, BT_WRITE);
+ XactLockTableWait(xwait);
+ /* start over... */
+ _bt_freestack(stack);
+ goto top;
+ }
+ }
+
+ /* do the insertion */
+ res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, btitem, 0);
+
+ /* be tidy */
+ _bt_freestack(stack);
+ _bt_freeskey(itup_scankey);
+
+ return res;
+}
+
+/*
+ * _bt_check_unique() -- Check for violation of unique index constraint
+ *
+ * Returns NullTransactionId if there is no conflict, else an xact ID we
+ * must wait for to see if it commits a conflicting tuple. If an actual
+ * conflict is detected, no return --- just elog().
+ */
+static TransactionId
+_bt_check_unique(Relation rel, BTItem btitem, Relation heapRel,
+ Buffer buf, ScanKey itup_scankey)
+{
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ int natts = rel->rd_rel->relnatts;
+ OffsetNumber offset,
+ maxoff;
+ Page page;
+ BTPageOpaque opaque;
+ Buffer nbuf = InvalidBuffer;
+ bool chtup = true;
+
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Find first item >= proposed new item. Note we could also get
+ * a pointer to end-of-page here.
+ */
+ offset = _bt_binsrch(rel, buf, natts, itup_scankey);
- offset = _bt_binsrch(rel, buf, natts, itup_scankey, BT_DESCENT);
+ /*
+ * Scan over all equal tuples, looking for live conflicts.
+ */
+ for (;;)
+ {
+ HeapTupleData htup;
+ Buffer buffer;
+ BTItem cbti;
+ BlockNumber nblkno;
- /* make sure the offset we're given points to an actual */
- /* key on the page before trying to compare it */
- if (!PageIsEmpty(page) && offset <= maxoff)
+ /*
+ * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's
+ * how we handling NULLs - and so we must not use _bt_compare
+ * in real comparison, but only for ordering/finding items on
+ * pages. - vadim 03/24/97
+ *
+ * make sure the offset points to an actual key
+ * before trying to compare it...
+ */
+ if (offset <= maxoff)
{
- TupleDesc itupdesc;
- BTItem cbti;
- HeapTupleData htup;
- BTPageOpaque opaque;
- Buffer nbuf;
- BlockNumber nblkno;
- bool chtup = true;
-
- itupdesc = RelationGetDescr(rel);
- nbuf = InvalidBuffer;
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (! _bt_isequal(itupdesc, page, offset, natts, itup_scankey))
+ break; /* we're past all the equal tuples */
/*
- * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's
- * how we handling NULLs - and so we must not use _bt_compare
- * in real comparison, but only for ordering/finding items on
- * pages. - vadim 03/24/97
- *
- * while ( !_bt_compare (rel, itupdesc, page, natts,
- * itup_scankey, offset) )
+ * Have to check is inserted heap tuple deleted one (i.e.
+ * just moved to another place by vacuum)! We only need to
+ * do this once, but don't want to do it at all unless
+ * we see equal tuples, so as not to slow down unequal case.
*/
- while (_bt_isequal(itupdesc, page, offset, natts, itup_scankey))
- { /* they're equal */
-
- /*
- * Have to check is inserted heap tuple deleted one (i.e.
- * just moved to another place by vacuum)!
- */
- if (chtup)
- {
- htup.t_self = btitem->bti_itup.t_tid;
- heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
- if (htup.t_data == NULL) /* YES! */
- break;
- /* Live tuple was inserted */
- ReleaseBuffer(buffer);
- chtup = false;
- }
- cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset));
- htup.t_self = cbti->bti_itup.t_tid;
+ if (chtup)
+ {
+ htup.t_self = btitem->bti_itup.t_tid;
heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
- if (htup.t_data != NULL) /* it is a duplicate */
- {
- TransactionId xwait =
+ if (htup.t_data == NULL) /* YES! */
+ break;
+ /* Live tuple is being inserted, so continue checking */
+ ReleaseBuffer(buffer);
+ chtup = false;
+ }
+
+ cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset));
+ htup.t_self = cbti->bti_itup.t_tid;
+ heap_fetch(heapRel, SnapshotDirty, &htup, &buffer);
+ if (htup.t_data != NULL) /* it is a duplicate */
+ {
+ TransactionId xwait =
(TransactionIdIsValid(SnapshotDirty->xmin)) ?
SnapshotDirty->xmin : SnapshotDirty->xmax;
- /*
- * If this tuple is being updated by other transaction
- * then we have to wait for its commit/abort.
- */
- ReleaseBuffer(buffer);
- if (TransactionIdIsValid(xwait))
- {
- if (nbuf != InvalidBuffer)
- _bt_relbuf(rel, nbuf, BT_READ);
- _bt_relbuf(rel, buf, BT_WRITE);
- XactLockTableWait(xwait);
- buf = _bt_getbuf(rel, blkno, BT_WRITE);
- goto l1;/* continue from the begin */
- }
- elog(ERROR, "Cannot insert a duplicate key into unique index %s", RelationGetRelationName(rel));
- }
- /* htup null so no buffer to release */
- /* get next offnum */
- if (offset < maxoff)
- offset = OffsetNumberNext(offset);
- else
- { /* move right ? */
- if (P_RIGHTMOST(opaque))
- break;
- if (!_bt_isequal(itupdesc, page, P_HIKEY,
- natts, itup_scankey))
- break;
-
- /*
- * min key of the right page is the same, ooh - so
- * many dead duplicates...
- */
- nblkno = opaque->btpo_next;
+ /*
+ * If this tuple is being updated by other transaction
+ * then we have to wait for its commit/abort.
+ */
+ ReleaseBuffer(buffer);
+ if (TransactionIdIsValid(xwait))
+ {
if (nbuf != InvalidBuffer)
_bt_relbuf(rel, nbuf, BT_READ);
- for (nbuf = InvalidBuffer;;)
- {
- nbuf = _bt_getbuf(rel, nblkno, BT_READ);
- page = BufferGetPage(nbuf);
- maxoff = PageGetMaxOffsetNumber(page);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- offset = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
- if (!PageIsEmpty(page) && offset <= maxoff)
- { /* Found some key */
- break;
- }
- else
- { /* Empty or "pseudo"-empty page - get next */
- nblkno = opaque->btpo_next;
- _bt_relbuf(rel, nbuf, BT_READ);
- nbuf = InvalidBuffer;
- if (nblkno == P_NONE)
- break;
- }
- }
- if (nbuf == InvalidBuffer)
- break;
+ /* Tell _bt_doinsert to wait... */
+ return xwait;
}
+ /*
+ * Otherwise we have a definite conflict.
+ */
+ elog(ERROR, "Cannot insert a duplicate key into unique index %s",
+ RelationGetRelationName(rel));
}
+ /* htup null so no buffer to release */
+ }
+
+ /*
+ * Advance to next tuple to continue checking.
+ */
+ if (offset < maxoff)
+ offset = OffsetNumberNext(offset);
+ else
+ {
+ /* If scankey == hikey we gotta check the next page too */
+ if (P_RIGHTMOST(opaque))
+ break;
+ if (!_bt_isequal(itupdesc, page, P_HIKEY,
+ natts, itup_scankey))
+ break;
+ nblkno = opaque->btpo_next;
if (nbuf != InvalidBuffer)
_bt_relbuf(rel, nbuf, BT_READ);
+ nbuf = _bt_getbuf(rel, nblkno, BT_READ);
+ page = BufferGetPage(nbuf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offset = P_FIRSTDATAKEY(opaque);
}
}
- /* do the insertion */
- res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey,
- btitem, (BTItem) NULL);
+ if (nbuf != InvalidBuffer)
+ _bt_relbuf(rel, nbuf, BT_READ);
- /* be tidy */
- _bt_freestack(stack);
- _bt_freeskey(itup_scankey);
-
- return res;
+ return NullTransactionId;
}
-/*
+/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
- * + if necessary, splits the target page.
- * + finds the right place to insert the tuple (taking into
- * account any changes induced by a split).
+ * + finds the right place to insert the tuple.
+ * + if necessary, splits the target page (making sure that the
+ * split is equitable as far as post-insert free space goes).
* + inserts the tuple.
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
- * + invoking itself with the appropriate tuple for the right
+ * + invokes itself with the appropriate tuple for the right
* child page on the parent.
*
* On entry, we must have the right buffer on which to do the
* insertion, and the buffer must be pinned and locked. On return,
* we will have dropped both the pin and the write lock on the buffer.
*
+ * If 'afteritem' is >0 then the new tuple must be inserted after the
+ * existing item of that number, noplace else. If 'afteritem' is 0
+ * then the procedure finds the exact spot to insert it by searching.
+ * (keysz and scankey parameters are used ONLY if afteritem == 0.)
+ *
+ * NOTE: if the new key is equal to one or more existing keys, we can
+ * legitimately place it anywhere in the series of equal keys --- in fact,
+ * if the new key is equal to the page's "high key" we can place it on
+ * the next page. If it is equal to the high key, and there's not room
+ * to insert the new tuple on the current page without splitting, then
+ * we move right hoping to find more free space and avoid a split.
+ * Ordinarily, though, we'll insert it before the existing equal keys
+ * because of the way _bt_binsrch() works.
+ *
* The locking interactions in this code are critical. You should
* grok Lehman and Yao's paper before making any changes. In addition,
* you need to understand how we disambiguate duplicate keys in this
* implementation, in order to be able to find our location using
* L&Y "move right" operations. Since we may insert duplicate user
- * keys, and since these dups may propogate up the tree, we use the
+ * keys, and since these dups may propagate up the tree, we use the
* 'afteritem' parameter to position ourselves correctly for the
* insertion on internal pages.
+ *----------
*/
static InsertIndexResult
_bt_insertonpg(Relation rel,
int keysz,
ScanKey scankey,
BTItem btitem,
- BTItem afteritem)
+ OffsetNumber afteritem)
{
InsertIndexResult res;
Page page;
BTPageOpaque lpageop;
- BlockNumber itup_blkno;
OffsetNumber itup_off;
+ BlockNumber itup_blkno;
+ OffsetNumber newitemoff;
OffsetNumber firstright = InvalidOffsetNumber;
Size itemsz;
- bool do_split = false;
- bool keys_equal = false;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
(PageGetPageSize(page) - sizeof(PageHeaderData) - MAXALIGN(sizeof(BTPageOpaqueData))) /3 - sizeof(ItemIdData));
/*
- * If we have to insert item on the leftmost page which is the first
- * page in the chain of duplicates then: 1. if scankey == hikey (i.e.
- * - new duplicate item) then insert it here; 2. if scankey < hikey
- * then: 2.a if there is duplicate key(s) here - we force splitting;
- * 2.b else - we may "eat" this page from duplicates chain.
+ * Determine exactly where new item will go.
*/
- if (lpageop->btpo_flags & BTP_CHAIN)
+ if (afteritem > 0)
{
- OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
- ItemId hitemid;
- BTItem hitem;
-
- Assert(!P_RIGHTMOST(lpageop));
- hitemid = PageGetItemId(page, P_HIKEY);
- hitem = (BTItem) PageGetItem(page, hitemid);
- if (maxoff > P_HIKEY &&
- !_bt_itemcmp(rel, keysz, scankey, hitem,
- (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY)),
- BTEqualStrategyNumber))
- elog(FATAL, "btree: bad key on the page in the chain of duplicates");
-
- if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid,
- BTEqualStrategyNumber))
- {
- if (!P_LEFTMOST(lpageop))
- elog(FATAL, "btree: attempt to insert bad key on the non-leftmost page in the chain of duplicates");
- if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid,
- BTLessStrategyNumber))
- elog(FATAL, "btree: attempt to insert higher key on the leftmost page in the chain of duplicates");
- if (maxoff > P_HIKEY) /* have duplicate(s) */
- {
- firstright = P_FIRSTKEY;
- do_split = true;
- }
- else
-/* "eat" page */
- {
- Buffer pbuf;
- Page ppage;
-
- itup_blkno = BufferGetBlockNumber(buf);
- itup_off = PageAddItem(page, (Item) btitem, itemsz,
- P_FIRSTKEY, LP_USED);
- if (itup_off == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add item");
- lpageop->btpo_flags &= ~BTP_CHAIN;
- pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- ppage = BufferGetPage(pbuf);
- PageIndexTupleDelete(ppage, stack->bts_offset);
- pfree(stack->bts_btitem);
- stack->bts_btitem = _bt_formitem(&(btitem->bti_itup));
- ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
- itup_blkno, P_HIKEY);
- _bt_wrtbuf(rel, buf);
- res = _bt_insertonpg(rel, pbuf, stack->bts_parent,
- keysz, scankey, stack->bts_btitem,
- NULL);
- ItemPointerSet(&(res->pointerData), itup_blkno, itup_off);
- return res;
- }
- }
- else
- {
- keys_equal = true;
- if (PageGetFreeSpace(page) < itemsz)
- do_split = true;
- }
+ newitemoff = afteritem + 1;
}
- else if (PageGetFreeSpace(page) < itemsz)
- do_split = true;
- else if (PageGetFreeSpace(page) < 3 * itemsz + 2 * sizeof(ItemIdData))
- {
- OffsetNumber offnum = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY;
- OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
-
- if (offnum < maxoff) /* can't split unless at least 2 items... */
- {
- ItemId itid;
- BTItem previtem,
- chkitem;
- Size maxsize;
- Size currsize;
-
- /* find largest group of identically-keyed items on page */
- itid = PageGetItemId(page, offnum);
- previtem = (BTItem) PageGetItem(page, itid);
- maxsize = currsize = (ItemIdGetLength(itid) + sizeof(ItemIdData));
- for (offnum = OffsetNumberNext(offnum);
- offnum <= maxoff; offnum = OffsetNumberNext(offnum))
- {
- itid = PageGetItemId(page, offnum);
- chkitem = (BTItem) PageGetItem(page, itid);
- if (!_bt_itemcmp(rel, keysz, scankey,
- previtem, chkitem,
- BTEqualStrategyNumber))
- {
- if (currsize > maxsize)
- maxsize = currsize;
- currsize = 0;
- previtem = chkitem;
- }
- currsize += (ItemIdGetLength(itid) + sizeof(ItemIdData));
- }
- if (currsize > maxsize)
- maxsize = currsize;
- /* Decide to split if largest group is > 1/2 page size */
- maxsize += sizeof(PageHeaderData) +
- MAXALIGN(sizeof(BTPageOpaqueData));
- if (maxsize >= PageGetPageSize(page) / 2)
- do_split = true;
- }
- }
-
- if (do_split)
+ else
{
- Buffer rbuf;
- Page rpage;
- BTItem ritem;
- BlockNumber rbknum;
- BTPageOpaque rpageop;
- Buffer pbuf;
- Page ppage;
- BTPageOpaque ppageop;
- BlockNumber bknum = BufferGetBlockNumber(buf);
- BTItem lowLeftItem;
- OffsetNumber maxoff;
- bool shifted = false;
- bool left_chained = (lpageop->btpo_flags & BTP_CHAIN) ? true : false;
- bool is_root = lpageop->btpo_flags & BTP_ROOT;
-
/*
- * Instead of splitting leaf page in the chain of duplicates by
- * new duplicate, insert it into some right page.
+ * If we will need to split the page to put the item here,
+ * check whether we can put the tuple somewhere to the right,
+ * instead. Keep scanning until we find enough free space or
+ * reach the last page where the tuple can legally go.
*/
- if ((lpageop->btpo_flags & BTP_CHAIN) &&
- (lpageop->btpo_flags & BTP_LEAF) && keys_equal)
+ while (PageGetFreeSpace(page) < itemsz &&
+ !P_RIGHTMOST(lpageop) &&
+ _bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0)
{
- rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
- rpage = BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+ /* step right one page */
+ BlockNumber rblkno = lpageop->btpo_next;
- /*
- * some checks
- */
- if (!P_RIGHTMOST(rpageop)) /* non-rightmost page */
- { /* If we have the same hikey here then
- * it's yet another page in chain. */
- if (_bt_skeycmp(rel, keysz, scankey, rpage,
- PageGetItemId(rpage, P_HIKEY),
- BTEqualStrategyNumber))
- {
- if (!(rpageop->btpo_flags & BTP_CHAIN))
- elog(FATAL, "btree: lost page in the chain of duplicates");
- }
- else if (_bt_skeycmp(rel, keysz, scankey, rpage,
- PageGetItemId(rpage, P_HIKEY),
- BTGreaterStrategyNumber))
- elog(FATAL, "btree: hikey is out of order");
- else if (rpageop->btpo_flags & BTP_CHAIN)
-
- /*
- * If hikey > scankey then it's last page in chain and
- * BTP_CHAIN must be OFF
- */
- elog(FATAL, "btree: lost last page in the chain of duplicates");
- }
- else
-/* rightmost page */
- Assert(!(rpageop->btpo_flags & BTP_CHAIN));
_bt_relbuf(rel, buf, BT_WRITE);
- return (_bt_insertonpg(rel, rbuf, stack, keysz,
- scankey, btitem, afteritem));
+ buf = _bt_getbuf(rel, rblkno, BT_WRITE);
+ page = BufferGetPage(buf);
+ lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
}
-
/*
- * If after splitting un-chained page we'll got chain of pages
- * with duplicates then we want to know 1. on which of two pages
- * new btitem will go (current _bt_findsplitloc is quite bad); 2.
- * what parent (if there's one) thinking about it (remember about
- * deletions)
+ * This is it, so find the position...
*/
- else if (!(lpageop->btpo_flags & BTP_CHAIN))
- {
- OffsetNumber start = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY;
- Size llimit;
-
- maxoff = PageGetMaxOffsetNumber(page);
- llimit = PageGetPageSize(page) - sizeof(PageHeaderData) -
- MAXALIGN(sizeof(BTPageOpaqueData))
- +sizeof(ItemIdData);
- llimit /= 2;
- firstright = _bt_findsplitloc(rel, keysz, scankey,
- page, start, maxoff, llimit);
-
- if (_bt_itemcmp(rel, keysz, scankey,
- (BTItem) PageGetItem(page, PageGetItemId(page, start)),
- (BTItem) PageGetItem(page, PageGetItemId(page, firstright)),
- BTEqualStrategyNumber))
- {
- if (_bt_skeycmp(rel, keysz, scankey, page,
- PageGetItemId(page, firstright),
- BTLessStrategyNumber))
-
- /*
- * force moving current items to the new page: new
- * item will go on the current page.
- */
- firstright = start;
- else
-
- /*
- * new btitem >= firstright, start item == firstright
- * - new chain of duplicates: if this non-leftmost
- * leaf page and parent item < start item then force
- * moving all items to the new page - current page
- * will be "empty" after it.
- */
- {
- if (!P_LEFTMOST(lpageop) &&
- (lpageop->btpo_flags & BTP_LEAF))
- {
- ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
- bknum, P_HIKEY);
- pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- if (_bt_itemcmp(rel, keysz, scankey,
- stack->bts_btitem,
- (BTItem) PageGetItem(page,
- PageGetItemId(page, start)),
- BTLessStrategyNumber))
- {
- firstright = start;
- shifted = true;
- }
- _bt_relbuf(rel, pbuf, BT_WRITE);
- }
- }
- } /* else - no new chain if start item <
- * firstright one */
- }
+ newitemoff = _bt_binsrch(rel, buf, keysz, scankey);
+ }
- /* split the buffer into left and right halves */
- rbuf = _bt_split(rel, keysz, scankey, buf, firstright);
+ /*
+ * Do we need to split the page to fit the item on it?
+ */
+ if (PageGetFreeSpace(page) < itemsz)
+ {
+ Buffer rbuf;
+ BlockNumber bknum = BufferGetBlockNumber(buf);
+ BlockNumber rbknum;
+ bool is_root = P_ISROOT(lpageop);
+ bool newitemonleft;
- /* which new page (left half or right half) gets the tuple? */
- if (_bt_goesonpg(rel, buf, keysz, scankey, afteritem))
- {
- /* left page */
- itup_off = _bt_pgaddtup(rel, buf, keysz, scankey,
- itemsz, btitem, afteritem);
- itup_blkno = BufferGetBlockNumber(buf);
- }
- else
- {
- /* right page */
- itup_off = _bt_pgaddtup(rel, rbuf, keysz, scankey,
- itemsz, btitem, afteritem);
- itup_blkno = BufferGetBlockNumber(rbuf);
- }
+ /* Choose the split point */
+ firstright = _bt_findsplitloc(rel, page,
+ newitemoff, itemsz,
+ &newitemonleft);
- maxoff = PageGetMaxOffsetNumber(page);
- if (shifted)
- {
- if (maxoff > P_FIRSTKEY)
- elog(FATAL, "btree: shifted page is not empty");
- lowLeftItem = (BTItem) NULL;
- }
- else
- {
- if (maxoff < P_FIRSTKEY)
- elog(FATAL, "btree: un-shifted page is empty");
- lowLeftItem = (BTItem) PageGetItem(page,
- PageGetItemId(page, P_FIRSTKEY));
- if (_bt_itemcmp(rel, keysz, scankey, lowLeftItem,
- (BTItem) PageGetItem(page, PageGetItemId(page, P_HIKEY)),
- BTEqualStrategyNumber))
- lpageop->btpo_flags |= BTP_CHAIN;
- }
+ /* split the buffer into left and right halves */
+ rbuf = _bt_split(rel, buf, firstright,
+ newitemoff, itemsz, btitem, newitemonleft,
+ &itup_off, &itup_blkno);
- /*
+ /*----------
* By here,
*
- * + our target page has been split; + the original tuple has been
- * inserted; + we have write locks on both the old (left half)
- * and new (right half) buffers, after the split; and + we have
- * the key we want to insert into the parent.
+ * + our target page has been split;
+ * + the original tuple has been inserted;
+ * + we have write locks on both the old (left half)
+ * and new (right half) buffers, after the split; and
+ * + we know the key we want to insert into the parent
+ * (it's the "high key" on the left child page).
+ *
+ * We're ready to do the parent insertion. We need to hold onto the
+ * locks for the child pages until we locate the parent, but we can
+ * release them before doing the actual insertion (see Lehman and Yao
+ * for the reasoning).
*
- * Do the parent insertion. We need to hold onto the locks for the
- * child pages until we locate the parent, but we can release them
- * before doing the actual insertion (see Lehman and Yao for the
- * reasoning).
+ * Here we have to do something Lehman and Yao don't talk about:
+ * deal with a root split and construction of a new root. If our
+ * stack is empty then we have just split a node on what had been
+ * the root level when we descended the tree. If it is still the
+ * root then we perform a new-root construction. If it *wasn't*
+ * the root anymore, use the parent pointer to get up to the root
+ * level that someone constructed meanwhile, and find the right
+ * place to insert as for the normal case.
+ *----------
*/
-l_spl: ;
- if (stack == (BTStack) NULL)
+ if (is_root)
{
- if (!is_root) /* if this page was not root page */
- {
- elog(DEBUG, "btree: concurrent ROOT page split");
- stack = (BTStack) palloc(sizeof(BTStackData));
- stack->bts_blkno = lpageop->btpo_parent;
- stack->bts_offset = InvalidOffsetNumber;
- stack->bts_btitem = (BTItem) palloc(sizeof(BTItemData));
- /* bts_btitem will be initialized below */
- stack->bts_parent = NULL;
- goto l_spl;
- }
+ Assert(stack == (BTStack) NULL);
/* create a new root node and release the split buffers */
_bt_newroot(rel, buf, rbuf);
}
else
{
- ScanKey newskey;
InsertIndexResult newres;
BTItem new_item;
- OffsetNumber upditem_offset = P_HIKEY;
- bool do_update = false;
- bool update_in_place = true;
- bool parent_chained;
+ BTStackData fakestack;
+ BTItem ritem;
+ Buffer pbuf;
- /* form a index tuple that points at the new right page */
- rbknum = BufferGetBlockNumber(rbuf);
- rpage = BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
- /*
- * By convention, the first entry (1) on every non-rightmost
- * page is the high key for that page. In order to get the
- * lowest key on the new right page, we actually look at its
- * second (2) entry.
- */
-
- if (!P_RIGHTMOST(rpageop))
+ /* Set up a phony stack entry if we haven't got a real one */
+ if (stack == (BTStack) NULL)
{
- ritem = (BTItem) PageGetItem(rpage,
- PageGetItemId(rpage, P_FIRSTKEY));
- if (_bt_itemcmp(rel, keysz, scankey,
- ritem,
- (BTItem) PageGetItem(rpage,
- PageGetItemId(rpage, P_HIKEY)),
- BTEqualStrategyNumber))
- rpageop->btpo_flags |= BTP_CHAIN;
+ elog(DEBUG, "btree: concurrent ROOT page split");
+ stack = &fakestack;
+ stack->bts_blkno = lpageop->btpo_parent;
+ stack->bts_offset = InvalidOffsetNumber;
+ /* bts_btitem will be initialized below */
+ stack->bts_parent = NULL;
}
- else
- ritem = (BTItem) PageGetItem(rpage,
- PageGetItemId(rpage, P_HIKEY));
- /* get a unique btitem for this key */
- new_item = _bt_formitem(&(ritem->bti_itup));
+ /* get high key from left page == lowest key on new right page */
+ ritem = (BTItem) PageGetItem(page,
+ PageGetItemId(page, P_HIKEY));
+ /* form an index tuple that points at the new right page */
+ new_item = _bt_formitem(&(ritem->bti_itup));
+ rbknum = BufferGetBlockNumber(rbuf);
ItemPointerSet(&(new_item->bti_itup.t_tid), rbknum, P_HIKEY);
/*
* Oops - if we were moved right then we need to change stack
* item! We want to find parent pointing to where we are,
* right ? - vadim 05/27/97
- */
- ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
- bknum, P_HIKEY);
- pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- ppage = BufferGetPage(pbuf);
- ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
- parent_chained = ((ppageop->btpo_flags & BTP_CHAIN)) ? true : false;
-
- if (parent_chained && !left_chained)
- elog(FATAL, "nbtree: unexpected chained parent of unchained page");
-
- /*
- * If the key of new_item is < than the key of the item in the
- * parent page pointing to the left page (stack->bts_btitem),
- * we have to update the latter key; otherwise the keys on the
- * parent page wouldn't be monotonically increasing after we
- * inserted the new pointer to the right page (new_item). This
- * only happens if our left page is the leftmost page and a
- * new minimum key had been inserted before, which is not
- * reflected in the parent page but didn't matter so far. If
- * there are duplicate keys and this new minimum key spills
- * over to our new right page, we get an inconsistency if we
- * don't update the left key in the parent page.
*
- * Also, new duplicates handling code require us to update parent
- * item if some smaller items left on the left page (which is
- * possible in splitting leftmost page) and current parent
- * item == new_item. - vadim 05/27/97
+ * Interestingly, this means we didn't *really* need to stack
+ * the parent key at all; all we really care about is the
+ * saved block and offset as a starting point for our search...
*/
- if (_bt_itemcmp(rel, keysz, scankey,
- stack->bts_btitem, new_item,
- BTGreaterStrategyNumber) ||
- (!shifted &&
- _bt_itemcmp(rel, keysz, scankey,
- stack->bts_btitem, new_item,
- BTEqualStrategyNumber) &&
- _bt_itemcmp(rel, keysz, scankey,
- lowLeftItem, new_item,
- BTLessStrategyNumber)))
- {
- do_update = true;
-
- /*
- * figure out which key is leftmost (if the parent page is
- * rightmost, too, it must be the root)
- */
- if (P_RIGHTMOST(ppageop))
- upditem_offset = P_HIKEY;
- else
- upditem_offset = P_FIRSTKEY;
- if (!P_LEFTMOST(lpageop) ||
- stack->bts_offset != upditem_offset)
- elog(FATAL, "btree: items are out of order (leftmost %d, stack %u, update %u)",
- P_LEFTMOST(lpageop), stack->bts_offset, upditem_offset);
- }
-
- if (do_update)
- {
- if (shifted)
- elog(FATAL, "btree: attempt to update parent for shifted page");
-
- /*
- * Try to update in place. If out parent page is chained
- * then we must forse insertion.
- */
- if (!parent_chained &&
- MAXALIGN(IndexTupleDSize(lowLeftItem->bti_itup)) ==
- MAXALIGN(IndexTupleDSize(stack->bts_btitem->bti_itup)))
- {
- _bt_updateitem(rel, keysz, pbuf,
- stack->bts_btitem, lowLeftItem);
- _bt_wrtbuf(rel, buf);
- _bt_wrtbuf(rel, rbuf);
- }
- else
- {
- update_in_place = false;
- PageIndexTupleDelete(ppage, upditem_offset);
-
- /*
- * don't write anything out yet--we still have the
- * write lock, and now we call another _bt_insertonpg
- * to insert the correct key. First, make a new item,
- * using the tuple data from lowLeftItem. Point it to
- * the left child. Update it on the stack at the same
- * time.
- */
- pfree(stack->bts_btitem);
- stack->bts_btitem = _bt_formitem(&(lowLeftItem->bti_itup));
- ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
- bknum, P_HIKEY);
-
- /*
- * Unlock the children before doing this
- */
- _bt_wrtbuf(rel, buf);
- _bt_wrtbuf(rel, rbuf);
-
- /*
- * A regular _bt_binsrch should find the right place
- * to put the new entry, since it should be lower than
- * any other key on the page. Therefore set afteritem
- * to NULL.
- */
- newskey = _bt_mkscankey(rel, &(stack->bts_btitem->bti_itup));
- newres = _bt_insertonpg(rel, pbuf, stack->bts_parent,
- keysz, newskey, stack->bts_btitem,
- NULL);
-
- pfree(newres);
- pfree(newskey);
-
- /*
- * we have now lost our lock on the parent buffer, and
- * need to get it back.
- */
- pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- }
- }
- else
- {
- _bt_wrtbuf(rel, buf);
- _bt_wrtbuf(rel, rbuf);
- }
+ ItemPointerSet(&(stack->bts_btitem.bti_itup.t_tid),
+ bknum, P_HIKEY);
- newskey = _bt_mkscankey(rel, &(new_item->bti_itup));
+ pbuf = _bt_getstackbuf(rel, stack);
- afteritem = stack->bts_btitem;
- if (parent_chained && !update_in_place)
- {
- ppage = BufferGetPage(pbuf);
- ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
- if (ppageop->btpo_flags & BTP_CHAIN)
- elog(FATAL, "btree: unexpected BTP_CHAIN flag in parent after update");
- if (P_RIGHTMOST(ppageop))
- elog(FATAL, "btree: chained parent is RIGHTMOST after update");
- maxoff = PageGetMaxOffsetNumber(ppage);
- if (maxoff != P_FIRSTKEY)
- elog(FATAL, "btree: FIRSTKEY was unexpected in parent after update");
- if (_bt_skeycmp(rel, keysz, newskey, ppage,
- PageGetItemId(ppage, P_FIRSTKEY),
- BTLessEqualStrategyNumber))
- elog(FATAL, "btree: parent FIRSTKEY is >= duplicate key after update");
- if (!_bt_skeycmp(rel, keysz, newskey, ppage,
- PageGetItemId(ppage, P_HIKEY),
- BTEqualStrategyNumber))
- elog(FATAL, "btree: parent HIGHKEY is not equal duplicate key after update");
- afteritem = (BTItem) NULL;
- }
- else if (left_chained && !update_in_place)
- {
- ppage = BufferGetPage(pbuf);
- ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
- if (!P_RIGHTMOST(ppageop) &&
- _bt_skeycmp(rel, keysz, newskey, ppage,
- PageGetItemId(ppage, P_HIKEY),
- BTGreaterStrategyNumber))
- afteritem = (BTItem) NULL;
- }
- if (afteritem == (BTItem) NULL)
- {
- rbuf = _bt_getbuf(rel, ppageop->btpo_next, BT_WRITE);
- _bt_relbuf(rel, pbuf, BT_WRITE);
- pbuf = rbuf;
- }
+ /* Now we can write and unlock the children */
+ _bt_wrtbuf(rel, rbuf);
+ _bt_wrtbuf(rel, buf);
+ /* Recursively update the parent */
newres = _bt_insertonpg(rel, pbuf, stack->bts_parent,
- keysz, newskey, new_item,
- afteritem);
+ 0, NULL, new_item, stack->bts_offset);
/* be tidy */
pfree(newres);
- pfree(newskey);
pfree(new_item);
}
}
else
{
- itup_off = _bt_pgaddtup(rel, buf, keysz, scankey,
- itemsz, btitem, afteritem);
+ _bt_pgaddtup(rel, page, itemsz, btitem, newitemoff, "page");
+ itup_off = newitemoff;
itup_blkno = BufferGetBlockNumber(buf);
-
- _bt_relbuf(rel, buf, BT_WRITE);
+ /* Write out the updated page and release pin/lock */
+ _bt_wrtbuf(rel, buf);
}
- /* by here, the new tuple is inserted */
+ /* by here, the new tuple is inserted at itup_blkno/itup_off */
res = (InsertIndexResult) palloc(sizeof(InsertIndexResultData));
ItemPointerSet(&(res->pointerData), itup_blkno, itup_off);
* _bt_split() -- split a page in the btree.
*
* On entry, buf is the page to split, and is write-locked and pinned.
- * Returns the new right sibling of buf, pinned and write-locked. The
- * pin and lock on buf are maintained.
+ * firstright is the item index of the first item to be moved to the
+ * new right page. newitemoff etc. tell us about the new item that
+ * must be inserted along with the data from the old page.
+ *
+ * Returns the new right sibling of buf, pinned and write-locked.
+ * The pin and lock on buf are maintained. *itup_off and *itup_blkno
+ * are set to the exact location where newitem was inserted.
*/
static Buffer
-_bt_split(Relation rel, Size keysz, ScanKey scankey,
- Buffer buf, OffsetNumber firstright)
+_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+ OffsetNumber newitemoff, Size newitemsz, BTItem newitem,
+ bool newitemonleft,
+ OffsetNumber *itup_off, BlockNumber *itup_blkno)
{
Buffer rbuf;
Page origpage;
BTItem item;
OffsetNumber leftoff,
rightoff;
- OffsetNumber start;
OffsetNumber maxoff;
OffsetNumber i;
leftpage = PageGetTempPage(origpage, sizeof(BTPageOpaqueData));
rightpage = BufferGetPage(rbuf);
- _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
_bt_pageinit(leftpage, BufferGetPageSize(buf));
+ _bt_pageinit(rightpage, BufferGetPageSize(rbuf));
/* init btree private data */
oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage);
/* if we're splitting this page, it won't be the root when we're done */
oopaque->btpo_flags &= ~BTP_ROOT;
- oopaque->btpo_flags &= ~BTP_CHAIN;
lopaque->btpo_flags = ropaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_prev = oopaque->btpo_prev;
- ropaque->btpo_prev = BufferGetBlockNumber(buf);
lopaque->btpo_next = BufferGetBlockNumber(rbuf);
+ ropaque->btpo_prev = BufferGetBlockNumber(buf);
ropaque->btpo_next = oopaque->btpo_next;
+ /*
+ * Must copy the original parent link into both new pages, even though
+ * it might be quite obsolete by now. We might need it if this level
+ * is or recently was the root (see README).
+ */
lopaque->btpo_parent = ropaque->btpo_parent = oopaque->btpo_parent;
/*
* If the page we're splitting is not the rightmost page at its level
- * in the tree, then the first (0) entry on the page is the high key
+ * in the tree, then the first entry on the page is the high key
* for the page. We need to copy that to the right half. Otherwise
- * (meaning the rightmost page case), we should treat the line
- * pointers beginning at zero as user data.
- *
- * We leave a blank space at the start of the line table for the left
- * page. We'll come back later and fill it in with the high key item
- * we get from the right key.
+ * (meaning the rightmost page case), all the items on the right half
+ * will be user data.
*/
+ rightoff = P_HIKEY;
- leftoff = P_FIRSTKEY;
- ropaque->btpo_next = oopaque->btpo_next;
if (!P_RIGHTMOST(oopaque))
{
- /* splitting a non-rightmost page, start at the first data item */
- start = P_FIRSTKEY;
-
itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
item = (BTItem) PageGetItem(origpage, itemid);
- if (PageAddItem(rightpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
+ if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
+ LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add hikey to the right sibling");
- rightoff = P_FIRSTKEY;
+ rightoff = OffsetNumberNext(rightoff);
}
- else
- {
- /* splitting a rightmost page, "high key" is the first data item */
- start = P_HIKEY;
- /* the new rightmost page will not have a high key */
- rightoff = P_HIKEY;
+ /*
+ * The "high key" for the new left page will be the first key that's
+ * going to go into the new right page. This might be either the
+ * existing data item at position firstright, or the incoming tuple.
+ */
+ leftoff = P_HIKEY;
+ if (!newitemonleft && newitemoff == firstright)
+ {
+ /* incoming tuple will become first on right page */
+ itemsz = newitemsz;
+ item = newitem;
}
- maxoff = PageGetMaxOffsetNumber(origpage);
- if (firstright == InvalidOffsetNumber)
+ else
{
- Size llimit = PageGetFreeSpace(leftpage) / 2;
-
- firstright = _bt_findsplitloc(rel, keysz, scankey,
- origpage, start, maxoff, llimit);
+ /* existing item at firstright will become first on right page */
+ itemid = PageGetItemId(origpage, firstright);
+ itemsz = ItemIdGetLength(itemid);
+ item = (BTItem) PageGetItem(origpage, itemid);
}
+ if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+ LP_USED) == InvalidOffsetNumber)
+ elog(FATAL, "btree: failed to add hikey to the left sibling");
+ leftoff = OffsetNumberNext(leftoff);
- for (i = start; i <= maxoff; i = OffsetNumberNext(i))
+ /*
+ * Now transfer all the data items to the appropriate page
+ */
+ maxoff = PageGetMaxOffsetNumber(origpage);
+
+ for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i))
{
itemid = PageGetItemId(origpage, i);
itemsz = ItemIdGetLength(itemid);
item = (BTItem) PageGetItem(origpage, itemid);
+ /* does new item belong before this one? */
+ if (i == newitemoff)
+ {
+ if (newitemonleft)
+ {
+ _bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff,
+ "left sibling");
+ *itup_off = leftoff;
+ *itup_blkno = BufferGetBlockNumber(buf);
+ leftoff = OffsetNumberNext(leftoff);
+ }
+ else
+ {
+ _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
+ "right sibling");
+ *itup_off = rightoff;
+ *itup_blkno = BufferGetBlockNumber(rbuf);
+ rightoff = OffsetNumberNext(rightoff);
+ }
+ }
+
/* decide which page to put it on */
if (i < firstright)
{
- if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
- LP_USED) == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add item to the left sibling");
+ _bt_pgaddtup(rel, leftpage, itemsz, item, leftoff,
+ "left sibling");
leftoff = OffsetNumberNext(leftoff);
}
else
{
- if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
- LP_USED) == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add item to the right sibling");
+ _bt_pgaddtup(rel, rightpage, itemsz, item, rightoff,
+ "right sibling");
rightoff = OffsetNumberNext(rightoff);
}
}
- /*
- * Okay, page has been split, high key on right page is correct. Now
- * set the high key on the left page to be the min key on the right
- * page.
- */
-
- if (P_RIGHTMOST(ropaque))
- itemid = PageGetItemId(rightpage, P_HIKEY);
- else
- itemid = PageGetItemId(rightpage, P_FIRSTKEY);
- itemsz = ItemIdGetLength(itemid);
- item = (BTItem) PageGetItem(rightpage, itemid);
-
- /*
- * We left a hole for the high key on the left page; fill it. The
- * modal crap is to tell the page manager to put the new item on the
- * page and not screw around with anything else. Whoever designed
- * this interface has presumably crawled back into the dung heap they
- * came from. No one here will admit to it.
- */
-
- PageManagerModeSet(OverwritePageManagerMode);
- if (PageAddItem(leftpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add hikey to the left sibling");
- PageManagerModeSet(ShufflePageManagerMode);
+ /* cope with possibility that newitem goes at the end */
+ if (i <= newitemoff)
+ {
+ if (newitemonleft)
+ {
+ _bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff,
+ "left sibling");
+ *itup_off = leftoff;
+ *itup_blkno = BufferGetBlockNumber(buf);
+ leftoff = OffsetNumberNext(leftoff);
+ }
+ else
+ {
+ _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff,
+ "right sibling");
+ *itup_off = rightoff;
+ *itup_blkno = BufferGetBlockNumber(rbuf);
+ rightoff = OffsetNumberNext(rightoff);
+ }
+ }
/*
* By here, the original data page has been split into two new halves,
PageRestoreTempPage(leftpage, origpage);
- /* write these guys out */
- _bt_wrtnorelbuf(rel, rbuf);
- _bt_wrtnorelbuf(rel, buf);
-
/*
* Finally, we need to grab the right sibling (if any) and fix the
* prev pointer there. We are guaranteed that this is deadlock-free
- * since no other writer will be moving holding a lock on that page
+ * since no other writer will be holding a lock on that page
* and trying to move left, and all readers release locks on a page
* before trying to fetch its neighbors.
*/
}
/*
- * _bt_findsplitloc() -- find a safe place to split a page.
+ * _bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The idea here is to equalize the free space that will be on each split
+ * page, *after accounting for the inserted tuple*. (If we fail to account
+ * for it, we might find ourselves with too little room on the page that
+ * it needs to go into!)
*
- * In order to guarantee the proper handling of searches for duplicate
- * keys, the first duplicate in the chain must either be the first
- * item on the page after the split, or the entire chain must be on
- * one of the two pages. That is,
- * [1 2 2 2 3 4 5]
- * must become
- * [1] [2 2 2 3 4 5]
- * or
- * [1 2 2 2] [3 4 5]
- * but not
- * [1 2 2] [2 3 4 5].
- * However,
- * [2 2 2 2 2 3 4]
- * may be split as
- * [2 2 2 2] [2 3 4].
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of. (This could be
+ * maxoff+1 if the tuple is to go at the end.)
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page. The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
*/
static OffsetNumber
_bt_findsplitloc(Relation rel,
- Size keysz,
- ScanKey scankey,
Page page,
- OffsetNumber start,
- OffsetNumber maxoff,
- Size llimit)
+ OffsetNumber newitemoff,
+ Size newitemsz,
+ bool *newitemonleft)
{
- OffsetNumber i;
- OffsetNumber saferight;
- ItemId nxtitemid,
- safeitemid;
- BTItem safeitem,
- nxtitem;
- Size nbytes;
-
- if (start >= maxoff)
- elog(FATAL, "btree: cannot split if start (%d) >= maxoff (%d)",
- start, maxoff);
- saferight = start;
- safeitemid = PageGetItemId(page, saferight);
- nbytes = ItemIdGetLength(safeitemid) + sizeof(ItemIdData);
- safeitem = (BTItem) PageGetItem(page, safeitemid);
-
- i = OffsetNumberNext(start);
-
- while (nbytes < llimit)
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ OffsetNumber maxoff;
+ ItemId itemid;
+ FindSplitData state;
+ int leftspace,
+ rightspace,
+ dataitemtotal,
+ dataitemstoleft;
+
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ state.newitemsz = newitemsz;
+ state.non_leaf = ! P_ISLEAF(opaque);
+ state.have_split = false;
+
+ /* Total free space available on a btree page, after fixed overhead */
+ leftspace = rightspace =
+ PageGetPageSize(page) - sizeof(PageHeaderData) -
+ MAXALIGN(sizeof(BTPageOpaqueData))
+ + sizeof(ItemIdData);
+
+ /* The right page will have the same high key as the old page */
+ if (!P_RIGHTMOST(opaque))
{
- /* check the next item on the page */
- nxtitemid = PageGetItemId(page, i);
- nbytes += (ItemIdGetLength(nxtitemid) + sizeof(ItemIdData));
- nxtitem = (BTItem) PageGetItem(page, nxtitemid);
+ itemid = PageGetItemId(page, P_HIKEY);
+ rightspace -= (int) (ItemIdGetLength(itemid) + sizeof(ItemIdData));
+ }
+
+ /* Count up total space in data items without actually scanning 'em */
+ dataitemtotal = rightspace - (int) PageGetFreeSpace(page);
+
+ /*
+ * Scan through the data items and calculate space usage for a split
+ * at each possible position. XXX we could probably stop somewhere
+ * near the middle...
+ */
+ dataitemstoleft = 0;
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ for (offnum = P_FIRSTDATAKEY(opaque);
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ Size itemsz;
+ int leftfree,
+ rightfree;
+
+ itemid = PageGetItemId(page, offnum);
+ itemsz = ItemIdGetLength(itemid) + sizeof(ItemIdData);
/*
- * Test against last known safe item: if the tuple we're looking
- * at isn't equal to the last safe one we saw, then it's our new
- * safe tuple.
+ * We have to allow for the current item becoming the high key of
+ * the left page; therefore it counts against left space.
*/
- if (!_bt_itemcmp(rel, keysz, scankey,
- safeitem, nxtitem, BTEqualStrategyNumber))
+ leftfree = leftspace - dataitemstoleft - (int) itemsz;
+ rightfree = rightspace - (dataitemtotal - dataitemstoleft);
+ if (offnum < newitemoff)
+ _bt_checksplitloc(&state, offnum, leftfree, rightfree,
+ false, itemsz);
+ else if (offnum > newitemoff)
+ _bt_checksplitloc(&state, offnum, leftfree, rightfree,
+ true, itemsz);
+ else
{
- safeitem = nxtitem;
- saferight = i;
+ /* need to try it both ways!! */
+ _bt_checksplitloc(&state, offnum, leftfree, rightfree,
+ false, newitemsz);
+ _bt_checksplitloc(&state, offnum, leftfree, rightfree,
+ true, itemsz);
}
- if (i < maxoff)
- i = OffsetNumberNext(i);
- else
- break;
+
+ dataitemstoleft += itemsz;
}
+ if (! state.have_split)
+ elog(FATAL, "_bt_findsplitloc: can't find a feasible split point for %s",
+ RelationGetRelationName(rel));
+ *newitemonleft = state.newitemonleft;
+ return state.firstright;
+}
+
+static void
+_bt_checksplitloc(FindSplitData *state, OffsetNumber firstright,
+ int leftfree, int rightfree,
+ bool newitemonleft, Size firstrightitemsz)
+{
+ if (newitemonleft)
+ leftfree -= (int) state->newitemsz;
+ else
+ rightfree -= (int) state->newitemsz;
+ /*
+ * If we are not on the leaf level, we will be able to discard the
+ * key data from the first item that winds up on the right page.
+ */
+ if (state->non_leaf)
+ rightfree += (int) firstrightitemsz -
+ (int) (sizeof(BTItemData) + sizeof(ItemIdData));
/*
- * If the chain of dups starts at the beginning of the page and
- * extends past the halfway mark, we can split it in the middle.
+ * If feasible split point, remember best delta.
*/
+ if (leftfree >= 0 && rightfree >= 0)
+ {
+ int delta = leftfree - rightfree;
+
+ if (delta < 0)
+ delta = -delta;
+ if (!state->have_split || delta < state->best_delta)
+ {
+ state->have_split = true;
+ state->newitemonleft = newitemonleft;
+ state->firstright = firstright;
+ state->best_delta = delta;
+ }
+ }
+}
+
+/*
+ * _bt_getstackbuf() -- Walk back up the tree one step, and find the item
+ * we last looked at in the parent.
+ *
+ * This is possible because we save a bit image of the last item
+ * we looked at in the parent, and the update algorithm guarantees
+ * that if items above us in the tree move, they only move right.
+ *
+ * Also, re-set bts_blkno & bts_offset if changed.
+ */
+static Buffer
+_bt_getstackbuf(Relation rel, BTStack stack)
+{
+ BlockNumber blkno;
+ Buffer buf;
+ OffsetNumber start,
+ offnum,
+ maxoff;
+ Page page;
+ ItemId itemid;
+ BTItem item;
+ BTPageOpaque opaque;
+
+ blkno = stack->bts_blkno;
+ buf = _bt_getbuf(rel, blkno, BT_WRITE);
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ maxoff = PageGetMaxOffsetNumber(page);
- if (saferight == start)
- saferight = i;
+ start = stack->bts_offset;
+ /*
+ * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
+ * case of concurrent ROOT page split. Also, watch out for
+ * possibility that page has a high key now when it didn't before.
+ */
+ if (start < P_FIRSTDATAKEY(opaque))
+ start = P_FIRSTDATAKEY(opaque);
- if (saferight == maxoff && (maxoff - start) > 1)
- saferight = start + (maxoff - start) / 2;
+ for (;;)
+ {
+ /* see if it's on this page */
+ for (offnum = start;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ itemid = PageGetItemId(page, offnum);
+ item = (BTItem) PageGetItem(page, itemid);
+ if (BTItemSame(item, &stack->bts_btitem))
+ {
+ /* Return accurate pointer to where link is now */
+ stack->bts_blkno = blkno;
+ stack->bts_offset = offnum;
+ return buf;
+ }
+ }
+ /* by here, the item we're looking for moved right at least one page */
+ if (P_RIGHTMOST(opaque))
+ elog(FATAL, "_bt_getstackbuf: my bits moved right off the end of the world!"
+ "\n\tRecreate index %s.", RelationGetRelationName(rel));
- return saferight;
+ blkno = opaque->btpo_next;
+ _bt_relbuf(rel, buf, BT_WRITE);
+ buf = _bt_getbuf(rel, blkno, BT_WRITE);
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+ start = P_FIRSTDATAKEY(opaque);
+ }
}
/*
* graph.
*
* On entry, lbuf (the old root) and rbuf (its new peer) are write-
- * locked. We don't drop the locks in this routine; that's done by
- * the caller. On exit, a new root page exists with entries for the
- * two new children. The new root page is neither pinned nor locked.
+ * locked. On exit, a new root page exists with entries for the
+ * two new children. The new root page is neither pinned nor locked, and
+ * we have also written out lbuf and rbuf and dropped their pins/locks.
*/
static void
_bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
rootpage = BufferGetPage(rootbuf);
rootbknum = BufferGetBlockNumber(rootbuf);
- _bt_pageinit(rootpage, BufferGetPageSize(rootbuf));
/* set btree special data */
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
rootopaque->btpo_flags |= BTP_ROOT;
- /*
- * Insert the internal tuple pointers.
- */
-
lbkno = BufferGetBlockNumber(lbuf);
rbkno = BufferGetBlockNumber(rbuf);
lpage = BufferGetPage(lbuf);
rpage = BufferGetPage(rbuf);
+ /*
+ * Make sure pages in old root level have valid parent links --- we will
+ * need this in _bt_insertonpg() if a concurrent root split happens (see
+ * README).
+ */
((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_parent =
((BTPageOpaque) PageGetSpecialPointer(rpage))->btpo_parent =
rootbknum;
/*
- * step over the high key on the left page while building the left
- * page pointer.
+ * Create downlink item for left page (old root). Since this will be
+ * the first item in a non-leaf page, it implicitly has minus-infinity
+ * key value, so we need not store any actual key in it.
*/
- itemid = PageGetItemId(lpage, P_FIRSTKEY);
- itemsz = ItemIdGetLength(itemid);
- item = (BTItem) PageGetItem(lpage, itemid);
- new_item = _bt_formitem(&(item->bti_itup));
+ itemsz = sizeof(BTItemData);
+ new_item = (BTItem) palloc(itemsz);
+ new_item->bti_itup.t_info = itemsz;
ItemPointerSet(&(new_item->bti_itup.t_tid), lbkno, P_HIKEY);
/*
- * insert the left page pointer into the new root page. the root page
- * is the rightmost page on its level so the "high key" item is the
- * first data item.
+ * Insert the left page pointer into the new root page. The root page
+ * is the rightmost page on its level so there is no "high key" in it;
+ * the two items will go into positions P_HIKEY and P_FIRSTKEY.
*/
if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add leftkey to new root page");
pfree(new_item);
/*
- * the right page is the rightmost page on the second level, so the
- * "high key" item is the first data item on that page as well.
+ * Create downlink item for right page. The key for it is obtained from
+ * the "high key" position in the left page.
*/
- itemid = PageGetItemId(rpage, P_HIKEY);
+ itemid = PageGetItemId(lpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
- item = (BTItem) PageGetItem(rpage, itemid);
+ item = (BTItem) PageGetItem(lpage, itemid);
new_item = _bt_formitem(&(item->bti_itup));
ItemPointerSet(&(new_item->bti_itup.t_tid), rbkno, P_HIKEY);
elog(FATAL, "btree: failed to add rightkey to new root page");
pfree(new_item);
- /* write and let go of the root buffer */
+ /* write and let go of the new root buffer */
_bt_wrtbuf(rel, rootbuf);
/* update metadata page with new root block number */
_bt_metaproot(rel, rootbknum, 0);
- _bt_wrtbuf(rel, lbuf);
+ /* update and release new sibling, and finally the old root */
_bt_wrtbuf(rel, rbuf);
+ _bt_wrtbuf(rel, lbuf);
}
/*
* _bt_pgaddtup() -- add a tuple to a particular page in the index.
*
- * This routine adds the tuple to the page as requested, and keeps the
- * write lock and reference associated with the page's buffer. It is
- * an error to call pgaddtup() without a write lock and reference. If
- * afteritem is non-null, it's the item that we expect our new item
- * to follow. Otherwise, we do a binary search for the correct place
- * and insert the new item there.
+ * This routine adds the tuple to the page as requested. It does
+ * not affect pin/lock status, but you'd better have a write lock
+ * and pin on the target buffer! Don't forget to write and release
+ * the buffer afterwards, either.
+ *
+ * The main difference between this routine and a bare PageAddItem call
+ * is that this code knows that the leftmost data item on a non-leaf
+ * btree page doesn't need to have a key. Therefore, it strips such
+ * items down to just the item header. CAUTION: this works ONLY if
+ * we insert the items in order, so that the given itup_off does
+ * represent the final position of the item!
*/
-static OffsetNumber
+static void
_bt_pgaddtup(Relation rel,
- Buffer buf,
- int keysz,
- ScanKey itup_scankey,
+ Page page,
Size itemsize,
BTItem btitem,
- BTItem afteritem)
-{
- OffsetNumber itup_off;
- OffsetNumber first;
- Page page;
- BTPageOpaque opaque;
- BTItem chkitem;
-
- page = BufferGetPage(buf);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- first = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
- if (afteritem == (BTItem) NULL)
- itup_off = _bt_binsrch(rel, buf, keysz, itup_scankey, BT_INSERTION);
- else
- {
- itup_off = first;
-
- do
- {
- chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, itup_off));
- itup_off = OffsetNumberNext(itup_off);
- } while (!BTItemSame(chkitem, afteritem));
- }
-
- if (PageAddItem(page, (Item) btitem, itemsize, itup_off, LP_USED) == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add item to the page");
-
- /* write the buffer, but hold our lock */
- _bt_wrtnorelbuf(rel, buf);
-
- return itup_off;
-}
-
-/*
- * _bt_goesonpg() -- Does a new tuple belong on this page?
- *
- * This is part of the complexity introduced by allowing duplicate
- * keys into the index. The tuple belongs on this page if:
- *
- * + there is no page to the right of this one; or
- * + it is less than the high key on the page; or
- * + the item it is to follow ("afteritem") appears on this
- * page.
- */
-static bool
-_bt_goesonpg(Relation rel,
- Buffer buf,
- Size keysz,
- ScanKey scankey,
- BTItem afteritem)
-{
- Page page;
- ItemId hikey;
- BTPageOpaque opaque;
- BTItem chkitem;
- OffsetNumber offnum,
- maxoff;
- bool found;
-
- page = BufferGetPage(buf);
-
- /* no right neighbor? */
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- if (P_RIGHTMOST(opaque))
- return true;
-
- /*
- * this is a non-rightmost page, so it must have a high key item.
- *
- * If the scan key is < the high key (the min key on the next page), then
- * it for sure belongs here.
- */
- hikey = PageGetItemId(page, P_HIKEY);
- if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTLessStrategyNumber))
- return true;
-
- /*
- * If the scan key is > the high key, then it for sure doesn't belong
- * here.
- */
-
- if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTGreaterStrategyNumber))
- return false;
-
- /*
- * If we have no adjacency information, and the item is equal to the
- * high key on the page (by here it is), then the item does not belong
- * on this page.
- *
- * Now it's not true in all cases. - vadim 06/10/97
- */
-
- if (afteritem == (BTItem) NULL)
- {
- if (opaque->btpo_flags & BTP_LEAF)
- return false;
- if (opaque->btpo_flags & BTP_CHAIN)
- return true;
- if (_bt_skeycmp(rel, keysz, scankey, page,
- PageGetItemId(page, P_FIRSTKEY),
- BTEqualStrategyNumber))
- return true;
- return false;
- }
-
- /* damn, have to work for it. i hate that. */
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * Search the entire page for the afteroid. We need to do this,
- * rather than doing a binary search and starting from there, because
- * if the key we're searching for is the leftmost key in the tree at
- * this level, then a binary search will do the wrong thing. Splits
- * are pretty infrequent, so the cost isn't as bad as it could be.
- */
-
- found = false;
- for (offnum = P_FIRSTKEY;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
-
- if (BTItemSame(chkitem, afteritem))
- {
- found = true;
- break;
- }
- }
-
- return found;
-}
-
-/*
- * _bt_tuplecompare() -- compare two IndexTuples,
- * return -1, 0, or +1
- *
- */
-static int32
-_bt_tuplecompare(Relation rel,
- Size keysz,
- ScanKey scankey,
- IndexTuple tuple1,
- IndexTuple tuple2)
+ OffsetNumber itup_off,
+ const char *where)
{
- TupleDesc tupDes;
- int i;
- int32 compare = 0;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ BTItemData truncitem;
- tupDes = RelationGetDescr(rel);
-
- for (i = 1; i <= (int) keysz; i++)
- {
- ScanKey entry = &scankey[i - 1];
- Datum attrDatum1,
- attrDatum2;
- bool isFirstNull,
- isSecondNull;
-
- attrDatum1 = index_getattr(tuple1, i, tupDes, &isFirstNull);
- attrDatum2 = index_getattr(tuple2, i, tupDes, &isSecondNull);
-
- /* see comments about NULLs handling in btbuild */
- if (isFirstNull) /* attr in tuple1 is NULL */
- {
- if (isSecondNull) /* attr in tuple2 is NULL too */
- compare = 0;
- else
- compare = 1; /* NULL ">" not-NULL */
- }
- else if (isSecondNull) /* attr in tuple1 is NOT_NULL and */
- { /* attr in tuple2 is NULL */
- compare = -1; /* not-NULL "<" NULL */
- }
- else
- {
- compare = DatumGetInt32(FunctionCall2(&entry->sk_func,
- attrDatum1, attrDatum2));
- }
-
- if (compare != 0)
- break; /* done when we find unequal attributes */
- }
-
- return compare;
-}
-
-/*
- * _bt_itemcmp() -- compare two BTItems using a requested
- * strategy (<, <=, =, >=, >)
- *
- */
-bool
-_bt_itemcmp(Relation rel,
- Size keysz,
- ScanKey scankey,
- BTItem item1,
- BTItem item2,
- StrategyNumber strat)
-{
- int32 compare;
-
- compare = _bt_tuplecompare(rel, keysz, scankey,
- &(item1->bti_itup),
- &(item2->bti_itup));
-
- switch (strat)
+ if (! P_ISLEAF(opaque) && itup_off == P_FIRSTDATAKEY(opaque))
{
- case BTLessStrategyNumber:
- return (bool) (compare < 0);
- case BTLessEqualStrategyNumber:
- return (bool) (compare <= 0);
- case BTEqualStrategyNumber:
- return (bool) (compare == 0);
- case BTGreaterEqualStrategyNumber:
- return (bool) (compare >= 0);
- case BTGreaterStrategyNumber:
- return (bool) (compare > 0);
+ memcpy(&truncitem, btitem, sizeof(BTItemData));
+ truncitem.bti_itup.t_info = sizeof(BTItemData);
+ btitem = &truncitem;
+ itemsize = sizeof(BTItemData);
}
- elog(ERROR, "_bt_itemcmp: bogus strategy %d", (int) strat);
- return false;
-}
-
-/*
- * _bt_updateitem() -- updates the key of the item identified by the
- * oid with the key of newItem (done in place if
- * possible)
- *
- */
-static void
-_bt_updateitem(Relation rel,
- Size keysz,
- Buffer buf,
- BTItem oldItem,
- BTItem newItem)
-{
- Page page;
- OffsetNumber maxoff;
- OffsetNumber i;
- ItemPointerData itemPtrData;
- BTItem item;
- IndexTuple oldIndexTuple,
- newIndexTuple;
- int first;
-
- page = BufferGetPage(buf);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /* locate item on the page */
- first = P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page))
- ? P_HIKEY : P_FIRSTKEY;
- i = first;
- do
- {
- item = (BTItem) PageGetItem(page, PageGetItemId(page, i));
- i = OffsetNumberNext(i);
- } while (i <= maxoff && !BTItemSame(item, oldItem));
-
- /* this should never happen (in theory) */
- if (!BTItemSame(item, oldItem))
- elog(FATAL, "_bt_getstackbuf was lying!!");
-
- /*
- * It's defined by caller (_bt_insertonpg)
- */
-
- /*
- * if(IndexTupleDSize(newItem->bti_itup) >
- * IndexTupleDSize(item->bti_itup)) { elog(NOTICE, "trying to
- * overwrite a smaller value with a bigger one in _bt_updateitem");
- * elog(ERROR, "this is not good."); }
- */
-
- oldIndexTuple = &(item->bti_itup);
- newIndexTuple = &(newItem->bti_itup);
-
- /* keep the original item pointer */
- ItemPointerCopy(&(oldIndexTuple->t_tid), &itemPtrData);
- CopyIndexTuple(newIndexTuple, &oldIndexTuple);
- ItemPointerCopy(&itemPtrData, &(oldIndexTuple->t_tid));
-
+ if (PageAddItem(page, (Item) btitem, itemsize, itup_off,
+ LP_USED) == InvalidOffsetNumber)
+ elog(FATAL, "btree: failed to add item to the %s for %s",
+ where, RelationGetRelationName(rel));
}
/*
* _bt_isequal - used in _bt_doinsert in check for duplicates.
*
+ * This is very similar to _bt_compare, except for NULL handling.
* Rule is simple: NOT_NULL not equal NULL, NULL not_equal NULL too.
*/
static bool
_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
int keysz, ScanKey scankey)
{
- Datum datum;
BTItem btitem;
IndexTuple itup;
- ScanKey entry;
- AttrNumber attno;
- int32 result;
int i;
- bool null;
+
+ /* Better be comparing to a leaf item */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &(btitem->bti_itup);
for (i = 1; i <= keysz; i++)
{
- entry = &scankey[i - 1];
+ ScanKey entry = &scankey[i - 1];
+ AttrNumber attno;
+ Datum datum;
+ bool isNull;
+ int32 result;
+
attno = entry->sk_attno;
Assert(attno == i);
- datum = index_getattr(itup, attno, itupdesc, &null);
+ datum = index_getattr(itup, attno, itupdesc, &isNull);
- /* NULLs are not equal */
- if (entry->sk_flags & SK_ISNULL || null)
+ /* NULLs are never equal to anything */
+ if (entry->sk_flags & SK_ISNULL || isNull)
return false;
result = DatumGetInt32(FunctionCall2(&entry->sk_func,
- entry->sk_argument, datum));
+ entry->sk_argument,
+ datum));
+
if (result != 0)
return false;
}
- /* by here, the keys are equal */
+ /* if we get here, the keys are equal */
return true;
}
-
-#ifdef NOT_USED
-/*
- * _bt_shift - insert btitem on the passed page after shifting page
- * to the right in the tree.
- *
- * NOTE: tested for shifting leftmost page only, having btitem < hikey.
- */
-static InsertIndexResult
-_bt_shift(Relation rel, Buffer buf, BTStack stack, int keysz,
- ScanKey scankey, BTItem btitem, BTItem hikey)
-{
- InsertIndexResult res;
- int itemsz;
- Page page;
- BlockNumber bknum;
- BTPageOpaque pageop;
- Buffer rbuf;
- Page rpage;
- BTPageOpaque rpageop;
- Buffer pbuf;
- Page ppage;
- BTPageOpaque ppageop;
- Buffer nbuf;
- Page npage;
- BTPageOpaque npageop;
- BlockNumber nbknum;
- BTItem nitem;
- OffsetNumber afteroff;
-
- btitem = _bt_formitem(&(btitem->bti_itup));
- hikey = _bt_formitem(&(hikey->bti_itup));
-
- page = BufferGetPage(buf);
-
- /* grab new page */
- nbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
- nbknum = BufferGetBlockNumber(nbuf);
- npage = BufferGetPage(nbuf);
- _bt_pageinit(npage, BufferGetPageSize(nbuf));
- npageop = (BTPageOpaque) PageGetSpecialPointer(npage);
-
- /* copy content of the passed page */
- memmove((char *) npage, (char *) page, BufferGetPageSize(buf));
-
- /* re-init old (passed) page */
- _bt_pageinit(page, BufferGetPageSize(buf));
- pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-
- /* init old page opaque */
- pageop->btpo_flags = npageop->btpo_flags; /* restore flags */
- pageop->btpo_flags &= ~BTP_CHAIN;
- if (_bt_itemcmp(rel, keysz, scankey, hikey, btitem, BTEqualStrategyNumber))
- pageop->btpo_flags |= BTP_CHAIN;
- pageop->btpo_prev = npageop->btpo_prev; /* restore prev */
- pageop->btpo_next = nbknum; /* next points to the new page */
- pageop->btpo_parent = npageop->btpo_parent;
-
- /* init shifted page opaque */
- npageop->btpo_prev = bknum = BufferGetBlockNumber(buf);
-
- /* shifted page is ok, populate old page */
-
- /* add passed hikey */
- itemsz = IndexTupleDSize(hikey->bti_itup)
- + (sizeof(BTItemData) - sizeof(IndexTupleData));
- itemsz = MAXALIGN(itemsz);
- if (PageAddItem(page, (Item) hikey, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add hikey in _bt_shift");
- pfree(hikey);
-
- /* add btitem */
- itemsz = IndexTupleDSize(btitem->bti_itup)
- + (sizeof(BTItemData) - sizeof(IndexTupleData));
- itemsz = MAXALIGN(itemsz);
- if (PageAddItem(page, (Item) btitem, itemsz, P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add firstkey in _bt_shift");
- pfree(btitem);
- nitem = (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
- btitem = _bt_formitem(&(nitem->bti_itup));
- ItemPointerSet(&(btitem->bti_itup.t_tid), bknum, P_HIKEY);
-
- /* ok, write them out */
- _bt_wrtnorelbuf(rel, nbuf);
- _bt_wrtnorelbuf(rel, buf);
-
- /* fix btpo_prev on right sibling of old page */
- if (!P_RIGHTMOST(npageop))
- {
- rbuf = _bt_getbuf(rel, npageop->btpo_next, BT_WRITE);
- rpage = BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
- rpageop->btpo_prev = nbknum;
- _bt_wrtbuf(rel, rbuf);
- }
-
- /* get parent pointing to the old page */
- ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid),
- bknum, P_HIKEY);
- pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- ppage = BufferGetPage(pbuf);
- ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage);
-
- _bt_relbuf(rel, nbuf, BT_WRITE);
- _bt_relbuf(rel, buf, BT_WRITE);
-
- /* re-set parent' pointer - we shifted our page to the right ! */
- nitem = (BTItem) PageGetItem(ppage,
- PageGetItemId(ppage, stack->bts_offset));
- ItemPointerSet(&(nitem->bti_itup.t_tid), nbknum, P_HIKEY);
- ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), nbknum, P_HIKEY);
- _bt_wrtnorelbuf(rel, pbuf);
-
- /*
- * Now we want insert into the parent pointer to our old page. It has
- * to be inserted before the pointer to new page. You may get problems
- * here (in the _bt_goesonpg and/or _bt_pgaddtup), but may be not - I
- * don't know. It works if old page is leftmost (nitem is NULL) and
- * btitem < hikey and it's all what we need currently. - vadim
- * 05/30/97
- */
- nitem = NULL;
- afteroff = P_FIRSTKEY;
- if (!P_RIGHTMOST(ppageop))
- afteroff = OffsetNumberNext(afteroff);
- if (stack->bts_offset >= afteroff)
- {
- afteroff = OffsetNumberPrev(stack->bts_offset);
- nitem = (BTItem) PageGetItem(ppage, PageGetItemId(ppage, afteroff));
- nitem = _bt_formitem(&(nitem->bti_itup));
- }
- res = _bt_insertonpg(rel, pbuf, stack->bts_parent,
- keysz, scankey, btitem, nitem);
- pfree(btitem);
-
- ItemPointerSet(&(res->pointerData), nbknum, P_HIKEY);
-
- return res;
-}
-
-#endif
*
*
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.36 2000/04/12 17:14:49 momjian Exp $
+ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.37 2000/07/21 06:42:32 tgl Exp $
*
* NOTES
* Postgres btree pages look like ordinary relation pages. The opaque
metad.btm_version = BTREE_VERSION;
metad.btm_root = P_NONE;
metad.btm_level = 0;
- memmove((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
+ memcpy((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
op = (BTPageOpaque) PageGetSpecialPointer(pg);
op->btpo_flags = BTP_META;
UnlockRelation(rel, AccessExclusiveLock);
}
-#ifdef NOT_USED
-/*
- * _bt_checkmeta() -- Verify that the metadata stored in a btree are
- * reasonable.
- */
-void
-_bt_checkmeta(Relation rel)
-{
- Buffer metabuf;
- Page metap;
- BTMetaPageData *metad;
- BTPageOpaque op;
- int nblocks;
-
- /* if the relation is empty, this is init time; don't complain */
- if ((nblocks = RelationGetNumberOfBlocks(rel)) == 0)
- return;
-
- metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
- metap = BufferGetPage(metabuf);
- op = (BTPageOpaque) PageGetSpecialPointer(metap);
- if (!(op->btpo_flags & BTP_META))
- {
- elog(ERROR, "Invalid metapage for index %s",
- RelationGetRelationName(rel));
- }
- metad = BTPageGetMeta(metap);
-
- if (metad->btm_magic != BTREE_MAGIC)
- {
- elog(ERROR, "Index %s is not a btree",
- RelationGetRelationName(rel));
- }
-
- if (metad->btm_version != BTREE_VERSION)
- {
- elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
- RelationGetRelationName(rel),
- metad->btm_version, BTREE_VERSION);
- }
-
- _bt_relbuf(rel, metabuf, BT_READ);
-}
-
-#endif
-
/*
* _bt_getroot() -- Get the root page of the btree.
*
* standard class of race conditions exists here; I think I covered
* them all in the Hopi Indian rain dance of lock requests below.
*
- * We pass in the access type (BT_READ or BT_WRITE), and return the
- * root page's buffer with the appropriate lock type set. Reference
- * count on the root page gets bumped by ReadBuffer. The metadata
- * page is unlocked and unreferenced by this process when this routine
- * returns.
+ * The access type parameter (BT_READ or BT_WRITE) controls whether
+ * a new root page will be created or not. If access = BT_READ,
+ * and no root page exists, we just return InvalidBuffer. For
+ * BT_WRITE, we try to create the root page if it doesn't exist.
+ * NOTE that the returned root page will have only a read lock set
+ * on it even if access = BT_WRITE!
+ *
+ * On successful return, the root page is pinned and read-locked.
+ * The metadata page is not locked or pinned on exit.
*/
Buffer
_bt_getroot(Relation rel, int access)
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
- Assert(metaopaque->btpo_flags & BTP_META);
metad = BTPageGetMeta(metapg);
- if (metad->btm_magic != BTREE_MAGIC)
- {
+ if (!(metaopaque->btpo_flags & BTP_META) ||
+ metad->btm_magic != BTREE_MAGIC)
elog(ERROR, "Index %s is not a btree",
RelationGetRelationName(rel));
- }
if (metad->btm_version != BTREE_VERSION)
- {
- elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
+ elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
RelationGetRelationName(rel),
metad->btm_version, BTREE_VERSION);
- }
/* if no root page initialized yet, do it */
if (metad->btm_root == P_NONE)
{
+ /* If access = BT_READ, caller doesn't want us to create root yet */
+ if (access == BT_READ)
+ {
+ _bt_relbuf(rel, metabuf, BT_READ);
+ return InvalidBuffer;
+ }
- /* turn our read lock in for a write lock */
- _bt_relbuf(rel, metabuf, BT_READ);
- metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
- metapg = BufferGetPage(metabuf);
- metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
- Assert(metaopaque->btpo_flags & BTP_META);
- metad = BTPageGetMeta(metapg);
+ /* trade in our read lock for a write lock */
+ LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(metabuf, BT_WRITE);
/*
* Race condition: if someone else initialized the metadata
* between the time we released the read lock and acquired the
- * write lock, above, we want to avoid doing it again.
+ * write lock, above, we must avoid doing it again.
*/
-
if (metad->btm_root == P_NONE)
{
/*
* Get, initialize, write, and leave a lock of the appropriate
* type on the new root page. Since this is the first page in
- * the tree, it's a leaf.
+ * the tree, it's a leaf as well as the root.
*/
-
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
rootblkno = BufferGetBlockNumber(rootbuf);
rootpg = BufferGetPage(rootbuf);
+
metad->btm_root = rootblkno;
metad->btm_level = 1;
+
_bt_pageinit(rootpg, BufferGetPageSize(rootbuf));
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT);
_bt_wrtnorelbuf(rel, rootbuf);
- /* swap write lock for read lock, if appropriate */
- if (access != BT_WRITE)
- {
- LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
- LockBuffer(rootbuf, BT_READ);
- }
+ /* swap write lock for read lock */
+ LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(rootbuf, BT_READ);
- /* okay, metadata is correct */
+ /* okay, metadata is correct, write and release it */
_bt_wrtbuf(rel, metabuf);
}
else
{
-
/*
* Metadata initialized by someone else. In order to
* guarantee no deadlocks, we have to release the metadata
* page and start all over again.
*/
-
_bt_relbuf(rel, metabuf, BT_WRITE);
return _bt_getroot(rel, access);
}
rootblkno = metad->btm_root;
_bt_relbuf(rel, metabuf, BT_READ); /* done with the meta page */
- rootbuf = _bt_getbuf(rel, rootblkno, access);
+ rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
}
/*
* Race condition: If the root page split between the time we looked
* at the metadata page and got the root buffer, then we got the wrong
- * buffer.
+ * buffer. Release it and try again.
*/
-
rootpg = BufferGetPage(rootbuf);
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
- if (!(rootopaque->btpo_flags & BTP_ROOT))
- {
+ if (! P_ISROOT(rootopaque))
+ {
/* it happened, try again */
- _bt_relbuf(rel, rootbuf, access);
+ _bt_relbuf(rel, rootbuf, BT_READ);
return _bt_getroot(rel, access);
}
* count is correct, and we have no lock set on the metadata page.
* Return the root block.
*/
-
return rootbuf;
}
* _bt_getbuf() -- Get a buffer by block number for read or write.
*
* When this routine returns, the appropriate lock is set on the
- * requested buffer its reference count is correct.
+ * requested buffer and its reference count has been incremented
+ * (ie, the buffer is "locked and pinned").
*/
Buffer
_bt_getbuf(Relation rel, BlockNumber blkno, int access)
{
Buffer buf;
- Page page;
if (blkno != P_NEW)
{
+ /* Read an existing block of the relation */
buf = ReadBuffer(rel, blkno);
LockBuffer(buf, access);
}
else
{
+ Page page;
/*
- * Extend bufmgr code is unclean and so we have to use locking
+ * Extend the relation by one page.
+ *
+ * Extend bufmgr code is unclean and so we have to use extra locking
* here.
*/
LockPage(rel, 0, ExclusiveLock);
buf = ReadBuffer(rel, blkno);
+ LockBuffer(buf, access);
UnlockPage(rel, 0, ExclusiveLock);
- blkno = BufferGetBlockNumber(buf);
+
+ /* Initialize the new page before returning it */
page = BufferGetPage(buf);
_bt_pageinit(page, BufferGetPageSize(buf));
- LockBuffer(buf, access);
}
/* ref count and lock type are correct */
/*
* _bt_relbuf() -- release a locked buffer.
+ *
+ * Lock and pin (refcount) are both dropped.
*/
void
_bt_relbuf(Relation rel, Buffer buf, int access)
/*
* _bt_wrtbuf() -- write a btree page to disk.
*
- * This routine releases the lock held on the buffer and our reference
- * to it. It is an error to call _bt_wrtbuf() without a write lock
- * or a reference to the buffer.
+ * This routine releases the lock held on the buffer and our refcount
+ * for it. It is an error to call _bt_wrtbuf() without a write lock
+ * and a pin on the buffer.
+ *
+ * NOTE: actually, the buffer manager just marks the shared buffer page
+ * dirty here, the real I/O happens later. Since we can't persuade the
+ * Unix kernel to schedule disk writes in a particular order, there's not
+ * much point in worrying about this. The most we can say is that all the
+ * writes will occur before commit.
*/
void
_bt_wrtbuf(Relation rel, Buffer buf)
* our reference or lock.
*
* It is an error to call _bt_wrtnorelbuf() without a write lock
- * or a reference to the buffer.
+ * and a pin on the buffer.
+ *
+ * See above NOTE.
*/
void
_bt_wrtnorelbuf(Relation rel, Buffer buf)
* we split the root page, we record the new parent in the metadata page
* for the relation. This routine does the work.
*
- * No direct preconditions, but if you don't have the a write lock on
+ * No direct preconditions, but if you don't have the write lock on
* at least the old root page when you call this, you're making a big
* mistake. On exit, metapage data is correct and we no longer have
- * a reference to or lock on the metapage.
+ * a pin or lock on the metapage.
*/
void
_bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
}
/*
- * _bt_getstackbuf() -- Walk back up the tree one step, and find the item
- * we last looked at in the parent.
- *
- * This is possible because we save a bit image of the last item
- * we looked at in the parent, and the update algorithm guarantees
- * that if items above us in the tree move, they only move right.
- *
- * Also, re-set bts_blkno & bts_offset if changed and
- * bts_btitem (it may be changed - see _bt_insertonpg).
+ * Delete an item from a btree. It had better be a leaf item...
*/
-Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, int access)
-{
- Buffer buf;
- BlockNumber blkno;
- OffsetNumber start,
- offnum,
- maxoff;
- OffsetNumber i;
- Page page;
- ItemId itemid;
- BTItem item;
- BTPageOpaque opaque;
- BTItem item_save;
- int item_nbytes;
-
- blkno = stack->bts_blkno;
- buf = _bt_getbuf(rel, blkno, access);
- page = BufferGetPage(buf);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- maxoff = PageGetMaxOffsetNumber(page);
-
- if (stack->bts_offset == InvalidOffsetNumber ||
- maxoff >= stack->bts_offset)
- {
-
- /*
- * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
- * case of concurrent ROOT page split
- */
- if (stack->bts_offset == InvalidOffsetNumber)
- i = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
- else
- {
- itemid = PageGetItemId(page, stack->bts_offset);
- item = (BTItem) PageGetItem(page, itemid);
-
- /* if the item is where we left it, we're done */
- if (BTItemSame(item, stack->bts_btitem))
- {
- pfree(stack->bts_btitem);
- item_nbytes = ItemIdGetLength(itemid);
- item_save = (BTItem) palloc(item_nbytes);
- memmove((char *) item_save, (char *) item, item_nbytes);
- stack->bts_btitem = item_save;
- return buf;
- }
- i = OffsetNumberNext(stack->bts_offset);
- }
-
- /* if the item has just moved right on this page, we're done */
- for (;
- i <= maxoff;
- i = OffsetNumberNext(i))
- {
- itemid = PageGetItemId(page, i);
- item = (BTItem) PageGetItem(page, itemid);
-
- /* if the item is where we left it, we're done */
- if (BTItemSame(item, stack->bts_btitem))
- {
- stack->bts_offset = i;
- pfree(stack->bts_btitem);
- item_nbytes = ItemIdGetLength(itemid);
- item_save = (BTItem) palloc(item_nbytes);
- memmove((char *) item_save, (char *) item, item_nbytes);
- stack->bts_btitem = item_save;
- return buf;
- }
- }
- }
-
- /* by here, the item we're looking for moved right at least one page */
- for (;;)
- {
- blkno = opaque->btpo_next;
- if (P_RIGHTMOST(opaque))
- elog(FATAL, "my bits moved right off the end of the world!\
-\n\tRecreate index %s.", RelationGetRelationName(rel));
-
- _bt_relbuf(rel, buf, access);
- buf = _bt_getbuf(rel, blkno, access);
- page = BufferGetPage(buf);
- maxoff = PageGetMaxOffsetNumber(page);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
- /* if we have a right sibling, step over the high key */
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
- /* see if it's on this page */
- for (offnum = start;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- itemid = PageGetItemId(page, offnum);
- item = (BTItem) PageGetItem(page, itemid);
- if (BTItemSame(item, stack->bts_btitem))
- {
- stack->bts_offset = offnum;
- stack->bts_blkno = blkno;
- pfree(stack->bts_btitem);
- item_nbytes = ItemIdGetLength(itemid);
- item_save = (BTItem) palloc(item_nbytes);
- memmove((char *) item_save, (char *) item, item_nbytes);
- stack->bts_btitem = item_save;
- return buf;
- }
- }
- }
-}
-
void
_bt_pagedel(Relation rel, ItemPointer tid)
{
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.61 2000/07/14 22:17:33 tgl Exp $
+ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.62 2000/07/21 06:42:32 tgl Exp $
*
*-------------------------------------------------------------------------
*/
#include "executor/executor.h"
#include "miscadmin.h"
+
bool BuildingBtree = false; /* see comment in btbuild() */
bool FastBuild = true; /* use sort/build instead of insertion
* build */
* btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE.
* Sure, it's just rule for placing/finding items and no more -
* keytest'll return FALSE for a = 5 for items having 'a' isNULL.
- * Look at _bt_skeycmp, _bt_compare and _bt_itemcmp for how it
- * works. - vadim 03/23/97
+ * Look at _bt_compare for how it works.
+ * - vadim 03/23/97
*
* if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; }
*/
/* generate an index tuple */
itup = index_formtuple(RelationGetDescr(rel), datum, nulls);
itup->t_tid = *ht_ctid;
-
- /*
- * See comments in btbuild.
- *
- * if (itup->t_info & INDEX_NULL_MASK)
- * PG_RETURN_POINTER((InsertIndexResult) NULL);
- */
-
btitem = _bt_formitem(itup);
res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel);
if (ItemPointerIsValid(&(scan->currentItemData)))
{
-
/*
* Restore scan position using heap TID returned by previous call
- * to btgettuple(). _bt_restscan() locks buffer.
+ * to btgettuple(). _bt_restscan() re-grabs the read lock on
+ * the buffer, too.
*/
_bt_restscan(scan);
res = _bt_next(scan, dir);
res = _bt_first(scan, dir);
/*
- * Save heap TID to use it in _bt_restscan. Unlock buffer before
- * leaving index !
+ * Save heap TID to use it in _bt_restscan. Then release the read
+ * lock on the buffer so that we aren't blocking other backends.
+ * NOTE: we do keep the pin on the buffer!
*/
if (res)
{
so = (BTScanOpaque) scan->opaque;
- /* we don't hold a read lock on the current page in the scan */
+ if (so == NULL) /* if called from btbeginscan */
+ {
+ so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
+ so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
+ so->keyData = (ScanKey) NULL;
+ if (scan->numberOfKeys > 0)
+ so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
+ scan->opaque = so;
+ scan->flags = 0x0;
+ }
+
+ /* we aren't holding any read locks, but gotta drop the pins */
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{
ReleaseBuffer(so->btso_curbuf);
ItemPointerSetInvalid(iptr);
}
- /* and we don't hold a read lock on the last marked item in the scan */
if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
{
ReleaseBuffer(so->btso_mrkbuf);
ItemPointerSetInvalid(iptr);
}
- if (so == NULL) /* if called from btbeginscan */
- {
- so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
- so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
- so->keyData = (ScanKey) NULL;
- if (scan->numberOfKeys > 0)
- so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
- scan->opaque = so;
- scan->flags = 0x0;
- }
-
/*
* Reset the scan keys. Note that keys ordering stuff moved to
* _bt_first. - vadim 05/05/97
so = (BTScanOpaque) scan->opaque;
- /* we don't hold a read lock on the current page in the scan */
+ /* we aren't holding any read locks, but gotta drop the pin */
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{
ReleaseBuffer(so->btso_curbuf);
ItemPointerSetInvalid(iptr);
}
-/* scan->keyData[0].sk_argument = v; */
so->keyData[0].sk_argument = v;
}
so = (BTScanOpaque) scan->opaque;
- /* we don't hold any read locks */
+ /* we aren't holding any read locks, but gotta drop the pins */
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{
if (BufferIsValid(so->btso_curbuf))
so = (BTScanOpaque) scan->opaque;
- /* we don't hold any read locks */
+ /* we aren't holding any read locks, but gotta drop the pin */
if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
{
ReleaseBuffer(so->btso_mrkbuf);
ItemPointerSetInvalid(iptr);
}
- /* bump pin on current buffer */
+ /* bump pin on current buffer for assignment to mark buffer */
if (ItemPointerIsValid(&(scan->currentItemData)))
{
so->btso_mrkbuf = ReadBuffer(scan->relation,
so = (BTScanOpaque) scan->opaque;
- /* we don't hold any read locks */
+ /* we aren't holding any read locks, but gotta drop the pin */
if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
{
ReleaseBuffer(so->btso_curbuf);
{
so->btso_curbuf = ReadBuffer(scan->relation,
BufferGetBlockNumber(so->btso_mrkbuf));
-
scan->currentItemData = scan->currentMarkData;
so->curHeapIptr = so->mrkHeapIptr;
}
PG_RETURN_VOID();
}
+/*
+ * Restore scan position when btgettuple is called to continue a scan.
+ */
static void
_bt_restscan(IndexScanDesc scan)
{
BTItem item;
BlockNumber blkno;
- LockBuffer(buf, BT_READ); /* lock buffer first! */
+ /*
+ * Get back the read lock we were holding on the buffer.
+ * (We still have a reference-count pin on it, though.)
+ */
+ LockBuffer(buf, BT_READ);
+
page = BufferGetPage(buf);
maxoff = PageGetMaxOffsetNumber(page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
*/
if (!ItemPointerIsValid(&target))
{
- ItemPointerSetOffsetNumber(&(scan->currentItemData),
- OffsetNumberPrev(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY));
+ ItemPointerSetOffsetNumber(current,
+ OffsetNumberPrev(P_FIRSTDATAKEY(opaque)));
return;
}
- if (maxoff >= offnum)
+ /*
+ * The item we were on may have moved right due to insertions.
+ * Find it again.
+ */
+ for (;;)
{
-
- /*
- * if the item is where we left it or has just moved right on this
- * page, we're done
- */
+ /* Check for item on this page */
for (;
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
- if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
- target.ip_blkid.bi_hi && \
- item->bti_itup.t_tid.ip_blkid.bi_lo == \
- target.ip_blkid.bi_lo && \
+ if (item->bti_itup.t_tid.ip_blkid.bi_hi ==
+ target.ip_blkid.bi_hi &&
+ item->bti_itup.t_tid.ip_blkid.bi_lo ==
+ target.ip_blkid.bi_lo &&
item->bti_itup.t_tid.ip_posid == target.ip_posid)
{
current->ip_posid = offnum;
return;
}
}
- }
- /*
- * By here, the item we're looking for moved right at least one page
- */
- for (;;)
- {
+ /*
+ * By here, the item we're looking for moved right at least one page
+ */
if (P_RIGHTMOST(opaque))
- elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!\
-\n\tRecreate index %s.", RelationGetRelationName(rel));
+ elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!"
+ "\n\tRecreate index %s.", RelationGetRelationName(rel));
blkno = opaque->btpo_next;
_bt_relbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
maxoff = PageGetMaxOffsetNumber(page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
- /* see if it's on this page */
- for (offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
- offnum <= maxoff;
- offnum = OffsetNumberNext(offnum))
- {
- item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
- if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
- target.ip_blkid.bi_hi && \
- item->bti_itup.t_tid.ip_blkid.bi_lo == \
- target.ip_blkid.bi_lo && \
- item->bti_itup.t_tid.ip_posid == target.ip_posid)
- {
- ItemPointerSet(current, blkno, offnum);
- so->btso_curbuf = buf;
- return;
- }
- }
+ offnum = P_FIRSTDATAKEY(opaque);
+ ItemPointerSet(current, blkno, offnum);
+ so->btso_curbuf = buf;
}
}
*
*
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.31 2000/04/12 17:14:49 momjian Exp $
+ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.32 2000/07/21 06:42:32 tgl Exp $
*
*
* NOTES
* Because we can be doing an index scan on a relation while we update
* it, we need to avoid missing data that moves around in the index.
- * The routines and global variables in this file guarantee that all
- * scans in the local address space stay correctly positioned. This
- * is all we need to worry about, since write locking guarantees that
- * no one else will be on the same page at the same time as we are.
+ * Insertions and page splits are no problem because _bt_restscan()
+ * can figure out where the current item moved to, but if a deletion
+ * happens at or before the current scan position, we'd better do
+ * something to stay in sync.
+ *
+ * The routines in this file handle the problem for deletions issued
+ * by the current backend. Currently, that's all we need, since
+ * deletions are only done by VACUUM and it gets an exclusive lock.
*
* The scheme is to manage a list of active scans in the current backend.
- * Whenever we add or remove records from an index, or whenever we
- * split a leaf page, we check the list of active scans to see if any
- * has been affected. A scan is affected only if it is on the same
- * relation, and the same page, as the update.
+ * Whenever we remove a record from an index, we check the list of active
+ * scans to see if any has been affected. A scan is affected only if it
+ * is on the same relation, and the same page, as the update.
*
*-------------------------------------------------------------------------
*/
/*
* _bt_adjscans() -- adjust all scans in the scan list to compensate
- * for a given deletion or insertion
+ * for a given deletion
*/
void
_bt_adjscans(Relation rel, ItemPointer tid)
{
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+ start = P_FIRSTDATAKEY(opaque);
if (ItemPointerGetOffsetNumber(current) == start)
ItemPointerSetInvalid(&(so->curHeapIptr));
else
*/
LockBuffer(buf, BT_READ);
_bt_step(scan, &buf, BackwardScanDirection);
- so->btso_curbuf = buf;
if (ItemPointerIsValid(current))
{
Page pg = BufferGetPage(buf);
&& ItemPointerGetBlockNumber(current) == blkno
&& ItemPointerGetOffsetNumber(current) >= offno)
{
-
page = BufferGetPage(so->btso_mrkbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+ start = P_FIRSTDATAKEY(opaque);
if (ItemPointerGetOffsetNumber(current) == start)
ItemPointerSetInvalid(&(so->mrkHeapIptr));
/*-------------------------------------------------------------------------
*
- * btsearch.c
+ * nbtsearch.c
* search code for postgres btrees.
*
+ *
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California
*
- *
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.60 2000/05/30 04:24:33 tgl Exp $
+ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.61 2000/07/21 06:42:32 tgl Exp $
*
*-------------------------------------------------------------------------
*/
#include "access/nbtree.h"
+static RetrieveIndexResult _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static BTStack _bt_searchr(Relation rel, int keysz, ScanKey scankey,
- Buffer *bufP, BTStack stack_in);
-static int32 _bt_compare(Relation rel, TupleDesc itupdesc, Page page,
- int keysz, ScanKey scankey, OffsetNumber offnum);
-static bool
- _bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
-static RetrieveIndexResult
- _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
/*
- * _bt_search() -- Search for a scan key in the index.
+ * _bt_search() -- Search the tree for a particular scankey,
+ * or more precisely for the first leaf page it could be on.
+ *
+ * Return value is a stack of parent-page pointers. *bufP is set to the
+ * address of the leaf-page buffer, which is read-locked and pinned.
+ * No locks are held on the parent pages, however!
*
- * This routine is actually just a helper that sets things up and
- * calls a recursive-descent search routine on the tree.
+ * NOTE that the returned buffer is read-locked regardless of the access
+ * parameter. However, access = BT_WRITE will allow an empty root page
+ * to be created and returned. When access = BT_READ, an empty index
+ * will result in *bufP being set to InvalidBuffer.
*/
BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, Buffer *bufP)
-{
- *bufP = _bt_getroot(rel, BT_READ);
- return _bt_searchr(rel, keysz, scankey, bufP, (BTStack) NULL);
-}
-
-/*
- * _bt_searchr() -- Search the tree recursively for a particular scankey.
- */
-static BTStack
-_bt_searchr(Relation rel,
- int keysz,
- ScanKey scankey,
- Buffer *bufP,
- BTStack stack_in)
+_bt_search(Relation rel, int keysz, ScanKey scankey,
+ Buffer *bufP, int access)
{
- BTStack stack;
- OffsetNumber offnum;
- Page page;
- BTPageOpaque opaque;
- BlockNumber par_blkno;
- BlockNumber blkno;
- ItemId itemid;
- BTItem btitem;
- BTItem item_save;
- int item_nbytes;
- IndexTuple itup;
+ BTStack stack_in = NULL;
- /* if this is a leaf page, we're done */
- page = BufferGetPage(*bufP);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- if (opaque->btpo_flags & BTP_LEAF)
- return stack_in;
+ /* Get the root page to start with */
+ *bufP = _bt_getroot(rel, access);
- /*
- * Find the appropriate item on the internal page, and get the child
- * page that it points to.
- */
+ /* If index is empty and access = BT_READ, no root page is created. */
+ if (! BufferIsValid(*bufP))
+ return (BTStack) NULL;
- par_blkno = BufferGetBlockNumber(*bufP);
- offnum = _bt_binsrch(rel, *bufP, keysz, scankey, BT_DESCENT);
- itemid = PageGetItemId(page, offnum);
- btitem = (BTItem) PageGetItem(page, itemid);
- itup = &(btitem->bti_itup);
- blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+ /* Loop iterates once per level descended in the tree */
+ for (;;)
+ {
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber offnum;
+ ItemId itemid;
+ BTItem btitem;
+ IndexTuple itup;
+ BlockNumber blkno;
+ BlockNumber par_blkno;
+ BTStack new_stack;
+
+ /* if this is a leaf page, we're done */
+ page = BufferGetPage(*bufP);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (P_ISLEAF(opaque))
+ break;
- /*
- * We need to save the bit image of the index entry we chose in the
- * parent page on a stack. In case we split the tree, we'll use this
- * bit image to figure out what our real parent page is, in case the
- * parent splits while we're working lower in the tree. See the paper
- * by Lehman and Yao for how this is detected and handled. (We use
- * unique OIDs to disambiguate duplicate keys in the index -- Lehman
- * and Yao disallow duplicate keys).
- */
+ /*
+ * Find the appropriate item on the internal page, and get the
+ * child page that it points to.
+ */
+ offnum = _bt_binsrch(rel, *bufP, keysz, scankey);
+ itemid = PageGetItemId(page, offnum);
+ btitem = (BTItem) PageGetItem(page, itemid);
+ itup = &(btitem->bti_itup);
+ blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+ par_blkno = BufferGetBlockNumber(*bufP);
- item_nbytes = ItemIdGetLength(itemid);
- item_save = (BTItem) palloc(item_nbytes);
- memmove((char *) item_save, (char *) btitem, item_nbytes);
- stack = (BTStack) palloc(sizeof(BTStackData));
- stack->bts_blkno = par_blkno;
- stack->bts_offset = offnum;
- stack->bts_btitem = item_save;
- stack->bts_parent = stack_in;
+ /*
+ * We need to save the bit image of the index entry we chose in the
+ * parent page on a stack. In case we split the tree, we'll use this
+ * bit image to figure out what our real parent page is, in case the
+ * parent splits while we're working lower in the tree. See the paper
+ * by Lehman and Yao for how this is detected and handled. (We use the
+ * child link to disambiguate duplicate keys in the index -- Lehman
+ * and Yao disallow duplicate keys.)
+ */
+ new_stack = (BTStack) palloc(sizeof(BTStackData));
+ new_stack->bts_blkno = par_blkno;
+ new_stack->bts_offset = offnum;
+ memcpy(&new_stack->bts_btitem, btitem, sizeof(BTItemData));
+ new_stack->bts_parent = stack_in;
- /* drop the read lock on the parent page and acquire one on the child */
- _bt_relbuf(rel, *bufP, BT_READ);
- *bufP = _bt_getbuf(rel, blkno, BT_READ);
+ /* drop the read lock on the parent page, acquire one on the child */
+ _bt_relbuf(rel, *bufP, BT_READ);
+ *bufP = _bt_getbuf(rel, blkno, BT_READ);
- /*
- * Race -- the page we just grabbed may have split since we read its
- * pointer in the parent. If it has, we may need to move right to its
- * new sibling. Do that.
- */
+ /*
+ * Race -- the page we just grabbed may have split since we read its
+ * pointer in the parent. If it has, we may need to move right to its
+ * new sibling. Do that.
+ */
+ *bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ);
- *bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ);
+ /* okay, all set to move down a level */
+ stack_in = new_stack;
+ }
- /* okay, all set to move down a level */
- return _bt_searchr(rel, keysz, scankey, bufP, stack);
+ return stack_in;
}
/*
*
* On entry, we have the buffer pinned and a lock of the proper type.
* If we move right, we release the buffer and lock and acquire the
- * same on the right sibling.
+ * same on the right sibling. Return value is the buffer we stop at.
*/
Buffer
_bt_moveright(Relation rel,
{
Page page;
BTPageOpaque opaque;
- ItemId hikey;
- BlockNumber rblkno;
- int natts = rel->rd_rel->relnatts;
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- /* if we're on a rightmost page, we don't need to move right */
- if (P_RIGHTMOST(opaque))
- return buf;
-
- /* by convention, item 0 on non-rightmost pages is the high key */
- hikey = PageGetItemId(page, P_HIKEY);
-
/*
- * If the scan key that brought us to this page is >= the high key
+ * If the scan key that brought us to this page is > the high key
* stored on the page, then the page has split and we need to move
- * right.
+ * right. (If the scan key is equal to the high key, we might or
+ * might not need to move right; have to scan the page first anyway.)
+ * It could even have split more than once, so scan as far as needed.
*/
-
- if (_bt_skeycmp(rel, keysz, scankey, page, hikey,
- BTGreaterEqualStrategyNumber))
+ while (!P_RIGHTMOST(opaque) &&
+ _bt_compare(rel, keysz, scankey, page, P_HIKEY) > 0)
{
- /* move right as long as we need to */
- do
- {
- OffsetNumber offmax = PageGetMaxOffsetNumber(page);
-
- /*
- * If this page consists of all duplicate keys (hikey and
- * first key on the page have the same value), then we don't
- * need to step right.
- *
- * NOTE for multi-column indices: we may do scan using keys not
- * for all attrs. But we handle duplicates using all attrs in
- * _bt_insert/_bt_spool code. And so we've to compare scankey
- * with _last_ item on this page to do not lose "good" tuples
- * if number of attrs > keysize. Example: (2,0) - last items
- * on this page, (2,1) - first item on next page (hikey), our
- * scankey is x = 2. Scankey == (2,1) because of we compare
- * first attrs only, but we shouldn't to move right of here. -
- * vadim 04/15/97
- *
- * Also, if this page is not LEAF one (and # of attrs > keysize)
- * then we can't move too. - vadim 10/22/97
- */
-
- if (_bt_skeycmp(rel, keysz, scankey, page, hikey,
- BTEqualStrategyNumber))
- {
- if (opaque->btpo_flags & BTP_CHAIN)
- {
- Assert((opaque->btpo_flags & BTP_LEAF) || offmax > P_HIKEY);
- break;
- }
- if (offmax > P_HIKEY)
- {
- if (natts == keysz) /* sanity checks */
- {
- if (_bt_skeycmp(rel, keysz, scankey, page,
- PageGetItemId(page, P_FIRSTKEY),
- BTEqualStrategyNumber))
- elog(FATAL, "btree: BTP_CHAIN flag was expected in %s (access = %s)",
- RelationGetRelationName(rel), access ? "bt_write" : "bt_read");
- if (_bt_skeycmp(rel, keysz, scankey, page,
- PageGetItemId(page, offmax),
- BTEqualStrategyNumber))
- elog(FATAL, "btree: unexpected equal last item");
- if (_bt_skeycmp(rel, keysz, scankey, page,
- PageGetItemId(page, offmax),
- BTLessStrategyNumber))
- elog(FATAL, "btree: unexpected greater last item");
- /* move right */
- }
- else if (!(opaque->btpo_flags & BTP_LEAF))
- break;
- else if (_bt_skeycmp(rel, keysz, scankey, page,
- PageGetItemId(page, offmax),
- BTLessEqualStrategyNumber))
- break;
- }
- }
+ /* step right one page */
+ BlockNumber rblkno = opaque->btpo_next;
- /* step right one page */
- rblkno = opaque->btpo_next;
- _bt_relbuf(rel, buf, access);
- buf = _bt_getbuf(rel, rblkno, access);
- page = BufferGetPage(buf);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- hikey = PageGetItemId(page, P_HIKEY);
-
- } while (!P_RIGHTMOST(opaque)
- && _bt_skeycmp(rel, keysz, scankey, page, hikey,
- BTGreaterEqualStrategyNumber));
+ _bt_relbuf(rel, buf, access);
+ buf = _bt_getbuf(rel, rblkno, access);
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
+
return buf;
}
/*
- * _bt_skeycmp() -- compare a scan key to a particular item on a page using
- * a requested strategy (<, <=, =, >=, >).
+ * _bt_binsrch() -- Do a binary search for a key on a particular page.
*
- * We ignore the unique OIDs stored in the btree item here. Those
- * numbers are intended for use internally only, in repositioning a
- * scan after a page split. They do not impose any meaningful ordering.
+ * The scankey we get has the compare function stored in the procedure
+ * entry of each data struct. We invoke this regproc to do the
+ * comparison for every key in the scankey.
*
- * The comparison is A <op> B, where A is the scan key and B is the
- * tuple pointed at by itemid on page.
- */
-bool
-_bt_skeycmp(Relation rel,
- Size keysz,
- ScanKey scankey,
- Page page,
- ItemId itemid,
- StrategyNumber strat)
-{
- BTItem item;
- IndexTuple indexTuple;
- TupleDesc tupDes;
- int i;
- int32 compare = 0;
-
- item = (BTItem) PageGetItem(page, itemid);
- indexTuple = &(item->bti_itup);
-
- tupDes = RelationGetDescr(rel);
-
- for (i = 1; i <= (int) keysz; i++)
- {
- ScanKey entry = &scankey[i - 1];
- Datum attrDatum;
- bool isNull;
-
- Assert(entry->sk_attno == i);
- attrDatum = index_getattr(indexTuple,
- entry->sk_attno,
- tupDes,
- &isNull);
-
- /* see comments about NULLs handling in btbuild */
- if (entry->sk_flags & SK_ISNULL) /* key is NULL */
- {
- if (isNull)
- compare = 0; /* NULL key "=" NULL datum */
- else
- compare = 1; /* NULL key ">" not-NULL datum */
- }
- else if (isNull) /* key is NOT_NULL and item is NULL */
- {
- compare = -1; /* not-NULL key "<" NULL datum */
- }
- else
- compare = DatumGetInt32(FunctionCall2(&entry->sk_func,
- entry->sk_argument,
- attrDatum));
-
- if (compare != 0)
- break; /* done when we find unequal attributes */
- }
-
- switch (strat)
- {
- case BTLessStrategyNumber:
- return (bool) (compare < 0);
- case BTLessEqualStrategyNumber:
- return (bool) (compare <= 0);
- case BTEqualStrategyNumber:
- return (bool) (compare == 0);
- case BTGreaterEqualStrategyNumber:
- return (bool) (compare >= 0);
- case BTGreaterStrategyNumber:
- return (bool) (compare > 0);
- }
-
- elog(ERROR, "_bt_skeycmp: bogus strategy %d", (int) strat);
- return false;
-}
-
-/*
- * _bt_binsrch() -- Do a binary search for a key on a particular page.
+ * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
+ * key >= given scankey. (NOTE: in particular, this means it is possible
+ * to return a value 1 greater than the number of keys on the page,
+ * if the scankey is > all keys on the page.)
*
- * The scankey we get has the compare function stored in the procedure
- * entry of each data struct. We invoke this regproc to do the
- * comparison for every key in the scankey. _bt_binsrch() returns
- * the OffsetNumber of the first matching key on the page, or the
- * OffsetNumber at which the matching key would appear if it were
- * on this page. (NOTE: in particular, this means it is possible to
- * return a value 1 greater than the number of keys on the page, if
- * the scankey is > all keys on the page.)
+ * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
+ * of the last key < given scankey. (Since _bt_compare treats the first
+ * data key of such a page as minus infinity, there will be at least one
+ * key < scankey, so the result always points at one of the keys on the
+ * page.) This key indicates the right place to descend to be sure we
+ * find all leaf keys >= given scankey.
*
- * By the time this procedure is called, we're sure we're looking
- * at the right page -- don't need to walk right. _bt_binsrch() has
- * no lock or refcount side effects on the buffer.
+ * This procedure is not responsible for walking right, it just examines
+ * the given page. _bt_binsrch() has no lock or refcount side effects
+ * on the buffer.
*/
OffsetNumber
_bt_binsrch(Relation rel,
Buffer buf,
int keysz,
- ScanKey scankey,
- int srchtype)
+ ScanKey scankey)
{
TupleDesc itupdesc;
Page page;
BTPageOpaque opaque;
OffsetNumber low,
high;
- bool haveEq;
- int natts = rel->rd_rel->relnatts;
int32 result;
itupdesc = RelationGetDescr(rel);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- /* by convention, item 1 on any non-rightmost page is the high key */
- low = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
+ low = P_FIRSTDATAKEY(opaque);
high = PageGetMaxOffsetNumber(page);
/*
* If there are no keys on the page, return the first available slot.
* Note this covers two cases: the page is really empty (no keys), or
* it contains only a high key. The latter case is possible after
- * vacuuming.
+ * vacuuming. This can never happen on an internal page, however,
+ * since they are never empty (an internal page must have children).
*/
if (high < low)
return low;
/*
* Binary search to find the first key on the page >= scan key. Loop
* invariant: all slots before 'low' are < scan key, all slots at or
- * after 'high' are >= scan key. Also, haveEq is true if the tuple at
- * 'high' is == scan key. We can fall out when high == low.
+ * after 'high' are >= scan key. We can fall out when high == low.
*/
high++; /* establish the loop invariant for high */
- haveEq = false;
while (high > low)
{
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, itupdesc, page, keysz, scankey, mid);
+ result = _bt_compare(rel, keysz, scankey, page, mid);
if (result > 0)
low = mid + 1;
else
- {
high = mid;
- haveEq = (result == 0);
- }
}
/*--------------------
* At this point we have high == low, but be careful: they could point
- * past the last slot on the page. We also know that haveEq is true
- * if and only if there is an equal key (in which case high&low point
- * at the first equal key).
+ * past the last slot on the page.
*
* On a leaf page, we always return the first key >= scan key
* (which could be the last slot + 1).
*--------------------
*/
-
- if (opaque->btpo_flags & BTP_LEAF)
+ if (P_ISLEAF(opaque))
return low;
/*--------------------
- * On a non-leaf page, there are special cases:
- *
- * For an insertion (srchtype != BT_DESCENT and natts == keysz)
- * always return first key >= scan key (which could be off the end).
- *
- * For a standard search (srchtype == BT_DESCENT and natts == keysz)
- * return the first equal key if one exists, else the last lesser key
- * if one exists, else the first slot on the page.
- *
- * For a partial-match search (srchtype == BT_DESCENT and natts > keysz)
- * return the last lesser key if one exists, else the first slot.
- *
- * Old comments:
- * For multi-column indices, we may scan using keys
- * not for all attrs. But we handle duplicates using all attrs
- * in _bt_insert/_bt_spool code. And so while searching on
- * internal pages having number of attrs > keysize we want to
- * point at the last item < the scankey, not at the first item
- * = the scankey (!!!), and let _bt_moveright decide later
- * whether to move right or not (see comments and example
- * there). Note also that INSERTions are not affected by this
- * code (since natts == keysz for inserts). - vadim 04/15/97
+ * On a non-leaf page, return the last key < scan key.
+ * There must be one if _bt_compare() is playing by the rules.
*--------------------
*/
-
- if (haveEq)
- {
-
- /*
- * There is an equal key. We return either the first equal key
- * (which we just found), or the last lesser key.
- *
- * We need not check srchtype != BT_DESCENT here, since if that is
- * true then natts == keysz by assumption.
- */
- if (natts == keysz)
- return low; /* return first equal key */
- }
- else
- {
-
- /*
- * There is no equal key. We return either the first greater key
- * (which we just found), or the last lesser key.
- */
- if (srchtype != BT_DESCENT)
- return low; /* return first greater key */
- }
-
-
- if (low == (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY))
- return low; /* there is no prior item */
+ Assert(low > P_FIRSTDATAKEY(opaque));
return OffsetNumberPrev(low);
}
-/*
+/*----------
* _bt_compare() -- Compare scankey to a particular tuple on the page.
*
+ * keysz: number of key conditions to be checked (might be less than the
+ * total length of the scan key!)
+ * page/offnum: location of btree item to be compared to.
+ *
* This routine returns:
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be
+ * returned to the caller as a matching key!
*
- * -- Old comments:
- * In order to avoid having to propagate changes up the tree any time
- * a new minimal key is inserted, the leftmost entry on the leftmost
- * page is less than all possible keys, by definition.
- *
- * -- New ones:
- * New insertion code (fix against updating _in_place_ if new minimal
- * key has bigger size than old one) may delete P_HIKEY entry on the
- * root page in order to insert new minimal key - and so this definition
- * does not work properly in this case and breaks key' order on root
- * page. BTW, this propagation occures only while page' splitting,
- * but not "any time a new min key is inserted" (see _bt_insertonpg).
- * - vadim 12/05/96
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey. The actual key value stored (if any, which there probably isn't)
+ * does not matter. This convention allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first key.
+ * See backend/access/nbtree/README for details.
+ *----------
*/
-static int32
+int32
_bt_compare(Relation rel,
- TupleDesc itupdesc,
- Page page,
int keysz,
ScanKey scankey,
+ Page page,
OffsetNumber offnum)
{
- Datum datum;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
BTItem btitem;
IndexTuple itup;
- BTPageOpaque opaque;
- ScanKey entry;
- AttrNumber attno;
- int32 result;
int i;
- bool null;
/*
- * If this is a leftmost internal page, and if our comparison is with
- * the first key on the page, then the item at that position is by
- * definition less than the scan key.
- *
- * - see new comments above...
+ * Force result ">" if target item is first data item on an internal
+ * page --- see NOTE above.
*/
-
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
- if (!(opaque->btpo_flags & BTP_LEAF)
- && P_LEFTMOST(opaque)
- && offnum == P_HIKEY)
- {
-
- /*
- * we just have to believe that this will only be called with
- * offnum == P_HIKEY when P_HIKEY is the OffsetNumber of the first
- * actual data key (i.e., this is also a rightmost page). there
- * doesn't seem to be any code that implies that the leftmost page
- * is normally missing a high key as well as the rightmost page.
- * but that implies that this code path only applies to the root
- * -- which seems unlikely..
- *
- * - see new comments above...
- */
- if (!P_RIGHTMOST(opaque))
- elog(ERROR, "_bt_compare: invalid comparison to high key");
-
-#ifdef NOT_USED
-
- /*
- * We just have to belive that right answer will not break
- * anything. I've checked code and all seems to be ok. See new
- * comments above...
- *
- * -- Old comments If the item on the page is equal to the scankey,
- * that's okay to admit. We just can't claim that the first key
- * on the page is greater than anything.
- */
-
- if (_bt_skeycmp(rel, keysz, scankey, page, PageGetItemId(page, offnum),
- BTEqualStrategyNumber))
- return 0;
+ if (! P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
return 1;
-#endif
- }
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &(btitem->bti_itup);
* they be in order. If you think about how multi-key ordering works,
* you'll understand why this is.
*
- * We don't test for violation of this condition here.
+ * We don't test for violation of this condition here, however. The
+ * initial setup for the index scan had better have gotten it right
+ * (see _bt_first).
*/
- for (i = 1; i <= keysz; i++)
+ for (i = 0; i < keysz; i++)
{
- entry = &scankey[i - 1];
- attno = entry->sk_attno;
- datum = index_getattr(itup, attno, itupdesc, &null);
+ ScanKey entry = &scankey[i];
+ Datum datum;
+ bool isNull;
+ int32 result;
+
+ datum = index_getattr(itup, entry->sk_attno, itupdesc, &isNull);
/* see comments about NULLs handling in btbuild */
- if (entry->sk_flags & SK_ISNULL) /* key is NULL */
+ if (entry->sk_flags & SK_ISNULL) /* key is NULL */
{
- if (null)
+ if (isNull)
result = 0; /* NULL "=" NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (null) /* key is NOT_NULL and item is NULL */
+ else if (isNull) /* key is NOT_NULL and item is NULL */
{
result = -1; /* NOT_NULL "<" NULL */
}
else
+ {
result = DatumGetInt32(FunctionCall2(&entry->sk_func,
- entry->sk_argument, datum));
+ entry->sk_argument,
+ datum));
+ }
/* if the keys are unequal, return the difference */
if (result != 0)
return result;
}
- /* by here, the keys are equal */
+ /* if we get here, the keys are equal */
return 0;
}
* _bt_next() -- Get the next item in a scan.
*
* On entry, we have a valid currentItemData in the scan, and a
- * read lock on the page that contains that item. We do not have
- * the page pinned. We return the next item in the scan. On
- * exit, we have the page containing the next item locked but not
- * pinned.
+ * read lock and pin count on the page that contains that item.
+ * We return the next item in the scan, or NULL if no more.
+ * On successful exit, the page containing the new item is locked
+ * and pinned; on NULL exit, no lock or pin is held.
*/
RetrieveIndexResult
_bt_next(IndexScanDesc scan, ScanDirection dir)
Buffer buf;
Page page;
OffsetNumber offnum;
- RetrieveIndexResult res;
ItemPointer current;
BTItem btitem;
IndexTuple itup;
so = (BTScanOpaque) scan->opaque;
current = &(scan->currentItemData);
- Assert(BufferIsValid(so->btso_curbuf));
-
/* we still have the buffer pinned and locked */
buf = so->btso_curbuf;
+ Assert(BufferIsValid(buf));
do
{
if (!_bt_step(scan, &buf, dir))
return (RetrieveIndexResult) NULL;
- /* by here, current is the tuple we want to return */
+ /* current is the next candidate tuple to return */
offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
if (_bt_checkkeys(scan, itup, &keysok))
{
+ /* tuple passes all scan key conditions, so return it */
Assert(keysok == so->numberOfKeys);
- res = FormRetrieveIndexResult(current, &(itup->t_tid));
-
- /* remember which buffer we have pinned and locked */
- so->btso_curbuf = buf;
- return res;
+ return FormRetrieveIndexResult(current, &(itup->t_tid));
}
+ /* This tuple doesn't pass, but there might be more that do */
} while (keysok >= so->numberOfFirstKeys ||
(keysok == ((Size) -1) && ScanDirectionIsBackward(dir)));
+ /* No more items, so close down the current-item info */
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ);
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel;
- TupleDesc itupdesc;
Buffer buf;
Page page;
- BTPageOpaque pop;
BTStack stack;
- OffsetNumber offnum,
- maxoff;
- bool offGmax = false;
+ OffsetNumber offnum;
BTItem btitem;
IndexTuple itup;
ItemPointer current;
int32 result;
BTScanOpaque so;
Size keysok;
-
bool strategyCheck;
ScanKey scankeys = 0;
int keysCount = 0;
return _bt_endpoint(scan, dir);
}
- itupdesc = RelationGetDescr(rel);
- current = &(scan->currentItemData);
-
/*
* Okay, we want something more complicated. What we'll do is use the
* first item in the scan key passed in (which has been correctly
* ordered to take advantage of index ordering) to position ourselves
* at the right place in the scan.
*/
- /* _bt_orderkeys disallows it, but it's place to add some code latter */
scankeys = (ScanKey) palloc(keysCount * sizeof(ScanKeyData));
for (i = 0; i < keysCount; i++)
{
j = nKeyIs[i];
+ /* _bt_orderkeys disallows it, but it's place to add some code latter */
if (so->keyData[j].sk_flags & SK_ISNULL)
{
pfree(nKeyIs);
if (nKeyIs)
pfree(nKeyIs);
- stack = _bt_search(rel, keysCount, scankeys, &buf);
- _bt_freestack(stack);
-
- blkno = BufferGetBlockNumber(buf);
- page = BufferGetPage(buf);
+ current = &(scan->currentItemData);
/*
- * This will happen if the tree we're searching is entirely empty, or
- * if we're doing a search for a key that would appear on an entirely
- * empty internal page. In either case, there are no matching tuples
- * in the index.
+ * Use the manufactured scan key to descend the tree and position
+ * ourselves on the target leaf page.
*/
+ stack = _bt_search(rel, keysCount, scankeys, &buf, BT_READ);
- if (PageIsEmpty(page))
+ /* don't need to keep the stack around... */
+ _bt_freestack(stack);
+
+ if (! BufferIsValid(buf))
{
+ /* Only get here if index is completely empty */
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
- _bt_relbuf(rel, buf, BT_READ);
pfree(scankeys);
return (RetrieveIndexResult) NULL;
}
- maxoff = PageGetMaxOffsetNumber(page);
- pop = (BTPageOpaque) PageGetSpecialPointer(page);
-
- /*
- * Now _bt_moveright doesn't move from non-rightmost leaf page if
- * scankey == hikey and there is only hikey there. It's good for
- * insertion, but we need to do work for scan here. - vadim 05/27/97
- */
-
- while (maxoff == P_HIKEY && !P_RIGHTMOST(pop) &&
- _bt_skeycmp(rel, keysCount, scankeys, page,
- PageGetItemId(page, P_HIKEY),
- BTGreaterEqualStrategyNumber))
- {
- /* step right one page */
- blkno = pop->btpo_next;
- _bt_relbuf(rel, buf, BT_READ);
- buf = _bt_getbuf(rel, blkno, BT_READ);
- page = BufferGetPage(buf);
- if (PageIsEmpty(page))
- {
- ItemPointerSetInvalid(current);
- so->btso_curbuf = InvalidBuffer;
- _bt_relbuf(rel, buf, BT_READ);
- pfree(scankeys);
- return (RetrieveIndexResult) NULL;
- }
- maxoff = PageGetMaxOffsetNumber(page);
- pop = (BTPageOpaque) PageGetSpecialPointer(page);
- }
-
- /* find the nearest match to the manufactured scan key on the page */
- offnum = _bt_binsrch(rel, buf, keysCount, scankeys, BT_DESCENT);
+ /* remember which buffer we have pinned */
+ so->btso_curbuf = buf;
+ blkno = BufferGetBlockNumber(buf);
+ page = BufferGetPage(buf);
- if (offnum > maxoff)
- {
- offnum = maxoff;
- offGmax = true;
- }
+ offnum = _bt_binsrch(rel, buf, keysCount, scankeys);
ItemPointerSet(current, blkno, offnum);
- /*
- * Now find the right place to start the scan. Result is the value
- * we're looking for minus the value we're looking at in the index.
+ /*----------
+ * At this point we are positioned at the first item >= scan key,
+ * or possibly at the end of a page on which all the existing items
+ * are < scan key and we know that everything on later pages is
+ * >= scan key. We could step forward in the latter case, but that'd
+ * be a waste of time if we want to scan backwards. So, it's now time to
+ * examine the scan strategy to find the exact place to start the scan.
+ *
+ * Note: if _bt_step fails (meaning we fell off the end of the index
+ * in one direction or the other), we either return NULL (no matches) or
+ * call _bt_endpoint() to set up a scan starting at that index endpoint,
+ * as appropriate for the desired scan type.
+ *
+ * it's yet other place to add some code latter for is(not)null ...
+ *----------
*/
- result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
-
- /* it's yet other place to add some code latter for is(not)null */
-
- strat = strat_total;
- switch (strat)
+ switch (strat_total)
{
case BTLessStrategyNumber:
- if (result <= 0)
+ /*
+ * Back up one to arrive at last item < scankey
+ */
+ if (!_bt_step(scan, &buf, BackwardScanDirection))
{
- do
- {
- if (!_bt_twostep(scan, &buf, BackwardScanDirection))
- break;
-
- offnum = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(buf);
- result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
- } while (result <= 0);
-
+ pfree(scankeys);
+ return (RetrieveIndexResult) NULL;
}
break;
case BTLessEqualStrategyNumber:
- if (result >= 0)
+ /*
+ * We need to find the last item <= scankey, so step forward
+ * till we find one > scankey, then step back one.
+ */
+ if (offnum > PageGetMaxOffsetNumber(page))
{
- do
+ if (!_bt_step(scan, &buf, ForwardScanDirection))
{
- if (!_bt_twostep(scan, &buf, ForwardScanDirection))
- break;
-
- offnum = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(buf);
- result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
- } while (result >= 0);
+ pfree(scankeys);
+ return _bt_endpoint(scan, dir);
+ }
+ }
+ for (;;)
+ {
+ offnum = ItemPointerGetOffsetNumber(current);
+ page = BufferGetPage(buf);
+ result = _bt_compare(rel, keysCount, scankeys, page, offnum);
+ if (result < 0)
+ break;
+ if (!_bt_step(scan, &buf, ForwardScanDirection))
+ {
+ pfree(scankeys);
+ return _bt_endpoint(scan, dir);
+ }
+ }
+ if (!_bt_step(scan, &buf, BackwardScanDirection))
+ {
+ pfree(scankeys);
+ return (RetrieveIndexResult) NULL;
}
- if (result < 0)
- _bt_twostep(scan, &buf, BackwardScanDirection);
break;
case BTEqualStrategyNumber:
- if (result != 0)
+ /*
+ * Make sure we are on the first equal item; might have to step
+ * forward if currently at end of page.
+ */
+ if (offnum > PageGetMaxOffsetNumber(page))
{
- _bt_relbuf(scan->relation, buf, BT_READ);
- so->btso_curbuf = InvalidBuffer;
- ItemPointerSetInvalid(&(scan->currentItemData));
- pfree(scankeys);
- return (RetrieveIndexResult) NULL;
+ if (!_bt_step(scan, &buf, ForwardScanDirection))
+ {
+ pfree(scankeys);
+ return (RetrieveIndexResult) NULL;
+ }
+ offnum = ItemPointerGetOffsetNumber(current);
+ page = BufferGetPage(buf);
}
- else if (ScanDirectionIsBackward(dir))
+ result = _bt_compare(rel, keysCount, scankeys, page, offnum);
+ if (result != 0)
+ goto nomatches; /* no equal items! */
+ /*
+ * If a backward scan was specified, need to start with last
+ * equal item not first one.
+ */
+ if (ScanDirectionIsBackward(dir))
{
do
{
- if (!_bt_twostep(scan, &buf, ForwardScanDirection))
- break;
-
+ if (!_bt_step(scan, &buf, ForwardScanDirection))
+ {
+ pfree(scankeys);
+ return _bt_endpoint(scan, dir);
+ }
offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
- result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
+ result = _bt_compare(rel, keysCount, scankeys, page, offnum);
} while (result == 0);
-
- if (result < 0)
- _bt_twostep(scan, &buf, BackwardScanDirection);
+ if (!_bt_step(scan, &buf, BackwardScanDirection))
+ elog(ERROR, "_bt_first: equal items disappeared?");
}
break;
case BTGreaterEqualStrategyNumber:
- if (offGmax)
+ /*
+ * We want the first item >= scankey, which is where we are...
+ * unless we're not anywhere at all...
+ */
+ if (offnum > PageGetMaxOffsetNumber(page))
{
- if (result < 0)
+ if (!_bt_step(scan, &buf, ForwardScanDirection))
{
- Assert(!P_RIGHTMOST(pop) && maxoff == P_HIKEY);
- if (!_bt_step(scan, &buf, ForwardScanDirection))
- {
- _bt_relbuf(scan->relation, buf, BT_READ);
- so->btso_curbuf = InvalidBuffer;
- ItemPointerSetInvalid(&(scan->currentItemData));
- pfree(scankeys);
- return (RetrieveIndexResult) NULL;
- }
- }
- else if (result > 0)
- { /* Just remember: _bt_binsrch() returns
- * the OffsetNumber of the first matching
- * key on the page, or the OffsetNumber at
- * which the matching key WOULD APPEAR IF
- * IT WERE on this page. No key on this
- * page, but offnum from _bt_binsrch()
- * greater maxoff - have to move right. -
- * vadim 12/06/96 */
- _bt_twostep(scan, &buf, ForwardScanDirection);
+ pfree(scankeys);
+ return (RetrieveIndexResult) NULL;
}
}
- else if (result < 0)
- {
- do
- {
- if (!_bt_twostep(scan, &buf, BackwardScanDirection))
- break;
-
- page = BufferGetPage(buf);
- offnum = ItemPointerGetOffsetNumber(current);
- result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
- } while (result < 0);
-
- if (result > 0)
- _bt_twostep(scan, &buf, ForwardScanDirection);
- }
break;
case BTGreaterStrategyNumber:
- /* offGmax helps as above */
- if (result >= 0 || offGmax)
+ /*
+ * We want the first item > scankey, so make sure we are on
+ * an item and then step over any equal items.
+ */
+ if (offnum > PageGetMaxOffsetNumber(page))
{
- do
+ if (!_bt_step(scan, &buf, ForwardScanDirection))
{
- if (!_bt_twostep(scan, &buf, ForwardScanDirection))
- break;
-
- offnum = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(buf);
- result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum);
- } while (result >= 0);
+ pfree(scankeys);
+ return (RetrieveIndexResult) NULL;
+ }
+ offnum = ItemPointerGetOffsetNumber(current);
+ page = BufferGetPage(buf);
+ }
+ result = _bt_compare(rel, keysCount, scankeys, page, offnum);
+ while (result == 0)
+ {
+ if (!_bt_step(scan, &buf, ForwardScanDirection))
+ {
+ pfree(scankeys);
+ return (RetrieveIndexResult) NULL;
+ }
+ offnum = ItemPointerGetOffsetNumber(current);
+ page = BufferGetPage(buf);
+ result = _bt_compare(rel, keysCount, scankeys, page, offnum);
}
break;
}
- pfree(scankeys);
/* okay, current item pointer for the scan is right */
offnum = ItemPointerGetOffsetNumber(current);
page = BufferGetPage(buf);
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &btitem->bti_itup;
+ /* is the first item actually acceptable? */
if (_bt_checkkeys(scan, itup, &keysok))
{
+ /* yes, return it */
res = FormRetrieveIndexResult(current, &(itup->t_tid));
-
- /* remember which buffer we have pinned */
- so->btso_curbuf = buf;
- }
- else if (keysok >= so->numberOfFirstKeys)
- {
- so->btso_curbuf = buf;
- return _bt_next(scan, dir);
}
- else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))
+ else if (keysok >= so->numberOfFirstKeys ||
+ (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)))
{
- so->btso_curbuf = buf;
- return _bt_next(scan, dir);
+ /* no, but there might be another one that is */
+ res = _bt_next(scan, dir);
}
else
{
+ /* no tuples in the index match this scan key */
+nomatches:
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ);
res = (RetrieveIndexResult) NULL;
}
+ pfree(scankeys);
+
return res;
}
* _bt_step() -- Step one item in the requested direction in a scan on
* the tree.
*
- * If no adjacent record exists in the requested direction, return
- * false. Else, return true and set the currentItemData for the
- * scan to the right thing.
+ * *bufP is the current buffer (read-locked and pinned). If we change
+ * pages, it's updated appropriately.
+ *
+ * If successful, update scan's currentItemData and return true.
+ * If no adjacent record exists in the requested direction,
+ * release buffer pin/locks and return false.
*/
bool
_bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
+ Relation rel = scan->relation;
+ ItemPointer current = &(scan->currentItemData);
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
Page page;
BTPageOpaque opaque;
OffsetNumber offnum,
maxoff;
- OffsetNumber start;
BlockNumber blkno;
BlockNumber obknum;
- BTScanOpaque so;
- ItemPointer current;
- Relation rel;
-
- rel = scan->relation;
- current = &(scan->currentItemData);
/*
* Don't use ItemPointerGetOffsetNumber or you risk to get assertion
* due to ability of ip_posid to be equal 0.
*/
offnum = current->ip_posid;
+
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- so = (BTScanOpaque) scan->opaque;
maxoff = PageGetMaxOffsetNumber(page);
- /* get the next tuple */
if (ScanDirectionIsForward(dir))
{
if (!PageIsEmpty(page) && offnum < maxoff)
offnum = OffsetNumberNext(offnum);
else
{
-
- /* if we're at end of scan, release the buffer and return */
- blkno = opaque->btpo_next;
- if (P_RIGHTMOST(opaque))
- {
- _bt_relbuf(rel, *bufP, BT_READ);
- ItemPointerSetInvalid(current);
- *bufP = so->btso_curbuf = InvalidBuffer;
- return false;
- }
- else
+ /* walk right to the next page with data */
+ for (;;)
{
-
- /* walk right to the next page with data */
- _bt_relbuf(rel, *bufP, BT_READ);
- for (;;)
+ /* if we're at end of scan, release the buffer and return */
+ if (P_RIGHTMOST(opaque))
{
- *bufP = _bt_getbuf(rel, blkno, BT_READ);
- page = BufferGetPage(*bufP);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- maxoff = PageGetMaxOffsetNumber(page);
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
- if (!PageIsEmpty(page) && start <= maxoff)
- break;
- else
- {
- blkno = opaque->btpo_next;
- _bt_relbuf(rel, *bufP, BT_READ);
- if (blkno == P_NONE)
- {
- *bufP = so->btso_curbuf = InvalidBuffer;
- ItemPointerSetInvalid(current);
- return false;
- }
- }
+ _bt_relbuf(rel, *bufP, BT_READ);
+ ItemPointerSetInvalid(current);
+ *bufP = so->btso_curbuf = InvalidBuffer;
+ return false;
}
- offnum = start;
+ /* step right one page */
+ blkno = opaque->btpo_next;
+ _bt_relbuf(rel, *bufP, BT_READ);
+ *bufP = _bt_getbuf(rel, blkno, BT_READ);
+ page = BufferGetPage(*bufP);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+ /* done if it's not empty */
+ offnum = P_FIRSTDATAKEY(opaque);
+ if (!PageIsEmpty(page) && offnum <= maxoff)
+ break;
}
}
}
- else if (ScanDirectionIsBackward(dir))
+ else
{
-
- /* remember that high key is item zero on non-rightmost pages */
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
- if (offnum > start)
+ if (offnum > P_FIRSTDATAKEY(opaque))
offnum = OffsetNumberPrev(offnum);
else
{
-
- /* if we're at end of scan, release the buffer and return */
- blkno = opaque->btpo_prev;
- if (P_LEFTMOST(opaque))
- {
- _bt_relbuf(rel, *bufP, BT_READ);
- *bufP = so->btso_curbuf = InvalidBuffer;
- ItemPointerSetInvalid(current);
- return false;
- }
- else
+ /* walk left to the next page with data */
+ for (;;)
{
-
+ /* if we're at end of scan, release the buffer and return */
+ if (P_LEFTMOST(opaque))
+ {
+ _bt_relbuf(rel, *bufP, BT_READ);
+ ItemPointerSetInvalid(current);
+ *bufP = so->btso_curbuf = InvalidBuffer;
+ return false;
+ }
+ /* step left */
obknum = BufferGetBlockNumber(*bufP);
-
- /* walk right to the next page with data */
+ blkno = opaque->btpo_prev;
_bt_relbuf(rel, *bufP, BT_READ);
- for (;;)
+ *bufP = _bt_getbuf(rel, blkno, BT_READ);
+ page = BufferGetPage(*bufP);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /*
+ * If the adjacent page just split, then we have to walk
+ * right to find the block that's now adjacent to where
+ * we were. Because pages only split right, we don't have
+ * to worry about this failing to terminate.
+ */
+ while (opaque->btpo_next != obknum)
{
+ blkno = opaque->btpo_next;
+ _bt_relbuf(rel, *bufP, BT_READ);
*bufP = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(*bufP);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- maxoff = PageGetMaxOffsetNumber(page);
-
- /*
- * If the adjacent page just split, then we may have
- * the wrong block. Handle this case. Because pages
- * only split right, we don't have to worry about this
- * failing to terminate.
- */
-
- while (opaque->btpo_next != obknum)
- {
- blkno = opaque->btpo_next;
- _bt_relbuf(rel, *bufP, BT_READ);
- *bufP = _bt_getbuf(rel, blkno, BT_READ);
- page = BufferGetPage(*bufP);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- maxoff = PageGetMaxOffsetNumber(page);
- }
-
- /* don't consider the high key */
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
- /* anything to look at here? */
- if (!PageIsEmpty(page) && maxoff >= start)
- break;
- else
- {
- blkno = opaque->btpo_prev;
- obknum = BufferGetBlockNumber(*bufP);
- _bt_relbuf(rel, *bufP, BT_READ);
- if (blkno == P_NONE)
- {
- *bufP = so->btso_curbuf = InvalidBuffer;
- ItemPointerSetInvalid(current);
- return false;
- }
- }
}
- offnum = maxoff;/* XXX PageIsEmpty? */
+ /* done if it's not empty */
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = maxoff;
+ if (!PageIsEmpty(page) && maxoff >= P_FIRSTDATAKEY(opaque))
+ break;
}
}
}
- blkno = BufferGetBlockNumber(*bufP);
+
+ /* Update scan state */
so->btso_curbuf = *bufP;
+ blkno = BufferGetBlockNumber(*bufP);
ItemPointerSet(current, blkno, offnum);
return true;
}
-/*
- * _bt_twostep() -- Move to an adjacent record in a scan on the tree,
- * if an adjacent record exists.
- *
- * This is like _bt_step, except that if no adjacent record exists
- * it restores us to where we were before trying the step. This is
- * only hairy when you cross page boundaries, since the page you cross
- * from could have records inserted or deleted, or could even split.
- * This is unlikely, but we try to handle it correctly here anyway.
- *
- * This routine contains the only case in which our changes to Lehman
- * and Yao's algorithm.
- *
- * Like step, this routine leaves the scan's currentItemData in the
- * proper state and acquires a lock and pin on *bufP. If the twostep
- * succeeded, we return true; otherwise, we return false.
- */
-static bool
-_bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
- Page page;
- BTPageOpaque opaque;
- OffsetNumber offnum,
- maxoff;
- OffsetNumber start;
- ItemPointer current;
- ItemId itemid;
- int itemsz;
- BTItem btitem;
- BTItem svitem;
- BlockNumber blkno;
-
- blkno = BufferGetBlockNumber(*bufP);
- page = BufferGetPage(*bufP);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- maxoff = PageGetMaxOffsetNumber(page);
- current = &(scan->currentItemData);
- offnum = ItemPointerGetOffsetNumber(current);
-
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
- /* if we're safe, just do it */
- if (ScanDirectionIsForward(dir) && offnum < maxoff)
- { /* XXX PageIsEmpty? */
- ItemPointerSet(current, blkno, OffsetNumberNext(offnum));
- return true;
- }
- else if (ScanDirectionIsBackward(dir) && offnum > start)
- {
- ItemPointerSet(current, blkno, OffsetNumberPrev(offnum));
- return true;
- }
-
- /* if we've hit end of scan we don't have to do any work */
- if (ScanDirectionIsForward(dir) && P_RIGHTMOST(opaque))
- return false;
- else if (ScanDirectionIsBackward(dir) && P_LEFTMOST(opaque))
- return false;
-
- /*
- * Okay, it's off the page; let _bt_step() do the hard work, and we'll
- * try to remember where we were. This is not guaranteed to work;
- * this is the only place in the code where concurrency can screw us
- * up, and it's because we want to be able to move in two directions
- * in the scan.
- */
-
- itemid = PageGetItemId(page, offnum);
- itemsz = ItemIdGetLength(itemid);
- btitem = (BTItem) PageGetItem(page, itemid);
- svitem = (BTItem) palloc(itemsz);
- memmove((char *) svitem, (char *) btitem, itemsz);
-
- if (_bt_step(scan, bufP, dir))
- {
- pfree(svitem);
- return true;
- }
-
- /* try to find our place again */
- *bufP = _bt_getbuf(scan->relation, blkno, BT_READ);
- page = BufferGetPage(*bufP);
- maxoff = PageGetMaxOffsetNumber(page);
-
- while (offnum <= maxoff)
- {
- itemid = PageGetItemId(page, offnum);
- btitem = (BTItem) PageGetItem(page, itemid);
- if (BTItemSame(btitem, svitem))
- {
- pfree(svitem);
- ItemPointerSet(current, blkno, offnum);
- return false;
- }
- }
-
- /*
- * XXX crash and burn -- can't find our place. We can be a little
- * smarter -- walk to the next page to the right, for example, since
- * that's the only direction that splits happen in. Deletions screw
- * us up less often since they're only done by the vacuum daemon.
- */
-
- elog(ERROR, "btree synchronization error: concurrent update botched scan");
-
- return false;
-}
-
/*
* _bt_endpoint() -- Find the first or last key in the index.
+ *
+ * This is used by _bt_first() to set up a scan when we've determined
+ * that the scan must start at the beginning or end of the index (for
+ * a forward or backward scan respectively).
*/
static RetrieveIndexResult
_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
ItemPointer current;
OffsetNumber offnum,
maxoff;
- OffsetNumber start = 0;
+ OffsetNumber start;
BlockNumber blkno;
BTItem btitem;
IndexTuple itup;
current = &(scan->currentItemData);
so = (BTScanOpaque) scan->opaque;
+ /*
+ * Scan down to the leftmost or rightmost leaf page. This is a
+ * simplified version of _bt_search(). We don't maintain a stack
+ * since we know we won't need it.
+ */
buf = _bt_getroot(rel, BT_READ);
+
+ if (! BufferIsValid(buf))
+ {
+ /* empty index... */
+ ItemPointerSetInvalid(current);
+ so->btso_curbuf = InvalidBuffer;
+ return (RetrieveIndexResult) NULL;
+ }
+
blkno = BufferGetBlockNumber(buf);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
for (;;)
{
- if (opaque->btpo_flags & BTP_LEAF)
+ if (P_ISLEAF(opaque))
break;
if (ScanDirectionIsForward(dir))
- offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+ offnum = P_FIRSTDATAKEY(opaque);
else
offnum = PageGetMaxOffsetNumber(page);
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
itup = &(btitem->bti_itup);
-
blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
_bt_relbuf(rel, buf, BT_READ);
buf = _bt_getbuf(rel, blkno, BT_READ);
+
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/*
- * Race condition: If the child page we just stepped onto is in
- * the process of being split, we need to make sure we're all the
- * way at the right edge of the tree. See the paper by Lehman and
- * Yao.
+ * Race condition: If the child page we just stepped onto was just
+ * split, we need to make sure we're all the way at the right edge
+ * of the tree. See the paper by Lehman and Yao.
*/
-
if (ScanDirectionIsBackward(dir) && !P_RIGHTMOST(opaque))
{
do
if (ScanDirectionIsForward(dir))
{
- if (!P_LEFTMOST(opaque))/* non-leftmost page ? */
- elog(ERROR, "_bt_endpoint: leftmost page (%u) has not leftmost flag", blkno);
- start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
- /*
- * I don't understand this stuff! It doesn't work for
- * non-rightmost pages with only one element (P_HIKEY) which we
- * have after deletion itups by vacuum (it's case of start >
- * maxoff). Scanning in BackwardScanDirection is not
- * understandable at all. Well - new stuff. - vadim 12/06/96
- */
-#ifdef NOT_USED
- if (PageIsEmpty(page) || start > maxoff)
- {
- ItemPointerSet(current, blkno, maxoff);
- if (!_bt_step(scan, &buf, BackwardScanDirection))
- return (RetrieveIndexResult) NULL;
-
- start = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(buf);
- }
-#endif
- if (PageIsEmpty(page))
- {
- if (start != P_HIKEY) /* non-rightmost page */
- elog(ERROR, "_bt_endpoint: non-rightmost page (%u) is empty", blkno);
+ Assert(P_LEFTMOST(opaque));
- /*
- * It's left- & right- most page - root page, - and it's
- * empty...
- */
- _bt_relbuf(rel, buf, BT_READ);
- ItemPointerSetInvalid(current);
- so->btso_curbuf = InvalidBuffer;
- return (RetrieveIndexResult) NULL;
- }
- if (start > maxoff) /* start == 2 && maxoff == 1 */
- {
- ItemPointerSet(current, blkno, maxoff);
- if (!_bt_step(scan, &buf, ForwardScanDirection))
- return (RetrieveIndexResult) NULL;
-
- start = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(buf);
- }
- /* new stuff ends here */
- else
- ItemPointerSet(current, blkno, start);
+ start = P_FIRSTDATAKEY(opaque);
}
else if (ScanDirectionIsBackward(dir))
{
+ Assert(P_RIGHTMOST(opaque));
- /*
- * I don't understand this stuff too! If RIGHT-most leaf page is
- * empty why do scanning in ForwardScanDirection ??? Well - new
- * stuff. - vadim 12/06/96
- */
-#ifdef NOT_USED
- if (PageIsEmpty(page))
- {
- ItemPointerSet(current, blkno, FirstOffsetNumber);
- if (!_bt_step(scan, &buf, ForwardScanDirection))
- return (RetrieveIndexResult) NULL;
-
- start = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(buf);
- }
-#endif
- if (PageIsEmpty(page))
- {
- /* If it's leftmost page too - it's empty root page... */
- if (P_LEFTMOST(opaque))
- {
- _bt_relbuf(rel, buf, BT_READ);
- ItemPointerSetInvalid(current);
- so->btso_curbuf = InvalidBuffer;
- return (RetrieveIndexResult) NULL;
- }
- /* Go back ! */
- ItemPointerSet(current, blkno, FirstOffsetNumber);
- if (!_bt_step(scan, &buf, BackwardScanDirection))
- return (RetrieveIndexResult) NULL;
-
- start = ItemPointerGetOffsetNumber(current);
- page = BufferGetPage(buf);
- }
- /* new stuff ends here */
- else
- {
- start = PageGetMaxOffsetNumber(page);
- ItemPointerSet(current, blkno, start);
- }
+ start = PageGetMaxOffsetNumber(page);
+ if (start < P_FIRSTDATAKEY(opaque)) /* watch out for empty page */
+ start = P_FIRSTDATAKEY(opaque);
}
else
+ {
elog(ERROR, "Illegal scan direction %d", dir);
+ start = 0; /* keep compiler quiet */
+ }
+
+ ItemPointerSet(current, blkno, start);
+ /* remember which buffer we have pinned */
+ so->btso_curbuf = buf;
+
+ /*
+ * Left/rightmost page could be empty due to deletions,
+ * if so step till we find a nonempty page.
+ */
+ if (start > maxoff)
+ {
+ if (!_bt_step(scan, &buf, dir))
+ return (RetrieveIndexResult) NULL;
+ start = ItemPointerGetOffsetNumber(current);
+ page = BufferGetPage(buf);
+ }
btitem = (BTItem) PageGetItem(page, PageGetItemId(page, start));
itup = &(btitem->bti_itup);
/* see if we picked a winner */
if (_bt_checkkeys(scan, itup, &keysok))
{
+ /* yes, return it */
res = FormRetrieveIndexResult(current, &(itup->t_tid));
-
- /* remember which buffer we have pinned */
- so->btso_curbuf = buf;
- }
- else if (keysok >= so->numberOfFirstKeys)
- {
- so->btso_curbuf = buf;
- return _bt_next(scan, dir);
}
- else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))
+ else if (keysok >= so->numberOfFirstKeys ||
+ (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)))
{
- so->btso_curbuf = buf;
- return _bt_next(scan, dir);
+ /* no, but there might be another one that is */
+ res = _bt_next(scan, dir);
}
else
{
+ /* no tuples in the index match this scan key */
ItemPointerSetInvalid(current);
so->btso_curbuf = InvalidBuffer;
_bt_relbuf(rel, buf, BT_READ);
*
* We use tuplesort.c to sort the given index tuples into order.
* Then we scan the index tuples in order and build the btree pages
- * for each level. When we have only one page on a level, it must be the
- * root -- it can be attached to the btree metapage and we are done.
+ * for each level. We load source tuples into leaf-level pages.
+ * Whenever we fill a page at one level, we add a link to it to its
+ * parent level (starting a new parent level if necessary). When
+ * done, we write out each final page on each level, adding it to
+ * its parent level. When we have only one page on a level, it must be
+ * the root -- it can be attached to the btree metapage and we are done.
*
* this code is moderately slow (~10% slower) compared to the regular
* btree (insertion) build code on sorted or well-clustered data. on
* something like the standard 70% steady-state load factor for btrees
* would probably be better.
*
+ * Another limitation is that we currently load full copies of all keys
+ * into upper tree levels. The leftmost data key in each non-leaf node
+ * could be omitted as far as normal btree operations are concerned
+ * (see README for more info). However, because we build the tree from
+ * the bottom up, we need that data key to insert into the node's parent.
+ * This could be fixed by keeping a spare copy of the minimum key in the
+ * state stack, but I haven't time for that right now.
+ *
*
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.54 2000/06/15 04:09:36 momjian Exp $
+ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.55 2000/07/21 06:42:33 tgl Exp $
*
*-------------------------------------------------------------------------
*/
bool isunique;
};
+/*
+ * Status record for a btree page being built. We have one of these
+ * for each active tree level.
+ */
+typedef struct BTPageState
+{
+ Buffer btps_buf; /* current buffer & page */
+ Page btps_page;
+ OffsetNumber btps_lastoff; /* last item offset loaded */
+ int btps_level;
+ struct BTPageState *btps_next; /* link to parent level, if any */
+} BTPageState;
+
+
#define BTITEMSZ(btitem) \
((btitem) ? \
(IndexTupleDSize((btitem)->bti_itup) + \
static void _bt_load(Relation index, BTSpool *btspool);
-static BTItem _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
- BTPageState *state, BTItem bti, int flags);
+static void _bt_buildadd(Relation index, BTPageState *state,
+ BTItem bti, int flags);
static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend);
-static BTPageState *_bt_pagestate(Relation index, int flags,
- int level, bool doupper);
-static void _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
- BTPageState *state);
+static BTPageState *_bt_pagestate(Relation index, int flags, int level);
+static void _bt_uppershutdown(Relation index, BTPageState *state);
/*
BTPageOpaque opaque;
*buf = _bt_getbuf(index, P_NEW, BT_WRITE);
-#ifdef NOT_USED
- printf("\tblk=%d\n", BufferGetBlockNumber(*buf));
-#endif
*page = BufferGetPage(*buf);
_bt_pageinit(*page, BufferGetPageSize(*buf));
opaque = (BTPageOpaque) PageGetSpecialPointer(*page);
* is suitable for immediate use by _bt_buildadd.
*/
static BTPageState *
-_bt_pagestate(Relation index, int flags, int level, bool doupper)
+_bt_pagestate(Relation index, int flags, int level)
{
BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState));
MemSet((char *) state, 0, sizeof(BTPageState));
_bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags);
- state->btps_firstoff = InvalidOffsetNumber;
state->btps_lastoff = P_HIKEY;
- state->btps_lastbti = (BTItem) NULL;
state->btps_next = (BTPageState *) NULL;
state->btps_level = level;
- state->btps_doupper = doupper;
return state;
}
}
/*
- * add an item to a disk page from a merge tape block.
+ * add an item to a disk page from the sort output.
*
* we must be careful to observe the following restrictions, placed
* upon us by the conventions in nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at
* P_FIRSTKEY.
- * - duplicates cannot be split among pages unless the chain of
- * duplicates starts at the first data item.
*
* a leaf page being built looks like:
*
* +----------------+---------------------------------+
* | PageHeaderData | linp0 linp1 linp2 ... |
* +-----------+----+---------------------------------+
- * | ... linpN | ^ first |
+ * | ... linpN | |
* +-----------+--------------------------------------+
* | ^ last |
* | |
- * | v last |
* +-------------+------------------------------------+
* | | itemN ... |
* +-------------+------------------+-----------------+
* | ... item3 item2 item1 | "special space" |
* +--------------------------------+-----------------+
- * ^ first
*
* contrast this with the diagram in bufpage.h; note the mismatch
* between linps and items. this is because we reserve linp0 as a
* filled up the page, we will set linp0 to point to itemN and clear
* linpN.
*
- * 'last' pointers indicate the last offset/item added to the page.
- * 'first' pointers indicate the first offset/item that is part of a
- * chain of duplicates extending from 'first' to 'last'.
- *
- * if all keys are unique, 'first' will always be the same as 'last'.
+ * 'last' pointer indicates the last offset added to the page.
*/
-static BTItem
-_bt_buildadd(Relation index, Size keysz, ScanKey scankey,
- BTPageState *state, BTItem bti, int flags)
+static void
+_bt_buildadd(Relation index, BTPageState *state, BTItem bti, int flags)
{
Buffer nbuf;
Page npage;
- BTItem last_bti;
- OffsetNumber first_off;
OffsetNumber last_off;
- OffsetNumber off;
Size pgspc;
Size btisz;
nbuf = state->btps_buf;
npage = state->btps_page;
- first_off = state->btps_firstoff;
last_off = state->btps_lastoff;
- last_bti = state->btps_lastbti;
pgspc = PageGetFreeSpace(npage);
btisz = BTITEMSZ(bti);
if (pgspc < btisz)
{
+ /*
+ * Item won't fit on this page, so finish off the page and
+ * write it out.
+ */
Buffer obuf = nbuf;
Page opage = npage;
- OffsetNumber o,
- n;
ItemId ii;
ItemId hii;
+ BTItem nbti;
_bt_blnewpage(index, &nbuf, &npage, flags);
/*
- * if 'last' is part of a chain of duplicates that does not start
- * at the beginning of the old page, the entire chain is copied to
- * the new page; we delete all of the duplicates from the old page
- * except the first, which becomes the high key item of the old
- * page.
+ * We copy the last item on the page into the new page, and then
+ * rearrange the old page so that the 'last item' becomes its high
+ * key rather than a true data item.
*
- * if the chain starts at the beginning of the page or there is no
- * chain ('first' == 'last'), we need only copy 'last' to the new
- * page. again, 'first' (== 'last') becomes the high key of the
- * old page.
- *
- * note that in either case, we copy at least one item to the new
- * page, so 'last_bti' will always be valid. 'bti' will never be
- * the first data item on the new page.
+ * note that since we always copy an item to the new page,
+ * 'bti' will never be the first data item on the new page.
*/
- if (first_off == P_FIRSTKEY)
+ ii = PageGetItemId(opage, last_off);
+ if (PageAddItem(npage, PageGetItem(opage, ii), ii->lp_len,
+ P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
+ elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
+#ifdef FASTBUILD_DEBUG
{
- Assert(last_off != P_FIRSTKEY);
- first_off = last_off;
+ bool isnull;
+ BTItem tmpbti =
+ (BTItem) PageGetItem(npage, PageGetItemId(npage, P_FIRSTKEY));
+ Datum d = index_getattr(&(tmpbti->bti_itup), 1,
+ index->rd_att, &isnull);
+
+ printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
+ d, P_FIRSTKEY, state->btps_level);
}
- for (o = first_off, n = P_FIRSTKEY;
- o <= last_off;
- o = OffsetNumberNext(o), n = OffsetNumberNext(n))
- {
- ii = PageGetItemId(opage, o);
- if (PageAddItem(npage, PageGetItem(opage, ii),
- ii->lp_len, n, LP_USED) == InvalidOffsetNumber)
- elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
-#ifdef FASTBUILD_DEBUG
- {
- bool isnull;
- BTItem tmpbti =
- (BTItem) PageGetItem(npage, PageGetItemId(npage, n));
- Datum d = index_getattr(&(tmpbti->bti_itup), 1,
- index->rd_att, &isnull);
-
- printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
- d, n, state->btps_level);
- }
#endif
- }
/*
- * this loop is backward because PageIndexTupleDelete shuffles the
- * tuples to fill holes in the page -- by starting at the end and
- * working back, we won't create holes (and thereby avoid
- * shuffling).
+ * Move 'last' into the high key position on opage
*/
- for (o = last_off; o > first_off; o = OffsetNumberPrev(o))
- PageIndexTupleDelete(opage, o);
hii = PageGetItemId(opage, P_HIKEY);
- ii = PageGetItemId(opage, first_off);
*hii = *ii;
ii->lp_flags &= ~LP_USED;
((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
- first_off = P_FIRSTKEY;
+ /*
+ * Reset last_off to point to new page
+ */
last_off = PageGetMaxOffsetNumber(npage);
- last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, last_off));
/*
* set the page (side link) pointers.
oopaque->btpo_next = BufferGetBlockNumber(nbuf);
nopaque->btpo_prev = BufferGetBlockNumber(obuf);
nopaque->btpo_next = P_NONE;
-
- if (_bt_itemcmp(index, keysz, scankey,
- (BTItem) PageGetItem(opage, PageGetItemId(opage, P_HIKEY)),
- (BTItem) PageGetItem(opage, PageGetItemId(opage, P_FIRSTKEY)),
- BTEqualStrategyNumber))
- oopaque->btpo_flags |= BTP_CHAIN;
}
/*
- * copy the old buffer's minimum key to its parent. if we don't
- * have a parent, we have to create one; this adds a new btree
- * level.
+ * Link the old buffer into its parent, using its minimum key.
+ * If we don't have a parent, we have to create one;
+ * this adds a new btree level.
*/
- if (state->btps_doupper)
+ if (state->btps_next == (BTPageState *) NULL)
{
- BTItem nbti;
-
- if (state->btps_next == (BTPageState *) NULL)
- {
- state->btps_next =
- _bt_pagestate(index, 0, state->btps_level + 1, true);
- }
- nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
- _bt_buildadd(index, keysz, scankey, state->btps_next, nbti, 0);
- pfree((void *) nbti);
+ state->btps_next =
+ _bt_pagestate(index, 0, state->btps_level + 1);
}
+ nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
+ _bt_buildadd(index, state->btps_next, nbti, 0);
+ pfree((void *) nbti);
/*
* write out the old stuff. we never want to see it again, so we
}
/*
- * if this item is different from the last item added, we start a new
- * chain of duplicates.
+ * Add the new item into the current page.
*/
- off = OffsetNumberNext(last_off);
- if (PageAddItem(npage, (Item) bti, btisz, off, LP_USED) == InvalidOffsetNumber)
+ last_off = OffsetNumberNext(last_off);
+ if (PageAddItem(npage, (Item) bti, btisz,
+ last_off, LP_USED) == InvalidOffsetNumber)
elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)");
#ifdef FASTBUILD_DEBUG
{
Datum d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull);
printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n",
- d, off, state->btps_level);
+ d, last_off, state->btps_level);
}
#endif
- if (last_bti == (BTItem) NULL)
- first_off = P_FIRSTKEY;
- else if (!_bt_itemcmp(index, keysz, scankey,
- bti, last_bti, BTEqualStrategyNumber))
- first_off = off;
- last_off = off;
- last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, off));
state->btps_buf = nbuf;
state->btps_page = npage;
- state->btps_lastbti = last_bti;
state->btps_lastoff = last_off;
- state->btps_firstoff = first_off;
-
- return last_bti;
}
+/*
+ * Finish writing out the completed btree.
+ */
static void
-_bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
- BTPageState *state)
+_bt_uppershutdown(Relation index, BTPageState *state)
{
BTPageState *s;
BlockNumber blkno;
BTPageOpaque opaque;
BTItem bti;
+ /*
+ * Each iteration of this loop completes one more level of the tree.
+ */
for (s = state; s != (BTPageState *) NULL; s = s->btps_next)
{
blkno = BufferGetBlockNumber(s->btps_buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page);
/*
- * if this is the root, attach it to the metapage. otherwise,
- * stick the minimum key of the last page on this level (which has
- * not been split, or else it wouldn't be the last page) into its
- * parent. this may cause the last page of upper levels to split,
- * but that's not a problem -- we haven't gotten to them yet.
+ * We have to link the last page on this level to somewhere.
+ *
+ * If we're at the top, it's the root, so attach it to the metapage.
+ * Otherwise, add an entry for it to its parent using its minimum
+ * key. This may cause the last page of the parent level to split,
+ * but that's not a problem -- we haven't gotten to it yet.
*/
- if (s->btps_doupper)
+ if (s->btps_next == (BTPageState *) NULL)
{
- if (s->btps_next == (BTPageState *) NULL)
- {
- opaque->btpo_flags |= BTP_ROOT;
- _bt_metaproot(index, blkno, s->btps_level + 1);
- }
- else
- {
- bti = _bt_minitem(s->btps_page, blkno, 0);
- _bt_buildadd(index, keysz, scankey, s->btps_next, bti, 0);
- pfree((void *) bti);
- }
+ opaque->btpo_flags |= BTP_ROOT;
+ _bt_metaproot(index, blkno, s->btps_level + 1);
+ }
+ else
+ {
+ bti = _bt_minitem(s->btps_page, blkno, 0);
+ _bt_buildadd(index, s->btps_next, bti, 0);
+ pfree((void *) bti);
}
/*
- * this is the rightmost page, so the ItemId array needs to be
- * slid back one slot.
+ * This is the rightmost page, so the ItemId array needs to be
+ * slid back one slot. Then we can dump out the page.
*/
_bt_slideleft(index, s->btps_buf, s->btps_page);
_bt_wrtbuf(index, s->btps_buf);
static void
_bt_load(Relation index, BTSpool *btspool)
{
- BTPageState *state;
- ScanKey skey;
- int natts;
- BTItem bti;
- bool should_free;
-
- /*
- * initialize state needed for the merge into the btree leaf pages.
- */
- state = _bt_pagestate(index, BTP_LEAF, 0, true);
-
- skey = _bt_mkscankey_nodata(index);
- natts = RelationGetNumberOfAttributes(index);
+ BTPageState *state = NULL;
for (;;)
{
+ BTItem bti;
+ bool should_free;
+
bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true,
&should_free);
if (bti == (BTItem) NULL)
break;
- _bt_buildadd(index, natts, skey, state, bti, BTP_LEAF);
+
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(index, BTP_LEAF, 0);
+
+ _bt_buildadd(index, state, bti, BTP_LEAF);
if (should_free)
pfree((void *) bti);
}
- _bt_uppershutdown(index, natts, skey, state);
-
- _bt_freeskey(skey);
+ if (state != NULL)
+ _bt_uppershutdown(index, state);
}
*
*
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.37 2000/05/30 04:24:33 tgl Exp $
+ * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.38 2000/07/21 06:42:33 tgl Exp $
*
*-------------------------------------------------------------------------
*/
#include "access/nbtree.h"
#include "executor/execdebug.h"
-extern int NIndexTupleProcessed;
-
/*
* _bt_mkscankey
* Build a scan key that contains comparison data from itup
* as well as comparator routines appropriate to the key datatypes.
*
- * The result is intended for use with _bt_skeycmp() or _bt_compare(),
- * although it could be used with _bt_itemcmp() or _bt_tuplecompare().
+ * The result is intended for use with _bt_compare().
*/
ScanKey
_bt_mkscankey(Relation rel, IndexTuple itup)
* Build a scan key that contains comparator routines appropriate to
* the key datatypes, but no comparison data.
*
- * The result can be used with _bt_itemcmp() or _bt_tuplecompare(),
- * but not with _bt_skeycmp() or _bt_compare().
+ * The result cannot be used with _bt_compare(). Currently this
+ * routine is only called by utils/sort/tuplesort.c, which has its
+ * own comparison routine.
*/
ScanKey
_bt_mkscankey_nodata(Relation rel)
{
ostack = stack;
stack = stack->bts_parent;
- pfree(ostack->bts_btitem);
pfree(ostack);
}
}
Size tuplen;
extern Oid newoid();
- /*
- * see comments in btbuild
- *
- * if (itup->t_info & INDEX_NULL_MASK) elog(ERROR, "btree indices cannot
- * include null keys");
- */
-
/* make a copy of the index tuple with room for the sequence number */
tuplen = IndexTupleSize(itup);
nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData));
btitem = (BTItem) palloc(nbytes_btitem);
- memmove((char *) &(btitem->bti_itup), (char *) itup, tuplen);
+ memcpy((char *) &(btitem->bti_itup), (char *) itup, tuplen);
return btitem;
}
-#ifdef NOT_USED
-bool
-_bt_checkqual(IndexScanDesc scan, IndexTuple itup)
-{
- BTScanOpaque so;
-
- so = (BTScanOpaque) scan->opaque;
- if (so->numberOfKeys > 0)
- return (index_keytest(itup, RelationGetDescr(scan->relation),
- so->numberOfKeys, so->keyData));
- else
- return true;
-}
-
-#endif
-
-#ifdef NOT_USED
-bool
-_bt_checkforkeys(IndexScanDesc scan, IndexTuple itup, Size keysz)
-{
- BTScanOpaque so;
-
- so = (BTScanOpaque) scan->opaque;
- if (keysz > 0 && so->numberOfKeys >= keysz)
- return (index_keytest(itup, RelationGetDescr(scan->relation),
- keysz, so->keyData));
- else
- return true;
-}
-
-#endif
-
bool
_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok)
{
*
*
* IDENTIFICATION
- * $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.30 2000/07/03 02:54:16 vadim Exp $
+ * $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.31 2000/07/21 06:42:33 tgl Exp $
*
*-------------------------------------------------------------------------
*/
#include "storage/bufpage.h"
+
static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr,
char *location, Size size);
-static bool PageManagerShuffle = true; /* default is shuffle mode */
/* ----------------------------------------------------------------
* Page support functions
/* ----------------
* PageAddItem
*
- * add an item to a page.
- *
- * !!! ELOG(ERROR) IS DISALLOWED HERE !!!
+ * Add an item to a page. Return value is offset at which it was
+ * inserted, or InvalidOffsetNumber if there's not room to insert.
*
- * Notes on interface:
- * If offsetNumber is valid, shuffle ItemId's down to make room
- * to use it, if PageManagerShuffle is true. If PageManagerShuffle is
- * false, then overwrite the specified ItemId. (PageManagerShuffle is
- * true by default, and is modified by calling PageManagerModeSet.)
+ * If offsetNumber is valid and <= current max offset in the page,
+ * insert item into the array at that position by shuffling ItemId's
+ * down to make room.
* If offsetNumber is not valid, then assign one by finding the first
* one that is both unused and deallocated.
*
- * NOTE: If offsetNumber is valid, and PageManagerShuffle is true, it
- * is assumed that there is room on the page to shuffle the ItemId's
- * down by one.
+ * !!! ELOG(ERROR) IS DISALLOWED HERE !!!
+ *
* ----------------
*/
OffsetNumber
Offset lower;
Offset upper;
ItemId itemId;
- ItemId fromitemId,
- toitemId;
OffsetNumber limit;
-
- bool shuffled = false;
+ bool needshuffle = false;
/*
* Find first unallocated offsetNumber
/* was offsetNumber passed in? */
if (OffsetNumberIsValid(offsetNumber))
{
- if (PageManagerShuffle == true)
- {
- /* shuffle ItemId's (Do the PageManager Shuffle...) */
- for (i = (limit - 1); i >= offsetNumber; i--)
- {
- fromitemId = &((PageHeader) page)->pd_linp[i - 1];
- toitemId = &((PageHeader) page)->pd_linp[i];
- *toitemId = *fromitemId;
- }
- shuffled = true; /* need to increase "lower" */
- }
- else
- { /* overwrite mode */
- itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
- if (((*itemId).lp_flags & LP_USED) ||
- ((*itemId).lp_len != 0))
- {
- elog(NOTICE, "PageAddItem: tried overwrite of used ItemId");
- return InvalidOffsetNumber;
- }
- }
+ needshuffle = true; /* need to increase "lower" */
+ /* don't actually do the shuffle till we've checked free space! */
}
else
- { /* offsetNumber was not passed in, so find
- * one */
+ {
+ /* offsetNumber was not passed in, so find one */
/* look for "recyclable" (unused & deallocated) ItemId */
for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
{
break;
}
}
+
+ /*
+ * Compute new lower and upper pointers for page, see if it'll fit
+ */
if (offsetNumber > limit)
lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page));
- else if (offsetNumber == limit || shuffled == true)
+ else if (offsetNumber == limit || needshuffle)
lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData);
else
lower = ((PageHeader) page)->pd_lower;
if (lower > upper)
return InvalidOffsetNumber;
+ /*
+ * OK to insert the item. First, shuffle the existing pointers if needed.
+ */
+ if (needshuffle)
+ {
+ /* shuffle ItemId's (Do the PageManager Shuffle...) */
+ for (i = (limit - 1); i >= offsetNumber; i--)
+ {
+ ItemId fromitemId,
+ toitemId;
+
+ fromitemId = &((PageHeader) page)->pd_linp[i - 1];
+ toitemId = &((PageHeader) page)->pd_linp[i];
+ *toitemId = *fromitemId;
+ }
+ }
+
itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
(*itemId).lp_off = upper;
(*itemId).lp_len = size;
PageHeader thdr;
pageSize = PageGetPageSize(page);
-
- if ((temp = (Page) palloc(pageSize)) == (Page) NULL)
- elog(FATAL, "Cannot allocate %d bytes for temp page.", pageSize);
+ temp = (Page) palloc(pageSize);
thdr = (PageHeader) temp;
/* copy old page in */
return space;
}
-/*
- * PageManagerModeSet
- *
- * Sets mode to either: ShufflePageManagerMode (the default) or
- * OverwritePageManagerMode. For use by access methods code
- * for determining semantics of PageAddItem when the offsetNumber
- * argument is passed in.
- */
-void
-PageManagerModeSet(PageManagerMode mode)
-{
- if (mode == ShufflePageManagerMode)
- PageManagerShuffle = true;
- else if (mode == OverwritePageManagerMode)
- PageManagerShuffle = false;
-}
-
/*
*----------------------------------------------------------------
* PageIndexTupleDelete
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California
*
- * $Id: nbtree.h,v 1.38 2000/06/15 03:32:31 momjian Exp $
+ * $Id: nbtree.h,v 1.39 2000/07/21 06:42:35 tgl Exp $
*
*-------------------------------------------------------------------------
*/
* info. In addition, we need to know what sort of page this is
* (leaf or internal), and whether the page is available for reuse.
*
- * Lehman and Yao's algorithm requires a ``high key'' on every page.
- * The high key on a page is guaranteed to be greater than or equal
- * to any key that appears on this page. Our insertion algorithm
- * guarantees that we can use the initial least key on our right
- * sibling as the high key. We allocate space for the line pointer
- * to the high key in the opaque data at the end of the page.
- *
- * Rightmost pages in the tree have no high key.
+ * We also store a back-link to the parent page, but this cannot be trusted
+ * very far since it does not get updated when the parent is split.
+ * See backend/access/nbtree/README for details.
*/
typedef struct BTPageOpaqueData
BlockNumber btpo_parent;
uint16 btpo_flags;
-#define BTP_LEAF (1 << 0)
-#define BTP_ROOT (1 << 1)
-#define BTP_FREE (1 << 2)
-#define BTP_META (1 << 3)
-#define BTP_CHAIN (1 << 4)
+/* Bits defined in btpo_flags */
+#define BTP_LEAF (1 << 0) /* It's a leaf page */
+#define BTP_ROOT (1 << 1) /* It's the root page (has no parent) */
+#define BTP_FREE (1 << 2) /* not currently used... */
+#define BTP_META (1 << 3) /* Set in the meta-page only */
} BTPageOpaqueData;
typedef BTScanOpaqueData *BTScanOpaque;
/*
- * BTItems are what we store in the btree. Each item has an index
- * tuple, including key and pointer values. In addition, we must
- * guarantee that all tuples in the index are unique, in order to
- * satisfy some assumptions in Lehman and Yao. The way that we do
- * this is by generating a new OID for every insertion that we do in
- * the tree. This adds eight bytes to the size of btree index
- * tuples. Note that we do not use the OID as part of a composite
- * key; the OID only serves as a unique identifier for a given index
- * tuple (logical position within a page).
+ * BTItems are what we store in the btree. Each item is an index tuple,
+ * including key and pointer values. (In some cases either the key or the
+ * pointer may go unused, see backend/access/nbtree/README for details.)
+ *
+ * Old comments:
+ * In addition, we must guarantee that all tuples in the index are unique,
+ * in order to satisfy some assumptions in Lehman and Yao. The way that we
+ * do this is by generating a new OID for every insertion that we do in the
+ * tree. This adds eight bytes to the size of btree index tuples. Note
+ * that we do not use the OID as part of a composite key; the OID only
+ * serves as a unique identifier for a given index tuple (logical position
+ * within a page).
*
* New comments:
* actually, we must guarantee that all tuples in A LEVEL
* are unique, not in ALL INDEX. So, we can use bti_itup->t_tid
* as unique identifier for a given index tuple (logical position
- * within a level). - vadim 04/09/97
+ * within a level). - vadim 04/09/97
*/
typedef struct BTItemData
typedef BTItemData *BTItem;
-#define BTItemSame(i1, i2) ( i1->bti_itup.t_tid.ip_blkid.bi_hi == \
- i2->bti_itup.t_tid.ip_blkid.bi_hi && \
- i1->bti_itup.t_tid.ip_blkid.bi_lo == \
- i2->bti_itup.t_tid.ip_blkid.bi_lo && \
- i1->bti_itup.t_tid.ip_posid == \
- i2->bti_itup.t_tid.ip_posid )
+/* Test whether items are the "same" per the above notes */
+#define BTItemSame(i1, i2) ( (i1)->bti_itup.t_tid.ip_blkid.bi_hi == \
+ (i2)->bti_itup.t_tid.ip_blkid.bi_hi && \
+ (i1)->bti_itup.t_tid.ip_blkid.bi_lo == \
+ (i2)->bti_itup.t_tid.ip_blkid.bi_lo && \
+ (i1)->bti_itup.t_tid.ip_posid == \
+ (i2)->bti_itup.t_tid.ip_posid )
/*
* BTStackData -- As we descend a tree, we push the (key, pointer)
{
BlockNumber bts_blkno;
OffsetNumber bts_offset;
- BTItem bts_btitem;
+ BTItemData bts_btitem;
struct BTStackData *bts_parent;
} BTStackData;
typedef BTStackData *BTStack;
-typedef struct BTPageState
-{
- Buffer btps_buf;
- Page btps_page;
- BTItem btps_lastbti;
- OffsetNumber btps_lastoff;
- OffsetNumber btps_firstoff;
- int btps_level;
- bool btps_doupper;
- struct BTPageState *btps_next;
-} BTPageState;
-
/*
* We need to be able to tell the difference between read and write
* requests for pages, in order to do locking correctly.
#define BT_READ BUFFER_LOCK_SHARE
#define BT_WRITE BUFFER_LOCK_EXCLUSIVE
-/*
- * Similarly, the difference between insertion and non-insertion binary
- * searches on a given page makes a difference when we're descending the
- * tree.
- */
-
-#define BT_INSERTION 0
-#define BT_DESCENT 1
-
/*
* In general, the btree code tries to localize its knowledge about
* page layout to a couple of routines. However, we need a special
* value to indicate "no page number" in those places where we expect
- * page numbers.
+ * page numbers. We can use zero for this because we never need to
+ * make a pointer to the metadata page.
*/
#define P_NONE 0
+
+/*
+ * Macros to test whether a page is leftmost or rightmost on its tree level,
+ * as well as other state info kept in the opaque data.
+ */
#define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE)
#define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE)
+#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF)
+#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT)
+
+/*
+ * Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
+ * page. The high key is not a data key, but gives info about what range of
+ * keys is supposed to be on this page. The high key on a page is required
+ * to be greater than or equal to any data key that appears on the page.
+ * If we find ourselves trying to insert a key > high key, we know we need
+ * to move right (this should only happen if the page was split since we
+ * examined the parent page).
+ *
+ * Our insertion algorithm guarantees that we can use the initial least key
+ * on our right sibling as the high key. Once a page is created, its high
+ * key changes only if the page is split.
+ *
+ * On a non-rightmost page, the high key lives in item 1 and data items
+ * start in item 2. Rightmost pages have no high key, so we store data
+ * items beginning in item 1.
+ */
#define P_HIKEY ((OffsetNumber) 1)
#define P_FIRSTKEY ((OffsetNumber) 2)
+#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
/*
- * Strategy numbers -- ordering of these is <, <=, =, >=, >
+ * Operator strategy numbers -- ordering of these is <, <=, =, >=, >
*/
#define BTLessStrategyNumber 1
#define BTORDER_PROC 1
+/*
+ * prototypes for functions in nbtree.c (external entry points for btree)
+ */
+extern bool BuildingBtree; /* in nbtree.c */
+
+extern Datum btbuild(PG_FUNCTION_ARGS);
+extern Datum btinsert(PG_FUNCTION_ARGS);
+extern Datum btgettuple(PG_FUNCTION_ARGS);
+extern Datum btbeginscan(PG_FUNCTION_ARGS);
+extern Datum btrescan(PG_FUNCTION_ARGS);
+extern void btmovescan(IndexScanDesc scan, Datum v);
+extern Datum btendscan(PG_FUNCTION_ARGS);
+extern Datum btmarkpos(PG_FUNCTION_ARGS);
+extern Datum btrestrpos(PG_FUNCTION_ARGS);
+extern Datum btdelete(PG_FUNCTION_ARGS);
+
/*
* prototypes for functions in nbtinsert.c
*/
extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem,
bool index_is_unique, Relation heapRel);
-extern bool _bt_itemcmp(Relation rel, Size keysz, ScanKey scankey,
- BTItem item1, BTItem item2, StrategyNumber strat);
/*
* prototypes for functions in nbtpage.c
extern void _bt_wrtnorelbuf(Relation rel, Buffer buf);
extern void _bt_pageinit(Page page, Size size);
extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
extern void _bt_pagedel(Relation rel, ItemPointer tid);
-/*
- * prototypes for functions in nbtree.c
- */
-extern bool BuildingBtree; /* in nbtree.c */
-
-extern Datum btbuild(PG_FUNCTION_ARGS);
-extern Datum btinsert(PG_FUNCTION_ARGS);
-extern Datum btgettuple(PG_FUNCTION_ARGS);
-extern Datum btbeginscan(PG_FUNCTION_ARGS);
-extern Datum btrescan(PG_FUNCTION_ARGS);
-extern void btmovescan(IndexScanDesc scan, Datum v);
-extern Datum btendscan(PG_FUNCTION_ARGS);
-extern Datum btmarkpos(PG_FUNCTION_ARGS);
-extern Datum btrestrpos(PG_FUNCTION_ARGS);
-extern Datum btdelete(PG_FUNCTION_ARGS);
-
/*
* prototypes for functions in nbtscan.c
*/
* prototypes for functions in nbtsearch.c
*/
extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey,
- Buffer *bufP);
+ Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, int access);
-extern bool _bt_skeycmp(Relation rel, Size keysz, ScanKey scankey,
- Page page, ItemId itemid, StrategyNumber strat);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, int srchtype);
+ ScanKey scankey);
+extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
+ Page page, OffsetNumber offnum);
extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir);
extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
* Portions Copyright (c) 1996-2000, PostgreSQL, Inc
* Portions Copyright (c) 1994, Regents of the University of California
*
- * $Id: bufpage.h,v 1.30 2000/07/03 02:54:21 vadim Exp $
+ * $Id: bufpage.h,v 1.31 2000/07/21 06:42:39 tgl Exp $
*
*-------------------------------------------------------------------------
*/
extern void PageRestoreTempPage(Page tempPage, Page oldPage);
extern void PageRepairFragmentation(Page page);
extern Size PageGetFreeSpace(Page page);
-extern void PageManagerModeSet(PageManagerMode mode);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);