From: Tom Lane Date: Fri, 21 Jul 2000 06:42:39 +0000 (+0000) Subject: Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for X-Git-Tag: REL7_1_BETA~881 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=9e85183bfc3;p=postgresql Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details. --- diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index a204ad4af0..9ae596ab23 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -1,68 +1,175 @@ -$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $ +$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $ This directory contains a correct implementation of Lehman and Yao's -btree management algorithm that supports concurrent access for Postgres. +high-concurrency B-tree management algorithm (P. Lehman and S. Yao, +Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions +on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). + We have made the following changes in order to incorporate their algorithm into Postgres: - + The requirement that all btree keys be unique is too onerous, - but the algorithm won't work correctly without it. As a result, - this implementation adds an OID (guaranteed to be unique) to - every key in the index. This guarantees uniqueness within a set - of duplicates. Space overhead is four bytes. - - For this reason, when we're passed an index tuple to store by the - common access method code, we allocate a larger one and copy the - supplied tuple into it. No Postgres code outside of the btree - access method knows about this xid or sequence number. - - + Lehman and Yao don't require read locks, but assume that in- - memory copies of tree nodes are unshared. Postgres shares - in-memory buffers among backends. As a result, we do page- - level read locking on btree nodes in order to guarantee that - no record is modified while we are examining it. This reduces - concurrency but guaranteees correct behavior. - - + Read locks on a page are held for as long as a scan has a pointer - to the page. However, locks are always surrendered before the - sibling page lock is acquired (for readers), so we remain deadlock- - free. I will do a formal proof if I get bored anytime soon. ++ The requirement that all btree keys be unique is too onerous, + but the algorithm won't work correctly without it. Fortunately, it is + only necessary that keys be unique on a single tree level, because L&Y + only use the assumption of key uniqueness when re-finding a key in a + parent node (to determine where to insert the key for a split page). + Therefore, we can use the link field to disambiguate multiple + occurrences of the same user key: only one entry in the parent level + will be pointing at the page we had split. (Indeed we need not look at + the real "key" at all, just at the link field.) We can distinguish + items at the leaf level in the same way, by examining their links to + heap tuples; we'd never have two items for the same heap tuple. + ++ Lehman and Yao assume that the key range for a subtree S is described + by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent + node. This does not work for nonunique keys (for example, if we have + enough equal keys to spread across several leaf pages, there *must* be + some equal bounding keys in the first level up). Therefore we assume + Ki <= v <= Ki+1 instead. A search that finds exact equality to a + bounding key in an upper tree level must descend to the left of that + key to ensure it finds any equal keys in the preceding page. An + insertion that sees the high key of its target page is equal to the key + to be inserted has a choice whether or not to move right, since the new + key could go on either page. (Currently, we try to find a page where + there is room for the new key without a split.) + ++ Lehman and Yao don't require read locks, but assume that in-memory + copies of tree nodes are unshared. Postgres shares in-memory buffers + among backends. As a result, we do page-level read locking on btree + nodes in order to guarantee that no record is modified while we are + examining it. This reduces concurrency but guaranteees correct + behavior. An advantage is that when trading in a read lock for a + write lock, we need not re-read the page after getting the write lock. + Since we're also holding a pin on the shared buffer containing the + page, we know that buffer still contains the page and is up-to-date. + ++ We support the notion of an ordered "scan" of an index as well as + insertions, deletions, and simple lookups. A scan in the forward + direction is no problem, we just use the right-sibling pointers that + L&Y require anyway. (Thus, once we have descended the tree to the + correct start point for the scan, the scan looks only at leaf pages + and never at higher tree levels.) To support scans in the backward + direction, we also store a "left sibling" link much like the "right + sibling". (This adds an extra step to the L&Y split algorithm: while + holding the write lock on the page being split, we also lock its former + right sibling to update that page's left-link. This is safe since no + writer of that page can be interested in acquiring a write lock on our + page.) A backwards scan has one additional bit of complexity: after + following the left-link we must account for the possibility that the + left sibling page got split before we could read it. So, we have to + move right until we find a page whose right-link matches the page we + came from. + ++ Read locks on a page are held for as long as a scan has a pointer + to the page. However, locks are always surrendered before the + sibling page lock is acquired (for readers), so we remain deadlock- + free. I will do a formal proof if I get bored anytime soon. + NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin, + on the current page of a scan before control leaves nbtree. When we + come back to resume the scan, we have to re-grab the read lock and + then move right if the current item moved (see _bt_restscan()). + ++ Lehman and Yao fail to discuss what must happen when the root page + becomes full and must be split. Our implementation is to split the + root in the same way that any other page would be split, then construct + a new root page holding pointers to both of the resulting pages (which + now become siblings on level 2 of the tree). The new root page is then + installed by altering the root pointer in the meta-data page (see + below). This works because the root is not treated specially in any + other way --- in particular, searches will move right using its link + pointer if the link is set. Therefore, searches will find the data + that's been moved into the right sibling even if they read the metadata + page before it got updated. This is the same reasoning that makes a + split of a non-root page safe. The locking considerations are similar too. + ++ Lehman and Yao assume fixed-size keys, but we must deal with + variable-size keys. Therefore there is not a fixed maximum number of + keys per page; we just stuff in as many as will fit. When we split a + page, we try to equalize the number of bytes, not items, assigned to + each of the resulting pages. Note we must include the incoming item in + this calculation, otherwise it is possible to find that the incoming + item doesn't fit on the split page where it needs to go! In addition, the following things are handy to know: - + Page zero of every btree is a meta-data page. This page stores - the location of the root page, a pointer to a list of free - pages, and other stuff that's handy to know. - - + This algorithm doesn't really work, since it requires ordered - writes, and UNIX doesn't support ordered writes. - - + There's one other case where we may screw up in this - implementation. When we start a scan, we descend the tree - to the key nearest the one in the qual, and once we get there, - position ourselves correctly for the qual type (eg, <, >=, etc). - If we happen to step off a page, decide we want to get back to - it, and fetch the page again, and if some bad person has split - the page and moved the last tuple we saw off of it, then the - code complains about botched concurrency in an elog(WARN, ...) - and gives up the ghost. This is the ONLY violation of Lehman - and Yao's guarantee of correct behavior that I am aware of in - this code. ++ Page zero of every btree is a meta-data page. This page stores + the location of the root page, a pointer to a list of free + pages, and other stuff that's handy to know. (Currently, we + never shrink btree indexes so there are never any free pages.) + ++ The algorithm assumes we can fit at least three items per page + (a "high key" and two real data items). Therefore it's unsafe + to accept items larger than 1/3rd page size. Larger items would + work sometimes, but could cause failures later on depending on + what else gets put on their page. + ++ This algorithm doesn't guarantee btree consistency after a kernel crash + or hardware failure. To do that, we'd need ordered writes, and UNIX + doesn't support ordered writes (short of fsync'ing every update, which + is too high a price). Rebuilding corrupted indexes during restart + seems more attractive. + ++ On deletions, we need to adjust the position of active scans on + the index. The code in nbtscan.c handles this. We don't need to + do this for insertions or splits because _bt_restscan can find the + new position of the previously-found item. NOTE that nbtscan.c + only copes with deletions issued by the current backend. This + essentially means that concurrent deletions are not supported, but + that's true already in the Lehman and Yao algorithm. nbtscan.c + exists only to support VACUUM and allow it to delete items while + it's scanning the index. + +Notes about data representation: + ++ The right-sibling link required by L&Y is kept in the page "opaque + data" area, as is the left-sibling link and some flags. + ++ We also keep a parent link in the opaque data, but this link is not + very trustworthy because it is not updated when the parent page splits. + Thus, it points to some page on the parent level, but possibly a page + well to the left of the page's actual current parent. In most cases + we do not need this link at all. Normally we return to a parent page + using a stack of entries that are made as we descend the tree, as in L&Y. + There is exactly one case where the stack will not help: concurrent + root splits. If an inserter process needs to split what had been the + root when it started its descent, but finds that that page is no longer + the root (because someone else split it meanwhile), then it uses the + parent link to move up to the next level. This is OK because we do fix + the parent link in a former root page when splitting it. This logic + will work even if the root is split multiple times (even up to creation + of multiple new levels) before an inserter returns to it. The same + could not be said of finding the new root via the metapage, since that + would work only for a single level of added root. + ++ The Postgres disk block data format (an array of items) doesn't fit + Lehman and Yao's alternating-keys-and-pointers notion of a disk page, + so we have to play some games. + ++ On a page that is not rightmost in its tree level, the "high key" is + kept in the page's first item, and real data items start at item 2. + The link portion of the "high key" item goes unused. A page that is + rightmost has no "high key", so data items start with the first item. + Putting the high key at the left, rather than the right, may seem odd, + but it avoids moving the high key as we add data items. + ++ On a leaf page, the data items are simply links to (TIDs of) tuples + in the relation being indexed, with the associated key values. + ++ On a non-leaf page, the data items are down-links to child pages with + bounding keys. The key in each data item is the *lower* bound for + keys on that child page, so logically the key is to the left of that + downlink. The high key (if present) is the upper bound for the last + downlink. The first data item on each such page has no lower bound + --- or lower bound of minus infinity, if you prefer. The comparison + routines must treat it accordingly. The actual key stored in the + item is irrelevant, and need not be stored at all. This arrangement + corresponds to the fact that an L&Y non-leaf page has one more pointer + than key. Notes to operator class implementors: - With this implementation, we require the user to supply us with - a procedure for pg_amproc. This procedure should take two keys - A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B, - respectively. See the contents of that relation for the btree - access method for some samples. - -Notes to mao for implementation document: - - On deletions, we need to adjust the position of active scans on - the index. The code in nbtscan.c handles this. We don't need to - do this for splits because of the way splits are handled; if they - happen behind us, we'll automatically go to the next page, and if - they happen in front of us, we're not affected by them. For - insertions, if we inserted a tuple behind the current scan location - on the current scan page, we move one space ahead. ++ With this implementation, we require the user to supply us with + a procedure for pg_amproc. This procedure should take two keys + A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B, + respectively. See the contents of that relation for the btree + access method for some samples. diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 7d65c63dc8..6be8e97b50 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.59 2000/06/08 22:36:52 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.60 2000/07/21 06:42:32 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -19,53 +19,76 @@ #include "access/nbtree.h" -static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf, BTStack stack, int keysz, ScanKey scankey, BTItem btitem, BTItem afteritem); -static Buffer _bt_split(Relation rel, Size keysz, ScanKey scankey, - Buffer buf, OffsetNumber firstright); -static OffsetNumber _bt_findsplitloc(Relation rel, Size keysz, ScanKey scankey, - Page page, OffsetNumber start, - OffsetNumber maxoff, Size llimit); +typedef struct +{ + /* context data for _bt_checksplitloc */ + Size newitemsz; /* size of new item to be inserted */ + bool non_leaf; /* T if splitting an internal node */ + + bool have_split; /* found a valid split? */ + + /* these fields valid only if have_split is true */ + bool newitemonleft; /* new item on left or right of best split */ + OffsetNumber firstright; /* best split point */ + int best_delta; /* best size delta so far */ +} FindSplitData; + + +static TransactionId _bt_check_unique(Relation rel, BTItem btitem, + Relation heapRel, Buffer buf, + ScanKey itup_scankey); +static InsertIndexResult _bt_insertonpg(Relation rel, Buffer buf, + BTStack stack, + int keysz, ScanKey scankey, + BTItem btitem, + OffsetNumber afteritem); +static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright, + OffsetNumber newitemoff, Size newitemsz, + BTItem newitem, bool newitemonleft, + OffsetNumber *itup_off, BlockNumber *itup_blkno); +static OffsetNumber _bt_findsplitloc(Relation rel, Page page, + OffsetNumber newitemoff, + Size newitemsz, + bool *newitemonleft); +static void _bt_checksplitloc(FindSplitData *state, OffsetNumber firstright, + int leftfree, int rightfree, + bool newitemonleft, Size firstrightitemsz); +static Buffer _bt_getstackbuf(Relation rel, BTStack stack); static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf); -static OffsetNumber _bt_pgaddtup(Relation rel, Buffer buf, int keysz, ScanKey itup_scankey, Size itemsize, BTItem btitem, BTItem afteritem); -static bool _bt_goesonpg(Relation rel, Buffer buf, Size keysz, ScanKey scankey, BTItem afteritem); -static void _bt_updateitem(Relation rel, Size keysz, Buffer buf, BTItem oldItem, BTItem newItem); -static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey); -static int32 _bt_tuplecompare(Relation rel, Size keysz, ScanKey scankey, - IndexTuple tuple1, IndexTuple tuple2); +static void _bt_pgaddtup(Relation rel, Page page, + Size itemsize, BTItem btitem, + OffsetNumber itup_off, const char *where); +static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, + int keysz, ScanKey scankey); /* * _bt_doinsert() -- Handle insertion of a single btitem in the tree. * * This routine is called by the public interface routines, btbuild - * and btinsert. By here, btitem is filled in, and has a unique - * (xid, seqno) pair. + * and btinsert. By here, btitem is filled in, including the TID. */ InsertIndexResult -_bt_doinsert(Relation rel, BTItem btitem, bool index_is_unique, Relation heapRel) +_bt_doinsert(Relation rel, BTItem btitem, + bool index_is_unique, Relation heapRel) { + IndexTuple itup = &(btitem->bti_itup); + int natts = rel->rd_rel->relnatts; ScanKey itup_scankey; - IndexTuple itup; BTStack stack; Buffer buf; - BlockNumber blkno; - int natts = rel->rd_rel->relnatts; InsertIndexResult res; - Buffer buffer; - - itup = &(btitem->bti_itup); /* we need a scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); +top: /* find the page containing this key */ - stack = _bt_search(rel, natts, itup_scankey, &buf); + stack = _bt_search(rel, natts, itup_scankey, &buf, BT_WRITE); /* trade in our read lock for a write lock */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(buf, BT_WRITE); -l1: - /* * If the page was split between the time that we surrendered our read * lock and acquired our write lock, then this page may no longer be @@ -73,176 +96,212 @@ l1: * need to move right in the tree. See Lehman and Yao for an * excruciatingly precise description. */ - buf = _bt_moveright(rel, buf, natts, itup_scankey, BT_WRITE); - blkno = BufferGetBlockNumber(buf); - /* if we're not allowing duplicates, make sure the key isn't */ - /* already in the node */ + /* + * If we're not allowing duplicates, make sure the key isn't + * already in the index. XXX this belongs somewhere else, likely + */ if (index_is_unique) { - OffsetNumber offset, - maxoff; - Page page; + TransactionId xwait; - page = BufferGetPage(buf); - maxoff = PageGetMaxOffsetNumber(page); + xwait = _bt_check_unique(rel, btitem, heapRel, buf, itup_scankey); + + if (TransactionIdIsValid(xwait)) + { + /* Have to wait for the other guy ... */ + _bt_relbuf(rel, buf, BT_WRITE); + XactLockTableWait(xwait); + /* start over... */ + _bt_freestack(stack); + goto top; + } + } + + /* do the insertion */ + res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, btitem, 0); + + /* be tidy */ + _bt_freestack(stack); + _bt_freeskey(itup_scankey); + + return res; +} + +/* + * _bt_check_unique() -- Check for violation of unique index constraint + * + * Returns NullTransactionId if there is no conflict, else an xact ID we + * must wait for to see if it commits a conflicting tuple. If an actual + * conflict is detected, no return --- just elog(). + */ +static TransactionId +_bt_check_unique(Relation rel, BTItem btitem, Relation heapRel, + Buffer buf, ScanKey itup_scankey) +{ + TupleDesc itupdesc = RelationGetDescr(rel); + int natts = rel->rd_rel->relnatts; + OffsetNumber offset, + maxoff; + Page page; + BTPageOpaque opaque; + Buffer nbuf = InvalidBuffer; + bool chtup = true; + + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + + /* + * Find first item >= proposed new item. Note we could also get + * a pointer to end-of-page here. + */ + offset = _bt_binsrch(rel, buf, natts, itup_scankey); - offset = _bt_binsrch(rel, buf, natts, itup_scankey, BT_DESCENT); + /* + * Scan over all equal tuples, looking for live conflicts. + */ + for (;;) + { + HeapTupleData htup; + Buffer buffer; + BTItem cbti; + BlockNumber nblkno; - /* make sure the offset we're given points to an actual */ - /* key on the page before trying to compare it */ - if (!PageIsEmpty(page) && offset <= maxoff) + /* + * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's + * how we handling NULLs - and so we must not use _bt_compare + * in real comparison, but only for ordering/finding items on + * pages. - vadim 03/24/97 + * + * make sure the offset points to an actual key + * before trying to compare it... + */ + if (offset <= maxoff) { - TupleDesc itupdesc; - BTItem cbti; - HeapTupleData htup; - BTPageOpaque opaque; - Buffer nbuf; - BlockNumber nblkno; - bool chtup = true; - - itupdesc = RelationGetDescr(rel); - nbuf = InvalidBuffer; - opaque = (BTPageOpaque) PageGetSpecialPointer(page); + if (! _bt_isequal(itupdesc, page, offset, natts, itup_scankey)) + break; /* we're past all the equal tuples */ /* - * _bt_compare returns 0 for (1,NULL) and (1,NULL) - this's - * how we handling NULLs - and so we must not use _bt_compare - * in real comparison, but only for ordering/finding items on - * pages. - vadim 03/24/97 - * - * while ( !_bt_compare (rel, itupdesc, page, natts, - * itup_scankey, offset) ) + * Have to check is inserted heap tuple deleted one (i.e. + * just moved to another place by vacuum)! We only need to + * do this once, but don't want to do it at all unless + * we see equal tuples, so as not to slow down unequal case. */ - while (_bt_isequal(itupdesc, page, offset, natts, itup_scankey)) - { /* they're equal */ - - /* - * Have to check is inserted heap tuple deleted one (i.e. - * just moved to another place by vacuum)! - */ - if (chtup) - { - htup.t_self = btitem->bti_itup.t_tid; - heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); - if (htup.t_data == NULL) /* YES! */ - break; - /* Live tuple was inserted */ - ReleaseBuffer(buffer); - chtup = false; - } - cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset)); - htup.t_self = cbti->bti_itup.t_tid; + if (chtup) + { + htup.t_self = btitem->bti_itup.t_tid; heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); - if (htup.t_data != NULL) /* it is a duplicate */ - { - TransactionId xwait = + if (htup.t_data == NULL) /* YES! */ + break; + /* Live tuple is being inserted, so continue checking */ + ReleaseBuffer(buffer); + chtup = false; + } + + cbti = (BTItem) PageGetItem(page, PageGetItemId(page, offset)); + htup.t_self = cbti->bti_itup.t_tid; + heap_fetch(heapRel, SnapshotDirty, &htup, &buffer); + if (htup.t_data != NULL) /* it is a duplicate */ + { + TransactionId xwait = (TransactionIdIsValid(SnapshotDirty->xmin)) ? SnapshotDirty->xmin : SnapshotDirty->xmax; - /* - * If this tuple is being updated by other transaction - * then we have to wait for its commit/abort. - */ - ReleaseBuffer(buffer); - if (TransactionIdIsValid(xwait)) - { - if (nbuf != InvalidBuffer) - _bt_relbuf(rel, nbuf, BT_READ); - _bt_relbuf(rel, buf, BT_WRITE); - XactLockTableWait(xwait); - buf = _bt_getbuf(rel, blkno, BT_WRITE); - goto l1;/* continue from the begin */ - } - elog(ERROR, "Cannot insert a duplicate key into unique index %s", RelationGetRelationName(rel)); - } - /* htup null so no buffer to release */ - /* get next offnum */ - if (offset < maxoff) - offset = OffsetNumberNext(offset); - else - { /* move right ? */ - if (P_RIGHTMOST(opaque)) - break; - if (!_bt_isequal(itupdesc, page, P_HIKEY, - natts, itup_scankey)) - break; - - /* - * min key of the right page is the same, ooh - so - * many dead duplicates... - */ - nblkno = opaque->btpo_next; + /* + * If this tuple is being updated by other transaction + * then we have to wait for its commit/abort. + */ + ReleaseBuffer(buffer); + if (TransactionIdIsValid(xwait)) + { if (nbuf != InvalidBuffer) _bt_relbuf(rel, nbuf, BT_READ); - for (nbuf = InvalidBuffer;;) - { - nbuf = _bt_getbuf(rel, nblkno, BT_READ); - page = BufferGetPage(nbuf); - maxoff = PageGetMaxOffsetNumber(page); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - offset = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - if (!PageIsEmpty(page) && offset <= maxoff) - { /* Found some key */ - break; - } - else - { /* Empty or "pseudo"-empty page - get next */ - nblkno = opaque->btpo_next; - _bt_relbuf(rel, nbuf, BT_READ); - nbuf = InvalidBuffer; - if (nblkno == P_NONE) - break; - } - } - if (nbuf == InvalidBuffer) - break; + /* Tell _bt_doinsert to wait... */ + return xwait; } + /* + * Otherwise we have a definite conflict. + */ + elog(ERROR, "Cannot insert a duplicate key into unique index %s", + RelationGetRelationName(rel)); } + /* htup null so no buffer to release */ + } + + /* + * Advance to next tuple to continue checking. + */ + if (offset < maxoff) + offset = OffsetNumberNext(offset); + else + { + /* If scankey == hikey we gotta check the next page too */ + if (P_RIGHTMOST(opaque)) + break; + if (!_bt_isequal(itupdesc, page, P_HIKEY, + natts, itup_scankey)) + break; + nblkno = opaque->btpo_next; if (nbuf != InvalidBuffer) _bt_relbuf(rel, nbuf, BT_READ); + nbuf = _bt_getbuf(rel, nblkno, BT_READ); + page = BufferGetPage(nbuf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + offset = P_FIRSTDATAKEY(opaque); } } - /* do the insertion */ - res = _bt_insertonpg(rel, buf, stack, natts, itup_scankey, - btitem, (BTItem) NULL); + if (nbuf != InvalidBuffer) + _bt_relbuf(rel, nbuf, BT_READ); - /* be tidy */ - _bt_freestack(stack); - _bt_freeskey(itup_scankey); - - return res; + return NullTransactionId; } -/* +/*---------- * _bt_insertonpg() -- Insert a tuple on a particular page in the index. * * This recursive procedure does the following things: * - * + if necessary, splits the target page. - * + finds the right place to insert the tuple (taking into - * account any changes induced by a split). + * + finds the right place to insert the tuple. + * + if necessary, splits the target page (making sure that the + * split is equitable as far as post-insert free space goes). * + inserts the tuple. * + if the page was split, pops the parent stack, and finds the * right place to insert the new child pointer (by walking * right using information stored in the parent stack). - * + invoking itself with the appropriate tuple for the right + * + invokes itself with the appropriate tuple for the right * child page on the parent. * * On entry, we must have the right buffer on which to do the * insertion, and the buffer must be pinned and locked. On return, * we will have dropped both the pin and the write lock on the buffer. * + * If 'afteritem' is >0 then the new tuple must be inserted after the + * existing item of that number, noplace else. If 'afteritem' is 0 + * then the procedure finds the exact spot to insert it by searching. + * (keysz and scankey parameters are used ONLY if afteritem == 0.) + * + * NOTE: if the new key is equal to one or more existing keys, we can + * legitimately place it anywhere in the series of equal keys --- in fact, + * if the new key is equal to the page's "high key" we can place it on + * the next page. If it is equal to the high key, and there's not room + * to insert the new tuple on the current page without splitting, then + * we move right hoping to find more free space and avoid a split. + * Ordinarily, though, we'll insert it before the existing equal keys + * because of the way _bt_binsrch() works. + * * The locking interactions in this code are critical. You should * grok Lehman and Yao's paper before making any changes. In addition, * you need to understand how we disambiguate duplicate keys in this * implementation, in order to be able to find our location using * L&Y "move right" operations. Since we may insert duplicate user - * keys, and since these dups may propogate up the tree, we use the + * keys, and since these dups may propagate up the tree, we use the * 'afteritem' parameter to position ourselves correctly for the * insertion on internal pages. + *---------- */ static InsertIndexResult _bt_insertonpg(Relation rel, @@ -251,17 +310,16 @@ _bt_insertonpg(Relation rel, int keysz, ScanKey scankey, BTItem btitem, - BTItem afteritem) + OffsetNumber afteritem) { InsertIndexResult res; Page page; BTPageOpaque lpageop; - BlockNumber itup_blkno; OffsetNumber itup_off; + BlockNumber itup_blkno; + OffsetNumber newitemoff; OffsetNumber firstright = InvalidOffsetNumber; Size itemsz; - bool do_split = false; - bool keys_equal = false; page = BufferGetPage(buf); lpageop = (BTPageOpaque) PageGetSpecialPointer(page); @@ -285,355 +343,117 @@ _bt_insertonpg(Relation rel, (PageGetPageSize(page) - sizeof(PageHeaderData) - MAXALIGN(sizeof(BTPageOpaqueData))) /3 - sizeof(ItemIdData)); /* - * If we have to insert item on the leftmost page which is the first - * page in the chain of duplicates then: 1. if scankey == hikey (i.e. - * - new duplicate item) then insert it here; 2. if scankey < hikey - * then: 2.a if there is duplicate key(s) here - we force splitting; - * 2.b else - we may "eat" this page from duplicates chain. + * Determine exactly where new item will go. */ - if (lpageop->btpo_flags & BTP_CHAIN) + if (afteritem > 0) { - OffsetNumber maxoff = PageGetMaxOffsetNumber(page); - ItemId hitemid; - BTItem hitem; - - Assert(!P_RIGHTMOST(lpageop)); - hitemid = PageGetItemId(page, P_HIKEY); - hitem = (BTItem) PageGetItem(page, hitemid); - if (maxoff > P_HIKEY && - !_bt_itemcmp(rel, keysz, scankey, hitem, - (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY)), - BTEqualStrategyNumber)) - elog(FATAL, "btree: bad key on the page in the chain of duplicates"); - - if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid, - BTEqualStrategyNumber)) - { - if (!P_LEFTMOST(lpageop)) - elog(FATAL, "btree: attempt to insert bad key on the non-leftmost page in the chain of duplicates"); - if (!_bt_skeycmp(rel, keysz, scankey, page, hitemid, - BTLessStrategyNumber)) - elog(FATAL, "btree: attempt to insert higher key on the leftmost page in the chain of duplicates"); - if (maxoff > P_HIKEY) /* have duplicate(s) */ - { - firstright = P_FIRSTKEY; - do_split = true; - } - else -/* "eat" page */ - { - Buffer pbuf; - Page ppage; - - itup_blkno = BufferGetBlockNumber(buf); - itup_off = PageAddItem(page, (Item) btitem, itemsz, - P_FIRSTKEY, LP_USED); - if (itup_off == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item"); - lpageop->btpo_flags &= ~BTP_CHAIN; - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - ppage = BufferGetPage(pbuf); - PageIndexTupleDelete(ppage, stack->bts_offset); - pfree(stack->bts_btitem); - stack->bts_btitem = _bt_formitem(&(btitem->bti_itup)); - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - itup_blkno, P_HIKEY); - _bt_wrtbuf(rel, buf); - res = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, scankey, stack->bts_btitem, - NULL); - ItemPointerSet(&(res->pointerData), itup_blkno, itup_off); - return res; - } - } - else - { - keys_equal = true; - if (PageGetFreeSpace(page) < itemsz) - do_split = true; - } + newitemoff = afteritem + 1; } - else if (PageGetFreeSpace(page) < itemsz) - do_split = true; - else if (PageGetFreeSpace(page) < 3 * itemsz + 2 * sizeof(ItemIdData)) - { - OffsetNumber offnum = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY; - OffsetNumber maxoff = PageGetMaxOffsetNumber(page); - - if (offnum < maxoff) /* can't split unless at least 2 items... */ - { - ItemId itid; - BTItem previtem, - chkitem; - Size maxsize; - Size currsize; - - /* find largest group of identically-keyed items on page */ - itid = PageGetItemId(page, offnum); - previtem = (BTItem) PageGetItem(page, itid); - maxsize = currsize = (ItemIdGetLength(itid) + sizeof(ItemIdData)); - for (offnum = OffsetNumberNext(offnum); - offnum <= maxoff; offnum = OffsetNumberNext(offnum)) - { - itid = PageGetItemId(page, offnum); - chkitem = (BTItem) PageGetItem(page, itid); - if (!_bt_itemcmp(rel, keysz, scankey, - previtem, chkitem, - BTEqualStrategyNumber)) - { - if (currsize > maxsize) - maxsize = currsize; - currsize = 0; - previtem = chkitem; - } - currsize += (ItemIdGetLength(itid) + sizeof(ItemIdData)); - } - if (currsize > maxsize) - maxsize = currsize; - /* Decide to split if largest group is > 1/2 page size */ - maxsize += sizeof(PageHeaderData) + - MAXALIGN(sizeof(BTPageOpaqueData)); - if (maxsize >= PageGetPageSize(page) / 2) - do_split = true; - } - } - - if (do_split) + else { - Buffer rbuf; - Page rpage; - BTItem ritem; - BlockNumber rbknum; - BTPageOpaque rpageop; - Buffer pbuf; - Page ppage; - BTPageOpaque ppageop; - BlockNumber bknum = BufferGetBlockNumber(buf); - BTItem lowLeftItem; - OffsetNumber maxoff; - bool shifted = false; - bool left_chained = (lpageop->btpo_flags & BTP_CHAIN) ? true : false; - bool is_root = lpageop->btpo_flags & BTP_ROOT; - /* - * Instead of splitting leaf page in the chain of duplicates by - * new duplicate, insert it into some right page. + * If we will need to split the page to put the item here, + * check whether we can put the tuple somewhere to the right, + * instead. Keep scanning until we find enough free space or + * reach the last page where the tuple can legally go. */ - if ((lpageop->btpo_flags & BTP_CHAIN) && - (lpageop->btpo_flags & BTP_LEAF) && keys_equal) + while (PageGetFreeSpace(page) < itemsz && + !P_RIGHTMOST(lpageop) && + _bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0) { - rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE); - rpage = BufferGetPage(rbuf); - rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage); + /* step right one page */ + BlockNumber rblkno = lpageop->btpo_next; - /* - * some checks - */ - if (!P_RIGHTMOST(rpageop)) /* non-rightmost page */ - { /* If we have the same hikey here then - * it's yet another page in chain. */ - if (_bt_skeycmp(rel, keysz, scankey, rpage, - PageGetItemId(rpage, P_HIKEY), - BTEqualStrategyNumber)) - { - if (!(rpageop->btpo_flags & BTP_CHAIN)) - elog(FATAL, "btree: lost page in the chain of duplicates"); - } - else if (_bt_skeycmp(rel, keysz, scankey, rpage, - PageGetItemId(rpage, P_HIKEY), - BTGreaterStrategyNumber)) - elog(FATAL, "btree: hikey is out of order"); - else if (rpageop->btpo_flags & BTP_CHAIN) - - /* - * If hikey > scankey then it's last page in chain and - * BTP_CHAIN must be OFF - */ - elog(FATAL, "btree: lost last page in the chain of duplicates"); - } - else -/* rightmost page */ - Assert(!(rpageop->btpo_flags & BTP_CHAIN)); _bt_relbuf(rel, buf, BT_WRITE); - return (_bt_insertonpg(rel, rbuf, stack, keysz, - scankey, btitem, afteritem)); + buf = _bt_getbuf(rel, rblkno, BT_WRITE); + page = BufferGetPage(buf); + lpageop = (BTPageOpaque) PageGetSpecialPointer(page); } - /* - * If after splitting un-chained page we'll got chain of pages - * with duplicates then we want to know 1. on which of two pages - * new btitem will go (current _bt_findsplitloc is quite bad); 2. - * what parent (if there's one) thinking about it (remember about - * deletions) + * This is it, so find the position... */ - else if (!(lpageop->btpo_flags & BTP_CHAIN)) - { - OffsetNumber start = (P_RIGHTMOST(lpageop)) ? P_HIKEY : P_FIRSTKEY; - Size llimit; - - maxoff = PageGetMaxOffsetNumber(page); - llimit = PageGetPageSize(page) - sizeof(PageHeaderData) - - MAXALIGN(sizeof(BTPageOpaqueData)) - +sizeof(ItemIdData); - llimit /= 2; - firstright = _bt_findsplitloc(rel, keysz, scankey, - page, start, maxoff, llimit); - - if (_bt_itemcmp(rel, keysz, scankey, - (BTItem) PageGetItem(page, PageGetItemId(page, start)), - (BTItem) PageGetItem(page, PageGetItemId(page, firstright)), - BTEqualStrategyNumber)) - { - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, firstright), - BTLessStrategyNumber)) - - /* - * force moving current items to the new page: new - * item will go on the current page. - */ - firstright = start; - else - - /* - * new btitem >= firstright, start item == firstright - * - new chain of duplicates: if this non-leftmost - * leaf page and parent item < start item then force - * moving all items to the new page - current page - * will be "empty" after it. - */ - { - if (!P_LEFTMOST(lpageop) && - (lpageop->btpo_flags & BTP_LEAF)) - { - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - if (_bt_itemcmp(rel, keysz, scankey, - stack->bts_btitem, - (BTItem) PageGetItem(page, - PageGetItemId(page, start)), - BTLessStrategyNumber)) - { - firstright = start; - shifted = true; - } - _bt_relbuf(rel, pbuf, BT_WRITE); - } - } - } /* else - no new chain if start item < - * firstright one */ - } + newitemoff = _bt_binsrch(rel, buf, keysz, scankey); + } - /* split the buffer into left and right halves */ - rbuf = _bt_split(rel, keysz, scankey, buf, firstright); + /* + * Do we need to split the page to fit the item on it? + */ + if (PageGetFreeSpace(page) < itemsz) + { + Buffer rbuf; + BlockNumber bknum = BufferGetBlockNumber(buf); + BlockNumber rbknum; + bool is_root = P_ISROOT(lpageop); + bool newitemonleft; - /* which new page (left half or right half) gets the tuple? */ - if (_bt_goesonpg(rel, buf, keysz, scankey, afteritem)) - { - /* left page */ - itup_off = _bt_pgaddtup(rel, buf, keysz, scankey, - itemsz, btitem, afteritem); - itup_blkno = BufferGetBlockNumber(buf); - } - else - { - /* right page */ - itup_off = _bt_pgaddtup(rel, rbuf, keysz, scankey, - itemsz, btitem, afteritem); - itup_blkno = BufferGetBlockNumber(rbuf); - } + /* Choose the split point */ + firstright = _bt_findsplitloc(rel, page, + newitemoff, itemsz, + &newitemonleft); - maxoff = PageGetMaxOffsetNumber(page); - if (shifted) - { - if (maxoff > P_FIRSTKEY) - elog(FATAL, "btree: shifted page is not empty"); - lowLeftItem = (BTItem) NULL; - } - else - { - if (maxoff < P_FIRSTKEY) - elog(FATAL, "btree: un-shifted page is empty"); - lowLeftItem = (BTItem) PageGetItem(page, - PageGetItemId(page, P_FIRSTKEY)); - if (_bt_itemcmp(rel, keysz, scankey, lowLeftItem, - (BTItem) PageGetItem(page, PageGetItemId(page, P_HIKEY)), - BTEqualStrategyNumber)) - lpageop->btpo_flags |= BTP_CHAIN; - } + /* split the buffer into left and right halves */ + rbuf = _bt_split(rel, buf, firstright, + newitemoff, itemsz, btitem, newitemonleft, + &itup_off, &itup_blkno); - /* + /*---------- * By here, * - * + our target page has been split; + the original tuple has been - * inserted; + we have write locks on both the old (left half) - * and new (right half) buffers, after the split; and + we have - * the key we want to insert into the parent. + * + our target page has been split; + * + the original tuple has been inserted; + * + we have write locks on both the old (left half) + * and new (right half) buffers, after the split; and + * + we know the key we want to insert into the parent + * (it's the "high key" on the left child page). + * + * We're ready to do the parent insertion. We need to hold onto the + * locks for the child pages until we locate the parent, but we can + * release them before doing the actual insertion (see Lehman and Yao + * for the reasoning). * - * Do the parent insertion. We need to hold onto the locks for the - * child pages until we locate the parent, but we can release them - * before doing the actual insertion (see Lehman and Yao for the - * reasoning). + * Here we have to do something Lehman and Yao don't talk about: + * deal with a root split and construction of a new root. If our + * stack is empty then we have just split a node on what had been + * the root level when we descended the tree. If it is still the + * root then we perform a new-root construction. If it *wasn't* + * the root anymore, use the parent pointer to get up to the root + * level that someone constructed meanwhile, and find the right + * place to insert as for the normal case. + *---------- */ -l_spl: ; - if (stack == (BTStack) NULL) + if (is_root) { - if (!is_root) /* if this page was not root page */ - { - elog(DEBUG, "btree: concurrent ROOT page split"); - stack = (BTStack) palloc(sizeof(BTStackData)); - stack->bts_blkno = lpageop->btpo_parent; - stack->bts_offset = InvalidOffsetNumber; - stack->bts_btitem = (BTItem) palloc(sizeof(BTItemData)); - /* bts_btitem will be initialized below */ - stack->bts_parent = NULL; - goto l_spl; - } + Assert(stack == (BTStack) NULL); /* create a new root node and release the split buffers */ _bt_newroot(rel, buf, rbuf); } else { - ScanKey newskey; InsertIndexResult newres; BTItem new_item; - OffsetNumber upditem_offset = P_HIKEY; - bool do_update = false; - bool update_in_place = true; - bool parent_chained; + BTStackData fakestack; + BTItem ritem; + Buffer pbuf; - /* form a index tuple that points at the new right page */ - rbknum = BufferGetBlockNumber(rbuf); - rpage = BufferGetPage(rbuf); - rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage); - - /* - * By convention, the first entry (1) on every non-rightmost - * page is the high key for that page. In order to get the - * lowest key on the new right page, we actually look at its - * second (2) entry. - */ - - if (!P_RIGHTMOST(rpageop)) + /* Set up a phony stack entry if we haven't got a real one */ + if (stack == (BTStack) NULL) { - ritem = (BTItem) PageGetItem(rpage, - PageGetItemId(rpage, P_FIRSTKEY)); - if (_bt_itemcmp(rel, keysz, scankey, - ritem, - (BTItem) PageGetItem(rpage, - PageGetItemId(rpage, P_HIKEY)), - BTEqualStrategyNumber)) - rpageop->btpo_flags |= BTP_CHAIN; + elog(DEBUG, "btree: concurrent ROOT page split"); + stack = &fakestack; + stack->bts_blkno = lpageop->btpo_parent; + stack->bts_offset = InvalidOffsetNumber; + /* bts_btitem will be initialized below */ + stack->bts_parent = NULL; } - else - ritem = (BTItem) PageGetItem(rpage, - PageGetItemId(rpage, P_HIKEY)); - /* get a unique btitem for this key */ - new_item = _bt_formitem(&(ritem->bti_itup)); + /* get high key from left page == lowest key on new right page */ + ritem = (BTItem) PageGetItem(page, + PageGetItemId(page, P_HIKEY)); + /* form an index tuple that points at the new right page */ + new_item = _bt_formitem(&(ritem->bti_itup)); + rbknum = BufferGetBlockNumber(rbuf); ItemPointerSet(&(new_item->bti_itup.t_tid), rbknum, P_HIKEY); /* @@ -642,192 +462,39 @@ l_spl: ; * Oops - if we were moved right then we need to change stack * item! We want to find parent pointing to where we are, * right ? - vadim 05/27/97 - */ - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - parent_chained = ((ppageop->btpo_flags & BTP_CHAIN)) ? true : false; - - if (parent_chained && !left_chained) - elog(FATAL, "nbtree: unexpected chained parent of unchained page"); - - /* - * If the key of new_item is < than the key of the item in the - * parent page pointing to the left page (stack->bts_btitem), - * we have to update the latter key; otherwise the keys on the - * parent page wouldn't be monotonically increasing after we - * inserted the new pointer to the right page (new_item). This - * only happens if our left page is the leftmost page and a - * new minimum key had been inserted before, which is not - * reflected in the parent page but didn't matter so far. If - * there are duplicate keys and this new minimum key spills - * over to our new right page, we get an inconsistency if we - * don't update the left key in the parent page. * - * Also, new duplicates handling code require us to update parent - * item if some smaller items left on the left page (which is - * possible in splitting leftmost page) and current parent - * item == new_item. - vadim 05/27/97 + * Interestingly, this means we didn't *really* need to stack + * the parent key at all; all we really care about is the + * saved block and offset as a starting point for our search... */ - if (_bt_itemcmp(rel, keysz, scankey, - stack->bts_btitem, new_item, - BTGreaterStrategyNumber) || - (!shifted && - _bt_itemcmp(rel, keysz, scankey, - stack->bts_btitem, new_item, - BTEqualStrategyNumber) && - _bt_itemcmp(rel, keysz, scankey, - lowLeftItem, new_item, - BTLessStrategyNumber))) - { - do_update = true; - - /* - * figure out which key is leftmost (if the parent page is - * rightmost, too, it must be the root) - */ - if (P_RIGHTMOST(ppageop)) - upditem_offset = P_HIKEY; - else - upditem_offset = P_FIRSTKEY; - if (!P_LEFTMOST(lpageop) || - stack->bts_offset != upditem_offset) - elog(FATAL, "btree: items are out of order (leftmost %d, stack %u, update %u)", - P_LEFTMOST(lpageop), stack->bts_offset, upditem_offset); - } - - if (do_update) - { - if (shifted) - elog(FATAL, "btree: attempt to update parent for shifted page"); - - /* - * Try to update in place. If out parent page is chained - * then we must forse insertion. - */ - if (!parent_chained && - MAXALIGN(IndexTupleDSize(lowLeftItem->bti_itup)) == - MAXALIGN(IndexTupleDSize(stack->bts_btitem->bti_itup))) - { - _bt_updateitem(rel, keysz, pbuf, - stack->bts_btitem, lowLeftItem); - _bt_wrtbuf(rel, buf); - _bt_wrtbuf(rel, rbuf); - } - else - { - update_in_place = false; - PageIndexTupleDelete(ppage, upditem_offset); - - /* - * don't write anything out yet--we still have the - * write lock, and now we call another _bt_insertonpg - * to insert the correct key. First, make a new item, - * using the tuple data from lowLeftItem. Point it to - * the left child. Update it on the stack at the same - * time. - */ - pfree(stack->bts_btitem); - stack->bts_btitem = _bt_formitem(&(lowLeftItem->bti_itup)); - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - - /* - * Unlock the children before doing this - */ - _bt_wrtbuf(rel, buf); - _bt_wrtbuf(rel, rbuf); - - /* - * A regular _bt_binsrch should find the right place - * to put the new entry, since it should be lower than - * any other key on the page. Therefore set afteritem - * to NULL. - */ - newskey = _bt_mkscankey(rel, &(stack->bts_btitem->bti_itup)); - newres = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, newskey, stack->bts_btitem, - NULL); - - pfree(newres); - pfree(newskey); - - /* - * we have now lost our lock on the parent buffer, and - * need to get it back. - */ - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - } - } - else - { - _bt_wrtbuf(rel, buf); - _bt_wrtbuf(rel, rbuf); - } + ItemPointerSet(&(stack->bts_btitem.bti_itup.t_tid), + bknum, P_HIKEY); - newskey = _bt_mkscankey(rel, &(new_item->bti_itup)); + pbuf = _bt_getstackbuf(rel, stack); - afteritem = stack->bts_btitem; - if (parent_chained && !update_in_place) - { - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - if (ppageop->btpo_flags & BTP_CHAIN) - elog(FATAL, "btree: unexpected BTP_CHAIN flag in parent after update"); - if (P_RIGHTMOST(ppageop)) - elog(FATAL, "btree: chained parent is RIGHTMOST after update"); - maxoff = PageGetMaxOffsetNumber(ppage); - if (maxoff != P_FIRSTKEY) - elog(FATAL, "btree: FIRSTKEY was unexpected in parent after update"); - if (_bt_skeycmp(rel, keysz, newskey, ppage, - PageGetItemId(ppage, P_FIRSTKEY), - BTLessEqualStrategyNumber)) - elog(FATAL, "btree: parent FIRSTKEY is >= duplicate key after update"); - if (!_bt_skeycmp(rel, keysz, newskey, ppage, - PageGetItemId(ppage, P_HIKEY), - BTEqualStrategyNumber)) - elog(FATAL, "btree: parent HIGHKEY is not equal duplicate key after update"); - afteritem = (BTItem) NULL; - } - else if (left_chained && !update_in_place) - { - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - if (!P_RIGHTMOST(ppageop) && - _bt_skeycmp(rel, keysz, newskey, ppage, - PageGetItemId(ppage, P_HIKEY), - BTGreaterStrategyNumber)) - afteritem = (BTItem) NULL; - } - if (afteritem == (BTItem) NULL) - { - rbuf = _bt_getbuf(rel, ppageop->btpo_next, BT_WRITE); - _bt_relbuf(rel, pbuf, BT_WRITE); - pbuf = rbuf; - } + /* Now we can write and unlock the children */ + _bt_wrtbuf(rel, rbuf); + _bt_wrtbuf(rel, buf); + /* Recursively update the parent */ newres = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, newskey, new_item, - afteritem); + 0, NULL, new_item, stack->bts_offset); /* be tidy */ pfree(newres); - pfree(newskey); pfree(new_item); } } else { - itup_off = _bt_pgaddtup(rel, buf, keysz, scankey, - itemsz, btitem, afteritem); + _bt_pgaddtup(rel, page, itemsz, btitem, newitemoff, "page"); + itup_off = newitemoff; itup_blkno = BufferGetBlockNumber(buf); - - _bt_relbuf(rel, buf, BT_WRITE); + /* Write out the updated page and release pin/lock */ + _bt_wrtbuf(rel, buf); } - /* by here, the new tuple is inserted */ + /* by here, the new tuple is inserted at itup_blkno/itup_off */ res = (InsertIndexResult) palloc(sizeof(InsertIndexResultData)); ItemPointerSet(&(res->pointerData), itup_blkno, itup_off); @@ -838,12 +505,19 @@ l_spl: ; * _bt_split() -- split a page in the btree. * * On entry, buf is the page to split, and is write-locked and pinned. - * Returns the new right sibling of buf, pinned and write-locked. The - * pin and lock on buf are maintained. + * firstright is the item index of the first item to be moved to the + * new right page. newitemoff etc. tell us about the new item that + * must be inserted along with the data from the old page. + * + * Returns the new right sibling of buf, pinned and write-locked. + * The pin and lock on buf are maintained. *itup_off and *itup_blkno + * are set to the exact location where newitem was inserted. */ static Buffer -_bt_split(Relation rel, Size keysz, ScanKey scankey, - Buffer buf, OffsetNumber firstright) +_bt_split(Relation rel, Buffer buf, OffsetNumber firstright, + OffsetNumber newitemoff, Size newitemsz, BTItem newitem, + bool newitemonleft, + OffsetNumber *itup_off, BlockNumber *itup_blkno) { Buffer rbuf; Page origpage; @@ -860,7 +534,6 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, BTItem item; OffsetNumber leftoff, rightoff; - OffsetNumber start; OffsetNumber maxoff; OffsetNumber i; @@ -869,8 +542,8 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, leftpage = PageGetTempPage(origpage, sizeof(BTPageOpaqueData)); rightpage = BufferGetPage(rbuf); - _bt_pageinit(rightpage, BufferGetPageSize(rbuf)); _bt_pageinit(leftpage, BufferGetPageSize(buf)); + _bt_pageinit(rightpage, BufferGetPageSize(rbuf)); /* init btree private data */ oopaque = (BTPageOpaque) PageGetSpecialPointer(origpage); @@ -879,106 +552,130 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, /* if we're splitting this page, it won't be the root when we're done */ oopaque->btpo_flags &= ~BTP_ROOT; - oopaque->btpo_flags &= ~BTP_CHAIN; lopaque->btpo_flags = ropaque->btpo_flags = oopaque->btpo_flags; lopaque->btpo_prev = oopaque->btpo_prev; - ropaque->btpo_prev = BufferGetBlockNumber(buf); lopaque->btpo_next = BufferGetBlockNumber(rbuf); + ropaque->btpo_prev = BufferGetBlockNumber(buf); ropaque->btpo_next = oopaque->btpo_next; + /* + * Must copy the original parent link into both new pages, even though + * it might be quite obsolete by now. We might need it if this level + * is or recently was the root (see README). + */ lopaque->btpo_parent = ropaque->btpo_parent = oopaque->btpo_parent; /* * If the page we're splitting is not the rightmost page at its level - * in the tree, then the first (0) entry on the page is the high key + * in the tree, then the first entry on the page is the high key * for the page. We need to copy that to the right half. Otherwise - * (meaning the rightmost page case), we should treat the line - * pointers beginning at zero as user data. - * - * We leave a blank space at the start of the line table for the left - * page. We'll come back later and fill it in with the high key item - * we get from the right key. + * (meaning the rightmost page case), all the items on the right half + * will be user data. */ + rightoff = P_HIKEY; - leftoff = P_FIRSTKEY; - ropaque->btpo_next = oopaque->btpo_next; if (!P_RIGHTMOST(oopaque)) { - /* splitting a non-rightmost page, start at the first data item */ - start = P_FIRSTKEY; - itemid = PageGetItemId(origpage, P_HIKEY); itemsz = ItemIdGetLength(itemid); item = (BTItem) PageGetItem(origpage, itemid); - if (PageAddItem(rightpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) + if (PageAddItem(rightpage, (Item) item, itemsz, rightoff, + LP_USED) == InvalidOffsetNumber) elog(FATAL, "btree: failed to add hikey to the right sibling"); - rightoff = P_FIRSTKEY; + rightoff = OffsetNumberNext(rightoff); } - else - { - /* splitting a rightmost page, "high key" is the first data item */ - start = P_HIKEY; - /* the new rightmost page will not have a high key */ - rightoff = P_HIKEY; + /* + * The "high key" for the new left page will be the first key that's + * going to go into the new right page. This might be either the + * existing data item at position firstright, or the incoming tuple. + */ + leftoff = P_HIKEY; + if (!newitemonleft && newitemoff == firstright) + { + /* incoming tuple will become first on right page */ + itemsz = newitemsz; + item = newitem; } - maxoff = PageGetMaxOffsetNumber(origpage); - if (firstright == InvalidOffsetNumber) + else { - Size llimit = PageGetFreeSpace(leftpage) / 2; - - firstright = _bt_findsplitloc(rel, keysz, scankey, - origpage, start, maxoff, llimit); + /* existing item at firstright will become first on right page */ + itemid = PageGetItemId(origpage, firstright); + itemsz = ItemIdGetLength(itemid); + item = (BTItem) PageGetItem(origpage, itemid); } + if (PageAddItem(leftpage, (Item) item, itemsz, leftoff, + LP_USED) == InvalidOffsetNumber) + elog(FATAL, "btree: failed to add hikey to the left sibling"); + leftoff = OffsetNumberNext(leftoff); - for (i = start; i <= maxoff; i = OffsetNumberNext(i)) + /* + * Now transfer all the data items to the appropriate page + */ + maxoff = PageGetMaxOffsetNumber(origpage); + + for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i)) { itemid = PageGetItemId(origpage, i); itemsz = ItemIdGetLength(itemid); item = (BTItem) PageGetItem(origpage, itemid); + /* does new item belong before this one? */ + if (i == newitemoff) + { + if (newitemonleft) + { + _bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff, + "left sibling"); + *itup_off = leftoff; + *itup_blkno = BufferGetBlockNumber(buf); + leftoff = OffsetNumberNext(leftoff); + } + else + { + _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff, + "right sibling"); + *itup_off = rightoff; + *itup_blkno = BufferGetBlockNumber(rbuf); + rightoff = OffsetNumberNext(rightoff); + } + } + /* decide which page to put it on */ if (i < firstright) { - if (PageAddItem(leftpage, (Item) item, itemsz, leftoff, - LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the left sibling"); + _bt_pgaddtup(rel, leftpage, itemsz, item, leftoff, + "left sibling"); leftoff = OffsetNumberNext(leftoff); } else { - if (PageAddItem(rightpage, (Item) item, itemsz, rightoff, - LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the right sibling"); + _bt_pgaddtup(rel, rightpage, itemsz, item, rightoff, + "right sibling"); rightoff = OffsetNumberNext(rightoff); } } - /* - * Okay, page has been split, high key on right page is correct. Now - * set the high key on the left page to be the min key on the right - * page. - */ - - if (P_RIGHTMOST(ropaque)) - itemid = PageGetItemId(rightpage, P_HIKEY); - else - itemid = PageGetItemId(rightpage, P_FIRSTKEY); - itemsz = ItemIdGetLength(itemid); - item = (BTItem) PageGetItem(rightpage, itemid); - - /* - * We left a hole for the high key on the left page; fill it. The - * modal crap is to tell the page manager to put the new item on the - * page and not screw around with anything else. Whoever designed - * this interface has presumably crawled back into the dung heap they - * came from. No one here will admit to it. - */ - - PageManagerModeSet(OverwritePageManagerMode); - if (PageAddItem(leftpage, (Item) item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add hikey to the left sibling"); - PageManagerModeSet(ShufflePageManagerMode); + /* cope with possibility that newitem goes at the end */ + if (i <= newitemoff) + { + if (newitemonleft) + { + _bt_pgaddtup(rel, leftpage, newitemsz, newitem, leftoff, + "left sibling"); + *itup_off = leftoff; + *itup_blkno = BufferGetBlockNumber(buf); + leftoff = OffsetNumberNext(leftoff); + } + else + { + _bt_pgaddtup(rel, rightpage, newitemsz, newitem, rightoff, + "right sibling"); + *itup_off = rightoff; + *itup_blkno = BufferGetBlockNumber(rbuf); + rightoff = OffsetNumberNext(rightoff); + } + } /* * By here, the original data page has been split into two new halves, @@ -992,14 +689,10 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, PageRestoreTempPage(leftpage, origpage); - /* write these guys out */ - _bt_wrtnorelbuf(rel, rbuf); - _bt_wrtnorelbuf(rel, buf); - /* * Finally, we need to grab the right sibling (if any) and fix the * prev pointer there. We are guaranteed that this is deadlock-free - * since no other writer will be moving holding a lock on that page + * since no other writer will be holding a lock on that page * and trying to move left, and all readers release locks on a page * before trying to fetch its neighbors. */ @@ -1020,87 +713,214 @@ _bt_split(Relation rel, Size keysz, ScanKey scankey, } /* - * _bt_findsplitloc() -- find a safe place to split a page. + * _bt_findsplitloc() -- find an appropriate place to split a page. + * + * The idea here is to equalize the free space that will be on each split + * page, *after accounting for the inserted tuple*. (If we fail to account + * for it, we might find ourselves with too little room on the page that + * it needs to go into!) * - * In order to guarantee the proper handling of searches for duplicate - * keys, the first duplicate in the chain must either be the first - * item on the page after the split, or the entire chain must be on - * one of the two pages. That is, - * [1 2 2 2 3 4 5] - * must become - * [1] [2 2 2 3 4 5] - * or - * [1 2 2 2] [3 4 5] - * but not - * [1 2 2] [2 3 4 5]. - * However, - * [2 2 2 2 2 3 4] - * may be split as - * [2 2 2 2] [2 3 4]. + * We are passed the intended insert position of the new tuple, expressed as + * the offsetnumber of the tuple it must go in front of. (This could be + * maxoff+1 if the tuple is to go at the end.) + * + * We return the index of the first existing tuple that should go on the + * righthand page, plus a boolean indicating whether the new tuple goes on + * the left or right page. The bool is necessary to disambiguate the case + * where firstright == newitemoff. */ static OffsetNumber _bt_findsplitloc(Relation rel, - Size keysz, - ScanKey scankey, Page page, - OffsetNumber start, - OffsetNumber maxoff, - Size llimit) + OffsetNumber newitemoff, + Size newitemsz, + bool *newitemonleft) { - OffsetNumber i; - OffsetNumber saferight; - ItemId nxtitemid, - safeitemid; - BTItem safeitem, - nxtitem; - Size nbytes; - - if (start >= maxoff) - elog(FATAL, "btree: cannot split if start (%d) >= maxoff (%d)", - start, maxoff); - saferight = start; - safeitemid = PageGetItemId(page, saferight); - nbytes = ItemIdGetLength(safeitemid) + sizeof(ItemIdData); - safeitem = (BTItem) PageGetItem(page, safeitemid); - - i = OffsetNumberNext(start); - - while (nbytes < llimit) + BTPageOpaque opaque; + OffsetNumber offnum; + OffsetNumber maxoff; + ItemId itemid; + FindSplitData state; + int leftspace, + rightspace, + dataitemtotal, + dataitemstoleft; + + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + + state.newitemsz = newitemsz; + state.non_leaf = ! P_ISLEAF(opaque); + state.have_split = false; + + /* Total free space available on a btree page, after fixed overhead */ + leftspace = rightspace = + PageGetPageSize(page) - sizeof(PageHeaderData) - + MAXALIGN(sizeof(BTPageOpaqueData)) + + sizeof(ItemIdData); + + /* The right page will have the same high key as the old page */ + if (!P_RIGHTMOST(opaque)) { - /* check the next item on the page */ - nxtitemid = PageGetItemId(page, i); - nbytes += (ItemIdGetLength(nxtitemid) + sizeof(ItemIdData)); - nxtitem = (BTItem) PageGetItem(page, nxtitemid); + itemid = PageGetItemId(page, P_HIKEY); + rightspace -= (int) (ItemIdGetLength(itemid) + sizeof(ItemIdData)); + } + + /* Count up total space in data items without actually scanning 'em */ + dataitemtotal = rightspace - (int) PageGetFreeSpace(page); + + /* + * Scan through the data items and calculate space usage for a split + * at each possible position. XXX we could probably stop somewhere + * near the middle... + */ + dataitemstoleft = 0; + maxoff = PageGetMaxOffsetNumber(page); + + for (offnum = P_FIRSTDATAKEY(opaque); + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + Size itemsz; + int leftfree, + rightfree; + + itemid = PageGetItemId(page, offnum); + itemsz = ItemIdGetLength(itemid) + sizeof(ItemIdData); /* - * Test against last known safe item: if the tuple we're looking - * at isn't equal to the last safe one we saw, then it's our new - * safe tuple. + * We have to allow for the current item becoming the high key of + * the left page; therefore it counts against left space. */ - if (!_bt_itemcmp(rel, keysz, scankey, - safeitem, nxtitem, BTEqualStrategyNumber)) + leftfree = leftspace - dataitemstoleft - (int) itemsz; + rightfree = rightspace - (dataitemtotal - dataitemstoleft); + if (offnum < newitemoff) + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + false, itemsz); + else if (offnum > newitemoff) + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + true, itemsz); + else { - safeitem = nxtitem; - saferight = i; + /* need to try it both ways!! */ + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + false, newitemsz); + _bt_checksplitloc(&state, offnum, leftfree, rightfree, + true, itemsz); } - if (i < maxoff) - i = OffsetNumberNext(i); - else - break; + + dataitemstoleft += itemsz; } + if (! state.have_split) + elog(FATAL, "_bt_findsplitloc: can't find a feasible split point for %s", + RelationGetRelationName(rel)); + *newitemonleft = state.newitemonleft; + return state.firstright; +} + +static void +_bt_checksplitloc(FindSplitData *state, OffsetNumber firstright, + int leftfree, int rightfree, + bool newitemonleft, Size firstrightitemsz) +{ + if (newitemonleft) + leftfree -= (int) state->newitemsz; + else + rightfree -= (int) state->newitemsz; + /* + * If we are not on the leaf level, we will be able to discard the + * key data from the first item that winds up on the right page. + */ + if (state->non_leaf) + rightfree += (int) firstrightitemsz - + (int) (sizeof(BTItemData) + sizeof(ItemIdData)); /* - * If the chain of dups starts at the beginning of the page and - * extends past the halfway mark, we can split it in the middle. + * If feasible split point, remember best delta. */ + if (leftfree >= 0 && rightfree >= 0) + { + int delta = leftfree - rightfree; + + if (delta < 0) + delta = -delta; + if (!state->have_split || delta < state->best_delta) + { + state->have_split = true; + state->newitemonleft = newitemonleft; + state->firstright = firstright; + state->best_delta = delta; + } + } +} + +/* + * _bt_getstackbuf() -- Walk back up the tree one step, and find the item + * we last looked at in the parent. + * + * This is possible because we save a bit image of the last item + * we looked at in the parent, and the update algorithm guarantees + * that if items above us in the tree move, they only move right. + * + * Also, re-set bts_blkno & bts_offset if changed. + */ +static Buffer +_bt_getstackbuf(Relation rel, BTStack stack) +{ + BlockNumber blkno; + Buffer buf; + OffsetNumber start, + offnum, + maxoff; + Page page; + ItemId itemid; + BTItem item; + BTPageOpaque opaque; + + blkno = stack->bts_blkno; + buf = _bt_getbuf(rel, blkno, BT_WRITE); + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); - if (saferight == start) - saferight = i; + start = stack->bts_offset; + /* + * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the + * case of concurrent ROOT page split. Also, watch out for + * possibility that page has a high key now when it didn't before. + */ + if (start < P_FIRSTDATAKEY(opaque)) + start = P_FIRSTDATAKEY(opaque); - if (saferight == maxoff && (maxoff - start) > 1) - saferight = start + (maxoff - start) / 2; + for (;;) + { + /* see if it's on this page */ + for (offnum = start; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + itemid = PageGetItemId(page, offnum); + item = (BTItem) PageGetItem(page, itemid); + if (BTItemSame(item, &stack->bts_btitem)) + { + /* Return accurate pointer to where link is now */ + stack->bts_blkno = blkno; + stack->bts_offset = offnum; + return buf; + } + } + /* by here, the item we're looking for moved right at least one page */ + if (P_RIGHTMOST(opaque)) + elog(FATAL, "_bt_getstackbuf: my bits moved right off the end of the world!" + "\n\tRecreate index %s.", RelationGetRelationName(rel)); - return saferight; + blkno = opaque->btpo_next; + _bt_relbuf(rel, buf, BT_WRITE); + buf = _bt_getbuf(rel, blkno, BT_WRITE); + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + start = P_FIRSTDATAKEY(opaque); + } } /* @@ -1116,9 +936,9 @@ _bt_findsplitloc(Relation rel, * graph. * * On entry, lbuf (the old root) and rbuf (its new peer) are write- - * locked. We don't drop the locks in this routine; that's done by - * the caller. On exit, a new root page exists with entries for the - * two new children. The new root page is neither pinned nor locked. + * locked. On exit, a new root page exists with entries for the + * two new children. The new root page is neither pinned nor locked, and + * we have also written out lbuf and rbuf and dropped their pins/locks. */ static void _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) @@ -1140,52 +960,52 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); rootpage = BufferGetPage(rootbuf); rootbknum = BufferGetBlockNumber(rootbuf); - _bt_pageinit(rootpage, BufferGetPageSize(rootbuf)); /* set btree special data */ rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage); rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE; rootopaque->btpo_flags |= BTP_ROOT; - /* - * Insert the internal tuple pointers. - */ - lbkno = BufferGetBlockNumber(lbuf); rbkno = BufferGetBlockNumber(rbuf); lpage = BufferGetPage(lbuf); rpage = BufferGetPage(rbuf); + /* + * Make sure pages in old root level have valid parent links --- we will + * need this in _bt_insertonpg() if a concurrent root split happens (see + * README). + */ ((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_parent = ((BTPageOpaque) PageGetSpecialPointer(rpage))->btpo_parent = rootbknum; /* - * step over the high key on the left page while building the left - * page pointer. + * Create downlink item for left page (old root). Since this will be + * the first item in a non-leaf page, it implicitly has minus-infinity + * key value, so we need not store any actual key in it. */ - itemid = PageGetItemId(lpage, P_FIRSTKEY); - itemsz = ItemIdGetLength(itemid); - item = (BTItem) PageGetItem(lpage, itemid); - new_item = _bt_formitem(&(item->bti_itup)); + itemsz = sizeof(BTItemData); + new_item = (BTItem) palloc(itemsz); + new_item->bti_itup.t_info = itemsz; ItemPointerSet(&(new_item->bti_itup.t_tid), lbkno, P_HIKEY); /* - * insert the left page pointer into the new root page. the root page - * is the rightmost page on its level so the "high key" item is the - * first data item. + * Insert the left page pointer into the new root page. The root page + * is the rightmost page on its level so there is no "high key" in it; + * the two items will go into positions P_HIKEY and P_FIRSTKEY. */ if (PageAddItem(rootpage, (Item) new_item, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) elog(FATAL, "btree: failed to add leftkey to new root page"); pfree(new_item); /* - * the right page is the rightmost page on the second level, so the - * "high key" item is the first data item on that page as well. + * Create downlink item for right page. The key for it is obtained from + * the "high key" position in the left page. */ - itemid = PageGetItemId(rpage, P_HIKEY); + itemid = PageGetItemId(lpage, P_HIKEY); itemsz = ItemIdGetLength(itemid); - item = (BTItem) PageGetItem(rpage, itemid); + item = (BTItem) PageGetItem(lpage, itemid); new_item = _bt_formitem(&(item->bti_itup)); ItemPointerSet(&(new_item->bti_itup.t_tid), rbkno, P_HIKEY); @@ -1196,497 +1016,101 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) elog(FATAL, "btree: failed to add rightkey to new root page"); pfree(new_item); - /* write and let go of the root buffer */ + /* write and let go of the new root buffer */ _bt_wrtbuf(rel, rootbuf); /* update metadata page with new root block number */ _bt_metaproot(rel, rootbknum, 0); - _bt_wrtbuf(rel, lbuf); + /* update and release new sibling, and finally the old root */ _bt_wrtbuf(rel, rbuf); + _bt_wrtbuf(rel, lbuf); } /* * _bt_pgaddtup() -- add a tuple to a particular page in the index. * - * This routine adds the tuple to the page as requested, and keeps the - * write lock and reference associated with the page's buffer. It is - * an error to call pgaddtup() without a write lock and reference. If - * afteritem is non-null, it's the item that we expect our new item - * to follow. Otherwise, we do a binary search for the correct place - * and insert the new item there. + * This routine adds the tuple to the page as requested. It does + * not affect pin/lock status, but you'd better have a write lock + * and pin on the target buffer! Don't forget to write and release + * the buffer afterwards, either. + * + * The main difference between this routine and a bare PageAddItem call + * is that this code knows that the leftmost data item on a non-leaf + * btree page doesn't need to have a key. Therefore, it strips such + * items down to just the item header. CAUTION: this works ONLY if + * we insert the items in order, so that the given itup_off does + * represent the final position of the item! */ -static OffsetNumber +static void _bt_pgaddtup(Relation rel, - Buffer buf, - int keysz, - ScanKey itup_scankey, + Page page, Size itemsize, BTItem btitem, - BTItem afteritem) -{ - OffsetNumber itup_off; - OffsetNumber first; - Page page; - BTPageOpaque opaque; - BTItem chkitem; - - page = BufferGetPage(buf); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - first = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - if (afteritem == (BTItem) NULL) - itup_off = _bt_binsrch(rel, buf, keysz, itup_scankey, BT_INSERTION); - else - { - itup_off = first; - - do - { - chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, itup_off)); - itup_off = OffsetNumberNext(itup_off); - } while (!BTItemSame(chkitem, afteritem)); - } - - if (PageAddItem(page, (Item) btitem, itemsize, itup_off, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the page"); - - /* write the buffer, but hold our lock */ - _bt_wrtnorelbuf(rel, buf); - - return itup_off; -} - -/* - * _bt_goesonpg() -- Does a new tuple belong on this page? - * - * This is part of the complexity introduced by allowing duplicate - * keys into the index. The tuple belongs on this page if: - * - * + there is no page to the right of this one; or - * + it is less than the high key on the page; or - * + the item it is to follow ("afteritem") appears on this - * page. - */ -static bool -_bt_goesonpg(Relation rel, - Buffer buf, - Size keysz, - ScanKey scankey, - BTItem afteritem) -{ - Page page; - ItemId hikey; - BTPageOpaque opaque; - BTItem chkitem; - OffsetNumber offnum, - maxoff; - bool found; - - page = BufferGetPage(buf); - - /* no right neighbor? */ - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - if (P_RIGHTMOST(opaque)) - return true; - - /* - * this is a non-rightmost page, so it must have a high key item. - * - * If the scan key is < the high key (the min key on the next page), then - * it for sure belongs here. - */ - hikey = PageGetItemId(page, P_HIKEY); - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTLessStrategyNumber)) - return true; - - /* - * If the scan key is > the high key, then it for sure doesn't belong - * here. - */ - - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, BTGreaterStrategyNumber)) - return false; - - /* - * If we have no adjacency information, and the item is equal to the - * high key on the page (by here it is), then the item does not belong - * on this page. - * - * Now it's not true in all cases. - vadim 06/10/97 - */ - - if (afteritem == (BTItem) NULL) - { - if (opaque->btpo_flags & BTP_LEAF) - return false; - if (opaque->btpo_flags & BTP_CHAIN) - return true; - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, P_FIRSTKEY), - BTEqualStrategyNumber)) - return true; - return false; - } - - /* damn, have to work for it. i hate that. */ - maxoff = PageGetMaxOffsetNumber(page); - - /* - * Search the entire page for the afteroid. We need to do this, - * rather than doing a binary search and starting from there, because - * if the key we're searching for is the leftmost key in the tree at - * this level, then a binary search will do the wrong thing. Splits - * are pretty infrequent, so the cost isn't as bad as it could be. - */ - - found = false; - for (offnum = P_FIRSTKEY; - offnum <= maxoff; - offnum = OffsetNumberNext(offnum)) - { - chkitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); - - if (BTItemSame(chkitem, afteritem)) - { - found = true; - break; - } - } - - return found; -} - -/* - * _bt_tuplecompare() -- compare two IndexTuples, - * return -1, 0, or +1 - * - */ -static int32 -_bt_tuplecompare(Relation rel, - Size keysz, - ScanKey scankey, - IndexTuple tuple1, - IndexTuple tuple2) + OffsetNumber itup_off, + const char *where) { - TupleDesc tupDes; - int i; - int32 compare = 0; + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); + BTItemData truncitem; - tupDes = RelationGetDescr(rel); - - for (i = 1; i <= (int) keysz; i++) - { - ScanKey entry = &scankey[i - 1]; - Datum attrDatum1, - attrDatum2; - bool isFirstNull, - isSecondNull; - - attrDatum1 = index_getattr(tuple1, i, tupDes, &isFirstNull); - attrDatum2 = index_getattr(tuple2, i, tupDes, &isSecondNull); - - /* see comments about NULLs handling in btbuild */ - if (isFirstNull) /* attr in tuple1 is NULL */ - { - if (isSecondNull) /* attr in tuple2 is NULL too */ - compare = 0; - else - compare = 1; /* NULL ">" not-NULL */ - } - else if (isSecondNull) /* attr in tuple1 is NOT_NULL and */ - { /* attr in tuple2 is NULL */ - compare = -1; /* not-NULL "<" NULL */ - } - else - { - compare = DatumGetInt32(FunctionCall2(&entry->sk_func, - attrDatum1, attrDatum2)); - } - - if (compare != 0) - break; /* done when we find unequal attributes */ - } - - return compare; -} - -/* - * _bt_itemcmp() -- compare two BTItems using a requested - * strategy (<, <=, =, >=, >) - * - */ -bool -_bt_itemcmp(Relation rel, - Size keysz, - ScanKey scankey, - BTItem item1, - BTItem item2, - StrategyNumber strat) -{ - int32 compare; - - compare = _bt_tuplecompare(rel, keysz, scankey, - &(item1->bti_itup), - &(item2->bti_itup)); - - switch (strat) + if (! P_ISLEAF(opaque) && itup_off == P_FIRSTDATAKEY(opaque)) { - case BTLessStrategyNumber: - return (bool) (compare < 0); - case BTLessEqualStrategyNumber: - return (bool) (compare <= 0); - case BTEqualStrategyNumber: - return (bool) (compare == 0); - case BTGreaterEqualStrategyNumber: - return (bool) (compare >= 0); - case BTGreaterStrategyNumber: - return (bool) (compare > 0); + memcpy(&truncitem, btitem, sizeof(BTItemData)); + truncitem.bti_itup.t_info = sizeof(BTItemData); + btitem = &truncitem; + itemsize = sizeof(BTItemData); } - elog(ERROR, "_bt_itemcmp: bogus strategy %d", (int) strat); - return false; -} - -/* - * _bt_updateitem() -- updates the key of the item identified by the - * oid with the key of newItem (done in place if - * possible) - * - */ -static void -_bt_updateitem(Relation rel, - Size keysz, - Buffer buf, - BTItem oldItem, - BTItem newItem) -{ - Page page; - OffsetNumber maxoff; - OffsetNumber i; - ItemPointerData itemPtrData; - BTItem item; - IndexTuple oldIndexTuple, - newIndexTuple; - int first; - - page = BufferGetPage(buf); - maxoff = PageGetMaxOffsetNumber(page); - - /* locate item on the page */ - first = P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page)) - ? P_HIKEY : P_FIRSTKEY; - i = first; - do - { - item = (BTItem) PageGetItem(page, PageGetItemId(page, i)); - i = OffsetNumberNext(i); - } while (i <= maxoff && !BTItemSame(item, oldItem)); - - /* this should never happen (in theory) */ - if (!BTItemSame(item, oldItem)) - elog(FATAL, "_bt_getstackbuf was lying!!"); - - /* - * It's defined by caller (_bt_insertonpg) - */ - - /* - * if(IndexTupleDSize(newItem->bti_itup) > - * IndexTupleDSize(item->bti_itup)) { elog(NOTICE, "trying to - * overwrite a smaller value with a bigger one in _bt_updateitem"); - * elog(ERROR, "this is not good."); } - */ - - oldIndexTuple = &(item->bti_itup); - newIndexTuple = &(newItem->bti_itup); - - /* keep the original item pointer */ - ItemPointerCopy(&(oldIndexTuple->t_tid), &itemPtrData); - CopyIndexTuple(newIndexTuple, &oldIndexTuple); - ItemPointerCopy(&itemPtrData, &(oldIndexTuple->t_tid)); - + if (PageAddItem(page, (Item) btitem, itemsize, itup_off, + LP_USED) == InvalidOffsetNumber) + elog(FATAL, "btree: failed to add item to the %s for %s", + where, RelationGetRelationName(rel)); } /* * _bt_isequal - used in _bt_doinsert in check for duplicates. * + * This is very similar to _bt_compare, except for NULL handling. * Rule is simple: NOT_NULL not equal NULL, NULL not_equal NULL too. */ static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey) { - Datum datum; BTItem btitem; IndexTuple itup; - ScanKey entry; - AttrNumber attno; - int32 result; int i; - bool null; + + /* Better be comparing to a leaf item */ + Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page))); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &(btitem->bti_itup); for (i = 1; i <= keysz; i++) { - entry = &scankey[i - 1]; + ScanKey entry = &scankey[i - 1]; + AttrNumber attno; + Datum datum; + bool isNull; + int32 result; + attno = entry->sk_attno; Assert(attno == i); - datum = index_getattr(itup, attno, itupdesc, &null); + datum = index_getattr(itup, attno, itupdesc, &isNull); - /* NULLs are not equal */ - if (entry->sk_flags & SK_ISNULL || null) + /* NULLs are never equal to anything */ + if (entry->sk_flags & SK_ISNULL || isNull) return false; result = DatumGetInt32(FunctionCall2(&entry->sk_func, - entry->sk_argument, datum)); + entry->sk_argument, + datum)); + if (result != 0) return false; } - /* by here, the keys are equal */ + /* if we get here, the keys are equal */ return true; } - -#ifdef NOT_USED -/* - * _bt_shift - insert btitem on the passed page after shifting page - * to the right in the tree. - * - * NOTE: tested for shifting leftmost page only, having btitem < hikey. - */ -static InsertIndexResult -_bt_shift(Relation rel, Buffer buf, BTStack stack, int keysz, - ScanKey scankey, BTItem btitem, BTItem hikey) -{ - InsertIndexResult res; - int itemsz; - Page page; - BlockNumber bknum; - BTPageOpaque pageop; - Buffer rbuf; - Page rpage; - BTPageOpaque rpageop; - Buffer pbuf; - Page ppage; - BTPageOpaque ppageop; - Buffer nbuf; - Page npage; - BTPageOpaque npageop; - BlockNumber nbknum; - BTItem nitem; - OffsetNumber afteroff; - - btitem = _bt_formitem(&(btitem->bti_itup)); - hikey = _bt_formitem(&(hikey->bti_itup)); - - page = BufferGetPage(buf); - - /* grab new page */ - nbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); - nbknum = BufferGetBlockNumber(nbuf); - npage = BufferGetPage(nbuf); - _bt_pageinit(npage, BufferGetPageSize(nbuf)); - npageop = (BTPageOpaque) PageGetSpecialPointer(npage); - - /* copy content of the passed page */ - memmove((char *) npage, (char *) page, BufferGetPageSize(buf)); - - /* re-init old (passed) page */ - _bt_pageinit(page, BufferGetPageSize(buf)); - pageop = (BTPageOpaque) PageGetSpecialPointer(page); - - /* init old page opaque */ - pageop->btpo_flags = npageop->btpo_flags; /* restore flags */ - pageop->btpo_flags &= ~BTP_CHAIN; - if (_bt_itemcmp(rel, keysz, scankey, hikey, btitem, BTEqualStrategyNumber)) - pageop->btpo_flags |= BTP_CHAIN; - pageop->btpo_prev = npageop->btpo_prev; /* restore prev */ - pageop->btpo_next = nbknum; /* next points to the new page */ - pageop->btpo_parent = npageop->btpo_parent; - - /* init shifted page opaque */ - npageop->btpo_prev = bknum = BufferGetBlockNumber(buf); - - /* shifted page is ok, populate old page */ - - /* add passed hikey */ - itemsz = IndexTupleDSize(hikey->bti_itup) - + (sizeof(BTItemData) - sizeof(IndexTupleData)); - itemsz = MAXALIGN(itemsz); - if (PageAddItem(page, (Item) hikey, itemsz, P_HIKEY, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add hikey in _bt_shift"); - pfree(hikey); - - /* add btitem */ - itemsz = IndexTupleDSize(btitem->bti_itup) - + (sizeof(BTItemData) - sizeof(IndexTupleData)); - itemsz = MAXALIGN(itemsz); - if (PageAddItem(page, (Item) btitem, itemsz, P_FIRSTKEY, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add firstkey in _bt_shift"); - pfree(btitem); - nitem = (BTItem) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY)); - btitem = _bt_formitem(&(nitem->bti_itup)); - ItemPointerSet(&(btitem->bti_itup.t_tid), bknum, P_HIKEY); - - /* ok, write them out */ - _bt_wrtnorelbuf(rel, nbuf); - _bt_wrtnorelbuf(rel, buf); - - /* fix btpo_prev on right sibling of old page */ - if (!P_RIGHTMOST(npageop)) - { - rbuf = _bt_getbuf(rel, npageop->btpo_next, BT_WRITE); - rpage = BufferGetPage(rbuf); - rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage); - rpageop->btpo_prev = nbknum; - _bt_wrtbuf(rel, rbuf); - } - - /* get parent pointing to the old page */ - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), - bknum, P_HIKEY); - pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); - ppage = BufferGetPage(pbuf); - ppageop = (BTPageOpaque) PageGetSpecialPointer(ppage); - - _bt_relbuf(rel, nbuf, BT_WRITE); - _bt_relbuf(rel, buf, BT_WRITE); - - /* re-set parent' pointer - we shifted our page to the right ! */ - nitem = (BTItem) PageGetItem(ppage, - PageGetItemId(ppage, stack->bts_offset)); - ItemPointerSet(&(nitem->bti_itup.t_tid), nbknum, P_HIKEY); - ItemPointerSet(&(stack->bts_btitem->bti_itup.t_tid), nbknum, P_HIKEY); - _bt_wrtnorelbuf(rel, pbuf); - - /* - * Now we want insert into the parent pointer to our old page. It has - * to be inserted before the pointer to new page. You may get problems - * here (in the _bt_goesonpg and/or _bt_pgaddtup), but may be not - I - * don't know. It works if old page is leftmost (nitem is NULL) and - * btitem < hikey and it's all what we need currently. - vadim - * 05/30/97 - */ - nitem = NULL; - afteroff = P_FIRSTKEY; - if (!P_RIGHTMOST(ppageop)) - afteroff = OffsetNumberNext(afteroff); - if (stack->bts_offset >= afteroff) - { - afteroff = OffsetNumberPrev(stack->bts_offset); - nitem = (BTItem) PageGetItem(ppage, PageGetItemId(ppage, afteroff)); - nitem = _bt_formitem(&(nitem->bti_itup)); - } - res = _bt_insertonpg(rel, pbuf, stack->bts_parent, - keysz, scankey, btitem, nitem); - pfree(btitem); - - ItemPointerSet(&(res->pointerData), nbknum, P_HIKEY); - - return res; -} - -#endif diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 1a623698f5..40604dbc25 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -9,7 +9,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.36 2000/04/12 17:14:49 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.37 2000/07/21 06:42:32 tgl Exp $ * * NOTES * Postgres btree pages look like ordinary relation pages. The opaque @@ -90,7 +90,7 @@ _bt_metapinit(Relation rel) metad.btm_version = BTREE_VERSION; metad.btm_root = P_NONE; metad.btm_level = 0; - memmove((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad)); + memcpy((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad)); op = (BTPageOpaque) PageGetSpecialPointer(pg); op->btpo_flags = BTP_META; @@ -102,52 +102,6 @@ _bt_metapinit(Relation rel) UnlockRelation(rel, AccessExclusiveLock); } -#ifdef NOT_USED -/* - * _bt_checkmeta() -- Verify that the metadata stored in a btree are - * reasonable. - */ -void -_bt_checkmeta(Relation rel) -{ - Buffer metabuf; - Page metap; - BTMetaPageData *metad; - BTPageOpaque op; - int nblocks; - - /* if the relation is empty, this is init time; don't complain */ - if ((nblocks = RelationGetNumberOfBlocks(rel)) == 0) - return; - - metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ); - metap = BufferGetPage(metabuf); - op = (BTPageOpaque) PageGetSpecialPointer(metap); - if (!(op->btpo_flags & BTP_META)) - { - elog(ERROR, "Invalid metapage for index %s", - RelationGetRelationName(rel)); - } - metad = BTPageGetMeta(metap); - - if (metad->btm_magic != BTREE_MAGIC) - { - elog(ERROR, "Index %s is not a btree", - RelationGetRelationName(rel)); - } - - if (metad->btm_version != BTREE_VERSION) - { - elog(ERROR, "Version mismatch on %s: version %d file, version %d code", - RelationGetRelationName(rel), - metad->btm_version, BTREE_VERSION); - } - - _bt_relbuf(rel, metabuf, BT_READ); -} - -#endif - /* * _bt_getroot() -- Get the root page of the btree. * @@ -157,11 +111,15 @@ _bt_checkmeta(Relation rel) * standard class of race conditions exists here; I think I covered * them all in the Hopi Indian rain dance of lock requests below. * - * We pass in the access type (BT_READ or BT_WRITE), and return the - * root page's buffer with the appropriate lock type set. Reference - * count on the root page gets bumped by ReadBuffer. The metadata - * page is unlocked and unreferenced by this process when this routine - * returns. + * The access type parameter (BT_READ or BT_WRITE) controls whether + * a new root page will be created or not. If access = BT_READ, + * and no root page exists, we just return InvalidBuffer. For + * BT_WRITE, we try to create the root page if it doesn't exist. + * NOTE that the returned root page will have only a read lock set + * on it even if access = BT_WRITE! + * + * On successful return, the root page is pinned and read-locked. + * The metadata page is not locked or pinned on exit. */ Buffer _bt_getroot(Relation rel, int access) @@ -178,78 +136,71 @@ _bt_getroot(Relation rel, int access) metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ); metapg = BufferGetPage(metabuf); metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg); - Assert(metaopaque->btpo_flags & BTP_META); metad = BTPageGetMeta(metapg); - if (metad->btm_magic != BTREE_MAGIC) - { + if (!(metaopaque->btpo_flags & BTP_META) || + metad->btm_magic != BTREE_MAGIC) elog(ERROR, "Index %s is not a btree", RelationGetRelationName(rel)); - } if (metad->btm_version != BTREE_VERSION) - { - elog(ERROR, "Version mismatch on %s: version %d file, version %d code", + elog(ERROR, "Version mismatch on %s: version %d file, version %d code", RelationGetRelationName(rel), metad->btm_version, BTREE_VERSION); - } /* if no root page initialized yet, do it */ if (metad->btm_root == P_NONE) { + /* If access = BT_READ, caller doesn't want us to create root yet */ + if (access == BT_READ) + { + _bt_relbuf(rel, metabuf, BT_READ); + return InvalidBuffer; + } - /* turn our read lock in for a write lock */ - _bt_relbuf(rel, metabuf, BT_READ); - metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE); - metapg = BufferGetPage(metabuf); - metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg); - Assert(metaopaque->btpo_flags & BTP_META); - metad = BTPageGetMeta(metapg); + /* trade in our read lock for a write lock */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + LockBuffer(metabuf, BT_WRITE); /* * Race condition: if someone else initialized the metadata * between the time we released the read lock and acquired the - * write lock, above, we want to avoid doing it again. + * write lock, above, we must avoid doing it again. */ - if (metad->btm_root == P_NONE) { /* * Get, initialize, write, and leave a lock of the appropriate * type on the new root page. Since this is the first page in - * the tree, it's a leaf. + * the tree, it's a leaf as well as the root. */ - rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); rootblkno = BufferGetBlockNumber(rootbuf); rootpg = BufferGetPage(rootbuf); + metad->btm_root = rootblkno; metad->btm_level = 1; + _bt_pageinit(rootpg, BufferGetPageSize(rootbuf)); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg); rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT); _bt_wrtnorelbuf(rel, rootbuf); - /* swap write lock for read lock, if appropriate */ - if (access != BT_WRITE) - { - LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK); - LockBuffer(rootbuf, BT_READ); - } + /* swap write lock for read lock */ + LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK); + LockBuffer(rootbuf, BT_READ); - /* okay, metadata is correct */ + /* okay, metadata is correct, write and release it */ _bt_wrtbuf(rel, metabuf); } else { - /* * Metadata initialized by someone else. In order to * guarantee no deadlocks, we have to release the metadata * page and start all over again. */ - _bt_relbuf(rel, metabuf, BT_WRITE); return _bt_getroot(rel, access); } @@ -259,22 +210,21 @@ _bt_getroot(Relation rel, int access) rootblkno = metad->btm_root; _bt_relbuf(rel, metabuf, BT_READ); /* done with the meta page */ - rootbuf = _bt_getbuf(rel, rootblkno, access); + rootbuf = _bt_getbuf(rel, rootblkno, BT_READ); } /* * Race condition: If the root page split between the time we looked * at the metadata page and got the root buffer, then we got the wrong - * buffer. + * buffer. Release it and try again. */ - rootpg = BufferGetPage(rootbuf); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg); - if (!(rootopaque->btpo_flags & BTP_ROOT)) - { + if (! P_ISROOT(rootopaque)) + { /* it happened, try again */ - _bt_relbuf(rel, rootbuf, access); + _bt_relbuf(rel, rootbuf, BT_READ); return _bt_getroot(rel, access); } @@ -283,7 +233,6 @@ _bt_getroot(Relation rel, int access) * count is correct, and we have no lock set on the metadata page. * Return the root block. */ - return rootbuf; } @@ -291,33 +240,38 @@ _bt_getroot(Relation rel, int access) * _bt_getbuf() -- Get a buffer by block number for read or write. * * When this routine returns, the appropriate lock is set on the - * requested buffer its reference count is correct. + * requested buffer and its reference count has been incremented + * (ie, the buffer is "locked and pinned"). */ Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access) { Buffer buf; - Page page; if (blkno != P_NEW) { + /* Read an existing block of the relation */ buf = ReadBuffer(rel, blkno); LockBuffer(buf, access); } else { + Page page; /* - * Extend bufmgr code is unclean and so we have to use locking + * Extend the relation by one page. + * + * Extend bufmgr code is unclean and so we have to use extra locking * here. */ LockPage(rel, 0, ExclusiveLock); buf = ReadBuffer(rel, blkno); + LockBuffer(buf, access); UnlockPage(rel, 0, ExclusiveLock); - blkno = BufferGetBlockNumber(buf); + + /* Initialize the new page before returning it */ page = BufferGetPage(buf); _bt_pageinit(page, BufferGetPageSize(buf)); - LockBuffer(buf, access); } /* ref count and lock type are correct */ @@ -326,6 +280,8 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) /* * _bt_relbuf() -- release a locked buffer. + * + * Lock and pin (refcount) are both dropped. */ void _bt_relbuf(Relation rel, Buffer buf, int access) @@ -337,9 +293,15 @@ _bt_relbuf(Relation rel, Buffer buf, int access) /* * _bt_wrtbuf() -- write a btree page to disk. * - * This routine releases the lock held on the buffer and our reference - * to it. It is an error to call _bt_wrtbuf() without a write lock - * or a reference to the buffer. + * This routine releases the lock held on the buffer and our refcount + * for it. It is an error to call _bt_wrtbuf() without a write lock + * and a pin on the buffer. + * + * NOTE: actually, the buffer manager just marks the shared buffer page + * dirty here, the real I/O happens later. Since we can't persuade the + * Unix kernel to schedule disk writes in a particular order, there's not + * much point in worrying about this. The most we can say is that all the + * writes will occur before commit. */ void _bt_wrtbuf(Relation rel, Buffer buf) @@ -353,7 +315,9 @@ _bt_wrtbuf(Relation rel, Buffer buf) * our reference or lock. * * It is an error to call _bt_wrtnorelbuf() without a write lock - * or a reference to the buffer. + * and a pin on the buffer. + * + * See above NOTE. */ void _bt_wrtnorelbuf(Relation rel, Buffer buf) @@ -389,10 +353,10 @@ _bt_pageinit(Page page, Size size) * we split the root page, we record the new parent in the metadata page * for the relation. This routine does the work. * - * No direct preconditions, but if you don't have the a write lock on + * No direct preconditions, but if you don't have the write lock on * at least the old root page when you call this, you're making a big * mistake. On exit, metapage data is correct and we no longer have - * a reference to or lock on the metapage. + * a pin or lock on the metapage. */ void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level) @@ -416,127 +380,8 @@ _bt_metaproot(Relation rel, BlockNumber rootbknum, int level) } /* - * _bt_getstackbuf() -- Walk back up the tree one step, and find the item - * we last looked at in the parent. - * - * This is possible because we save a bit image of the last item - * we looked at in the parent, and the update algorithm guarantees - * that if items above us in the tree move, they only move right. - * - * Also, re-set bts_blkno & bts_offset if changed and - * bts_btitem (it may be changed - see _bt_insertonpg). + * Delete an item from a btree. It had better be a leaf item... */ -Buffer -_bt_getstackbuf(Relation rel, BTStack stack, int access) -{ - Buffer buf; - BlockNumber blkno; - OffsetNumber start, - offnum, - maxoff; - OffsetNumber i; - Page page; - ItemId itemid; - BTItem item; - BTPageOpaque opaque; - BTItem item_save; - int item_nbytes; - - blkno = stack->bts_blkno; - buf = _bt_getbuf(rel, blkno, access); - page = BufferGetPage(buf); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - - if (stack->bts_offset == InvalidOffsetNumber || - maxoff >= stack->bts_offset) - { - - /* - * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the - * case of concurrent ROOT page split - */ - if (stack->bts_offset == InvalidOffsetNumber) - i = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - else - { - itemid = PageGetItemId(page, stack->bts_offset); - item = (BTItem) PageGetItem(page, itemid); - - /* if the item is where we left it, we're done */ - if (BTItemSame(item, stack->bts_btitem)) - { - pfree(stack->bts_btitem); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) item, item_nbytes); - stack->bts_btitem = item_save; - return buf; - } - i = OffsetNumberNext(stack->bts_offset); - } - - /* if the item has just moved right on this page, we're done */ - for (; - i <= maxoff; - i = OffsetNumberNext(i)) - { - itemid = PageGetItemId(page, i); - item = (BTItem) PageGetItem(page, itemid); - - /* if the item is where we left it, we're done */ - if (BTItemSame(item, stack->bts_btitem)) - { - stack->bts_offset = i; - pfree(stack->bts_btitem); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) item, item_nbytes); - stack->bts_btitem = item_save; - return buf; - } - } - } - - /* by here, the item we're looking for moved right at least one page */ - for (;;) - { - blkno = opaque->btpo_next; - if (P_RIGHTMOST(opaque)) - elog(FATAL, "my bits moved right off the end of the world!\ -\n\tRecreate index %s.", RelationGetRelationName(rel)); - - _bt_relbuf(rel, buf, access); - buf = _bt_getbuf(rel, blkno, access); - page = BufferGetPage(buf); - maxoff = PageGetMaxOffsetNumber(page); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - - /* if we have a right sibling, step over the high key */ - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - /* see if it's on this page */ - for (offnum = start; - offnum <= maxoff; - offnum = OffsetNumberNext(offnum)) - { - itemid = PageGetItemId(page, offnum); - item = (BTItem) PageGetItem(page, itemid); - if (BTItemSame(item, stack->bts_btitem)) - { - stack->bts_offset = offnum; - stack->bts_blkno = blkno; - pfree(stack->bts_btitem); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) item, item_nbytes); - stack->bts_btitem = item_save; - return buf; - } - } - } -} - void _bt_pagedel(Relation rel, ItemPointer tid) { diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index b174d30317..072d400070 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -12,7 +12,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.61 2000/07/14 22:17:33 tgl Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.62 2000/07/21 06:42:32 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -26,6 +26,7 @@ #include "executor/executor.h" #include "miscadmin.h" + bool BuildingBtree = false; /* see comment in btbuild() */ bool FastBuild = true; /* use sort/build instead of insertion * build */ @@ -206,8 +207,8 @@ btbuild(PG_FUNCTION_ARGS) * btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE. * Sure, it's just rule for placing/finding items and no more - * keytest'll return FALSE for a = 5 for items having 'a' isNULL. - * Look at _bt_skeycmp, _bt_compare and _bt_itemcmp for how it - * works. - vadim 03/23/97 + * Look at _bt_compare for how it works. + * - vadim 03/23/97 * * if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; } */ @@ -321,14 +322,6 @@ btinsert(PG_FUNCTION_ARGS) /* generate an index tuple */ itup = index_formtuple(RelationGetDescr(rel), datum, nulls); itup->t_tid = *ht_ctid; - - /* - * See comments in btbuild. - * - * if (itup->t_info & INDEX_NULL_MASK) - * PG_RETURN_POINTER((InsertIndexResult) NULL); - */ - btitem = _bt_formitem(itup); res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel); @@ -357,10 +350,10 @@ btgettuple(PG_FUNCTION_ARGS) if (ItemPointerIsValid(&(scan->currentItemData))) { - /* * Restore scan position using heap TID returned by previous call - * to btgettuple(). _bt_restscan() locks buffer. + * to btgettuple(). _bt_restscan() re-grabs the read lock on + * the buffer, too. */ _bt_restscan(scan); res = _bt_next(scan, dir); @@ -369,8 +362,9 @@ btgettuple(PG_FUNCTION_ARGS) res = _bt_first(scan, dir); /* - * Save heap TID to use it in _bt_restscan. Unlock buffer before - * leaving index ! + * Save heap TID to use it in _bt_restscan. Then release the read + * lock on the buffer so that we aren't blocking other backends. + * NOTE: we do keep the pin on the buffer! */ if (res) { @@ -419,7 +413,18 @@ btrescan(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold a read lock on the current page in the scan */ + if (so == NULL) /* if called from btbeginscan */ + { + so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); + so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer; + so->keyData = (ScanKey) NULL; + if (scan->numberOfKeys > 0) + so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData)); + scan->opaque = so; + scan->flags = 0x0; + } + + /* we aren't holding any read locks, but gotta drop the pins */ if (ItemPointerIsValid(iptr = &(scan->currentItemData))) { ReleaseBuffer(so->btso_curbuf); @@ -427,7 +432,6 @@ btrescan(PG_FUNCTION_ARGS) ItemPointerSetInvalid(iptr); } - /* and we don't hold a read lock on the last marked item in the scan */ if (ItemPointerIsValid(iptr = &(scan->currentMarkData))) { ReleaseBuffer(so->btso_mrkbuf); @@ -435,17 +439,6 @@ btrescan(PG_FUNCTION_ARGS) ItemPointerSetInvalid(iptr); } - if (so == NULL) /* if called from btbeginscan */ - { - so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); - so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer; - so->keyData = (ScanKey) NULL; - if (scan->numberOfKeys > 0) - so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData)); - scan->opaque = so; - scan->flags = 0x0; - } - /* * Reset the scan keys. Note that keys ordering stuff moved to * _bt_first. - vadim 05/05/97 @@ -472,7 +465,7 @@ btmovescan(IndexScanDesc scan, Datum v) so = (BTScanOpaque) scan->opaque; - /* we don't hold a read lock on the current page in the scan */ + /* we aren't holding any read locks, but gotta drop the pin */ if (ItemPointerIsValid(iptr = &(scan->currentItemData))) { ReleaseBuffer(so->btso_curbuf); @@ -480,7 +473,6 @@ btmovescan(IndexScanDesc scan, Datum v) ItemPointerSetInvalid(iptr); } -/* scan->keyData[0].sk_argument = v; */ so->keyData[0].sk_argument = v; } @@ -496,7 +488,7 @@ btendscan(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold any read locks */ + /* we aren't holding any read locks, but gotta drop the pins */ if (ItemPointerIsValid(iptr = &(scan->currentItemData))) { if (BufferIsValid(so->btso_curbuf)) @@ -534,7 +526,7 @@ btmarkpos(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold any read locks */ + /* we aren't holding any read locks, but gotta drop the pin */ if (ItemPointerIsValid(iptr = &(scan->currentMarkData))) { ReleaseBuffer(so->btso_mrkbuf); @@ -542,7 +534,7 @@ btmarkpos(PG_FUNCTION_ARGS) ItemPointerSetInvalid(iptr); } - /* bump pin on current buffer */ + /* bump pin on current buffer for assignment to mark buffer */ if (ItemPointerIsValid(&(scan->currentItemData))) { so->btso_mrkbuf = ReadBuffer(scan->relation, @@ -566,7 +558,7 @@ btrestrpos(PG_FUNCTION_ARGS) so = (BTScanOpaque) scan->opaque; - /* we don't hold any read locks */ + /* we aren't holding any read locks, but gotta drop the pin */ if (ItemPointerIsValid(iptr = &(scan->currentItemData))) { ReleaseBuffer(so->btso_curbuf); @@ -579,7 +571,6 @@ btrestrpos(PG_FUNCTION_ARGS) { so->btso_curbuf = ReadBuffer(scan->relation, BufferGetBlockNumber(so->btso_mrkbuf)); - scan->currentItemData = scan->currentMarkData; so->curHeapIptr = so->mrkHeapIptr; } @@ -603,6 +594,9 @@ btdelete(PG_FUNCTION_ARGS) PG_RETURN_VOID(); } +/* + * Restore scan position when btgettuple is called to continue a scan. + */ static void _bt_restscan(IndexScanDesc scan) { @@ -618,7 +612,12 @@ _bt_restscan(IndexScanDesc scan) BTItem item; BlockNumber blkno; - LockBuffer(buf, BT_READ); /* lock buffer first! */ + /* + * Get back the read lock we were holding on the buffer. + * (We still have a reference-count pin on it, though.) + */ + LockBuffer(buf, BT_READ); + page = BufferGetPage(buf); maxoff = PageGetMaxOffsetNumber(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page); @@ -631,43 +630,40 @@ _bt_restscan(IndexScanDesc scan) */ if (!ItemPointerIsValid(&target)) { - ItemPointerSetOffsetNumber(&(scan->currentItemData), - OffsetNumberPrev(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)); + ItemPointerSetOffsetNumber(current, + OffsetNumberPrev(P_FIRSTDATAKEY(opaque))); return; } - if (maxoff >= offnum) + /* + * The item we were on may have moved right due to insertions. + * Find it again. + */ + for (;;) { - - /* - * if the item is where we left it or has just moved right on this - * page, we're done - */ + /* Check for item on this page */ for (; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); - if (item->bti_itup.t_tid.ip_blkid.bi_hi == \ - target.ip_blkid.bi_hi && \ - item->bti_itup.t_tid.ip_blkid.bi_lo == \ - target.ip_blkid.bi_lo && \ + if (item->bti_itup.t_tid.ip_blkid.bi_hi == + target.ip_blkid.bi_hi && + item->bti_itup.t_tid.ip_blkid.bi_lo == + target.ip_blkid.bi_lo && item->bti_itup.t_tid.ip_posid == target.ip_posid) { current->ip_posid = offnum; return; } } - } - /* - * By here, the item we're looking for moved right at least one page - */ - for (;;) - { + /* + * By here, the item we're looking for moved right at least one page + */ if (P_RIGHTMOST(opaque)) - elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!\ -\n\tRecreate index %s.", RelationGetRelationName(rel)); + elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!" + "\n\tRecreate index %s.", RelationGetRelationName(rel)); blkno = opaque->btpo_next; _bt_relbuf(rel, buf, BT_READ); @@ -675,23 +671,8 @@ _bt_restscan(IndexScanDesc scan) page = BufferGetPage(buf); maxoff = PageGetMaxOffsetNumber(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - - /* see if it's on this page */ - for (offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - offnum <= maxoff; - offnum = OffsetNumberNext(offnum)) - { - item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); - if (item->bti_itup.t_tid.ip_blkid.bi_hi == \ - target.ip_blkid.bi_hi && \ - item->bti_itup.t_tid.ip_blkid.bi_lo == \ - target.ip_blkid.bi_lo && \ - item->bti_itup.t_tid.ip_posid == target.ip_posid) - { - ItemPointerSet(current, blkno, offnum); - so->btso_curbuf = buf; - return; - } - } + offnum = P_FIRSTDATAKEY(opaque); + ItemPointerSet(current, blkno, offnum); + so->btso_curbuf = buf; } } diff --git a/src/backend/access/nbtree/nbtscan.c b/src/backend/access/nbtree/nbtscan.c index 37469365bc..5d48895c1a 100644 --- a/src/backend/access/nbtree/nbtscan.c +++ b/src/backend/access/nbtree/nbtscan.c @@ -8,22 +8,25 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.31 2000/04/12 17:14:49 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.32 2000/07/21 06:42:32 tgl Exp $ * * * NOTES * Because we can be doing an index scan on a relation while we update * it, we need to avoid missing data that moves around in the index. - * The routines and global variables in this file guarantee that all - * scans in the local address space stay correctly positioned. This - * is all we need to worry about, since write locking guarantees that - * no one else will be on the same page at the same time as we are. + * Insertions and page splits are no problem because _bt_restscan() + * can figure out where the current item moved to, but if a deletion + * happens at or before the current scan position, we'd better do + * something to stay in sync. + * + * The routines in this file handle the problem for deletions issued + * by the current backend. Currently, that's all we need, since + * deletions are only done by VACUUM and it gets an exclusive lock. * * The scheme is to manage a list of active scans in the current backend. - * Whenever we add or remove records from an index, or whenever we - * split a leaf page, we check the list of active scans to see if any - * has been affected. A scan is affected only if it is on the same - * relation, and the same page, as the update. + * Whenever we remove a record from an index, we check the list of active + * scans to see if any has been affected. A scan is affected only if it + * is on the same relation, and the same page, as the update. * *------------------------------------------------------------------------- */ @@ -111,7 +114,7 @@ _bt_dropscan(IndexScanDesc scan) /* * _bt_adjscans() -- adjust all scans in the scan list to compensate - * for a given deletion or insertion + * for a given deletion */ void _bt_adjscans(Relation rel, ItemPointer tid) @@ -153,7 +156,7 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) { page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; + start = P_FIRSTDATAKEY(opaque); if (ItemPointerGetOffsetNumber(current) == start) ItemPointerSetInvalid(&(so->curHeapIptr)); else @@ -165,7 +168,6 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) */ LockBuffer(buf, BT_READ); _bt_step(scan, &buf, BackwardScanDirection); - so->btso_curbuf = buf; if (ItemPointerIsValid(current)) { Page pg = BufferGetPage(buf); @@ -183,10 +185,9 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno) && ItemPointerGetBlockNumber(current) == blkno && ItemPointerGetOffsetNumber(current) >= offno) { - page = BufferGetPage(so->btso_mrkbuf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; + start = P_FIRSTDATAKEY(opaque); if (ItemPointerGetOffsetNumber(current) == start) ItemPointerSetInvalid(&(so->mrkHeapIptr)); diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index 54c15b2f6a..49aec3b23d 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1,14 +1,14 @@ /*------------------------------------------------------------------------- * - * btsearch.c + * nbtsearch.c * search code for postgres btrees. * + * * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * - * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.60 2000/05/30 04:24:33 tgl Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsearch.c,v 1.61 2000/07/21 06:42:32 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -19,102 +19,96 @@ #include "access/nbtree.h" +static RetrieveIndexResult _bt_endpoint(IndexScanDesc scan, ScanDirection dir); -static BTStack _bt_searchr(Relation rel, int keysz, ScanKey scankey, - Buffer *bufP, BTStack stack_in); -static int32 _bt_compare(Relation rel, TupleDesc itupdesc, Page page, - int keysz, ScanKey scankey, OffsetNumber offnum); -static bool - _bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir); -static RetrieveIndexResult - _bt_endpoint(IndexScanDesc scan, ScanDirection dir); /* - * _bt_search() -- Search for a scan key in the index. + * _bt_search() -- Search the tree for a particular scankey, + * or more precisely for the first leaf page it could be on. + * + * Return value is a stack of parent-page pointers. *bufP is set to the + * address of the leaf-page buffer, which is read-locked and pinned. + * No locks are held on the parent pages, however! * - * This routine is actually just a helper that sets things up and - * calls a recursive-descent search routine on the tree. + * NOTE that the returned buffer is read-locked regardless of the access + * parameter. However, access = BT_WRITE will allow an empty root page + * to be created and returned. When access = BT_READ, an empty index + * will result in *bufP being set to InvalidBuffer. */ BTStack -_bt_search(Relation rel, int keysz, ScanKey scankey, Buffer *bufP) -{ - *bufP = _bt_getroot(rel, BT_READ); - return _bt_searchr(rel, keysz, scankey, bufP, (BTStack) NULL); -} - -/* - * _bt_searchr() -- Search the tree recursively for a particular scankey. - */ -static BTStack -_bt_searchr(Relation rel, - int keysz, - ScanKey scankey, - Buffer *bufP, - BTStack stack_in) +_bt_search(Relation rel, int keysz, ScanKey scankey, + Buffer *bufP, int access) { - BTStack stack; - OffsetNumber offnum; - Page page; - BTPageOpaque opaque; - BlockNumber par_blkno; - BlockNumber blkno; - ItemId itemid; - BTItem btitem; - BTItem item_save; - int item_nbytes; - IndexTuple itup; + BTStack stack_in = NULL; - /* if this is a leaf page, we're done */ - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - if (opaque->btpo_flags & BTP_LEAF) - return stack_in; + /* Get the root page to start with */ + *bufP = _bt_getroot(rel, access); - /* - * Find the appropriate item on the internal page, and get the child - * page that it points to. - */ + /* If index is empty and access = BT_READ, no root page is created. */ + if (! BufferIsValid(*bufP)) + return (BTStack) NULL; - par_blkno = BufferGetBlockNumber(*bufP); - offnum = _bt_binsrch(rel, *bufP, keysz, scankey, BT_DESCENT); - itemid = PageGetItemId(page, offnum); - btitem = (BTItem) PageGetItem(page, itemid); - itup = &(btitem->bti_itup); - blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); + /* Loop iterates once per level descended in the tree */ + for (;;) + { + Page page; + BTPageOpaque opaque; + OffsetNumber offnum; + ItemId itemid; + BTItem btitem; + IndexTuple itup; + BlockNumber blkno; + BlockNumber par_blkno; + BTStack new_stack; + + /* if this is a leaf page, we're done */ + page = BufferGetPage(*bufP); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + if (P_ISLEAF(opaque)) + break; - /* - * We need to save the bit image of the index entry we chose in the - * parent page on a stack. In case we split the tree, we'll use this - * bit image to figure out what our real parent page is, in case the - * parent splits while we're working lower in the tree. See the paper - * by Lehman and Yao for how this is detected and handled. (We use - * unique OIDs to disambiguate duplicate keys in the index -- Lehman - * and Yao disallow duplicate keys). - */ + /* + * Find the appropriate item on the internal page, and get the + * child page that it points to. + */ + offnum = _bt_binsrch(rel, *bufP, keysz, scankey); + itemid = PageGetItemId(page, offnum); + btitem = (BTItem) PageGetItem(page, itemid); + itup = &(btitem->bti_itup); + blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); + par_blkno = BufferGetBlockNumber(*bufP); - item_nbytes = ItemIdGetLength(itemid); - item_save = (BTItem) palloc(item_nbytes); - memmove((char *) item_save, (char *) btitem, item_nbytes); - stack = (BTStack) palloc(sizeof(BTStackData)); - stack->bts_blkno = par_blkno; - stack->bts_offset = offnum; - stack->bts_btitem = item_save; - stack->bts_parent = stack_in; + /* + * We need to save the bit image of the index entry we chose in the + * parent page on a stack. In case we split the tree, we'll use this + * bit image to figure out what our real parent page is, in case the + * parent splits while we're working lower in the tree. See the paper + * by Lehman and Yao for how this is detected and handled. (We use the + * child link to disambiguate duplicate keys in the index -- Lehman + * and Yao disallow duplicate keys.) + */ + new_stack = (BTStack) palloc(sizeof(BTStackData)); + new_stack->bts_blkno = par_blkno; + new_stack->bts_offset = offnum; + memcpy(&new_stack->bts_btitem, btitem, sizeof(BTItemData)); + new_stack->bts_parent = stack_in; - /* drop the read lock on the parent page and acquire one on the child */ - _bt_relbuf(rel, *bufP, BT_READ); - *bufP = _bt_getbuf(rel, blkno, BT_READ); + /* drop the read lock on the parent page, acquire one on the child */ + _bt_relbuf(rel, *bufP, BT_READ); + *bufP = _bt_getbuf(rel, blkno, BT_READ); - /* - * Race -- the page we just grabbed may have split since we read its - * pointer in the parent. If it has, we may need to move right to its - * new sibling. Do that. - */ + /* + * Race -- the page we just grabbed may have split since we read its + * pointer in the parent. If it has, we may need to move right to its + * new sibling. Do that. + */ + *bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ); - *bufP = _bt_moveright(rel, *bufP, keysz, scankey, BT_READ); + /* okay, all set to move down a level */ + stack_in = new_stack; + } - /* okay, all set to move down a level */ - return _bt_searchr(rel, keysz, scankey, bufP, stack); + return stack_in; } /* @@ -133,7 +127,7 @@ _bt_searchr(Relation rel, * * On entry, we have the buffer pinned and a lock of the proper type. * If we move right, we release the buffer and lock and acquire the - * same on the right sibling. + * same on the right sibling. Return value is the buffer we stop at. */ Buffer _bt_moveright(Relation rel, @@ -144,231 +138,81 @@ _bt_moveright(Relation rel, { Page page; BTPageOpaque opaque; - ItemId hikey; - BlockNumber rblkno; - int natts = rel->rd_rel->relnatts; page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - /* if we're on a rightmost page, we don't need to move right */ - if (P_RIGHTMOST(opaque)) - return buf; - - /* by convention, item 0 on non-rightmost pages is the high key */ - hikey = PageGetItemId(page, P_HIKEY); - /* - * If the scan key that brought us to this page is >= the high key + * If the scan key that brought us to this page is > the high key * stored on the page, then the page has split and we need to move - * right. + * right. (If the scan key is equal to the high key, we might or + * might not need to move right; have to scan the page first anyway.) + * It could even have split more than once, so scan as far as needed. */ - - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, - BTGreaterEqualStrategyNumber)) + while (!P_RIGHTMOST(opaque) && + _bt_compare(rel, keysz, scankey, page, P_HIKEY) > 0) { - /* move right as long as we need to */ - do - { - OffsetNumber offmax = PageGetMaxOffsetNumber(page); - - /* - * If this page consists of all duplicate keys (hikey and - * first key on the page have the same value), then we don't - * need to step right. - * - * NOTE for multi-column indices: we may do scan using keys not - * for all attrs. But we handle duplicates using all attrs in - * _bt_insert/_bt_spool code. And so we've to compare scankey - * with _last_ item on this page to do not lose "good" tuples - * if number of attrs > keysize. Example: (2,0) - last items - * on this page, (2,1) - first item on next page (hikey), our - * scankey is x = 2. Scankey == (2,1) because of we compare - * first attrs only, but we shouldn't to move right of here. - - * vadim 04/15/97 - * - * Also, if this page is not LEAF one (and # of attrs > keysize) - * then we can't move too. - vadim 10/22/97 - */ - - if (_bt_skeycmp(rel, keysz, scankey, page, hikey, - BTEqualStrategyNumber)) - { - if (opaque->btpo_flags & BTP_CHAIN) - { - Assert((opaque->btpo_flags & BTP_LEAF) || offmax > P_HIKEY); - break; - } - if (offmax > P_HIKEY) - { - if (natts == keysz) /* sanity checks */ - { - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, P_FIRSTKEY), - BTEqualStrategyNumber)) - elog(FATAL, "btree: BTP_CHAIN flag was expected in %s (access = %s)", - RelationGetRelationName(rel), access ? "bt_write" : "bt_read"); - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, offmax), - BTEqualStrategyNumber)) - elog(FATAL, "btree: unexpected equal last item"); - if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, offmax), - BTLessStrategyNumber)) - elog(FATAL, "btree: unexpected greater last item"); - /* move right */ - } - else if (!(opaque->btpo_flags & BTP_LEAF)) - break; - else if (_bt_skeycmp(rel, keysz, scankey, page, - PageGetItemId(page, offmax), - BTLessEqualStrategyNumber)) - break; - } - } + /* step right one page */ + BlockNumber rblkno = opaque->btpo_next; - /* step right one page */ - rblkno = opaque->btpo_next; - _bt_relbuf(rel, buf, access); - buf = _bt_getbuf(rel, rblkno, access); - page = BufferGetPage(buf); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - hikey = PageGetItemId(page, P_HIKEY); - - } while (!P_RIGHTMOST(opaque) - && _bt_skeycmp(rel, keysz, scankey, page, hikey, - BTGreaterEqualStrategyNumber)); + _bt_relbuf(rel, buf, access); + buf = _bt_getbuf(rel, rblkno, access); + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); } + return buf; } /* - * _bt_skeycmp() -- compare a scan key to a particular item on a page using - * a requested strategy (<, <=, =, >=, >). + * _bt_binsrch() -- Do a binary search for a key on a particular page. * - * We ignore the unique OIDs stored in the btree item here. Those - * numbers are intended for use internally only, in repositioning a - * scan after a page split. They do not impose any meaningful ordering. + * The scankey we get has the compare function stored in the procedure + * entry of each data struct. We invoke this regproc to do the + * comparison for every key in the scankey. * - * The comparison is A B, where A is the scan key and B is the - * tuple pointed at by itemid on page. - */ -bool -_bt_skeycmp(Relation rel, - Size keysz, - ScanKey scankey, - Page page, - ItemId itemid, - StrategyNumber strat) -{ - BTItem item; - IndexTuple indexTuple; - TupleDesc tupDes; - int i; - int32 compare = 0; - - item = (BTItem) PageGetItem(page, itemid); - indexTuple = &(item->bti_itup); - - tupDes = RelationGetDescr(rel); - - for (i = 1; i <= (int) keysz; i++) - { - ScanKey entry = &scankey[i - 1]; - Datum attrDatum; - bool isNull; - - Assert(entry->sk_attno == i); - attrDatum = index_getattr(indexTuple, - entry->sk_attno, - tupDes, - &isNull); - - /* see comments about NULLs handling in btbuild */ - if (entry->sk_flags & SK_ISNULL) /* key is NULL */ - { - if (isNull) - compare = 0; /* NULL key "=" NULL datum */ - else - compare = 1; /* NULL key ">" not-NULL datum */ - } - else if (isNull) /* key is NOT_NULL and item is NULL */ - { - compare = -1; /* not-NULL key "<" NULL datum */ - } - else - compare = DatumGetInt32(FunctionCall2(&entry->sk_func, - entry->sk_argument, - attrDatum)); - - if (compare != 0) - break; /* done when we find unequal attributes */ - } - - switch (strat) - { - case BTLessStrategyNumber: - return (bool) (compare < 0); - case BTLessEqualStrategyNumber: - return (bool) (compare <= 0); - case BTEqualStrategyNumber: - return (bool) (compare == 0); - case BTGreaterEqualStrategyNumber: - return (bool) (compare >= 0); - case BTGreaterStrategyNumber: - return (bool) (compare > 0); - } - - elog(ERROR, "_bt_skeycmp: bogus strategy %d", (int) strat); - return false; -} - -/* - * _bt_binsrch() -- Do a binary search for a key on a particular page. + * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first + * key >= given scankey. (NOTE: in particular, this means it is possible + * to return a value 1 greater than the number of keys on the page, + * if the scankey is > all keys on the page.) * - * The scankey we get has the compare function stored in the procedure - * entry of each data struct. We invoke this regproc to do the - * comparison for every key in the scankey. _bt_binsrch() returns - * the OffsetNumber of the first matching key on the page, or the - * OffsetNumber at which the matching key would appear if it were - * on this page. (NOTE: in particular, this means it is possible to - * return a value 1 greater than the number of keys on the page, if - * the scankey is > all keys on the page.) + * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber + * of the last key < given scankey. (Since _bt_compare treats the first + * data key of such a page as minus infinity, there will be at least one + * key < scankey, so the result always points at one of the keys on the + * page.) This key indicates the right place to descend to be sure we + * find all leaf keys >= given scankey. * - * By the time this procedure is called, we're sure we're looking - * at the right page -- don't need to walk right. _bt_binsrch() has - * no lock or refcount side effects on the buffer. + * This procedure is not responsible for walking right, it just examines + * the given page. _bt_binsrch() has no lock or refcount side effects + * on the buffer. */ OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz, - ScanKey scankey, - int srchtype) + ScanKey scankey) { TupleDesc itupdesc; Page page; BTPageOpaque opaque; OffsetNumber low, high; - bool haveEq; - int natts = rel->rd_rel->relnatts; int32 result; itupdesc = RelationGetDescr(rel); page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - /* by convention, item 1 on any non-rightmost page is the high key */ - low = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - + low = P_FIRSTDATAKEY(opaque); high = PageGetMaxOffsetNumber(page); /* * If there are no keys on the page, return the first available slot. * Note this covers two cases: the page is really empty (no keys), or * it contains only a high key. The latter case is possible after - * vacuuming. + * vacuuming. This can never happen on an internal page, however, + * since they are never empty (an internal page must have children). */ if (high < low) return low; @@ -376,11 +220,9 @@ _bt_binsrch(Relation rel, /* * Binary search to find the first key on the page >= scan key. Loop * invariant: all slots before 'low' are < scan key, all slots at or - * after 'high' are >= scan key. Also, haveEq is true if the tuple at - * 'high' is == scan key. We can fall out when high == low. + * after 'high' are >= scan key. We can fall out when high == low. */ high++; /* establish the loop invariant for high */ - haveEq = false; while (high > low) { @@ -388,175 +230,77 @@ _bt_binsrch(Relation rel, /* We have low <= mid < high, so mid points at a real slot */ - result = _bt_compare(rel, itupdesc, page, keysz, scankey, mid); + result = _bt_compare(rel, keysz, scankey, page, mid); if (result > 0) low = mid + 1; else - { high = mid; - haveEq = (result == 0); - } } /*-------------------- * At this point we have high == low, but be careful: they could point - * past the last slot on the page. We also know that haveEq is true - * if and only if there is an equal key (in which case high&low point - * at the first equal key). + * past the last slot on the page. * * On a leaf page, we always return the first key >= scan key * (which could be the last slot + 1). *-------------------- */ - - if (opaque->btpo_flags & BTP_LEAF) + if (P_ISLEAF(opaque)) return low; /*-------------------- - * On a non-leaf page, there are special cases: - * - * For an insertion (srchtype != BT_DESCENT and natts == keysz) - * always return first key >= scan key (which could be off the end). - * - * For a standard search (srchtype == BT_DESCENT and natts == keysz) - * return the first equal key if one exists, else the last lesser key - * if one exists, else the first slot on the page. - * - * For a partial-match search (srchtype == BT_DESCENT and natts > keysz) - * return the last lesser key if one exists, else the first slot. - * - * Old comments: - * For multi-column indices, we may scan using keys - * not for all attrs. But we handle duplicates using all attrs - * in _bt_insert/_bt_spool code. And so while searching on - * internal pages having number of attrs > keysize we want to - * point at the last item < the scankey, not at the first item - * = the scankey (!!!), and let _bt_moveright decide later - * whether to move right or not (see comments and example - * there). Note also that INSERTions are not affected by this - * code (since natts == keysz for inserts). - vadim 04/15/97 + * On a non-leaf page, return the last key < scan key. + * There must be one if _bt_compare() is playing by the rules. *-------------------- */ - - if (haveEq) - { - - /* - * There is an equal key. We return either the first equal key - * (which we just found), or the last lesser key. - * - * We need not check srchtype != BT_DESCENT here, since if that is - * true then natts == keysz by assumption. - */ - if (natts == keysz) - return low; /* return first equal key */ - } - else - { - - /* - * There is no equal key. We return either the first greater key - * (which we just found), or the last lesser key. - */ - if (srchtype != BT_DESCENT) - return low; /* return first greater key */ - } - - - if (low == (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)) - return low; /* there is no prior item */ + Assert(low > P_FIRSTDATAKEY(opaque)); return OffsetNumberPrev(low); } -/* +/*---------- * _bt_compare() -- Compare scankey to a particular tuple on the page. * + * keysz: number of key conditions to be checked (might be less than the + * total length of the scan key!) + * page/offnum: location of btree item to be compared to. + * * This routine returns: * <0 if scankey < tuple at offnum; * 0 if scankey == tuple at offnum; * >0 if scankey > tuple at offnum. + * NULLs in the keys are treated as sortable values. Therefore + * "equality" does not necessarily mean that the item should be + * returned to the caller as a matching key! * - * -- Old comments: - * In order to avoid having to propagate changes up the tree any time - * a new minimal key is inserted, the leftmost entry on the leftmost - * page is less than all possible keys, by definition. - * - * -- New ones: - * New insertion code (fix against updating _in_place_ if new minimal - * key has bigger size than old one) may delete P_HIKEY entry on the - * root page in order to insert new minimal key - and so this definition - * does not work properly in this case and breaks key' order on root - * page. BTW, this propagation occures only while page' splitting, - * but not "any time a new min key is inserted" (see _bt_insertonpg). - * - vadim 12/05/96 + * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be + * "minus infinity": this routine will always claim it is less than the + * scankey. The actual key value stored (if any, which there probably isn't) + * does not matter. This convention allows us to implement the Lehman and + * Yao convention that the first down-link pointer is before the first key. + * See backend/access/nbtree/README for details. + *---------- */ -static int32 +int32 _bt_compare(Relation rel, - TupleDesc itupdesc, - Page page, int keysz, ScanKey scankey, + Page page, OffsetNumber offnum) { - Datum datum; + TupleDesc itupdesc = RelationGetDescr(rel); + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); BTItem btitem; IndexTuple itup; - BTPageOpaque opaque; - ScanKey entry; - AttrNumber attno; - int32 result; int i; - bool null; /* - * If this is a leftmost internal page, and if our comparison is with - * the first key on the page, then the item at that position is by - * definition less than the scan key. - * - * - see new comments above... + * Force result ">" if target item is first data item on an internal + * page --- see NOTE above. */ - - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - - if (!(opaque->btpo_flags & BTP_LEAF) - && P_LEFTMOST(opaque) - && offnum == P_HIKEY) - { - - /* - * we just have to believe that this will only be called with - * offnum == P_HIKEY when P_HIKEY is the OffsetNumber of the first - * actual data key (i.e., this is also a rightmost page). there - * doesn't seem to be any code that implies that the leftmost page - * is normally missing a high key as well as the rightmost page. - * but that implies that this code path only applies to the root - * -- which seems unlikely.. - * - * - see new comments above... - */ - if (!P_RIGHTMOST(opaque)) - elog(ERROR, "_bt_compare: invalid comparison to high key"); - -#ifdef NOT_USED - - /* - * We just have to belive that right answer will not break - * anything. I've checked code and all seems to be ok. See new - * comments above... - * - * -- Old comments If the item on the page is equal to the scankey, - * that's okay to admit. We just can't claim that the first key - * on the page is greater than anything. - */ - - if (_bt_skeycmp(rel, keysz, scankey, page, PageGetItemId(page, offnum), - BTEqualStrategyNumber)) - return 0; + if (! P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) return 1; -#endif - } btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &(btitem->bti_itup); @@ -568,37 +312,45 @@ _bt_compare(Relation rel, * they be in order. If you think about how multi-key ordering works, * you'll understand why this is. * - * We don't test for violation of this condition here. + * We don't test for violation of this condition here, however. The + * initial setup for the index scan had better have gotten it right + * (see _bt_first). */ - for (i = 1; i <= keysz; i++) + for (i = 0; i < keysz; i++) { - entry = &scankey[i - 1]; - attno = entry->sk_attno; - datum = index_getattr(itup, attno, itupdesc, &null); + ScanKey entry = &scankey[i]; + Datum datum; + bool isNull; + int32 result; + + datum = index_getattr(itup, entry->sk_attno, itupdesc, &isNull); /* see comments about NULLs handling in btbuild */ - if (entry->sk_flags & SK_ISNULL) /* key is NULL */ + if (entry->sk_flags & SK_ISNULL) /* key is NULL */ { - if (null) + if (isNull) result = 0; /* NULL "=" NULL */ else result = 1; /* NULL ">" NOT_NULL */ } - else if (null) /* key is NOT_NULL and item is NULL */ + else if (isNull) /* key is NOT_NULL and item is NULL */ { result = -1; /* NOT_NULL "<" NULL */ } else + { result = DatumGetInt32(FunctionCall2(&entry->sk_func, - entry->sk_argument, datum)); + entry->sk_argument, + datum)); + } /* if the keys are unequal, return the difference */ if (result != 0) return result; } - /* by here, the keys are equal */ + /* if we get here, the keys are equal */ return 0; } @@ -606,10 +358,10 @@ _bt_compare(Relation rel, * _bt_next() -- Get the next item in a scan. * * On entry, we have a valid currentItemData in the scan, and a - * read lock on the page that contains that item. We do not have - * the page pinned. We return the next item in the scan. On - * exit, we have the page containing the next item locked but not - * pinned. + * read lock and pin count on the page that contains that item. + * We return the next item in the scan, or NULL if no more. + * On successful exit, the page containing the new item is locked + * and pinned; on NULL exit, no lock or pin is held. */ RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir) @@ -618,7 +370,6 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) Buffer buf; Page page; OffsetNumber offnum; - RetrieveIndexResult res; ItemPointer current; BTItem btitem; IndexTuple itup; @@ -629,10 +380,9 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) so = (BTScanOpaque) scan->opaque; current = &(scan->currentItemData); - Assert(BufferIsValid(so->btso_curbuf)); - /* we still have the buffer pinned and locked */ buf = so->btso_curbuf; + Assert(BufferIsValid(buf)); do { @@ -640,7 +390,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) if (!_bt_step(scan, &buf, dir)) return (RetrieveIndexResult) NULL; - /* by here, current is the tuple we want to return */ + /* current is the next candidate tuple to return */ offnum = ItemPointerGetOffsetNumber(current); page = BufferGetPage(buf); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); @@ -648,17 +398,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) if (_bt_checkkeys(scan, itup, &keysok)) { + /* tuple passes all scan key conditions, so return it */ Assert(keysok == so->numberOfKeys); - res = FormRetrieveIndexResult(current, &(itup->t_tid)); - - /* remember which buffer we have pinned and locked */ - so->btso_curbuf = buf; - return res; + return FormRetrieveIndexResult(current, &(itup->t_tid)); } + /* This tuple doesn't pass, but there might be more that do */ } while (keysok >= so->numberOfFirstKeys || (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))); + /* No more items, so close down the current-item info */ ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; _bt_relbuf(rel, buf, BT_READ); @@ -680,14 +429,10 @@ RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir) { Relation rel; - TupleDesc itupdesc; Buffer buf; Page page; - BTPageOpaque pop; BTStack stack; - OffsetNumber offnum, - maxoff; - bool offGmax = false; + OffsetNumber offnum; BTItem btitem; IndexTuple itup; ItemPointer current; @@ -698,7 +443,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) int32 result; BTScanOpaque so; Size keysok; - bool strategyCheck; ScanKey scankeys = 0; int keysCount = 0; @@ -784,20 +528,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) return _bt_endpoint(scan, dir); } - itupdesc = RelationGetDescr(rel); - current = &(scan->currentItemData); - /* * Okay, we want something more complicated. What we'll do is use the * first item in the scan key passed in (which has been correctly * ordered to take advantage of index ordering) to position ourselves * at the right place in the scan. */ - /* _bt_orderkeys disallows it, but it's place to add some code latter */ scankeys = (ScanKey) palloc(keysCount * sizeof(ScanKeyData)); for (i = 0; i < keysCount; i++) { j = nKeyIs[i]; + /* _bt_orderkeys disallows it, but it's place to add some code latter */ if (so->keyData[j].sk_flags & SK_ISNULL) { pfree(nKeyIs); @@ -812,234 +553,213 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) if (nKeyIs) pfree(nKeyIs); - stack = _bt_search(rel, keysCount, scankeys, &buf); - _bt_freestack(stack); - - blkno = BufferGetBlockNumber(buf); - page = BufferGetPage(buf); + current = &(scan->currentItemData); /* - * This will happen if the tree we're searching is entirely empty, or - * if we're doing a search for a key that would appear on an entirely - * empty internal page. In either case, there are no matching tuples - * in the index. + * Use the manufactured scan key to descend the tree and position + * ourselves on the target leaf page. */ + stack = _bt_search(rel, keysCount, scankeys, &buf, BT_READ); - if (PageIsEmpty(page)) + /* don't need to keep the stack around... */ + _bt_freestack(stack); + + if (! BufferIsValid(buf)) { + /* Only get here if index is completely empty */ ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; - _bt_relbuf(rel, buf, BT_READ); pfree(scankeys); return (RetrieveIndexResult) NULL; } - maxoff = PageGetMaxOffsetNumber(page); - pop = (BTPageOpaque) PageGetSpecialPointer(page); - - /* - * Now _bt_moveright doesn't move from non-rightmost leaf page if - * scankey == hikey and there is only hikey there. It's good for - * insertion, but we need to do work for scan here. - vadim 05/27/97 - */ - - while (maxoff == P_HIKEY && !P_RIGHTMOST(pop) && - _bt_skeycmp(rel, keysCount, scankeys, page, - PageGetItemId(page, P_HIKEY), - BTGreaterEqualStrategyNumber)) - { - /* step right one page */ - blkno = pop->btpo_next; - _bt_relbuf(rel, buf, BT_READ); - buf = _bt_getbuf(rel, blkno, BT_READ); - page = BufferGetPage(buf); - if (PageIsEmpty(page)) - { - ItemPointerSetInvalid(current); - so->btso_curbuf = InvalidBuffer; - _bt_relbuf(rel, buf, BT_READ); - pfree(scankeys); - return (RetrieveIndexResult) NULL; - } - maxoff = PageGetMaxOffsetNumber(page); - pop = (BTPageOpaque) PageGetSpecialPointer(page); - } - - /* find the nearest match to the manufactured scan key on the page */ - offnum = _bt_binsrch(rel, buf, keysCount, scankeys, BT_DESCENT); + /* remember which buffer we have pinned */ + so->btso_curbuf = buf; + blkno = BufferGetBlockNumber(buf); + page = BufferGetPage(buf); - if (offnum > maxoff) - { - offnum = maxoff; - offGmax = true; - } + offnum = _bt_binsrch(rel, buf, keysCount, scankeys); ItemPointerSet(current, blkno, offnum); - /* - * Now find the right place to start the scan. Result is the value - * we're looking for minus the value we're looking at in the index. + /*---------- + * At this point we are positioned at the first item >= scan key, + * or possibly at the end of a page on which all the existing items + * are < scan key and we know that everything on later pages is + * >= scan key. We could step forward in the latter case, but that'd + * be a waste of time if we want to scan backwards. So, it's now time to + * examine the scan strategy to find the exact place to start the scan. + * + * Note: if _bt_step fails (meaning we fell off the end of the index + * in one direction or the other), we either return NULL (no matches) or + * call _bt_endpoint() to set up a scan starting at that index endpoint, + * as appropriate for the desired scan type. + * + * it's yet other place to add some code latter for is(not)null ... + *---------- */ - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - - /* it's yet other place to add some code latter for is(not)null */ - - strat = strat_total; - switch (strat) + switch (strat_total) { case BTLessStrategyNumber: - if (result <= 0) + /* + * Back up one to arrive at last item < scankey + */ + if (!_bt_step(scan, &buf, BackwardScanDirection)) { - do - { - if (!_bt_twostep(scan, &buf, BackwardScanDirection)) - break; - - offnum = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result <= 0); - + pfree(scankeys); + return (RetrieveIndexResult) NULL; } break; case BTLessEqualStrategyNumber: - if (result >= 0) + /* + * We need to find the last item <= scankey, so step forward + * till we find one > scankey, then step back one. + */ + if (offnum > PageGetMaxOffsetNumber(page)) { - do + if (!_bt_step(scan, &buf, ForwardScanDirection)) { - if (!_bt_twostep(scan, &buf, ForwardScanDirection)) - break; - - offnum = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result >= 0); + pfree(scankeys); + return _bt_endpoint(scan, dir); + } + } + for (;;) + { + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + result = _bt_compare(rel, keysCount, scankeys, page, offnum); + if (result < 0) + break; + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return _bt_endpoint(scan, dir); + } + } + if (!_bt_step(scan, &buf, BackwardScanDirection)) + { + pfree(scankeys); + return (RetrieveIndexResult) NULL; } - if (result < 0) - _bt_twostep(scan, &buf, BackwardScanDirection); break; case BTEqualStrategyNumber: - if (result != 0) + /* + * Make sure we are on the first equal item; might have to step + * forward if currently at end of page. + */ + if (offnum > PageGetMaxOffsetNumber(page)) { - _bt_relbuf(scan->relation, buf, BT_READ); - so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(&(scan->currentItemData)); - pfree(scankeys); - return (RetrieveIndexResult) NULL; + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return (RetrieveIndexResult) NULL; + } + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); } - else if (ScanDirectionIsBackward(dir)) + result = _bt_compare(rel, keysCount, scankeys, page, offnum); + if (result != 0) + goto nomatches; /* no equal items! */ + /* + * If a backward scan was specified, need to start with last + * equal item not first one. + */ + if (ScanDirectionIsBackward(dir)) { do { - if (!_bt_twostep(scan, &buf, ForwardScanDirection)) - break; - + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return _bt_endpoint(scan, dir); + } offnum = ItemPointerGetOffsetNumber(current); page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); + result = _bt_compare(rel, keysCount, scankeys, page, offnum); } while (result == 0); - - if (result < 0) - _bt_twostep(scan, &buf, BackwardScanDirection); + if (!_bt_step(scan, &buf, BackwardScanDirection)) + elog(ERROR, "_bt_first: equal items disappeared?"); } break; case BTGreaterEqualStrategyNumber: - if (offGmax) + /* + * We want the first item >= scankey, which is where we are... + * unless we're not anywhere at all... + */ + if (offnum > PageGetMaxOffsetNumber(page)) { - if (result < 0) + if (!_bt_step(scan, &buf, ForwardScanDirection)) { - Assert(!P_RIGHTMOST(pop) && maxoff == P_HIKEY); - if (!_bt_step(scan, &buf, ForwardScanDirection)) - { - _bt_relbuf(scan->relation, buf, BT_READ); - so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(&(scan->currentItemData)); - pfree(scankeys); - return (RetrieveIndexResult) NULL; - } - } - else if (result > 0) - { /* Just remember: _bt_binsrch() returns - * the OffsetNumber of the first matching - * key on the page, or the OffsetNumber at - * which the matching key WOULD APPEAR IF - * IT WERE on this page. No key on this - * page, but offnum from _bt_binsrch() - * greater maxoff - have to move right. - - * vadim 12/06/96 */ - _bt_twostep(scan, &buf, ForwardScanDirection); + pfree(scankeys); + return (RetrieveIndexResult) NULL; } } - else if (result < 0) - { - do - { - if (!_bt_twostep(scan, &buf, BackwardScanDirection)) - break; - - page = BufferGetPage(buf); - offnum = ItemPointerGetOffsetNumber(current); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result < 0); - - if (result > 0) - _bt_twostep(scan, &buf, ForwardScanDirection); - } break; case BTGreaterStrategyNumber: - /* offGmax helps as above */ - if (result >= 0 || offGmax) + /* + * We want the first item > scankey, so make sure we are on + * an item and then step over any equal items. + */ + if (offnum > PageGetMaxOffsetNumber(page)) { - do + if (!_bt_step(scan, &buf, ForwardScanDirection)) { - if (!_bt_twostep(scan, &buf, ForwardScanDirection)) - break; - - offnum = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - result = _bt_compare(rel, itupdesc, page, keysCount, scankeys, offnum); - } while (result >= 0); + pfree(scankeys); + return (RetrieveIndexResult) NULL; + } + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + } + result = _bt_compare(rel, keysCount, scankeys, page, offnum); + while (result == 0) + { + if (!_bt_step(scan, &buf, ForwardScanDirection)) + { + pfree(scankeys); + return (RetrieveIndexResult) NULL; + } + offnum = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + result = _bt_compare(rel, keysCount, scankeys, page, offnum); } break; } - pfree(scankeys); /* okay, current item pointer for the scan is right */ offnum = ItemPointerGetOffsetNumber(current); page = BufferGetPage(buf); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &btitem->bti_itup; + /* is the first item actually acceptable? */ if (_bt_checkkeys(scan, itup, &keysok)) { + /* yes, return it */ res = FormRetrieveIndexResult(current, &(itup->t_tid)); - - /* remember which buffer we have pinned */ - so->btso_curbuf = buf; - } - else if (keysok >= so->numberOfFirstKeys) - { - so->btso_curbuf = buf; - return _bt_next(scan, dir); } - else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)) + else if (keysok >= so->numberOfFirstKeys || + (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))) { - so->btso_curbuf = buf; - return _bt_next(scan, dir); + /* no, but there might be another one that is */ + res = _bt_next(scan, dir); } else { + /* no tuples in the index match this scan key */ +nomatches: ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; _bt_relbuf(rel, buf, BT_READ); res = (RetrieveIndexResult) NULL; } + pfree(scankeys); + return res; } @@ -1047,276 +767,128 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) * _bt_step() -- Step one item in the requested direction in a scan on * the tree. * - * If no adjacent record exists in the requested direction, return - * false. Else, return true and set the currentItemData for the - * scan to the right thing. + * *bufP is the current buffer (read-locked and pinned). If we change + * pages, it's updated appropriately. + * + * If successful, update scan's currentItemData and return true. + * If no adjacent record exists in the requested direction, + * release buffer pin/locks and return false. */ bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) { + Relation rel = scan->relation; + ItemPointer current = &(scan->currentItemData); + BTScanOpaque so = (BTScanOpaque) scan->opaque; Page page; BTPageOpaque opaque; OffsetNumber offnum, maxoff; - OffsetNumber start; BlockNumber blkno; BlockNumber obknum; - BTScanOpaque so; - ItemPointer current; - Relation rel; - - rel = scan->relation; - current = &(scan->currentItemData); /* * Don't use ItemPointerGetOffsetNumber or you risk to get assertion * due to ability of ip_posid to be equal 0. */ offnum = current->ip_posid; + page = BufferGetPage(*bufP); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - so = (BTScanOpaque) scan->opaque; maxoff = PageGetMaxOffsetNumber(page); - /* get the next tuple */ if (ScanDirectionIsForward(dir)) { if (!PageIsEmpty(page) && offnum < maxoff) offnum = OffsetNumberNext(offnum); else { - - /* if we're at end of scan, release the buffer and return */ - blkno = opaque->btpo_next; - if (P_RIGHTMOST(opaque)) - { - _bt_relbuf(rel, *bufP, BT_READ); - ItemPointerSetInvalid(current); - *bufP = so->btso_curbuf = InvalidBuffer; - return false; - } - else + /* walk right to the next page with data */ + for (;;) { - - /* walk right to the next page with data */ - _bt_relbuf(rel, *bufP, BT_READ); - for (;;) + /* if we're at end of scan, release the buffer and return */ + if (P_RIGHTMOST(opaque)) { - *bufP = _bt_getbuf(rel, blkno, BT_READ); - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - if (!PageIsEmpty(page) && start <= maxoff) - break; - else - { - blkno = opaque->btpo_next; - _bt_relbuf(rel, *bufP, BT_READ); - if (blkno == P_NONE) - { - *bufP = so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(current); - return false; - } - } + _bt_relbuf(rel, *bufP, BT_READ); + ItemPointerSetInvalid(current); + *bufP = so->btso_curbuf = InvalidBuffer; + return false; } - offnum = start; + /* step right one page */ + blkno = opaque->btpo_next; + _bt_relbuf(rel, *bufP, BT_READ); + *bufP = _bt_getbuf(rel, blkno, BT_READ); + page = BufferGetPage(*bufP); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + /* done if it's not empty */ + offnum = P_FIRSTDATAKEY(opaque); + if (!PageIsEmpty(page) && offnum <= maxoff) + break; } } } - else if (ScanDirectionIsBackward(dir)) + else { - - /* remember that high key is item zero on non-rightmost pages */ - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - if (offnum > start) + if (offnum > P_FIRSTDATAKEY(opaque)) offnum = OffsetNumberPrev(offnum); else { - - /* if we're at end of scan, release the buffer and return */ - blkno = opaque->btpo_prev; - if (P_LEFTMOST(opaque)) - { - _bt_relbuf(rel, *bufP, BT_READ); - *bufP = so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(current); - return false; - } - else + /* walk left to the next page with data */ + for (;;) { - + /* if we're at end of scan, release the buffer and return */ + if (P_LEFTMOST(opaque)) + { + _bt_relbuf(rel, *bufP, BT_READ); + ItemPointerSetInvalid(current); + *bufP = so->btso_curbuf = InvalidBuffer; + return false; + } + /* step left */ obknum = BufferGetBlockNumber(*bufP); - - /* walk right to the next page with data */ + blkno = opaque->btpo_prev; _bt_relbuf(rel, *bufP, BT_READ); - for (;;) + *bufP = _bt_getbuf(rel, blkno, BT_READ); + page = BufferGetPage(*bufP); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + /* + * If the adjacent page just split, then we have to walk + * right to find the block that's now adjacent to where + * we were. Because pages only split right, we don't have + * to worry about this failing to terminate. + */ + while (opaque->btpo_next != obknum) { + blkno = opaque->btpo_next; + _bt_relbuf(rel, *bufP, BT_READ); *bufP = _bt_getbuf(rel, blkno, BT_READ); page = BufferGetPage(*bufP); opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - - /* - * If the adjacent page just split, then we may have - * the wrong block. Handle this case. Because pages - * only split right, we don't have to worry about this - * failing to terminate. - */ - - while (opaque->btpo_next != obknum) - { - blkno = opaque->btpo_next; - _bt_relbuf(rel, *bufP, BT_READ); - *bufP = _bt_getbuf(rel, blkno, BT_READ); - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - } - - /* don't consider the high key */ - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - /* anything to look at here? */ - if (!PageIsEmpty(page) && maxoff >= start) - break; - else - { - blkno = opaque->btpo_prev; - obknum = BufferGetBlockNumber(*bufP); - _bt_relbuf(rel, *bufP, BT_READ); - if (blkno == P_NONE) - { - *bufP = so->btso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(current); - return false; - } - } } - offnum = maxoff;/* XXX PageIsEmpty? */ + /* done if it's not empty */ + maxoff = PageGetMaxOffsetNumber(page); + offnum = maxoff; + if (!PageIsEmpty(page) && maxoff >= P_FIRSTDATAKEY(opaque)) + break; } } } - blkno = BufferGetBlockNumber(*bufP); + + /* Update scan state */ so->btso_curbuf = *bufP; + blkno = BufferGetBlockNumber(*bufP); ItemPointerSet(current, blkno, offnum); return true; } -/* - * _bt_twostep() -- Move to an adjacent record in a scan on the tree, - * if an adjacent record exists. - * - * This is like _bt_step, except that if no adjacent record exists - * it restores us to where we were before trying the step. This is - * only hairy when you cross page boundaries, since the page you cross - * from could have records inserted or deleted, or could even split. - * This is unlikely, but we try to handle it correctly here anyway. - * - * This routine contains the only case in which our changes to Lehman - * and Yao's algorithm. - * - * Like step, this routine leaves the scan's currentItemData in the - * proper state and acquires a lock and pin on *bufP. If the twostep - * succeeded, we return true; otherwise, we return false. - */ -static bool -_bt_twostep(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) -{ - Page page; - BTPageOpaque opaque; - OffsetNumber offnum, - maxoff; - OffsetNumber start; - ItemPointer current; - ItemId itemid; - int itemsz; - BTItem btitem; - BTItem svitem; - BlockNumber blkno; - - blkno = BufferGetBlockNumber(*bufP); - page = BufferGetPage(*bufP); - opaque = (BTPageOpaque) PageGetSpecialPointer(page); - maxoff = PageGetMaxOffsetNumber(page); - current = &(scan->currentItemData); - offnum = ItemPointerGetOffsetNumber(current); - - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - /* if we're safe, just do it */ - if (ScanDirectionIsForward(dir) && offnum < maxoff) - { /* XXX PageIsEmpty? */ - ItemPointerSet(current, blkno, OffsetNumberNext(offnum)); - return true; - } - else if (ScanDirectionIsBackward(dir) && offnum > start) - { - ItemPointerSet(current, blkno, OffsetNumberPrev(offnum)); - return true; - } - - /* if we've hit end of scan we don't have to do any work */ - if (ScanDirectionIsForward(dir) && P_RIGHTMOST(opaque)) - return false; - else if (ScanDirectionIsBackward(dir) && P_LEFTMOST(opaque)) - return false; - - /* - * Okay, it's off the page; let _bt_step() do the hard work, and we'll - * try to remember where we were. This is not guaranteed to work; - * this is the only place in the code where concurrency can screw us - * up, and it's because we want to be able to move in two directions - * in the scan. - */ - - itemid = PageGetItemId(page, offnum); - itemsz = ItemIdGetLength(itemid); - btitem = (BTItem) PageGetItem(page, itemid); - svitem = (BTItem) palloc(itemsz); - memmove((char *) svitem, (char *) btitem, itemsz); - - if (_bt_step(scan, bufP, dir)) - { - pfree(svitem); - return true; - } - - /* try to find our place again */ - *bufP = _bt_getbuf(scan->relation, blkno, BT_READ); - page = BufferGetPage(*bufP); - maxoff = PageGetMaxOffsetNumber(page); - - while (offnum <= maxoff) - { - itemid = PageGetItemId(page, offnum); - btitem = (BTItem) PageGetItem(page, itemid); - if (BTItemSame(btitem, svitem)) - { - pfree(svitem); - ItemPointerSet(current, blkno, offnum); - return false; - } - } - - /* - * XXX crash and burn -- can't find our place. We can be a little - * smarter -- walk to the next page to the right, for example, since - * that's the only direction that splits happen in. Deletions screw - * us up less often since they're only done by the vacuum daemon. - */ - - elog(ERROR, "btree synchronization error: concurrent update botched scan"); - - return false; -} - /* * _bt_endpoint() -- Find the first or last key in the index. + * + * This is used by _bt_first() to set up a scan when we've determined + * that the scan must start at the beginning or end of the index (for + * a forward or backward scan respectively). */ static RetrieveIndexResult _bt_endpoint(IndexScanDesc scan, ScanDirection dir) @@ -1328,7 +900,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) ItemPointer current; OffsetNumber offnum, maxoff; - OffsetNumber start = 0; + OffsetNumber start; BlockNumber blkno; BTItem btitem; IndexTuple itup; @@ -1340,38 +912,50 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) current = &(scan->currentItemData); so = (BTScanOpaque) scan->opaque; + /* + * Scan down to the leftmost or rightmost leaf page. This is a + * simplified version of _bt_search(). We don't maintain a stack + * since we know we won't need it. + */ buf = _bt_getroot(rel, BT_READ); + + if (! BufferIsValid(buf)) + { + /* empty index... */ + ItemPointerSetInvalid(current); + so->btso_curbuf = InvalidBuffer; + return (RetrieveIndexResult) NULL; + } + blkno = BufferGetBlockNumber(buf); page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); for (;;) { - if (opaque->btpo_flags & BTP_LEAF) + if (P_ISLEAF(opaque)) break; if (ScanDirectionIsForward(dir)) - offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; + offnum = P_FIRSTDATAKEY(opaque); else offnum = PageGetMaxOffsetNumber(page); btitem = (BTItem) PageGetItem(page, PageGetItemId(page, offnum)); itup = &(btitem->bti_itup); - blkno = ItemPointerGetBlockNumber(&(itup->t_tid)); _bt_relbuf(rel, buf, BT_READ); buf = _bt_getbuf(rel, blkno, BT_READ); + page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); /* - * Race condition: If the child page we just stepped onto is in - * the process of being split, we need to make sure we're all the - * way at the right edge of the tree. See the paper by Lehman and - * Yao. + * Race condition: If the child page we just stepped onto was just + * split, we need to make sure we're all the way at the right edge + * of the tree. See the paper by Lehman and Yao. */ - if (ScanDirectionIsBackward(dir) && !P_RIGHTMOST(opaque)) { do @@ -1390,101 +974,39 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) if (ScanDirectionIsForward(dir)) { - if (!P_LEFTMOST(opaque))/* non-leftmost page ? */ - elog(ERROR, "_bt_endpoint: leftmost page (%u) has not leftmost flag", blkno); - start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY; - - /* - * I don't understand this stuff! It doesn't work for - * non-rightmost pages with only one element (P_HIKEY) which we - * have after deletion itups by vacuum (it's case of start > - * maxoff). Scanning in BackwardScanDirection is not - * understandable at all. Well - new stuff. - vadim 12/06/96 - */ -#ifdef NOT_USED - if (PageIsEmpty(page) || start > maxoff) - { - ItemPointerSet(current, blkno, maxoff); - if (!_bt_step(scan, &buf, BackwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } -#endif - if (PageIsEmpty(page)) - { - if (start != P_HIKEY) /* non-rightmost page */ - elog(ERROR, "_bt_endpoint: non-rightmost page (%u) is empty", blkno); + Assert(P_LEFTMOST(opaque)); - /* - * It's left- & right- most page - root page, - and it's - * empty... - */ - _bt_relbuf(rel, buf, BT_READ); - ItemPointerSetInvalid(current); - so->btso_curbuf = InvalidBuffer; - return (RetrieveIndexResult) NULL; - } - if (start > maxoff) /* start == 2 && maxoff == 1 */ - { - ItemPointerSet(current, blkno, maxoff); - if (!_bt_step(scan, &buf, ForwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } - /* new stuff ends here */ - else - ItemPointerSet(current, blkno, start); + start = P_FIRSTDATAKEY(opaque); } else if (ScanDirectionIsBackward(dir)) { + Assert(P_RIGHTMOST(opaque)); - /* - * I don't understand this stuff too! If RIGHT-most leaf page is - * empty why do scanning in ForwardScanDirection ??? Well - new - * stuff. - vadim 12/06/96 - */ -#ifdef NOT_USED - if (PageIsEmpty(page)) - { - ItemPointerSet(current, blkno, FirstOffsetNumber); - if (!_bt_step(scan, &buf, ForwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } -#endif - if (PageIsEmpty(page)) - { - /* If it's leftmost page too - it's empty root page... */ - if (P_LEFTMOST(opaque)) - { - _bt_relbuf(rel, buf, BT_READ); - ItemPointerSetInvalid(current); - so->btso_curbuf = InvalidBuffer; - return (RetrieveIndexResult) NULL; - } - /* Go back ! */ - ItemPointerSet(current, blkno, FirstOffsetNumber); - if (!_bt_step(scan, &buf, BackwardScanDirection)) - return (RetrieveIndexResult) NULL; - - start = ItemPointerGetOffsetNumber(current); - page = BufferGetPage(buf); - } - /* new stuff ends here */ - else - { - start = PageGetMaxOffsetNumber(page); - ItemPointerSet(current, blkno, start); - } + start = PageGetMaxOffsetNumber(page); + if (start < P_FIRSTDATAKEY(opaque)) /* watch out for empty page */ + start = P_FIRSTDATAKEY(opaque); } else + { elog(ERROR, "Illegal scan direction %d", dir); + start = 0; /* keep compiler quiet */ + } + + ItemPointerSet(current, blkno, start); + /* remember which buffer we have pinned */ + so->btso_curbuf = buf; + + /* + * Left/rightmost page could be empty due to deletions, + * if so step till we find a nonempty page. + */ + if (start > maxoff) + { + if (!_bt_step(scan, &buf, dir)) + return (RetrieveIndexResult) NULL; + start = ItemPointerGetOffsetNumber(current); + page = BufferGetPage(buf); + } btitem = (BTItem) PageGetItem(page, PageGetItemId(page, start)); itup = &(btitem->bti_itup); @@ -1492,23 +1014,18 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) /* see if we picked a winner */ if (_bt_checkkeys(scan, itup, &keysok)) { + /* yes, return it */ res = FormRetrieveIndexResult(current, &(itup->t_tid)); - - /* remember which buffer we have pinned */ - so->btso_curbuf = buf; - } - else if (keysok >= so->numberOfFirstKeys) - { - so->btso_curbuf = buf; - return _bt_next(scan, dir); } - else if (keysok == ((Size) -1) && ScanDirectionIsBackward(dir)) + else if (keysok >= so->numberOfFirstKeys || + (keysok == ((Size) -1) && ScanDirectionIsBackward(dir))) { - so->btso_curbuf = buf; - return _bt_next(scan, dir); + /* no, but there might be another one that is */ + res = _bt_next(scan, dir); } else { + /* no tuples in the index match this scan key */ ItemPointerSetInvalid(current); so->btso_curbuf = InvalidBuffer; _bt_relbuf(rel, buf, BT_READ); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 458abe7754..1981f55469 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -6,8 +6,12 @@ * * We use tuplesort.c to sort the given index tuples into order. * Then we scan the index tuples in order and build the btree pages - * for each level. When we have only one page on a level, it must be the - * root -- it can be attached to the btree metapage and we are done. + * for each level. We load source tuples into leaf-level pages. + * Whenever we fill a page at one level, we add a link to it to its + * parent level (starting a new parent level if necessary). When + * done, we write out each final page on each level, adding it to + * its parent level. When we have only one page on a level, it must be + * the root -- it can be attached to the btree metapage and we are done. * * this code is moderately slow (~10% slower) compared to the regular * btree (insertion) build code on sorted or well-clustered data. on @@ -23,12 +27,20 @@ * something like the standard 70% steady-state load factor for btrees * would probably be better. * + * Another limitation is that we currently load full copies of all keys + * into upper tree levels. The leftmost data key in each non-leaf node + * could be omitted as far as normal btree operations are concerned + * (see README for more info). However, because we build the tree from + * the bottom up, we need that data key to insert into the node's parent. + * This could be fixed by keeping a spare copy of the minimum key in the + * state stack, but I haven't time for that right now. + * * * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.54 2000/06/15 04:09:36 momjian Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.55 2000/07/21 06:42:33 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -57,6 +69,20 @@ struct BTSpool bool isunique; }; +/* + * Status record for a btree page being built. We have one of these + * for each active tree level. + */ +typedef struct BTPageState +{ + Buffer btps_buf; /* current buffer & page */ + Page btps_page; + OffsetNumber btps_lastoff; /* last item offset loaded */ + int btps_level; + struct BTPageState *btps_next; /* link to parent level, if any */ +} BTPageState; + + #define BTITEMSZ(btitem) \ ((btitem) ? \ (IndexTupleDSize((btitem)->bti_itup) + \ @@ -65,13 +91,11 @@ struct BTSpool static void _bt_load(Relation index, BTSpool *btspool); -static BTItem _bt_buildadd(Relation index, Size keysz, ScanKey scankey, - BTPageState *state, BTItem bti, int flags); +static void _bt_buildadd(Relation index, BTPageState *state, + BTItem bti, int flags); static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend); -static BTPageState *_bt_pagestate(Relation index, int flags, - int level, bool doupper); -static void _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, - BTPageState *state); +static BTPageState *_bt_pagestate(Relation index, int flags, int level); +static void _bt_uppershutdown(Relation index, BTPageState *state); /* @@ -159,9 +183,6 @@ _bt_blnewpage(Relation index, Buffer *buf, Page *page, int flags) BTPageOpaque opaque; *buf = _bt_getbuf(index, P_NEW, BT_WRITE); -#ifdef NOT_USED - printf("\tblk=%d\n", BufferGetBlockNumber(*buf)); -#endif *page = BufferGetPage(*buf); _bt_pageinit(*page, BufferGetPageSize(*buf)); opaque = (BTPageOpaque) PageGetSpecialPointer(*page); @@ -202,18 +223,15 @@ _bt_slideleft(Relation index, Buffer buf, Page page) * is suitable for immediate use by _bt_buildadd. */ static BTPageState * -_bt_pagestate(Relation index, int flags, int level, bool doupper) +_bt_pagestate(Relation index, int flags, int level) { BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState)); MemSet((char *) state, 0, sizeof(BTPageState)); _bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags); - state->btps_firstoff = InvalidOffsetNumber; state->btps_lastoff = P_HIKEY; - state->btps_lastbti = (BTItem) NULL; state->btps_next = (BTPageState *) NULL; state->btps_level = level; - state->btps_doupper = doupper; return state; } @@ -240,31 +258,27 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend) } /* - * add an item to a disk page from a merge tape block. + * add an item to a disk page from the sort output. * * we must be careful to observe the following restrictions, placed * upon us by the conventions in nbtsearch.c: * - rightmost pages start data items at P_HIKEY instead of at * P_FIRSTKEY. - * - duplicates cannot be split among pages unless the chain of - * duplicates starts at the first data item. * * a leaf page being built looks like: * * +----------------+---------------------------------+ * | PageHeaderData | linp0 linp1 linp2 ... | * +-----------+----+---------------------------------+ - * | ... linpN | ^ first | + * | ... linpN | | * +-----------+--------------------------------------+ * | ^ last | * | | - * | v last | * +-------------+------------------------------------+ * | | itemN ... | * +-------------+------------------+-----------------+ * | ... item3 item2 item1 | "special space" | * +--------------------------------+-----------------+ - * ^ first * * contrast this with the diagram in bufpage.h; note the mismatch * between linps and items. this is because we reserve linp0 as a @@ -272,30 +286,20 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend) * filled up the page, we will set linp0 to point to itemN and clear * linpN. * - * 'last' pointers indicate the last offset/item added to the page. - * 'first' pointers indicate the first offset/item that is part of a - * chain of duplicates extending from 'first' to 'last'. - * - * if all keys are unique, 'first' will always be the same as 'last'. + * 'last' pointer indicates the last offset added to the page. */ -static BTItem -_bt_buildadd(Relation index, Size keysz, ScanKey scankey, - BTPageState *state, BTItem bti, int flags) +static void +_bt_buildadd(Relation index, BTPageState *state, BTItem bti, int flags) { Buffer nbuf; Page npage; - BTItem last_bti; - OffsetNumber first_off; OffsetNumber last_off; - OffsetNumber off; Size pgspc; Size btisz; nbuf = state->btps_buf; npage = state->btps_page; - first_off = state->btps_firstoff; last_off = state->btps_lastoff; - last_bti = state->btps_lastbti; pgspc = PageGetFreeSpace(npage); btisz = BTITEMSZ(bti); @@ -319,75 +323,55 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, if (pgspc < btisz) { + /* + * Item won't fit on this page, so finish off the page and + * write it out. + */ Buffer obuf = nbuf; Page opage = npage; - OffsetNumber o, - n; ItemId ii; ItemId hii; + BTItem nbti; _bt_blnewpage(index, &nbuf, &npage, flags); /* - * if 'last' is part of a chain of duplicates that does not start - * at the beginning of the old page, the entire chain is copied to - * the new page; we delete all of the duplicates from the old page - * except the first, which becomes the high key item of the old - * page. + * We copy the last item on the page into the new page, and then + * rearrange the old page so that the 'last item' becomes its high + * key rather than a true data item. * - * if the chain starts at the beginning of the page or there is no - * chain ('first' == 'last'), we need only copy 'last' to the new - * page. again, 'first' (== 'last') becomes the high key of the - * old page. - * - * note that in either case, we copy at least one item to the new - * page, so 'last_bti' will always be valid. 'bti' will never be - * the first data item on the new page. + * note that since we always copy an item to the new page, + * 'bti' will never be the first data item on the new page. */ - if (first_off == P_FIRSTKEY) + ii = PageGetItemId(opage, last_off); + if (PageAddItem(npage, PageGetItem(opage, ii), ii->lp_len, + P_FIRSTKEY, LP_USED) == InvalidOffsetNumber) + elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)"); +#ifdef FASTBUILD_DEBUG { - Assert(last_off != P_FIRSTKEY); - first_off = last_off; + bool isnull; + BTItem tmpbti = + (BTItem) PageGetItem(npage, PageGetItemId(npage, P_FIRSTKEY)); + Datum d = index_getattr(&(tmpbti->bti_itup), 1, + index->rd_att, &isnull); + + printf("_bt_buildadd: moved <%x> to offset %d at level %d\n", + d, P_FIRSTKEY, state->btps_level); } - for (o = first_off, n = P_FIRSTKEY; - o <= last_off; - o = OffsetNumberNext(o), n = OffsetNumberNext(n)) - { - ii = PageGetItemId(opage, o); - if (PageAddItem(npage, PageGetItem(opage, ii), - ii->lp_len, n, LP_USED) == InvalidOffsetNumber) - elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)"); -#ifdef FASTBUILD_DEBUG - { - bool isnull; - BTItem tmpbti = - (BTItem) PageGetItem(npage, PageGetItemId(npage, n)); - Datum d = index_getattr(&(tmpbti->bti_itup), 1, - index->rd_att, &isnull); - - printf("_bt_buildadd: moved <%x> to offset %d at level %d\n", - d, n, state->btps_level); - } #endif - } /* - * this loop is backward because PageIndexTupleDelete shuffles the - * tuples to fill holes in the page -- by starting at the end and - * working back, we won't create holes (and thereby avoid - * shuffling). + * Move 'last' into the high key position on opage */ - for (o = last_off; o > first_off; o = OffsetNumberPrev(o)) - PageIndexTupleDelete(opage, o); hii = PageGetItemId(opage, P_HIKEY); - ii = PageGetItemId(opage, first_off); *hii = *ii; ii->lp_flags &= ~LP_USED; ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData); - first_off = P_FIRSTKEY; + /* + * Reset last_off to point to new page + */ last_off = PageGetMaxOffsetNumber(npage); - last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, last_off)); /* * set the page (side link) pointers. @@ -399,32 +383,21 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, oopaque->btpo_next = BufferGetBlockNumber(nbuf); nopaque->btpo_prev = BufferGetBlockNumber(obuf); nopaque->btpo_next = P_NONE; - - if (_bt_itemcmp(index, keysz, scankey, - (BTItem) PageGetItem(opage, PageGetItemId(opage, P_HIKEY)), - (BTItem) PageGetItem(opage, PageGetItemId(opage, P_FIRSTKEY)), - BTEqualStrategyNumber)) - oopaque->btpo_flags |= BTP_CHAIN; } /* - * copy the old buffer's minimum key to its parent. if we don't - * have a parent, we have to create one; this adds a new btree - * level. + * Link the old buffer into its parent, using its minimum key. + * If we don't have a parent, we have to create one; + * this adds a new btree level. */ - if (state->btps_doupper) + if (state->btps_next == (BTPageState *) NULL) { - BTItem nbti; - - if (state->btps_next == (BTPageState *) NULL) - { - state->btps_next = - _bt_pagestate(index, 0, state->btps_level + 1, true); - } - nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0); - _bt_buildadd(index, keysz, scankey, state->btps_next, nbti, 0); - pfree((void *) nbti); + state->btps_next = + _bt_pagestate(index, 0, state->btps_level + 1); } + nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0); + _bt_buildadd(index, state->btps_next, nbti, 0); + pfree((void *) nbti); /* * write out the old stuff. we never want to see it again, so we @@ -435,11 +408,11 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, } /* - * if this item is different from the last item added, we start a new - * chain of duplicates. + * Add the new item into the current page. */ - off = OffsetNumberNext(last_off); - if (PageAddItem(npage, (Item) bti, btisz, off, LP_USED) == InvalidOffsetNumber) + last_off = OffsetNumberNext(last_off); + if (PageAddItem(npage, (Item) bti, btisz, + last_off, LP_USED) == InvalidOffsetNumber) elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)"); #ifdef FASTBUILD_DEBUG { @@ -447,65 +420,57 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey, Datum d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull); printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n", - d, off, state->btps_level); + d, last_off, state->btps_level); } #endif - if (last_bti == (BTItem) NULL) - first_off = P_FIRSTKEY; - else if (!_bt_itemcmp(index, keysz, scankey, - bti, last_bti, BTEqualStrategyNumber)) - first_off = off; - last_off = off; - last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, off)); state->btps_buf = nbuf; state->btps_page = npage; - state->btps_lastbti = last_bti; state->btps_lastoff = last_off; - state->btps_firstoff = first_off; - - return last_bti; } +/* + * Finish writing out the completed btree. + */ static void -_bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, - BTPageState *state) +_bt_uppershutdown(Relation index, BTPageState *state) { BTPageState *s; BlockNumber blkno; BTPageOpaque opaque; BTItem bti; + /* + * Each iteration of this loop completes one more level of the tree. + */ for (s = state; s != (BTPageState *) NULL; s = s->btps_next) { blkno = BufferGetBlockNumber(s->btps_buf); opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page); /* - * if this is the root, attach it to the metapage. otherwise, - * stick the minimum key of the last page on this level (which has - * not been split, or else it wouldn't be the last page) into its - * parent. this may cause the last page of upper levels to split, - * but that's not a problem -- we haven't gotten to them yet. + * We have to link the last page on this level to somewhere. + * + * If we're at the top, it's the root, so attach it to the metapage. + * Otherwise, add an entry for it to its parent using its minimum + * key. This may cause the last page of the parent level to split, + * but that's not a problem -- we haven't gotten to it yet. */ - if (s->btps_doupper) + if (s->btps_next == (BTPageState *) NULL) { - if (s->btps_next == (BTPageState *) NULL) - { - opaque->btpo_flags |= BTP_ROOT; - _bt_metaproot(index, blkno, s->btps_level + 1); - } - else - { - bti = _bt_minitem(s->btps_page, blkno, 0); - _bt_buildadd(index, keysz, scankey, s->btps_next, bti, 0); - pfree((void *) bti); - } + opaque->btpo_flags |= BTP_ROOT; + _bt_metaproot(index, blkno, s->btps_level + 1); + } + else + { + bti = _bt_minitem(s->btps_page, blkno, 0); + _bt_buildadd(index, s->btps_next, bti, 0); + pfree((void *) bti); } /* - * this is the rightmost page, so the ItemId array needs to be - * slid back one slot. + * This is the rightmost page, so the ItemId array needs to be + * slid back one slot. Then we can dump out the page. */ _bt_slideleft(index, s->btps_buf, s->btps_page); _bt_wrtbuf(index, s->btps_buf); @@ -519,32 +484,27 @@ _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey, static void _bt_load(Relation index, BTSpool *btspool) { - BTPageState *state; - ScanKey skey; - int natts; - BTItem bti; - bool should_free; - - /* - * initialize state needed for the merge into the btree leaf pages. - */ - state = _bt_pagestate(index, BTP_LEAF, 0, true); - - skey = _bt_mkscankey_nodata(index); - natts = RelationGetNumberOfAttributes(index); + BTPageState *state = NULL; for (;;) { + BTItem bti; + bool should_free; + bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true, &should_free); if (bti == (BTItem) NULL) break; - _bt_buildadd(index, natts, skey, state, bti, BTP_LEAF); + + /* When we see first tuple, create first index page */ + if (state == NULL) + state = _bt_pagestate(index, BTP_LEAF, 0); + + _bt_buildadd(index, state, bti, BTP_LEAF); if (should_free) pfree((void *) bti); } - _bt_uppershutdown(index, natts, skey, state); - - _bt_freeskey(skey); + if (state != NULL) + _bt_uppershutdown(index, state); } diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 5853267670..aabdf80900 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.37 2000/05/30 04:24:33 tgl Exp $ + * $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.38 2000/07/21 06:42:33 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -20,16 +20,13 @@ #include "access/nbtree.h" #include "executor/execdebug.h" -extern int NIndexTupleProcessed; - /* * _bt_mkscankey * Build a scan key that contains comparison data from itup * as well as comparator routines appropriate to the key datatypes. * - * The result is intended for use with _bt_skeycmp() or _bt_compare(), - * although it could be used with _bt_itemcmp() or _bt_tuplecompare(). + * The result is intended for use with _bt_compare(). */ ScanKey _bt_mkscankey(Relation rel, IndexTuple itup) @@ -68,8 +65,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup) * Build a scan key that contains comparator routines appropriate to * the key datatypes, but no comparison data. * - * The result can be used with _bt_itemcmp() or _bt_tuplecompare(), - * but not with _bt_skeycmp() or _bt_compare(). + * The result cannot be used with _bt_compare(). Currently this + * routine is only called by utils/sort/tuplesort.c, which has its + * own comparison routine. */ ScanKey _bt_mkscankey_nodata(Relation rel) @@ -114,7 +112,6 @@ _bt_freestack(BTStack stack) { ostack = stack; stack = stack->bts_parent; - pfree(ostack->bts_btitem); pfree(ostack); } } @@ -331,55 +328,16 @@ _bt_formitem(IndexTuple itup) Size tuplen; extern Oid newoid(); - /* - * see comments in btbuild - * - * if (itup->t_info & INDEX_NULL_MASK) elog(ERROR, "btree indices cannot - * include null keys"); - */ - /* make a copy of the index tuple with room for the sequence number */ tuplen = IndexTupleSize(itup); nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData)); btitem = (BTItem) palloc(nbytes_btitem); - memmove((char *) &(btitem->bti_itup), (char *) itup, tuplen); + memcpy((char *) &(btitem->bti_itup), (char *) itup, tuplen); return btitem; } -#ifdef NOT_USED -bool -_bt_checkqual(IndexScanDesc scan, IndexTuple itup) -{ - BTScanOpaque so; - - so = (BTScanOpaque) scan->opaque; - if (so->numberOfKeys > 0) - return (index_keytest(itup, RelationGetDescr(scan->relation), - so->numberOfKeys, so->keyData)); - else - return true; -} - -#endif - -#ifdef NOT_USED -bool -_bt_checkforkeys(IndexScanDesc scan, IndexTuple itup, Size keysz) -{ - BTScanOpaque so; - - so = (BTScanOpaque) scan->opaque; - if (keysz > 0 && so->numberOfKeys >= keysz) - return (index_keytest(itup, RelationGetDescr(scan->relation), - keysz, so->keyData)); - else - return true; -} - -#endif - bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok) { diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 43cabceba1..1a970a1375 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.30 2000/07/03 02:54:16 vadim Exp $ + * $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.31 2000/07/21 06:42:33 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -19,10 +19,10 @@ #include "storage/bufpage.h" + static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr, char *location, Size size); -static bool PageManagerShuffle = true; /* default is shuffle mode */ /* ---------------------------------------------------------------- * Page support functions @@ -53,21 +53,17 @@ PageInit(Page page, Size pageSize, Size specialSize) /* ---------------- * PageAddItem * - * add an item to a page. - * - * !!! ELOG(ERROR) IS DISALLOWED HERE !!! + * Add an item to a page. Return value is offset at which it was + * inserted, or InvalidOffsetNumber if there's not room to insert. * - * Notes on interface: - * If offsetNumber is valid, shuffle ItemId's down to make room - * to use it, if PageManagerShuffle is true. If PageManagerShuffle is - * false, then overwrite the specified ItemId. (PageManagerShuffle is - * true by default, and is modified by calling PageManagerModeSet.) + * If offsetNumber is valid and <= current max offset in the page, + * insert item into the array at that position by shuffling ItemId's + * down to make room. * If offsetNumber is not valid, then assign one by finding the first * one that is both unused and deallocated. * - * NOTE: If offsetNumber is valid, and PageManagerShuffle is true, it - * is assumed that there is room on the page to shuffle the ItemId's - * down by one. + * !!! ELOG(ERROR) IS DISALLOWED HERE !!! + * * ---------------- */ OffsetNumber @@ -82,11 +78,8 @@ PageAddItem(Page page, Offset lower; Offset upper; ItemId itemId; - ItemId fromitemId, - toitemId; OffsetNumber limit; - - bool shuffled = false; + bool needshuffle = false; /* * Find first unallocated offsetNumber @@ -96,31 +89,12 @@ PageAddItem(Page page, /* was offsetNumber passed in? */ if (OffsetNumberIsValid(offsetNumber)) { - if (PageManagerShuffle == true) - { - /* shuffle ItemId's (Do the PageManager Shuffle...) */ - for (i = (limit - 1); i >= offsetNumber; i--) - { - fromitemId = &((PageHeader) page)->pd_linp[i - 1]; - toitemId = &((PageHeader) page)->pd_linp[i]; - *toitemId = *fromitemId; - } - shuffled = true; /* need to increase "lower" */ - } - else - { /* overwrite mode */ - itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1]; - if (((*itemId).lp_flags & LP_USED) || - ((*itemId).lp_len != 0)) - { - elog(NOTICE, "PageAddItem: tried overwrite of used ItemId"); - return InvalidOffsetNumber; - } - } + needshuffle = true; /* need to increase "lower" */ + /* don't actually do the shuffle till we've checked free space! */ } else - { /* offsetNumber was not passed in, so find - * one */ + { + /* offsetNumber was not passed in, so find one */ /* look for "recyclable" (unused & deallocated) ItemId */ for (offsetNumber = 1; offsetNumber < limit; offsetNumber++) { @@ -130,9 +104,13 @@ PageAddItem(Page page, break; } } + + /* + * Compute new lower and upper pointers for page, see if it'll fit + */ if (offsetNumber > limit) lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page)); - else if (offsetNumber == limit || shuffled == true) + else if (offsetNumber == limit || needshuffle) lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData); else lower = ((PageHeader) page)->pd_lower; @@ -144,6 +122,23 @@ PageAddItem(Page page, if (lower > upper) return InvalidOffsetNumber; + /* + * OK to insert the item. First, shuffle the existing pointers if needed. + */ + if (needshuffle) + { + /* shuffle ItemId's (Do the PageManager Shuffle...) */ + for (i = (limit - 1); i >= offsetNumber; i--) + { + ItemId fromitemId, + toitemId; + + fromitemId = &((PageHeader) page)->pd_linp[i - 1]; + toitemId = &((PageHeader) page)->pd_linp[i]; + *toitemId = *fromitemId; + } + } + itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1]; (*itemId).lp_off = upper; (*itemId).lp_len = size; @@ -168,9 +163,7 @@ PageGetTempPage(Page page, Size specialSize) PageHeader thdr; pageSize = PageGetPageSize(page); - - if ((temp = (Page) palloc(pageSize)) == (Page) NULL) - elog(FATAL, "Cannot allocate %d bytes for temp page.", pageSize); + temp = (Page) palloc(pageSize); thdr = (PageHeader) temp; /* copy old page in */ @@ -327,23 +320,6 @@ PageGetFreeSpace(Page page) return space; } -/* - * PageManagerModeSet - * - * Sets mode to either: ShufflePageManagerMode (the default) or - * OverwritePageManagerMode. For use by access methods code - * for determining semantics of PageAddItem when the offsetNumber - * argument is passed in. - */ -void -PageManagerModeSet(PageManagerMode mode) -{ - if (mode == ShufflePageManagerMode) - PageManagerShuffle = true; - else if (mode == OverwritePageManagerMode) - PageManagerShuffle = false; -} - /* *---------------------------------------------------------------- * PageIndexTupleDelete diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 49d9dd07dc..3f8eebc3b3 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * - * $Id: nbtree.h,v 1.38 2000/06/15 03:32:31 momjian Exp $ + * $Id: nbtree.h,v 1.39 2000/07/21 06:42:35 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -24,14 +24,9 @@ * info. In addition, we need to know what sort of page this is * (leaf or internal), and whether the page is available for reuse. * - * Lehman and Yao's algorithm requires a ``high key'' on every page. - * The high key on a page is guaranteed to be greater than or equal - * to any key that appears on this page. Our insertion algorithm - * guarantees that we can use the initial least key on our right - * sibling as the high key. We allocate space for the line pointer - * to the high key in the opaque data at the end of the page. - * - * Rightmost pages in the tree have no high key. + * We also store a back-link to the parent page, but this cannot be trusted + * very far since it does not get updated when the parent is split. + * See backend/access/nbtree/README for details. */ typedef struct BTPageOpaqueData @@ -41,11 +36,11 @@ typedef struct BTPageOpaqueData BlockNumber btpo_parent; uint16 btpo_flags; -#define BTP_LEAF (1 << 0) -#define BTP_ROOT (1 << 1) -#define BTP_FREE (1 << 2) -#define BTP_META (1 << 3) -#define BTP_CHAIN (1 << 4) +/* Bits defined in btpo_flags */ +#define BTP_LEAF (1 << 0) /* It's a leaf page */ +#define BTP_ROOT (1 << 1) /* It's the root page (has no parent) */ +#define BTP_FREE (1 << 2) /* not currently used... */ +#define BTP_META (1 << 3) /* Set in the meta-page only */ } BTPageOpaqueData; @@ -84,21 +79,24 @@ typedef struct BTScanOpaqueData typedef BTScanOpaqueData *BTScanOpaque; /* - * BTItems are what we store in the btree. Each item has an index - * tuple, including key and pointer values. In addition, we must - * guarantee that all tuples in the index are unique, in order to - * satisfy some assumptions in Lehman and Yao. The way that we do - * this is by generating a new OID for every insertion that we do in - * the tree. This adds eight bytes to the size of btree index - * tuples. Note that we do not use the OID as part of a composite - * key; the OID only serves as a unique identifier for a given index - * tuple (logical position within a page). + * BTItems are what we store in the btree. Each item is an index tuple, + * including key and pointer values. (In some cases either the key or the + * pointer may go unused, see backend/access/nbtree/README for details.) + * + * Old comments: + * In addition, we must guarantee that all tuples in the index are unique, + * in order to satisfy some assumptions in Lehman and Yao. The way that we + * do this is by generating a new OID for every insertion that we do in the + * tree. This adds eight bytes to the size of btree index tuples. Note + * that we do not use the OID as part of a composite key; the OID only + * serves as a unique identifier for a given index tuple (logical position + * within a page). * * New comments: * actually, we must guarantee that all tuples in A LEVEL * are unique, not in ALL INDEX. So, we can use bti_itup->t_tid * as unique identifier for a given index tuple (logical position - * within a level). - vadim 04/09/97 + * within a level). - vadim 04/09/97 */ typedef struct BTItemData @@ -108,12 +106,13 @@ typedef struct BTItemData typedef BTItemData *BTItem; -#define BTItemSame(i1, i2) ( i1->bti_itup.t_tid.ip_blkid.bi_hi == \ - i2->bti_itup.t_tid.ip_blkid.bi_hi && \ - i1->bti_itup.t_tid.ip_blkid.bi_lo == \ - i2->bti_itup.t_tid.ip_blkid.bi_lo && \ - i1->bti_itup.t_tid.ip_posid == \ - i2->bti_itup.t_tid.ip_posid ) +/* Test whether items are the "same" per the above notes */ +#define BTItemSame(i1, i2) ( (i1)->bti_itup.t_tid.ip_blkid.bi_hi == \ + (i2)->bti_itup.t_tid.ip_blkid.bi_hi && \ + (i1)->bti_itup.t_tid.ip_blkid.bi_lo == \ + (i2)->bti_itup.t_tid.ip_blkid.bi_lo && \ + (i1)->bti_itup.t_tid.ip_posid == \ + (i2)->bti_itup.t_tid.ip_posid ) /* * BTStackData -- As we descend a tree, we push the (key, pointer) @@ -129,24 +128,12 @@ typedef struct BTStackData { BlockNumber bts_blkno; OffsetNumber bts_offset; - BTItem bts_btitem; + BTItemData bts_btitem; struct BTStackData *bts_parent; } BTStackData; typedef BTStackData *BTStack; -typedef struct BTPageState -{ - Buffer btps_buf; - Page btps_page; - BTItem btps_lastbti; - OffsetNumber btps_lastoff; - OffsetNumber btps_firstoff; - int btps_level; - bool btps_doupper; - struct BTPageState *btps_next; -} BTPageState; - /* * We need to be able to tell the difference between read and write * requests for pages, in order to do locking correctly. @@ -155,31 +142,49 @@ typedef struct BTPageState #define BT_READ BUFFER_LOCK_SHARE #define BT_WRITE BUFFER_LOCK_EXCLUSIVE -/* - * Similarly, the difference between insertion and non-insertion binary - * searches on a given page makes a difference when we're descending the - * tree. - */ - -#define BT_INSERTION 0 -#define BT_DESCENT 1 - /* * In general, the btree code tries to localize its knowledge about * page layout to a couple of routines. However, we need a special * value to indicate "no page number" in those places where we expect - * page numbers. + * page numbers. We can use zero for this because we never need to + * make a pointer to the metadata page. */ #define P_NONE 0 + +/* + * Macros to test whether a page is leftmost or rightmost on its tree level, + * as well as other state info kept in the opaque data. + */ #define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE) #define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE) +#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF) +#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT) + +/* + * Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost + * page. The high key is not a data key, but gives info about what range of + * keys is supposed to be on this page. The high key on a page is required + * to be greater than or equal to any data key that appears on the page. + * If we find ourselves trying to insert a key > high key, we know we need + * to move right (this should only happen if the page was split since we + * examined the parent page). + * + * Our insertion algorithm guarantees that we can use the initial least key + * on our right sibling as the high key. Once a page is created, its high + * key changes only if the page is split. + * + * On a non-rightmost page, the high key lives in item 1 and data items + * start in item 2. Rightmost pages have no high key, so we store data + * items beginning in item 1. + */ #define P_HIKEY ((OffsetNumber) 1) #define P_FIRSTKEY ((OffsetNumber) 2) +#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY) /* - * Strategy numbers -- ordering of these is <, <=, =, >=, > + * Operator strategy numbers -- ordering of these is <, <=, =, >=, > */ #define BTLessStrategyNumber 1 @@ -199,13 +204,27 @@ typedef struct BTPageState #define BTORDER_PROC 1 +/* + * prototypes for functions in nbtree.c (external entry points for btree) + */ +extern bool BuildingBtree; /* in nbtree.c */ + +extern Datum btbuild(PG_FUNCTION_ARGS); +extern Datum btinsert(PG_FUNCTION_ARGS); +extern Datum btgettuple(PG_FUNCTION_ARGS); +extern Datum btbeginscan(PG_FUNCTION_ARGS); +extern Datum btrescan(PG_FUNCTION_ARGS); +extern void btmovescan(IndexScanDesc scan, Datum v); +extern Datum btendscan(PG_FUNCTION_ARGS); +extern Datum btmarkpos(PG_FUNCTION_ARGS); +extern Datum btrestrpos(PG_FUNCTION_ARGS); +extern Datum btdelete(PG_FUNCTION_ARGS); + /* * prototypes for functions in nbtinsert.c */ extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem, bool index_is_unique, Relation heapRel); -extern bool _bt_itemcmp(Relation rel, Size keysz, ScanKey scankey, - BTItem item1, BTItem item2, StrategyNumber strat); /* * prototypes for functions in nbtpage.c @@ -218,25 +237,8 @@ extern void _bt_wrtbuf(Relation rel, Buffer buf); extern void _bt_wrtnorelbuf(Relation rel, Buffer buf); extern void _bt_pageinit(Page page, Size size); extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level); -extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access); extern void _bt_pagedel(Relation rel, ItemPointer tid); -/* - * prototypes for functions in nbtree.c - */ -extern bool BuildingBtree; /* in nbtree.c */ - -extern Datum btbuild(PG_FUNCTION_ARGS); -extern Datum btinsert(PG_FUNCTION_ARGS); -extern Datum btgettuple(PG_FUNCTION_ARGS); -extern Datum btbeginscan(PG_FUNCTION_ARGS); -extern Datum btrescan(PG_FUNCTION_ARGS); -extern void btmovescan(IndexScanDesc scan, Datum v); -extern Datum btendscan(PG_FUNCTION_ARGS); -extern Datum btmarkpos(PG_FUNCTION_ARGS); -extern Datum btrestrpos(PG_FUNCTION_ARGS); -extern Datum btdelete(PG_FUNCTION_ARGS); - /* * prototypes for functions in nbtscan.c */ @@ -249,13 +251,13 @@ extern void AtEOXact_nbtree(void); * prototypes for functions in nbtsearch.c */ extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey, - Buffer *bufP); + Buffer *bufP, int access); extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz, ScanKey scankey, int access); -extern bool _bt_skeycmp(Relation rel, Size keysz, ScanKey scankey, - Page page, ItemId itemid, StrategyNumber strat); extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz, - ScanKey scankey, int srchtype); + ScanKey scankey); +extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey, + Page page, OffsetNumber offnum); extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir); extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir); extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir); diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h index 30b5a93ad6..8498c783a1 100644 --- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2000, PostgreSQL, Inc * Portions Copyright (c) 1994, Regents of the University of California * - * $Id: bufpage.h,v 1.30 2000/07/03 02:54:21 vadim Exp $ + * $Id: bufpage.h,v 1.31 2000/07/21 06:42:39 tgl Exp $ * *------------------------------------------------------------------------- */ @@ -309,7 +309,6 @@ extern Page PageGetTempPage(Page page, Size specialSize); extern void PageRestoreTempPage(Page tempPage, Page oldPage); extern void PageRepairFragmentation(Page page); extern Size PageGetFreeSpace(Page page); -extern void PageManagerModeSet(PageManagerMode mode); extern void PageIndexTupleDelete(Page page, OffsetNumber offset);