granicus.if.org Git - postgresql/blob - src/backend/access/nbtree/README

   1 $PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.21 2009/12/19 01:32:32 sriggs Exp $
   2
   3 Btree Indexing
   4 ==============
   5
   6 This directory contains a correct implementation of Lehman and Yao's
   7 high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
   8 Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
   9 on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).  We also
  10 use a simplified version of the deletion logic described in Lanin and
  11 Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
  12 Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).
  13
  14 The Lehman and Yao Algorithm and Insertions
  15 -------------------------------------------
  16
  17 We have made the following changes in order to incorporate the L&Y algorithm
  18 into Postgres:
  19
  20 The requirement that all btree keys be unique is too onerous,
  21 but the algorithm won't work correctly without it.  Fortunately, it is
  22 only necessary that keys be unique on a single tree level, because L&Y
  23 only use the assumption of key uniqueness when re-finding a key in a
  24 parent page (to determine where to insert the key for a split page).
  25 Therefore, we can use the link field to disambiguate multiple
  26 occurrences of the same user key: only one entry in the parent level
  27 will be pointing at the page we had split.  (Indeed we need not look at
  28 the real "key" at all, just at the link field.)  We can distinguish
  29 items at the leaf level in the same way, by examining their links to
  30 heap tuples; we'd never have two items for the same heap tuple.
  31
  32 Lehman and Yao assume that the key range for a subtree S is described
  33 by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
  34 page.  This does not work for nonunique keys (for example, if we have
  35 enough equal keys to spread across several leaf pages, there *must* be
  36 some equal bounding keys in the first level up).  Therefore we assume
  37 Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
  38 bounding key in an upper tree level must descend to the left of that
  39 key to ensure it finds any equal keys in the preceding page.  An
  40 insertion that sees the high key of its target page is equal to the key
  41 to be inserted has a choice whether or not to move right, since the new
  42 key could go on either page.  (Currently, we try to find a page where
  43 there is room for the new key without a split.)
  44
  45 Lehman and Yao don't require read locks, but assume that in-memory
  46 copies of tree pages are unshared.  Postgres shares in-memory buffers
  47 among backends.  As a result, we do page-level read locking on btree
  48 pages in order to guarantee that no record is modified while we are
  49 examining it.  This reduces concurrency but guaranteees correct
  50 behavior.  An advantage is that when trading in a read lock for a
  51 write lock, we need not re-read the page after getting the write lock.
  52 Since we're also holding a pin on the shared buffer containing the
  53 page, we know that buffer still contains the page and is up-to-date.
  54
  55 We support the notion of an ordered "scan" of an index as well as
  56 insertions, deletions, and simple lookups.  A scan in the forward
  57 direction is no problem, we just use the right-sibling pointers that
  58 L&Y require anyway.  (Thus, once we have descended the tree to the
  59 correct start point for the scan, the scan looks only at leaf pages
  60 and never at higher tree levels.)  To support scans in the backward
  61 direction, we also store a "left sibling" link much like the "right
  62 sibling".  (This adds an extra step to the L&Y split algorithm: while
  63 holding the write lock on the page being split, we also lock its former
  64 right sibling to update that page's left-link.  This is safe since no
  65 writer of that page can be interested in acquiring a write lock on our
  66 page.)  A backwards scan has one additional bit of complexity: after
  67 following the left-link we must account for the possibility that the
  68 left sibling page got split before we could read it.  So, we have to
  69 move right until we find a page whose right-link matches the page we
  70 came from.  (Actually, it's even harder than that; see deletion discussion
  71 below.)
  72
  73 Page read locks are held only for as long as a scan is examining a page.
  74 To minimize lock/unlock traffic, an index scan always searches a leaf page
  75 to identify all the matching items at once, copying their heap tuple IDs
  76 into backend-local storage.  The heap tuple IDs are then processed while
  77 not holding any page lock within the index.  We do continue to hold a pin
  78 on the leaf page, to protect against concurrent deletions (see below).
  79 In this state the scan is effectively stopped "between" pages, either
  80 before or after the page it has pinned.  This is safe in the presence of
  81 concurrent insertions and even page splits, because items are never moved
  82 across pre-existing page boundaries --- so the scan cannot miss any items
  83 it should have seen, nor accidentally return the same item twice.  The scan
  84 must remember the page's right-link at the time it was scanned, since that
  85 is the page to move right to; if we move right to the current right-link
  86 then we'd re-scan any items moved by a page split.  We don't similarly
  87 remember the left-link, since it's best to use the most up-to-date
  88 left-link when trying to move left (see detailed move-left algorithm below).
  89
  90 In most cases we release our lock and pin on a page before attempting
  91 to acquire pin and lock on the page we are moving to.  In a few places
  92 it is necessary to lock the next page before releasing the current one.
  93 This is safe when moving right or up, but not when moving left or down
  94 (else we'd create the possibility of deadlocks).
  95
  96 Lehman and Yao fail to discuss what must happen when the root page
  97 becomes full and must be split.  Our implementation is to split the
  98 root in the same way that any other page would be split, then construct
  99 a new root page holding pointers to both of the resulting pages (which
 100 now become siblings on the next level of the tree).  The new root page
 101 is then installed by altering the root pointer in the meta-data page (see
 102 below).  This works because the root is not treated specially in any
 103 other way --- in particular, searches will move right using its link
 104 pointer if the link is set.  Therefore, searches will find the data
 105 that's been moved into the right sibling even if they read the meta-data
 106 page before it got updated.  This is the same reasoning that makes a
 107 split of a non-root page safe.  The locking considerations are similar too.
 108
 109 When an inserter recurses up the tree, splitting internal pages to insert
 110 links to pages inserted on the level below, it is possible that it will
 111 need to access a page above the level that was the root when it began its
 112 descent (or more accurately, the level that was the root when it read the
 113 meta-data page).  In this case the stack it made while descending does not
 114 help for finding the correct page.  When this happens, we find the correct
 115 place by re-descending the tree until we reach the level one above the
 116 level we need to insert a link to, and then moving right as necessary.
 117 (Typically this will take only two fetches, the meta-data page and the new
 118 root, but in principle there could have been more than one root split
 119 since we saw the root.  We can identify the correct tree level by means of
 120 the level numbers stored in each page.  The situation is rare enough that
 121 we do not need a more efficient solution.)
 122
 123 Lehman and Yao assume fixed-size keys, but we must deal with
 124 variable-size keys.  Therefore there is not a fixed maximum number of
 125 keys per page; we just stuff in as many as will fit.  When we split a
 126 page, we try to equalize the number of bytes, not items, assigned to
 127 each of the resulting pages.  Note we must include the incoming item in
 128 this calculation, otherwise it is possible to find that the incoming
 129 item doesn't fit on the split page where it needs to go!
 130
 131 The Deletion Algorithm
 132 ----------------------
 133
 134 Before deleting a leaf item, we get a super-exclusive lock on the target
 135 page, so that no other backend has a pin on the page when the deletion
 136 starts.  This is not necessary for correctness in terms of the btree index
 137 operations themselves; as explained above, index scans logically stop
 138 "between" pages and so can't lose their place.  The reason we do it is to
 139 provide an interlock between non-full VACUUM and indexscans.  Since VACUUM
 140 deletes index entries before deleting tuples, the super-exclusive lock
 141 guarantees that VACUUM can't delete any heap tuple that an indexscanning
 142 process might be about to visit.  (This guarantee works only for simple
 143 indexscans that visit the heap in sync with the index scan, not for bitmap
 144 scans.  We only need the guarantee when using non-MVCC snapshot rules such
 145 as SnapshotNow, so in practice this is only important for system catalog
 146 accesses.)
 147
 148 Because a page can be split even while someone holds a pin on it, it is
 149 possible that an indexscan will return items that are no longer stored on
 150 the page it has a pin on, but rather somewhere to the right of that page.
 151 To ensure that VACUUM can't prematurely remove such heap tuples, we require
 152 btbulkdelete to obtain super-exclusive lock on every leaf page in the index,
 153 even pages that don't contain any deletable tuples.  This guarantees that
 154 the btbulkdelete call cannot return while any indexscan is still holding
 155 a copy of a deleted index tuple.  Note that this requirement does not say
 156 that btbulkdelete must visit the pages in any particular order.  (See also
 157 on-the-fly deletion, below.)
 158
 159 There is no such interlocking for deletion of items in internal pages,
 160 since backends keep no lock nor pin on a page they have descended past.
 161 Hence, when a backend is ascending the tree using its stack, it must
 162 be prepared for the possibility that the item it wants is to the left of
 163 the recorded position (but it can't have moved left out of the recorded
 164 page).  Since we hold a lock on the lower page (per L&Y) until we have
 165 re-found the parent item that links to it, we can be assured that the
 166 parent item does still exist and can't have been deleted.  Also, because
 167 we are matching downlink page numbers and not data keys, we don't have any
 168 problem with possibly misidentifying the parent item.
 169
 170 We consider deleting an entire page from the btree only when it's become
 171 completely empty of items.  (Merging partly-full pages would allow better
 172 space reuse, but it seems impractical to move existing data items left or
 173 right to make this happen --- a scan moving in the opposite direction
 174 might miss the items if so.  We could do it during VACUUM FULL, though.)
 175 Also, we *never* delete the rightmost page on a tree level (this
 176 restriction simplifies the traversal algorithms, as explained below).
 177
 178 To delete an empty page, we acquire write lock on its left sibling (if
 179 any), the target page itself, the right sibling (there must be one), and
 180 the parent page, in that order.  The parent page must be found using the
 181 same type of search as used to find the parent during an insertion split.
 182 Then we update the side-links in the siblings, mark the target page
 183 deleted, and remove the downlink from the parent, as well as the parent's
 184 upper bounding key for the target (the one separating it from its right
 185 sibling).  This causes the target page's key space to effectively belong
 186 to its right sibling.  (Neither the left nor right sibling pages need to
 187 change their "high key" if any; so there is no problem with possibly not
 188 having enough space to replace a high key.)  The side-links in the target
 189 page are not changed.
 190
 191 (Note: Lanin and Shasha prefer to make the key space move left, but their
 192 argument for doing so hinges on not having left-links, which we have
 193 anyway.  So we simplify the algorithm by moving key space right.)
 194
 195 To preserve consistency on the parent level, we cannot merge the key space
 196 of a page into its right sibling unless the right sibling is a child of
 197 the same parent --- otherwise, the parent's key space assignment changes
 198 too, meaning we'd have to make bounding-key updates in its parent, and
 199 perhaps all the way up the tree.  Since we can't possibly do that
 200 atomically, we forbid this case.  That means that the rightmost child of a
 201 parent node can't be deleted unless it's the only remaining child.
 202
 203 When we delete the last remaining child of a parent page, we mark the
 204 parent page "half-dead" as part of the atomic update that deletes the
 205 child page.  This implicitly transfers the parent's key space to its right
 206 sibling (which it must have, since we never delete the overall-rightmost
 207 page of a level).  Searches ignore the half-dead page and immediately move
 208 right.  We need not worry about insertions into a half-dead page --- insertions
 209 into upper tree levels happen only as a result of splits of child pages, and
 210 the half-dead page no longer has any children that could split.  Therefore
 211 the page stays empty even when we don't have lock on it, and we can complete
 212 its deletion in a second atomic action.
 213
 214 The notion of a half-dead page means that the key space relationship between
 215 the half-dead page's level and its parent's level may be a little out of
 216 whack: key space that appears to belong to the half-dead page's parent on the
 217 parent level may really belong to its right sibling.  To prevent any possible
 218 problems, we hold lock on the deleted child page until we have finished
 219 deleting any now-half-dead parent page(s).  This prevents any insertions into
 220 the transferred keyspace until the operation is complete.  The reason for
 221 doing this is that a sufficiently large number of insertions into the
 222 transferred keyspace, resulting in multiple page splits, could propagate keys
 223 from that keyspace into the parent level, resulting in transiently
 224 out-of-order keys in that level.  It is thought that that wouldn't cause any
 225 serious problem, but it seems too risky to allow.
 226
 227 A deleted page cannot be reclaimed immediately, since there may be other
 228 processes waiting to reference it (ie, search processes that just left the
 229 parent, or scans moving right or left from one of the siblings).  These
 230 processes must observe that the page is marked dead and recover
 231 accordingly.  Searches and forward scans simply follow the right-link
 232 until they find a non-dead page --- this will be where the deleted page's
 233 key-space moved to.
 234
 235 Moving left in a backward scan is complicated because we must consider
 236 the possibility that the left sibling was just split (meaning we must find
 237 the rightmost page derived from the left sibling), plus the possibility
 238 that the page we were just on has now been deleted and hence isn't in the
 239 sibling chain at all anymore.  So the move-left algorithm becomes:
 240 0. Remember the page we are on as the "original page".
 241 1. Follow the original page's left-link (we're done if this is zero).
 242 2. If the current page is live and its right-link matches the "original
 243    page", we are done.
 244 3. Otherwise, move right one or more times looking for a live page whose
 245    right-link matches the "original page".  If found, we are done.  (In
 246    principle we could scan all the way to the right end of the index, but
 247    in practice it seems better to give up after a small number of tries.
 248    It's unlikely the original page's sibling split more than a few times
 249    while we were in flight to it; if we do not find a matching link in a
 250    few tries, then most likely the original page is deleted.)
 251 4. Return to the "original page".  If it is still live, return to step 1
 252    (we guessed wrong about it being deleted, and should restart with its
 253    current left-link).  If it is dead, move right until a non-dead page
 254    is found (there must be one, since rightmost pages are never deleted),
 255    mark that as the new "original page", and return to step 1.
 256 This algorithm is correct because the live page found by step 4 will have
 257 the same left keyspace boundary as the page we started from.  Therefore,
 258 when we ultimately exit, it must be on a page whose right keyspace
 259 boundary matches the left boundary of where we started --- which is what
 260 we need to be sure we don't miss or re-scan any items.
 261
 262 A deleted page can only be reclaimed once there is no scan or search that
 263 has a reference to it; until then, it must stay in place with its
 264 right-link undisturbed.  We implement this by waiting until all
 265 transactions that were running at the time of deletion are dead; which is
 266 overly strong, but is simple to implement within Postgres.  When marked
 267 dead, a deleted page is labeled with the next-transaction counter value.
 268 VACUUM can reclaim the page for re-use when this transaction number is
 269 older than the oldest open transaction.  (NOTE: VACUUM FULL can reclaim
 270 such pages immediately.)
 271
 272 Reclaiming a page doesn't actually change its state on disk --- we simply
 273 record it in the shared-memory free space map, from which it will be
 274 handed out the next time a new page is needed for a page split.  The
 275 deleted page's contents will be overwritten by the split operation.
 276 (Note: if we find a deleted page with an extremely old transaction
 277 number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
 278 a later xid wraparound can't cause us to think the page is unreclaimable.
 279 But in more normal situations this would be a waste of a disk write.)
 280
 281 Because we never delete the rightmost page of any level (and in particular
 282 never delete the root), it's impossible for the height of the tree to
 283 decrease.  After massive deletions we might have a scenario in which the
 284 tree is "skinny", with several single-page levels below the root.
 285 Operations will still be correct in this case, but we'd waste cycles
 286 descending through the single-page levels.  To handle this we use an idea
 287 from Lanin and Shasha: we keep track of the "fast root" level, which is
 288 the lowest single-page level.  The meta-data page keeps a pointer to this
 289 level as well as the true root.  All ordinary operations initiate their
 290 searches at the fast root not the true root.  When we split a page that is
 291 alone on its level or delete the next-to-last page on a level (both cases
 292 are easily detected), we have to make sure that the fast root pointer is
 293 adjusted appropriately.  In the split case, we do this work as part of the
 294 atomic update for the insertion into the parent level; in the delete case
 295 as part of the atomic update for the delete (either way, the metapage has
 296 to be the last page locked in the update to avoid deadlock risks).  This
 297 avoids race conditions if two such operations are executing concurrently.
 298
 299 VACUUM needs to do a linear scan of an index to search for deleted pages
 300 that can be reclaimed because they are older than all open transactions.
 301 For efficiency's sake, we'd like to use the same linear scan to search for
 302 deletable tuples.  Before Postgres 8.2, btbulkdelete scanned the leaf pages
 303 in index order, but it is possible to visit them in physical order instead.
 304 The tricky part of this is to avoid missing any deletable tuples in the
 305 presence of concurrent page splits: a page split could easily move some
 306 tuples from a page not yet passed over by the sequential scan to a
 307 lower-numbered page already passed over.  (This wasn't a concern for the
 308 index-order scan, because splits always split right.)  To implement this,
 309 we provide a "vacuum cycle ID" mechanism that makes it possible to
 310 determine whether a page has been split since the current btbulkdelete
 311 cycle started.  If btbulkdelete finds a page that has been split since
 312 it started, and has a right-link pointing to a lower page number, then
 313 it temporarily suspends its sequential scan and visits that page instead.
 314 It must continue to follow right-links and vacuum dead tuples until
 315 reaching a page that either hasn't been split since btbulkdelete started,
 316 or is above the location of the outer sequential scan.  Then it can resume
 317 the sequential scan.  This ensures that all tuples are visited.  It may be
 318 that some tuples are visited twice, but that has no worse effect than an
 319 inaccurate index tuple count (and we can't guarantee an accurate count
 320 anyway in the face of concurrent activity).  Note that this still works
 321 if the has-been-recently-split test has a small probability of false
 322 positives, so long as it never gives a false negative.  This makes it
 323 possible to implement the test with a small counter value stored on each
 324 index page.
 325
 326 On-the-Fly Deletion Of Index Tuples
 327 -----------------------------------
 328
 329 If a process visits a heap tuple and finds that it's dead and removable
 330 (ie, dead to all open transactions, not only that process), then we can
 331 return to the index and mark the corresponding index entry "known dead",
 332 allowing subsequent index scans to skip visiting the heap tuple.  The
 333 "known dead" marking works by setting the index item's lp_flags state
 334 to LP_DEAD.  This is currently only done in plain indexscans, not bitmap
 335 scans, because only plain scans visit the heap and index "in sync" and so
 336 there's not a convenient way to do it for bitmap scans.
 337
 338 Once an index tuple has been marked LP_DEAD it can actually be removed
 339 from the index immediately; since index scans only stop "between" pages,
 340 no scan can lose its place from such a deletion.  We separate the steps
 341 because we allow LP_DEAD to be set with only a share lock (it's exactly
 342 like a hint bit for a heap tuple), but physically removing tuples requires
 343 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 344 we are otherwise faced with having to split a page to do an insertion (and
 345 hence have exclusive lock on it already).
 346
 347 This leaves the index in a state where it has no entry for a dead tuple
 348 that still exists in the heap.  This is not a problem for the current
 349 implementation of VACUUM, but it could be a problem for anything that
 350 explicitly tries to find index entries for dead tuples.  (However, the
 351 same situation is created by REINDEX, since it doesn't enter dead
 352 tuples into the index.)
 353
 354 It's sufficient to have an exclusive lock on the index page, not a
 355 super-exclusive lock, to do deletion of LP_DEAD items.  It might seem
 356 that this breaks the interlock between VACUUM and indexscans, but that is
 357 not so: as long as an indexscanning process has a pin on the page where
 358 the index item used to be, VACUUM cannot complete its btbulkdelete scan
 359 and so cannot remove the heap tuple.  This is another reason why
 360 btbulkdelete has to get super-exclusive lock on every leaf page, not only
 361 the ones where it actually sees items to delete.
 362
 363 WAL Considerations
 364 ------------------
 365
 366 The insertion and deletion algorithms in themselves don't guarantee btree
 367 consistency after a crash.  To provide robustness, we depend on WAL
 368 replay.  A single WAL entry is effectively an atomic action --- we can
 369 redo it from the log if it fails to complete.
 370
 371 Ordinary item insertions (that don't force a page split) are of course
 372 single WAL entries, since they only affect one page.  The same for
 373 leaf-item deletions (if the deletion brings the leaf page to zero items,
 374 it is now a candidate to be deleted, but that is a separate action).
 375
 376 An insertion that causes a page split is logged as a single WAL entry for
 377 the changes occuring on the insertion's level --- including update of the
 378 right sibling's left-link --- followed by a second WAL entry for the
 379 insertion on the parent level (which might itself be a page split, requiring
 380 an additional insertion above that, etc).
 381
 382 For a root split, the followon WAL entry is a "new root" entry rather than
 383 an "insertion" entry, but details are otherwise much the same.
 384
 385 Because insertion involves multiple atomic actions, the WAL replay logic
 386 has to detect the case where a page split isn't followed by a matching
 387 insertion on the parent level, and then do that insertion on its own (and
 388 recursively for any subsequent parent insertion, of course).  This is
 389 feasible because the WAL entry for the split contains enough info to know
 390 what must be inserted in the parent level.
 391
 392 When splitting a non-root page that is alone on its level, the required
 393 metapage update (of the "fast root" link) is performed and logged as part
 394 of the insertion into the parent level.  When splitting the root page, the
 395 metapage update is handled as part of the "new root" action.
 396
 397 A page deletion is logged as a single WAL entry covering all four
 398 required page updates (target page, left and right siblings, and parent)
 399 as an atomic event.  (Any required fast-root link update is also part
 400 of the WAL entry.)  If the parent page becomes half-dead but is not
 401 immediately deleted due to a subsequent crash, there is no loss of
 402 consistency, and the empty page will be picked up by the next VACUUM.
 403
 404 Scans during Recovery
 405 ---------------------
 406
 407 The btree index type can be safely used during recovery. During recovery
 408 we have at most one writer and potentially many readers. In that
 409 situation the locking requirements can be relaxed and we do not need
 410 double locking during block splits. Each WAL record makes changes to a
 411 single level of the btree using the correct locking sequence and so
 412 is safe for concurrent readers. Some readers may observe a block split
 413 in progress as they descend the tree, but they will simply move right
 414 onto the correct page.
 415
 416 During recovery all index scans start with ignore_killed_tuples = false
 417 and we never set kill_prior_tuple. We do this because the oldest xmin
 418 on the standby server can be older than the oldest xmin on the master
 419 server, which means tuples can be marked as killed even when they are
 420 still visible on the standby. We don't WAL log tuple killed bits, but
 421 they can still appear in the standby because of full page writes. So
 422 we must always ignore them in standby, and that means it's not worth
 423 setting them either.
 424
 425 Note that we talk about scans that are started during recovery. We go to
 426 a little trouble to allow a scan to start during recovery and end during
 427 normal running after recovery has completed. This is a key capability
 428 because it allows running applications to continue while the standby
 429 changes state into a normally running server.
 430
 431 Other Things That Are Handy to Know
 432 -----------------------------------
 433
 434 Page zero of every btree is a meta-data page.  This page stores the
 435 location of the root page --- both the true root and the current effective
 436 root ("fast" root).  To avoid fetching the metapage for every single index
 437 search, we cache a copy of the meta-data information in the index's
 438 relcache entry (rd_amcache).  This is a bit ticklish since using the cache
 439 implies following a root page pointer that could be stale.  We require
 440 every metapage update to send out a SI "relcache inval" message on the
 441 index relation.  That ensures that each backend will flush its cached copy
 442 not later than the start of its next transaction.  Therefore, stale
 443 pointers cannot be used for longer than the current transaction, which
 444 reduces the problem to the same one already dealt with for concurrent
 445 VACUUM --- we can just imagine that each open transaction is potentially
 446 "already in flight" to the old root.
 447
 448 The algorithm assumes we can fit at least three items per page
 449 (a "high key" and two real data items).  Therefore it's unsafe
 450 to accept items larger than 1/3rd page size.  Larger items would
 451 work sometimes, but could cause failures later on depending on
 452 what else gets put on their page.
 453
 454 "ScanKey" data structures are used in two fundamentally different ways
 455 in this code, which we describe as "search" scankeys and "insertion"
 456 scankeys.  A search scankey is the kind passed to btbeginscan() or
 457 btrescan() from outside the btree code.  The sk_func pointers in a search
 458 scankey point to comparison functions that return boolean, such as int4lt.
 459 There might be more than one scankey entry for a given index column, or
 460 none at all.  (We require the keys to appear in index column order, but
 461 the order of multiple keys for a given column is unspecified.)  An
 462 insertion scankey uses the same array-of-ScanKey data structure, but the
 463 sk_func pointers point to btree comparison support functions (ie, 3-way
 464 comparators that return int4 values interpreted as <0, =0, >0).  In an
 465 insertion scankey there is exactly one entry per index column.  Insertion
 466 scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
 467 used to locate the starting point of a scan, as well as for locating the
 468 place to insert a new index tuple.  (Note: in the case of an insertion
 469 scankey built from a search scankey, there might be fewer keys than
 470 index columns, indicating that we have no constraints for the remaining
 471 index columns.)  After we have located the starting point of a scan, the
 472 original search scankey is consulted as each index entry is sequentially
 473 scanned to decide whether to return the entry and whether the scan can
 474 stop (see _bt_checkkeys()).
 475
 476 Notes About Data Representation
 477 -------------------------------
 478
 479 The right-sibling link required by L&Y is kept in the page "opaque
 480 data" area, as is the left-sibling link, the page level, and some flags.
 481 The page level counts upwards from zero at the leaf level, to the tree
 482 depth minus 1 at the root.  (Counting up from the leaves ensures that we
 483 don't need to renumber any existing pages when splitting the root.)
 484
 485 The Postgres disk block data format (an array of items) doesn't fit
 486 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
 487 so we have to play some games.
 488
 489 On a page that is not rightmost in its tree level, the "high key" is
 490 kept in the page's first item, and real data items start at item 2.
 491 The link portion of the "high key" item goes unused.  A page that is
 492 rightmost has no "high key", so data items start with the first item.
 493 Putting the high key at the left, rather than the right, may seem odd,
 494 but it avoids moving the high key as we add data items.
 495
 496 On a leaf page, the data items are simply links to (TIDs of) tuples
 497 in the relation being indexed, with the associated key values.
 498
 499 On a non-leaf page, the data items are down-links to child pages with
 500 bounding keys.  The key in each data item is the *lower* bound for
 501 keys on that child page, so logically the key is to the left of that
 502 downlink.  The high key (if present) is the upper bound for the last
 503 downlink.  The first data item on each such page has no lower bound
 504 --- or lower bound of minus infinity, if you prefer.  The comparison
 505 routines must treat it accordingly.  The actual key stored in the
 506 item is irrelevant, and need not be stored at all.  This arrangement
 507 corresponds to the fact that an L&Y non-leaf page has one more pointer
 508 than key.
 509
 510 Notes to Operator Class Implementors
 511 ------------------------------------
 512
 513 With this implementation, we require each supported combination of
 514 datatypes to supply us with a comparison procedure via pg_amproc.
 515 This procedure must take two nonnull values A and B and return an int32 < 0,
 516 0, or > 0 if A < B, A = B, or A > B, respectively.  The procedure must
 517 not return INT_MIN for "A < B", since the value may be negated before
 518 being tested for sign.  A null result is disallowed, too.  See nbtcompare.c
 519 for examples.
 520
 521 There are some basic assumptions that a btree operator family must satisfy:
 522
 523 An = operator must be an equivalence relation; that is, for all non-null
 524 values A,B,C of the datatype:
 525
 526         A = A is true                                           reflexive law
 527         if A = B, then B = A                                    symmetric law
 528         if A = B and B = C, then A = C                          transitive law
 529
 530 A < operator must be a strong ordering relation; that is, for all non-null
 531 values A,B,C:
 532
 533         A < A is false                                          irreflexive law
 534         if A < B and B < C, then A < C                          transitive law
 535
 536 Furthermore, the ordering is total; that is, for all non-null values A,B:
 537
 538         exactly one of A < B, A = B, and B < A is true          trichotomy law
 539
 540 (The trichotomy law justifies the definition of the comparison support
 541 procedure, of course.)
 542
 543 The other three operators are defined in terms of these two in the obvious way,
 544 and must act consistently with them.
 545
 546 For an operator family supporting multiple datatypes, the above laws must hold
 547 when A,B,C are taken from any datatypes in the family.  The transitive laws
 548 are the trickiest to ensure, as in cross-type situations they represent
 549 statements that the behaviors of two or three different operators are
 550 consistent.  As an example, it would not work to put float8 and numeric into
 551 an opfamily, at least not with the current semantics that numerics are
 552 converted to float8 for comparison to a float8.  Because of the limited
 553 accuracy of float8, this means there are distinct numeric values that will
 554 compare equal to the same float8 value, and thus the transitive law fails.
 555
 556 It should be fairly clear why a btree index requires these laws to hold within
 557 a single datatype: without them there is no ordering to arrange the keys with.
 558 Also, index searches using a key of a different datatype require comparisons
 559 to behave sanely across two datatypes.  The extensions to three or more
 560 datatypes within a family are not strictly required by the btree index
 561 mechanism itself, but the planner relies on them for optimization purposes.