granicus.if.org Git - postgresql/blob - src/backend/access/nbtree/README

   1 src/backend/access/nbtree/README
   2
   3 Btree Indexing
   4 ==============
   5
   6 This directory contains a correct implementation of Lehman and Yao's
   7 high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
   8 Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
   9 on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).  We also
  10 use a simplified version of the deletion logic described in Lanin and
  11 Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
  12 Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).
  13
  14 The basic Lehman & Yao Algorithm
  15 --------------------------------
  16
  17 Compared to a classic B-tree, L&Y adds a right-link pointer to each page,
  18 to the page's right sibling.  It also adds a "high key" to each page, which
  19 is an upper bound on the keys that are allowed on that page.  These two
  20 additions make it possible detect a concurrent page split, which allows the
  21 tree to be searched without holding any read locks (except to keep a single
  22 page from being modified while reading it).
  23
  24 When a search follows a downlink to a child page, it compares the page's
  25 high key with the search key.  If the search key is greater than the high
  26 key, the page must've been split concurrently, and you must follow the
  27 right-link to find the new page containing the key range you're looking
  28 for.  This might need to be repeated, if the page has been split more than
  29 once.
  30
  31 Differences to the Lehman & Yao algorithm
  32 -----------------------------------------
  33
  34 We have made the following changes in order to incorporate the L&Y algorithm
  35 into Postgres:
  36
  37 The requirement that all btree keys be unique is too onerous,
  38 but the algorithm won't work correctly without it.  Fortunately, it is
  39 only necessary that keys be unique on a single tree level, because L&Y
  40 only use the assumption of key uniqueness when re-finding a key in a
  41 parent page (to determine where to insert the key for a split page).
  42 Therefore, we can use the link field to disambiguate multiple
  43 occurrences of the same user key: only one entry in the parent level
  44 will be pointing at the page we had split.  (Indeed we need not look at
  45 the real "key" at all, just at the link field.)  We can distinguish
  46 items at the leaf level in the same way, by examining their links to
  47 heap tuples; we'd never have two items for the same heap tuple.
  48
  49 Lehman and Yao assume that the key range for a subtree S is described
  50 by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
  51 page.  This does not work for nonunique keys (for example, if we have
  52 enough equal keys to spread across several leaf pages, there *must* be
  53 some equal bounding keys in the first level up).  Therefore we assume
  54 Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
  55 bounding key in an upper tree level must descend to the left of that
  56 key to ensure it finds any equal keys in the preceding page.  An
  57 insertion that sees the high key of its target page is equal to the key
  58 to be inserted has a choice whether or not to move right, since the new
  59 key could go on either page.  (Currently, we try to find a page where
  60 there is room for the new key without a split.)
  61
  62 Lehman and Yao don't require read locks, but assume that in-memory
  63 copies of tree pages are unshared.  Postgres shares in-memory buffers
  64 among backends.  As a result, we do page-level read locking on btree
  65 pages in order to guarantee that no record is modified while we are
  66 examining it.  This reduces concurrency but guarantees correct
  67 behavior.  An advantage is that when trading in a read lock for a
  68 write lock, we need not re-read the page after getting the write lock.
  69 Since we're also holding a pin on the shared buffer containing the
  70 page, we know that buffer still contains the page and is up-to-date.
  71
  72 We support the notion of an ordered "scan" of an index as well as
  73 insertions, deletions, and simple lookups.  A scan in the forward
  74 direction is no problem, we just use the right-sibling pointers that
  75 L&Y require anyway.  (Thus, once we have descended the tree to the
  76 correct start point for the scan, the scan looks only at leaf pages
  77 and never at higher tree levels.)  To support scans in the backward
  78 direction, we also store a "left sibling" link much like the "right
  79 sibling".  (This adds an extra step to the L&Y split algorithm: while
  80 holding the write lock on the page being split, we also lock its former
  81 right sibling to update that page's left-link.  This is safe since no
  82 writer of that page can be interested in acquiring a write lock on our
  83 page.)  A backwards scan has one additional bit of complexity: after
  84 following the left-link we must account for the possibility that the
  85 left sibling page got split before we could read it.  So, we have to
  86 move right until we find a page whose right-link matches the page we
  87 came from.  (Actually, it's even harder than that; see deletion discussion
  88 below.)
  89
  90 Page read locks are held only for as long as a scan is examining a page.
  91 To minimize lock/unlock traffic, an index scan always searches a leaf page
  92 to identify all the matching items at once, copying their heap tuple IDs
  93 into backend-local storage.  The heap tuple IDs are then processed while
  94 not holding any page lock within the index.  We do continue to hold a pin
  95 on the leaf page in some circumstances, to protect against concurrent
  96 deletions (see below).  In this state the scan is effectively stopped
  97 "between" pages, either before or after the page it has pinned.  This is
  98 safe in the presence of concurrent insertions and even page splits, because
  99 items are never moved across pre-existing page boundaries --- so the scan
 100 cannot miss any items it should have seen, nor accidentally return the same
 101 item twice.  The scan must remember the page's right-link at the time it
 102 was scanned, since that is the page to move right to; if we move right to
 103 the current right-link then we'd re-scan any items moved by a page split.
 104 We don't similarly remember the left-link, since it's best to use the most
 105 up-to-date left-link when trying to move left (see detailed move-left
 106 algorithm below).
 107
 108 In most cases we release our lock and pin on a page before attempting
 109 to acquire pin and lock on the page we are moving to.  In a few places
 110 it is necessary to lock the next page before releasing the current one.
 111 This is safe when moving right or up, but not when moving left or down
 112 (else we'd create the possibility of deadlocks).
 113
 114 Lehman and Yao fail to discuss what must happen when the root page
 115 becomes full and must be split.  Our implementation is to split the
 116 root in the same way that any other page would be split, then construct
 117 a new root page holding pointers to both of the resulting pages (which
 118 now become siblings on the next level of the tree).  The new root page
 119 is then installed by altering the root pointer in the meta-data page (see
 120 below).  This works because the root is not treated specially in any
 121 other way --- in particular, searches will move right using its link
 122 pointer if the link is set.  Therefore, searches will find the data
 123 that's been moved into the right sibling even if they read the meta-data
 124 page before it got updated.  This is the same reasoning that makes a
 125 split of a non-root page safe.  The locking considerations are similar too.
 126
 127 When an inserter recurses up the tree, splitting internal pages to insert
 128 links to pages inserted on the level below, it is possible that it will
 129 need to access a page above the level that was the root when it began its
 130 descent (or more accurately, the level that was the root when it read the
 131 meta-data page).  In this case the stack it made while descending does not
 132 help for finding the correct page.  When this happens, we find the correct
 133 place by re-descending the tree until we reach the level one above the
 134 level we need to insert a link to, and then moving right as necessary.
 135 (Typically this will take only two fetches, the meta-data page and the new
 136 root, but in principle there could have been more than one root split
 137 since we saw the root.  We can identify the correct tree level by means of
 138 the level numbers stored in each page.  The situation is rare enough that
 139 we do not need a more efficient solution.)
 140
 141 Lehman and Yao assume fixed-size keys, but we must deal with
 142 variable-size keys.  Therefore there is not a fixed maximum number of
 143 keys per page; we just stuff in as many as will fit.  When we split a
 144 page, we try to equalize the number of bytes, not items, assigned to
 145 each of the resulting pages.  Note we must include the incoming item in
 146 this calculation, otherwise it is possible to find that the incoming
 147 item doesn't fit on the split page where it needs to go!
 148
 149 The Deletion Algorithm
 150 ----------------------
 151
 152 Before deleting a leaf item, we get a super-exclusive lock on the target
 153 page, so that no other backend has a pin on the page when the deletion
 154 starts.  This is not necessary for correctness in terms of the btree index
 155 operations themselves; as explained above, index scans logically stop
 156 "between" pages and so can't lose their place.  The reason we do it is to
 157 provide an interlock between non-full VACUUM and indexscans.  Since VACUUM
 158 deletes index entries before reclaiming heap tuple line pointers, the
 159 super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
 160 line pointer that an indexscanning process might be about to visit.  This
 161 guarantee works only for simple indexscans that visit the heap in sync
 162 with the index scan, not for bitmap scans.  We only need the guarantee
 163 when using non-MVCC snapshot rules; when using an MVCC snapshot, it
 164 doesn't matter if the heap tuple is replaced with an unrelated tuple at
 165 the same TID, because the new tuple won't be visible to our scan anyway.
 166 Therefore, a scan using an MVCC snapshot which has no other confounding
 167 factors will not hold the pin after the page contents are read.  The
 168 current reasons for exceptions, where a pin is still needed, are if the
 169 index is not WAL-logged or if the scan is an index-only scan.  If later
 170 work allows the pin to be dropped for all cases we will be able to
 171 simplify the vacuum code, since the concept of a super-exclusive lock
 172 for btree indexes will no longer be needed.
 173
 174 Because a pin is not always held, and a page can be split even while
 175 someone does hold a pin on it, it is possible that an indexscan will
 176 return items that are no longer stored on the page it has a pin on, but
 177 rather somewhere to the right of that page.  To ensure that VACUUM can't
 178 prematurely remove such heap tuples, we require btbulkdelete to obtain a
 179 super-exclusive lock on every leaf page in the index, even pages that
 180 don't contain any deletable tuples.  Any scan which could yield incorrect
 181 results if the tuple at a TID matching the scan's range and filter
 182 conditions were replaced by a different tuple while the scan is in
 183 progress must hold the pin on each index page until all index entries read
 184 from the page have been processed.  This guarantees that the btbulkdelete
 185 call cannot return while any indexscan is still holding a copy of a
 186 deleted index tuple if the scan could be confused by that.  Note that this
 187 requirement does not say that btbulkdelete must visit the pages in any
 188 particular order.  (See also on-the-fly deletion, below.)
 189
 190 There is no such interlocking for deletion of items in internal pages,
 191 since backends keep no lock nor pin on a page they have descended past.
 192 Hence, when a backend is ascending the tree using its stack, it must
 193 be prepared for the possibility that the item it wants is to the left of
 194 the recorded position (but it can't have moved left out of the recorded
 195 page).  Since we hold a lock on the lower page (per L&Y) until we have
 196 re-found the parent item that links to it, we can be assured that the
 197 parent item does still exist and can't have been deleted.  Also, because
 198 we are matching downlink page numbers and not data keys, we don't have any
 199 problem with possibly misidentifying the parent item.
 200
 201 Page Deletion
 202 -------------
 203
 204 We consider deleting an entire page from the btree only when it's become
 205 completely empty of items.  (Merging partly-full pages would allow better
 206 space reuse, but it seems impractical to move existing data items left or
 207 right to make this happen --- a scan moving in the opposite direction
 208 might miss the items if so.)  Also, we *never* delete the rightmost page
 209 on a tree level (this restriction simplifies the traversal algorithms, as
 210 explained below).  Page deletion always begins from an empty leaf page.  An
 211 internal page can only be deleted as part of a branch leading to a leaf
 212 page, where each internal page has only one child and that child is also to
 213 be deleted.
 214
 215 Deleting a leaf page is a two-stage process.  In the first stage, the page
 216 is unlinked from its parent, and marked as half-dead.  The parent page must
 217 be found using the same type of search as used to find the parent during an
 218 insertion split.  We lock the target and the parent pages, change the
 219 target's downlink to point to the right sibling, and remove its old
 220 downlink.  This causes the target page's key space to effectively belong to
 221 its right sibling.  (Neither the left nor right sibling pages need to
 222 change their "high key" if any; so there is no problem with possibly not
 223 having enough space to replace a high key.)  At the same time, we mark the
 224 target page as half-dead, which causes any subsequent searches to ignore it
 225 and move right (or left, in a backwards scan).  This leaves the tree in a
 226 similar state as during a page split: the page has no downlink pointing to
 227 it, but it's still linked to its siblings.
 228
 229 (Note: Lanin and Shasha prefer to make the key space move left, but their
 230 argument for doing so hinges on not having left-links, which we have
 231 anyway.  So we simplify the algorithm by moving key space right.)
 232
 233 To preserve consistency on the parent level, we cannot merge the key space
 234 of a page into its right sibling unless the right sibling is a child of
 235 the same parent --- otherwise, the parent's key space assignment changes
 236 too, meaning we'd have to make bounding-key updates in its parent, and
 237 perhaps all the way up the tree.  Since we can't possibly do that
 238 atomically, we forbid this case.  That means that the rightmost child of a
 239 parent node can't be deleted unless it's the only remaining child, in which
 240 case we will delete the parent too (see below).
 241
 242 In the second-stage, the half-dead leaf page is unlinked from its siblings.
 243 We first lock the left sibling (if any) of the target, the target page
 244 itself, and its right sibling (there must be one) in that order.  Then we
 245 update the side-links in the siblings, and mark the target page deleted.
 246
 247 When we're about to delete the last remaining child of a parent page, things
 248 are slightly more complicated.  In the first stage, we leave the immediate
 249 parent of the leaf page alone, and remove the downlink to the parent page
 250 instead, from the grandparent.  If it's the last child of the grandparent
 251 too, we recurse up until we find a parent with more than one child, and
 252 remove the downlink of that page.  The leaf page is marked as half-dead, and
 253 the block number of the page whose downlink was removed is stashed in the
 254 half-dead leaf page.  This leaves us with a chain of internal pages, with
 255 one downlink each, leading to the half-dead leaf page, and no downlink
 256 pointing to the topmost page in the chain.
 257
 258 While we recurse up to find the topmost parent in the chain, we keep the
 259 leaf page locked, but don't need to hold locks on the intermediate pages
 260 between the leaf and the topmost parent -- insertions into upper tree levels
 261 happen only as a result of splits of child pages, and that can't happen as
 262 long as we're keeping the leaf locked.  The internal pages in the chain
 263 cannot acquire new children afterwards either, because the leaf page is
 264 marked as half-dead and won't be split.
 265
 266 Removing the downlink to the top of the to-be-deleted chain effectively
 267 transfers the key space to the right sibling for all the intermediate levels
 268 too, in one atomic operation.  A concurrent search might still visit the
 269 intermediate pages, but it will move right when it reaches the half-dead page
 270 at the leaf level.
 271
 272 In the second stage, the topmost page in the chain is unlinked from its
 273 siblings, and the half-dead leaf page is updated to point to the next page
 274 down in the chain.  This is repeated until there are no internal pages left
 275 in the chain.  Finally, the half-dead leaf page itself is unlinked from its
 276 siblings.
 277
 278 A deleted page cannot be reclaimed immediately, since there may be other
 279 processes waiting to reference it (ie, search processes that just left the
 280 parent, or scans moving right or left from one of the siblings).  These
 281 processes must observe that the page is marked dead and recover
 282 accordingly.  Searches and forward scans simply follow the right-link
 283 until they find a non-dead page --- this will be where the deleted page's
 284 key-space moved to.
 285
 286 Moving left in a backward scan is complicated because we must consider
 287 the possibility that the left sibling was just split (meaning we must find
 288 the rightmost page derived from the left sibling), plus the possibility
 289 that the page we were just on has now been deleted and hence isn't in the
 290 sibling chain at all anymore.  So the move-left algorithm becomes:
 291 0. Remember the page we are on as the "original page".
 292 1. Follow the original page's left-link (we're done if this is zero).
 293 2. If the current page is live and its right-link matches the "original
 294    page", we are done.
 295 3. Otherwise, move right one or more times looking for a live page whose
 296    right-link matches the "original page".  If found, we are done.  (In
 297    principle we could scan all the way to the right end of the index, but
 298    in practice it seems better to give up after a small number of tries.
 299    It's unlikely the original page's sibling split more than a few times
 300    while we were in flight to it; if we do not find a matching link in a
 301    few tries, then most likely the original page is deleted.)
 302 4. Return to the "original page".  If it is still live, return to step 1
 303    (we guessed wrong about it being deleted, and should restart with its
 304    current left-link).  If it is dead, move right until a non-dead page
 305    is found (there must be one, since rightmost pages are never deleted),
 306    mark that as the new "original page", and return to step 1.
 307 This algorithm is correct because the live page found by step 4 will have
 308 the same left keyspace boundary as the page we started from.  Therefore,
 309 when we ultimately exit, it must be on a page whose right keyspace
 310 boundary matches the left boundary of where we started --- which is what
 311 we need to be sure we don't miss or re-scan any items.
 312
 313 A deleted page can only be reclaimed once there is no scan or search that
 314 has a reference to it; until then, it must stay in place with its
 315 right-link undisturbed.  We implement this by waiting until all active
 316 snapshots and registered snapshots as of the deletion are gone; which is
 317 overly strong, but is simple to implement within Postgres.  When marked
 318 dead, a deleted page is labeled with the next-transaction counter value.
 319 VACUUM can reclaim the page for re-use when this transaction number is
 320 older than RecentGlobalXmin.  As collateral damage, this implementation
 321 also waits for running XIDs with no snapshots and for snapshots taken
 322 until the next transaction to allocate an XID commits.
 323
 324 Reclaiming a page doesn't actually change its state on disk --- we simply
 325 record it in the shared-memory free space map, from which it will be
 326 handed out the next time a new page is needed for a page split.  The
 327 deleted page's contents will be overwritten by the split operation.
 328 (Note: if we find a deleted page with an extremely old transaction
 329 number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
 330 a later xid wraparound can't cause us to think the page is unreclaimable.
 331 But in more normal situations this would be a waste of a disk write.)
 332
 333 Because we never delete the rightmost page of any level (and in particular
 334 never delete the root), it's impossible for the height of the tree to
 335 decrease.  After massive deletions we might have a scenario in which the
 336 tree is "skinny", with several single-page levels below the root.
 337 Operations will still be correct in this case, but we'd waste cycles
 338 descending through the single-page levels.  To handle this we use an idea
 339 from Lanin and Shasha: we keep track of the "fast root" level, which is
 340 the lowest single-page level.  The meta-data page keeps a pointer to this
 341 level as well as the true root.  All ordinary operations initiate their
 342 searches at the fast root not the true root.  When we split a page that is
 343 alone on its level or delete the next-to-last page on a level (both cases
 344 are easily detected), we have to make sure that the fast root pointer is
 345 adjusted appropriately.  In the split case, we do this work as part of the
 346 atomic update for the insertion into the parent level; in the delete case
 347 as part of the atomic update for the delete (either way, the metapage has
 348 to be the last page locked in the update to avoid deadlock risks).  This
 349 avoids race conditions if two such operations are executing concurrently.
 350
 351 VACUUM needs to do a linear scan of an index to search for deleted pages
 352 that can be reclaimed because they are older than all open transactions.
 353 For efficiency's sake, we'd like to use the same linear scan to search for
 354 deletable tuples.  Before Postgres 8.2, btbulkdelete scanned the leaf pages
 355 in index order, but it is possible to visit them in physical order instead.
 356 The tricky part of this is to avoid missing any deletable tuples in the
 357 presence of concurrent page splits: a page split could easily move some
 358 tuples from a page not yet passed over by the sequential scan to a
 359 lower-numbered page already passed over.  (This wasn't a concern for the
 360 index-order scan, because splits always split right.)  To implement this,
 361 we provide a "vacuum cycle ID" mechanism that makes it possible to
 362 determine whether a page has been split since the current btbulkdelete
 363 cycle started.  If btbulkdelete finds a page that has been split since
 364 it started, and has a right-link pointing to a lower page number, then
 365 it temporarily suspends its sequential scan and visits that page instead.
 366 It must continue to follow right-links and vacuum dead tuples until
 367 reaching a page that either hasn't been split since btbulkdelete started,
 368 or is above the location of the outer sequential scan.  Then it can resume
 369 the sequential scan.  This ensures that all tuples are visited.  It may be
 370 that some tuples are visited twice, but that has no worse effect than an
 371 inaccurate index tuple count (and we can't guarantee an accurate count
 372 anyway in the face of concurrent activity).  Note that this still works
 373 if the has-been-recently-split test has a small probability of false
 374 positives, so long as it never gives a false negative.  This makes it
 375 possible to implement the test with a small counter value stored on each
 376 index page.
 377
 378 On-the-Fly Deletion Of Index Tuples
 379 -----------------------------------
 380
 381 If a process visits a heap tuple and finds that it's dead and removable
 382 (ie, dead to all open transactions, not only that process), then we can
 383 return to the index and mark the corresponding index entry "known dead",
 384 allowing subsequent index scans to skip visiting the heap tuple.  The
 385 "known dead" marking works by setting the index item's lp_flags state
 386 to LP_DEAD.  This is currently only done in plain indexscans, not bitmap
 387 scans, because only plain scans visit the heap and index "in sync" and so
 388 there's not a convenient way to do it for bitmap scans.
 389
 390 Once an index tuple has been marked LP_DEAD it can actually be removed
 391 from the index immediately; since index scans only stop "between" pages,
 392 no scan can lose its place from such a deletion.  We separate the steps
 393 because we allow LP_DEAD to be set with only a share lock (it's exactly
 394 like a hint bit for a heap tuple), but physically removing tuples requires
 395 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 396 we are otherwise faced with having to split a page to do an insertion (and
 397 hence have exclusive lock on it already).
 398
 399 This leaves the index in a state where it has no entry for a dead tuple
 400 that still exists in the heap.  This is not a problem for the current
 401 implementation of VACUUM, but it could be a problem for anything that
 402 explicitly tries to find index entries for dead tuples.  (However, the
 403 same situation is created by REINDEX, since it doesn't enter dead
 404 tuples into the index.)
 405
 406 It's sufficient to have an exclusive lock on the index page, not a
 407 super-exclusive lock, to do deletion of LP_DEAD items.  It might seem
 408 that this breaks the interlock between VACUUM and indexscans, but that is
 409 not so: as long as an indexscanning process has a pin on the page where
 410 the index item used to be, VACUUM cannot complete its btbulkdelete scan
 411 and so cannot remove the heap tuple.  This is another reason why
 412 btbulkdelete has to get a super-exclusive lock on every leaf page, not
 413 only the ones where it actually sees items to delete.  So that we can
 414 handle the cases where we attempt LP_DEAD flagging for a page after we
 415 have released its pin, we remember the LSN of the index page when we read
 416 the index tuples from it; we do not attempt to flag index tuples as dead
 417 if the we didn't hold the pin the entire time and the LSN has changed.
 418
 419 WAL Considerations
 420 ------------------
 421
 422 The insertion and deletion algorithms in themselves don't guarantee btree
 423 consistency after a crash.  To provide robustness, we depend on WAL
 424 replay.  A single WAL entry is effectively an atomic action --- we can
 425 redo it from the log if it fails to complete.
 426
 427 Ordinary item insertions (that don't force a page split) are of course
 428 single WAL entries, since they only affect one page.  The same for
 429 leaf-item deletions (if the deletion brings the leaf page to zero items,
 430 it is now a candidate to be deleted, but that is a separate action).
 431
 432 An insertion that causes a page split is logged as a single WAL entry for
 433 the changes occurring on the insertion's level --- including update of the
 434 right sibling's left-link --- followed by a second WAL entry for the
 435 insertion on the parent level (which might itself be a page split, requiring
 436 an additional insertion above that, etc).
 437
 438 For a root split, the followon WAL entry is a "new root" entry rather than
 439 an "insertion" entry, but details are otherwise much the same.
 440
 441 Because splitting involves multiple atomic actions, it's possible that the
 442 system crashes between splitting a page and inserting the downlink for the
 443 new half to the parent.  After recovery, the downlink for the new page will
 444 be missing.  The search algorithm works correctly, as the page will be found
 445 by following the right-link from its left sibling, although if a lot of
 446 downlinks in the tree are missing, performance will suffer.  A more serious
 447 consequence is that if the page without a downlink gets split again, the
 448 insertion algorithm will fail to find the location in the parent level to
 449 insert the downlink.
 450
 451 Our approach is to create any missing downlinks on-the-fly, when searching
 452 the tree for a new insertion.  It could be done during searches, too, but
 453 it seems best not to put any extra updates in what would otherwise be a
 454 read-only operation (updating is not possible in hot standby mode anyway).
 455 It would seem natural to add the missing downlinks in VACUUM, but since
 456 inserting a downlink might require splitting a page, it might fail if you
 457 run out of disk space.  That would be bad during VACUUM - the reason for
 458 running VACUUM in the first place might be that you run out of disk space,
 459 and now VACUUM won't finish because you're out of disk space.  In contrast,
 460 an insertion can require enlarging the physical file anyway.
 461
 462 To identify missing downlinks, when a page is split, the left page is
 463 flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
 464 When the downlink is inserted to the parent, the flag is cleared atomically
 465 with the insertion.  The child page is kept locked until the insertion in
 466 the parent is finished and the flag in the child cleared, but can be
 467 released immediately after that, before recursing up the tree if the parent
 468 also needs to be split.  This ensures that incompletely split pages should
 469 not be seen under normal circumstances; only if insertion to the parent
 470 has failed for some reason.
 471
 472 We flag the left page, even though it's the right page that's missing the
 473 downlink, beacuse it's more convenient to know already when following the
 474 right-link from the left page to the right page that it will need to have
 475 its downlink inserted to the parent.
 476
 477 When splitting a non-root page that is alone on its level, the required
 478 metapage update (of the "fast root" link) is performed and logged as part
 479 of the insertion into the parent level.  When splitting the root page, the
 480 metapage update is handled as part of the "new root" action.
 481
 482 Each step in page deletion is logged as a separate WAL entry: marking the
 483 leaf as half-dead and removing the downlink is one record, and unlinking a
 484 page is a second record.  If vacuum is interrupted for some reason, or the
 485 system crashes, the tree is consistent for searches and insertions.  The
 486 next VACUUM will find the half-dead leaf page and continue the deletion.
 487
 488 Before 9.4, we used to keep track of incomplete splits and page deletions
 489 during recovery and finish them immediately at end of recovery, instead of
 490 doing it lazily at the next  insertion or vacuum.  However, that made the
 491 recovery much more complicated, and only fixed the problem when crash
 492 recovery was performed.  An incomplete split can also occur if an otherwise
 493 recoverable error, like out-of-memory or out-of-disk-space, happens while
 494 inserting the downlink to the parent.
 495
 496 Scans during Recovery
 497 ---------------------
 498
 499 The btree index type can be safely used during recovery. During recovery
 500 we have at most one writer and potentially many readers. In that
 501 situation the locking requirements can be relaxed and we do not need
 502 double locking during block splits. Each WAL record makes changes to a
 503 single level of the btree using the correct locking sequence and so
 504 is safe for concurrent readers. Some readers may observe a block split
 505 in progress as they descend the tree, but they will simply move right
 506 onto the correct page.
 507
 508 During recovery all index scans start with ignore_killed_tuples = false
 509 and we never set kill_prior_tuple. We do this because the oldest xmin
 510 on the standby server can be older than the oldest xmin on the master
 511 server, which means tuples can be marked as killed even when they are
 512 still visible on the standby. We don't WAL log tuple killed bits, but
 513 they can still appear in the standby because of full page writes. So
 514 we must always ignore them in standby, and that means it's not worth
 515 setting them either.
 516
 517 Note that we talk about scans that are started during recovery. We go to
 518 a little trouble to allow a scan to start during recovery and end during
 519 normal running after recovery has completed. This is a key capability
 520 because it allows running applications to continue while the standby
 521 changes state into a normally running server.
 522
 523 Other Things That Are Handy to Know
 524 -----------------------------------
 525
 526 Page zero of every btree is a meta-data page.  This page stores the
 527 location of the root page --- both the true root and the current effective
 528 root ("fast" root).  To avoid fetching the metapage for every single index
 529 search, we cache a copy of the meta-data information in the index's
 530 relcache entry (rd_amcache).  This is a bit ticklish since using the cache
 531 implies following a root page pointer that could be stale.  However, a
 532 backend following a cached pointer can sufficiently verify whether it
 533 reached the intended page; either by checking the is-root flag when it
 534 is going to the true root, or by checking that the page has no siblings
 535 when going to the fast root.  At worst, this could result in descending
 536 some extra tree levels if we have a cached pointer to a fast root that is
 537 now above the real fast root.  Such cases shouldn't arise often enough to
 538 be worth optimizing; and in any case we can expect a relcache flush will
 539 discard the cached metapage before long, since a VACUUM that's moved the
 540 fast root pointer can be expected to issue a statistics update for the
 541 index.
 542
 543 The algorithm assumes we can fit at least three items per page
 544 (a "high key" and two real data items).  Therefore it's unsafe
 545 to accept items larger than 1/3rd page size.  Larger items would
 546 work sometimes, but could cause failures later on depending on
 547 what else gets put on their page.
 548
 549 "ScanKey" data structures are used in two fundamentally different ways
 550 in this code, which we describe as "search" scankeys and "insertion"
 551 scankeys.  A search scankey is the kind passed to btbeginscan() or
 552 btrescan() from outside the btree code.  The sk_func pointers in a search
 553 scankey point to comparison functions that return boolean, such as int4lt.
 554 There might be more than one scankey entry for a given index column, or
 555 none at all.  (We require the keys to appear in index column order, but
 556 the order of multiple keys for a given column is unspecified.)  An
 557 insertion scankey uses the same array-of-ScanKey data structure, but the
 558 sk_func pointers point to btree comparison support functions (ie, 3-way
 559 comparators that return int4 values interpreted as <0, =0, >0).  In an
 560 insertion scankey there is exactly one entry per index column.  Insertion
 561 scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
 562 used to locate the starting point of a scan, as well as for locating the
 563 place to insert a new index tuple.  (Note: in the case of an insertion
 564 scankey built from a search scankey, there might be fewer keys than
 565 index columns, indicating that we have no constraints for the remaining
 566 index columns.)  After we have located the starting point of a scan, the
 567 original search scankey is consulted as each index entry is sequentially
 568 scanned to decide whether to return the entry and whether the scan can
 569 stop (see _bt_checkkeys()).
 570
 571 Notes About Data Representation
 572 -------------------------------
 573
 574 The right-sibling link required by L&Y is kept in the page "opaque
 575 data" area, as is the left-sibling link, the page level, and some flags.
 576 The page level counts upwards from zero at the leaf level, to the tree
 577 depth minus 1 at the root.  (Counting up from the leaves ensures that we
 578 don't need to renumber any existing pages when splitting the root.)
 579
 580 The Postgres disk block data format (an array of items) doesn't fit
 581 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
 582 so we have to play some games.
 583
 584 On a page that is not rightmost in its tree level, the "high key" is
 585 kept in the page's first item, and real data items start at item 2.
 586 The link portion of the "high key" item goes unused.  A page that is
 587 rightmost has no "high key", so data items start with the first item.
 588 Putting the high key at the left, rather than the right, may seem odd,
 589 but it avoids moving the high key as we add data items.
 590
 591 On a leaf page, the data items are simply links to (TIDs of) tuples
 592 in the relation being indexed, with the associated key values.
 593
 594 On a non-leaf page, the data items are down-links to child pages with
 595 bounding keys.  The key in each data item is the *lower* bound for
 596 keys on that child page, so logically the key is to the left of that
 597 downlink.  The high key (if present) is the upper bound for the last
 598 downlink.  The first data item on each such page has no lower bound
 599 --- or lower bound of minus infinity, if you prefer.  The comparison
 600 routines must treat it accordingly.  The actual key stored in the
 601 item is irrelevant, and need not be stored at all.  This arrangement
 602 corresponds to the fact that an L&Y non-leaf page has one more pointer
 603 than key.
 604
 605 Notes to Operator Class Implementors
 606 ------------------------------------
 607
 608 With this implementation, we require each supported combination of
 609 datatypes to supply us with a comparison procedure via pg_amproc.
 610 This procedure must take two nonnull values A and B and return an int32 < 0,
 611 0, or > 0 if A < B, A = B, or A > B, respectively.  The procedure must
 612 not return INT_MIN for "A < B", since the value may be negated before
 613 being tested for sign.  A null result is disallowed, too.  See nbtcompare.c
 614 for examples.
 615
 616 There are some basic assumptions that a btree operator family must satisfy:
 617
 618 An = operator must be an equivalence relation; that is, for all non-null
 619 values A,B,C of the datatype:
 620
 621         A = A is true                                           reflexive law
 622         if A = B, then B = A                                    symmetric law
 623         if A = B and B = C, then A = C                          transitive law
 624
 625 A < operator must be a strong ordering relation; that is, for all non-null
 626 values A,B,C:
 627
 628         A < A is false                                          irreflexive law
 629         if A < B and B < C, then A < C                          transitive law
 630
 631 Furthermore, the ordering is total; that is, for all non-null values A,B:
 632
 633         exactly one of A < B, A = B, and B < A is true          trichotomy law
 634
 635 (The trichotomy law justifies the definition of the comparison support
 636 procedure, of course.)
 637
 638 The other three operators are defined in terms of these two in the obvious way,
 639 and must act consistently with them.
 640
 641 For an operator family supporting multiple datatypes, the above laws must hold
 642 when A,B,C are taken from any datatypes in the family.  The transitive laws
 643 are the trickiest to ensure, as in cross-type situations they represent
 644 statements that the behaviors of two or three different operators are
 645 consistent.  As an example, it would not work to put float8 and numeric into
 646 an opfamily, at least not with the current semantics that numerics are
 647 converted to float8 for comparison to a float8.  Because of the limited
 648 accuracy of float8, this means there are distinct numeric values that will
 649 compare equal to the same float8 value, and thus the transitive law fails.
 650
 651 It should be fairly clear why a btree index requires these laws to hold within
 652 a single datatype: without them there is no ordering to arrange the keys with.
 653 Also, index searches using a key of a different datatype require comparisons
 654 to behave sanely across two datatypes.  The extensions to three or more
 655 datatypes within a family are not strictly required by the btree index
 656 mechanism itself, but the planner relies on them for optimization purposes.