granicus.if.org Git - postgresql/blob - src/backend/access/spgist/README

   1 src/backend/access/spgist/README
   2
   3 SP-GiST is an abbreviation of space-partitioned GiST.  It provides a
   4 generalized infrastructure for implementing space-partitioned data
   5 structures, such as quadtrees, k-d trees, and radix trees (tries).  When
   6 implemented in main memory, these structures are usually designed as a set of
   7 dynamically-allocated nodes linked by pointers.  This is not suitable for
   8 direct storing on disk, since the chains of pointers can be rather long and
   9 require too many disk accesses. In contrast, disk based data structures
  10 should have a high fanout to minimize I/O.  The challenge is to map tree
  11 nodes to disk pages in such a way that the search algorithm accesses only a
  12 few disk pages, even if it traverses many nodes.
  13
  14
  15 COMMON STRUCTURE DESCRIPTION
  16
  17 Logically, an SP-GiST tree is a set of tuples, each of which can be either
  18 an inner or leaf tuple.  Each inner tuple contains "nodes", which are
  19 (label,pointer) pairs, where the pointer (ItemPointerData) is a pointer to
  20 another inner tuple or to the head of a list of leaf tuples.  Inner tuples
  21 can have different numbers of nodes (children).  Branches can be of different
  22 depth (actually, there is no control or code to support balancing), which
  23 means that the tree is non-balanced.  However, leaf and inner tuples cannot
  24 be intermixed at the same level: a downlink from a node of an inner tuple
  25 leads either to one inner tuple, or to a list of leaf tuples.
  26
  27 The SP-GiST core requires that inner and leaf tuples fit on a single index
  28 page, and even more stringently that the list of leaf tuples reached from a
  29 single inner-tuple node all be stored on the same index page.  (Restricting
  30 such lists to not cross pages reduces seeks, and allows the list links to be
  31 stored as simple 2-byte OffsetNumbers.)  SP-GiST index opclasses should
  32 therefore ensure that not too many nodes can be needed in one inner tuple,
  33 and that inner-tuple prefixes and leaf-node datum values not be too large.
  34
  35 Inner and leaf tuples are stored separately: the former are stored only on
  36 "inner" pages, the latter only on "leaf" pages.  Also, there are special
  37 restrictions on the root page.  Early in an index's life, when there is only
  38 one page's worth of data, the root page contains an unorganized set of leaf
  39 tuples.  After the first page split has occurred, the root is required to
  40 contain exactly one inner tuple.
  41
  42 When the search traversal algorithm reaches an inner tuple, it chooses a set
  43 of nodes to continue tree traverse in depth.  If it reaches a leaf page it
  44 scans a list of leaf tuples to find the ones that match the query.
  45
  46 The insertion algorithm descends the tree similarly, except it must choose
  47 just one node to descend to from each inner tuple.  Insertion might also have
  48 to modify the inner tuple before it can descend: it could add a new node, or
  49 it could "split" the tuple to obtain a less-specific prefix that can match
  50 the value to be inserted.  If it's necessary to append a new leaf tuple to a
  51 list and there is no free space on page, then SP-GiST creates a new inner
  52 tuple and distributes leaf tuples into a set of lists on, perhaps, several
  53 pages.
  54
  55 Inner tuple consists of:
  56
  57   optional prefix value - all successors must be consistent with it.
  58     Example:
  59         radix tree   - prefix value is a common prefix string
  60         quad tree    - centroid
  61         k-d tree     - one coordinate
  62
  63   list of nodes, where node is a (label, pointer) pair.
  64     Example of a label: a single character for radix tree
  65
  66 Leaf tuple consists of:
  67
  68   a leaf value
  69     Example:
  70         radix tree - the rest of string (postfix)
  71         quad and k-d tree - the point itself
  72
  73   ItemPointer to the heap
  74
  75
  76 NULLS HANDLING
  77
  78 We assume that SPGiST-indexable operators are strict (can never succeed for
  79 null inputs).  It is still desirable to index nulls, so that whole-table
  80 indexscans are possible and so that "x IS NULL" can be implemented by an
  81 SPGiST indexscan.  However, we prefer that SPGiST index opclasses not have
  82 to cope with nulls.  Therefore, the main tree of an SPGiST index does not
  83 include any null entries.  We store null entries in a separate SPGiST tree
  84 occupying a disjoint set of pages (in particular, its own root page).
  85 Insertions and searches in the nulls tree do not use any of the
  86 opclass-supplied functions, but just use hardwired logic comparable to
  87 AllTheSame cases in the normal tree.
  88
  89
  90 INSERTION ALGORITHM
  91
  92 Insertion algorithm is designed to keep the tree in a consistent state at
  93 any moment.  Here is a simplified insertion algorithm specification
  94 (numbers refer to notes below):
  95
  96   Start with the first tuple on the root page (1)
  97
  98   loop:
  99     if (page is leaf) then
 100         if (enough space)
 101             insert on page and exit (5)
 102         else (7)
 103             call PickSplitFn() (2)
 104         end if
 105     else
 106         switch (chooseFn())
 107             case MatchNode  - descend through selected node
 108             case AddNode    - add node and then retry chooseFn (3, 6)
 109             case SplitTuple - split inner tuple to prefix and postfix, then
 110                               retry chooseFn with the prefix tuple (4, 6)
 111     end if
 112
 113 Notes:
 114
 115 (1) Initially, we just dump leaf tuples into the root page until it is full;
 116 then we split it.  Once the root is not a leaf page, it can have only one
 117 inner tuple, so as to keep the amount of free space on the root as large as
 118 possible.  Both of these rules are meant to postpone doing PickSplit on the
 119 root for as long as possible, so that the topmost partitioning of the search
 120 space is as good as we can easily make it.
 121
 122 (2) Current implementation allows to do picksplit and insert a new leaf tuple
 123 in one operation, if the new list of leaf tuples fits on one page. It's
 124 always possible for trees with small nodes like quad tree or k-d tree, but
 125 radix trees may require another picksplit.
 126
 127 (3) Addition of node must keep size of inner tuple small enough to fit on a
 128 page.  After addition, inner tuple could become too large to be stored on
 129 current page because of other tuples on page. In this case it will be moved
 130 to another inner page (see notes about page management). When moving tuple to
 131 another page, we can't change the numbers of other tuples on the page, else
 132 we'd make downlink pointers to them invalid. To prevent that, SP-GiST leaves
 133 a "placeholder" tuple, which can be reused later whenever another tuple is
 134 added to the page. See also Concurrency and Vacuum sections below. Right now
 135 only radix trees could add a node to the tuple; quad trees and k-d trees
 136 make all possible nodes at once in PickSplitFn() call.
 137
 138 (4) Prefix value could only partially match a new value, so the SplitTuple
 139 action allows breaking the current tree branch into upper and lower sections.
 140 Another way to say it is that we can split the current inner tuple into
 141 "prefix" and "postfix" parts, where the prefix part is able to match the
 142 incoming new value. Consider example of insertion into a radix tree. We use
 143 the following notation, where tuple's id is just for discussion (no such id
 144 is actually stored):
 145
 146 inner tuple: {tuple id}(prefix string)[ comma separated list of node labels ]
 147 leaf tuple: {tuple id}<value>
 148
 149 Suppose we need to insert string 'www.gogo.com' into inner tuple
 150
 151     {1}(www.google.com/)[a, i]
 152
 153 The string does not match the prefix so we cannot descend.  We must
 154 split the inner tuple into two tuples:
 155
 156     {2}(www.go)[o]  - prefix tuple
 157                 |
 158                 {3}(gle.com/)[a,i] - postfix tuple
 159
 160 On the next iteration of loop we find that 'www.gogo.com' matches the
 161 prefix, but not any node label, so we add a node [g] to tuple {2}:
 162
 163                    NIL (no child exists yet)
 164                    |
 165     {2}(www.go)[o, g]
 166                 |
 167                 {3}(gle.com/)[a,i]
 168
 169 Now we can descend through the [g] node, which will cause us to update
 170 the target string to just 'o.com'.  Finally, we'll insert a leaf tuple
 171 bearing that string:
 172
 173                   {4}<o.com>
 174                    |
 175     {2}(www.go)[o, g]
 176                 |
 177                 {3}(gle.com/)[a,i]
 178
 179 As we can see, the original tuple's node array moves to postfix tuple without
 180 any changes.  Note also that SP-GiST core assumes that prefix tuple is not
 181 larger than old inner tuple.  That allows us to store prefix tuple directly
 182 in place of old inner tuple.  SP-GiST core will try to store postfix tuple on
 183 the same page if possible, but will use another page if there is not enough
 184 free space (see notes 5 and 6).  Currently, quad and k-d trees don't use this
 185 feature, because they have no concept of a prefix being "inconsistent" with
 186 any new value.  They grow their depth only by PickSplitFn() call.
 187
 188 (5) If pointer from node of parent is a NIL pointer, algorithm chooses a leaf
 189 page to store on.  At first, it tries to use the last-used leaf page with the
 190 largest free space (which we track in each backend) to better utilize disk
 191 space.  If that's not large enough, then the algorithm allocates a new page.
 192
 193 (6) Management of inner pages is very similar to management of leaf pages,
 194 described in (5).
 195
 196 (7) Actually, current implementation can move the whole list of leaf tuples
 197 and a new tuple to another page, if the list is short enough. This improves
 198 space utilization, but doesn't change the basis of the algorithm.
 199
 200
 201 CONCURRENCY
 202
 203 While descending the tree, the insertion algorithm holds exclusive lock on
 204 two tree levels at a time, ie both parent and child pages (but parent and
 205 child pages can be the same, see notes above).  There is a possibility of
 206 deadlock between two insertions if there are cross-referenced pages in
 207 different branches.  That is, if inner tuple on page M has a child on page N
 208 while an inner tuple from another branch is on page N and has a child on
 209 page M, then two insertions descending the two branches could deadlock,
 210 since they will each hold their parent page's lock while trying to get the
 211 child page's lock.
 212
 213 Currently, we deal with this by conditionally locking buffers as we descend
 214 the tree.  If we fail to get lock on a buffer, we release both buffers and
 215 restart the insertion process.  This is potentially inefficient, but the
 216 locking costs of a more deterministic approach seem very high.
 217
 218 To reduce the number of cases where that happens, we introduce a concept of
 219 "triple parity" of pages: if inner tuple is on page with BlockNumber N, then
 220 its child tuples should be placed on the same page, or else on a page with
 221 BlockNumber M where (N+1) mod 3 == M mod 3.  This rule ensures that tuples
 222 on page M will have no children on page N, since (M+1) mod 3 != N mod 3.
 223 That makes it unlikely that two insertion processes will conflict against
 224 each other while descending the tree.  It's not perfect though: in the first
 225 place, we could still get a deadlock among three or more insertion processes,
 226 and in the second place, it's impractical to preserve this invariant in every
 227 case when we expand or split an inner tuple.  So we still have to allow for
 228 deadlocks.
 229
 230 Insertion may also need to take locks on an additional inner and/or leaf page
 231 to add tuples of the right type(s), when there's not enough room on the pages
 232 it descended through.  However, we don't care exactly which such page we add
 233 to, so deadlocks can be avoided by conditionally locking the additional
 234 buffers: if we fail to get lock on an additional page, just try another one.
 235
 236 Search traversal algorithm is rather traditional.  At each non-leaf level, it
 237 share-locks the page, identifies which node(s) in the current inner tuple
 238 need to be visited, and puts those addresses on a stack of pages to examine
 239 later.  It then releases lock on the current buffer before visiting the next
 240 stack item.  So only one page is locked at a time, and no deadlock is
 241 possible.  But instead, we have to worry about race conditions: by the time
 242 we arrive at a pointed-to page, a concurrent insertion could have replaced
 243 the target inner tuple (or leaf tuple chain) with data placed elsewhere.
 244 To handle that, whenever the insertion algorithm changes a nonempty downlink
 245 in an inner tuple, it places a "redirect tuple" in place of the lower-level
 246 inner tuple or leaf-tuple chain head that the link formerly led to.  Scans
 247 (though not insertions) must be prepared to honor such redirects.  Only a
 248 scan that had already visited the parent level could possibly reach such a
 249 redirect tuple, so we can remove redirects once all active transactions have
 250 been flushed out of the system.
 251
 252
 253 DEAD TUPLES
 254
 255 Tuples on leaf pages can be in one of four states:
 256
 257 SPGIST_LIVE: normal, live pointer to a heap tuple.
 258
 259 SPGIST_REDIRECT: placeholder that contains a link to another place in the
 260 index.  When a chain of leaf tuples has to be moved to another page, a
 261 redirect tuple is inserted in place of the chain's head tuple.  The parent
 262 inner tuple's downlink is updated when this happens, but concurrent scans
 263 might be "in flight" from the parent page to the child page (since they
 264 release lock on the parent page before attempting to lock the child).
 265 The redirect pointer serves to tell such a scan where to go.  A redirect
 266 pointer is only needed for as long as such concurrent scans could be in
 267 progress.  Eventually, it's converted into a PLACEHOLDER dead tuple by
 268 VACUUM, and is then a candidate for replacement.  Searches that find such
 269 a tuple (which should never be part of a chain) should immediately proceed
 270 to the other place, forgetting about the redirect tuple.  Insertions that
 271 reach such a tuple should raise error, since a valid downlink should never
 272 point to such a tuple.
 273
 274 SPGIST_DEAD: tuple is dead, but it cannot be removed or moved to a
 275 different offset on the page because there is a link leading to it from
 276 some inner tuple elsewhere in the index.  (Such a tuple is never part of a
 277 chain, since we don't need one unless there is nothing live left in its
 278 chain.)  Searches should ignore such entries.  If an insertion action
 279 arrives at such a tuple, it should either replace it in-place (if there's
 280 room on the page to hold the desired new leaf tuple) or replace it with a
 281 redirection pointer to wherever it puts the new leaf tuple.
 282
 283 SPGIST_PLACEHOLDER: tuple is dead, and there are known to be no links to
 284 it from elsewhere.  When a live tuple is deleted or moved away, and not
 285 replaced by a redirect pointer, it is replaced by a placeholder to keep
 286 the offsets of later tuples on the same page from changing.  Placeholders
 287 can be freely replaced when adding a new tuple to the page, and also
 288 VACUUM will delete any that are at the end of the range of valid tuple
 289 offsets.  Both searches and insertions should complain if a link from
 290 elsewhere leads them to a placeholder tuple.
 291
 292 When the root page is also a leaf, all its tuple should be in LIVE state;
 293 there's no need for the others since there are no links and no need to
 294 preserve offset numbers.
 295
 296 Tuples on inner pages can be in LIVE, REDIRECT, or PLACEHOLDER states.
 297 The REDIRECT state has the same function as on leaf pages, to send
 298 concurrent searches to the place where they need to go after an inner
 299 tuple is moved to another page.  Expired REDIRECT pointers are converted
 300 to PLACEHOLDER status by VACUUM, and are then candidates for replacement.
 301 DEAD state is not currently possible, since VACUUM does not attempt to
 302 remove unused inner tuples.
 303
 304
 305 VACUUM
 306
 307 VACUUM (or more precisely, spgbulkdelete) performs a single sequential scan
 308 over the entire index.  On both leaf and inner pages, we can convert old
 309 REDIRECT tuples into PLACEHOLDER status, and then remove any PLACEHOLDERs
 310 that are at the end of the page (since they aren't needed to preserve the
 311 offsets of any live tuples).  On leaf pages, we scan for tuples that need
 312 to be deleted because their heap TIDs match a vacuum target TID.
 313
 314 If we find a deletable tuple that is not at the head of its chain, we
 315 can simply replace it with a PLACEHOLDER, updating the chain links to
 316 remove it from the chain.  If it is at the head of its chain, but there's
 317 at least one live tuple remaining in the chain, we move that live tuple
 318 to the head tuple's offset, replacing it with a PLACEHOLDER to preserve
 319 the offsets of other tuples.  This keeps the parent inner tuple's downlink
 320 valid.  If we find ourselves deleting all live tuples in a chain, we
 321 replace the head tuple with a DEAD tuple and the rest with PLACEHOLDERS.
 322 The parent inner tuple's downlink thus points to the DEAD tuple, and the
 323 rules explained in the previous section keep everything working.
 324
 325 VACUUM doesn't know a-priori which tuples are heads of their chains, but
 326 it can easily figure that out by constructing a predecessor array that's
 327 the reverse map of the nextOffset links (ie, when we see tuple x links to
 328 tuple y, we set predecessor[y] = x).  Then head tuples are the ones with
 329 no predecessor.
 330
 331 Because insertions can occur while VACUUM runs, a pure sequential scan
 332 could miss deleting some target leaf tuples, because they could get moved
 333 from a not-yet-visited leaf page to an already-visited leaf page as a
 334 consequence of a PickSplit or MoveLeafs operation.  Failing to delete any
 335 target TID is not acceptable, so we have to extend the algorithm to cope
 336 with such cases.  We recognize that such a move might have occurred when
 337 we see a leaf-page REDIRECT tuple whose XID indicates it might have been
 338 created after the VACUUM scan started.  We add the redirection target TID
 339 to a "pending list" of places we need to recheck.  Between pages of the
 340 main sequential scan, we empty the pending list by visiting each listed
 341 TID.  If it points to an inner tuple (from a PickSplit), add each downlink
 342 TID to the pending list.  If it points to a leaf page, vacuum that page.
 343 (We could just vacuum the single pointed-to chain, but vacuuming the
 344 whole page simplifies the code and reduces the odds of VACUUM having to
 345 modify the same page multiple times.)  To ensure that pending-list
 346 processing can never get into an endless loop, even in the face of
 347 concurrent index changes, we don't remove list entries immediately but
 348 only after we've completed all pending-list processing; instead we just
 349 mark items as done after processing them.  Adding a TID that's already in
 350 the list is a no-op, whether or not that item is marked done yet.
 351
 352 spgbulkdelete also updates the index's free space map.
 353
 354 Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was
 355 performed; otherwise, it does an spgbulkdelete scan with an empty target
 356 list, so as to clean up redirections and placeholders, update the free
 357 space map, and gather statistics.
 358
 359
 360 LAST USED PAGE MANAGEMENT
 361
 362 The list of last used pages contains four pages - a leaf page and three
 363 inner pages, one from each "triple parity" group.  (Actually, there's one
 364 such list for the main tree and a separate one for the nulls tree.)  This
 365 list is stored between calls on the index meta page, but updates are never
 366 WAL-logged to decrease WAL traffic.  Incorrect data on meta page isn't
 367 critical, because we could allocate a new page at any moment.
 368
 369
 370 AUTHORS
 371
 372     Teodor Sigaev <teodor@sigaev.ru>
 373     Oleg Bartunov <oleg@sai.msu.su>