granicus.if.org Git - postgresql/blob - src/backend/access/gin/README

   1 src/backend/access/gin/README
   2
   3 Gin for PostgreSQL
   4 ==================
   5
   6 Gin was sponsored by jfg://networks (http://www.jfg-networks.com/)
   7
   8 Gin stands for Generalized Inverted Index and should be considered as a genie,
   9 not a drink.
  10
  11 Generalized means that the index does not know which operation it accelerates.
  12 It instead works with custom strategies, defined for specific data types (read
  13 "Index Method Strategies" in the PostgreSQL documentation). In that sense, Gin
  14 is similar to GiST and differs from btree indices, which have predefined,
  15 comparison-based operations.
  16
  17 An inverted index is an index structure storing a set of (key, posting list)
  18 pairs, where 'posting list' is a set of heap rows in which the key occurs.
  19 (A text document would usually contain many keys.)  The primary goal of
  20 Gin indices is support for highly scalable, full-text search in PostgreSQL.
  21
  22 A Gin index consists of a B-tree index constructed over key values,
  23 where each key is an element of some indexed items (element of array, lexeme
  24 for tsvector) and where each tuple in a leaf page contains either a pointer to
  25 a B-tree over item pointers (posting tree), or a simple list of item pointers
  26 (posting list) if the list is small enough.
  27
  28 Note: There is no delete operation in the key (entry) tree. The reason for
  29 this is that in our experience, the set of distinct words in a large corpus
  30 changes very slowly.  This greatly simplifies the code and concurrency
  31 algorithms.
  32
  33 Core PostgreSQL includes built-in Gin support for one-dimensional arrays
  34 (eg. integer[], text[]).  The following operations are available:
  35
  36   * contains: value_array @> query_array
  37   * overlaps: value_array && query_array
  38   * is contained by: value_array <@ query_array
  39
  40 Synopsis
  41 --------
  42
  43 =# create index txt_idx on aa using gin(a);
  44
  45 Features
  46 --------
  47
  48   * Concurrency
  49   * Write-Ahead Logging (WAL).  (Recoverability from crashes.)
  50   * User-defined opclasses.  (The scheme is similar to GiST.)
  51   * Optimized index creation (Makes use of maintenance_work_mem to accumulate
  52     postings in memory.)
  53   * Text search support via an opclass
  54   * Soft upper limit on the returned results set using a GUC variable:
  55     gin_fuzzy_search_limit
  56
  57 Gin Fuzzy Limit
  58 ---------------
  59
  60 There are often situations when a full-text search returns a very large set of
  61 results.  Since reading tuples from the disk and sorting them could take a
  62 lot of time, this is unacceptable for production.  (Note that the search
  63 itself is very fast.)
  64
  65 Such queries usually contain very frequent lexemes, so the results are not
  66 very helpful. To facilitate execution of such queries Gin has a configurable
  67 soft upper limit on the size of the returned set, determined by the
  68 'gin_fuzzy_search_limit' GUC variable.  This is set to 0 by default (no
  69 limit).
  70
  71 If a non-zero search limit is set, then the returned set is a subset of the
  72 whole result set, chosen at random.
  73
  74 "Soft" means that the actual number of returned results could differ
  75 from the specified limit, depending on the query and the quality of the
  76 system's random number generator.
  77
  78 From experience, a value of 'gin_fuzzy_search_limit' in the thousands
  79 (eg. 5000-20000) works well.  This means that 'gin_fuzzy_search_limit' will
  80 have no effect for queries returning a result set with less tuples than this
  81 number.
  82
  83 Index structure
  84 ---------------
  85
  86 The "items" that a GIN index indexes are composite values that contain
  87 zero or more "keys".  For example, an item might be an integer array, and
  88 then the keys would be the individual integer values.  The index actually
  89 stores and searches for the key values, not the items per se.  In the
  90 pg_opclass entry for a GIN opclass, the opcintype is the data type of the
  91 items, and the opckeytype is the data type of the keys.  GIN is optimized
  92 for cases where items contain many keys and the same key values appear
  93 in many different items.
  94
  95 A GIN index contains a metapage, a btree of key entries, and possibly
  96 "posting tree" pages, which hold the overflow when a key entry acquires
  97 too many heap tuple pointers to fit in a btree page.  Additionally, if the
  98 fast-update feature is enabled, there can be "list pages" holding "pending"
  99 key entries that haven't yet been merged into the main btree.  The list
 100 pages have to be scanned linearly when doing a search, so the pending
 101 entries should be merged into the main btree before there get to be too
 102 many of them.  The advantage of the pending list is that bulk insertion of
 103 a few thousand entries can be much faster than retail insertion.  (The win
 104 comes mainly from not having to do multiple searches/insertions when the
 105 same key appears in multiple new heap tuples.)
 106
 107 Key entries are nominally of the same IndexTuple format as used in other
 108 index types, but since a leaf key entry typically refers to multiple heap
 109 tuples, there are significant differences.  (See GinFormTuple, which works
 110 by building a "normal" index tuple and then modifying it.)  The points to
 111 know are:
 112
 113 * In a single-column index, a key tuple just contains the key datum, but
 114 in a multi-column index, a key tuple contains the pair (column number,
 115 key datum) where the column number is stored as an int2.  This is needed
 116 to support different key data types in different columns.  This much of
 117 the tuple is built by index_form_tuple according to the usual rules.
 118 The column number (if present) can never be null, but the key datum can
 119 be, in which case a null bitmap is present as usual.  (As usual for index
 120 tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.)
 121
 122 * If the key datum is null (ie, IndexTupleHasNulls() is true), then
 123 just after the nominal index data (ie, at offset IndexInfoFindDataOffset
 124 or IndexInfoFindDataOffset + sizeof(int2)) there is a byte indicating
 125 the "category" of the null entry.  These are the possible categories:
 126         1 = ordinary null key value extracted from an indexable item
 127         2 = placeholder for zero-key indexable item
 128         3 = placeholder for null indexable item
 129 Placeholder null entries are inserted into the index because otherwise
 130 there would be no index entry at all for an empty or null indexable item,
 131 which would mean that full index scans couldn't be done and various corner
 132 cases would give wrong answers.  The different categories of null entries
 133 are treated as distinct keys by the btree, but heap itempointers for the
 134 same category of null entry are merged into one index entry just as happens
 135 with ordinary key entries.
 136
 137 * In a key entry at the btree leaf level, at the next SHORTALIGN boundary,
 138 there is a list of item pointers, in compressed format (see Posting List
 139 Compression section), pointing to the heap tuples for which the indexable
 140 items contain this key. This is called the "posting list".
 141
 142 If the list would be too big for the index tuple to fit on an index page, the
 143 ItemPointers are pushed out to a separate posting page or pages, and none
 144 appear in the key entry itself.  The separate pages are called a "posting
 145 tree" (see below); Note that in either case, the ItemPointers associated with
 146 a key can easily be read out in sorted order; this is relied on by the scan
 147 algorithms.
 148
 149 * The index tuple header fields of a leaf key entry are abused as follows:
 150
 151 1) Posting list case:
 152
 153 * ItemPointerGetBlockNumber(&itup->t_tid) contains the offset from index
 154   tuple start to the posting list.
 155   Access macros: GinGetPostingOffset(itup) / GinSetPostingOffset(itup,n)
 156
 157 * ItemPointerGetOffsetNumber(&itup->t_tid) contains the number of elements
 158   in the posting list (number of heap itempointers).
 159   Access macros: GinGetNPosting(itup) / GinSetNPosting(itup,n)
 160
 161 * If IndexTupleHasNulls(itup) is true, the null category byte can be
 162   accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c)
 163
 164 * The posting list can be accessed with GinGetPosting(itup)
 165
 166 * If GinITupIsCompressed(itup), the posting list is stored in compressed
 167   format. Otherwise it is just an array of ItemPointers. New tuples are always
 168   stored in compressed format, uncompressed items can be present if the
 169   database was migrated from 9.3 or earlier version.
 170
 171 2) Posting tree case:
 172
 173 * ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number
 174   of the root of the posting tree.
 175   Access macros: GinGetPostingTree(itup) / GinSetPostingTree(itup, blkno)
 176
 177 * ItemPointerGetOffsetNumber(&itup->t_tid) contains the magic number
 178   GIN_TREE_POSTING, which distinguishes this from the posting-list case
 179   (it's large enough that that many heap itempointers couldn't possibly
 180   fit on an index page).  This value is inserted automatically by the
 181   GinSetPostingTree macro.
 182
 183 * If IndexTupleHasNulls(itup) is true, the null category byte can be
 184   accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c)
 185
 186 * The posting list is not present and must not be accessed.
 187
 188 Use the macro GinIsPostingTree(itup) to determine which case applies.
 189
 190 In both cases, itup->t_info & INDEX_SIZE_MASK contains actual total size of
 191 tuple, and the INDEX_VAR_MASK and INDEX_NULL_MASK bits have their normal
 192 meanings as set by index_form_tuple.
 193
 194 Index tuples in non-leaf levels of the btree contain the optional column
 195 number, key datum, and null category byte as above.  They do not contain
 196 a posting list.  ItemPointerGetBlockNumber(&itup->t_tid) is the downlink
 197 to the next lower btree level, and ItemPointerGetOffsetNumber(&itup->t_tid)
 198 is InvalidOffsetNumber.  Use the access macros GinGetDownlink/GinSetDownlink
 199 to get/set the downlink.
 200
 201 Index entries that appear in "pending list" pages work a tad differently as
 202 well.  The optional column number, key datum, and null category byte are as
 203 for other GIN index entries.  However, there is always exactly one heap
 204 itempointer associated with a pending entry, and it is stored in the t_tid
 205 header field just as in non-GIN indexes.  There is no posting list.
 206 Furthermore, the code that searches the pending list assumes that all
 207 entries for a given heap tuple appear consecutively in the pending list and
 208 are sorted by the column-number-plus-key-datum.  The GIN_LIST_FULLROW page
 209 flag bit tells whether entries for a given heap tuple are spread across
 210 multiple pending-list pages.  If GIN_LIST_FULLROW is set, the page contains
 211 all the entries for one or more heap tuples.  If GIN_LIST_FULLROW is clear,
 212 the page contains entries for only one heap tuple, *and* they are not all
 213 the entries for that tuple.  (Thus, a heap tuple whose entries do not all
 214 fit on one pending-list page must have those pages to itself, even if this
 215 results in wasting much of the space on the preceding page and the last
 216 page for the tuple.)
 217
 218 Posting tree
 219 ------------
 220
 221 If a posting list is too large to store in-line in a key entry, a posting tree
 222 is created. A posting tree is a B-tree structure, where the ItemPointer is
 223 used as the key.
 224
 225 Internal posting tree pages use the standard PageHeader and the same "opaque"
 226 struct as other GIN page, but do not contain regular index tuples. Instead,
 227 the contents of the page is an array of PostingItem structs. Each PostingItem
 228 consists of the block number of the child page, and the right bound of that
 229 child page, as an ItemPointer. The right bound of the page is stored right
 230 after the page header, before the PostingItem array.
 231
 232 Posting tree leaf pages also use the standard PageHeader and opaque struct,
 233 and the right bound of the page is stored right after the page header, but
 234 the page content comprises of a number of compressed posting lists. The
 235 compressed posting lists are stored one after each other, between page header
 236 and pd_lower. The space between pd_lower and pd_upper is unused, which allows
 237 full-page images of posting tree leaf pages to skip the unused space in middle
 238 (buffer_std = true in XLogRecData).
 239
 240 The item pointers are stored in a number of independent compressed posting
 241 lists (also called segments), instead of one big one, to make random access
 242 to a given item pointer faster: to find an item in a compressed list, you
 243 have to read the list from the beginning, but when the items are split into
 244 multiple lists, you can first skip over to the list containing the item you're
 245 looking for, and read only that segment. Also, an update only needs to
 246 re-encode the affected segment.
 247
 248 Posting List Compression
 249 ------------------------
 250
 251 To fit as many item pointers on a page as possible, posting tree leaf pages
 252 and posting lists stored inline in entry tree leaf tuples use a lightweight
 253 form of compression. We take advantage of the fact that the item pointers
 254 are stored in sorted order. Instead of storing the block and offset number of
 255 each item pointer separately, we store the difference from the previous item.
 256 That in itself doesn't do much, but it allows us to use so-called varbyte
 257 encoding to compress them.
 258
 259 Varbyte encoding is a method to encode integers, allowing smaller numbers to
 260 take less space at the cost of larger numbers. Each integer is represented by
 261 variable number of bytes. High bit of each byte in varbyte encoding determines
 262 whether the next byte is still part of this number. Therefore, to read a single
 263 varbyte encoded number, you have to read bytes until you find a byte with the
 264 high bit not set.
 265
 266 When encoding, the block and offset number forming the item pointer are
 267 combined into a single integer. The offset number is stored in the 11 low
 268 bits (see MaxHeapTuplesPerPageBits in ginpostinglist.c), and the block number
 269 is stored in the higher bits. That requires 43 bits in total, which
 270 conveniently fits in at most 6 bytes.
 271
 272 A compressed posting list is passed around and stored on disk in a
 273 PackedPostingList struct. The first item in the list is stored uncompressed
 274 as a regular ItemPointerData, followed by the length of the list in bytes,
 275 followed by the packed items.
 276
 277 Concurrency
 278 -----------
 279
 280 The entry tree and each posting tree is a B-tree, with right-links connecting
 281 sibling pages at the same level. This is the same structure that is used in
 282 the regular B-tree indexam (invented by Lehman & Yao), but we don't support
 283 scanning a GIN trees backwards, so we don't need left-links.
 284
 285 To avoid deadlocks, B-tree pages must always be locked in the same order:
 286 left to right, and bottom to top. When searching, the tree is traversed from
 287 top to bottom, so the lock on the parent page must be released before
 288 descending to the next level. Concurrent page splits move the keyspace to
 289 right, so after following a downlink, the page actually containing the key
 290 we're looking for might be somewhere to the right of the page we landed on.
 291 In that case, we follow the right-links until we find the page we're looking
 292 for.
 293
 294 To delete a page, the page's left sibling, the target page, and its parent,
 295 are locked in that order, and the page is marked as deleted. However, a
 296 concurrent search might already have read a pointer to the page, and might be
 297 just about to follow it. A page can be reached via the right-link of its left
 298 sibling, or via its downlink in the parent.
 299
 300 To prevent a backend from reaching a deleted page via a right-link, when
 301 following a right-link the lock on the previous page is not released until
 302 the lock on next page has been acquired.
 303
 304 The downlink is more tricky. A search descending the tree must release the
 305 lock on the parent page before locking the child, or it could deadlock with
 306 a concurrent split of the child page; a page split locks the parent, while
 307 already holding a lock on the child page. However, posting trees are only
 308 fully searched from left to right, starting from the leftmost leaf. (The
 309 tree-structure is only needed by insertions, to quickly find the correct
 310 insert location). So as long as we don't delete the leftmost page on each
 311 level, a search can never follow a downlink to page that's about to be
 312 deleted.
 313
 314 The previous paragraph's reasoning only applies to searches, and only to
 315 posting trees. To protect from inserters following a downlink to a deleted
 316 page, vacuum simply locks out all concurrent insertions to the posting tree,
 317 by holding a super-exclusive lock on the posting tree root. Inserters hold a
 318 pin on the root page, but searches do not, so while new searches cannot begin
 319 while root page is locked, any already-in-progress scans can continue
 320 concurrently with vacuum. In the entry tree, we never delete pages.
 321
 322 (This is quite different from the mechanism the btree indexam uses to make
 323 page-deletions safe; it stamps the deleted pages with an XID and keeps the
 324 deleted pages around with the right-link intact until all concurrent scans
 325 have finished.)
 326
 327 Compatibility
 328 -------------
 329
 330 Compression of TIDs was introduced in 9.4. Some GIN indexes could remain in
 331 uncompressed format because of pg_upgrade from 9.3 or earlier versions.
 332 For compatibility, old uncompressed format is also supported. Following
 333 rules are used to handle it:
 334
 335 * GIN_ITUP_COMPRESSED flag marks index tuples that contain a posting list.
 336 This flag is stored in high bit of ItemPointerGetBlockNumber(&itup->t_tid).
 337 Use GinItupIsCompressed(itup) to check the flag.
 338
 339 * Posting tree pages in the new format are marked with the GIN_COMPRESSED flag.
 340   Macros GinPageIsCompressed(page) and GinPageSetCompressed(page) are used to
 341   check and set this flag.
 342
 343 * All scan operations check format of posting list add use corresponding code
 344 to read its content.
 345
 346 * When updating an index tuple containing an uncompressed posting list, it
 347 will be replaced with new index tuple containing a compressed list.
 348
 349 * When updating an uncompressed posting tree leaf page, it's compressed.
 350
 351 * If vacuum finds some dead TIDs in uncompressed posting lists, they are
 352 converted into compressed posting lists. This assumes that the compressed
 353 posting list fits in the space occupied by the uncompressed list. IOW, we
 354 assume that the compressed version of the page, with the dead items removed,
 355 takes less space than the old uncompressed version.
 356
 357 Limitations
 358 -----------
 359
 360   * Gin doesn't use scan->kill_prior_tuple & scan->ignore_killed_tuples
 361   * Gin searches entries only by equality matching, or simple range
 362     matching using the "partial match" feature.
 363
 364 TODO
 365 ----
 366
 367 Nearest future:
 368
 369   * Opclasses for more types (no programming, just many catalog changes)
 370
 371 Distant future:
 372
 373   * Replace B-tree of entries to something like GiST
 374
 375 Authors
 376 -------
 377
 378 Original work was done by Teodor Sigaev (teodor@sigaev.ru) and Oleg Bartunov
 379 (oleg@sai.msu.su).