granicus.if.org Git - postgresql/blob - src/backend/access/heap/README.HOT

   1 $PostgreSQL: pgsql/src/backend/access/heap/README.HOT,v 1.3 2008/03/21 13:23:27 momjian Exp $
   2
   3 Heap Only Tuples (HOT)
   4 ======================
   5
   6 The Heap Only Tuple (HOT) feature eliminates redundant index entries and
   7 allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples
   8 without performing a table-wide vacuum.  It does this by allowing
   9 single-page vacuuming, also called "defragmentation".
  10
  11 Note: there is a Glossary at the end of this document that may be helpful
  12 for first-time readers.
  13
  14
  15 Technical Challenges
  16 --------------------
  17
  18 Page-at-a-time vacuuming is normally impractical because of the costs of
  19 finding and removing the index entries that link to the tuples to be
  20 reclaimed.  Standard vacuuming scans the indexes to ensure all such index
  21 entries are removed, amortizing the index scan cost across as many dead
  22 tuples as possible; this approach does not scale down well to the case of
  23 reclaiming just a few tuples.  In principle one could recompute the index
  24 keys and do standard index searches to find the index entries, but this is
  25 risky in the presence of possibly-buggy user-defined functions in
  26 functional indexes.  An allegedly immutable function that in fact is not
  27 immutable might prevent us from re-finding an index entry (and we cannot
  28 throw an error for not finding it, in view of the fact that dead index
  29 entries are sometimes reclaimed early).  That would lead to a seriously
  30 corrupt index, in the form of entries pointing to tuple slots that by now
  31 contain some unrelated content.  In any case we would prefer to be able
  32 to do vacuuming without invoking any user-written code.
  33
  34 HOT solves this problem for a restricted but useful special case:
  35 where a tuple is repeatedly updated in ways that do not change its
  36 indexed columns.  (Here, "indexed column" means any column referenced
  37 at all in an index definition, including for example columns that are
  38 tested in a partial-index predicate but are not stored in the index.)
  39
  40 An additional property of HOT is that it reduces index size by avoiding
  41 the creation of identically-keyed index entries.  This improves search
  42 speeds.
  43
  44
  45 Update Chains With a Single Index Entry
  46 ---------------------------------------
  47
  48 Without HOT, every version of a row in an update chain has its own index
  49 entries, even if all indexed columns are the same.  With HOT, a new tuple
  50 placed on the same page and with all indexed columns the same as its
  51 parent row version does not get new index entries.  This means there is
  52 only one index entry for the entire update chain on the heap page.
  53 An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag.
  54 The prior row version is marked HEAP_HOT_UPDATED, and (as always in an
  55 update chain) its t_ctid field links forward to the newer version.
  56
  57 For example:
  58
  59         Index points to 1
  60         lp [1]  [2]
  61
  62         [111111111]->[2222222222]
  63
  64 In the above diagram, the index points to line pointer 1, and tuple 1 is
  65 marked as HEAP_HOT_UPDATED.  Tuple 2 is a HOT tuple, meaning it has
  66 no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE.
  67 Although tuple 2 is not directly referenced by the index, it can still be
  68 found by an index search: after traversing from the index to tuple 1,
  69 the index search proceeds forward to child tuples as long as it sees the
  70 HEAP_HOT_UPDATED flag set.  Since we restrict the HOT chain to lie within
  71 a single page, this requires no additional page fetches and doesn't
  72 introduce much performance penalty.
  73
  74 Eventually, tuple 1 will no longer be visible to any transaction.
  75 At that point its space could be reclaimed, but its line pointer cannot,
  76 since the index still links to that line pointer and we still need to
  77 be able to find tuple 2 in an index search.  HOT handles this by turning
  78 line pointer 1 into a "redirecting line pointer", which links to tuple 2
  79 but has no actual tuple attached.  This state of affairs looks like
  80
  81         Index points to 1
  82         lp [1]->[2]
  83
  84         [2222222222]
  85
  86 If now the row is updated again, to version 3, the page looks like this:
  87
  88         Index points to 1
  89         lp [1]->[2]  [3]
  90
  91         [2222222222]->[3333333333]
  92
  93 At some later time when no transaction can see tuple 2 in its snapshot,
  94 tuple 2 and its line pointer can be pruned entirely:
  95
  96         Index points to 1
  97         lp [1]------>[3]
  98
  99         [3333333333]
 100
 101 This is safe because no index entry points to line pointer 2.  Subsequent
 102 insertions into the page can now recycle both line pointer 2 and the
 103 space formerly used by tuple 2.
 104
 105 If an update changes any indexed column, or there is not room on the
 106 same page for the new tuple, then the HOT chain ends: the last member
 107 has a regular t_ctid link to the next version and is not marked
 108 HEAP_HOT_UPDATED.  (In principle we could continue a HOT chain across
 109 pages, but this would destroy the desired property of being able to
 110 reclaim space with just page-local manipulations.  Anyway, we don't
 111 want to have to chase through multiple heap pages to get from an index
 112 entry to the desired tuple, so it seems better to create a new index
 113 entry for the new tuple.)  If further updates occur, the next version
 114 could become the root of a new HOT chain.
 115
 116 Line pointer 1 has to remain as long as there is any non-dead member of
 117 the chain on the page.  When there is not, it is marked "dead".
 118 This lets us reclaim the last child line pointer and associated tuple
 119 immediately.  The next regular VACUUM pass can reclaim the index entries
 120 pointing at the line pointer and then the line pointer itself.  Since a
 121 line pointer is small compared to a tuple, this does not represent an
 122 undue space cost.
 123
 124 Note: we can use a "dead" line pointer for any DELETEd tuple,
 125 whether it was part of a HOT chain or not.  This allows space reclamation
 126 in advance of running VACUUM for plain DELETEs as well as HOT updates.
 127
 128 The requirement for doing a HOT update is that none of the indexed
 129 columns are changed.  This is checked at execution time by comparing the
 130 binary representation of the old and new values.  We insist on bitwise
 131 equality rather than using datatype-specific equality routines.  The
 132 main reason to avoid the latter is that there might be multiple notions
 133 of equality for a datatype, and we don't know exactly which one is
 134 relevant for the indexes at hand.  We assume that bitwise equality
 135 guarantees equality for all purposes.
 136
 137
 138 Abort Cases
 139 -----------
 140
 141 If a heap-only tuple's xmin is aborted, then it can be removed immediately:
 142 it was never visible to any other transaction, and all descendant row
 143 versions must be aborted as well.  Therefore we need not consider it part
 144 of a HOT chain.  By the same token, if a HOT-updated tuple's xmax is
 145 aborted, there is no need to follow the chain link.  However, there is a
 146 race condition here: the transaction that did the HOT update might abort
 147 between the time we inspect the HOT-updated tuple and the time we reach
 148 the descendant heap-only tuple.  It is conceivable that someone prunes
 149 the heap-only tuple before that, and even conceivable that the line pointer
 150 is re-used for another purpose.  Therefore, when following a HOT chain,
 151 it is always necessary to be prepared for the possibility that the
 152 linked-to item pointer is unused, dead, or redirected; and if it is a
 153 normal item pointer, we still have to check that XMIN of the tuple matches
 154 the XMAX of the tuple we left.  Otherwise we should assume that we have
 155 come to the end of the HOT chain.  Note that this sort of XMIN/XMAX
 156 matching is required when following ordinary update chains anyway.
 157
 158 (Early versions of the HOT code assumed that holding pin on the page
 159 buffer while following a HOT link would prevent this type of problem,
 160 but checking XMIN/XMAX matching is a much more robust solution.)
 161
 162
 163 Index/Sequential Scans
 164 ----------------------
 165
 166 When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose
 167 xmax is not aborted, we need to follow its t_ctid link and check that
 168 entry as well; possibly repeatedly until we reach the end of the HOT
 169 chain.  (When using an MVCC snapshot it is possible to optimize this a
 170 bit: there can be at most one visible tuple in the chain, so we can stop
 171 when we find it.  This rule does not work for non-MVCC snapshots, though.)
 172
 173 Sequential scans do not need to pay attention to the HOT links because
 174 they scan every item pointer on the page anyway.  The same goes for a
 175 bitmap heap scan with a lossy bitmap.
 176
 177
 178 Pruning
 179 -------
 180
 181 HOT pruning means updating item pointers so that HOT chains are
 182 reduced in length, by collapsing out line pointers for intermediate dead
 183 tuples.  Although this makes those line pointers available for re-use,
 184 it does not immediately make the space occupied by their tuples available.
 185
 186
 187 Defragmentation
 188 ---------------
 189
 190 Defragmentation centralizes unused space.  After we have converted root
 191 line pointers to redirected line pointers and pruned away any dead
 192 intermediate line pointers, the tuples they linked to are free space.
 193 But unless that space is adjacent to the central "hole" on the page
 194 (the pd_lower-to-pd_upper area) it cannot be used by tuple insertion.
 195 Defragmentation moves the surviving tuples to coalesce all the free
 196 space into one "hole".  This is done with the same PageRepairFragmentation
 197 function that regular VACUUM uses.
 198
 199
 200 When can/should we prune or defragment?
 201 ---------------------------------------
 202
 203 This is the most interesting question in HOT implementation, since there
 204 is no simple right answer: we must use heuristics to determine when it's
 205 most efficient to perform pruning and/or defragmenting.
 206
 207 We cannot prune or defragment unless we can get a "buffer cleanup lock"
 208 on the target page; otherwise, pruning might destroy line pointers that
 209 other backends have live references to, and defragmenting might move
 210 tuples that other backends have live pointers to.  Thus the general
 211 approach must be to heuristically decide if we should try to prune
 212 or defragment, and if so try to acquire the buffer cleanup lock without
 213 blocking.  If we succeed we can proceed with our housekeeping work.
 214 If we cannot get the lock (which should not happen often, except under
 215 very heavy contention) then the housekeeping has to be postponed till
 216 some other time.  The worst-case consequence of this is only that an
 217 UPDATE cannot be made HOT but has to link to a new tuple version placed on
 218 some other page, for lack of centralized space on the original page.
 219
 220 Ideally we would do defragmenting only when we are about to attempt
 221 heap_update on a HOT-safe tuple.  The difficulty with this approach
 222 is that the update query has certainly got a pin on the old tuple, and
 223 therefore our attempt to acquire a buffer cleanup lock will always fail.
 224 (This corresponds to the idea that we don't want to move the old tuple
 225 out from under where the query's HeapTuple pointer points.  It might
 226 be possible to finesse that, but it seems fragile.)
 227
 228 Pruning, however, is potentially useful even when we are not about to
 229 insert a new tuple, since shortening a HOT chain reduces the cost of
 230 subsequent index searches.  However it is unclear that this gain is
 231 large enough to accept any extra maintenance burden for.
 232
 233 The currently planned heuristic is to prune and defrag when first accessing
 234 a page that potentially has prunable tuples (as flagged by the pd_prune_xid
 235 page hint field) and that either has free space less than MAX(fillfactor
 236 target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to
 237 find enough free space to store an updated tuple version.  (These rules
 238 are subject to change.)
 239
 240 We have effectively implemented the "truncate dead tuples to just line
 241 pointer" idea that has been proposed and rejected before because of fear
 242 of line pointer bloat: we might end up with huge numbers of line pointers
 243 and just a few actual tuples on a page.  To limit the damage in the worst
 244 case, and to keep various work arrays as well as the bitmaps in bitmap
 245 scans reasonably sized, the maximum number of line pointers per page
 246 is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that
 247 could fit without HOT pruning).
 248
 249
 250 VACUUM
 251 ------
 252
 253 There is little change to regular vacuum.  It performs pruning to remove
 254 dead heap-only tuples, and cleans up any dead line pointers as if they were
 255 regular dead tuples.
 256
 257
 258 VACUUM FULL
 259 -----------
 260
 261 VACUUM FULL performs an extra operation of collapsing out redirecting line
 262 pointers, by moving the first non-DEAD tuple of each HOT chain to the root
 263 position and clearing its heap-only-tuple flag.  This effectively changes
 264 the user-visible CTID of that tuple.  This would be completely unsafe
 265 during normal concurrent operation, but since VACUUM FULL takes full
 266 exclusive lock on the table, it should be OK.  (Note that VACUUM FULL has
 267 always felt free to change tuples' CTIDs by moving them across pages.)
 268 Eliminating redirection links means that the main body of VACUUM FULL
 269 doesn't have to deal with them, which seems a good thing since VACUUM FULL
 270 is horrendously complex already.
 271
 272 When VACUUM FULL tries to move tuple chains, it does not distinguish regular
 273 and heap-only tuples, but just moves both types the same.  This is OK because
 274 it will move the entire non-DEAD tail of an update chain and remove index
 275 entries for each item moved.  At worst, we'll uselessly search for index
 276 entries matching the heap-only tuples included in the move.
 277
 278
 279 Statistics
 280 ----------
 281
 282 Currently, we count HOT updates the same as cold updates for statistics
 283 purposes, though there is an additional per-table counter that counts
 284 only HOT updates.  When a page pruning operation is able to remove a
 285 physical tuple by eliminating an intermediate heap-only tuple or
 286 replacing a physical root tuple by a redirect pointer, a decrement in
 287 the table's number of dead tuples is reported to pgstats, which may
 288 postpone autovacuuming.  Note that we do not count replacing a root tuple
 289 by a DEAD item pointer as decrementing n_dead_tuples; we still want
 290 autovacuum to run to clean up the index entries and DEAD item.
 291
 292 This area probably needs further work ...
 293
 294
 295 CREATE INDEX
 296 ------------
 297
 298 CREATE INDEX presents a problem for HOT updates.  While the existing HOT
 299 chains all have the same index values for existing indexes, the columns
 300 in the new index might change within a pre-existing HOT chain, creating
 301 a "broken" chain that can't be indexed properly.
 302
 303 To address this issue, regular (non-concurrent) CREATE INDEX makes the
 304 new index usable only by transactions newer than the CREATE INDEX
 305 command.  This prevents transactions that can see the inconsistent HOT
 306 chains from trying to use the new index and getting incorrect results.
 307 New transactions can only see the rows visible after the index was
 308 created, hence the HOT chains are consistent for them.
 309
 310 Entries in the new index point to root tuples (tuples with current index
 311 pointers) so that our index uses the same index pointers as all other
 312 indexes on the table.  However the row we want to index is actually at
 313 the *end* of the chain, ie, the most recent live tuple on the HOT chain.
 314 That is the one we compute the index entry values for, but the TID
 315 we put into the index is that of the root tuple.  Since transactions that
 316 will be allowed to use the new index cannot see any of the older tuple
 317 versions in the chain, the fact that they might not match the index entry
 318 isn't a problem.  (Such transactions will check the tuple visibility
 319 information of the older versions and ignore them, without ever looking at
 320 their contents, so the content inconsistency is OK.)  Subsequent updates
 321 to the live tuple will be allowed to extend the HOT chain only if they are
 322 HOT-safe for all the indexes.
 323
 324 Because we have ShareLock on the table, any DELETE_IN_PROGRESS or
 325 INSERT_IN_PROGRESS tuples should have come from our own transaction.
 326 Therefore we can consider them committed since if the CREATE INDEX
 327 commits, they will be committed, and if it aborts the index is discarded.
 328 An exception to this is that early lock release is customary for system
 329 catalog updates, and so we might find such tuples when reindexing a system
 330 catalog.  In that case we deal with it by waiting for the source
 331 transaction to commit or roll back.  (We could do that for user tables
 332 too, but since the case is unexpected we prefer to throw an error.)
 333
 334 Practically, we prevent old transactions from using the new index by
 335 setting pg_index.indcheckxmin to TRUE.  Queries are allowed to use such an
 336 index only after pg_index.xmin is below their TransactionXmin horizon,
 337 thereby ensuring that any incompatible rows in HOT chains are dead to them.
 338 (pg_index.xmin will be the XID of the CREATE INDEX transaction.  The reason
 339 for using xmin rather than a normal column is that the regular vacuum
 340 freezing mechanism will take care of converting xmin to FrozenTransactionId
 341 before it can wrap around.)
 342
 343 This means in particular that the transaction creating the index will be
 344 unable to use the index.  We alleviate that problem somewhat by not setting
 345 indcheckxmin unless the table actually contains HOT chains with
 346 RECENTLY_DEAD members.  (In 8.4 we may be able to improve the situation,
 347 at least for non-serializable transactions, because we expect to be able to
 348 advance TransactionXmin intratransaction.)
 349
 350 Another unpleasant consequence is that it is now risky to use SnapshotAny
 351 in an index scan: if the index was created more recently than the last
 352 vacuum, it's possible that some of the visited tuples do not match the
 353 index entry they are linked to.  This does not seem to be a fatal
 354 objection, since there are few users of SnapshotAny and most use seqscans.
 355 The only exception at this writing is CLUSTER, which is okay because it
 356 does not require perfect ordering of the indexscan readout (and especially
 357 so because CLUSTER tends to write recently-dead tuples out of order anyway).
 358
 359
 360 CREATE INDEX CONCURRENTLY
 361 -------------------------
 362
 363 In the concurrent case we must take a different approach.  We create the
 364 pg_index entry immediately, before we scan the table.  The pg_index entry
 365 is marked as "not ready for inserts".  Then we commit and wait for any
 366 transactions which have the table open to finish.  This ensures that no
 367 new HOT updates will change the key value for our new index, because all
 368 transactions will see the existence of the index and will respect its
 369 constraint on which updates can be HOT.  Other transactions must include
 370 such an index when determining HOT-safety of updates, even though they
 371 must ignore it for both insertion and searching purposes.
 372
 373 We must do this to avoid making incorrect index entries.  For example,
 374 suppose we are building an index on column X and we make an index entry for
 375 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
 376 indexed column, HOT-updates the row to have X=2, and commits.  We now have
 377 an index entry for X=1 pointing at a HOT chain whose live row has X=2.
 378 We could make an index entry with X=2 during the validation pass, but
 379 there is no nice way to get rid of the wrong entry with X=1.  So we must
 380 have the HOT-safety property enforced before we start to build the new
 381 index.
 382
 383 After waiting for transactions which had the table open, we build the index
 384 for all rows that are valid in a fresh snapshot.  Any tuples visible in the
 385 snapshot will have only valid forward-growing HOT chains.  (They might have
 386 older HOT updates behind them which are broken, but this is OK for the same
 387 reason it's OK in a regular index build.)  As above, we point the index
 388 entry at the root of the HOT-update chain but we use the key value from the
 389 live tuple.
 390
 391 We mark the index open for inserts (but still not ready for reads) then
 392 we again wait for transactions which have the table open.  Then we take
 393 a second reference snapshot and validate the index.  This searches for
 394 tuples missing from the index, and inserts any missing ones.  Again,
 395 the index entries have to have TIDs equal to HOT-chain root TIDs, but
 396 the value to be inserted is the one from the live tuple.
 397
 398 Then we wait until every transaction that could have a snapshot older than
 399 the second reference snapshot is finished.  This ensures that nobody is
 400 alive any longer who could need to see any tuples that might be missing
 401 from the index, as well as ensuring that no one can see any inconsistent
 402 rows in a broken HOT chain (the first condition is stronger than the
 403 second).  Finally, we can mark the index valid for searches.
 404
 405
 406 Limitations and Restrictions
 407 ----------------------------
 408
 409 It is worth noting that HOT forever forecloses alternative approaches
 410 to vacuuming, specifically the recompute-the-index-keys approach alluded
 411 to in Technical Challenges above.  It'll be tough to recompute the index
 412 keys for a root line pointer you don't have data for anymore ...
 413
 414
 415 Glossary
 416 --------
 417
 418 Broken HOT Chain
 419
 420         A HOT chain in which the key value for an index has changed.
 421
 422         This is not allowed to occur normally but if a new index is created
 423         it can happen.  In that case various strategies are used to ensure
 424         that no transaction for which the older tuples are visible can
 425         use the index.
 426
 427 Cold update
 428
 429         A normal, non-HOT update, in which index entries are made for
 430         the new version of the tuple.
 431
 432 Dead line pointer
 433
 434         A stub line pointer, that does not point to anything, but cannot
 435         be removed or reused yet because there are index pointers to it.
 436         Semantically same as a dead tuple.  It has state LP_DEAD.
 437
 438 Heap-only tuple
 439
 440         A heap tuple with no index pointers, which can only be reached
 441         from indexes indirectly through its ancestral root tuple.
 442         Marked with HEAP_ONLY_TUPLE flag.
 443
 444 HOT-safe
 445
 446         A proposed tuple update is said to be HOT-safe if it changes
 447         none of the tuple's indexed columns.  It will only become an
 448         actual HOT update if we can find room on the same page for
 449         the new tuple version.
 450
 451 HOT update
 452
 453         An UPDATE where the new tuple becomes a heap-only tuple, and no
 454         new index entries are made.
 455
 456 HOT-updated tuple
 457
 458         An updated tuple, for which the next tuple in the chain is a
 459         heap-only tuple.  Marked with HEAP_HOT_UPDATED flag.
 460
 461 Indexed column
 462
 463         A column used in an index definition.  The column might not
 464         actually be stored in the index --- it could be used in a
 465         functional index's expression, or used in a partial index
 466         predicate.  HOT treats all these cases alike.
 467
 468 Redirecting line pointer
 469
 470         A line pointer that points to another line pointer and has no
 471         associated tuple.  It has the special lp_flags state LP_REDIRECT,
 472         and lp_off is the OffsetNumber of the line pointer it links to.
 473         This is used when a root tuple becomes dead but we cannot prune
 474         the line pointer because there are non-dead heap-only tuples
 475         further down the chain.
 476
 477 Root tuple
 478
 479         The first tuple in a HOT update chain; the one that indexes point to.
 480
 481 Update chain
 482
 483         A chain of updated tuples, in which each tuple's ctid points to
 484         the next tuple in the chain. A HOT update chain is an update chain
 485         (or portion of an update chain) that consists of a root tuple and
 486         one or more heap-only tuples.  A complete update chain can contain
 487         both HOT and non-HOT (cold) updated tuples.