granicus.if.org Git - postgresql/blob - src/backend/storage/buffer/README

   1 src/backend/storage/buffer/README
   2
   3 Notes About Shared Buffer Access Rules
   4 ======================================
   5
   6 There are two separate access control mechanisms for shared disk buffers:
   7 reference counts (a/k/a pin counts) and buffer content locks.  (Actually,
   8 there's a third level of access control: one must hold the appropriate kind
   9 of lock on a relation before one can legally access any page belonging to
  10 the relation.  Relation-level locks are not discussed here.)
  11
  12 Pins: one must "hold a pin on" a buffer (increment its reference count)
  13 before being allowed to do anything at all with it.  An unpinned buffer is
  14 subject to being reclaimed and reused for a different page at any instant,
  15 so touching it is unsafe.  Normally a pin is acquired via ReadBuffer and
  16 released via ReleaseBuffer.  It is OK and indeed common for a single
  17 backend to pin a page more than once concurrently; the buffer manager
  18 handles this efficiently.  It is considered OK to hold a pin for long
  19 intervals --- for example, sequential scans hold a pin on the current page
  20 until done processing all the tuples on the page, which could be quite a
  21 while if the scan is the outer scan of a join.  Similarly, btree index
  22 scans hold a pin on the current index page.  This is OK because normal
  23 operations never wait for a page's pin count to drop to zero.  (Anything
  24 that might need to do such a wait is instead handled by waiting to obtain
  25 the relation-level lock, which is why you'd better hold one first.)  Pins
  26 may not be held across transaction boundaries, however.
  27
  28 Buffer content locks: there are two kinds of buffer lock, shared and exclusive,
  29 which act just as you'd expect: multiple backends can hold shared locks on
  30 the same buffer, but an exclusive lock prevents anyone else from holding
  31 either shared or exclusive lock.  (These can alternatively be called READ
  32 and WRITE locks.)  These locks are intended to be short-term: they should not
  33 be held for long.  Buffer locks are acquired and released by LockBuffer().
  34 It will *not* work for a single backend to try to acquire multiple locks on
  35 the same buffer.  One must pin a buffer before trying to lock it.
  36
  37 Buffer access rules:
  38
  39 1. To scan a page for tuples, one must hold a pin and either shared or
  40 exclusive content lock.  To examine the commit status (XIDs and status bits)
  41 of a tuple in a shared buffer, one must likewise hold a pin and either shared
  42 or exclusive lock.
  43
  44 2. Once one has determined that a tuple is interesting (visible to the
  45 current transaction) one may drop the content lock, yet continue to access
  46 the tuple's data for as long as one holds the buffer pin.  This is what is
  47 typically done by heap scans, since the tuple returned by heap_fetch
  48 contains a pointer to tuple data in the shared buffer.  Therefore the
  49 tuple cannot go away while the pin is held (see rule #5).  Its state could
  50 change, but that is assumed not to matter after the initial determination
  51 of visibility is made.
  52
  53 3. To add a tuple or change the xmin/xmax fields of an existing tuple,
  54 one must hold a pin and an exclusive content lock on the containing buffer.
  55 This ensures that no one else might see a partially-updated state of the
  56 tuple while they are doing visibility checks.
  57
  58 4. It is considered OK to update tuple commit status bits (ie, OR the
  59 values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
  60 HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
  61 pin on a buffer.  This is OK because another backend looking at the tuple
  62 at about the same time would OR the same bits into the field, so there
  63 is little or no risk of conflicting update; what's more, if there did
  64 manage to be a conflict it would merely mean that one bit-update would
  65 be lost and need to be done again later.  These four bits are only hints
  66 (they cache the results of transaction status lookups in pg_clog), so no
  67 great harm is done if they get reset to zero by conflicting updates.
  68 Note, however, that a tuple is frozen by setting both HEAP_XMIN_INVALID
  69 and HEAP_XMIN_COMMITTED; this is a critical update and accordingly requires
  70 an exclusive buffer lock (and it must also be WAL-logged).
  71
  72 5. To physically remove a tuple or compact free space on a page, one
  73 must hold a pin and an exclusive lock, *and* observe while holding the
  74 exclusive lock that the buffer's shared reference count is one (ie,
  75 no other backend holds a pin).  If these conditions are met then no other
  76 backend can perform a page scan until the exclusive lock is dropped, and
  77 no other backend can be holding a reference to an existing tuple that it
  78 might expect to examine again.  Note that another backend might pin the
  79 buffer (increment the refcount) while one is performing the cleanup, but
  80 it won't be able to actually examine the page until it acquires shared
  81 or exclusive content lock.
  82
  83
  84 Rule #5 only affects VACUUM operations.  Obtaining the
  85 necessary lock is done by the bufmgr routine LockBufferForCleanup().
  86 It first gets an exclusive lock and then checks to see if the shared pin
  87 count is currently 1.  If not, it releases the exclusive lock (but not the
  88 caller's pin) and waits until signaled by another backend, whereupon it
  89 tries again.  The signal will occur when UnpinBuffer decrements the shared
  90 pin count to 1.  As indicated above, this operation might have to wait a
  91 good while before it acquires lock, but that shouldn't matter much for
  92 concurrent VACUUM.  The current implementation only supports a single
  93 waiter for pin-count-1 on any particular shared buffer.  This is enough
  94 for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
  95 single relation anyway.
  96
  97
  98 Buffer Manager's Internal Locking
  99 ---------------------------------
 100
 101 Before PostgreSQL 8.1, all operations of the shared buffer manager itself
 102 were protected by a single system-wide lock, the BufMgrLock, which
 103 unsurprisingly proved to be a source of contention.  The new locking scheme
 104 avoids grabbing system-wide exclusive locks in common code paths.  It works
 105 like this:
 106
 107 * There is a system-wide LWLock, the BufMappingLock, that notionally
 108 protects the mapping from buffer tags (page identifiers) to buffers.
 109 (Physically, it can be thought of as protecting the hash table maintained
 110 by buf_table.c.)  To look up whether a buffer exists for a tag, it is
 111 sufficient to obtain share lock on the BufMappingLock.  Note that one
 112 must pin the found buffer, if any, before releasing the BufMappingLock.
 113 To alter the page assignment of any buffer, one must hold exclusive lock
 114 on the BufMappingLock.  This lock must be held across adjusting the buffer's
 115 header fields and changing the buf_table hash table.  The only common
 116 operation that needs exclusive lock is reading in a page that was not
 117 in shared buffers already, which will require at least a kernel call
 118 and usually a wait for I/O, so it will be slow anyway.
 119
 120 * As of PG 8.2, the BufMappingLock has been split into NUM_BUFFER_PARTITIONS
 121 separate locks, each guarding a portion of the buffer tag space.  This allows
 122 further reduction of contention in the normal code paths.  The partition
 123 that a particular buffer tag belongs to is determined from the low-order
 124 bits of the tag's hash value.  The rules stated above apply to each partition
 125 independently.  If it is necessary to lock more than one partition at a time,
 126 they must be locked in partition-number order to avoid risk of deadlock.
 127
 128 * A separate system-wide LWLock, the BufFreelistLock, provides mutual
 129 exclusion for operations that access the buffer free list or select
 130 buffers for replacement.  This is always taken in exclusive mode since
 131 there are no read-only operations on those data structures.  The buffer
 132 management policy is designed so that BufFreelistLock need not be taken
 133 except in paths that will require I/O, and thus will be slow anyway.
 134 (Details appear below.)  It is never necessary to hold the BufMappingLock
 135 and the BufFreelistLock at the same time.
 136
 137 * Each buffer header contains a spinlock that must be taken when examining
 138 or changing fields of that buffer header.  This allows operations such as
 139 ReleaseBuffer to make local state changes without taking any system-wide
 140 lock.  We use a spinlock, not an LWLock, since there are no cases where
 141 the lock needs to be held for more than a few instructions.
 142
 143 Note that a buffer header's spinlock does not control access to the data
 144 held within the buffer.  Each buffer header also contains an LWLock, the
 145 "buffer content lock", that *does* represent the right to access the data
 146 in the buffer.  It is used per the rules above.
 147
 148 There is yet another set of per-buffer LWLocks, the io_in_progress locks,
 149 that are used to wait for I/O on a buffer to complete.  The process doing
 150 a read or write takes exclusive lock for the duration, and processes that
 151 need to wait for completion try to take shared locks (which they release
 152 immediately upon obtaining).  XXX on systems where an LWLock represents
 153 nontrivial resources, it's fairly annoying to need so many locks.  Possibly
 154 we could use per-backend LWLocks instead (a buffer header would then contain
 155 a field to show which backend is doing its I/O).
 156
 157
 158 Normal Buffer Replacement Strategy
 159 ----------------------------------
 160
 161 There is a "free list" of buffers that are prime candidates for replacement.
 162 In particular, buffers that are completely free (contain no valid page) are
 163 always in this list.  We could also throw buffers into this list if we
 164 consider their pages unlikely to be needed soon; however, the current
 165 algorithm never does that.  The list is singly-linked using fields in the
 166 buffer headers; we maintain head and tail pointers in global variables.
 167 (Note: although the list links are in the buffer headers, they are
 168 considered to be protected by the BufFreelistLock, not the buffer-header
 169 spinlocks.)  To choose a victim buffer to recycle when there are no free
 170 buffers available, we use a simple clock-sweep algorithm, which avoids the
 171 need to take system-wide locks during common operations.  It works like
 172 this:
 173
 174 Each buffer header contains a usage counter, which is incremented (up to a
 175 small limit value) whenever the buffer is pinned.  (This requires only the
 176 buffer header spinlock, which would have to be taken anyway to increment the
 177 buffer reference count, so it's nearly free.)
 178
 179 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 180 through all the available buffers.  nextVictimBuffer is protected by the
 181 BufFreelistLock.
 182
 183 The algorithm for a process that needs to obtain a victim buffer is:
 184
 185 1. Obtain BufFreelistLock.
 186
 187 2. If buffer free list is nonempty, remove its head buffer.  If the buffer
 188 is pinned or has a nonzero usage count, it cannot be used; ignore it and
 189 return to the start of step 2.  Otherwise, pin the buffer, release
 190 BufFreelistLock, and return the buffer.
 191
 192 3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
 193 circularly advance nextVictimBuffer for next time.
 194
 195 4. If the selected buffer is pinned or has a nonzero usage count, it cannot
 196 be used.  Decrement its usage count (if nonzero) and return to step 3 to
 197 examine the next buffer.
 198
 199 5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
 200
 201 (Note that if the selected buffer is dirty, we will have to write it out
 202 before we can recycle it; if someone else pins the buffer meanwhile we will
 203 have to give up and try another buffer.  This however is not a concern
 204 of the basic select-a-victim-buffer algorithm.)
 205
 206
 207 Buffer Ring Replacement Strategy
 208 ---------------------------------
 209
 210 When running a query that needs to access a large number of pages just once,
 211 such as VACUUM or a large sequential scan, a different strategy is used.
 212 A page that has been touched only by such a scan is unlikely to be needed
 213 again soon, so instead of running the normal clock sweep algorithm and
 214 blowing out the entire buffer cache, a small ring of buffers is allocated
 215 using the normal clock sweep algorithm and those buffers are reused for the
 216 whole scan.  This also implies that much of the write traffic caused by such
 217 a statement will be done by the backend itself and not pushed off onto other
 218 processes.
 219
 220 For sequential scans, a 256KB ring is used. That's small enough to fit in L2
 221 cache, which makes transferring pages from OS cache to shared buffer cache
 222 efficient.  Even less would often be enough, but the ring must be big enough
 223 to accommodate all pages in the scan that are pinned concurrently.  256KB
 224 should also be enough to leave a small cache trail for other backends to
 225 join in a synchronized seq scan.  If a ring buffer is dirtied and its LSN
 226 updated, we would normally have to write and flush WAL before we could
 227 re-use the buffer; in this case we instead discard the buffer from the ring
 228 and (later) choose a replacement using the normal clock-sweep algorithm.
 229 Hence this strategy works best for scans that are read-only (or at worst
 230 update hint bits).  In a scan that modifies every page in the scan, like a
 231 bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
 232 the ring strategy effectively degrades to the normal strategy.
 233
 234 VACUUM uses a 256KB ring like sequential scans, but dirty pages are not
 235 removed from the ring.  Instead, WAL is flushed if needed to allow reuse of
 236 the buffers.  Before introducing the buffer ring strategy in 8.3, VACUUM's
 237 buffers were sent to the freelist, which was effectively a buffer ring of 1
 238 buffer, resulting in excessive WAL flushing.  Allowing VACUUM to update
 239 256KB between WAL flushes should be more efficient.
 240
 241 Bulk writes work similarly to VACUUM.  Currently this applies only to
 242 COPY IN and CREATE TABLE AS SELECT.  (Might it be interesting to make
 243 seqscan UPDATE and DELETE use the bulkwrite strategy?)  For bulk writes
 244 we use a ring size of 16MB (but not more than 1/8th of shared_buffers).
 245 Smaller sizes have been shown to result in the COPY blocking too often
 246 for WAL flushes.  While it's okay for a background vacuum to be slowed by
 247 doing its own WAL flushing, we'd prefer that COPY not be subject to that,
 248 so we let it use up a bit more of the buffer arena.
 249
 250
 251 Background Writer's Processing
 252 ------------------------------
 253
 254 The background writer is designed to write out pages that are likely to be
 255 recycled soon, thereby offloading the writing work from active backends.
 256 To do this, it scans forward circularly from the current position of
 257 nextVictimBuffer (which it does not change!), looking for buffers that are
 258 dirty and not pinned nor marked with a positive usage count.  It pins,
 259 writes, and releases any such buffer.
 260
 261 If we can assume that reading nextVictimBuffer is an atomic action, then
 262 the writer doesn't even need to take the BufFreelistLock in order to look
 263 for buffers to write; it needs only to spinlock each buffer header for long
 264 enough to check the dirtybit.  Even without that assumption, the writer
 265 only needs to take the lock long enough to read the variable value, not
 266 while scanning the buffers.  (This is a very substantial improvement in
 267 the contention cost of the writer compared to PG 8.0.)
 268
 269 During a checkpoint, the writer's strategy must be to write every dirty
 270 buffer (pinned or not!).  We may as well make it start this scan from
 271 nextVictimBuffer, however, so that the first-to-be-written pages are the
 272 ones that backends might otherwise have to write for themselves soon.
 273
 274 The background writer takes shared content lock on a buffer while writing it
 275 out (and anyone else who flushes buffer contents to disk must do so too).
 276 This ensures that the page image transferred to disk is reasonably consistent.
 277 We might miss a hint-bit update or two but that isn't a problem, for the same
 278 reasons mentioned under buffer access rules.
 279
 280 As of 8.4, background writer starts during recovery mode when there is
 281 some form of potentially extended recovery to perform. It performs an
 282 identical service to normal processing, except that checkpoints it
 283 writes are technically restartpoints.