Added documentation for the new interface between the buffer manager

author Jan Wieck <JanWieck@Yahoo.com>

Fri, 14 Nov 2003 04:32:11 +0000 (04:32 +0000)

committer Jan Wieck <JanWieck@Yahoo.com>

Fri, 14 Nov 2003 04:32:11 +0000 (04:32 +0000)
author Jan Wieck <JanWieck@Yahoo.com>
Fri, 14 Nov 2003 04:32:11 +0000 (04:32 +0000)
committer Jan Wieck <JanWieck@Yahoo.com>
Fri, 14 Nov 2003 04:32:11 +0000 (04:32 +0000)
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README

index a5d0c926de17c2f4901497a36f98a6182a415a5e..8a68ff054ec5cdd82277fe0518c3362b7132dc29 100644 (file)
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -1,4 +1,4 @@
-$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.4 2003/10/31 22:48:08 tgl Exp $
+$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.5 2003/11/14 04:32:11 wieck Exp $
  
  Notes about shared buffer access rules
  --------------------------------------
@@ -95,3 +95,155 @@ concurrent VACUUM.  The current implementation only supports a single
  waiter for pin-count-1 on any particular shared buffer.  This is enough
  for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
  single relation anyway.
+
+
+Buffer replacement strategy interface:
+
+The two files freelist.c and buf_table.c contain the buffer cache
+replacement strategy. The interface to the strategy is:
+
+    BufferDesc *
+       StrategyBufferLookup(BufferTag *tagPtr, bool recheck)
+
+               This is allways the first call made by the buffer manager
+               to check if a disk page is in memory. If so, the function
+               returns the buffer descriptor and no further action is
+               required.
+
+               If the page is not in memory, StrategyBufferLookup()
+               returns NULL.
+
+               The flag recheck tells the strategy that this is a second
+               lookup after flushing a dirty block. If the buffer manager
+               has to evict another buffer, he will release the bufmgr lock
+               while doing the write IO. During this time, another backend
+               could possibly fault in the same page this backend is after,
+               so we have to check again after the IO is done if the page
+               is in memory now.
+
+       BufferDesc *
+       StrategyGetBuffer(void)
+
+               The buffer manager calls this function to get an unpinned
+               cache buffer who's content can be evicted. The returned
+               buffer might be empty, clean or dirty.
+
+               The returned buffer is only a cadidate for replacement.
+               It is possible that while the buffer is written, another
+               backend finds and modifies it, so that it is dirty again.
+               The buffer manager will then call StrategyGetBuffer()
+               again to ask for another candidate.
+
+       void
+       StrategyReplaceBuffer(BufferDesc *buf, Relation rnode, 
+                       BlockNumber blockNum)
+               
+               Called by the buffer manager at the time it is about to
+               change the association of a buffer with a disk page.
+
+               Before this call, StrategyBufferLookup() still has to find
+               the buffer even if it was returned by StrategyGetBuffer()
+               as a candidate for replacement.
+
+               After this call, this buffer must be returned for a
+               lookup of the new page identified by rnode and blockNum.
+
+       void
+       StrategyInvalidateBuffer(BufferDesc *buf)
+
+               Called from various parts to inform that the content of
+               this buffer has been thrown away. This happens for example
+               in the case of dropping a relation.
+
+               The buffer must be clean and unpinned on call.
+
+               If the buffer associated with a disk page, StrategyBufferLookup()
+               must not return it for this page after the call.
+
+       void
+       StrategyHintVacuum(bool vacuum_active)
+
+               Because vacuum reads all relations of the entire database
+               through the buffer manager, it can greatly disturb the
+               buffer replacement strategy. This function is used by vacuum
+               to inform that all subsequent buffer lookups are caused
+               by vacuum scanning relations.
+
+               
+Buffer replacement strategy:
+
+The buffer replacement strategy actually used in freelist.c is a
+version of the Adaptive Replacement Cache (ARC) special tailored for
+PostgreSQL.
+
+The algorithm works as follows:
+
+    C is the size of the cache in number of pages (conf: shared_buffers)
+       ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block
+       is allwayt associated with one unique file page and "can" point to
+       one shared buffer.
+
+       All file pages known in by the directory are managed in 4 LRU lists
+       named B1, T1, T2 and B2. The T1 and T2 lists are the "real" cache
+       entries, linking a file page to a memory buffer where the page is
+       currently cached. Consequently T1len+T2len <= C. B1 and B2 are
+       ghost cache directories that extend T1 and T2 so that the strategy
+       remembers pages longer. The strategy tries to keep B1len+T1len and
+       B2len+T2len both at C. T1len and T2 len vary over the runtime
+       depending on the lookup pattern and its resulting cache hits. The
+       desired size of T1len is called T1target.
+
+       Assuming we have a full cache, one of 5 cases happens on a lookup:
+
+       MISS    On a cache miss, depending on T1target and the actual T1len
+                       the LRU buffer of T1 or T2 is evicted. Its CDB is removed
+                       from the T list and added as MRU of the corresponding B list.
+                       The now free buffer is replaced with the requested page
+                       and added as MRU of T1.
+
+       T1 hit  The T1 CDB is moved to the MRU position of the T2 list.
+
+       T2 hit  The T2 CDB is moved to the MRU position of the T2 list.
+
+       B1 hit  This means that a buffer that was evicted from the T1
+                       list is now requested again, indicating that T1target is
+                       too small (otherwise it would still be in T1 and thus in
+                       memory). The strategy raises T1target, evicts a buffer
+                       depending on T1target and T1len and places the CDB at
+                       MRU of T2.
+
+       B2 hit  This means the opposite of B1, the T2 list is probably too
+                       small. So the strategy lowers T1target, evicts a buffer
+                       and places the CDB at MRU of T2.
+
+       Thus, every page that is found on lookup in any of the four lists
+       ends up as the MRU of the T2 list. The T2 list therefore is the
+       "frequency" cache, holding frequently requested pages.
+
+       Every page that is seen for the first time ends up as the MRU of
+       the T1 list. The T1 list is the "recency" cache, holding recent
+       newcomers.
+
+       The tailoring done for PostgreSQL has to do with the way, the
+       query executor works. A typical UPDATE or DELETE first scans the 
+       relation, searching for the tuples and then calls heap_update() or
+       heap_delete(). This causes at least 2 lookups for the block in the
+       same statement. In the case of multiple matches in one block even
+       more often. As a result, every block touched in an UPDATE or DELETE
+       would directly jump into the T2 cache, which is wrong. To prevent
+       this the strategy remembers which transaction added a buffer to the
+       T1 list and will not promote it from there into the T2 cache during
+       the same transaction.
+       
+       Another specialty is the change of the strategy during VACUUM.
+       Lookups during VACUUM do not represent application needs, so it
+       would be wrong to change the cache balance T1target due to that
+       or to cause massive cache evictions. Therefore, a page read in to
+       satisfy vacuum (not those that actually cause a hit on any list)
+       is placed at the LRU position of the T1 list, for immediate
+       reuse. Since Vacuum usually requests many pages very fast, the
+       natural side effect of this is that it will get back the very
+       buffers it filled and possibly modified on the next call and will
+       therefore do it's work in a few shared memory buffers, while using
+       whatever it finds in the cache already.
+
author	Jan Wieck <JanWieck@Yahoo.com>
	Fri, 14 Nov 2003 04:32:11 +0000 (04:32 +0000)
committer	Jan Wieck <JanWieck@Yahoo.com>
	Fri, 14 Nov 2003 04:32:11 +0000 (04:32 +0000)