Implement LockBufferForCleanup(), which will allow concurrent VACUUM

author Tom Lane <tgl@sss.pgh.pa.us>

Fri, 6 Jul 2001 21:04:26 +0000 (21:04 +0000)

committer Tom Lane <tgl@sss.pgh.pa.us>

Fri, 6 Jul 2001 21:04:26 +0000 (21:04 +0000)
author Tom Lane <tgl@sss.pgh.pa.us>
Fri, 6 Jul 2001 21:04:26 +0000 (21:04 +0000)
committer Tom Lane <tgl@sss.pgh.pa.us>
Fri, 6 Jul 2001 21:04:26 +0000 (21:04 +0000)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c

index 293980ea5116b68328708b8b8284ea35fd3dc48d..4c18ac5a56a1ae0b0dbec8868a1bfd2babf97f99 100644 (file)
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -8,7 +8,7 @@
   *
   *
   * IDENTIFICATION
- *       $Header: /cvsroot/pgsql/src/backend/access/transam/xact.c,v 1.104 2001/06/22 19:16:21 wieck Exp $
+ *       $Header: /cvsroot/pgsql/src/backend/access/transam/xact.c,v 1.105 2001/07/06 21:04:25 tgl Exp $
   *
   * NOTES
   *             Transaction aborts can now occur two ways:
@@ -653,7 +653,7 @@ void
  RecordTransactionCommit()
  {
         TransactionId xid;
-       int                     leak;
+       bool            leak;
  
         xid = GetCurrentTransactionId();
  
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README

new file mode 100644 (file)

index 0000000..519c9c9
--- /dev/null
+++ b/src/backend/storage/buffer/README
@@ -0,0 +1,100 @@
+$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.1 2001/07/06 21:04:25 tgl Exp $
+
+Notes about shared buffer access rules
+--------------------------------------
+
+There are two separate access control mechanisms for shared disk buffers:
+reference counts (a/k/a pin counts) and buffer locks.  (Actually, there's
+a third level of access control: one must hold the appropriate kind of
+lock on a relation before one can legally access any page belonging to
+the relation.  Relation-level locks are not discussed here.)
+
+Pins: one must "hold a pin on" a buffer (increment its reference count)
+before being allowed to do anything at all with it.  An unpinned buffer is
+subject to being reclaimed and reused for a different page at any instant,
+so touching it is unsafe.  Typically a pin is acquired via ReadBuffer and
+released via WriteBuffer (if one modified the page) or ReleaseBuffer (if not).
+It is OK and indeed common for a single backend to pin a page more than
+once concurrently; the buffer manager handles this efficiently.  It is
+considered OK to hold a pin for long intervals --- for example, sequential
+scans hold a pin on the current page until done processing all the tuples
+on the page, which could be quite a while if the scan is the outer scan of
+a join.  Similarly, btree index scans hold a pin on the current index page.
+This is OK because normal operations never wait for a page's pin count to
+drop to zero.  (Anything that might need to do such a wait is instead
+handled by waiting to obtain the relation-level lock, which is why you'd
+better hold one first.)  Pins may not be held across transaction
+boundaries, however.
+
+Buffer locks: there are two kinds of buffer locks, shared and exclusive,
+which act just as you'd expect: multiple backends can hold shared locks on
+the same buffer, but an exclusive lock prevents anyone else from holding
+either shared or exclusive lock.  (These can alternatively be called READ
+and WRITE locks.)  These locks are short-term: they should not be held for
+long.  They are implemented as per-buffer spinlocks, so another backend
+trying to acquire a competing lock will spin as long as you hold yours!
+Buffer locks are acquired and released by LockBuffer().  It will *not* work
+for a single backend to try to acquire multiple locks on the same buffer.
+One must pin a buffer before trying to lock it.
+
+Buffer access rules:
+
+1. To scan a page for tuples, one must hold a pin and either shared or
+exclusive lock.  To examine the commit status (XIDs and status bits) of
+a tuple in a shared buffer, one must likewise hold a pin and either shared
+or exclusive lock.
+
+2. Once one has determined that a tuple is interesting (visible to the
+current transaction) one may drop the buffer lock, yet continue to access
+the tuple's data for as long as one holds the buffer pin.  This is what is
+typically done by heap scans, since the tuple returned by heap_fetch
+contains a pointer to tuple data in the shared buffer.  Therefore the
+tuple cannot go away while the pin is held (see rule #5).  Its state could
+change, but that is assumed not to matter after the initial determination
+of visibility is made.
+
+3. To add a tuple or change the xmin/xmax fields of an existing tuple,
+one must hold a pin and an exclusive lock on the containing buffer.
+This ensures that no one else might see a partially-updated state of the
+tuple.
+
+4. It is considered OK to update tuple commit status bits (ie, OR the
+values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
+HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
+pin on a buffer.  This is OK because another backend looking at the tuple
+at about the same time would OR the same bits into the field, so there
+is little or no risk of conflicting update; what's more, if there did
+manage to be a conflict it would merely mean that one bit-update would
+be lost and need to be done again later.  These four bits are only hints
+(they cache the results of transaction status lookups in pg_log), so no
+great harm is done if they get reset to zero by conflicting updates.
+
+5. To physically remove a tuple or compact free space on a page, one
+must hold a pin and an exclusive lock, *and* observe while holding the
+exclusive lock that the buffer's shared reference count is one (ie,
+no other backend holds a pin).  If these conditions are met then no other
+backend can perform a page scan until the exclusive lock is dropped, and
+no other backend can be holding a reference to an existing tuple that it
+might expect to examine again.  Note that another backend might pin the
+buffer (increment the refcount) while one is performing the cleanup, but
+it won't be able to actually examine the page until it acquires shared
+or exclusive lock.
+
+
+As of 7.1, the only operation that removes tuples or compacts free space is
+(oldstyle) VACUUM.  It does not have to implement rule #5 directly, because
+it instead acquires exclusive lock at the relation level, which ensures
+indirectly that no one else is accessing pages of the relation at all.
+
+To implement concurrent VACUUM we will need to make it obey rule #5 fully.
+To do this, we'll create a new buffer manager operation
+LockBufferForCleanup() that gets an exclusive lock and then checks to see
+if the shared pin count is currently 1.  If not, it releases the exclusive
+lock (but not the caller's pin) and waits until signaled by another backend,
+whereupon it tries again.  The signal will occur when UnpinBuffer
+decrements the shared pin count to 1.  As indicated above, this operation
+might have to wait a good while before it acquires lock, but that shouldn't
+matter much for concurrent VACUUM.  The current implementation only
+supports a single waiter for pin-count-1 on any particular shared buffer.
+This is enough for VACUUM's use, since we don't allow multiple VACUUMs
+concurrently on a single relation anyway.
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c

index 03d6504db8659a0889672e4b3178d73aeb6785f4..819fe7e206c3cdea5bb24ea974381ef03aa3686a 100644 (file)
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -8,7 +8,7 @@
   *
   *
   * IDENTIFICATION
- *       $Header: /cvsroot/pgsql/src/backend/storage/buffer/buf_init.c,v 1.42 2001/03/22 03:59:44 momjian Exp $
+ *       $Header: /cvsroot/pgsql/src/backend/storage/buffer/buf_init.c,v 1.43 2001/07/06 21:04:25 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
@@ -63,7 +63,6 @@ long     *PrivateRefCount;    /* also used in freelist.c */
  bits8     *BufferLocks;                /* flag bits showing locks I have set */
  BufferTag  *BufferTagLastDirtied;              /* tag buffer had when last
                                                                                  * dirtied by me */
-BufferBlindId *BufferBlindLastDirtied;
  bool      *BufferDirtiedByMe;  /* T if buf has been dirtied in cur xact */
  
  
@@ -237,7 +236,6 @@ InitBufferPoolAccess(void)
         PrivateRefCount = (long *) calloc(NBuffers, sizeof(long));
         BufferLocks = (bits8 *) calloc(NBuffers, sizeof(bits8));
         BufferTagLastDirtied = (BufferTag *) calloc(NBuffers, sizeof(BufferTag));
-       BufferBlindLastDirtied = (BufferBlindId *) calloc(NBuffers, sizeof(BufferBlindId));
         BufferDirtiedByMe = (bool *) calloc(NBuffers, sizeof(bool));
  
         /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c

index 6090e729c0479719386342eb1a7f9a9c5fb32356..89443ee160e863e616afa6565536ba2ff18c844b 100644 (file)
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -8,7 +8,7 @@
   *
   *
   * IDENTIFICATION
- *       $Header: /cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v 1.115 2001/07/02 18:47:18 tgl Exp $
+ *       $Header: /cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v 1.116 2001/07/06 21:04:25 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
@@ -46,14 +46,12 @@
  #include <math.h>
  #include <signal.h>
  
-#include "executor/execdebug.h"
  #include "miscadmin.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
-#include "storage/s_lock.h"
+#include "storage/proc.h"
  #include "storage/smgr.h"
  #include "utils/relcache.h"
-#include "catalog/pg_database.h"
  
  #include "pgstat.h"
  
@@ -254,7 +252,7 @@ ReadBufferInternal(Relation reln, BlockNumber blockNum,
                 if (!BufTableDelete(bufHdr))
                 {
                         SpinRelease(BufMgrLock);
-                       elog(FATAL, "BufRead: buffer table broken after IO error\n");
+                       elog(FATAL, "BufRead: buffer table broken after IO error");
                 }
                 /* remember that BufferAlloc() pinned the buffer */
                 UnpinBuffer(bufHdr);
@@ -426,33 +424,27 @@ BufferAlloc(Relation reln,
  
                         if (smok == FALSE)
                         {
-                               elog(NOTICE, "BufferAlloc: cannot write block %u for %s/%s",
-                               buf->tag.blockNum, buf->blind.dbname, buf->blind.relname);
+                               elog(NOTICE, "BufferAlloc: cannot write block %u for %u/%u",
+                                        buf->tag.blockNum,
+                                        buf->tag.rnode.tblNode, buf->tag.rnode.relNode);
                                 inProgress = FALSE;
                                 buf->flags |= BM_IO_ERROR;
                                 buf->flags &= ~BM_IO_IN_PROGRESS;
                                 TerminateBufferIO(buf);
-                               PrivateRefCount[BufferDescriptorGetBuffer(buf) - 1] = 0;
-                               Assert(buf->refcount > 0);
-                               buf->refcount--;
-                               if (buf->refcount == 0)
-                               {
-                                       AddBufferToFreelist(buf);
-                                       buf->flags |= BM_FREE;
-                               }
+                               UnpinBuffer(buf);
                                 buf = (BufferDesc *) NULL;
                         }
                         else
                         {
-
                                 /*
                                  * BM_JUST_DIRTIED cleared by BufferReplace and shouldn't
                                  * be setted by anyone.         - vadim 01/17/97
                                  */
                                 if (buf->flags & BM_JUST_DIRTIED)
                                 {
-                                       elog(STOP, "BufferAlloc: content of block %u (%s) changed while flushing",
-                                                buf->tag.blockNum, buf->blind.relname);
+                                       elog(STOP, "BufferAlloc: content of block %u (%u/%u) changed while flushing",
+                                                buf->tag.blockNum,
+                                                buf->tag.rnode.tblNode, buf->tag.rnode.relNode);
                                 }
                                 else
                                         buf->flags &= ~BM_DIRTY;
@@ -475,8 +467,7 @@ BufferAlloc(Relation reln,
                                 inProgress = FALSE;
                                 buf->flags &= ~BM_IO_IN_PROGRESS;
                                 TerminateBufferIO(buf);
-                               PrivateRefCount[BufferDescriptorGetBuffer(buf) - 1] = 0;
-                               buf->refcount--;
+                               UnpinBuffer(buf);
                                 buf = (BufferDesc *) NULL;
                         }
  
@@ -501,15 +492,8 @@ BufferAlloc(Relation reln,
                                 {
                                         buf->flags &= ~BM_IO_IN_PROGRESS;
                                         TerminateBufferIO(buf);
-                                       /* give up the buffer since we don't need it any more */
-                                       PrivateRefCount[BufferDescriptorGetBuffer(buf) - 1] = 0;
-                                       Assert(buf->refcount > 0);
-                                       buf->refcount--;
-                                       if (buf->refcount == 0)
-                                       {
-                                               AddBufferToFreelist(buf);
-                                               buf->flags |= BM_FREE;
-                                       }
+                                       /* give up old buffer since we don't need it any more */
+                                       UnpinBuffer(buf);
                                 }
  
                                 PinBuffer(buf2);
@@ -551,18 +535,15 @@ BufferAlloc(Relation reln,
         if (!BufTableDelete(buf))
         {
                 SpinRelease(BufMgrLock);
-               elog(FATAL, "buffer wasn't in the buffer table\n");
+               elog(FATAL, "buffer wasn't in the buffer table");
         }
  
-       /* record the database name and relation name for this buffer */
-       strcpy(buf->blind.dbname, (DatabaseName) ? DatabaseName : "Recovery");
-       strcpy(buf->blind.relname, RelationGetPhysicalRelationName(reln));
-
         INIT_BUFFERTAG(&(buf->tag), reln, blockNum);
+
         if (!BufTableInsert(buf))
         {
                 SpinRelease(BufMgrLock);
-               elog(FATAL, "Buffer in lookup table twice \n");
+               elog(FATAL, "Buffer in lookup table twice");
         }
  
         /*
@@ -704,14 +685,7 @@ ReleaseAndReadBuffer(Buffer buffer,
                         else
                         {
                                 SpinAcquire(BufMgrLock);
-                               PrivateRefCount[buffer - 1] = 0;
-                               Assert(bufHdr->refcount > 0);
-                               bufHdr->refcount--;
-                               if (bufHdr->refcount == 0)
-                               {
-                                       AddBufferToFreelist(bufHdr);
-                                       bufHdr->flags |= BM_FREE;
-                               }
+                               UnpinBuffer(bufHdr);
                                 return ReadBufferInternal(relation, blockNum, true);
                         }
                 }
@@ -831,8 +805,9 @@ BufferSync()
                 }
  
                 if (status == SM_FAIL)  /* disk failure ?! */
-                       elog(STOP, "BufferSync: cannot write %u for %s",
-                                bufHdr->tag.blockNum, bufHdr->blind.relname);
+                       elog(STOP, "BufferSync: cannot write %u for %u/%u",
+                                bufHdr->tag.blockNum,
+                                bufHdr->tag.rnode.tblNode, bufHdr->tag.rnode.relNode);
  
                 /*
                  * Note that it's safe to change cntxDirty here because of we
@@ -956,16 +931,11 @@ ResetBufferPool(bool isCommit)
                 {
                         BufferDesc *buf = &BufferDescriptors[i];
  
+                       PrivateRefCount[i] = 1; /* make sure we release shared pin */
                         SpinAcquire(BufMgrLock);
-                       PrivateRefCount[i] = 0;
-                       Assert(buf->refcount > 0);
-                       buf->refcount--;
-                       if (buf->refcount == 0)
-                       {
-                               AddBufferToFreelist(buf);
-                               buf->flags |= BM_FREE;
-                       }
+                       UnpinBuffer(buf);
                         SpinRelease(BufMgrLock);
+                       Assert(PrivateRefCount[i] == 0);
                 }
         }
  
@@ -975,32 +945,31 @@ ResetBufferPool(bool isCommit)
                 smgrabort();
  }
  
-/* -----------------------------------------------
- *             BufferPoolCheckLeak
+/*
+ * BufferPoolCheckLeak
   *
   *             check if there is buffer leak
- *
- * -----------------------------------------------
   */
-int
-BufferPoolCheckLeak()
+bool
+BufferPoolCheckLeak(void)
  {
         int                     i;
-       int                     result = 0;
+       bool            result = false;
  
-       for (i = 1; i <= NBuffers; i++)
+       for (i = 0; i < NBuffers; i++)
         {
-               if (PrivateRefCount[i - 1] != 0)
+               if (PrivateRefCount[i] != 0)
                 {
-                       BufferDesc *buf = &(BufferDescriptors[i - 1]);
+                       BufferDesc *buf = &(BufferDescriptors[i]);
  
                         elog(NOTICE,
                                  "Buffer Leak: [%03d] (freeNext=%d, freePrev=%d, \
-relname=%s, blockNum=%d, flags=0x%x, refcount=%d %ld)",
-                                i - 1, buf->freeNext, buf->freePrev,
-                                buf->blind.relname, buf->tag.blockNum, buf->flags,
-                                buf->refcount, PrivateRefCount[i - 1]);
-                       result = 1;
+rel=%u/%u, blockNum=%u, flags=0x%x, refcount=%d %ld)",
+                                i, buf->freeNext, buf->freePrev,
+                                buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
+                                buf->tag.blockNum, buf->flags,
+                                buf->refcount, PrivateRefCount[i]);
+                       result = true;
                 }
         }
         return result;
@@ -1389,10 +1358,11 @@ PrintBufferDescs()
                 SpinAcquire(BufMgrLock);
                 for (i = 0; i < NBuffers; ++i, ++buf)
                 {
-                       elog(DEBUG, "[%02d] (freeNext=%d, freePrev=%d, relname=%s, \
-blockNum=%d, flags=0x%x, refcount=%d %ld)",
+                       elog(DEBUG, "[%02d] (freeNext=%d, freePrev=%d, rel=%u/%u, \
+blockNum=%u, flags=0x%x, refcount=%d %ld)",
                                  i, buf->freeNext, buf->freePrev,
-                                buf->blind.relname, buf->tag.blockNum, buf->flags,
+                                buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
+                                buf->tag.blockNum, buf->flags,
                                  buf->refcount, PrivateRefCount[i]);
                 }
                 SpinRelease(BufMgrLock);
@@ -1402,8 +1372,9 @@ blockNum=%d, flags=0x%x, refcount=%d %ld)",
                 /* interactive backend */
                 for (i = 0; i < NBuffers; ++i, ++buf)
                 {
-                       printf("[%-2d] (%s, %d) flags=0x%x, refcnt=%d %ld)\n",
-                                  i, buf->blind.relname, buf->tag.blockNum,
+                       printf("[%-2d] (%u/%u, %u) flags=0x%x, refcnt=%d %ld)\n",
+                                  i, buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
+                                  buf->tag.blockNum,
                                    buf->flags, buf->refcount, PrivateRefCount[i]);
                 }
         }
@@ -1419,9 +1390,10 @@ PrintPinnedBufs()
         for (i = 0; i < NBuffers; ++i, ++buf)
         {
                 if (PrivateRefCount[i] > 0)
-                       elog(NOTICE, "[%02d] (freeNext=%d, freePrev=%d, relname=%s, \
-blockNum=%d, flags=0x%x, refcount=%d %ld)\n",
-                                i, buf->freeNext, buf->freePrev, buf->blind.relname,
+                       elog(NOTICE, "[%02d] (freeNext=%d, freePrev=%d, rel=%u/%u, \
+blockNum=%u, flags=0x%x, refcount=%d %ld)",
+                                i, buf->freeNext, buf->freePrev,
+                                buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
                                  buf->tag.blockNum, buf->flags,
                                  buf->refcount, PrivateRefCount[i]);
         }
@@ -1581,8 +1553,10 @@ FlushRelationBuffers(Relation rel, BlockNumber firstDelBlock)
                                                                            (char *) MAKE_PTR(bufHdr->data));
  
                                         if (status == SM_FAIL)          /* disk failure ?! */
-                                               elog(STOP, "FlushRelationBuffers: cannot write %u for %s",
-                                                        bufHdr->tag.blockNum, bufHdr->blind.relname);
+                                               elog(STOP, "FlushRelationBuffers: cannot write %u for %u/%u",
+                                                        bufHdr->tag.blockNum,
+                                                        bufHdr->tag.rnode.tblNode,
+                                                        bufHdr->tag.rnode.relNode);
  
                                         BufferFlushCount++;
  
@@ -1624,7 +1598,6 @@ FlushRelationBuffers(Relation rel, BlockNumber firstDelBlock)
  /*
   * ReleaseBuffer -- remove the pin on a buffer without
   *             marking it dirty.
- *
   */
  int
  ReleaseBuffer(Buffer buffer)
@@ -1649,14 +1622,7 @@ ReleaseBuffer(Buffer buffer)
         else
         {
                 SpinAcquire(BufMgrLock);
-               PrivateRefCount[buffer - 1] = 0;
-               Assert(bufHdr->refcount > 0);
-               bufHdr->refcount--;
-               if (bufHdr->refcount == 0)
-               {
-                       AddBufferToFreelist(bufHdr);
-                       bufHdr->flags |= BM_FREE;
-               }
+               UnpinBuffer(bufHdr);
                 SpinRelease(BufMgrLock);
         }
  
@@ -1665,7 +1631,7 @@ ReleaseBuffer(Buffer buffer)
  
  /*
   * ReleaseBufferWithBufferLock
- *             Same as ReleaseBuffer except we hold the lock
+ *             Same as ReleaseBuffer except we hold the bufmgr lock
   */
  static int
  ReleaseBufferWithBufferLock(Buffer buffer)
@@ -1688,16 +1654,7 @@ ReleaseBufferWithBufferLock(Buffer buffer)
         if (PrivateRefCount[buffer - 1] > 1)
                 PrivateRefCount[buffer - 1]--;
         else
-       {
-               PrivateRefCount[buffer - 1] = 0;
-               Assert(bufHdr->refcount > 0);
-               bufHdr->refcount--;
-               if (bufHdr->refcount == 0)
-               {
-                       AddBufferToFreelist(bufHdr);
-                       bufHdr->flags |= BM_FREE;
-               }
-       }
+               UnpinBuffer(bufHdr);
  
         return STATUS_OK;
  }
@@ -1712,9 +1669,11 @@ IncrBufferRefCount_Debug(char *file, int line, Buffer buffer)
         {
                 BufferDesc *buf = &BufferDescriptors[buffer - 1];
  
-               fprintf(stderr, "PIN(Incr) %d relname = %s, blockNum = %d, \
+               fprintf(stderr, "PIN(Incr) %d rel = %u/%u, blockNum = %u, \
  refcount = %ld, file: %s, line: %d\n",
-                               buffer, buf->blind.relname, buf->tag.blockNum,
+                               buffer,
+                               buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
+                               buf->tag.blockNum,
                                 PrivateRefCount[buffer - 1], file, line);
         }
  }
@@ -1730,9 +1689,11 @@ ReleaseBuffer_Debug(char *file, int line, Buffer buffer)
         {
                 BufferDesc *buf = &BufferDescriptors[buffer - 1];
  
-               fprintf(stderr, "UNPIN(Rel) %d relname = %s, blockNum = %d, \
+               fprintf(stderr, "UNPIN(Rel) %d rel = %u/%u, blockNum = %u, \
  refcount = %ld, file: %s, line: %d\n",
-                               buffer, buf->blind.relname, buf->tag.blockNum,
+                               buffer,
+                               buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
+                               buf->tag.blockNum,
                                 PrivateRefCount[buffer - 1], file, line);
         }
  }
@@ -1757,18 +1718,22 @@ ReleaseAndReadBuffer_Debug(char *file,
         {
                 BufferDesc *buf = &BufferDescriptors[buffer - 1];
  
-               fprintf(stderr, "UNPIN(Rel&Rd) %d relname = %s, blockNum = %d, \
+               fprintf(stderr, "UNPIN(Rel&Rd) %d rel = %u/%u, blockNum = %u, \
  refcount = %ld, file: %s, line: %d\n",
-                               buffer, buf->blind.relname, buf->tag.blockNum,
+                               buffer,
+                               buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
+                               buf->tag.blockNum,
                                 PrivateRefCount[buffer - 1], file, line);
         }
         if (ShowPinTrace && BufferIsLocal(buffer) && is_userbuffer(buffer))
         {
                 BufferDesc *buf = &BufferDescriptors[b - 1];
  
-               fprintf(stderr, "PIN(Rel&Rd) %d relname = %s, blockNum = %d, \
+               fprintf(stderr, "PIN(Rel&Rd) %d rel = %u/%u, blockNum = %u, \
  refcount = %ld, file: %s, line: %d\n",
-                               b, buf->blind.relname, buf->tag.blockNum,
+                               b,
+                               buf->tag.rnode.tblNode, buf->tag.rnode.relNode,
+                               buf->tag.blockNum,
                                 PrivateRefCount[b - 1], file, line);
         }
         return b;
@@ -1784,6 +1749,7 @@ refcount = %ld, file: %s, line: %d\n",
   *     and die if there's anything fishy.
   */
  
+void
  _bm_trace(Oid dbId, Oid relId, int blkNo, int bufNo, int allocType)
  {
         long            start,
@@ -1835,6 +1801,7 @@ okay:
         *CurTraceBuf = (start + 1) % BMT_LIMIT;
  }
  
+void
  _bm_die(Oid dbId, Oid relId, int blkNo, int bufNo,
                 int allocType, long start, long cur)
  {
@@ -1860,7 +1827,7 @@ _bm_die(Oid dbId, Oid relId, int blkNo, int bufNo,
                 tb = &TraceBuf[i];
                 if (tb->bmt_op != BMT_NOTUSED)
                 {
-                       fprintf(fp, "     [%3d]%spid %d buf %2d for <%d,%u,%d> ",
+                       fprintf(fp, "     [%3d]%spid %d buf %2d for <%u,%u,%u> ",
                                         i, (i == cur ? " ---> " : "\t"),
                                         tb->bmt_pid, tb->bmt_buf,
                                         tb->bmt_dbid, tb->bmt_relid, tb->bmt_blkno);
@@ -1967,7 +1934,9 @@ UnlockBuffers(void)
  
         for (i = 0; i < NBuffers; i++)
         {
-               if (BufferLocks[i] == 0)
+               bits8   buflocks = BufferLocks[i];
+
+               if (buflocks == 0)
                         continue;
  
                 Assert(BufferIsValid(i + 1));
@@ -1977,14 +1946,13 @@ UnlockBuffers(void)
  
                 S_LOCK(&(buf->cntx_lock));
  
-               if (BufferLocks[i] & BL_R_LOCK)
+               if (buflocks & BL_R_LOCK)
                 {
                         Assert(buf->r_locks > 0);
                         (buf->r_locks)--;
                 }
-               if (BufferLocks[i] & BL_RI_LOCK)
+               if (buflocks & BL_RI_LOCK)
                 {
-
                         /*
                          * Someone else could remove our RI lock when acquiring W
                          * lock. This is possible if we came here from elog(ERROR)
@@ -1993,7 +1961,7 @@ UnlockBuffers(void)
                          */
                         buf->ri_lock = false;
                 }
-               if (BufferLocks[i] & BL_W_LOCK)
+               if (buflocks & BL_W_LOCK)
                 {
                         Assert(buf->w_lock);
                         buf->w_lock = false;
@@ -2001,6 +1969,20 @@ UnlockBuffers(void)
  
                 S_UNLOCK(&(buf->cntx_lock));
  
+               if (buflocks & BL_PIN_COUNT_LOCK)
+               {
+                       SpinAcquire(BufMgrLock);
+                       /*
+                        * Don't complain if flag bit not set; it could have been reset
+                        * but we got a cancel/die interrupt before getting the signal.
+                        */
+                       if ((buf->flags & BM_PIN_COUNT_WAITER) != 0 &&
+                               buf->wait_backend_id == MyBackendId)
+                               buf->flags &= ~BM_PIN_COUNT_WAITER;
+                       SpinRelease(BufMgrLock);
+                       ProcCancelWaitForSignal();
+               }
+
                 BufferLocks[i] = 0;
  
                 RESUME_INTERRUPTS();
@@ -2126,6 +2108,77 @@ LockBuffer(Buffer buffer, int mode)
         RESUME_INTERRUPTS();
  }
  
+/*
+ * LockBufferForCleanup - lock a buffer in preparation for deleting items
+ *
+ * Items may be deleted from a disk page only when the caller (a) holds an
+ * exclusive lock on the buffer and (b) has observed that no other backend
+ * holds a pin on the buffer.  If there is a pin, then the other backend
+ * might have a pointer into the buffer (for example, a heapscan reference
+ * to an item --- see README for more details).  It's OK if a pin is added
+ * after the cleanup starts, however; the newly-arrived backend will be
+ * unable to look at the page until we release the exclusive lock.
+ *
+ * To implement this protocol, a would-be deleter must pin the buffer and
+ * then call LockBufferForCleanup().  LockBufferForCleanup() is similar to
+ * LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE), except that it loops until
+ * it has successfully observed pin count = 1.
+ */
+void
+LockBufferForCleanup(Buffer buffer)
+{
+       BufferDesc *bufHdr;
+       bits8      *buflock;
+
+       Assert(BufferIsValid(buffer));
+
+       if (BufferIsLocal(buffer))
+       {
+               /* There should be exactly one pin */
+               if (LocalRefCount[-buffer - 1] != 1)
+                       elog(ERROR, "LockBufferForCleanup: wrong local pin count");
+               /* Nobody else to wait for */
+               return;
+       }
+
+       /* There should be exactly one local pin */
+       if (PrivateRefCount[buffer - 1] != 1)
+               elog(ERROR, "LockBufferForCleanup: wrong local pin count");
+
+       bufHdr = &BufferDescriptors[buffer - 1];
+       buflock = &(BufferLocks[buffer - 1]);
+
+       for (;;)
+       {
+               /* Try to acquire lock */
+               LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+               SpinAcquire(BufMgrLock);
+               Assert(bufHdr->refcount > 0);
+               if (bufHdr->refcount == 1)
+               {
+                       /* Successfully acquired exclusive lock with pincount 1 */
+                       SpinRelease(BufMgrLock);
+                       return;
+               }
+               /* Failed, so mark myself as waiting for pincount 1 */
+               if (bufHdr->flags & BM_PIN_COUNT_WAITER)
+               {
+                       SpinRelease(BufMgrLock);
+                       LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+                       elog(ERROR, "Multiple backends attempting to wait for pincount 1");
+               }
+               bufHdr->wait_backend_id = MyBackendId;
+               bufHdr->flags |= BM_PIN_COUNT_WAITER;
+               *buflock |= BL_PIN_COUNT_LOCK;
+               SpinRelease(BufMgrLock);
+               LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+               /* Wait to be signaled by UnpinBuffer() */
+               ProcWaitForSignal();
+               *buflock &= ~BL_PIN_COUNT_LOCK;
+               /* Loop back and try again */
+       }
+}
+
  /*
   *     Functions for IO error handling
   *
@@ -2240,8 +2293,9 @@ AbortBufferIO(void)
                         /* Issue notice if this is not the first failure... */
                         if (buf->flags & BM_IO_ERROR)
                         {
-                               elog(NOTICE, "write error may be permanent: cannot write block %u for %s/%s",
-                               buf->tag.blockNum, buf->blind.dbname, buf->blind.relname);
+                               elog(NOTICE, "write error may be permanent: cannot write block %u for %u/%u",
+                                        buf->tag.blockNum,
+                                        buf->tag.rnode.tblNode, buf->tag.rnode.relNode);
                         }
                         buf->flags |= BM_DIRTY;
                 }
@@ -2252,59 +2306,6 @@ AbortBufferIO(void)
         }
  }
  
-/*
- * Cleanup buffer or mark it for cleanup. Buffer may be cleaned
- * up if it's pinned only once.
- *
- * NOTE: buffer must be excl locked.
- */
-void
-MarkBufferForCleanup(Buffer buffer, void (*CleanupFunc) (Buffer))
-{
-       BufferDesc *bufHdr = &BufferDescriptors[buffer - 1];
-
-       Assert(PrivateRefCount[buffer - 1] > 0);
-
-       if (PrivateRefCount[buffer - 1] > 1)
-       {
-               LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-               PrivateRefCount[buffer - 1]--;
-               SpinAcquire(BufMgrLock);
-               Assert(bufHdr->refcount > 0);
-               bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
-               bufHdr->CleanupFunc = CleanupFunc;
-               SpinRelease(BufMgrLock);
-               return;
-       }
-
-       SpinAcquire(BufMgrLock);
-       Assert(bufHdr->refcount > 0);
-       if (bufHdr->refcount == 1)
-       {
-               SpinRelease(BufMgrLock);
-               CleanupFunc(buffer);
-               CleanupFunc = NULL;
-       }
-       else
-               SpinRelease(BufMgrLock);
-
-       LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-
-       SpinAcquire(BufMgrLock);
-       PrivateRefCount[buffer - 1] = 0;
-       Assert(bufHdr->refcount > 0);
-       bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
-       bufHdr->CleanupFunc = CleanupFunc;
-       bufHdr->refcount--;
-       if (bufHdr->refcount == 0)
-       {
-               AddBufferToFreelist(bufHdr);
-               bufHdr->flags |= BM_FREE;
-       }
-       SpinRelease(BufMgrLock);
-       return;
-}
-
  RelFileNode
  BufferGetFileNode(Buffer buffer)
  {
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c

index b88e1bbf7158fa989e41cb93185604b0758543ed..84eded795091adbd0910366462af49e4eee66ca4 100644 (file)
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -9,7 +9,7 @@
   *
   *
   * IDENTIFICATION
- *       $Header: /cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v 1.23 2001/01/24 19:43:06 momjian Exp $
+ *       $Header: /cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v 1.24 2001/07/06 21:04:26 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
@@ -29,14 +29,14 @@
  
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
+#include "storage/proc.h"
  
  
  static BufferDesc *SharedFreeList;
  
-/* only actually used in debugging.  The lock
- * should be acquired before calling the freelist manager.
+/*
+ * State-checking macros
   */
-extern SPINLOCK BufMgrLock;
  
  #define IsInQueue(bf) \
  ( \
@@ -45,7 +45,7 @@ extern SPINLOCK BufMgrLock;
         AssertMacro((bf->flags & BM_FREE)) \
  )
  
-#define NotInQueue(bf) \
+#define IsNotInQueue(bf) \
  ( \
         AssertMacro((bf->freeNext == INVALID_DESCRIPTOR)), \
         AssertMacro((bf->freePrev == INVALID_DESCRIPTOR)), \
@@ -61,14 +61,14 @@ extern SPINLOCK BufMgrLock;
   * the manner in which buffers are added to the freelist queue.
   * Currently, they are added on an LRU basis.
   */
-void
+static void
  AddBufferToFreelist(BufferDesc *bf)
  {
  #ifdef BMTRACE
         _bm_trace(bf->tag.relId.dbId, bf->tag.relId.relId, bf->tag.blockNum,
                           BufferDescriptorGetBuffer(bf), BMT_DEALLOC);
  #endif  /* BMTRACE */
-       NotInQueue(bf);
+       IsNotInQueue(bf);
  
         /* change bf so it points to inFrontOfNew and its successor */
         bf->freePrev = SharedFreeList->freePrev;
@@ -83,13 +83,14 @@ AddBufferToFreelist(BufferDesc *bf)
  
  /*
   * PinBuffer -- make buffer unavailable for replacement.
+ *
+ * This should be applied only to shared buffers, never local ones.
+ * Bufmgr lock must be held by caller.
   */
  void
  PinBuffer(BufferDesc *buf)
  {
-       long            b;
-
-       /* Assert (buf->refcount < 25); */
+       int             b = BufferDescriptorGetBuffer(buf) - 1;
  
         if (buf->refcount == 0)
         {
@@ -104,13 +105,12 @@ PinBuffer(BufferDesc *buf)
                 buf->flags &= ~BM_FREE;
         }
         else
-               NotInQueue(buf);
+               IsNotInQueue(buf);
  
-       b = BufferDescriptorGetBuffer(buf) - 1;
-       Assert(PrivateRefCount[b] >= 0);
         if (PrivateRefCount[b] == 0)
                 buf->refcount++;
         PrivateRefCount[b]++;
+       Assert(PrivateRefCount[b] > 0);
  }
  
  #ifdef NOT_USED
@@ -135,24 +135,35 @@ refcount = %ld, file: %s, line: %d\n",
  
  /*
   * UnpinBuffer -- make buffer available for replacement.
+ *
+ * This should be applied only to shared buffers, never local ones.
+ * Bufmgr lock must be held by caller.
   */
  void
  UnpinBuffer(BufferDesc *buf)
  {
-       long            b = BufferDescriptorGetBuffer(buf) - 1;
+       int             b = BufferDescriptorGetBuffer(buf) - 1;
  
+       IsNotInQueue(buf);
         Assert(buf->refcount > 0);
         Assert(PrivateRefCount[b] > 0);
         PrivateRefCount[b]--;
         if (PrivateRefCount[b] == 0)
                 buf->refcount--;
-       NotInQueue(buf);
  
         if (buf->refcount == 0)
         {
+               /* buffer is now unpinned */
                 AddBufferToFreelist(buf);
                 buf->flags |= BM_FREE;
         }
+       else if ((buf->flags & BM_PIN_COUNT_WAITER) != 0 &&
+                        buf->refcount == 1)
+       {
+               /* we just released the last pin other than the waiter's */
+               buf->flags &= ~BM_PIN_COUNT_WAITER;
+               ProcSendSignal(buf->wait_backend_id);
+       }
         else
         {
                 /* do nothing */
@@ -179,18 +190,16 @@ refcount = %ld, file: %s, line: %d\n",
  
  /*
   * GetFreeBuffer() -- get the 'next' buffer from the freelist.
- *
   */
  BufferDesc *
-GetFreeBuffer()
+GetFreeBuffer(void)
  {
         BufferDesc *buf;
  
         if (Free_List_Descriptor == SharedFreeList->freeNext)
         {
-
                 /* queue is empty. All buffers in the buffer pool are pinned. */
-               elog(ERROR, "out of free buffers: time to abort !\n");
+               elog(ERROR, "out of free buffers: time to abort!");
                 return NULL;
         }
         buf = &(BufferDescriptors[SharedFreeList->freeNext]);
@@ -220,7 +229,7 @@ InitFreeList(bool init)
  
         if (init)
         {
-               /* we only do this once, normally the postmaster */
+               /* we only do this once, normally in the postmaster */
                 SharedFreeList->data = INVALID_OFFSET;
                 SharedFreeList->flags = 0;
                 SharedFreeList->flags &= ~(BM_VALID | BM_DELETED | BM_FREE);
@@ -249,37 +258,23 @@ DBG_FreeListCheck(int nfree)
         buf = &(BufferDescriptors[SharedFreeList->freeNext]);
         for (i = 0; i < nfree; i++, buf = &(BufferDescriptors[buf->freeNext]))
         {
-
                 if (!(buf->flags & (BM_FREE)))
                 {
                         if (buf != SharedFreeList)
-                       {
                                 printf("\tfree list corrupted: %d flags %x\n",
                                            buf->buf_id, buf->flags);
-                       }
                         else
-                       {
                                 printf("\tfree list corrupted: too short -- %d not %d\n",
                                            i, nfree);
-
-                       }
-
-
                 }
                 if ((BufferDescriptors[buf->freeNext].freePrev != buf->buf_id) ||
                         (BufferDescriptors[buf->freePrev].freeNext != buf->buf_id))
-               {
                         printf("\tfree list links corrupted: %d %ld %ld\n",
                                    buf->buf_id, buf->freePrev, buf->freeNext);
-               }
-
         }
         if (buf != SharedFreeList)
-       {
                 printf("\tfree list corrupted: %d-th buffer is %d\n",
                            nfree, buf->buf_id);
-
-       }
  }
  
  #endif
diff --git a/src/backend/storage/ipc/sinval.c b/src/backend/storage/ipc/sinval.c

index 6a3129ffd5682063dc83f94f34e6bddb8daa2f11..5794207eb1882755efa2400dc72874584d8e073a 100644 (file)
--- a/src/backend/storage/ipc/sinval.c
+++ b/src/backend/storage/ipc/sinval.c
@@ -8,7 +8,7 @@
   *
   *
   * IDENTIFICATION
- *       $Header: /cvsroot/pgsql/src/backend/storage/ipc/sinval.c,v 1.34 2001/06/19 19:42:15 tgl Exp $
+ *       $Header: /cvsroot/pgsql/src/backend/storage/ipc/sinval.c,v 1.35 2001/07/06 21:04:26 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
@@ -16,7 +16,6 @@
  
  #include <sys/types.h>
  
-#include "storage/backendid.h"
  #include "storage/proc.h"
  #include "storage/sinval.h"
  #include "storage/sinvaladt.h"
@@ -411,3 +410,31 @@ GetUndoRecPtr(void)
  
         return (urec);
  }
+
+/*
+ * BackendIdGetProc - given a BackendId, find its PROC structure
+ *
+ * This is a trivial lookup in the ProcState array.  We assume that the caller
+ * knows that the backend isn't going to go away, so we do not bother with
+ * locking.
+ */
+struct proc *
+BackendIdGetProc(BackendId procId)
+{
+       SISeg      *segP = shmInvalBuffer;
+
+       if (procId > 0 && procId <= segP->lastBackend)
+       {
+               ProcState  *stateP = &segP->procState[procId - 1];
+               SHMEM_OFFSET pOffset = stateP->procStruct;
+
+               if (pOffset != INVALID_OFFSET)
+               {
+                       PROC       *proc = (PROC *) MAKE_PTR(pOffset);
+
+                       return proc;
+               }
+       }
+
+       return NULL;
+}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c

index d2a8fe10df845ea74d7b64bda8a807db98c42542..9fc91d06cdbeb524c157bb9a87391b7c2b7fd971 100644 (file)
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -8,7 +8,7 @@
   *
   *
   * IDENTIFICATION
- *       $Header: /cvsroot/pgsql/src/backend/storage/lmgr/proc.c,v 1.103 2001/06/16 22:58:16 tgl Exp $
+ *       $Header: /cvsroot/pgsql/src/backend/storage/lmgr/proc.c,v 1.104 2001/07/06 21:04:26 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
@@ -74,6 +74,7 @@
  
  #include "access/xact.h"
  #include "storage/proc.h"
+#include "storage/sinval.h"
  
  
  int                    DeadlockTimeout = 1000;
@@ -92,6 +93,7 @@ static PROC_HDR *ProcGlobal = NULL;
  PROC      *MyProc = NULL;
  
  static bool waitingForLock = false;
+static bool waitingForSignal = false;
  
  static void ProcKill(void);
  static void ProcGetNewSemIdAndNum(IpcSemaphoreId *semId, int *semNum);
@@ -894,6 +896,49 @@ ProcReleaseSpins(PROC *proc)
         AbortBufferIO();
  }
  
+/*
+ * ProcWaitForSignal - wait for a signal from another backend.
+ *
+ * This can share the semaphore normally used for waiting for locks,
+ * since a backend could never be waiting for a lock and a signal at
+ * the same time.  As with locks, it's OK if the signal arrives just
+ * before we actually reach the waiting state.
+ */
+void
+ProcWaitForSignal(void)
+{
+       waitingForSignal = true;
+       IpcSemaphoreLock(MyProc->sem.semId, MyProc->sem.semNum, true);
+       waitingForSignal = false;
+}
+
+/*
+ * ProcCancelWaitForSignal - clean up an aborted wait for signal
+ *
+ * We need this in case the signal arrived after we aborted waiting,
+ * or if it arrived but we never reached ProcWaitForSignal() at all.
+ * Caller should call this after resetting the signal request status.
+ */
+void
+ProcCancelWaitForSignal(void)
+{
+       ZeroProcSemaphore(MyProc);
+       waitingForSignal = false;
+}
+
+/*
+ * ProcSendSignal - send a signal to a backend identified by BackendId
+ */
+void
+ProcSendSignal(BackendId procId)
+{
+       PROC   *proc = BackendIdGetProc(procId);
+
+       if (proc != NULL)
+               IpcSemaphoreUnlock(proc->sem.semId, proc->sem.semNum);
+}
+
+
  /*****************************************************************************
   *
   *****************************************************************************/
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h

index b795d65fb2060e2471b11c386ce235329f962aab..ca1e1dd4b4960827a9f2e99143eca8f5e5cfac5d 100644 (file)
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -7,17 +7,19 @@
   * Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
   *
- * $Id: buf_internals.h,v 1.48 2001/03/22 04:01:05 momjian Exp $
+ * $Id: buf_internals.h,v 1.49 2001/07/06 21:04:26 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
  #ifndef BUFMGR_INTERNALS_H
  #define BUFMGR_INTERNALS_H
  
+#include "storage/backendid.h"
  #include "storage/buf.h"
  #include "storage/lmgr.h"
  #include "storage/s_lock.h"
  
+
  /* Buf Mgr constants */
  /* in bufmgr.c */
  extern int     Data_Descriptors;
@@ -38,9 +40,19 @@ extern int   ShowPinTrace;
  #define BM_IO_IN_PROGRESS              (1 << 5)
  #define BM_IO_ERROR                            (1 << 6)
  #define BM_JUST_DIRTIED                        (1 << 7)
+#define BM_PIN_COUNT_WAITER            (1 << 8)
  
  typedef bits16 BufFlags;
  
+/*
+ * Buffer tag identifies which disk block the buffer contains.
+ *
+ * Note: the BufferTag data must be sufficient to determine where to write the
+ * block, even during a "blind write" with no relcache entry.  It's possible
+ * that the backend flushing the buffer doesn't even believe the relation is
+ * visible yet (its xact may have started before the xact that created the
+ * rel).  The storage manager must be able to cope anyway.
+ */
  typedef struct buftag
  {
         RelFileNode rnode;
@@ -60,28 +72,9 @@ typedef struct buftag
         (a)->rnode = (xx_reln)->rd_node \
  )
  
-/*
- * We don't need this data any more but it allows more user
- * friendly error messages. Feel free to get rid of it
- * (and change a lot of places -:))
- */
-typedef struct bufblindid
-{
-       char            dbname[NAMEDATALEN];    /* name of db in which buf belongs */
-       char            relname[NAMEDATALEN];   /* name of reln */
-} BufferBlindId;
-
  /*
   *     BufferDesc -- shared buffer cache metadata for a single
   *                               shared buffer descriptor.
- *
- *             We keep the name of the database and relation in which this
- *             buffer appears in order to avoid a catalog lookup on cache
- *             flush if we don't have the reldesc in the cache.  It is also
- *             possible that the relation to which this buffer belongs is
- *             not visible to all backends at the time that it gets flushed.
- *             Dbname, relname, dbid, and relid are enough to determine where
- *             to put the buffer, for all storage managers.
   */
  typedef struct sbufdesc
  {
@@ -89,14 +82,14 @@ typedef struct sbufdesc
         Buffer          freePrev;
         SHMEM_OFFSET data;                      /* pointer to data in buf pool */
  
-       /* tag and id must be together for table lookup to work */
+       /* tag and id must be together for table lookup (still true?) */
         BufferTag       tag;                    /* file/block identifier */
-       int                     buf_id;                 /* maps global desc to local desc */
+       int                     buf_id;                 /* buffer's index number (from 0) */
  
         BufFlags        flags;                  /* see bit definitions above */
-       unsigned        refcount;               /* # of times buffer is pinned */
+       unsigned        refcount;               /* # of backends holding pins on buffer */
  
-       slock_t         io_in_progress_lock;    /* to block for I/O to complete */
+       slock_t         io_in_progress_lock;    /* to wait for I/O to complete */
         slock_t         cntx_lock;              /* to lock access to page context */
  
         unsigned        r_locks;                /* # of shared locks */
@@ -105,15 +98,14 @@ typedef struct sbufdesc
  
         bool            cntxDirty;              /* new way to mark block as dirty */
  
-       BufferBlindId blind;            /* was used to support blind write */
-
         /*
-        * When we can't delete item from page (someone else has buffer
-        * pinned) we mark buffer for cleanup by specifying appropriate for
-        * buffer content cleanup function. Buffer will be cleaned up from
-        * release buffer functions.
+        * We can't physically remove items from a disk page if another backend
+        * has the buffer pinned.  Hence, a backend may need to wait for all
+        * other pins to go away.  This is signaled by setting its own backend ID
+        * into wait_backend_id and setting flag bit BM_PIN_COUNT_WAITER.
+        * At present, there can be only one such waiter per buffer.
          */
-       void            (*CleanupFunc) (Buffer);
+       BackendId       wait_backend_id; /* backend ID of pin-count waiter */
  } BufferDesc;
  
  #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
@@ -128,21 +120,23 @@ typedef struct sbufdesc
  #define BL_R_LOCK                      (1 << 1)
  #define BL_RI_LOCK                     (1 << 2)
  #define BL_W_LOCK                      (1 << 3)
+#define BL_PIN_COUNT_LOCK      (1 << 4)
  
  /*
   *     mao tracing buffer allocation
   */
  
  /*#define BMTRACE*/
+
  #ifdef BMTRACE
  
  typedef struct _bmtrace
  {
         int                     bmt_pid;
-       long            bmt_buf;
-       long            bmt_dbid;
-       long            bmt_relid;
-       int                     bmt_blkno;
+       int                     bmt_buf;
+       Oid                     bmt_dbid;
+       Oid                     bmt_relid;
+       BlockNumber     bmt_blkno;
         int                     bmt_op;
  
  #define BMT_NOTUSED            0
@@ -162,9 +156,7 @@ typedef struct _bmtrace
  /* Internal routines: only called by buf.c */
  
  /*freelist.c*/
-extern void AddBufferToFreelist(BufferDesc *bf);
  extern void PinBuffer(BufferDesc *buf);
-extern void PinBuffer_Debug(char *file, int line, BufferDesc *buf);
  extern void UnpinBuffer(BufferDesc *buf);
  extern BufferDesc *GetFreeBuffer(void);
  extern void InitFreeList(bool init);
@@ -179,7 +171,6 @@ extern bool BufTableInsert(BufferDesc *buf);
  extern BufferDesc *BufferDescriptors;
  extern bits8 *BufferLocks;
  extern BufferTag *BufferTagLastDirtied;
-extern BufferBlindId *BufferBlindLastDirtied;
  extern LockRelId *BufferRelidLastDirtied;
  extern bool *BufferDirtiedByMe;
  extern SPINLOCK BufMgrLock;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h

index f06b3bb6cc34c90b9c5f036fe98a034927c9efd4..70161c9c8d2b5c34a007dd0cf7f05a4a8686f407 100644 (file)
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -7,7 +7,7 @@
   * Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
   *
- * $Id: bufmgr.h,v 1.53 2001/06/29 21:08:25 tgl Exp $
+ * $Id: bufmgr.h,v 1.54 2001/07/06 21:04:26 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
@@ -167,7 +167,7 @@ extern void InitBufferPoolAccess(void);
  extern void PrintBufferUsage(FILE *statfp);
  extern void ResetBufferUsage(void);
  extern void ResetBufferPool(bool isCommit);
-extern int     BufferPoolCheckLeak(void);
+extern bool BufferPoolCheckLeak(void);
  extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
@@ -183,10 +183,9 @@ extern void SetBufferCommitInfoNeedsSave(Buffer buffer);
  
  extern void UnlockBuffers(void);
  extern void LockBuffer(Buffer buffer, int mode);
-extern void AbortBufferIO(void);
+extern void LockBufferForCleanup(Buffer buffer);
  
-extern bool BufferIsUpdatable(Buffer buffer);
-extern void MarkBufferForCleanup(Buffer buffer, void (*CleanupFunc) (Buffer));
+extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
  extern void BufferSync(void);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h

index a35c8f4481ecfd4fa8d5ce7ed310af91705f188a..f35144e69190b04d3a857699ec75684039feb182 100644 (file)
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -7,7 +7,7 @@
   * Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
   *
- * $Id: proc.h,v 1.44 2001/06/16 22:58:17 tgl Exp $
+ * $Id: proc.h,v 1.45 2001/07/06 21:04:26 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
@@ -15,6 +15,7 @@
  #define _PROC_H_
  
  #include "access/xlog.h"
+#include "storage/backendid.h"
  #include "storage/lock.h"
  
  /* configurable option */
@@ -139,4 +140,8 @@ extern void ProcReleaseSpins(PROC *proc);
  extern bool LockWaitCancel(void);
  extern void HandleDeadLock(SIGNAL_ARGS);
  
+extern void ProcWaitForSignal(void);
+extern void ProcCancelWaitForSignal(void);
+extern void ProcSendSignal(BackendId procId);
+
  #endif  /* PROC_H */
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h

index 7d43c645780eee842b7a339914d91e4b382970ab..11aad6095738d8d3f2f151a00d79e4e318eb9a05 100644 (file)
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@@ -7,13 +7,14 @@
   * Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
   *
- * $Id: sinval.h,v 1.19 2001/06/19 19:42:16 tgl Exp $
+ * $Id: sinval.h,v 1.20 2001/07/06 21:04:26 tgl Exp $
   *
   *-------------------------------------------------------------------------
   */
  #ifndef SINVAL_H
  #define SINVAL_H
  
+#include "storage/backendid.h"
  #include "storage/itemptr.h"
  #include "storage/spin.h"
  
@@ -77,5 +78,7 @@ extern bool DatabaseHasActiveBackends(Oid databaseId, bool ignoreMyself);
  extern bool TransactionIdIsInProgress(TransactionId xid);
  extern void GetXmaxRecent(TransactionId *XmaxRecent);
  extern int     CountActiveBackends(void);
+/* Use "struct proc", not PROC, to avoid including proc.h here */
+extern struct proc *BackendIdGetProc(BackendId procId);
  
  #endif  /* SINVAL_H */
author	Tom Lane <tgl@sss.pgh.pa.us>
	Fri, 6 Jul 2001 21:04:26 +0000 (21:04 +0000)
committer	Tom Lane <tgl@sss.pgh.pa.us>
	Fri, 6 Jul 2001 21:04:26 +0000 (21:04 +0000)
src/backend/access/transam/xact.c		patch \| blob \| history
src/backend/storage/buffer/README	[new file with mode: 0644]	patch \| blob
src/backend/storage/buffer/buf_init.c		patch \| blob \| history
src/backend/storage/buffer/bufmgr.c		patch \| blob \| history
src/backend/storage/buffer/freelist.c		patch \| blob \| history
src/backend/storage/ipc/sinval.c		patch \| blob \| history
src/backend/storage/lmgr/proc.c		patch \| blob \| history
src/include/storage/buf_internals.h		patch \| blob \| history
src/include/storage/bufmgr.h		patch \| blob \| history
src/include/storage/proc.h		patch \| blob \| history
src/include/storage/sinval.h		patch \| blob \| history