Improve comment for tricky aspect of index-only scans.

author Jeff Davis <jdavis@postgresql.org>

Sun, 4 May 2014 20:18:55 +0000 (13:18 -0700)

committer Jeff Davis <jdavis@postgresql.org>

Wed, 7 May 2014 02:27:43 +0000 (19:27 -0700)
author Jeff Davis <jdavis@postgresql.org>
Sun, 4 May 2014 20:18:55 +0000 (13:18 -0700)
committer Jeff Davis <jdavis@postgresql.org>
Wed, 7 May 2014 02:27:43 +0000 (19:27 -0700)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c

index c55723608d60460b2acad6974ec65ab0183dfb40..afcd1ff353e71ab39616cd49919c1f8a5d6bd54b 100644 (file)
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -88,15 +88,31 @@ IndexOnlyNext(IndexOnlyScanState *node)
                  * Note on Memory Ordering Effects: visibilitymap_test does not lock
                  * the visibility map buffer, and therefore the result we read here
                  * could be slightly stale.  However, it can't be stale enough to
-                * matter.  It suffices to show that (1) there is a read barrier
-                * between the time we read the index TID and the time we test the
-                * visibility map; and (2) there is a write barrier between the time
-                * some other concurrent process clears the visibility map bit and the
-                * time it inserts the index TID.  Since acquiring or releasing a
-                * LWLock interposes a full barrier, this is easy to show: (1) is
-                * satisfied by the release of the index buffer content lock after
-                * reading the TID; and (2) is satisfied by the acquisition of the
-                * buffer content lock in order to insert the TID.
+                * matter.
+                *
+                * We need to detect clearing a VM bit due to an insert right away,
+                * because the tuple is present in the index page but not visible. The
+                * reading of the TID by this scan (using a shared lock on the index
+                * buffer) is serialized with the insert of the TID into the index
+                * (using an exclusive lock on the index buffer). Because the VM bit
+                * is cleared before updating the index, and locking/unlocking of the
+                * index page acts as a full memory barrier, we are sure to see the
+                * cleared bit if we see a recently-inserted TID.
+                *
+                * Deletes do not update the index page (only VACUUM will clear out
+                * the TID), so the clearing of the VM bit by a delete is not
+                * serialized with this test below, and we may see a value that is
+                * significantly stale. However, we don't care about the delete right
+                * away, because the tuple is still visible until the deleting
+                * transaction commits or the statement ends (if it's our
+                * transaction). In either case, the lock on the VM buffer will have
+                * been released (acting as a write barrier) after clearing the
+                * bit. And for us to have a snapshot that includes the deleting
+                * transaction (making the tuple invisible), we must have acquired
+                * ProcArrayLock after that time, acting as a read barrier.
+                *
+                * It's worth going through this complexity to avoid needing to lock
+                * the VM buffer, which could cause significant contention.
                  */
                 if (!visibilitymap_test(scandesc->heapRelation,
                                                                 ItemPointerGetBlockNumber(tid),
author	Jeff Davis <jdavis@postgresql.org>
	Sun, 4 May 2014 20:18:55 +0000 (13:18 -0700)
committer	Jeff Davis <jdavis@postgresql.org>
	Wed, 7 May 2014 02:27:43 +0000 (19:27 -0700)