During parallel index scans, if the current page to be read is deleted, we
skip it and try to get the next page for a scan without releasing the buffer
lock on the current page. To get the next page, sometimes it needs to wait
for another process to complete its scan and advance it to the next page.
Now, it is quite possible that the master backend has errored out before
advancing the scan and issued a termination signal for all workers. The
workers failed to notice the termination request during wait because the
interrupts are held due to buffer lock on the previous page. This lead to
all workers being stuck.
The fix is to release the buffer lock on current page before trying to get
the next page. We are already doing same in backward scans, but missed
it for forward scans.
Reported-by: Victor Yegorov
Bug: 15290
Diagnosed-by: Thomas Munro and Amit Kapila
Author: Amit Kapila
Reviewed-by: Thomas Munro
Tested-By: Thomas Munro and Victor Yegorov
Backpatch-through: 10 where parallel index scans were introduced
Discussion:https://postgr.es/m/
153228422922.1395.
1746424054206154747@wrigleys.postgresql.org
/* nope, keep going */
if (scan->parallel_scan != NULL)
{
+ _bt_relbuf(rel, so->currPos.buf);
status = _bt_parallel_seize(scan, &blkno);
if (!status)
{
- _bt_relbuf(rel, so->currPos.buf);
BTScanPosInvalidate(so->currPos);
return false;
}
}
else
+ {
blkno = opaque->btpo_next;
- _bt_relbuf(rel, so->currPos.buf);
+ _bt_relbuf(rel, so->currPos.buf);
+ }
}
}
else