Read from the same worker repeatedly until it returns no tuple.
The original coding read tuples from workers in round-robin fashion,
but performance testing shows that it works much better to read enough
to empty one queue before moving on to the next. I believe the
reason for this is that, with the old approach, we could easily wake
up a worker repeatedly to write only one new tuple into the shm_mq
each time. With this approach, by the time the process gets scheduled,
it has a decent chance of being able to fill the entire buffer in
one go.
Patch by me. Dilip Kumar helped with performance testing.