From: Tom Lane Date: Mon, 26 Aug 2019 19:59:44 +0000 (-0400) Subject: Fix postmaster state machine to handle dead_end child crashes better. X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=ee3278239550ff0ec9df72dd798a480c82c2b723;p=postgresql Fix postmaster state machine to handle dead_end child crashes better. A report from Alvaro Herrera shows that if we're in PM_STARTUP state, and we spawn a dead_end child to reject some incoming connection request, and that child dies with an unexpected exit code, the postmaster does not respond well. We correctly send SIGQUIT to the startup process, but then: * if the startup process exits with nonzero exit code, as expected, we thought that that indicated a crash and aborted startup. * if the startup process exits with zero exit code, which is possible due to the inherent race condition, we'd advance to PM_RUN state which is fine --- but the code forgot that AbortStartTime would be nonzero in this situation. We'd either die on the Asserts saying that it was zero, or perhaps misbehave later on. (A quick look suggests that the only misbehavior might be busy-waiting due to DetermineSleepTime doing the wrong thing.) To fix the first point, adjust the state-machine logic to recognize that a nonzero exit code is expected after sending SIGQUIT, and have it transition to a state where we can restart the startup process. To fix the second point, change the Asserts to clear the variable rather than just claiming it should be clear already. Perhaps we could improve this further by not treating a crash of a dead_end child as a reason for panic'ing the database. However, since those child processes are connected to shared memory, that seems a bit risky. There are few good reasons for a dead_end child to report failure anyway (the cause of this in Alvaro's report is quite unclear). On balance, therefore, a minimal fix seems best. This is an oversight in commit 45811be94. While that was back-patched, I'm hesitant to back-patch this change. The lack of reasons for a dead_end child to fail suggests that the case should be very rare in the field, which squares with the lack of reports; so it seems like this might not be worth the risk of introducing new issues. In any case we can let it bake awhile in HEAD before considering a back-patch. Discussion: https://postgr.es/m/20190615160950.GA31378@alvherre.pgsql --- diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 3339804be9..62dc93d56b 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2920,7 +2920,9 @@ reaper(SIGNAL_ARGS) * during PM_STARTUP is treated as catastrophic. There are no * other processes running yet, so we can just exit. */ - if (pmState == PM_STARTUP && !EXIT_STATUS_0(exitstatus)) + if (pmState == PM_STARTUP && + StartupStatus != STARTUP_SIGNALED && + !EXIT_STATUS_0(exitstatus)) { LogChildExit(LOG, _("startup process"), pid, exitstatus); @@ -2937,11 +2939,24 @@ reaper(SIGNAL_ARGS) * then we previously sent the startup process a SIGQUIT; so * that's probably the reason it died, and we do want to try to * restart in that case. + * + * This stanza also handles the case where we sent a SIGQUIT + * during PM_STARTUP due to some dead_end child crashing: in that + * situation, if the startup process dies on the SIGQUIT, we need + * to transition to PM_WAIT_BACKENDS state which will allow + * PostmasterStateMachine to restart the startup process. (On the + * other hand, the startup process might complete normally, if we + * were too late with the SIGQUIT. In that case we'll fall + * through and commence normal operations.) */ if (!EXIT_STATUS_0(exitstatus)) { if (StartupStatus == STARTUP_SIGNALED) + { StartupStatus = STARTUP_NOT_RUNNING; + if (pmState == PM_STARTUP) + pmState = PM_WAIT_BACKENDS; + } else StartupStatus = STARTUP_CRASHED; HandleChildCrash(pid, exitstatus, @@ -2954,7 +2969,7 @@ reaper(SIGNAL_ARGS) */ StartupStatus = STARTUP_NOT_RUNNING; FatalError = false; - Assert(AbortStartTime == 0); + AbortStartTime = 0; ReachedNormalRunning = true; pmState = PM_RUN; @@ -3504,7 +3519,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) if (pid == StartupPID) { StartupPID = 0; - StartupStatus = STARTUP_CRASHED; + /* Caller adjusts StartupStatus, so don't touch it here */ } else if (StartupPID != 0 && take_action) { @@ -5100,7 +5115,7 @@ sigusr1_handler(SIGNAL_ARGS) { /* WAL redo has started. We're out of reinitialization. */ FatalError = false; - Assert(AbortStartTime == 0); + AbortStartTime = 0; /* * Crank up the background tasks. It doesn't matter if this fails,