MEDIUM: proto_tcp: make the pause() more robust in multi-process

In multi-process, the TCP pause is very brittle and we never noticed
it because the error was lost in the upper layers. The problem is that
shutdown() may fail if another process already did it, and will cause
a process to fail to pause.

What we do here in case of error is that we double-check the socket's
state to verify if it's still accepting connections, and if not, we
can conclude that another process already did the job in parallel.

The difficulty here is that we're trying to eliminate false positives
where some OSes will silently report a success on shutdown() while they
don't shut the socket down, hence this dance of shutw/listen/shutr that
only keeps the compatible ones. Probably that a new approach relying on
connect(AF_UNSPEC) would provide better results.
This commit is contained in:
Willy Tarreau 2020-10-08 16:51:09 +02:00
parent 1accacbcc3
commit 91c614dd0e

View File

@ -732,15 +732,34 @@ static void tcpv6_add_listener(struct listener *listener, int port)
*/
int tcp_pause_listener(struct listener *l)
{
socklen_t opt_val, opt_len;
if (shutdown(l->rx.fd, SHUT_WR) != 0)
return -1; /* Solaris dies here */
goto check_already_done; /* usually Solaris fails here */
if (listen(l->rx.fd, listener_backlog(l)) != 0)
return -1; /* OpenBSD dies here */
goto check_already_done; /* Usually OpenBSD fails here */
if (shutdown(l->rx.fd, SHUT_RD) != 0)
return -1; /* should always be OK */
goto check_already_done; /* show always be OK */
return 1;
check_already_done:
/* in case one of the shutdown() above fails, it might be because we're
* dealing with a socket that is shared with other processes doing the
* same. Let's check if it's still accepting connections.
*/
opt_val = 0;
opt_len = sizeof(opt_val);
if (getsockopt(l->rx.fd, SOL_SOCKET, SO_ACCEPTCONN, &opt_val, &opt_len) == -1)
return 0; /* the socket is really unrecoverable */
if (!opt_val)
return 1; /* already paused by another process */
/* something looks fishy here */
return -1;
}