mirror of
http://git.haproxy.org/git/haproxy.git/
synced 2025-05-15 06:08:09 +00:00
BUG/MAJOR: fd/thread: fix race between updates and closing FD
While running some L7 retries tests, Christopher and I stumbled upon a very strange behavior showing some occasional server timeouts when the server closes keep-alive connections quickly. The issue can be reproduced with the following config: global expose-experimental-directives #tune.fd.edge-triggered on # can speed up the issue defaults mode http timeout client 5s timeout server 10s timeout connect 2s listen f bind :8001 http-reuse always retry-on all-retryable-errors server next 127.0.0.1:8002 frontend b bind :8002 timeout http-keep-alive 1 # one ms redirect location / Sending fast requests without reusing the client connection on port 8001 with a single connection and at least 3 threads on haproxy occasionally shows some glitches pauses (below with timeout server 2s): $ taskset -c 2,3 h1load -e -t 1 -r 1 -c 1 http://127.0.0.1:8001/ # time conns tot_conn tot_req tot_bytes err cps rps bps ttfb 1 1 9794 9793 959714 0 9k79 9k79 7M67 42.94u 2 1 9794 9793 959714 0 0.00 0.00 0.00 - 3 1 9794 9793 959714 0 0.00 0.00 0.00 - 4 0 16015 16015 1569470 0 6k22 6k22 4M87 522.9u 5 0 18657 18656 1828190 2 2k63 2k63 2M06 39.22u If this doesn't happen, limiting to a request rate close to 1/timeout may help. What is happening is that after several migrations, a late report via fd_update_events() may detect that the thread is not welcome, and will want to program an update so that the current thread's poller disables its polling on it. It is allowed to do so because it used fd_grab_tgid(). But what if _fd_delete_orphan() was just starting to be called and already reset the update_mask ? We'll end up with a bit present in the update mask, then _fd_delete_orphan() resets the tgid, which will prevent the poller from consuming that update. The update is not needed anymore since the FD was closed, but in this case nobody will clear this bit until the same FD is reused again and cleared. And as long as the thread's bit remains in the update_mask, no new updates will be programmed for the next use of this FD on the same thread since due to the bit being present, fd_nbupdt will not be changed. This is what is causing this timeout. The fix consists in making sure _fd_delete_orphan() waits for the occasional watchers to leave, and to do this before clearing the update_mask. This will be either fd_update_events() trying to check its thread_mask, or the poller checking its updates, so that's pretty short. But it definitely closes this race. This fix is needed since the introduction of fd_grab_tgid(), hence 2.7. Note that while testing the fix, another related issue concerning the atomicity of running_mask vs thread_mask popped up and will have to be fixed till 2.5 as part of another patch. It may make the tests for this fix occasionally tigger a few BUG_ON() or face a null conn->subs in sock_conn_iocb(), though these ones are much more difficult to trigger. This is not caused by this fix.
This commit is contained in:
parent
315a4f6ae5
commit
237e6a0d65
14
src/fd.c
14
src/fd.c
@ -319,6 +319,19 @@ void _fd_delete_orphan(int fd)
|
||||
if (cur_poller.clo)
|
||||
cur_poller.clo(fd);
|
||||
|
||||
/* now we're about to reset some of this FD's fields. We don't want
|
||||
* anyone to grab it anymore and we need to make sure those which could
|
||||
* possibly have stumbled upon it right now are leaving before we
|
||||
* proceed. This is done in two steps. First we reset the tgid so that
|
||||
* fd_take_tgid() and fd_grab_tgid() fail, then we wait for existing
|
||||
* ref counts to drop. Past this point we're alone dealing with the
|
||||
* FD's thead/running/update/polled masks.
|
||||
*/
|
||||
fd_reset_tgid(fd);
|
||||
|
||||
while (_HA_ATOMIC_LOAD(&fdtab[fd].refc_tgid) != 0) // refc==0 ?
|
||||
__ha_cpu_relax();
|
||||
|
||||
/* we don't want this FD anymore in the global list */
|
||||
fd_rm_from_fd_list(&update_list[tgrp - 1], fd);
|
||||
|
||||
@ -331,7 +344,6 @@ void _fd_delete_orphan(int fd)
|
||||
polled_mask[fd].poll_recv = polled_mask[fd].poll_send = 0;
|
||||
|
||||
fdtab[fd].state = 0;
|
||||
fd_reset_tgid(fd);
|
||||
|
||||
#ifdef DEBUG_FD
|
||||
fdtab[fd].event_count = 0;
|
||||
|
Loading…
Reference in New Issue
Block a user