BUG/MEDIUM: fd/threads: fix a concurrency issue between add and rm on the same fd
There's a very hard-to-trigger bug in the FD list code where the fd_add_to_fd_list() function assumes that if the FD it's trying to add is already locked, it's in the process of being added. Unfortunately, it can also be in the process of being removed. It is very hard to trigger because it requires that one thread is removing the FD while another one is adding it. First very few FDs run on multiple threads (listeners and DNS), and second, it does not make sense to add and remove the FD at the same time. In practice the DNS code built on the older callback-only model does perform bursts of fd_want_send() for all resolvers at once when it wants to send a new query (dns_send_query()). And this is more likely to happen when here are lots of resolutions in parallel and many resolvers, because the dns_response_recv() callback can also trigger a series of queries on all resolvers for each invalid response it receives. This means that it really is perfectly possible to both stop and start in parallel during short periods of time there. This issue was not reported before 2.1, but 2.1 had the FD cache, built on the exact same code base. It's very possible that the issue caused exactly the opposite situation, where an event was occasionally lost, causing a DNS retry that worked, and nobody noticing the problem in the end. In 2.1 the lost entries are the updates asking for not polling for writes anymore, and the effect is that the poller contiuously reports writability on the socket when the issue happens. This patch fixes bug #416 and must be backported as far as 1.8, and absolutely requires that previous commit "MINOR: fd/threads: make _GET_NEXT()/_GET_PREV() use the volatile attribute" is backported as well otherwise it will make the issue worse. Special thanks to Julien Pivotto for setting up a reliable reproducer for this difficult issue.
This commit is contained in:
parent
337fb719ee
commit
fc51f0f588
4
src/fd.c
4
src/fd.c
|
@ -128,8 +128,10 @@ void fd_add_to_fd_list(volatile struct fdlist *list, int fd, int off)
|
|||
redo_next:
|
||||
next = _GET_NEXT(fd, off);
|
||||
/* Check that we're not already in the cache, and if not, lock us. */
|
||||
if (next >= -2)
|
||||
if (next > -2)
|
||||
goto done;
|
||||
if (next == -2)
|
||||
goto redo_next;
|
||||
if (!_HA_ATOMIC_CAS(&_GET_NEXT(fd, off), &next, -2))
|
||||
goto redo_next;
|
||||
__ha_barrier_atomic_store();
|
||||
|
|
Loading…
Reference in New Issue