haproxy/doc/internals/fd-migration.txt

2021-07-30 - File descriptor migration between threads

An FD migration may happen on any idle connection that experiences a takeover()
operation by another thread. In this case the acting thread becomes the owner
of the connection (and FD) while previous one(s) need to forget about it.

File descriptor migration between threads is a fairly complex operation because
it is required to maintain a durable consistency between the pollers states and
the haproxy's desired state. Indeed, very often the FD is registered within one
thread's poller and that thread might be waiting in the system, so there is no
way to synchronously update it. This is where thread_mask, polled_mask and per
thread updates are used:

  - a thread knows if it's allowed to manipulate an FD by looking at its bit in
    the FD's thread_mask ;

  - each thread knows if it was polling an FD by looking at its bit in the
    polled_mask field ; a recent migration is usually indicated by a bit being
    present in polled_mask and absent from thread_mask.

  - other threads know whether it's safe to take over an FD by looking at the
    running mask: if it contains any other thread's bit, then other threads are
    using it and it's not safe to take it over.

  - sleeping threads are notified about the need to update their polling via
    local or global updates to the FD. Each thread has its own local update
    list and its own bit in the update_mask to know whether there are pending
    updates for it. This allows to reconverge polling with the desired state
    at the last instant before polling.

While the description above could be seen as "progressive" (it technically is)
in that there is always a transition and convergence period in a migrated FD's
life, functionally speaking it's perfectly atomic thanks to the running bit and
to the per-thread idle connections lock: no takeover is permitted without
holding the idle_conns lock, and takeover may only happen by atomically picking
a connection from the list that is also protected by this lock. In practice, an
FD is never taken over by itself, but always in the context of a connection,
and by atomically removing a connection from an idle list, it is possible to
guarantee that a connection will not be picked, hence that its FD will not be
taken over.

same thread as list!

The possible entry points to a race to use a file descriptor are the following
ones, with their respective sequences:

 1) takeover: requested by conn_backend_get() on behalf of connect_server()
    - take the idle_conns_lock, protecting against a parallel access from the
      I/O tasklet or timeout task
    - pick the first connection from the list
    - attempt an fd_takeover() on this connection's fd. Usually it works,
      unless a late wakeup of the owning thread shows up in the FD's running
      mask. The operation is performed in fd_takeover() using a DWCAS which
      tries to switch both running and thread_mask to the caller's tid_bit. A
      concurrent bit in running is enough to make it fail. This guarantees
      another thread does not wakeup from I/O in the middle of the takeover.
      In case of conflict, this FD is skipped and the attempt is tried again
      with the next connection.
    - resets the task/tasklet contexts to NULL, as a signal that they are not
      allowed to run anymore. The tasks retrieve their execution context from
      the scheduler in the arguments, but will check the tasks' context from
      the structure under the lock to detect this possible change, and abort.
    - at this point the takeover succeeded, the idle_conns_lock is released and
      the connection and its FD are now owned by the caller

  2) poll report: happens on late rx, shutdown or error on idle conns
    - fd_set_running() is called to atomically set the running_mask and check
      that the caller's tid_bit is still present in the thread_mask. Upon
      failure the caller arranges itself to stop reporting that FD (e.g. by
      immediate removal or by an asynchronous update). Upon success, it's
      guaranteed that any concurrent fd_takeover() will fail the DWCAS and that
      another connection will need to be picked instead.
    - FD's state is possibly updated
    - the iocb is called if needed (almost always)
    - if the iocb didn't kill the connection, release the bit from running_mask
      making the connection possibly available to a subsequent fd_takeover().

  3) I/O tasklet, timeout task: timeout or subscribed wakeup
    - start by taking the idle_conns_lock, ensuring no takeover() will pick the
      same connection from this point.
    - check the task/tasklet's context to verify that no recently completed
      takeover() stole the connection. If it's NULL, the connection was lost,
      the lock is released and the task/tasklet killed. Otherwise it is
      guaranteed that no other thread may use that connection (current takeover
      candidates are waiting on the lock, previous owners waking from poll()
      lost their bit in the thread_mask and will not touch the FD).
    - the connection is removed from the idle conns list. From this point on,
      no other thread will even find it there nor even try fd_takeover() on it.
    - the idle_conns_lock is now released, the connection is protected and its
      FD is not reachable by other threads anymore.
    - the task does what it has to do
    - if the connection is still usable (i.e. not upon timeout), it's inserted
      again into the idle conns list, meaning it may instantly be taken over
      by a competing thread.

  4) wake() callback: happens on last user after xfers (may free() the conn)
    - the connection is still owned by the caller, it's still subscribed to
      polling but the connection is idle thus inactive. Errors or shutdowns
      may be reported late, via sock_conn_iocb() and conn_notify_mux(), thus
      the running bit is set (i.e. a concurrent fd_takeover() will fail).
    - if the connection is in the list, the idle_conns_lock is grabbed, the
      connection is removed from the list, and the lock is released.
    - mux->wake() is called
    - if the connection previously was in the list, it's reinserted under the
      idle_conns_lock.


With the DWCAS removal between running_mask & thread_mask:

fd_takeover:
     1  if (!CAS(&running_mask, 0, tid_bit))
     2      return fail;
     3  atomic_store(&thread_mask, tid_bit);
     4  atomic_and(&running_mask, ~tid_bit);

poller:
     1  do {
     2      /* read consistent running_mask & thread_mask */
     3      do {
     4          run = atomic_load(&running_mask);
     5          thr = atomic_load(&thread_mask);
     6      } while (run & ~thr);
     7
     8      if (!(thr & tid_bit)) {
     9          /* takeover has started */
    10          goto disable_fd;
    11      }
    12  } while (!CAS(&running_mask, run, run | tid_bit));

fd_delete:
     1  atomic_or(&running_mask, tid_bit);
     2  atomic_store(&thread_mask, 0);
     3  atomic_and(&running_mask, ~tid_bit);

The loop in poller:3-6 is used to make sure the thread_mask we read matches
the last updated running_mask. If nobody can give up on fd_takeover(), it
might even be possible to spin on thread_mask only. Late pollers will not
set running anymore with this.
DOC: internals: document the FD takeover process This explains the traps to avoid and the sequence that leads to consistent use of an FD known by multiple threads at once. This was co-authored with Olivier. 2021-07-30 15:40:07 +00:00			`2021-07-30 - File descriptor migration between threads`

			`An FD migration may happen on any idle connection that experiences a takeover()`
			`operation by another thread. In this case the acting thread becomes the owner`
			`of the connection (and FD) while previous one(s) need to forget about it.`

			`File descriptor migration between threads is a fairly complex operation because`
			`it is required to maintain a durable consistency between the pollers states and`
			`the haproxy's desired state. Indeed, very often the FD is registered within one`
			`thread's poller and that thread might be waiting in the system, so there is no`
			`way to synchronously update it. This is where thread_mask, polled_mask and per`
			`thread updates are used:`

			`- a thread knows if it's allowed to manipulate an FD by looking at its bit in`
			`the FD's thread_mask ;`

			`- each thread knows if it was polling an FD by looking at its bit in the`
			`polled_mask field ; a recent migration is usually indicated by a bit being`
			`present in polled_mask and absent from thread_mask.`

			`- other threads know whether it's safe to take over an FD by looking at the`
			`running mask: if it contains any other thread's bit, then other threads are`
			`using it and it's not safe to take it over.`

			`- sleeping threads are notified about the need to update their polling via`
			`local or global updates to the FD. Each thread has its own local update`
			`list and its own bit in the update_mask to know whether there are pending`
			`updates for it. This allows to reconverge polling with the desired state`
			`at the last instant before polling.`

			`While the description above could be seen as "progressive" (it technically is)`
			`in that there is always a transition and convergence period in a migrated FD's`
			`life, functionally speaking it's perfectly atomic thanks to the running bit and`
			`to the per-thread idle connections lock: no takeover is permitted without`
			`holding the idle_conns lock, and takeover may only happen by atomically picking`
			`a connection from the list that is also protected by this lock. In practice, an`
			`FD is never taken over by itself, but always in the context of a connection,`
			`and by atomically removing a connection from an idle list, it is possible to`
			`guarantee that a connection will not be picked, hence that its FD will not be`
			`taken over.`

			`same thread as list!`

			`The possible entry points to a race to use a file descriptor are the following`
			`ones, with their respective sequences:`

			`1) takeover: requested by conn_backend_get() on behalf of connect_server()`
			`- take the idle_conns_lock, protecting against a parallel access from the`
			`I/O tasklet or timeout task`
			`- pick the first connection from the list`
			`- attempt an fd_takeover() on this connection's fd. Usually it works,`
			`unless a late wakeup of the owning thread shows up in the FD's running`
			`mask. The operation is performed in fd_takeover() using a DWCAS which`
			`tries to switch both running and thread_mask to the caller's tid_bit. A`
			`concurrent bit in running is enough to make it fail. This guarantees`
			`another thread does not wakeup from I/O in the middle of the takeover.`
			`In case of conflict, this FD is skipped and the attempt is tried again`
			`with the next connection.`
			`- resets the task/tasklet contexts to NULL, as a signal that they are not`
			`allowed to run anymore. The tasks retrieve their execution context from`
			`the scheduler in the arguments, but will check the tasks' context from`
			`the structure under the lock to detect this possible change, and abort.`
CLEANUP: assorted typo fixes in the code and comments This is 25th iteration of typo fixes 2021-08-07 09:41:56 +00:00			`- at this point the takeover succeeded, the idle_conns_lock is released and`
DOC: internals: document the FD takeover process This explains the traps to avoid and the sequence that leads to consistent use of an FD known by multiple threads at once. This was co-authored with Olivier. 2021-07-30 15:40:07 +00:00			`the connection and its FD are now owned by the caller`

			`2) poll report: happens on late rx, shutdown or error on idle conns`
			`- fd_set_running() is called to atomically set the running_mask and check`
			`that the caller's tid_bit is still present in the thread_mask. Upon`
			`failure the caller arranges itself to stop reporting that FD (e.g. by`
			`immediate removal or by an asynchronous update). Upon success, it's`
			`guaranteed that any concurrent fd_takeover() will fail the DWCAS and that`
			`another connection will need to be picked instead.`
			`- FD's state is possibly updated`
			`- the iocb is called if needed (almost always)`
			`- if the iocb didn't kill the connection, release the bit from running_mask`
			`making the connection possibly available to a subsequent fd_takeover().`

			`3) I/O tasklet, timeout task: timeout or subscribed wakeup`
			`- start by taking the idle_conns_lock, ensuring no takeover() will pick the`
			`same connection from this point.`
			`- check the task/tasklet's context to verify that no recently completed`
			`takeover() stole the connection. If it's NULL, the connection was lost,`
			`the lock is released and the task/tasklet killed. Otherwise it is`
CLEANUP: assorted typo fixes in the code and comments This is 25th iteration of typo fixes 2021-08-07 09:41:56 +00:00			`guaranteed that no other thread may use that connection (current takeover`
DOC: internals: document the FD takeover process This explains the traps to avoid and the sequence that leads to consistent use of an FD known by multiple threads at once. This was co-authored with Olivier. 2021-07-30 15:40:07 +00:00			`candidates are waiting on the lock, previous owners waking from poll()`
			`lost their bit in the thread_mask and will not touch the FD).`
			`- the connection is removed from the idle conns list. From this point on,`
			`no other thread will even find it there nor even try fd_takeover() on it.`
			`- the idle_conns_lock is now released, the connection is protected and its`
			`FD is not reachable by other threads anymore.`
			`- the task does what it has to do`
			`- if the connection is still usable (i.e. not upon timeout), it's inserted`
			`again into the idle conns list, meaning it may instantly be taken over`
			`by a competing thread.`

			`4) wake() callback: happens on last user after xfers (may free() the conn)`
			`- the connection is still owned by the caller, it's still subscribed to`
			`polling but the connection is idle thus inactive. Errors or shutdowns`
			`may be reported late, via sock_conn_iocb() and conn_notify_mux(), thus`
			`the running bit is set (i.e. a concurrent fd_takeover() will fail).`
			`- if the connection is in the list, the idle_conns_lock is grabbed, the`
			`connection is removed from the list, and the lock is released.`
			`- mux->wake() is called`
			`- if the connection previously was in the list, it's reinserted under the`
			`idle_conns_lock.`
MAJOR: fd: get rid of the DWCAS when setting the running_mask Right now we're using a DWCAS to atomically set the running_mask while being constrained by the thread_mask. This DWCAS is annoying because we may seriously need it later when adding support for thread groups, for checking that the running_mask applies to the correct group. It turns out that the DWCAS is not strictly necessary because we never need it to set the thread_mask based on the running_mask, only the other way around. And in fact, the running_mask is always cleared alone, and the thread_mask is changed alone as well. The running_mask is only relevant to indicate a takeover when the thread_mask matches it. Any bit set in running and not present in thread_mask indicates a transition in progress. As such, it is possible to re-arrange this by using a regular CAS around a consistency check between running_mask and thread_mask in fd_update_events and by making a CAS on running_mask then an atomic store on the thread_mask in fd_takeover(). The only other case is fd_delete() but that one already sets the running_mask before clearing the thread_mask, which is compatible with the consistency check above. This change has happily survived 10 billion takeovers on a 16-thread machine at 800k requests/s. The fd-migration doc was updated to reflect this change. 2021-08-03 07:04:32 +00:00

			`With the DWCAS removal between running_mask & thread_mask:`

			`fd_takeover:`
			`1 if (!CAS(&running_mask, 0, tid_bit))`
			`2 return fail;`
			`3 atomic_store(&thread_mask, tid_bit);`
			`4 atomic_and(&running_mask, ~tid_bit);`

			`poller:`
			`1 do {`
			`2 /* read consistent running_mask & thread_mask */`
			`3 do {`
			`4 run = atomic_load(&running_mask);`
			`5 thr = atomic_load(&thread_mask);`
			`6 } while (run & ~thr);`
			`7`
			`8 if (!(thr & tid_bit)) {`
			`9 /* takeover has started */`
			`10 goto disable_fd;`
			`11 }`
			`12 } while (!CAS(&running_mask, run, run \| tid_bit));`

			`fd_delete:`
			`1 atomic_or(&running_mask, tid_bit);`
			`2 atomic_store(&thread_mask, 0);`
			`3 atomic_and(&running_mask, ~tid_bit);`

			`The loop in poller:3-6 is used to make sure the thread_mask we read matches`
			`the last updated running_mask. If nobody can give up on fd_takeover(), it`
			`might even be possible to spin on thread_mask only. Late pollers will not`
			`set running anymore with this.`