haproxy/doc/design-thoughts/thread-group.txt

Thread groups
#############

2021-07-13 - first draft
==========

Objective
---------
- support multi-socket systems with limited cache-line bouncing between
  physical CPUs and/or L3 caches

- overcome the 64-thread limitation

- Support a reasonable number of groups. I.e. if modern CPUs arrive with
  core complexes made of 8 cores, with 8 CC per chip and 2 chips in a
  system, it makes sense to support 16 groups.


Non-objective
-------------
- no need to optimize to the last possible cycle. I.e. some algos like
  leastconn will remain shared across all threads, servers will keep a
  single queue, etc. Global information remains global.

- no stubborn enforcement of FD sharing. Per-server idle connection lists
  can become per-group; listeners can (and should probably) be per-group.
  Other mechanisms (like SO_REUSEADDR) can already overcome this.

- no need to go beyond 64 threads per group.


Identified tasks
================

General
-------
Everywhere tid_bit is used we absolutely need to find a complement using
either the current group or a specific one. Thread debugging will need to
be extended as masks are extensively used.


Scheduler
---------
The global run queue and global wait queue must become per-group. This
means that a task may only be queued into one of them at a time. It
sounds like tasks may only belong to a given group, but doing so would
bring back the original issue that it's impossible to perform remote wake
ups.

We could probably ignore the group if we don't need to set the thread mask
in the task. the task's thread_mask is never manipulated using atomics so
it's safe to complement it with a group.

The sleeping_thread_mask should become per-group. Thus possibly that a
wakeup may only be performed on the assigned group, meaning that either
a task is not assigned, in which case it be self-assigned (like today),
otherwise the tg to be woken up will be retrieved from the task itself.

Task creation currently takes a thread mask of either tid_bit, a specific
mask, or MAX_THREADS_MASK. How to create a task able to run anywhere
(checks, Lua, ...) ?

Profiling
---------
There should be one task_profiling_mask per thread group. Enabling or
disabling profiling should be made per group (possibly by iterating).

Thread isolation
----------------
Thread isolation is difficult as we solely rely on atomic ops to figure
who can complete. Such operation is rare, maybe we could have a global
read_mostly flag containing a mask of the groups that require isolation.
Then the threads_want_rdv_mask etc can become per-group. However setting
and clearing the bits will become problematic as this will happen in two
steps hence will require careful ordering.

FD
--
Tidbit is used in a number of atomic ops on the running_mask. If we have
one fdtab[] per group, the mask implies that it's within the group.
Theoretically we should never face a situation where an FD is reported nor
manipulated for a remote group.

There will still be one poller per thread, except that this time all
operations will be related to the current thread_group. No fd may appear
in two thread_groups at once, but we can probably not prevent that (e.g.
delayed close and reopen). Should we instead have a single shared fdtab[]
(less memory usage also) ? Maybe adding the group in the fdtab entry would
work, but when does a thread know it can leave it ? Currently this is
solved by running_mask and by update_mask. Having two tables could help
with this (each table sees the FD in a different group with a different
mask) but this looks overkill.

There's polled_mask[] which needs to be decided upon. Probably that it
should be doubled as well. Note, polled_mask left fdtab[] for cacheline
alignment reasons in commit cb92f5cae4.

If we have one fdtab[] per group, what *really* prevents from using the
same FD in multiple groups ? _fd_delete_orphan() and fd_update_events()
need to check for no-thread usage before closing the FD. This could be
a limiting factor. Enabling could require to wake every poller.

Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short,
used only during creation and close) ?

Other problem, if we have one fdtab[] per TG, disabling/enabling an FD
(e.g. pause/resume on listener) can become a problem if it's not necessarily
on the current TG. We'll then need a way to figure that one. It sounds like
FDs from listeners and receivers are very specific and suffer from problems
all other ones under high load do not suffer from. Maybe something specific
ought to be done for them, if we can guarantee there is no risk of accidental
reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the
FD's flags). The risk is always that a close() can result in instant pop-up
of the same FD on any other thread of the same process.

Observations: right now fdtab[].thread_mask more or less corresponds to a
declaration of interest, it's very close to meaning "active per thread". It is
in fact located in the FD while it ought to do nothing there, as it should be
where the FD is used as it rules accesses to a shared resource that is not
the FD but what uses it. Indeed, if neither polled_mask nor running_mask have
a thread's bit, the FD is unknown to that thread and the element using it may
only be reached from above and not from the FD. As such we ought to have a
thread_mask on a listener and another one on connections. These ones will
indicate who uses them. A takeover could then be simplified (atomically set
exclusivity on the FD's running_mask, upon success, takeover the connection,
clear the running mask). Probably that the change ought to be performed on
the connection level first, not the FD level by the way. But running and
polled are the two relevant elements, one indicates userland knowledge,
the other one kernel knowledge. For listeners there's no exclusivity so it's
a bit different but the rule remains the same that we don't have to know
what threads are *interested* in the FD, only its holder.

Not exact in fact, see FD notes below.

activity
--------
There should be one activity array per thread group. The dump should
simply scan them all since the cumuled values are not very important
anyway.

applets
-------
They use tid_bit only for the task. It looks like the appctx's thread_mask
is never used (now removed). Furthermore, it looks like the argument is
*always* tid_bit.

CPU binding
-----------
This is going to be tough. It will be needed to detect that threads overlap
and are not bound (i.e. all threads on same mask). In this case, if the number
of threads is higher than the number of threads per physical socket, one must
try hard to evenly spread them among physical sockets (e.g. one thread group
per physical socket) and start as many threads as needed on each, bound to
all threads/cores of each socket. If there is a single socket, the same job
may be done based on L3 caches. Maybe it could always be done based on L3
caches. The difficulty behind this is the number of sockets to be bound: it
is not possible to bind several FDs per listener. Maybe with a new bind
keyword we can imagine to automatically duplicate listeners ? In any case,
the initially bound cpumap (via taskset) must always be respected, and
everything should probably start from there.

Frontend binding
----------------
We'll have to define a list of threads and thread-groups per frontend.
Probably that having a group mask and a same thread-mask for each group
would suffice.

Threads should have two numbers:
  - the per-process number (e.g. 1..256)
  - the per-group number (1..64)

The "bind-thread" lines ought to use the following syntax:
  - bind 45      ## bind to process' thread 45
  - bind 1/45    ## bind to group 1's thread 45
  - bind all/45  ## bind to thread 45 in each group
  - bind 1/all   ## bind to all threads in group 1
  - bind all     ## bind to all threads
  - bind all/all ## bind to all threads in all groups (=all)
  - bind 1/65    ## rejected
  - bind 65      ## OK if there are enough
  - bind 35-45   ## depends. Rejected if it crosses a group boundary.

The global directive "nbthread 28" means 28 total threads for the process. The
number of groups will sub-divide this. E.g. 4 groups will very likely imply 7
threads per group. At the beginning, the nbgroup should be manual since it
implies config adjustments to bind lines.

There should be a trivial way to map a global thread to a group and local ID
and to do the opposite.


Panic handler + watchdog
------------------------
Will probably depend on what's done for thread_isolate

Per-thread arrays inside structures
-----------------------------------
- listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should
  we simply bump the limit ?
- same for servers with idle connections.
=> doesn't seem very practical.
- another solution might be to point to dynamically allocated arrays of
  arrays (e.g. nbthread * nbgroup) or a first level per group and a second
  per thread.
=> dynamic allocation based on the global number

Other
-----
- what about dynamic thread start/stop (e.g. for containers/VMs) ?
  E.g. if we decide to start $MANY threads in 4 groups, and only use
  one, in the end it will not be possible to use less than one thread
  per group, and at most 64 will be present in each group.


FD Notes
--------
  - updt_fd_polling() uses thread_mask to figure where to send the update,
    the local list or a shared list, and which bits to set in update_mask.
    This could be changed so that it takes the update mask in argument. The
    call from the poller's fork would just have to broadcast everywhere.

  - pollers use it to figure whether they're concerned or not by the activity
    update. This looks important as otherwise we could re-enable polling on
    an FD that changed to another thread.

  - thread_mask being a per-thread active mask looks more exact and is
    precisely used this way by _update_fd(). In this case using it instead
    of running_mask to gauge a change or temporarily lock it during a
    removal could make sense.

  - running should be conditioned by thread. Polled not (since deferred
    or migrated). In this case testing thread_mask can be enough most of
    the time, but this requires synchronization that will have to be
    extended to tgid.. But migration seems a different beast that we shouldn't
    care about here: if first performed at the higher level it ought to
    be safe.

In practice the update_mask can be dropped to zero by the first fd_delete()
as the only authority allowed to fd_delete() is *the* owner, and as soon as
all running_mask are gone, the FD will be closed, hence removed from all
pollers. This will be the only way to make sure that update_mask always
refers to the current tgid.

However, it may happen that a takeover within the same group causes a thread
to read the update_mask late, while the FD is being wiped by another thread.
That other thread may close it, causing another thread in another group to
catch it, and change the tgid and start to update the update_mask. This means
that it would be possible for a thread entering do_poll() to see the correct
tgid, then the fd would be closed, reopened and reassigned to another tgid,
and the thread would see its bit in the update_mask, being confused. Right
now this should already happen when the update_mask is not cleared, except
that upon wakeup a migration would be detected and that would be all.

Thus we might need to set the running bit to prevent the FD from migrating
before reading update_mask, which also implies closing on fd_clr_running() == 0 :-(

Also even fd_update_events() leaves a risk of updating update_mask after
clearing running, thus affecting the wrong one. Probably that update_mask
should be updated before clearing running_mask there. Also, how about not
creating an update on a close ? Not trivial if done before running, unless
thread_mask==0.

###########################################################

Current state:


Mux / takeover / fd_delete() code                |||  poller code
-------------------------------------------------|||---------------------------------------------------
                                                 \|/
mux_takeover():                                   | fd_set_running():
   if (fd_takeover()<0)                           |    old = {running, thread};
     return fail;                                 |    new = {tid_bit, tid_bit};
   ...                                            |
fd_takeover():                                    |    do {
   atomic_or(running, tid_bit);                   |       if (!(old.thread & tid_bit))
   old = {running, thread};                       |          return -1;
   new = {tid_bit, tid_bit};                      |       new = { running | tid_bit, old.thread }
   if (owner != expected) {                       |    } while (!dwcas({running, thread}, &old, &new));
      atomic_and(runnning, ~tid_bit);             |
      return -1; // fail                          | fd_clr_running():
   }                                              |    return atomic_and_fetch(running, ~tid_bit);
                                                  |
   while (old == {tid_bit, !=0 })                 | poll():
      if (dwcas({running, thread}, &old, &new)) { |    if (!owner)
         atomic_and(runnning, ~tid_bit);          |       continue;
         return 0; // success                     |
      }                                           |    if (!(thread_mask & tid_bit)) {
   }                                              |       epoll_ctl_del();
                                                  |       continue;
   atomic_and(runnning, ~tid_bit);                |    }
   return -1; // fail                             |
                                                  |    // via fd_update_events()
fd_delete():                                      |    if (fd_set_running() != -1) {
   atomic_or(running, tid_bit);                   |       iocb();
   atomic_store(thread, 0);                       |       if (fd_clr_running() == 0 && !thread_mask)
   if (fd_clr_running(fd) = 0)                    |         fd_delete_orphan();
        fd_delete_orphan();                       |    }


The idle_conns_lock prevents the connection from being *picked* and released
while someone else is reading it. What it does is guarantee that on idle
connections, the caller of the IOCB will not dereference the task's context
while the connection is still in the idle list, since it might be picked then
freed at the same instant by another thread. As soon as the IOCB manages to
get that lock, it removes the connection from the list so that it cannot be
taken over anymore. Conversely, the mux's takeover() code runs under that
lock so that if it frees the connection and task, this will appear atomic
to the IOCB. The timeout task (which is another entry point for connection
deletion) does the same. Thus, when coming from the low-level (I/O or timeout):
  - task always exists, but ctx checked under lock validates; conn removal
    from list prevents takeover().
  - t->context is stable, except during changes under takeover lock. So
    h2_timeout_task may well run on a different thread than h2_io_cb().

Coming from the top:
  - takeover() done under lock() clears task's ctx and possibly closes the FD
    (unless some running remains present).

Unlikely but currently possible situations:
  - multiple pollers (up to N) may have an idle connection's FD being
    polled, if the connection was passed from thread to thread. The first
    event on the connection would wake all of them. Most of them would
    see fdtab[].owner set (the late ones might miss it). All but one would
    see that their bit is missing from fdtab[].thread_mask and give up.
    However, just after this test, others might take over the connection,
    so in practice if terribly unlucky, all but 1 could see their bit in
    thread_mask just before it gets removed, all of them set their bit
    in running_mask, and all of them call iocb() (sock_conn_iocb()).
    Thus all of them dereference the connection and touch the subscriber
    with no protection, then end up in conn_notify_mux() that will call
    the mux's wake().

  - multiple pollers (up to N-1) might still be in fd_update_events()
    manipulating fdtab[].state. The cause is that the "locked" variable
    is determined by atleast2(thread_mask) but that thread_mask is read
    at a random instant (i.e. it may be stolen by another one during a
    takeover) since we don't yet hold running to prevent this from being
    done. Thus we can arrive here with thread_mask==something_else (1bit),
    locked==0 and fdtab[].state assigned non-atomically.

  - it looks like nothing prevents h2_release() from being called on a
    thread (e.g. from the top or task timeout) while sock_conn_iocb()
    dereferences the connection on another thread. Those killing the
    connection don't yet consider the fact that it's an FD that others
    might currently be waking up on.

###################

pb with counter:

users count doesn't say who's using the FD and two users can do the same
close in turn. The thread_mask should define who's responsible for closing
the FD, and all those with a bit in it ought to do it.


2021-08-25 - update with minimal locking on tgid value
==========

  - tgid + refcount at once using CAS
  - idle_conns lock during updates
  - update:
    if tgid differs => close happened, thus drop update
    otherwise normal stuff. Lock tgid until running if needed.
  - poll report:
    if tgid differs => closed
    if thread differs => stop polling (migrated)
    keep tgid lock until running
  - test on thread_id:
    if (xadd(&tgid,65536) != my_tgid) {
      // was closed
      sub(&tgid, 65536)
      return -1
    }
    if !(thread_id & tidbit) => migrated/closed
    set_running()
    sub(tgid,65536)
  - note: either fd_insert() or the final close() ought to set
    polled and update to 0.

2021-09-13 - tid / tgroups etc.
==========

  * tid currently is the thread's global ID. It's essentially used as an index
    for arrays. It must be clearly stated that it works this way.

  * tasklets use the global thread id, and __tasklet_wakeup_on() must use a
    global ID as well. It's capital that tinfo[] provides instant access to
    local/global bits/indexes/arrays

  - tid_bit makes no sense process-wide, so it must be redefined to represent
    the thread's tid within its group. The name is not much welcome though, but
    there are 286 of it that are not going to be changed that fast.
    => now we have ltid and ltid_bit in thread_info. thread-local tid_bit still
       not changed though. If renamed we must make sure the older one vanishes.
       Why not rename "ptid, ptid_bit" for the process-wide tid and "gtid,
       gtid_bit" for the group-wide ones ? This removes the ambiguity on "tid"
       which is half the time not the one we expect.

  * just like "ti" is the thread_info, we need to have "tg" pointing to the
    thread_group.

  - other less commonly used elements should be retrieved from ti->xxx. E.g.
    the thread's local ID.

  - lock debugging must reproduce tgid

  - task profiling must be made per-group (annoying), unless we want to add a
    per-thread TH_FL_* flag and have the rare places where the bit is changed
    iterate over all threads if needed. Sounds preferable overall.

  * an offset might be placed in the tgroup so that even with 64 threads max
    we could have completely separate tid_bits over several groups.
    => base and count now

2021-09-15 - bind + listen() + rx
==========

  - thread_mask (in bind_conf->rx_settings) should become an array of
    MAX_TGROUP longs.
  - when parsing "thread 123" or "thread 2/37", the proper bit is set,
    assuming the array is either a contigous bitfield or a tgroup array.
    An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on
    how the thread num was parsed, so that we reject mixes.
  - end of parsing: entries translated to the cleanest form (to be determined)
  - binding: for each socket()/bind()/listen()... just perform one extra dup()
    for each tgroup and store the multiple FDs into an FD array indexed on
    MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence
    to have multiple entries in all tgroup pollers without requiring the user
    to duplicate the bind line.

2021-09-15 - global thread masks
==========

Some global variables currently expect to know about thread IDs and it's
uncertain what must be done with them:
  - global_tasks_mask  /* Mask of threads with tasks in the global runqueue */
    => touched under the rq lock. Change it per-group ? What exact use is made ?

  - sleeping_thread_mask /* Threads that are about to sleep in poll() */
    => seems that it can be made per group

  - all_threads_mask: a bit complicated, derived from nbthread and used with
    masks and with my_ffsl() to wake threads up. Should probably be per-group
    but we might miss something for global.

  - stopping_thread_mask: used in combination with all_threads_mask, should
    move per-group.

  - threads_harmless_mask: indicates all threads that are currently harmless in
    that they promise not to access a shared resource. Must be made per-group
    but then we'll likely need a second stage to have the harmless groups mask.
    threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one
    above. Maybe the right approach will be to request harmless on a group mask
    so that we can detect collisions and arbiter them like today, but on top of
    this it becomes possible to request harmless only on the local group if
    desired. The subtlety is that requesting harmless at the group level does
    not mean it's achieved since the requester cannot vouch for the other ones
    in the same group.

In addition, some variables are related to the global runqueue:
  __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */
  struct eb_root rqueue;      /* tree constituting the global run queue, accessed under rq_lock */
  unsigned int grq_total;     /* total number of entries in the global run queue, atomic */
  static unsigned int global_rqueue_ticks;  /* insertion count in the grq, use rq_lock */

And others to the global wait queue:
  struct eb_root timers;      /* sorted timers tree, global, accessed under wq_lock */
  __decl_aligned_rwlock(wq_lock);   /* RW lock related to the wait queue */
  struct eb_root timers;      /* sorted timers tree, global, accessed under wq_lock */


2021-09-29 - group designation and masks
==========

Neither FDs nor tasks will belong to incomplete subsets of threads spanning
over multiple thread groups. In addition there may be a difference between
configuration and operation (for FDs). This allows to fix the following rules:

  group  mask   description
    0     0     bind_conf: groups & thread not set. bind to any/all
                task: it would be nice to mean "run on the same as the caller".

    0    xxx    bind_conf: thread set but not group: thread IDs are global
                FD/task: group 0, mask xxx

    G>0   0     bind_conf: only group is set: bind to all threads of group G
                FD/task: mask 0 not permitted (= not owned). May be used to
                mention "any thread of this group", though already covered by
                G/xxx like today.

    G>0  xxx    bind_conf: Bind to these threads of this group
                FD/task: group G, mask xxx

It looks like keeping groups starting at zero internally complicates everything
though. But forcing it to start at 1 might also require that we rescan all tasks
to replace 0 with 1 upon startup. This would also allow group 0 to be special and
be used as the default group for any new thread creation, so that group0.count
would keep the number of unassigned threads. Let's try:

  group  mask   description
    0     0     bind_conf: groups & thread not set. bind to any/all
                task: "run on the same group & thread as the caller".

    0    xxx    bind_conf: thread set but not group: thread IDs are global
                FD/task: invalid. Or maybe for a task we could use this to
                mean "run on current group, thread XXX", which would cover
                the need for health checks (g/t 0/0 while sleeping, 0/xxx
                while running) and have wake_expired_tasks() detect 0/0 and
                wake them up to a random group.

    G>0   0     bind_conf: only group is set: bind to all threads of group G
                FD/task: mask 0 not permitted (= not owned). May be used to
                mention "any thread of this group", though already covered by
                G/xxx like today.

    G>0  xxx    bind_conf: Bind to these threads of this group
                FD/task: group G, mask xxx

With a single group declared in the config, group 0 would implicitly find the
first one.


The problem with the approach above is that a task queued in one group+thread's
wait queue could very well receive a signal from another thread and/or group,
and that there is no indication about where the task is queued, nor how to
dequeue it. Thus it seems that it's up to the application itself to unbind/
rebind a task. This contradicts the principle of leaving a task waiting in a
wait queue and waking it anywhere.

Another possibility might be to decide that a task having a defined group but
a mask of zero is shared and will always be queued into its group's wait queue.
However, upon expiry, the scheduler would notice the thread-mask 0 and would
broadcast it to any group.

Right now in the code we have:
  - 18 calls of task_new(tid_bit)
  - 18 calls of task_new(MAX_THREADS_MASK)
  - 2 calls with a single bit

Thus it looks like "task_new_anywhere()", "task_new_on()" and
"task_new_here()" would be sufficient.