Thread groups ############# 2021-07-13 - first draft ========== Objective --------- - support multi-socket systems with limited cache-line bouncing between physical CPUs and/or L3 caches - overcome the 64-thread limitation - Support a reasonable number of groups. I.e. if modern CPUs arrive with core complexes made of 8 cores, with 8 CC per chip and 2 chips in a system, it makes sense to support 16 groups. Non-objective ------------- - no need to optimize to the last possible cycle. I.e. some algos like leastconn will remain shared across all threads, servers will keep a single queue, etc. Global information remains global. - no stubborn enforcement of FD sharing. Per-server idle connection lists can become per-group; listeners can (and should probably) be per-group. Other mechanisms (like SO_REUSEADDR) can already overcome this. - no need to go beyond 64 threads per group. Identified tasks ================ General ------- Everywhere tid_bit is used we absolutely need to find a complement using either the current group or a specific one. Thread debugging will need to be extended as masks are extensively used. Scheduler --------- The global run queue and global wait queue must become per-group. This means that a task may only be queued into one of them at a time. It sounds like tasks may only belong to a given group, but doing so would bring back the original issue that it's impossible to perform remote wake ups. We could probably ignore the group if we don't need to set the thread mask in the task. the task's thread_mask is never manipulated using atomics so it's safe to complement it with a group. The sleeping_thread_mask should become per-group. Thus possibly that a wakeup may only be performed on the assigned group, meaning that either a task is not assigned, in which case it be self-assigned (like today), otherwise the tg to be woken up will be retrieved from the task itself. Task creation currently takes a thread mask of either tid_bit, a specific mask, or MAX_THREADS_MASK. How to create a task able to run anywhere (checks, Lua, ...) ? Profiling -> completed --------- There should be one task_profiling_mask per thread group. Enabling or disabling profiling should be made per group (possibly by iterating). -> not needed anymore, one flag per thread in each thread's context. Thread isolation ---------------- Thread isolation is difficult as we solely rely on atomic ops to figure who can complete. Such operation is rare, maybe we could have a global read_mostly flag containing a mask of the groups that require isolation. Then the threads_want_rdv_mask etc can become per-group. However setting and clearing the bits will become problematic as this will happen in two steps hence will require careful ordering. FD -- Tidbit is used in a number of atomic ops on the running_mask. If we have one fdtab[] per group, the mask implies that it's within the group. Theoretically we should never face a situation where an FD is reported nor manipulated for a remote group. There will still be one poller per thread, except that this time all operations will be related to the current thread_group. No fd may appear in two thread_groups at once, but we can probably not prevent that (e.g. delayed close and reopen). Should we instead have a single shared fdtab[] (less memory usage also) ? Maybe adding the group in the fdtab entry would work, but when does a thread know it can leave it ? Currently this is solved by running_mask and by update_mask. Having two tables could help with this (each table sees the FD in a different group with a different mask) but this looks overkill. There's polled_mask[] which needs to be decided upon. Probably that it should be doubled as well. Note, polled_mask left fdtab[] for cacheline alignment reasons in commit cb92f5cae4. If we have one fdtab[] per group, what *really* prevents from using the same FD in multiple groups ? _fd_delete_orphan() and fd_update_events() need to check for no-thread usage before closing the FD. This could be a limiting factor. Enabling could require to wake every poller. Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short, used only during creation and close) ? Other problem, if we have one fdtab[] per TG, disabling/enabling an FD (e.g. pause/resume on listener) can become a problem if it's not necessarily on the current TG. We'll then need a way to figure that one. It sounds like FDs from listeners and receivers are very specific and suffer from problems all other ones under high load do not suffer from. Maybe something specific ought to be done for them, if we can guarantee there is no risk of accidental reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the FD's flags). The risk is always that a close() can result in instant pop-up of the same FD on any other thread of the same process. Observations: right now fdtab[].thread_mask more or less corresponds to a declaration of interest, it's very close to meaning "active per thread". It is in fact located in the FD while it ought to do nothing there, as it should be where the FD is used as it rules accesses to a shared resource that is not the FD but what uses it. Indeed, if neither polled_mask nor running_mask have a thread's bit, the FD is unknown to that thread and the element using it may only be reached from above and not from the FD. As such we ought to have a thread_mask on a listener and another one on connections. These ones will indicate who uses them. A takeover could then be simplified (atomically set exclusivity on the FD's running_mask, upon success, takeover the connection, clear the running mask). Probably that the change ought to be performed on the connection level first, not the FD level by the way. But running and polled are the two relevant elements, one indicates userland knowledge, the other one kernel knowledge. For listeners there's no exclusivity so it's a bit different but the rule remains the same that we don't have to know what threads are *interested* in the FD, only its holder. Not exact in fact, see FD notes below. activity -------- There should be one activity array per thread group. The dump should simply scan them all since the cumuled values are not very important anyway. applets ------- They use tid_bit only for the task. It looks like the appctx's thread_mask is never used (now removed). Furthermore, it looks like the argument is *always* tid_bit. CPU binding ----------- This is going to be tough. It will be needed to detect that threads overlap and are not bound (i.e. all threads on same mask). In this case, if the number of threads is higher than the number of threads per physical socket, one must try hard to evenly spread them among physical sockets (e.g. one thread group per physical socket) and start as many threads as needed on each, bound to all threads/cores of each socket. If there is a single socket, the same job may be done based on L3 caches. Maybe it could always be done based on L3 caches. The difficulty behind this is the number of sockets to be bound: it is not possible to bind several FDs per listener. Maybe with a new bind keyword we can imagine to automatically duplicate listeners ? In any case, the initially bound cpumap (via taskset) must always be respected, and everything should probably start from there. Frontend binding ---------------- We'll have to define a list of threads and thread-groups per frontend. Probably that having a group mask and a same thread-mask for each group would suffice. Threads should have two numbers: - the per-process number (e.g. 1..256) - the per-group number (1..64) The "bind-thread" lines ought to use the following syntax: - bind 45 ## bind to process' thread 45 - bind 1/45 ## bind to group 1's thread 45 - bind all/45 ## bind to thread 45 in each group - bind 1/all ## bind to all threads in group 1 - bind all ## bind to all threads - bind all/all ## bind to all threads in all groups (=all) - bind 1/65 ## rejected - bind 65 ## OK if there are enough - bind 35-45 ## depends. Rejected if it crosses a group boundary. The global directive "nbthread 28" means 28 total threads for the process. The number of groups will sub-divide this. E.g. 4 groups will very likely imply 7 threads per group. At the beginning, the nbgroup should be manual since it implies config adjustments to bind lines. There should be a trivial way to map a global thread to a group and local ID and to do the opposite. Panic handler + watchdog ------------------------ Will probably depend on what's done for thread_isolate Per-thread arrays inside structures ----------------------------------- - listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should we simply bump the limit ? - same for servers with idle connections. => doesn't seem very practical. - another solution might be to point to dynamically allocated arrays of arrays (e.g. nbthread * nbgroup) or a first level per group and a second per thread. => dynamic allocation based on the global number Other ----- - what about dynamic thread start/stop (e.g. for containers/VMs) ? E.g. if we decide to start $MANY threads in 4 groups, and only use one, in the end it will not be possible to use less than one thread per group, and at most 64 will be present in each group. FD Notes -------- - updt_fd_polling() uses thread_mask to figure where to send the update, the local list or a shared list, and which bits to set in update_mask. This could be changed so that it takes the update mask in argument. The call from the poller's fork would just have to broadcast everywhere. - pollers use it to figure whether they're concerned or not by the activity update. This looks important as otherwise we could re-enable polling on an FD that changed to another thread. - thread_mask being a per-thread active mask looks more exact and is precisely used this way by _update_fd(). In this case using it instead of running_mask to gauge a change or temporarily lock it during a removal could make sense. - running should be conditioned by thread. Polled not (since deferred or migrated). In this case testing thread_mask can be enough most of the time, but this requires synchronization that will have to be extended to tgid.. But migration seems a different beast that we shouldn't care about here: if first performed at the higher level it ought to be safe. In practice the update_mask can be dropped to zero by the first fd_delete() as the only authority allowed to fd_delete() is *the* owner, and as soon as all running_mask are gone, the FD will be closed, hence removed from all pollers. This will be the only way to make sure that update_mask always refers to the current tgid. However, it may happen that a takeover within the same group causes a thread to read the update_mask late, while the FD is being wiped by another thread. That other thread may close it, causing another thread in another group to catch it, and change the tgid and start to update the update_mask. This means that it would be possible for a thread entering do_poll() to see the correct tgid, then the fd would be closed, reopened and reassigned to another tgid, and the thread would see its bit in the update_mask, being confused. Right now this should already happen when the update_mask is not cleared, except that upon wakeup a migration would be detected and that would be all. Thus we might need to set the running bit to prevent the FD from migrating before reading update_mask, which also implies closing on fd_clr_running() == 0 :-( Also even fd_update_events() leaves a risk of updating update_mask after clearing running, thus affecting the wrong one. Probably that update_mask should be updated before clearing running_mask there. Also, how about not creating an update on a close ? Not trivial if done before running, unless thread_mask==0. ########################################################### Current state: Mux / takeover / fd_delete() code ||| poller code -------------------------------------------------|||--------------------------------------------------- \|/ mux_takeover(): | fd_set_running(): if (fd_takeover()<0) | old = {running, thread}; return fail; | new = {tid_bit, tid_bit}; ... | fd_takeover(): | do { atomic_or(running, tid_bit); | if (!(old.thread & tid_bit)) old = {running, thread}; | return -1; new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread } if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new)); atomic_and(runnning, ~tid_bit); | return -1; // fail | fd_clr_running(): } | return atomic_and_fetch(running, ~tid_bit); | while (old == {tid_bit, !=0 }) | poll(): if (dwcas({running, thread}, &old, &new)) { | if (!owner) atomic_and(runnning, ~tid_bit); | continue; return 0; // success | } | if (!(thread_mask & tid_bit)) { } | epoll_ctl_del(); | continue; atomic_and(runnning, ~tid_bit); | } return -1; // fail | | // via fd_update_events() fd_delete(): | if (fd_set_running() != -1) { atomic_or(running, tid_bit); | iocb(); atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask) if (fd_clr_running(fd) = 0) | fd_delete_orphan(); fd_delete_orphan(); | } The idle_conns_lock prevents the connection from being *picked* and released while someone else is reading it. What it does is guarantee that on idle connections, the caller of the IOCB will not dereference the task's context while the connection is still in the idle list, since it might be picked then freed at the same instant by another thread. As soon as the IOCB manages to get that lock, it removes the connection from the list so that it cannot be taken over anymore. Conversely, the mux's takeover() code runs under that lock so that if it frees the connection and task, this will appear atomic to the IOCB. The timeout task (which is another entry point for connection deletion) does the same. Thus, when coming from the low-level (I/O or timeout): - task always exists, but ctx checked under lock validates; conn removal from list prevents takeover(). - t->context is stable, except during changes under takeover lock. So h2_timeout_task may well run on a different thread than h2_io_cb(). Coming from the top: - takeover() done under lock() clears task's ctx and possibly closes the FD (unless some running remains present). Unlikely but currently possible situations: - multiple pollers (up to N) may have an idle connection's FD being polled, if the connection was passed from thread to thread. The first event on the connection would wake all of them. Most of them would see fdtab[].owner set (the late ones might miss it). All but one would see that their bit is missing from fdtab[].thread_mask and give up. However, just after this test, others might take over the connection, so in practice if terribly unlucky, all but 1 could see their bit in thread_mask just before it gets removed, all of them set their bit in running_mask, and all of them call iocb() (sock_conn_iocb()). Thus all of them dereference the connection and touch the subscriber with no protection, then end up in conn_notify_mux() that will call the mux's wake(). - multiple pollers (up to N-1) might still be in fd_update_events() manipulating fdtab[].state. The cause is that the "locked" variable is determined by atleast2(thread_mask) but that thread_mask is read at a random instant (i.e. it may be stolen by another one during a takeover) since we don't yet hold running to prevent this from being done. Thus we can arrive here with thread_mask==something_else (1bit), locked==0 and fdtab[].state assigned non-atomically. - it looks like nothing prevents h2_release() from being called on a thread (e.g. from the top or task timeout) while sock_conn_iocb() dereferences the connection on another thread. Those killing the connection don't yet consider the fact that it's an FD that others might currently be waking up on. ################### pb with counter: users count doesn't say who's using the FD and two users can do the same close in turn. The thread_mask should define who's responsible for closing the FD, and all those with a bit in it ought to do it. 2021-08-25 - update with minimal locking on tgid value ========== - tgid + refcount at once using CAS - idle_conns lock during updates - update: if tgid differs => close happened, thus drop update otherwise normal stuff. Lock tgid until running if needed. - poll report: if tgid differs => closed if thread differs => stop polling (migrated) keep tgid lock until running - test on thread_id: if (xadd(&tgid,65536) != my_tgid) { // was closed sub(&tgid, 65536) return -1 } if !(thread_id & tidbit) => migrated/closed set_running() sub(tgid,65536) - note: either fd_insert() or the final close() ought to set polled and update to 0. 2021-09-13 - tid / tgroups etc. ========== * tid currently is the thread's global ID. It's essentially used as an index for arrays. It must be clearly stated that it works this way. * tasklets use the global thread id, and __tasklet_wakeup_on() must use a global ID as well. It's capital that tinfo[] provides instant access to local/global bits/indexes/arrays - tid_bit makes no sense process-wide, so it must be redefined to represent the thread's tid within its group. The name is not much welcome though, but there are 286 of it that are not going to be changed that fast. => now we have ltid and ltid_bit in thread_info. thread-local tid_bit still not changed though. If renamed we must make sure the older one vanishes. Why not rename "ptid, ptid_bit" for the process-wide tid and "gtid, gtid_bit" for the group-wide ones ? This removes the ambiguity on "tid" which is half the time not the one we expect. * just like "ti" is the thread_info, we need to have "tg" pointing to the thread_group. - other less commonly used elements should be retrieved from ti->xxx. E.g. the thread's local ID. - lock debugging must reproduce tgid * task profiling must be made per-group (annoying), unless we want to add a per-thread TH_FL_* flag and have the rare places where the bit is changed iterate over all threads if needed. Sounds preferable overall. * an offset might be placed in the tgroup so that even with 64 threads max we could have completely separate tid_bits over several groups. => base and count now 2021-09-15 - bind + listen() + rx ========== - thread_mask (in bind_conf->rx_settings) should become an array of MAX_TGROUP longs. - when parsing "thread 123" or "thread 2/37", the proper bit is set, assuming the array is either a contigous bitfield or a tgroup array. An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on how the thread num was parsed, so that we reject mixes. - end of parsing: entries translated to the cleanest form (to be determined) - binding: for each socket()/bind()/listen()... just perform one extra dup() for each tgroup and store the multiple FDs into an FD array indexed on MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence to have multiple entries in all tgroup pollers without requiring the user to duplicate the bind line. 2021-09-15 - global thread masks ========== Some global variables currently expect to know about thread IDs and it's uncertain what must be done with them: - global_tasks_mask /* Mask of threads with tasks in the global runqueue */ => touched under the rq lock. Change it per-group ? What exact use is made ? - sleeping_thread_mask /* Threads that are about to sleep in poll() */ => seems that it can be made per group - all_threads_mask: a bit complicated, derived from nbthread and used with masks and with my_ffsl() to wake threads up. Should probably be per-group but we might miss something for global. - stopping_thread_mask: used in combination with all_threads_mask, should move per-group. - threads_harmless_mask: indicates all threads that are currently harmless in that they promise not to access a shared resource. Must be made per-group but then we'll likely need a second stage to have the harmless groups mask. threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one above. Maybe the right approach will be to request harmless on a group mask so that we can detect collisions and arbiter them like today, but on top of this it becomes possible to request harmless only on the local group if desired. The subtlety is that requesting harmless at the group level does not mean it's achieved since the requester cannot vouch for the other ones in the same group. In addition, some variables are related to the global runqueue: __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */ struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */ unsigned int grq_total; /* total number of entries in the global run queue, atomic */ static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */ And others to the global wait queue: struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */ struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ 2022-06-14 - progress on task affinity ========== The particularity of the current global run queue is to be usable for remote wakeups because it's protected by a lock. There is no need for a global run queue beyond this, and there could already be a locked queue per thread for remote wakeups, with a random selection at wakeup time. It's just that picking a pending task in a run queue among a number is convenient (though it introduces some excessive locking). A task will either be tied to a single group or will be allowed to run on any group. As such it's pretty clear that we don't need a global run queue. When a run-anywhere task expires, either it runs on the current group's runqueue with any thread, or a target thread is selected during the wakeup and it's directly assigned. A global wait queue seems important for scheduled repetitive tasks however. But maybe it's more a task for a cron-like job and there's no need for the task itself to wake up anywhere, because once the task wakes up, it must be tied to one (or a set of) thread(s). One difficulty if the task is temporarily assigned a thread group is that it's impossible to know where it's running when trying to perform a second wakeup or when trying to kill it. Maybe we'll need to have two tgid for a task (desired, effective). Or maybe we can restrict the ability of such a task to stay in wait queue in case of wakeup, though that sounds difficult. Other approaches would be to set the GID to the current one when waking up the task, and to have a flag (or sign on the GID) indicating that the task is still queued in the global timers queue. We already have TASK_SHARED_WQ so it seems that antoher similar flag such as TASK_WAKE_ANYWHERE could make sense. But when is TASK_SHARED_WQ really used, except for the "anywhere" case ? All calls to task_new() use either 1< we don't need one WQ per group, only a global and N local ones, hence | | the TASK_SHARED_WQ flag can continue to be used for this purpose. | +----------------------------------------------------------------------------+ Having TASK_SHARED_WQ should indicate that a task will always be queued to the shared queue and will always have a temporary gid and thread mask in the run queue. Going further, as we don't have any single case of a task bound to a small set of threads, we could decide to wake up only expired tasks for ourselves by looking them up using eb32sc and adopting them. Thus, there's no more need for a shared runqueue nor a global_runqueue_ticks counter, and we can simply have the ability to wake up a remote task. The task's thread_mask will then change so that it's only a thread ID, except when the task has TASK_SHARED_WQ, in which case it corresponds to the running thread. That's very close to what is already done with tasklets in fact. 2021-09-29 - group designation and masks ========== Neither FDs nor tasks will belong to incomplete subsets of threads spanning over multiple thread groups. In addition there may be a difference between configuration and operation (for FDs). This allows to fix the following rules: group mask description 0 0 bind_conf: groups & thread not set. bind to any/all task: it would be nice to mean "run on the same as the caller". 0 xxx bind_conf: thread set but not group: thread IDs are global FD/task: group 0, mask xxx G>0 0 bind_conf: only group is set: bind to all threads of group G FD/task: mask 0 not permitted (= not owned). May be used to mention "any thread of this group", though already covered by G/xxx like today. G>0 xxx bind_conf: Bind to these threads of this group FD/task: group G, mask xxx It looks like keeping groups starting at zero internally complicates everything though. But forcing it to start at 1 might also require that we rescan all tasks to replace 0 with 1 upon startup. This would also allow group 0 to be special and be used as the default group for any new thread creation, so that group0.count would keep the number of unassigned threads. Let's try: group mask description 0 0 bind_conf: groups & thread not set. bind to any/all task: "run on the same group & thread as the caller". 0 xxx bind_conf: thread set but not group: thread IDs are global FD/task: invalid. Or maybe for a task we could use this to mean "run on current group, thread XXX", which would cover the need for health checks (g/t 0/0 while sleeping, 0/xxx while running) and have wake_expired_tasks() detect 0/0 and wake them up to a random group. G>0 0 bind_conf: only group is set: bind to all threads of group G FD/task: mask 0 not permitted (= not owned). May be used to mention "any thread of this group", though already covered by G/xxx like today. G>0 xxx bind_conf: Bind to these threads of this group FD/task: group G, mask xxx With a single group declared in the config, group 0 would implicitly find the first one. The problem with the approach above is that a task queued in one group+thread's wait queue could very well receive a signal from another thread and/or group, and that there is no indication about where the task is queued, nor how to dequeue it. Thus it seems that it's up to the application itself to unbind/ rebind a task. This contradicts the principle of leaving a task waiting in a wait queue and waking it anywhere. Another possibility might be to decide that a task having a defined group but a mask of zero is shared and will always be queued into its group's wait queue. However, upon expiry, the scheduler would notice the thread-mask 0 and would broadcast it to any group. Right now in the code we have: - 18 calls of task_new(tid_bit) - 17 calls of task_new_anywhere() - 2 calls with a single bit Thus it looks like "task_new_anywhere()", "task_new_on()" and "task_new_here()" would be sufficient.