This flag is not needed anymore as we're already marking the waiting
threads as harmless, thus the thread's bit is already covered by this
information. The variable was unexported.
The debug_handler() function waits for other threads to join, but does
not mark itself as harmless, so if at the same time another thread tries
to isolate, this may deadlock. In practice this does not happen as the
signal is received during epoll_wait() hence under harmless mode, but
it can possibly arrive under other conditions.
In order to improve this, while waiting for other threads to join, we're
now marking the current thread as harmless, as it's doing nothing but
waiting for the other ones. This way another harmless waiter will be able
to proceed. It's valid to do this since we're not doing anything else in
this loop.
One improvement could be to also check for the thread being idle and
marking it idle in addition to harmless, so that it can even release a
full isolation requester. But that really doesn't look worth it.
The harmless status is not re-entrant, so sometimes for signal handling
it can be useful to know if we're already harmless or not. Let's add a
function doing that, and make the debugger use it instead of manipulating
the harmless mask.
thread_isolate() and thread_isolate_full() were relying on a set of thread
masks for all threads in different states (rdv, harmless, idle). This cannot
work anymore when the number of threads increases beyond LONGBITS so we need
to change the mechanism.
What is done here is to have a counter of requesters and the number of the
current isolated thread. Threads which want to isolate themselves increment
the request counter and wait for all threads to be marked harmless (or idle)
by scanning all groups and watching the respective masks. This is possible
because threads cannot escape once they discover this counter, unless they
also want to isolate and possibly pass first. Once all threads are harmless,
the requesting thread tries to self-assign the isolated thread number, and
if it fails it loops back to checking all threads. If it wins it's guaranted
to be alone, and can drop its harmless bit, so that other competing threads
go back to the loop waiting for all threads to be harmless. The benefit of
proceeding this way is that there's very little write contention on the
thread number (none during work), hence no cache line moves between caches,
thus frozen threads do not slow down the isolated one.
Once it's done, the isolated thread resets the thread number (hence lets
another thread take the place) and decrements the requester count, thus
possibly releasing all harmless threads.
With this change there's no more need for any global mask to synchronize
any thread, and we only need to loop over a number of groups to check
64 threads at a time per iteration. As such, tinfo's threads_want_rdv
could be dropped.
This was tested with 64 threads spread into 2 groups, running 64 tasks
(from the debug dev command), 20 "show sess" (thread_isolate()), 20
"add server blah/blah" (thread_isolate()), and 20 "del server blah/blah"
(thread_isolate_full()). The load remained very low (limited by external
socat forks) and no stuck nor starved thread was found.
Stopping threads need a mask to figure who's still there without scanning
everything in the poll loop. This means this will have to be per-group.
And we also need to have a global stopping groups mask to know what groups
were already signaled. This is used both to figure what thread is the first
one to catch the event, and which one is the first one to detect the end of
the last job. The logic isn't changed, though a loop is required in the
slow path to make sure all threads are aware of the end.
Note that for now the soft-stop still takes time for group IDs > 1 as the
poller is not yet started on these threads and needs to expire its timeout
as there's no way to wake it up. But all threads are eventually stopped.
The thread group info is not sufficient to represent a thread group's
current state as it's read-only. We also need something comparable to
the thread context to represent the aggregate state of the threads in
that group. This patch introduces ha_tgroup_ctx[] and tg_ctx for this.
It's indexed on the group id and must be cache-line aligned. The thread
masks that were global and that do not need to remain global were moved
there (want_rdv, harmless, idle).
Given that all the masks placed there now become group-specific, the
associated thread mask (tid_bit) now switches to the thread's local
bit (ltid_bit). Both are the same for nbtgroups 1 but will differ for
other values.
There's also a tg_ctx pointer in the thread so that it can be reached
from other threads.
This function was added in 2.0 when reworking the thread isolation
mechanism to make it more reliable. However it if fundamentally
incompatible with the full isolation mechanism provided by
thread_isolate_full() since that one will wait for all threads to
become idle while the former will wait for all threads to finish
waiting, causing a deadlock.
Given that it's not used, let's just drop it entirely before it gets
used by accident.
In order to kill all_threads_mask we'll need to have an equivalent for
the thread groups. The all_tgroups_mask does just this, it keeps one bit
set per enabled group.
Since commit cc7a11ee3 ("MINOR: threads: set the tid, ltid and their bit
in thread_cfg") we ought not use (1UL << thr) to get the group mask for
thread <thr>, but (ha_thread_info[thr].ltid_bit). ha_tkillall() needs
this.
Since commit cc7a11ee3 ("MINOR: threads: set the tid, ltid and their bit
in thread_cfg") we ought not use (1UL << thr) to get the group mask for
thread <thr>, but (ha_thread_info[thr].ltid_bit). clock_report_idle()
needs this.
This also implies not using all_threads_mask anymore but taking the mask
from the tgroup since it becomes relative now.
Since commit cc7a11ee3 ("MINOR: threads: set the tid, ltid and their bit
in thread_cfg") we ought not use (1UL << thr) to get the group mask for
thread <thr>, but (ha_thread_info[thr].ltid_bit). wdt_handler() needs
this.
Since commit cc7a11ee3 ("MINOR: threads: set the tid, ltid and their bit
in thread_cfg") we ought not use (1UL << thr) to get the group mask for
thread <thr>, but (ha_thread_info[thr].ltid_bit). ha_thread_dump() needs
this.
In order to replace the global "all_threads_mask" we'll need to have an
equivalent per group. Take this opportunity for calling it threads_enabled
and make sure which ones are counted there (in case in the future we allow
to stop some).
Now that the tgid is accessible from the thread, it's pointless to have
it in the group, and it was only set but never used. However we'll soon
frequently need the mask corresponding to the group ID and the risk of
getting it wrong with the +1 or to shift 1 instead of 1UL is important,
so let's store the tgid_bit there.
At several places we're dereferencing the thread group just to catch
the group number, and this will become even more required once we start
to use per-group contexts. Let's just add the tgid in the thread_info
struct to make this easier.
Every single place where sleeping_thread_mask was still used was to test
or set a single thread. We can now add a per-thread flag to indicate a
thread is sleeping, and remove this shared mask.
The wake_thread() function now always performs an atomic fetch-and-or
instead of a first load then an atomic OR. That's cleaner and more
reliable.
This is not easy to test, as broadcast FD events are rare. The good
way to test for this is to run a very low rate-limited frontend with
a listener that listens to the fewest possible threads (2), and to
send it only 1 connection at a time. The listener will periodically
pause and the wakeup task will sometimes wake up on a random thread
and will call wake_thread():
frontend test
bind :8888 maxconn 10 thread 1-2
rate-limit sessions 5
Alternately, disabling/enabling a frontend in loops via the CLI also
broadcasts such events, but they're more difficult to observe since
this is causing connection failures.
Right now when an inter-thread wakeup happens, we preliminary check if the
thread was asleep, and if so we wake the poller up and remove its bit from
the sleeping mask. That's not very clean since the sleeping mask cannot be
entirely trusted since a thread that's about to wake up will already have
its sleeping bit removed.
This patch adds a new per-thread flag (TH_FL_NOTIFIED) to remember that a
thread was notified to wake up. It's cleared before checking the task lists
last, so that new wakeups can be considered again (since wake_thread() is
only used to notify about task wakeups and FD polling changes). This way
we do not need to modify a remote thread's sleeping mask anymore. As such
wake_thread() now only tests and sets the TH_FL_NOTIFIED flag but doesn't
clear sleeping anymore.
When enabling an FD that's only bound to another thread, instead of
always picking the first one, let's pick a random one. This is rarely
used (enabling a frontend, or session rate-limiting period ending),
and has greater chances of avoiding that some obscure corner cases
could degenerate into a poorly distributed load.
Till now, update_fd_polling() used to check if all the target threads were
sleeping, and only then would wake an owning thread up. This causes several
problems among which the need for the sleeping_thread_mask and the fact that
by the time we wake one thread up, it has changed.
This commit changes this by leaving it to wake_thread() to perform this
check on the selected thread, since wake_thread() is already capable of
doing this now. Concretely speaking, for updt_fd_polling() it will mean
performing one computation of an ffsl() before knowing the sleeping status
on a global FD state change (which is very rare and not important here,
as it basically happens after relaxing a rate-limit (i.e. once a second
at beast) or after enabling a frontend from the CLI); thus we don't care.
When returning from the polling syscall, all pollers have a certain
dance to follow, made of wall clock updates, thread harmless updates,
idle time management and sleeping mask updates. Let's have a centralized
function to deal with all of this boring stuff: fd_leaving_poll(), and
make all the pollers use it.
The thread flags are touched a little bit by other threads, e.g. the STUCK
flag may be set by other ones, and they're watched a little bit. As such
we need to use atomic ops only to manipulate them. Most places were already
using them, but here we generalize the practice. Only ha_thread_dump() does
not change because it's run under isolation.
The thread flags were once believed to be local to the thread, but as
it stands, even the STUCK flag is shared since it's looked at by the
watchdog. As such we'll need to use atomic ops to manipulate them, and
likely to move them into the shared area.
This patch only moves the flag into the shared area so that we can later
decide whether it's best to leave them there or to move them back to the
local area. Interestingly, some tests have shown a 3% better performance
on dequeuing with this, while they're not used by other threads yet, so
there are definitely alignment effects that might change over time.
Almost every call place of wake_thread() checks for sleeping threads and
clears the sleeping mask itself, while the function is solely used for
there. Let's move the check and the clearing of the bit inside the function
itself. Note that updt_fd_polling() still performs the check because its
rules are a bit different.
Now that the inter-task wakeups are cheap, there's no point in using
task_instant_wakeup() anymore when dequeueing tasks. The use of the
regular task_wakeup() is sufficient there and will preserve a better
fairness: the test that went from 40k to 570k RPS now gives 580k RPS
(down from 585k RPS with previous commit). This essentially reverts
commit 27fab1dcb ("MEDIUM: queue: use tasklet_instant_wakeup() to
wake tasks").
Since we don't mix tasks from different threads in the run queues
anymore, we don't need to use the eb32sc_ trees and we can switch
to the regular eb32 ones. This uses cheaper lookup and insert code,
and a 16-thread test on the queues shows a performance increase
from 570k RPS to 585k RPS.
This bit field used to be a per-thread cache of the result of the last
lookup of the presence of a task for each thread in the shared cache.
Since we now know that each thread has its own shared cache, a test of
emptiness is now sufficient to decide whether or not the shared tree
has a task for the current thread. Let's just remove this mask.
grq_total was only used to know how many tasks were being queued in the
global runqueue for stats purposes, and that was transferred to the per
thread rq_total counter once assigned. We don't need this anymore since
we know where they are, so let's just directly update rq_total and drop
that one.
Since we only use the shared runqueue to put tasks only assigned to
known threads, let's move that runqueue to each of these threads. The
goal will be to arrange an N*(N-1) mesh instead of a central contention
point.
The global_rqueue_ticks had to be dropped (for good) since we'll now
use the per-thread rqueue_ticks counter for both trees.
A few points to note:
- the rq_lock stlil remains the global one for now so there should not
be any gain in doing this, but should this trigger any regression, it
is important to detect whether it's related to the lock or to the tree.
- there's no more reason for using the scope-based version of the ebtree
now, we could switch back to the regular eb32_tree.
- it's worth checking if we still need TASK_GLOBAL (probably only to
delete a task in one's own shared queue maybe).
The runqueue ticks counter is per-thread and wasn't initially meant to
be shared. We'll soon have to share it so let's make it atomic. It's
only updated when waking up a task, and no performance difference was
observed. It was moved in the thread_ctx struct so that it doesn't
pollute the local cache line when it's later updated by other threads.
This function stopped being used before 2.4 because either the task is
dequeued by the scheduler itself and it knows where to find it, or it's
killed by any thread, and task_kill() must be used for this as only this
one is safe.
It's difficult to say whether task_unlink_rq() is still safe, but once
the lock moves to a thread declared in the task itself, it will be even
more difficult to keep it safe.
Let's just remove it now before someone reuses it and causes trouble.
TASK_SHARED_WQ was set upon task creation and never changed afterwards.
Thus if a task was created to run anywhere (e.g. a check or a Lua task),
all its timers would always pass through the shared timers queue with a
lock. Now we know that tid<0 indicates a shared task, so we can use that
to decide whether or not to use the shared queue. The task might be
migrated using task_set_affinity() but it's always dequeued first so
the check will still be valid.
Not only this removes a flag that's difficult to keep synchronized with
the thread ID, but it should significantly lower the load on systems with
many checks. A quick test with 5000 servers and fast checks that were
saturating the CPU shows that the check rate increased by 20% (hence the
CPU usage dropped by 17%). It's worth noting that run_task_lists() almost
no longer appears in perf top now.
As previously advertised in comments, the mask-based task_new() is now
gone. The low-level function now is task_new_on() which takes a thread
number or a negative value for "any thread", which is turned to zero
for thread-less builds since there's no shared WQ in thiscase. The
task_new_here() and task_new_anywhere() functions were adjusted
accordingly.
This removes the mask-based variant so that from now on the low-level
function becomes appctx_new_on() and it takes either a thread number or
a negative value for "any thread". This way we can use task_new_on() and
task_new_anywhere() instead of task_new() which will soon disappear.
At a few places where the task's thread mask. Now we know that it's always
either one bit or all bits of all_threads_mask, so we can replace it with
either 1<<tid or all_threads_mask depending on what's expected.
It's worth noting that the global_tasks_mask is still set this way and
that it's reaching its limits. Similarly, the task_new() API would deserve
an update to stop using a thread mask and use a thread number instead.
Similarly, task_set_affinity() should be updated to directly take a
thread number.
At this point the task's thread mask is not used anymore.
At several places we need to figure the ID of the first thread allowed
to run a task. Till now this was performed using my_ffsl(t->thread_mask)
but since we now have the thread ID stored into the task, let's use it
instead. This is tagged major because it starts to assume that tid<0 is
strictly equivalent to atleast2(thread_mask), and that as such, among
the allowed threads are the current one.
The tasks currently rely on a mask but do not have an assigned thread ID,
contrary to tasklets. However, in practice they're either running on a
single thread or on any thread, so that it will be worth simplifying all
this in order to ease the transition to the thread groups.
This patch introduces a "tid" field in the task struct, that's either
the number of the thread the task is attached to, or a negative value
if the task is not bound to a thread, (i.e. its mask is all_threads_mask).
The new ID is only set and updated but not used yet.
The thread mask will not be used anymore, instead the thread id only
is used. Interestingly it was already implemented in the parsing but
not used. The single/multi thread argument is not needed anymore since
it's sufficient to pass tid<0 to get a multi-threaded task/tasklet.
This is in preparation for the removal of the thread_mask in tasks as
only this debug code was using it!
For now we still set tid_bit to (1UL << tid) because FDs will not
work with more than one group without this, but once FDs start to
adopt local masks this must change to thr->ltid_bit.
Before 2.3, after an async crypto processing or on session close, the engine
async file's descriptors were removed from the fdtab but not closed because
it is the engine which has created the file descriptor, and it is responsible
for closing it. In 2.3 the fd_remove() call was replaced by fd_stop_both()
which stops the polling but does not remove the fd from the fdtab and the
fd remains indefinitively in the fdtab.
A simple replacement by fd_delete() is not a valid fix because fd_delete()
removes the fd from the fdtab but also closes the fd. And the fd will be
closed twice: by the haproxy's core and by the engine itself.
Instead, let's set FD_DISOWN on the FD before calling fd_delete() which will
take care of not closing it.
This patch must be backported on branches >= 2.3, and it relies on this
previous patch:
MINOR: fd: add a new FD_DISOWN flag to prevent from closing a deleted FD
As mentioned in the patch above, a different flag will be needed in 2.3.
Some FDs might be offered to some external code (external libraries)
which will deal with them until they close them. As such we must not
close them upon fd_delete() but we need to delete them anyway so that
they do not appear anymore in the fdtab. This used to be handled by
fd_remove() before 2.3 but we don't have this anymore.
This patch introduces a new flag FD_DISOWN to let fd_delete() know that
the core doesn't own the fd and it must not be closed upon removal from
the fd_tab. This way it's totally unregistered from the poller but still
open.
This patch must be backported on branches >= 2.3 because it will be
needed to fix a bug affecting SSL async. it should be adapted on 2.3
because state flags were stored in a different way (via bits in the
structure).
Adjust FIN signal on Rx path for the application layer : ensure that the
receive buffer has no gap.
Without this extra condition, FIN was signalled as soon as the STREAM
frame with FIN was received, even if we were still waiting to receive
missing offsets.
This bug could have lead to incomplete requests read from the
application protocol. However, in practice this bug has very little
chance to happen as the application layer ensures that the demuxed frame
length is equivalent to the buffer data size. The only way to happen is
if to receive the FIN STREAM as the H3 demuxer is still processing on a
frame which is not the last one of the stream.
This must be backported up to 2.6. The previous patch on ncbuf is
required for the newly defined function ncb_is_fragmented().
MINOR: ncbuf: implement ncb_is_fragmented()
Implement a new status function for ncbuf. It allows to quickly report
if a buffer contains data in a fragmented way, i.e. with gaps in between
or at start of the buffer.
To summarize, a buffer is considered as non-fragmented in the following
cases :
- a null or empty buffer
- a full buffer
- a buffer containing exactly one data block at the beginning, following
by a gap until the end.
Since a previous refactoring, application protocol layer is not require
anymore to call qcs_consume(). This function is now automatically used
by the MUX itself.