When task profiling is enabled, the scheduler can measure and report
the cumulated time spent in each task and their respective latencies. But
this was wrong for tasks with few wakeups as well as for self-waking ones,
because the call date needed to measure how long it takes to process the
task is retrieved in the task itself (->wake_date was turned to the call
date), and we could face two conditions:
- a new wakeup while the task is executing would reset the ->wake_date
field before returning and make abnormally low values being reported;
that was likely the case for taskèrun_applet for self-waking applets;
- when the task dies, NULL is returned and the call date couldn't be
retrieved, so that CPU time was not being accounted for. This was
particularly visible with process_stream() which is usually called
only twice per request, and whose time was systematically halved.
The cleanest solution here is to keep in mind that the scheduler already
uses quite a bit of local context in th_ctx, and place the intermediary
values there so that they cannot vanish. The wake_date has to be reset
immediately once read, and only its copy is used along the function. Note
that this must be done both for tasks and tasklet, and that until recently
tasklets were also able to report wrong values due to their sole dependency
on TH_FL_TASK_PROFILING between tests.
One nice benefit for future improvements is that such information will now
be available from the task without having to be stored into the task itself
anymore.
Since the tasklet part was computed on wrapping 32-bit arithmetics and
the task one was on 64-bit, the values were now consistently moved to
32-bit as it's already largely sufficient (4s spent in a task is more
than twice what the watchdog would tolerate). Some further cleanups might
be necessary, but the patch aimed at staying minimal.
Task profiling output after 1 million HTTP request previously looked like
this:
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
h1_io_cb 2012338 4.850s 2.410us 12.91s 6.417us
process_stream 2000136 9.594s 4.796us 34.26s 17.13us
sc_conn_io_cb 2000135 1.973s 986.0ns 30.24s 15.12us
h1_timeout_task 137 - - 2.649ms 19.34us
accept_queue_process 49 152.3us 3.107us 321.7yr 6.564yr
main+0x146430 7 5.250us 750.0ns 25.92us 3.702us
srv_cleanup_idle_conns 1 559.0ns 559.0ns 918.0ns 918.0ns
task_run_applet 1 - - 2.162us 2.162us
Now it looks like this:
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
h1_io_cb 2014194 4.794s 2.380us 13.75s 6.826us
process_stream 2000151 20.01s 10.00us 36.04s 18.02us
sc_conn_io_cb 2000148 2.167s 1.083us 32.27s 16.13us
h1_timeout_task 198 54.24us 273.0ns 3.487ms 17.61us
accept_queue_process 52 158.3us 3.044us 409.9us 7.882us
main+0x1466e0 18 16.77us 931.0ns 63.98us 3.554us
srv_cleanup_toremove_conns 8 282.1us 35.26us 546.8us 68.35us
srv_cleanup_idle_conns 3 149.2us 49.73us 8.131us 2.710us
task_run_applet 3 268.1us 89.38us 11.61us 3.871us
Note the two-fold difference on process_stream().
This feature is essentially used for debugging so it has extremely limited
impact. However it's used quite a bit more in bug reports and it would be
desirable that at least 2.6 gets this fix backported. It depends on at least
these two previous patches which will then also have to be backported:
MINOR: task: permanently enable latency measurement on tasklets
CLEANUP: task: rename ->call_date to ->wake_date
This field is misnamed because its real and important content is the
date the task was woken up, not the date it was called. It temporarily
holds the call date during execution but this remains confusing. In
fact before the latency measurements were possible it was indeed a call
date. Thus is will now be called wake_date.
This change is necessary because a subsequent fix will require the
introduction of the real call date in the thread ctx.
When tasklet latency measurement was enabled in 2.4 with commit b2285de04
("MINOR: tasks: also compute the tasklet latency when DEBUG_TASK is set"),
the feature was conditionned on DEBUG_TASK because the field would add 8
bytes to the struct tasklet.
This approach was not a very good idea because the struct ends on an int
anyway thus it does finish with a 32-bit hole regardless of the presence
of this field. What is true however is that adding it turned a 64-byte
struct to 72-byte when caller debugging is enabled.
This patch revisits this with a minor change. Now only the lowest 32
bits of the call date are stored, so they always fit in the remaining
hole, and this allows to remove the dependency on DEBUG_TASK. With
debugging off, we're now seeing a 48-byte struct, and with debugging
on it's exactly 64 bytes, thus still exactly one cache line. 32 bits
allow a latency of 4 seconds on a tasklet, which already indicates a
completely dead process, so there's no point storing the upper bits at
all. And even in the event it would happen once in a while, the lost
upper bits do not really add any value to the debug reports. Also, now
one tasklet wakeup every 4 billion will not be sampled due to the test
on the value itself. Similarly we just don't care, it's statistics and
the measurements are not 9-digit accurate anyway.
This one is only used as a hint to improve scheduling latency, so there
is no more point in keeping it global since each thread group handles
its own run q
Their migration was postponed for convenience only but now's time for
having the shared wait queues per thread group and not just per process,
otherwise the WQ lock uses a huge amount of CPU alone.
The thread flags are touched a little bit by other threads, e.g. the STUCK
flag may be set by other ones, and they're watched a little bit. As such
we need to use atomic ops only to manipulate them. Most places were already
using them, but here we generalize the practice. Only ha_thread_dump() does
not change because it's run under isolation.
Almost every call place of wake_thread() checks for sleeping threads and
clears the sleeping mask itself, while the function is solely used for
there. Let's move the check and the clearing of the bit inside the function
itself. Note that updt_fd_polling() still performs the check because its
rules are a bit different.
Since we don't mix tasks from different threads in the run queues
anymore, we don't need to use the eb32sc_ trees and we can switch
to the regular eb32 ones. This uses cheaper lookup and insert code,
and a 16-thread test on the queues shows a performance increase
from 570k RPS to 585k RPS.
This bit field used to be a per-thread cache of the result of the last
lookup of the presence of a task for each thread in the shared cache.
Since we now know that each thread has its own shared cache, a test of
emptiness is now sufficient to decide whether or not the shared tree
has a task for the current thread. Let's just remove this mask.
grq_total was only used to know how many tasks were being queued in the
global runqueue for stats purposes, and that was transferred to the per
thread rq_total counter once assigned. We don't need this anymore since
we know where they are, so let's just directly update rq_total and drop
that one.
Since we only use the shared runqueue to put tasks only assigned to
known threads, let's move that runqueue to each of these threads. The
goal will be to arrange an N*(N-1) mesh instead of a central contention
point.
The global_rqueue_ticks had to be dropped (for good) since we'll now
use the per-thread rqueue_ticks counter for both trees.
A few points to note:
- the rq_lock stlil remains the global one for now so there should not
be any gain in doing this, but should this trigger any regression, it
is important to detect whether it's related to the lock or to the tree.
- there's no more reason for using the scope-based version of the ebtree
now, we could switch back to the regular eb32_tree.
- it's worth checking if we still need TASK_GLOBAL (probably only to
delete a task in one's own shared queue maybe).
The runqueue ticks counter is per-thread and wasn't initially meant to
be shared. We'll soon have to share it so let's make it atomic. It's
only updated when waking up a task, and no performance difference was
observed. It was moved in the thread_ctx struct so that it doesn't
pollute the local cache line when it's later updated by other threads.
TASK_SHARED_WQ was set upon task creation and never changed afterwards.
Thus if a task was created to run anywhere (e.g. a check or a Lua task),
all its timers would always pass through the shared timers queue with a
lock. Now we know that tid<0 indicates a shared task, so we can use that
to decide whether or not to use the shared queue. The task might be
migrated using task_set_affinity() but it's always dequeued first so
the check will still be valid.
Not only this removes a flag that's difficult to keep synchronized with
the thread ID, but it should significantly lower the load on systems with
many checks. A quick test with 5000 servers and fast checks that were
saturating the CPU shows that the check rate increased by 20% (hence the
CPU usage dropped by 17%). It's worth noting that run_task_lists() almost
no longer appears in perf top now.
At a few places where the task's thread mask. Now we know that it's always
either one bit or all bits of all_threads_mask, so we can replace it with
either 1<<tid or all_threads_mask depending on what's expected.
It's worth noting that the global_tasks_mask is still set this way and
that it's reaching its limits. Similarly, the task_new() API would deserve
an update to stop using a thread mask and use a thread number instead.
Similarly, task_set_affinity() should be updated to directly take a
thread number.
At this point the task's thread mask is not used anymore.
At several places we need to figure the ID of the first thread allowed
to run a task. Till now this was performed using my_ffsl(t->thread_mask)
but since we now have the thread ID stored into the task, let's use it
instead. This is tagged major because it starts to assume that tid<0 is
strictly equivalent to atleast2(thread_mask), and that as such, among
the allowed threads are the current one.
We want to be able to schedule a tasklet onto a thread after the current tasklet
is done. What we have to do is to insert this tasklet at the head of the thread
task list. Furthermore, we would like to serialize the tasklets. They must be
run in the same order as the order in which they have been scheduled. This is
implemented passing a list of tasklet as parameter (see <head> parameters) which
must be reused for subsequent calls.
_tasklet_wakeup_after_on() is implemented to accomplish this job.
tasklet_wakeup_after_on() and tasklet_wake_after() are only wrapper macros around
_tasklet_wakeup_after_on(). tasklet_wakeup_after_on() does exactly the same thing
as _tasklet_wakeup_after_on() without having to pass the filename and line in the
filename as parameters (usefull when DEBUG_TASK is enabled).
tasklet_wakeup_after() hides also the usage of the thread parameter which is
<tl> tasklet thread ID.
tasklet_kill() was introduced in 2.5-dev4 with commit 7b368339a
("MEDIUM: task: implement tasklet kill"), but a comparison error
there makes tasklets killed on thread 1 assigned to the killing
thread. Fortunately, the function was finally not used so there's
no harm right now, hence the minor tag, but this must be fixed and
backported in case a later fix relies on it.
This should be backported to 2.5.
If we've stopped consulting the local wait queue due to too many tasks
(max_processed <= 0), there's no point starting to lock the shared WQ,
check the first task's expiration date, upgrading the lock just to
refrain from doing the work because of the limit. All this does is
increase contention on an already contended system.
Note that there is still a fairness issue in this WQ dequeuing code. If
each thread is busy with expired tasks, no thread will dequeue the global
ones. In practice it doesn't make much sense and should quickly resorb,
but it could be nice to have an alternating flag indicating where to
start from on next call to improve this.
This macro was used both for binding and for lookups. When binding tasks
or FDs, using all_threads_mask instead is better as it will later be per
group. For lookups, ~0UL always does the job. Thus in practice the macro
was already almost not used anymore since the rest of the code could run
fine with a constant of all ones there.
Instead of having a global mask of all the profiled threads, let's have
one flag per thread in each thread's flags. They are never accessed more
than one at a time an are better located inside the threads' contexts for
both performance and scalability.
Since the relaxation of the run-queue locks in 2.0 there has been a
very small but existing race between expired tasks and running tasks:
a task might be expiring and being woken up at the same time, on
different threads. This is protected against via the TASK_QUEUED and
TASK_RUNNING flags, but just after the task finishes executing, it
releases it TASK_RUNNING bit an only then it may go to task_queue().
This one will do nothing if the task's ->expire field is zero, but
if the field turns to zero between this test and the call to
__task_queue() then three things may happen:
- the task may remain in the WQ until the 24 next days if it's in
the future;
- the task may prevent any other task after it from expiring during
the 24 next days once it's queued
- if DEBUG_STRICT is set on 2.4 and above, an abort may happen
- since 2.2, if the task got killed in between, then we may
even requeue a freed task, causing random behaviour next time
it's found there, or possibly corrupting the tree if it gets
reinserted later.
The peers code is one call path that easily reproduces the case with
the ->expire field being reset, because it starts by setting it to
TICK_ETERNITY as the first thing when entering the task handler. But
other code parts also use multi-threaded tasks and rightfully expect
to be able to touch their expire field without causing trouble. No
trivial code path was found that would destroy such a shared task at
runtime, which already limits the risks.
This must be backported to 2.0.
There were a few casts of list* to mt_list* that were upsetting some
old compilers (not sure about the effect on others). We had created
list_to_mt_list() purposely for this, let's use it instead of applying
this cast.
This applicationn specific flag was added in 2.4-dev by commit 6fa8bcdc7
("MINOR: task: add an application specific flag to the state: TASK_F_USR1")
to help preserve a the idle connections status across wakeup calls. While
the code to do this was OK for tasklets, it was wrong for tasks, as in an
effort not to lose it when setting the RUNNING flag (that tasklets don't
have), it ended up being inconditionally set. It just happens that for now
no regular tasks use it, only tasklets.
This fix makes sure we always atomically perform (state & flags | running)
there, using a CAS. It also does it for tasklets because it was possible
to lose some such flags if set by another thread, even though this should
not happen with current code. In order to make the code more readable (and
avoid the previous mistake of repeated flags in the bit field), a new
TASK_PERSISTENT aggregate was declared in task.h for this.
In practice the CAS is cheap here because task states are stable or
convergent so the loop will almost never be taken.
This should be backported to 2.4.
The TI_FL_STUCK flag is manipulated by the watchdog and scheduler
and describes the apparent life/death of a thread so it changes
all the time and it makes sense to move it to the thread's context
for an active thread.
The scheduler contains a lot of stuff that is thread-local and not
exclusively tied to the scheduler. Other parts (namely thread_info)
contain similar thread-local context that ought to be merged with
it but that is even less related to the scheduler. However moving
more data into this structure isn't possible since task.h is high
level and cannot be included everywhere (e.g. activity) without
causing include loops.
In the end, it appears that the task_per_thread represents most of
the per-thread context defined with generic types and should simply
move to tinfo.h so that everyone can use them.
The struct was renamed to thread_ctx and the variable "sched" was
renamed to "th_ctx". "sched" used to be initialized manually from
run_thread_poll_loop(), now it's initialized by ha_set_tid() just
like ti, tid, tid_bit.
The memset() in init_task() was removed in favor of a bss initialization
of the array, so that other subsystems can put their stuff in this array.
Since the tasklet array has TL_CLASSES elements, the TL_* definitions
was moved there as well, but it's not a problem.
The vast majority of the change in this patch is caused by the
renaming of the structures.
The entering_poll/leaving_poll/measure_idle functions that were hard
to classify and used to move to various locations have now been placed
into clock.c since it's precisely about time-keeping. The functions
were renamed to clock_*. The samp_time and idle_time values are now
static since there is no reason for them to be read from outside.
There is currently a problem related to time keeping. We're mixing
the functions to perform calculations with the os-dependent code
needed to retrieve and adjust the local time.
This patch extracts from time.{c,h} the parts that are solely dedicated
to time keeping. These are the "now" or "before_poll" variables for
example, as well as the various now_*() functions that make use of
gettimeofday() and clock_gettime() to retrieve the current time.
The "tv_*" functions moved there were also more appropriately renamed
to "clock_*".
Other parts used to compute stolen time are in other files, they will
have to be picked next.
It's pointless to inline this, it's called exactly once per poll loop,
and it depends on time.h which is quite deep. Let's move that to task.c
along with sched_report_idle().
The idle time calculation stuff was moved to task.h by commit 6dfab112e
("REORG: sched: move idle time calculation from time.h to task.h") but
these two variables that are only maintained by task.{c,h} were still
left in time.{c,h}. They have to move as well.
A number of files currently access activity counters but rely on their
definitions to be inherited from other files (task.c, backend.c hlua.c,
sock.c, pool.c, stats.c, fd.c).
I don't know why I inlined this one, this makes no sense given that it's
only used for stats, and it starts a circular dependency on tinfo.h which
can be problematic in the future. In addition, all the stuff related to
idle time calculation should be with the rest of the scheduler, which
currently is in task.{c,h}, so let's move it there.
Work lists were a mechanism introduced in 1.8 to asynchronously delegate
some work to be performed on another thread via a dedicated task.
The only user was the listeners, to deal with the queue. Nowadays
the tasklets have made this much more convenient, and have replaced
work_lists in the listeners. It seems there will be no valid use case
of work lists anymore, so better get rid of them entirely and keep the
scheduler code cleaner.
__task_queue() must absolutely not be called with TICK_ETERNITY or it
will place a never-expiring node upfront in the timers queue, preventing
any timer from expiring until the process is restarted. Code was found
to cause this using "task_schedule(task, now_ms)" which does this one
millisecond every 49.7 days, so let's add a condition against this. It
must never trigger since any process susceptible to trigger it would
already accumulate tasks until it dies.
An extra test was added in wake_expired_tasks() to detect tasks whose
timeout would have been changed after being queued.
An improvement over this could be in the future to use a non-scalar
type (union/struct) for expiration dates so as to avoid the risk of
using them directly like this. But now_ms is already such a valid
time and this specific construct would still not be caught.
This could even be backported to stable versions to help detect other
occurrences if any.
Implement an equivalent of task_kill for tasklets. This function can be
used to request a tasklet deletion in a thread-safe way.
Currently this function is unused.
This one comes with a very deep dependency hell, only to know that
process_stream() is a function. Dropping it reduces the preprocessed
output from 1.5MB to 640kB.
The current "ADD" vs "ADDQ" is confusing because when thinking in terms
of appending at the end of a list, "ADD" naturally comes to mind, but
here it does the opposite, it inserts. Several times already it's been
incorrectly used where ADDQ was expected, the latest of which was a
fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain
why we queue the stream at the head of the server list").
Let's use more explicit (but slightly longer) names now:
LIST_ADD -> LIST_INSERT
LIST_ADDQ -> LIST_APPEND
LIST_ADDED -> LIST_INLIST
LIST_DEL -> LIST_DELETE
The same is true for MT_LISTs, including their "TRY" variant.
LIST_DEL_INIT keeps its short name to encourage to use it instead of the
lazier LIST_DELETE which is often less safe.
The change is large (~674 non-comment entries) but is mechanical enough
to remain safe. No permutation was performed, so any out-of-tree code
can easily map older names to new ones.
The list doc was updated.
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
Currently our atomic ops return a value but it's never known whether
the fetch is done before or after the operation, which causes some
confusion each time the value is desired. Let's create an explicit
variant of these operations suffixed with _FETCH to explicitly mention
that the fetch occurs after the operation, and make use of it at the
few call places.
Instead of setting a hard-limit on runqueue-depth and keeping it short
to maintain fairness, let's allow the scheduler to automatically cut
the existing one in two equal halves if its size is between the configured
size and its double. This will allow to increase the default value while
keeping a low latency.
The default proxy was passed as a variable to all parsers instead of a
const, which is not without risk, especially when some timeout parsers used
to make some int pointers point to the default values for comparisons. We
want to be certain that none of these parsers will modify the defaults
sections by accident, so it's important to mark this proxy as const.
This patch touches all occurrences found (89).
grq_total was incremented when picking tasks from the global run queue,
but this variable was not defined with threads disabled, and the code
was optimized away at -O2. No backport is needed.
This flag will be usable by any application. It will be preserved across
wakeups so the application can use it to do various stuff. Some I/O
handlers will soon benefit from this.
It's been too short for quite a while now and is now full. It's still
time to extend it to 32-bits since we have room for this without
wasting any space, so we now gained 16 new bits for future flags.
The values were not reassigned just in case there would be a few
hidden u16 or short somewhere in which these flags are placed (as
it used to be the case with stream->pending_events).
The patch is tagged MEDIUM because this required to update the task's
process() prototype to use an int instead of a short, that's quite a
bunch of places.
It's cleaner to use a flag from the task's state to detect a tasklet
and it's even cheaper. One of the best benefits is that this will
allow to get the nice field out of the common part since the tasklet
doesn't need it anymore. This commit uses the last task bit available
but that's temporary as the purpose of the change is to extend this.
The preliminary approach to dealing with heavy tasks forced us to quit
the poller after meeting one. Now instead we process at most one per poll
loop and ignore the next ones, so that we get more bandwidth to process
all other classes.
Doing so further reduced the induced HTTP request latency at 100k req/s
under the stress of 1000 concurrent SSL handshakes in the following
proportions:
| default | low-latency
---------+------------+--------------
before | 2.75 ms | 2.0 ms
after | 1.38 ms | 0.98 ms
In both cases, the latency is roughly halved. It's worth noting that
both values are now exactly 10 times better than in 2.4-dev9. Even the
percentiles have much improved. For 16 HTTP connections (1 per thread)
competing with 1000 SSL handshakes, we're seeing these long-tail
latencies (in milliseconds) :
| 99.5% | 99.9% | 100%
-----------+---------+---------+--------
2.4-dev9 | 48.4 | 58.1 | 78.5
previous | 6.2 | 11.4 | 67.8
this patch | 2.8 | 2.9 | 6.1
The task latency profiling report now shows this in default mode:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
si_cs_io_cb 3061966 2.224s 726.0ns 42.03s 13.72us
h1_io_cb 3061960 6.418s 2.096us 18.76m 367.6us
process_stream 3059982 9.137s 2.985us 15.52m 304.3us
ssl_sock_io_cb 602657 4.265m 424.7us 4.736h 28.29ms
h1_timeout_task 202973 - - 6.254s 30.81us
accept_queue_process 135547 1.179s 8.699us 16.29s 120.1us
srv_cleanup_toremove_conns 81 15.64ms 193.1us 30.87ms 381.1us
task_run_applet 10 758.7us 75.87us 51.77us 5.176us
srv_cleanup_idle_conns 4 375.3us 93.83us 54.52us 13.63us
And this in low-latency mode, showing that both si_cs_io_cb() and process_stream()
have significantly benefitted from the improvement, with values 50 to 200 times
smaller than 2.4-dev9:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks {on|auto|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
h1_io_cb 6407006 11.86s 1.851us 31.14m 291.6us
process_stream 6403890 18.40s 2.873us 2.134m 20.00us
si_cs_io_cb 6403866 4.139s 646.0ns 1.773m 16.61us
ssl_sock_io_cb 894326 6.407m 429.9us 7.326h 29.49ms
h1_timeout_task 301189 - - 8.440s 28.02us
accept_queue_process 211989 1.691s 7.977us 21.48s 101.3us
srv_cleanup_toremove_conns 220 23.46ms 106.7us 65.61ms 298.2us
task_run_applet 16 1.219ms 76.17us 181.7us 11.36us
srv_cleanup_idle_conns 12 713.3us 59.44us 168.4us 14.03us
The changes are slightly more invasive than previous ones and depend on
recent patches so they are not likely well suited for backporting.
Instead of placing heavy tasklets into the TL_BULK queue, we now place
them into the TL_HEAVY one, which is assigned a default weight of ~1%
load at once. This way heavy tasks will not block TL_BULK anymore.
This class will be used exclusively for heavy processing tasklets. It
will be cleaner than mixing them with the bulk ones. For now it's
allocated ~1% of the CPU bandwidth.
The largest part of the patch consists in re-arranging the fields in the
task_per_thread structure to preserve a clean alignment with one more
list head. Since we're now forced to increase the struct past a second
cache line, it now uses 4 cache lines (for easy multiplying) with the
first two ones being exclusively used by local operations and the third
one mostly by atomic operations. Interestingly, this better arrangement
causes less stress and reduced the response time by 8 microseconds at
1 million requests per second.