Commit Graph

266 Commits

Author SHA1 Message Date
Willy Tarreau
74dea8caea MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY
While the scheduler is priority-aware and class-aware, and consistently
tries to maintain fairness between all classes, it doesn't make use of a
fine execution budget to compensate for high-latency tasks such as TLS
handshakes. This can result in many subsequent calls adding multiple
milliseconds of latency between the various steps of other tasklets that
don't even depend on this.

An ideal solution would be to add a 4th queue, have all tasks announce
their estimated cost upfront and let the scheduler maintain an auto-
refilling budget to pick from the most suitable queue.

But it turns out that a very simplified version of this already provides
impressive gains with very tiny changes and could easily be backported.
The principle is to reserve a new task flag "TASK_HEAVY" that indicates
that a task is expected to take a lot of time without yielding (e.g. an
SSL handshake typically takes 700 microseconds of crypto computation).
When the scheduler sees this flag when queuing a tasklet, it will place
it into the bulk queue. And during dequeuing, we accept only one of
these in a full round. This means that the first one will be accepted,
will not prevent other lower priority tasks from running, but if a new
one arrives, then the queue stops here and goes back to the polling.
This will allow to collect more important updates for other tasks that
will be batched before the next call of a heavy task.

Preliminary tests consisting in placing this flag on the SSL handshake
tasklet show that response times under SSL stress fell from 14 ms
before the patch to 3.0 ms with the patch, and even 1.8 ms if
tune.sched.low-latency is set to "on".
2021-02-26 00:25:51 +01:00
Willy Tarreau
2a54ffbf43 MINOR: task: make tasklet wakeup latency measurements more accurate
First, we don't want to measure wakeup times if the call date had not
been set before profiling was enabled at run time. And second, we may
only collect the value before clearing the TASK_IN_LIST bit, otherwise
another wakeup might happen on another thread and replace the call date
we're about to use, hence artificially lower the wakeup times.
2021-02-25 09:44:16 +01:00
Willy Tarreau
b2285de049 MINOR: tasks: also compute the tasklet latency when DEBUG_TASK is set
It is extremely useful to be able to observe the wakeup latency of some
important I/O operations, so let's accept to inflate the tasklet struct
by 8 extra bytes when DEBUG_TASK is set. With just this we have enough
to get live reports like this:

  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    si_cs_io_cb                 8099492   4.833s    596.0ns   8.974m    66.48us
    h1_io_cb                    7460365   11.55s    1.548us   2.477m    19.92us
    process_stream              7383828   22.79s    3.086us   18.39m    149.5us
    h1_timeout_task                4157      -         -      348.4ms   83.81us
    srv_cleanup_toremove_connections751   39.70ms   52.86us   10.54ms   14.04us
    srv_cleanup_idle_connections     21   1.405ms   66.89us   30.82us   1.467us
    task_run_applet                  16   1.058ms   66.13us   446.2us   27.89us
    accept_queue_process              7   34.53us   4.933us   333.1us   47.58us
2021-02-25 09:44:16 +01:00
Willy Tarreau
45499c56d3 MINOR: task: make grq_total atomic to move it outside of the grq_lock
Instead of decrementing grq_total once per task picked from the global
run queue, let's do it at once after the loop like we do for other
counters. This simplifies the code everywhere. It is not expected to
bring noticeable improvements however, since global tasks tend to be
less common nowadays.
2021-02-25 09:44:16 +01:00
Willy Tarreau
c9afbb10f5 MINOR: task: don't decrement then increment the local run queue
Now we don't need to decrement rq_total when we pick a tack in the tree
to immediately increment it again after installing it into the local
list. Instead, we simply add to the local queue count the number of
globally picked tasks. Avoiding this shows ~0.5% performance gains at
1Mreq/s (2M task switches/s).
2021-02-25 09:44:16 +01:00
Willy Tarreau
2b363ac092 MINOR: task: do not use __task_unlink_rq() from process_runnable_tasks()
As indicated in previous commit, this function tries to guess which tree
the task is in to figure what counters to update, while we already have
that info in the caller. Let's just pick the relevant parts to place them
in the caller.
2021-02-25 09:44:16 +01:00
Willy Tarreau
e7923c1d22 MINOR: task: split the counts of local and global tasks picked
In process_runnable_tasks() we're still calling __task_unlink_rq() to
pick a task, and this function tries to guess where to pick the task
from and which counter to update while the caller's context already
has everything. Worse, the number of local tasks is decremented then
recredited, doubling the operations. In order to avoid this we first
need to keep separate counters for local and global tasks that were
picked. This is what this patch does.
2021-02-25 09:44:16 +01:00
Willy Tarreau
9c6dbf0eea CLEANUP: task: split the large tasklet_wakeup_on() function in two
This function has become large with the multi-queue scheduler. We need
to keep the fast path and the debugging parts inlined, but the rest now
moves to task.c just like was done for task_wakeup(). This has reduced
the code size by 6kB due to less inlining of large parts that are always
context-dependent, and as a side effect, has increased the overall
performance by 1%.
2021-02-24 17:55:58 +01:00
Willy Tarreau
955a11ebfa MINOR: task: move the allocated tasks counter to the per-thread struct
The nb_tasks counter was still global and gets incremented and decremented
for each task_new()/task_free(), and was read in process_runnable_tasks().
But it's only used for stats reporting, so doing this this often is
pointless and expensive. Let's move it to the task_per_thread struct and
have the stats sum it when needed.
2021-02-24 17:42:04 +01:00
Willy Tarreau
eeffb3df41 MINOR: task: limit the remote thread wakeup to the global runqueue only
The test in __task_wakeup() to figure if the remote threads are sleeping
doesn't make sense outside of the global runqueue test, since there are
only two possibilities here: local runqueue or global runqueue, hence a
sleeping thread is another one and can only happen when sending to the
global run queue. Let's move the test inside the "if" block.
2021-02-24 17:42:04 +01:00
Willy Tarreau
018564eaa2 CLEANUP: task: move the tree root detection from __task_wakeup() to task_wakeup()
Historically we used to call __task_wakeup() with a known tree root but
this is not the case and the code has remained needlessly complicated
with the root calculation in task_wakeup() passed in argument to
__task_wakeup() which compares it again.

Let's get rid of this and just move the detection code there. This
eliminates some ifdefs and allows to simplify the test conditions quite
a bit.
2021-02-24 17:42:04 +01:00
Willy Tarreau
1f3b1417b8 CLEANUP: tasks: use a less confusing name for task_list_size
This one is systematically misunderstood due to its unclear name. It
is in fact the number of tasks in the local tasklet list. Let's call
it "tasks_in_list" to remove some of the confusion.
2021-02-24 17:42:04 +01:00
Willy Tarreau
2c41d77ebc MINOR: tasks: do not maintain the rqueue_size counter anymore
This one is exclusively used as a boolean nowadays and is non-zero only
when the thread-local run queue is not empty. Better check the root tree's
pointer and avoid updating this counter all the time.
2021-02-24 17:42:04 +01:00
Willy Tarreau
9c7b8085f4 MEDIUM: task: remove the tasks_run_queue counter and have one per thread
This counter is solely used for reporting in the stats and is the hottest
thread contention point to date. Moving it to the scheduler and having a
separate one for the global run queue dramatically improves the performance,
showing a 12% boost on the request rate on 16 threads!

In addition, the thread debugging output which used to rely on rqueue_size
was not totally accurate as it would only report task counts. Now we can
return the exact thread's run queue length.

It is also interesting to note that there are still a few other task/tasklet
counters in the scheduler that are not efficiently updated because some cover
a single area and others cover multiple areas. It looks like having a distinct
counter for each of the following entries would help and would keep the code
a bit cleaner:
  - global run queue (tree)
  - per-thread run queue (tree)
  - per-thread shared tasklets list
  - per-thread local lists

Maybe even splitting the shared tasklets lists between pure tasklets and
tasks instead of having the whole and tasks would simplify the code because
there remain a number of places where several counters have to be updated.
2021-02-24 17:42:04 +01:00
Willy Tarreau
c6ba9a0b9b MINOR: sched: have one runqueue ticks counter per thread
The runqueue_ticks counts the number of task wakeups and is used to
position new tasks in the run queue, but since we've had per-thread
run queues, the values there are not very relevant anymore and the
nice value doesn't apply well if some threads are more loaded than
others. In addition, letting all threads compete over a shared counter
is not smart as this may cause some excessive contention.

Let's move this index close to the run queues themselves, i.e. one per
thread and a global one. In addition to improving fairness, this has
increased global performance by 2% on 16 threads thanks to the lower
contention on rqueue_ticks.

Fairness issues were not observed, but if any were to be, this patch
could be backported as far as 2.0 to address them.
2021-02-20 13:03:37 +01:00
Willy Tarreau
4e2282f9bf MEDIUM: tasks/activity: collect per-task statistics when profiling is enabled
Now when the profiling is enabled, the scheduler wlil update per-function
task-level statistics on number of calls, cpu usage and lateny, that could
later be checked using "show profiling". This will immediately make it
obvious what functions are responsible for others' high latencies or which
ones are suffering from others, and should help spot issues like undesired
wakeups. For now the stats are only collected but not reported (though they
are readable from sched_activity[] under gdb).
2021-01-29 12:10:33 +01:00
Willy Tarreau
4d6c594998 BUG/MEDIUM: task: close a possible data race condition on a tasklet's list link
In issue #958 Ashley Penney reported intermittent crashes on AWS's ARM
nodes which would not happen on x86 nodes. After investigation it turned
out that the Neoverse N1 CPU cores used in the Graviton2 CPU are much
more aggressive than the usual Cortex A53/A72/A55 or any x86 regarding
memory ordering.

The issue that was triggered there is that if a tasklet_wakeup() call
is made on a tasklet scheduled to run on a foreign thread and that
tasklet is just being dequeued to be processed, there can be a race at
two places:
  - if MT_LIST_TRY_ADDQ() happens between MT_LIST_BEHEAD() and
    LIST_SPLICE_END_DETACHED() if the tasklet is alone in the list,
    because the emptiness tests matches ;

  - if MT_LIST_TRY_ADDQ() happens during LIST_DEL_INIT() in
    run_tasks_from_lists(), then depending on how LIST_DEL_INIT() ends
    up being implemented, it may even corrupt the adjacent nodes while
    they're being reused for the in-tree storage.

This issue was introduced in 2.2 when support for waking up remote
tasklets was added. Initially the attachment of a tasklet to a list
was enough to know its status and this used to be stable information.
Now it's not sufficient to rely on this anymore, thus we need to use
a different information.

This patch solves this by adding a new task flag, TASK_IN_LIST, which
is atomically set before attaching a tasklet to a list, and is only
removed after the tasklet is detached from a list. It is checked
by tasklet_wakeup_on() so that it may only be done while the tasklet
is out of any list, and is cleared during the state switch when calling
the tasklet. Note that the flag is not set for pure tasks as it's not
needed.

However this introduces a new special case: the function
tasklet_remove_from_tasklet_list() needs to keep both states in sync
and cannot check both the state and the attachment to a list at the
same time. This function is already limited to being used by the thread
owning the tasklet, so in this case the test remains reliable. However,
just like its predecessors, this function is wrong by design and it
should probably be replaced with a stricter one, a lazy one, or be
totally removed (it's only used in checks to avoid calling a possibly
scheduled event, and when freeing a tasklet). Regardless, for now the
function exists so the flag is removed only if the deletion could be
done, which covers all cases we're interested in regarding the insertion.
This removal is safe against a concurrent tasklet_wakeup_on() since
MT_LIST_DEL() guarantees the atomic test, and will ultimately clear
the flag only if the task could be deleted, so the flag will always
reflect the last state.

This should be carefully be backported as far as 2.2 after some
observation period. This patch depends on previous patch
"MINOR: task: remove __tasklet_remove_from_tasklet_list()".
2020-11-30 18:17:59 +01:00
Willy Tarreau
2da4c316c2 MINOR: task: remove __tasklet_remove_from_tasklet_list()
This function is only used at a single place directly within the
scheduler in run_tasks_from_lists() and it really ought not be called
by anything else, regardless of what its comment says. Let's delete
it, move the two lines directly into the call place, and take this
opportunity to factor the atomic decrement on tasks_run_queue. A comment
was added on the remaining one tasklet_remove_from_tasklet_list() to
mention the risks in using it.
2020-11-30 18:17:44 +01:00
Willy Tarreau
c309dbdd99 MINOR: task: perform atomic counter increments only once per wakeup
In process_runnable_tasks(), we walk the run queue and pick tasks to
insert them into the local list. And for each of these operations we
perform a few increments, some of which are atomic, and they're even
performed under the runqueue's lock. This is useless inside the loop,
better do them at the end, since we don't use these values inside the
loop and they're not used anywhere else either during this time. The
only one is task_list_size which is accessed in parallel by other
threads performing remote tasklet wakeups, but it's already
approximative and is used to decide to get out of the loop when the
limit is reached. So now we compute it first as an initial budget
instead.
2020-11-30 18:17:44 +01:00
Willy Tarreau
a868c2920b MINOR: task: remove tasklet_insert_into_tasklet_list()
This function is only called at a single place and adds more confusion
than it removes. It also makes one think it could be used outside of
the scheduler while it must absolutely not. Let's just move its two
lines to the call place, making the code more readable there. In
addition this clearly shows that the preliminary LIST_INIT() is
useless since the entry is immediately overwritten.
2020-11-30 18:17:44 +01:00
Willy Tarreau
69a7b8fc6c CLEANUP: task: remove the unused and mishandled global_rqueue_size
This counter is only updated and never used, and in addition it's done
without any atomicity so it's very unlikely to be correct on multi-CPU
systems! Let's just remove it since it's not used.
2020-10-19 14:08:13 +02:00
Willy Tarreau
d48ed6643b MEDIUM: task: use an upgradable seek lock when scanning the wait queue
Right now when running a configuration with many global timers (e.g. many
health checks), there is a lot of contention on the global wait queue
lock because all threads queue up in front of it to scan it.

With 2000 servers checked every 10 milliseconds (200k checks per second),
after 23 seconds running on 8 threads, the lock stats were this high:

  Stats about Lock TASK_WQ:
      write lock  : 9872564
      write unlock: 9872564 (0)
      wait time for write     : 9208.409 msec
      wait time for write/lock: 932.727 nsec
      read lock   : 240367
      read unlock : 240367 (0)
      wait time for read      : 149.025 msec
      wait time for read/lock : 619.991 nsec

i.e. ~5% of the total runtime spent waiting on this specific lock.

With upgradable locks we don't need to work like this anymore. We
can just try to upgade the read lock to a seek lock before scanning
the queue, then upgrade the seek lock to a write lock for each element
we want to delete there and immediately downgrade it to a seek lock.

The benefit is double:
  - all other threads which need to call next_expired_task() before
    polling won't wait anymore since the seek lock is compatible with
    the read lock ;

  - all other threads competing on trying to grab this lock will fail
    on the upgrade attempt from read to seek, and will let the current
    lock owner finish collecting expired entries.

Doing only this has reduced the wake_expired_tasks() CPU usage in a
very large servers test from 2.15% to 1.04% as reported by perf top,
and increased by 3% the health check rate (all threads being saturated).

This is expected to help against (and possibly solve) the problem
described in issue #875.
2020-10-16 17:15:54 +02:00
Willy Tarreau
3cfaa8d1e0 BUG/MEDIUM: task: bound the number of tasks picked from the wait queue at once
There is a theorical problem in the wait queue, which is that with many
threads, one could spend a lot of time looping on the newly expired tasks,
causing a lot of contention on the global wq_lock and on the global
rq_lock. This initially sounds bening, but if another thread does just
a task_schedule() or task_queue(), it might end up waiting for a long
time on this lock, and this wait time will count on its execution budget,
degrading the end user's experience and possibly risking to trigger the
watchdog if that lasts too long.

The simplest (and backportable) solution here consists in bounding the
number of expired tasks that may be picked from the global wait queue at
once by a thread, given that all other ones will do it as well anyway.

We don't need to pick more than global.tune.runqueue_depth tasks at once
as we won't process more, so this counter is updated for both the local
and the global queues: threads with more local expired tasks will pick
less global tasks and conversely, keeping the load balanced between all
threads. This will guarantee a much lower latency if/when wakeup storms
happen (e.g. hundreds of thousands of synchronized health checks).

Note that some crashes have been witnessed with 1/4 of the threads in
wake_expired_tasks() and, while the issue might or might not be related,
not having reasonable bounds here definitely justifies why we can spend
so much time there.

This patch should be backported, probably as far as 2.0 (maybe with
some adaptations).
2020-10-16 15:18:48 +02:00
Willy Tarreau
6ce0232a78 BUILD: task: work around a bogus warning in gcc 4.7/4.8 at -O1
As reported in issue #816, when building task.o at -O1 with gcc 4.7 or
4.8, we get the following warning:

    CC      src/task.o
  In file included from include/haproxy/proxy.h:31:0,
                   from include/haproxy/cfgparse.h:27,
                   from src/task.c:19:
  src/task.c: In function 'next_timer_expiry':
  include/haproxy/ticks.h:121:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized]
  src/task.c:349:2: note: 'key' was declared here

It is wrong since the condition to use 'key' is exactly the same as
the one used to set it. This warning disappears at -O2 and disappeared
from gcc 5 and above. Let's just initialize 'key' there, it only adds
16 bytes of code and remains cheap enough for this function.

This should be backported to 2.2.
2020-08-21 05:54:00 +02:00
Willy Tarreau
e5d79bccc0 MINOR: tasks/debug: add a few BUG_ON() to detect use of wrong timer queue
This aims at catching calls to task_unlink_wq() performed by the wrong
thread based on the shared status for the task, as well as calls to
__task_queue() with the wrong timer queue being used based on the task's
capabilities. This will at least help eliminate some hypothesis during
debugging sessions when suspecting that a wrong thread has attempted to
queue a task at the wrong place.
2020-07-22 14:42:52 +02:00
Willy Tarreau
783afbe93b BUG/MAJOR: tasks: don't requeue global tasks into the local queue
A bug was introduced by commit 77015abe0 ("MEDIUM: tasks: clean up the
front side of the wait queue in wake_expired_tasks()"): front tasks
that are not yet expired were incorrectly requeued into the local
wait queue instead of the global one. Because of this, the same task
could be found by the same thread on next invocation and be unlinked
without locking, allowing another thread to requeue it in parallel,
and conversely another thread could unlink it while the task was being
walked over, causing all sorts of crashes and endless loops in
wake_expired_tasks() and affiliates.

This bug can easily be triggered by stressing the do_resolve action
in multi-thread (after applying the fixes required to get do_resolve
to work with threads). It certainly is the cause of issue #758.

This must be backported to 2.2 only.
2020-07-22 14:12:45 +02:00
Willy Tarreau
273aea479d BUG/MAJOR: tasks: make sure to always lock the shared wait queue if needed
In run_tasks_from_task_list() we may free some tasks that have been
killed. Before doing so we unlink them from the wait queue. But if such
a task is in the global wait queue, the queue isn't locked so this can
result in corrupting the global task list and causing loops or crashes.

It's very likely one cause of issue #758.

This must be backported to 2.2. For 2.1 there doesn't seem to be any
case where a task could be freed this way while in the global queue,
but it doesn't cost much to apply the same change (the code is in
process_runnable_task there).
2020-07-17 14:37:51 +02:00
Willy Tarreau
950954f5f7 MINOR: tasks: use MT_LIST_ADDQ() when killing tasks.
A bug in task_kill() was fixed by commy 54d31170a ("BUG/MAJOR: sched:
make sure task_kill() always queues the task") which added a list
initialization before adding an element. But in fact an inconditional
addition would have done the same and been simpler than first
initializing then checking the element was initialized. Let's use
MT_LIST_ADDQ() there to add the task to kill into the shared queue
and kill the dirty LIST_INIT().
2020-07-10 08:52:13 +02:00
Willy Tarreau
de4db17dee MINOR: lists: rename some MT_LIST operations to clarify them
Initially when mt_lists were added, their purpose was to be used with
the scheduler, where anyone may concurrently add the same tasklet, so
it sounded natural to implement a check in MT_LIST_ADD{,Q}. Later their
usage was extended and MT_LIST_ADD{,Q} started to be used on situations
where the element to be added was exclusively owned by the one performing
the operation so a conflict was impossible. This became more obvious with
the idle connections and the new macro was called MT_LIST_ADDQ_NOCHECK.

But this remains confusing and at many places it's not expected that
an MT_LIST_ADD could possibly fail, and worse, at some places we start
by initializing it before adding (and the test is superflous) so let's
rename them to something more conventional to denote the presence of the
check or not:

   MT_LIST_ADD{,Q}    : inconditional operation, the caller owns the
                        element, and doesn't care about the element's
                        current state (exactly like LIST_ADD)
   MT_LIST_TRY_ADD{,Q}: only perform the operation if the element is not
                        already added or in the process of being added.

This means that the previously "safe" MT_LIST_ADD{,Q} are not "safe"
anymore. This also means that in case of backport mistakes in the
future causing this to be overlooked, the slower and safer functions
will still be used by default.

Note that the missing unchecked MT_LIST_ADD macro was added.

The rest of the code will have to be reviewed so that a number of
callers of MT_LIST_TRY_ADDQ are changed to MT_LIST_ADDQ to remove
the unneeded test.
2020-07-10 08:50:41 +02:00
Willy Tarreau
4f58926352 BUG/MAJOR: sched: make it work also when not building with DEBUG_STRICT
Sadly, the fix from commit 54d31170a ("BUG/MAJOR: sched: make sure
task_kill() always queues the task") broke the builds without DEBUG_STRICT
as, in order to be careful, it plcaed a BUG_ON() around the previously
failing condition to check for any new possible failure, but this BUG_ON
strips the condition when DEBUG_STRICT is not set. We don't want BUG_ON
to evaluate any condition either as some debugging code calls possibly
expensive ones (e.g. in htx_get_stline). Let's just drop the useless
BUG_ON().

No backport is needed, this is 2.2-dev.
2020-07-02 17:17:42 +02:00
Willy Tarreau
54d31170a9 BUG/MAJOR: sched: make sure task_kill() always queues the task
task_kill() may fail to queue a task if this task has never ever run,
because its equivalent (tasklet->list) member has never been "emptied"
since it didn't pass through the LIST_DEL_INIT() that's performed by
run_tasks_from_lists(). This results in these tasks to never be freed.

It happens during the mux takeover since the target task usually is
the timeout task which, by definition, has never run yet.

This fixes commit eb8c2c69f ("MEDIUM: sched: implement task_kill() to
kill a task") which was introduced after 2.2-dev11 and doesn't need to
be backported.
2020-07-02 14:14:00 +02:00
Willy Tarreau
eb8c2c69fa MEDIUM: sched: implement task_kill() to kill a task
task_kill() may be used by any thread to kill any task with less overhead
than a regular wakeup. In order to achieve this, it bypasses the priority
tree and inserts the task directly into the shared tasklets list, cast as
a tasklet. The task_list_size is updated to make sure it is properly
decremented after execution of this task. The task will thus be picked by
process_runnable_tasks() after checking the tree and sent to the TL_URGENT
list, where it will be processed and killed.

If the task is bound to more than one thread, its first thread will be the
one notified.

If the task was already queued or running, nothing is done, only the flag
is added so that it gets killed before or after execution. Of course it's
the caller's responsibility to make sur any resources allocated by this
task were already cleaned up or taken over.
2020-07-01 16:35:53 +02:00
Willy Tarreau
8a6049c268 MEDIUM: sched: create a new TASK_KILLED task flag
This flag, when set, will be used to indicate that the task must die.
At the moment this may only be placed by the task itself or by the
scheduler when placing it into the TL_NORMAL queue.
2020-07-01 16:35:49 +02:00
Willy Tarreau
d99177f86d MINOR: sched: make sched->task_list_size atomic
We'll need to update it from foreign threads in order to throw killed
tasks and maintain correct accounting, so let's make it atomic.
2020-07-01 16:35:41 +02:00
Willy Tarreau
1553b6657d BUG/MINOR: sched: properly cover for a rare MT_LIST_ADDQ() race
In commit 3ef7a190b ("MEDIUM: tasks: apply a fair CPU distribution
between tasklet classes") we compute a total weight to be used to
split the CPU time between queues. There is a mention that the
total cannot be null, wihch is based on the fact that we only get
there if thread_has_task() returns non-zero. But there is a very
small race which can break this assumption: if two threads conflict
on MT_LIST_ADDQ() on an empty shared list and both roll back before
trying again, there is the possibility that a first call to
MT_LIST_ISEMPTY() sees the first thread install itself, then the
second call will see the list empty when both roll back. Thus we
could proceed with the queue while it's temporarily empty and
compute max lengths using a divide by zero. This case is very
hard to trigger, it seldom happens on 16 threads at 400k req/s.

Let's simply test for max_total and leave the loop when we've not
found any work.

No backport is needed, that's 2.2-only.
2020-06-30 14:06:19 +02:00
Willy Tarreau
e7723bddd7 MEDIUM: tasks: add a tune.sched.low-latency option
Now that all tasklet queues are scanned at once by run_tasks_from_lists(),
it becomes possible to always check for lower priority classes and jump
back to them when they exist.

This patch adds tune.sched.low-latency global setting to enable this
behavior. What it does is stick to the lowest ranked priority list in
which tasks are still present with an available budget, and leave the
loop to refill the tasklet lists if the trees got new tasks or if new
work arrived into the shared urgent queue.

Doing so allows to cut the latency in half when running with extremely
deep run queues (10k-100k), thus allowing forwarding of small and large
objects to coexist better. It remains off by default since it does have
a small impact on large traffic by default (shorter batches).
2020-06-24 12:21:26 +02:00
Willy Tarreau
59153fef86 MINOR: tasks: make run_tasks_from_lists() scan the queues itself
Now process_runnable_tasks is responsible for calculating the budgets
for each queue, dequeuing from the tree, and calling run_tasks_from_lists().
This latter one scans the queues, picking tasks there and respecting budgets.
Note that its name was updated with a plural "s" for this reason.
2020-06-24 12:21:26 +02:00
Willy Tarreau
ba48d5c8f9 MINOR: tasks: pass the queue index to run_task_from_list()
Instead of passing it a pointer to the queue, pass it the queue's index
so that it can perform all the work around current_queue and tl_class_mask.
2020-06-24 12:21:26 +02:00
Willy Tarreau
49f90bf148 MINOR: tasks: add a mask of the queues with active tasklets
It is neither convenient nor scalable to check each and every tasklet
queue to figure whether it's empty or not while we often need to check
them all at once. This patch introduces a tasklet class mask which gets
a bit 1 set for each queue representing one class of service. A single
test on the mask allows to figure whether there's still some work to be
done. It will later be usable to better factor the runqueue code.

Bits are set when tasklets are queued. They're cleared when queues are
emptied. It is possible that a queue is empty but has a bit if a tasklet
was added then removed, but this is not a problem as this is properly
checked for in run_tasks_from_list().
2020-06-24 12:21:26 +02:00
Willy Tarreau
c0a08ba2df MINOR: tasks: make current_queue an index instead of a pointer
It will be convenient to have the tasklet queue number soon, better make
current_queue an index rather than a pointer to the queue. When not currently
running (e.g. from I/O), the index is -1.
2020-06-24 12:21:26 +02:00
Willy Tarreau
3ef7a190b0 MEDIUM: tasks: apply a fair CPU distribution between tasklet classes
Till now in process_runnable_tasks() we used to reserve a fixed portion
of max_processed to urgent tasks, then a portion of what remains for
normal tasks, then what remains for bulk tasks. This causes two issues:

  - the current budget for processed tasks could be drained once for
    all by higher level tasks so that they couldn't have enough left
    for the next run. For example, if bulk tasklets cause task wakeups,
    the required share to run them could be eaten by other bulk tasklets.

  - it forces the urgent tasks to be run before scanning the tree so that
    we know how many tasks to pick from the tree, and this isn't very
    efficient cache-wise.

This patch changes this so that we compute upfront how max_processed will
be shared between classes that require so. We can then decide in advance
to pick a certain number of tasks from the tree, then execute all tasklets
in turn. When reaching the end, if there's still some budget, we can go
back and do the same thing again, improving chances to pick new work
before the global budget is depleted.

The default weights have been set to 50% for urgent tasklets, 37% for
normal ones and 13% for the bulk ones. In practice, there are not that
many urgent tasklets but when they appear they are cheap and must be
processed in as large batches as possible. Every time there is nothing
to pick there, the unused budget is shared between normal and bulk and
this allows bulk tasklets to still have quite some CPU to run on.
2020-06-24 12:21:26 +02:00
Willy Tarreau
116ef223d2 MINOR: task: add a new pointer to current tasklet queue
In task_per_thread[] we now have current_queue which is a pointer to
the current tasklet_list entry being evaluated. This will be used to
know the class under which the current task/tasklet is currently
running.
2020-06-23 16:35:38 +02:00
Willy Tarreau
0c0c85ed9d BUG/MINOR: tasks: make sure never to exceed max_processed
We want to be sure not to exceed max_processed. It can actually go
slightly negative due to the rounding applied to ratios, but we must
refrain from processing too many tasks if it's already low.

This became particularly relevant since recent commit 5c8be272c ("MEDIUM:
tasks: also process late wakeups in process_runnable_tasks()") which was
merged into 2.2-dev10. No backport is needed.
2020-06-23 11:34:40 +02:00
Willy Tarreau
5c8be272c7 MEDIUM: tasks: also process late wakeups in process_runnable_tasks()
Since version 1.8, we've started to use tasks and tasklets more
extensively to defer I/O processing. Originally with the simple
scheduler, a task waking another one up using task_wakeup() would
have caused it to be processed right after the list of runnable ones.

With the introduction of tasklets, we've started to spill running
tasks from the run queues to the tasklet queues, so if a task wakes
another one up, it will only be executed on the next call to
process_runnable_task(), which means after yet another round of
polling loop.

This is particularly visible with I/Os hitting muxes: poll() reports
a read event, the connection layer performs a tasklet_wakeup() on the
mux subscribed to this I/O, and this mux in turn signals the upper
layer stream using task_wakeup(). The process goes back to poll() with
a null timeout since there's one active task, then back to checking all
possibly expired events, and finally back to process_runnable_tasks()
again. Worse, when there is high I/O activity, doing so will make the
task's execution further apart from the tasklet and will both increase
the total processing latency and reduce the cache hit ratio.

This patch brings back to the original spirit of process_runnable_tasks()
which is to execute runnable tasks as long as the execution budget is not
exhausted. By doing so, we're immediately cutting in half the number of
calls to all functions called by run_poll_loop(), and halving the number
of calls to poll(). Furthermore, calling poll() less often also means
purging FD updates less often and offering more chances to merge them.

This also has the nice effect of making tune.runqueue-depth effective
again, as in the past it used to be quickly bounded by this artificial
event horizon which was preventing from executing remaining tasks. On
certain workloads we can see a 2-3% performance increase.
2020-06-19 14:21:46 +02:00
Willy Tarreau
77015abe0b MEDIUM: tasks: clean up the front side of the wait queue in wake_expired_tasks()
Due to the way the wait queue works, some tasks might be postponed but not
requeued. However when we exit wake_expired_tasks() on a not-yet-expired
task and leave it in this situation, the next call to next_timer_expiry()
will use this first task's key in the tree as an expiration date, but this
date might be totally off and cause needless wakeups just to reposition it.

This patch makes sure that we leave wake_expired_tasks with a clean state
of frontside tasks and that their tree's key matches their expiration date.
Doing so we can already observe a ~15% reduction of the number of wakeups
when dealing with large numbers of health checks.

The patch looks large because the code was rearranged but the real change
is to take the wakeup/requeue decision on the task's expiration date instead
of the tree node's key, the rest is unchanged.
2020-06-19 14:21:46 +02:00
Willy Tarreau
b2551057af CLEANUP: include: tree-wide alphabetical sort of include files
This patch fixes all the leftovers from the include cleanup campaign. There
were not that many (~400 entries in ~150 files) but it was definitely worth
doing it as it revealed a few duplicates.
2020-06-11 10:18:59 +02:00
Willy Tarreau
dfd3de8826 REORG: include: move stream.h to haproxy/stream{,-t}.h
This one was not easy because it was embarking many includes with it,
which other files would automatically find. At least global.h, arg.h
and tools.h were identified. 93 total locations were identified, 8
additional includes had to be added.

In the rare files where it was possible to finalize the sorting of
includes by adjusting only one or two extra lines, it was done. But
all files would need to be rechecked and cleaned up now.

It was the last set of files in types/ and proto/ and these directories
must not be reused anymore.
2020-06-11 10:18:58 +02:00
Willy Tarreau
a264d960f6 REORG: include: move proxy.h to haproxy/proxy{,-t}.h
This one is particularly difficult to split because it provides all the
functions used to manipulate a proxy state and to retrieve names or IDs
for error reporting, and as such, it was included in 73 files (down to
68 after cleanup). It would deserve a small cleanup though the cut points
are not obvious at the moment given the number of structs involved in
the struct proxy itself.
2020-06-11 10:18:58 +02:00
Willy Tarreau
cea0e1bb19 REORG: include: move task.h to haproxy/task{,-t}.h
The TASK_IS_TASKLET() macro was moved to the proto file instead of the
type one. The proto part was a bit reordered to remove a number of ugly
forward declaration of static inline functions. About a tens of C and H
files had their dependency dropped since they were not using anything
from task.h.
2020-06-11 10:18:58 +02:00
Willy Tarreau
0f6ffd652e REORG: include: move fd.h to haproxy/fd{,-t}.h
A few includes were missing in each file. A definition of
struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was
moved to defaults.h

Stdio used to be silently inherited from whatever path but it's needed
for list_pollers() which takes a FILE* and which can thus not be
forward-declared.
2020-06-11 10:18:57 +02:00
Willy Tarreau
48fbcae07c REORG: tools: split common/standard.h into haproxy/tools{,-t}.h
And also rename standard.c to tools.c. The original split between
tools.h and standard.h dates from version 1.3-dev and was mostly an
accident. This patch moves the files back to what they were expected
to be, and takes care of not changing anything else. However this
time tools.h was split between functions and types, because it contains
a small number of commonly used macros and structures (e.g. name_desc)
which in turn cause the massive list of includes of tools.h to conflict
with the callers.

They remain the ugliest files of the whole project and definitely need
to be cleaned and split apart. A few types are defined there only for
functions provided there, and some parts are even OS-specific and should
move somewhere else, such as the symbol resolution code.
2020-06-11 10:18:57 +02:00
Willy Tarreau
d0ef439699 REORG: include: move common/memory.h to haproxy/pool.h
Now the file is ready to be stored into its final destination. A few
minor reorderings were performed to keep the file properly organized,
making the various sections more visible (cache & lockless).

In addition and to stay consistent, memory.c was renamed to pool.c.
2020-06-11 10:18:57 +02:00
Willy Tarreau
6634794992 REORG: include: move freq_ctr to haproxy/
types/freq_ctr.h was moved to haproxy/freq_ctr-t.h and proto/freq_ctr.h
was moved to haproxy/freq_ctr.h. Files were updated accordingly, no other
change was applied.
2020-06-11 10:18:56 +02:00
Willy Tarreau
92b4f1372e REORG: include: move time.h from common/ to haproxy/
This one is included almost everywhere and used to rely on a few other
.h that are not needed (unistd, stdlib, standard.h). It could possibly
make sense to split it into multiple parts to distinguish operations
performed on timers and the internal time accounting, but at this point
it does not appear much important.
2020-06-11 10:18:56 +02:00
Willy Tarreau
af613e8359 CLEANUP: thread: rename __decl_hathreads() to __decl_thread()
I can never figure whether it takes an "s" or not, and in the end it's
better if it matches the file's naming, so let's call it "__decl_thread".
2020-06-11 10:18:56 +02:00
Willy Tarreau
853b297c9b REORG: include: split mini-clist into haproxy/list and list-t.h
Half of the users of this include only need the type definitions and
not the manipulation macros nor the inline functions. Moves the various
types into mini-clist-t.h makes the files cleaner. The other one had all
its includes grouped at the top. A few files continued to reference it
without using it and were cleaned.

In addition it was about time that we'd rename that file, it's not
"mini" anymore and contains a bit more than just circular lists.
2020-06-11 10:18:56 +02:00
Willy Tarreau
4c7e4b7738 REORG: include: update all files to use haproxy/api.h or api-t.h if needed
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:

  - common/config.h
  - common/compat.h
  - common/compiler.h
  - common/defaults.h
  - common/initcall.h
  - common/tools.h

The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.

In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.

No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
2020-06-11 10:18:42 +02:00
Willy Tarreau
8d2b777fe3 REORG: ebtree: move the include files from ebtree to include/import/
This is where other imported components are located. All files which
used to directly include ebtree were touched to update their include
path so that "import/" is now prefixed before the ebtree-related files.

The ebtree.h file was slightly adjusted to read compiler.h from the
common/ subdirectory (this is the only change).

A build issue was encountered when eb32sctree.h is loaded before
eb32tree.h because only the former checks for the latter before
defining type u32. This was addressed by adding the reverse ifdef
in eb32tree.h.

No further cleanup was done yet in order to keep changes minimal.
2020-06-11 09:31:11 +02:00
Ilya Shipitsin
856aabcda5 CLEANUP: assorted typo fixes in the code and comments
This is 8th iteration of typo fixes
2020-04-17 09:37:36 +02:00
Olivier Houchard
c62d9ab7cb MINOR: tasks: Provide the tasklet to the callback.
When tasklet were introduced, it has been decided not to provide the tasklet
to the callback, but NULL instead. While it may have been reasonable back
then, maybe to be able to differentiate a task from a tasklet from the
callback, it also means that we can't access the tasklet from the handler if
the context provided can't be trusted.
As no handler is shared between a task and a tasklet, and there are now
other means of distinguishing between task and tasklet, just pass the
tasklet pointer too.

This may be backported to 2.1, 2.0 and 1.9 if needed.
2020-03-17 18:52:33 +01:00
Willy Tarreau
27d00c0167 MINOR: task: export run_tasks_from_list
This will help refine debug traces.
2020-03-03 15:26:10 +01:00
Willy Tarreau
952c2640b0 MINOR: task: don't set TASK_RUNNING on tasklets
We can't clear flags on tasklets because we don't know if they're still
present upon return (they all return NULL, maybe that could change in
the future). As a side effect, once TASK_RUNNING is set, it's never
cleared anymore, which is misleading and resulted in some incorrect
flagging of bulk tasks in the recent scheduler changes. And the only
reason for setting TASK_RUNNING on tasklets was to detect self-wakers,
which is not done using a dedicated flag. So instead of setting this
flags for no opportunity to clear it, let's simply not set it.
2020-01-31 18:37:03 +01:00
Willy Tarreau
1dfc9bbdc6 OPTIM: task: readjust CPU bandwidth distribution since last update
Now that we can more accurately watch which connection is really
being woken up from itself, it was desirable to re-adjust the CPU BW
thresholds based on measurements. New tests with 60000 concurrent
connections were run at 100 Gbps with unbounded queues and showed
the following distribution:

     scenario           TC0 TC1 TC2   observation
    -------------------+---+---+----+---------------------------
     TCP conn rate     : 32, 51, 17
     HTTP conn rate    : 34, 41, 25
     TCP byte rate     :  2,  3, 95   (2 MB objets)
     splicing byte rate: 11,  6, 83   (2 MB objets)
     H2 10k object     : 44, 23, 33   client-limited
     mixed traffic     : 18, 10, 72   2*1m+1*0: 11kcps, 36 Gbps

The H2 experienced a huge change since it uses a persistent connection
that was accidently flagged in the previous test. The splicing test
exhibits a higher need for short tasklets, so does the mixed traffic
test. Given that latency mainly matters for conn rate and H2 here,
the ratios were readjusted as 33% for TC0, 50% for TC1 and 17% for
TC2, keeping in mind that whatever is not consumed by one class is
automatically shared in equal propertions by the next one(s). This
setting immediately provided a nice improvement as with the default
settings (maxpollevents=200, runqueue-depth=200), the same ratios as
above are still reported, while the time to request "show activity"
on the CLI dropped to 30-50ms. The average loop time is around 5.7ms
on the mixed traffic.

In addition, one extra stress test at 90.5 Gbps with 5100 conn/s shows
70-100ms CLI request time, with an average loop time of 17 ms.
2020-01-31 18:37:01 +01:00
Willy Tarreau
d23d413e38 MINOR: task: make sched->current also reflect tasklets
sched->current is used to know the current task/tasklet, and is currently
only used by the panic dump code. However it turns out it was not set for
tasklets, which prevents us from using it for more usages, despite the
panic handling code already handling this case very well. Let's make sure
it's now set.
2020-01-31 17:45:10 +01:00
Willy Tarreau
bb238834da MINOR: task: permanently flag tasklets waking themselves up
Commit a17664d829 ("MEDIUM: tasks: automatically requeue into the bulk
queue an already running tasklet") tried to inflict a penalty to
self-requeuing tasks/tasklets which correspond to those involved in
large, high-latency data transfers, for the benefit of all other
processing which requires a low latency. However, it turns out that
while it ought to do this on a case-by-case basis, basing itself on
the RUNNING flag isn't accurate because this flag doesn't leave for
tasklets, so we'd rather need a distinct flag to tag such tasklets.

This commit introduces TASK_SELF_WAKING to mark tasklets acting like
this. For now it's still set when TASK_RUNNING is present but this
will have to change. The flag is kept across wakeups.
2020-01-31 17:45:10 +01:00
Willy Tarreau
c633607c06 OPTIM: task: refine task classes default CPU bandwidth ratios
Measures with unbounded execution ratios under 40000 concurrent
connections at 100 Gbps showed the following CPU bandwidth
distribution between task classes depending on traffic scenarios:

    scenario           TC0 TC1 TC2   observation
   -------------------+---+---+----+---------------------------
    TCP conn rate     : 29, 48, 23   221 kcps
    HTTP conn rate    : 29, 47, 24   200 kcps
    TCP byte rate     :  3,  5, 92   53 Gbps
    splicing byte rate:  5, 10, 85   70 Gbps
    H2 10k object     : 10, 21, 74   client-limited
    mixed traffic     :  4,  7, 89   2*1m+1*0: 11kcps, 36 Gbps

Thus it seems that we always need a bit of bulk tasks even for short
connections, which seems to imply a suboptimal processing somewhere,
and that there are roughly twice as many tasks (TC1=normal) as regular
tasklets (TC0=urgent). This ratio stands even when data forwarding
increases. So at first glance it looks reasonable to enforce the
following ratio by default:

  - 16% for TL_URGENT
  - 33% for TL_NORMAL
  - 50% for TL_BULK

With this, the TCP conn rate climbs to ~225 kcps, and the mixed traffic
pattern shows a more balanced 17kcps + 35 Gbps with 35ms CLI request
time time instead of 11kcps + 36 Gbps and 400 ms response time. The
byte rate tests (1M objects) are not affected at all. This setting
looks "good enough" to allow immediate merging, and could be refined
later.

It's worth noting that it resists very well to massive increase of
run queue depth and maxpollevents: with the run queue depth changed
from 200 to 10000 and maxpollevents to 10000 as well, the CLI's
request time is back to the previous ~400ms, but the mixed traffic
test reaches 52 Gbps + 7500 CPS, which was never met with the previous
scheduling model, while the CLI used to show ~1 minute response time.
The reason is that in the bulk class it becomes possible to perform
multiple rounds of recv+send and eliminate objects at once, increasing
the L3 cache hit ratio, and keeping the connection count low, without
degrading too much the latency.

Another test with mixed traffic involving 2/3 splicing on huge objects
and 1/3 on empty objects without touching any setting reports 51 Gbps +
5300 cps and 35ms CLI request time.
2020-01-31 07:09:10 +01:00
Willy Tarreau
a62917b890 MEDIUM: tasks: implement 3 different tasklet classes with their own queues
We used to mix high latency tasks and low latency tasklets in the same
list, and to even refill bulk tasklets there, causing some unfairness
in certain situations (e.g. poll-less transfers between many connections
saturating the machine with similarly-sized in and out network interfaces).

This patch changes the mechanism to split the load into 3 lists depending
on the task/tasklet's desired classes :
  - URGENT: this is mainly for tasklets used as deferred callbacks
  - NORMAL: this is for regular tasks
  - BULK: this is for bulk tasks/tasklets

Arbitrary ratios of max_processed are picked from each of these lists in
turn, with the ability to complete in one list from what was not picked
in the previous one. After some quick tests, the following setup gave
apparently good results both for raw TCP with splicing and for H2-to-H1
request rate:

  - 0 to 75% for urgent
  - 12 to 50% for normal
  - 12 to what remains for bulk

Bulk is not used yet.
2020-01-30 18:59:33 +01:00
Willy Tarreau
4ffa0b526a MINOR: tasks: move the list walking code to its own function
New function run_tasks_from_list() will run over a tasklet list and will
run all the tasks and tasklets it finds there within a limit of <max>
that is passed in arggument. This is a preliminary work for scheduler QoS
improvements.
2020-01-30 18:13:13 +01:00
Willy Tarreau
dd0e89a084 BUG/MAJOR: task: add a new TASK_SHARED_WQ flag to fix foreing requeuing
Since 1.9 with commit b20aa9eef3 ("MAJOR: tasks: create per-thread wait
queues") a task bound to a single thread will not use locks when being
queued or dequeued because the wait queue is assumed to be the owner
thread's.

But there exists a rare situation where this is not true: the health
check tasks may be running on one thread waiting for a response, and
may in parallel be requeued by another thread calling health_adjust()
after a detecting a response error in traffic when "observe l7" is set,
and "fastinter" is lower than "inter", requiring to shorten the running
check's timeout. In this case, the task being requeued was present in
another thread's wait queue, thus opening a race during task_unlink_wq(),
and gets requeued into the calling thread's wait queue instead of the
running one's, opening a second race here.

This patch aims at protecting against the risk of calling task_unlink_wq()
from one thread while the task is queued on another thread, hence unlocked,
by introducing a new TASK_SHARED_WQ flag.

This new flag indicates that a task's position in the wait queue may be
adjusted by other threads than then one currently executing it. This means
that such WQ manipulations must be performed under a lock. There are two
types of such tasks:
  - the global ones, using the global wait queue (technically speaking,
    those whose thread_mask has at least 2 bits set).
  - some local ones, which for now will be placed into the global wait
    queue as well in order to benefit from its lock.

The flag is automatically set on initialization if the task's thread mask
indicates more than one thread. The caller must also set it if it intends
to let other threads update the task's expiration delay (e.g. delegated
I/Os), or if it intends to change the task's affinity over time as this
could lead to the same situation.

Right now only the situation described above seems to be affected by this
issue, and it is very difficult to trigger, and even then, will often have
no visible effect beyond stopping the checks for example once the race is
met. On my laptop it is feasible with the following config, chained to
httpterm:

    global
        maxconn 400 # provoke FD errors, calling health_adjust()

    defaults
        mode http
        timeout client 10s
        timeout server 10s
        timeout connect 10s

    listen px
        bind :8001
        option httpchk /?t=50
        server sback 127.0.0.1:8000 backup
        server-template s 0-999 127.0.0.1:8000 check port 8001 inter 100 fastinter 10 observe layer7

This patch will automatically address the case for the checks because
check tasks are created with multiple threads bound and will get the
TASK_SHARED_WQ flag set.

If in the future more tasks need to rely on this (multi-threaded muxes
for example) and the use of the global wait queue becomes a bottleneck
again, then it should not be too difficult to place locks on the local
wait queues and queue the task on its bound thread.

This patch needs to be backported to 2.1, 2.0 and 1.9. It depends on
previous patch "MINOR: task: only check TASK_WOKEN_ANY to decide to
requeue a task".

Many thanks to William Dauchy for providing detailed traces allowing to
spot the problem.
2019-12-19 14:42:22 +01:00
Willy Tarreau
8fe4253bf6 MINOR: task: only check TASK_WOKEN_ANY to decide to requeue a task
After processing a task, its RUNNING bit is cleared and at the same time
we check for other bits to decide whether to requeue the task or not. It
happens that we only want to check the TASK_WOKEN_* bits, because :
  - TASK_RUNNING was just cleared
  - TASK_GLOBAL and TASK_QUEUE cannot be set yet as the task was running,
    preventing it from being requeued

It's important not to catch yet undefined flags there because it would
prevent addition of new task flags. This also shows more clearly that
waking a task up with flags 0 is not something safe to do as the task
will not be woken up if it's already running.
2019-12-19 14:42:22 +01:00
Willy Tarreau
c49ba52524 MINOR: tasks: split wake_expired_tasks() in two parts to avoid useless wakeups
We used to have wake_expired_tasks() wake up tasks and return the next
expiration delay. The problem this causes is that we have to call it just
before poll() in order to consider latest timers, but this also means that
we don't wake up all newly expired tasks upon return from poll(), which
thus systematically requires a second poll() round.

This is visible when running any scheduled task like a health check, as there
are systematically two poll() calls, one with the interval, nothing is done
after it, and another one with a zero delay, and the task is called:

  listen test
    bind *:8001
    server s1 127.0.0.1:1111 check

  09:37:38.200959 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8696843}) = 0
  09:37:38.200967 epoll_wait(3, [], 200, 1000) = 0
  09:37:39.202459 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8712467}) = 0
>> nothing run here, as the expired task was not woken up yet.
  09:37:39.202497 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8715766}) = 0
  09:37:39.202505 epoll_wait(3, [], 200, 0) = 0
  09:37:39.202513 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8719064}) = 0
>> now the expired task was woken up
  09:37:39.202522 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
  09:37:39.202537 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
  09:37:39.202565 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
  09:37:39.202577 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
  09:37:39.202585 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
  09:37:39.202659 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
  09:37:39.202673 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8814713}) = 0
  09:37:39.202683 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
  09:37:39.202693 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8818617}) = 0
  09:37:39.202701 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
  09:37:39.202715 close(7)                = 0

Let's instead split the function in two parts:
  - the first part, wake_expired_tasks(), called just before
    process_runnable_tasks(), wakes up all expired tasks; it doesn't
    compute any timeout.
  - the second part, next_timer_expiry(), called just before poll(),
    only computes the next timeout for the current thread.

Thanks to this, all expired tasks are properly woken up when leaving
poll, and each poll call's timeout remains up to date:

  09:41:16.270449 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10223556}) = 0
  09:41:16.270457 epoll_wait(3, [], 200, 999) = 0
  09:41:17.270130 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10238572}) = 0
  09:41:17.270157 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
  09:41:17.270194 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
  09:41:17.270204 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
  09:41:17.270216 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0
  09:41:17.270224 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
  09:41:17.270299 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0
  09:41:17.270314 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10337841}) = 0
  09:41:17.270323 epoll_wait(3, [{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1
  09:41:17.270332 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10341860}) = 0
  09:41:17.270340 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
  09:41:17.270367 close(7)                = 0

This may be backported to 2.1 and 2.0 though it's unlikely to bring any
user-visible improvement except to clarify debugging.
2019-12-11 09:42:58 +01:00
Olivier Houchard
06910464dd MEDIUM: task: Split the tasklet list into two lists.
As using an mt_list for the tasklet list is costly, instead use a regular list,
but add an mt_list for tasklet woken up by other threads, to be run on the
current thread. At the beginning of process_runnable_tasks(), we just take
the new list, and merge it into the task_list.
This should give us performances comparable to before we started using a
mt_list, but allow us to use tasklet_wakeup() from other threads.
2019-10-11 16:37:41 +02:00
Olivier Houchard
07308677dd BUG/MEDIUM: tasks: Don't forget to decrement tasks_run_queue.
When executing tasks, don't forget to decrement tasks_run_queue once we
popped one task from the task_list. tasks_run_queue used to be decremented
by __tasklet_remove_from_tasklet_list(), but we now call MT_LIST_POP().
2019-10-03 14:55:40 +02:00
Willy Tarreau
d022e9c98b MINOR: task: introduce a thread-local "sched" variable for local scheduler stuff
The aim is to rassemble all scheduler information related to the current
thread. It simply points to task_per_thread[tid] without having to perform
the operation at each time. We save around 1.2 kB of code on performance
sensitive paths and increase the request rate by almost 1%.
2019-09-24 11:23:30 +02:00
Willy Tarreau
d66d75656e MINOR: task: split the tasklet vs task code in process_runnable_tasks()
There are a number of tests there which are enforced on tasklets while
they will never apply (various handlers, destroyed task or not, arguments,
results, ...). Instead let's have a single TASK_IS_TASKLET() test and call
the tasklet processing function directly, skipping all the rest.

It now appears visible that the only unneeded code is the update to
curr_task that is never used for tasklets, except for opportunistic
reporting in the debug handler, which can only catch si_cs_io_cb,
which in practice doesn't appear in any report so the extra cost
incurred there is pointless.

This change alone removes 700 bytes of code, mostly in
process_runnable_tasks() and increases the performance by about
1%.
2019-09-24 11:23:30 +02:00
Willy Tarreau
4c1e1ad6a8 CLEANUP: task: cache the task_per_thread pointer
In process_runnable_tasks() we perform a lot of dereferences to
task_per_thread[tid] but tid is thread_local and the compiler cannot
know that it doesn't change so this results in making lots of thread
local accesses and array dereferences. By just keeping a copy pointer
of this, we let the compiler optimize the code. Just doing this has
reduced process_runnable_tasks() by 124 bytes in the fast path. Doing
the same in wake_expired_tasks() results in 16 extra bytes saved.
2019-09-24 11:23:30 +02:00
Willy Tarreau
9b48c629f2 CLEANUP: task: remove impossible test
In process_runnable_task(), after the task's process() function returns,
we used to check if the return is not NULL and is not a tasklet, to update
profiling measurements. This is useless since only tasks can return non-null
here. Let's remove this useless test.
2019-09-24 11:23:30 +02:00
Olivier Houchard
ff1e9f39b9 MEDIUM: tasklets: Make the tasklet list a struct mt_list.
Change the tasklet code so that the tasklet list is now a mt_list.
That means that tasklet now do have an associated tid, for the thread it
is expected to run on, and any thread can now call tasklet_wakeup() for
that tasklet.
One can change the associated tid with tasklet_set_tid().
2019-09-23 18:16:08 +02:00
Olivier Houchard
859dc80f94 MEDIUM: list: Separate "locked" list from regular list.
Instead of using the same type for regular linked lists and "autolocked"
linked lists, use a separate type, "struct mt_list", for the autolocked one,
and introduce a set of macros, similar to the LIST_* macros, with the
MT_ prefix.
When we use the same entry for both regular list and autolocked list, as
is done for the "list" field in struct connection, we know have to explicitely
cast it to struct mt_list when using MT_ macros.
2019-09-23 18:16:08 +02:00
Willy Tarreau
64e6012eb9 MINOR: task: introduce work lists
Sometimes we need to delegate some list processing to a function running
on another thread. In this case the list element will simply be queued
into a dedicated self-locked list and the task responsible for this list
will be woken up, calling the associated function which will run over the
list.

This is what work_list does. Such lists will be dedicated to a limited
type of work but will significantly ease such remote handling. A function
is provided to create these per-thread lists, their tasks and to properly
bind each task to a distinct thread, so that the caller only has to store
the resulting pointer to the start of the structure.

These structures should not be abused though as each head will consume
4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.
2019-07-12 09:07:48 +02:00
Willy Tarreau
bd20a9dd4e BUG: tasks: fix bug introduced by latest scheduler cleanup
In commit 86eded6c6 ("CLEANUP: tasks: rename task_remove_from_tasklet_list()
to tasklet_remove_*") which consisted in removing the casts between tasks
and tasklet, I was a bit too fast to believe that we only saw tasklets in
this function since process_runnable_tasks() also uses it with tasks under
a cast. So removing the bookkeeping on task_list_size was not appropriate.
Bah, the joy of casts which hide the real thing...

This patch does two things at once to address this mess once for all:
  - it restores the decrement of task_list_size when it's a real task,
    but moves it to process_runnable_task() since it's the only place
    where it's allowed to call it with a task

  - it moves the increment there as well and renames
    task_insert_into_tasklet_list() to tasklet_insert_into_tasklet_list()
    of obvious consistency reasons.

This way the increment/decrement of task_list_size is made at the only
places where the cast is enforced, so it has less risks to be missed.
The comments on top of these functions were updated to reflect that they
are only supposed to be used with tasklets and that the caller is responsible
for keeping task_list_size up to date if it decides to enforce a task there.

Now we don't have to worry anymore about how these functions work outside
of the scheduler, which is better longterm-wise. Thanks to Christopher for
spotting this mistake.

No backport is needed.
2019-06-14 18:16:19 +02:00
Willy Tarreau
86eded6c69 CLEANUP: tasks: rename task_remove_from_tasklet_list() to tasklet_remove_*
The function really only operates on tasklets, its arguments are always
tasklets cast as tasks to match the function's type, to be cast back to
a struct tasklet. Let's rename it to tasklet_remove_from_tasklet_list(),
take a struct tasklet, and get rid of the undesired task casts.
2019-06-14 14:57:03 +02:00
Willy Tarreau
5598d171b3 BUILD: task: fix a build warning when threads are disabled
The __decl_hathreads() macro will leave a lone semi-colon making the end
of variables declarations, resulting in a warning if threads are disabled.
Let's simply swap it with the last variable. Thanks to Ilya Shipitsin for
reporting this issue.

No backport is needed.
2019-06-04 17:18:40 +02:00
Olivier Houchard
cfbb3e6560 MEDIUM: tasks: Get rid of active_tasks_mask.
Remove the active_tasks_mask variable, we can deduce if we've work to do
by other means, and it is costly to maintain. Instead, introduce a new
function, thread_has_tasks(), that returns non-zero if there's tasks
scheduled for the thread, zero otherwise.
2019-05-29 21:53:37 +02:00
Willy Tarreau
1e928c074b MEDIUM: task: don't grab the WR lock just to check the WQ
When profiling locks, it appears that the WQ's lock has become the most
contended one, despite the WQ being split by thread. The reason is that
each thread takes the WQ lock before checking if it it does have something
to do. In practice the WQ almost only contains health checks and rare tasks
that can be scheduled anywhere, so this is a real waste of resources.

This patch proceeds differently. Now that the WQ's lock was turned to RW
lock, we proceed in 3 phases :
  1) locklessly check for the queue's emptiness

  2) take an R lock to retrieve the first element and check if it is
     expired. This way most visits are performed with an R lock to find
     and return the next expiration date.

  3) if one expiration is found, we perform the WR-locked lookup as
     usual.

As a result, on a one-minute test involving 8 threads and 64 streams at
1.3 million ctxsw/s, before this patch the lock profiler reported this :

    Stats about Lock TASK_WQ:
         # write lock  : 1125496
         # write unlock: 1125496 (0)
         # wait time for write     : 263.143 msec
         # wait time for write/lock: 233.802 nsec
         # read lock   : 0
         # read unlock : 0 (0)
         # wait time for read      : 0.000 msec
         # wait time for read/lock : 0.000 nsec

And after :

    Stats about Lock TASK_WQ:
         # write lock  : 173
         # write unlock: 173 (0)
         # wait time for write     : 0.018 msec
         # wait time for write/lock: 103.988 nsec
         # read lock   : 1072706
         # read unlock : 1072706 (0)
         # wait time for read      : 60.702 msec
         # wait time for read/lock : 56.588 nsec

Thus the contention was divided by 4.3.
2019-05-28 19:15:44 +02:00
Willy Tarreau
ef28dc11e3 MINOR: task: turn the WQ lock to an RW_LOCK
For now it's exclusively used as a write lock though, thus it remains
100% equivalent to the spinlock it replaces.
2019-05-28 19:15:44 +02:00
Willy Tarreau
e6a02fa65a MINOR: threads: add a "stuck" flag to the thread_info struct
This flag is constantly cleared by the scheduler and will be set by the
watchdog timer to detect stuck threads. It is also set by the "show
threads" command so that it is easy to spot if the situation has evolved
between two subsequent calls : if the first "show threads" shows no stuck
thread and the second one shows such a stuck thread, it indicates that
this thread didn't manage to make any forward progress since the previous
call, which is extremely suspicious.
2019-05-22 11:50:48 +02:00
Willy Tarreau
01f3489752 MINOR: task: put barriers after each write to curr_task
This one may be watched by signal handlers, we don't want the compiler
to optimize its assignment away at the end of the loop and leave some
wandering pointers there.
2019-05-17 17:16:20 +02:00
Willy Tarreau
bc13bec548 MINOR: activity: report context switch counts instead of rates
It's not logical to report context switch rates per thread in show activity
because everything else is a counter and it's not even possible to compare
values. Let's only report counts. Further, this simplifies the scheduler's
code.
2019-04-30 14:55:18 +02:00
Willy Tarreau
d9add3acc8 MINOR: activity: make the profiling status per thread and not global
In order to later support automatic profiling turn on/off, we need to
have it per-thread. We're keeping the global option to know whether to
turn it or on off, but the profiling status is now set per thread. We're
updating the status in activity_count_runtime() which is called before
entering poll(). The reason is that we'll extend this with run time
measurement when deciding to automatically turn it on or off.
2019-04-25 17:26:19 +02:00
Willy Tarreau
0212fadd65 MINOR: tasks/activity: report the context switch and task wakeup rates
It's particularly useful to spot runaway tasks to see this. The context
switch rate covers all tasklet calls (tasks and I/O handlers) while the
task wakeups only covers tasks picked from the run queue to be executed.
High values there will indicate either an intense traffic or a bug that
mades a task go wild.
2019-04-24 16:04:23 +02:00
Olivier Houchard
ed1a6a0d8a MEDIUM: tasks: Use __ha_barrier_store after modifying global_tasks_mask.
Now that we no longer use atomic operations to update global_tasks_mask,
as it's always modified while holding the TASK_RQ_LOCK, we have to use
__ha_barrier_store() instead of __ha_barrier_atomic_store() to ensure
any modification of global_tasks_mask is seen before modifying
active_tasks_mask.

This should be backported to 1.9.
2019-04-18 14:14:10 +02:00
Olivier Houchard
1cfac37b65 MEDIUM: tasks: Don't account a destroyed task as a runned task.
In process_runnable_tasks(), if the task we're about to run has been
destroyed, and should be free, don't account for it in the number of task
we ran. We're only allowed a maximum number of tasks to run per call to
process_runnable_tasks(), and freeing one shouldn't take the slot of a
valid task.
2019-04-18 10:11:13 +02:00
Olivier Houchard
3f795f76e8 MEDIUM: tasks: Merge task_delete() and task_free() into task_destroy().
task_delete() was never used without calling task_free() just after, and
task_free() was only used on error pathes to destroy a just-created task,
so merge them into task_destroy(), that will remove the task from the
wait queue, and make sure the task is either destroyed immediately if it's
not in the run queue, or destroyed when it's supposed to run.
2019-04-18 10:10:04 +02:00
Willy Tarreau
03dd029a5b CLEANUP: task: remain consistent when using the task's handler
A pointer "process" is assigned the task's handler in
process_runnable_tasks(), we have no reason to use t->process
right after it is assigned.
2019-04-17 22:32:27 +02:00
Olivier Houchard
0c7a4b6371 MINOR: tasks: Don't set the TASK_RUNNING flag when adding in the tasklet list.
Now that TASK_QUEUED is enforced, there's no need to set TASK_RUNNING when
removing the task from the runqueue to add it to the tasklet list. The flag
will only be set right before we run the task.
2019-04-17 19:28:01 +02:00
Olivier Houchard
de82aeaa26 BUG/MEDIUM: tasks: Make sure we modify global_tasks_mask with the rq_lock.
When modifying global_tasks_mask, make sure we hold the rq_lock, or we might
remove the bit while it has been re-set by somebody else, and we make not
be waked when needed.
2019-04-17 19:28:01 +02:00
Willy Tarreau
b038007ae8 BUG/MEDIUM: tasks: Make sure we set TASK_QUEUED before adding a task to the rq.
Make sure we set TASK_QUEUED in every case before adding the task to the
run queue. task_wakeup() now checks if either TASK_QUEUED or TASK_RUNNING
is set, and if neither is set, add TASK_QUEUED and effectively add the task
to the runqueue.
No longer use __task_wakeup() anywhere except in task_wakeup(), always use
task_wakeup() instead.
With the old code, process_runnable_task() may re-add a task in the runqueue
without setting the TASK_QUEUED flag, and there were race conditions that could
lead to a task having the TASK_QUEUED flag but not in the runqueue, thus
being unschedulable.

This should be backported to 1.9.
2019-04-17 19:28:01 +02:00
Willy Tarreau
3466e3cdcb BUILD: task/thread: fix single-threaded build of task.c
As expected, commit cde7902ac ("MEDIUM: tasks: improve fairness between
the local and global queues") broke the build with threads disabled,
and I forgot to rerun this test before committing. No backport is
needed.
2019-04-15 18:52:40 +02:00
Willy Tarreau
c8da044b41 MINOR: tasks: restore the lower latency scheduling when niced tasks are present
In the past we used to reduce the number of tasks consulted at once when
some niced tasks were present in the run queue. This was dropped in 1.8
when the scheduler started to take batches. With the recent fixes it now
becomes possible to restore this behaviour which guarantees a better
latency between tasks when niced tasks are present. Thanks to this, with
the default number of 200 for tune.runqueue-depth, with a parasitic load
of 14000 requests per second, nice 0 gives 14000 rps, nice 1024 gives
12000 rps and nice -1024 gives 16000 rps. The amplitude widens if the
runqueue depth is lowered.
2019-04-15 09:50:56 +02:00