MEDIUM: checks: spread the checks load over random threads

The CPU usage pattern was found to be high (5%) on a machine with
48 threads and only 100 servers checked every second That was
supposed to be only 100 connections per second, which should be very
cheap. It was figured that due to the check tasks unbinding from any
thread when going back to sleep, they're queued into the shared queue.

Not only this requires to manipulate the global queue lock, but in
addition it means that all threads have to check the global queue
before going to sleep (hence take a lock again) to figure how long
to sleep, and that they would all sleep only for the shortest amount
of time to the next check, one would pick it and all other ones would
go down to sleep waiting for the next check.

That's perfectly visible in time-to-first-byte measurements. A quick
test consisting in retrieving the stats page in CSV over a 48-thread
process checking 200 servers every 2 seconds shows the following tail:

  percentile   ttfb(ms)
  99.98        2.43
  99.985       5.72
  99.99       32.96
  99.995     82.176
  99.996     82.944
  99.9965    83.328
  99.997      83.84
  99.9975    84.288
  99.998      85.12
  99.9985    86.592
  99.999         88
  99.9995    89.728
  99.9999   100.352

One solution could consist in forcefully binding checks to threads at
boot time, but that's annoying, will cause trouble for dynamic servers
and may cause some skew in the load depending on some server patterns.

Instead here we take a different approach. A check remains bound to its
thread for as long as possible, but upon every wakeup, the thread's load
is compared with another random thread's load. If it's found that that
other thread's load is less than half of the current one's, the task is
bounced to that thread. In order to prevent that new thread from doing
the same, we set a flag "CHK_ST_SLEEPING" that indicates that it just
woke up and we're bouncing the task only on this condition.

Tests have shown that the initial load was very unfair before, with a few
checks threads having a load of 15-20 and the vast majority having zero.
With this modification, after two "inter" delays, the load is either zero
or one everywhere when checks start. The same test shows a CPU usage that
significantly drops, between 0.5 and 1%. The same latency tail measurement
is much better, roughly 10 times smaller:

  percentile   ttfb(ms)
  99.98        1.647
  99.985       1.773
  99.99        4.912
  99.995        8.76
  99.996        8.88
  99.9965      8.944
  99.997       9.016
  99.9975      9.104
  99.998       9.224
  99.9985      9.416
  99.999         9.8
  99.9995      10.04
  99.9999     10.432

In fact one difference here is that many threads work while in the past
they were waking up and going down to sleep after having perturbated the
shared lock. Thus it is anticipated that this will scale way smoother
than before. Under strace it's clearly visible that all threads are
sleeping for the time it takes to relaunch a check, there's no more
thundering herd wakeups.

However it is also possible that in some rare cases such as very short
check intervals smaller than a scheduler's timeslice (such as 4ms),
some users might have benefited from the work being concentrated on
less threads and would instead observe a small increase of apparent
CPU usage due to more total threads waking up even if that's for less
work each and less total work. That's visible with 200 servers at 4ms
where show activity shows that a few threads were overloaded and others
doing nothing. It's not a problem, though as in practice checks are not
supposed to eat much CPU and to wake up fast enough to represent a
significant load anyway, and the main issue they could have been
causing (aside the global lock) is an increase last-percentile latency.
This commit is contained in:
Willy Tarreau 2022-10-12 20:58:18 +02:00
parent a840b4a39b
commit d114f4a68f
2 changed files with 47 additions and 1 deletions

View File

@ -56,6 +56,7 @@ enum chk_result {
#define CHK_ST_OUT_ALLOC 0x0080 /* check blocked waiting for output buffer allocation */
#define CHK_ST_CLOSE_CONN 0x0100 /* check is waiting that the connection gets closed */
#define CHK_ST_PURGE 0x0200 /* check must be freed */
#define CHK_ST_SLEEPING 0x0400 /* check was sleeping */
/* check status */
enum healthcheck_status {

View File

@ -1089,9 +1089,54 @@ struct task *process_chk_conn(struct task *t, void *context, unsigned int state)
TRACE_ENTER(CHK_EV_TASK_WAKE, check);
if (check->state & CHK_ST_SLEEPING) {
/* This check just restarted. It's still time to verify if
* we're on an overloaded thread or if a more suitable one is
* available. This helps spread the load over the available
* threads, without migrating too often. For this we'll check
* our load, and pick a random thread, check if it has less
* than half of the current thread's load, and if so we'll
* bounce the task there. It's possible because it's not yet
* tied to the current thread. The other thread will not bounce
* the task again because we're removing CHK_ST_SLEEPING.
*/
uint my_load = HA_ATOMIC_LOAD(&th_ctx->rq_total);
check->state &= ~CHK_ST_SLEEPING;
if (my_load >= 2) {
uint new_tid = statistical_prng_range(global.nbthread);
uint new_load = HA_ATOMIC_LOAD(&ha_thread_ctx[new_tid].rq_total);
if (new_load <= my_load / 2) {
/* Found one. Let's migrate the task over there. We have to
* remove it from the WQ first and kill its expire time
* otherwise the scheduler will reinsert it and trigger a
* BUG_ON() as we're not allowed to call task_queue() for a
* foreign thread. The recipient will restore the expiration.
*/
task_unlink_wq(t);
t->expire = TICK_ETERNITY;
task_set_thread(t, new_tid);
task_wakeup(t, TASK_WOKEN_MSG);
TRACE_LEAVE(CHK_EV_TASK_WAKE, check);
return t;
}
}
}
if (check->server)
HA_SPIN_LOCK(SERVER_LOCK, &check->server->lock);
if (!(check->state & (CHK_ST_INPROGRESS|CHK_ST_IN_ALLOC|CHK_ST_OUT_ALLOC))) {
/* This task might have bounced from another overloaded thread, it
* needs an expiration timer that was supposed to be now, but that
* was erased during the bounce.
*/
if (!tick_isset(t->expire))
t->expire = now_ms;
}
if (unlikely(check->state & CHK_ST_PURGE)) {
TRACE_STATE("health-check state to purge", CHK_EV_TASK_WAKE, check);
}
@ -1217,10 +1262,10 @@ struct task *process_chk_conn(struct task *t, void *context, unsigned int state)
if (LIST_INLIST(&check->buf_wait.list))
LIST_DEL_INIT(&check->buf_wait.list);
task_set_thread(t, -1);
check_release_buf(check, &check->bi);
check_release_buf(check, &check->bo);
check->state &= ~(CHK_ST_INPROGRESS|CHK_ST_IN_ALLOC|CHK_ST_OUT_ALLOC);
check->state |= CHK_ST_SLEEPING;
if (check->server) {
rv = 0;