MEDIUM: checks: spread the checks load over random threads

The CPU usage pattern was found to be high (5%) on a machine with 48 threads and only 100 servers checked every second That was supposed to be only 100 connections per second, which should be very cheap. It was figured that due to the check tasks unbinding from any thread when going back to sleep, they're queued into the shared queue. Not only this requires to manipulate the global queue lock, but in addition it means that all threads have to check the global queue before going to sleep (hence take a lock again) to figure how long to sleep, and that they would all sleep only for the shortest amount of time to the next check, one would pick it and all other ones would go down to sleep waiting for the next check. That's perfectly visible in time-to-first-byte measurements. A quick test consisting in retrieving the stats page in CSV over a 48-thread process checking 200 servers every 2 seconds shows the following tail: percentile ttfb(ms) 99.98 2.43 99.985 5.72 99.99 32.96 99.995 82.176 99.996 82.944 99.9965 83.328 99.997 83.84 99.9975 84.288 99.998 85.12 99.9985 86.592 99.999 88 99.9995 89.728 99.9999 100.352 One solution could consist in forcefully binding checks to threads at boot time, but that's annoying, will cause trouble for dynamic servers and may cause some skew in the load depending on some server patterns. Instead here we take a different approach. A check remains bound to its thread for as long as possible, but upon every wakeup, the thread's load is compared with another random thread's load. If it's found that that other thread's load is less than half of the current one's, the task is bounced to that thread. In order to prevent that new thread from doing the same, we set a flag "CHK_ST_SLEEPING" that indicates that it just woke up and we're bouncing the task only on this condition. Tests have shown that the initial load was very unfair before, with a few checks threads having a load of 15-20 and the vast majority having zero. With this modification, after two "inter" delays, the load is either zero or one everywhere when checks start. The same test shows a CPU usage that significantly drops, between 0.5 and 1%. The same latency tail measurement is much better, roughly 10 times smaller: percentile ttfb(ms) 99.98 1.647 99.985 1.773 99.99 4.912 99.995 8.76 99.996 8.88 99.9965 8.944 99.997 9.016 99.9975 9.104 99.998 9.224 99.9985 9.416 99.999 9.8 99.9995 10.04 99.9999 10.432 In fact one difference here is that many threads work while in the past they were waking up and going down to sleep after having perturbated the shared lock. Thus it is anticipated that this will scale way smoother than before. Under strace it's clearly visible that all threads are sleeping for the time it takes to relaunch a check, there's no more thundering herd wakeups. However it is also possible that in some rare cases such as very short check intervals smaller than a scheduler's timeslice (such as 4ms), some users might have benefited from the work being concentrated on less threads and would instead observe a small increase of apparent CPU usage due to more total threads waking up even if that's for less work each and less total work. That's visible with 200 servers at 4ms where show activity shows that a few threads were overloaded and others doing nothing. It's not a problem, though as in practice checks are not supposed to eat much CPU and to wake up fast enough to represent a significant load anyway, and the main issue they could have been causing (aside the global lock) is an increase last-percentile latency.
2024-12-24 13:42:16 +00:00 · 2022-10-12 20:58:18 +02:00 · 2022-10-12 20:58:18 +02:00 · d114f4a68f
commit d114f4a68f
parent a840b4a39b
2 changed files with 47 additions and 1 deletions
--- a/include/haproxy/check-t.h
+++ b/include/haproxy/check-t.h
@ -56,6 +56,7 @@ enum chk_result {
 #define CHK_ST_OUT_ALLOC        0x0080  /* check blocked waiting for output buffer allocation */
 #define CHK_ST_CLOSE_CONN       0x0100  /* check is waiting that the connection gets closed */
 #define CHK_ST_PURGE            0x0200  /* check must be freed */
+#define CHK_ST_SLEEPING         0x0400  /* check was sleeping */

 /* check status */
 enum healthcheck_status {
--- a/src/check.c
+++ b/src/check.c
@ -1089,9 +1089,54 @@ struct task *process_chk_conn(struct task *t, void *context, unsigned int state)

 	TRACE_ENTER(CHK_EV_TASK_WAKE, check);

+	if (check->state & CHK_ST_SLEEPING) {
+		/* This check just restarted. It's still time to verify if
+		 * we're on an overloaded thread or if a more suitable one is
+		 * available. This helps spread the load over the available
+		 * threads, without migrating too often. For this we'll check
+		 * our load, and pick a random thread, check if it has less
+		 * than half of the current thread's load, and if so we'll
+		 * bounce the task there. It's possible because it's not yet
+		 * tied to the current thread. The other thread will not bounce
+		 * the task again because we're removing CHK_ST_SLEEPING.
+		 */
+		uint my_load = HA_ATOMIC_LOAD(&th_ctx->rq_total);
+
+		check->state &= ~CHK_ST_SLEEPING;
+
+		if (my_load >= 2) {
+			uint new_tid  = statistical_prng_range(global.nbthread);
+			uint new_load = HA_ATOMIC_LOAD(&ha_thread_ctx[new_tid].rq_total);
+
+			if (new_load <= my_load / 2) {
+				/* Found one. Let's migrate the task over there. We have to
+				 * remove it from the WQ first and kill its expire time
+				 * otherwise the scheduler will reinsert it and trigger a
+				 * BUG_ON() as we're not allowed to call task_queue() for a
+				 * foreign thread. The recipient will restore the expiration.
+				 */
+				task_unlink_wq(t);
+				t->expire = TICK_ETERNITY;
+				task_set_thread(t, new_tid);
+				task_wakeup(t, TASK_WOKEN_MSG);
+				TRACE_LEAVE(CHK_EV_TASK_WAKE, check);
+				return t;
+			}
+		}
+	}
+
 	if (check->server)
 		HA_SPIN_LOCK(SERVER_LOCK, &check->server->lock);

+	if (!(check->state & (CHK_ST_INPROGRESS|CHK_ST_IN_ALLOC|CHK_ST_OUT_ALLOC))) {
+		/* This task might have bounced from another overloaded thread, it
+		 * needs an expiration timer that was supposed to be now, but that
+		 * was erased during the bounce.
+		 */
+		if (!tick_isset(t->expire))
+			t->expire = now_ms;
+	}
+
 	if (unlikely(check->state & CHK_ST_PURGE)) {
 		TRACE_STATE("health-check state to purge", CHK_EV_TASK_WAKE, check);
 	}
@ -1217,10 +1262,10 @@ struct task *process_chk_conn(struct task *t, void *context, unsigned int state)
        if (LIST_INLIST(&check->buf_wait.list))
                LIST_DEL_INIT(&check->buf_wait.list);

-	task_set_thread(t, -1);
 	check_release_buf(check, &check->bi);
 	check_release_buf(check, &check->bo);
 	check->state &= ~(CHK_ST_INPROGRESS|CHK_ST_IN_ALLOC|CHK_ST_OUT_ALLOC);
+	check->state |= CHK_ST_SLEEPING;

 	if (check->server) {
 		rv = 0;