From 8e978a094d24a7835790da6be6c94e38f9888026 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Wed, 17 May 2023 09:01:22 +0200 Subject: [PATCH] BUG/MINOR: checks: postpone the startup of health checks by the boot time When health checks are started at boot, now_ms could be off by the boot time. In general it's not even noticeable, but with very large configs taking up to one or even a few seconds to start, this can result in a part of the servers' checks being scheduled slightly in the past. As such all of them will start groupped, partially defeating the purpose of the spread-checks setting. For example, this can cause a burst of connections for the network, or an excess of CPU usage during SSL handshakes, possibly even causing some timeouts to expire early. Here in order to compensate for this, we simply add the known boot time to the computed delay when scheduling the startup of checks. That's very simple and particularly efficient. For example, a config with 5k servers in 800 backends checked every 5 seconds, that was taking 3.8 seconds to start used to show this distribution of health checks previously despite the spread-checks 50: 3690 08:59:25 417 08:59:26 213 08:59:27 71 08:59:28 428 08:59:29 860 08:59:30 918 08:59:31 938 08:59:32 1124 08:59:33 904 08:59:34 647 08:59:35 890 08:59:36 973 08:59:37 856 08:59:38 893 08:59:39 154 08:59:40 Now with the fix it shows this: 470 08:59:59 929 09:00:00 896 09:00:01 937 09:00:02 854 09:00:03 827 09:00:04 906 09:00:05 863 09:00:06 913 09:00:07 873 09:00:08 162 09:00:09 This should be backported to all supported versions. It depends on this commit: MINOR: clock: measure the total boot time For 2.8 where the internal clock is now totally independent on the human one, an more generic fix will consist in simply updating now_ms to reflect the startup time. --- src/check.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/check.c b/src/check.c index a440185da..4e681d5db 100644 --- a/src/check.c +++ b/src/check.c @@ -1475,6 +1475,7 @@ int start_check_task(struct check *check, int mininter, int nbcheck, int srvpos) { struct task *t; + ulong boottime = tv_ms_remain(&start_date, &ready_date); /* task for the check. Process-based checks exclusively run on thread 1. */ if (check->type == PR_O2_EXT_CHK) @@ -1504,7 +1505,7 @@ int start_check_task(struct check *check, int mininter, mininter = global.max_spread_checks; /* check this every ms */ - t->expire = tick_add(now_ms, MS_TO_TICKS(mininter * srvpos / nbcheck)); + t->expire = tick_add(now_ms, MS_TO_TICKS(boottime + mininter * srvpos / nbcheck)); check->start = now_ns; task_queue(t);