BUG/MINOR: checks: postpone the startup of health checks by the boot time

When health checks are started at boot, now_ms could be off by the boot
time. In general it's not even noticeable, but with very large configs
taking up to one or even a few seconds to start, this can result in a
part of the servers' checks being scheduled slightly in the past. As
such all of them will start groupped, partially defeating the purpose of
the spread-checks setting. For example, this can cause a burst of
connections for the network, or an excess of CPU usage during SSL
handshakes, possibly even causing some timeouts to expire early.

Here in order to compensate for this, we simply add the known boot time
to the computed delay when scheduling the startup of checks. That's very
simple and particularly efficient. For example, a config with 5k servers
in 800 backends checked every 5 seconds, that was taking 3.8 seconds to
start used to show this distribution of health checks previously despite
the spread-checks 50:

   3690 08:59:25
    417 08:59:26
    213 08:59:27
     71 08:59:28
    428 08:59:29
    860 08:59:30
    918 08:59:31
    938 08:59:32
   1124 08:59:33
    904 08:59:34
    647 08:59:35
    890 08:59:36
    973 08:59:37
    856 08:59:38
    893 08:59:39
    154 08:59:40

Now with the fix it shows this:
    470 08:59:59
    929 09:00:00
    896 09:00:01
    937 09:00:02
    854 09:00:03
    827 09:00:04
    906 09:00:05
    863 09:00:06
    913 09:00:07
    873 09:00:08
    162 09:00:09

This should be backported to all supported versions. It depends on
this commit:

    MINOR: clock: measure the total boot time

For 2.8 where the internal clock is now totally independent on the human
one, an more generic fix will consist in simply updating now_ms to reflect
the startup time.
This commit is contained in:
Willy Tarreau 2023-05-17 09:01:22 +02:00
parent 5723b382ed
commit 8e978a094d

View File

@ -1475,6 +1475,7 @@ int start_check_task(struct check *check, int mininter,
int nbcheck, int srvpos) int nbcheck, int srvpos)
{ {
struct task *t; struct task *t;
ulong boottime = tv_ms_remain(&start_date, &ready_date);
/* task for the check. Process-based checks exclusively run on thread 1. */ /* task for the check. Process-based checks exclusively run on thread 1. */
if (check->type == PR_O2_EXT_CHK) if (check->type == PR_O2_EXT_CHK)
@ -1504,7 +1505,7 @@ int start_check_task(struct check *check, int mininter,
mininter = global.max_spread_checks; mininter = global.max_spread_checks;
/* check this every ms */ /* check this every ms */
t->expire = tick_add(now_ms, MS_TO_TICKS(mininter * srvpos / nbcheck)); t->expire = tick_add(now_ms, MS_TO_TICKS(boottime + mininter * srvpos / nbcheck));
check->start = now_ns; check->start = now_ns;
task_queue(t); task_queue(t);