MINOR: fd: don't scan the full fdtab on all threads

During tests, it's pretty visible that with many threads and a large
number of FDs, the process may take time to be ready. The reason for
this is that the full fdtab array is scanned by each and every thread
at boot in fd_reregister_all() in order to make each thread-local
poller adopt the FDs that are relevant to it. The problem is that
when dealing with 1-2M FDs and 64+ threads, it starts to represent
quite a number of loops, and usually the fdtab array doesn't entirely
fit in the CPU's L3 cache, causing extra memory accesses.

It's particularly visible when issuing debugging commands to the CLI
because usually the first one fails while the CPU is at 100% for half
a second (which also is socat's timeout). A quick test with this:

    global
        stats socket /tmp/sock1 level admin mode 666
        stats timeout 1h
        maxconn 2000000

And the following script started in another window:

    while ! time socat -t5 - /tmp/sock1 <<< "show version";do date -Ins;done

shows that it takes 1.58s for the socat instance that succeeds on an
Ampere Altra with 80 cores, this requires to change the timeout (defaults
to half a second) otherwise it returns nothing. In addition it also means
that during reloads, some CPU spikes will be noticed.

Adding a prefetch of the current FD + 16 improves the startup time by 30%
but that's far from being sufficient.

In practice all of this is performed at boot time, a moment at which we
know that extremely few FDs are registered (basically just the listeners),
so FD numbers are usually very low and the rest of the table is scanned
for no benefit. Ideally, knowing upfront how many FDs we have should be
sufficient.

A first approach would consist in counting the entries on a single thread
before registering pollers. It's not necessarily efficient and would take
time anyway.

This patch takes a different approach. It consists in keeping a thread-local
max ("fd_highest") that is updated whenever fd_insert() is called with a
larger number. Of course this is not correct once all threads have started,
but it will remain valid during boot since the same value is used during
startup and is cloned for each thread, and no scheduling happens anywhere
during this period, so that all threads are aware of the highest FD they've
seen registered, even if it had been done in some init code, and this without
having to deal with a shared variable.

Here on the test platform, the script gets its response in 10ms vs 1580
before.
This commit is contained in:
Willy Tarreau 2024-07-15 15:09:10 +02:00
parent a5c5a68454
commit 75b335abc7
2 changed files with 17 additions and 2 deletions

View File

@ -48,6 +48,7 @@ extern struct polled_mask *polled_mask;
extern THREAD_LOCAL int *fd_updt; // FD updates list
extern THREAD_LOCAL int fd_nbupdt; // number of updates in the list
extern THREAD_LOCAL int fd_highest;// highest FD known by the current thread
extern int poller_wr_pipe[MAX_THREADS];
@ -466,6 +467,19 @@ static inline void fd_insert(int fd, void *owner, void (*iocb)(int fd), int tgid
if ((global.tune.options & GTUNE_FD_ET) && iocb == sock_conn_iocb)
newstate |= FD_ET_POSSIBLE;
/* We must update fd_highest to reflect the highest known FD for this
* thread. It's important to note that it's not necessarily the highest
* FD the thread will see, it's the highest FD that was inserted by
* this thread or by the main thread. The purpose is essentially to
* let all threads know the highest known FD at boot, that will be
* cloned into each thread, in order to limit the work range for init
* functions such as fork_poller() and fd_reregister_all(). Keeping the
* value thread-local substantially limits the cost, since after a few
* thousand calls the value will just stop changing.
*/
if (unlikely(fd > fd_highest))
fd_highest = fd;
/* This must never happen and would definitely indicate a bug, in
* addition to overwriting some unexpected memory areas.
*/

View File

@ -112,6 +112,7 @@ volatile struct fdlist update_list[MAX_TGROUPS]; // Global update list
THREAD_LOCAL int *fd_updt = NULL; // FD updates list
THREAD_LOCAL int fd_nbupdt = 0; // number of updates in the list
THREAD_LOCAL int fd_highest = -1; // highest FD known by the current thread
THREAD_LOCAL int poller_rd_pipe = -1; // Pipe to wake the thread
int poller_wr_pipe[MAX_THREADS] __read_mostly; // Pipe to wake the threads
@ -836,7 +837,7 @@ void fd_reregister_all(int tgrp, ulong mask)
{
int fd;
for (fd = 0; fd < global.maxsock; fd++) {
for (fd = 0; fd < fd_highest; fd++) {
if (!fdtab[fd].owner)
continue;
@ -1271,7 +1272,7 @@ int list_pollers(FILE *out)
int fork_poller()
{
int fd;
for (fd = 0; fd < global.maxsock; fd++) {
for (fd = 0; fd < fd_highest; fd++) {
if (fdtab[fd].owner) {
HA_ATOMIC_OR(&fdtab[fd].state, FD_CLONED);
}