Commit Graph

23331 Commits

Author SHA1 Message Date
Valentine Krasnobaeva
699be6a55d MEDIUM: cfgparse-global: parse only KWF_DISCOVERY keywords in MODE_DISCOVERY
This commit is a part of the series to add a support of discovery mode in the
configuration parser and in initialization sequence.

Global section parser parses the majority of keywords in its function, so
those keywords don't have any dedicated parsers yet. Only after this parsing
block cfg_parse_global() starts to call dedicated parsers for any other
discovered keywords, which were not found in the block.

As all keywords, which should be parsed in MODE_DISCOVERY have its own parser
funtions, we can skip this block with goto discovery_kw and start directly from
the part, where we call parsers from the keywords list. KWF_DISCOVERY flag helps
to call in MODE_DISCOVERY only the parsers, which we are needed at this mode.

All unknown keywords and garbage will be ignored at this stage.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
48371e6a30 MEDIUM: cfgparse: call some parsers only in MODE_DISCOVERY
This commit is a part of the series to add a support of discovery mode in the
configuration parser and in initialization sequence.

Some keyword parsers tagged with KWF_DISCOVERY (for example those, which parse
runtime modes, poller types, pidfile), should not be called twice when
the configuration will be read the second time after the discovery mode.
It's redundant and could trigger parser's errors in standalone mode. In
master-worker mode the worker process inherits parsed settings from the master.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
f9123e2183 MEDIUM: cfgparse: add KWF_DISCOVERY keyword flag
This commit is a part of the series to add a support of discovery mode in the
configuration parser and in initialization sequence.

So, let's add here KWF_DISCOVERY flag to distinguish the keywords,
which should be parsed in "discovery" mode and which are needed for master
process, from all others. Keywords, that should be parsed in "discovery" mode
have its dedicated parser funtions. Let's tag these functions with
KWF_DISCOVERY flag in keywords list. Like this, only these keyword parsers
might be called during the first configuration read in discovery mode.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
6769745fe5 MINOR: global: add MODE_DISCOVERY flag
This is the first commit from a series to add a support of discovery mode
in the configuration parser and in initialization sequence.

Discovery mode is the mode, when we read the configuration at the first time
and we parse and set runtime modes: daemon, zero-warning, master-worker. In
this mode we also parse some parameters needed for the master process to start,
in case if we are in the master-worker mode. Like this the master process
doesn't allocate any additional resources, which it doesn't use and it quickly
finishes its initialization and enters to its polling loop. The worker process
after its fork reads the rest of the configuration.

So, let's add in this commit MODE_DISCOVERY flag to check it in
configuration parser functions.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
fe75c1e12d MEDIUM: startup: remove MODE_MWORKER_WAIT
MODE_MWORKER_WAIT becames redundant with MODE_MWORKER, due to moving
master-worker fork in init(). This change allows master no longer perform
reexec just after forking in order to free additional memory.

As after the fork in the master process we set 'master' variable, we can
replace now MODE_MWORKER_WAIT in some 'if' statements by simple check of this
'master' variable.

Let's also continue to get rid of HAPROXY_MWORKER_WAIT_ONLY environment
variable, as it's no longer needed as well.

In cfg_program_postparser(), which is used to check if cmdline is defined to
launch a program, we completely remove the check of mode for now, because
the master process does not parse the configuration for the moment. 'program'
section parsing will be reintroduced in master later in the next commits.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
fb7bef781d MINOR: defaults: update MASTER_MAXCONN description
This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT
support, as it became redundant with MODE_MWORKER due to moving master-worker
fork in init().
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
3f5f57845b MINOR: startup: rename exit_on_waitmode_failure to exit_on_failure
As we no longer support MODE_MWORKER_WAIT for master (it became redundant with
MODE_MWORKER after moving master-worker fork in init()), let's rename
exit_on_waitmode_failure() callback in just exit_on_failure().
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
7795d49ae6 CLEANUP: haproxy: rm no longer used mworker_reexec_waitmode
This a first commit to prepare the removal of MODE_MWORKER_WAIT support. It has
became redundant with MODE_MWORKER, due to moving master-worker fork in init().
Master process does no longer perform reexec to free additional memory after
forking and does no longer changing its mode to MODE_MWORKER_WAIT, where it has
entered to its wait polling loop and has handled signals. Now, master enters in
this loop almost immediately after forking a worker and being always in mode
MODE_MWORKER.

So, we can remove mworker_reexec_waitmode() wrapper, which was used to set
HAPROXY_MWORKER_WAIT_ONLY variable and to call mworker_reexec(). But let's keep
for the moment the logic of reexec_on_failure() atexit callback for master in
order if in the future we will need to support this case again.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
cb0f1f42e1 MINOR: startup: only worker gets capabilities from bin
Due to moving the master-worker fork in init(), we need to protect
prepare_caps_from_permitted_set() call, which is executed after init(). This
call makes sense only for worker, daemon and for foreground mono process modes.

prepare_caps_from_permitted_set() allows to read Linux capabilities from
haproxy binary and to move some of them in process Effective set, if 'setcap'
keyword lists needed capabilities in the global section.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
fe04c2ad37 MEDIUM: startup: do set_identity() if needed in one place
There are two set_identity() calls, both under quite same:

    'if ((global.mode & (MODE_MWORKER|MODE_DAEMON...)...'

The first call serves to change uid/gid and set some needed Linux capabilities
only for process in the foreground mode. The second comes after master-worker
fork and allows to do the same in daemon and in worker modes.

Due to moving the master-worker fork in init() in some previous commit, the
second set_identity() now is no longer under the 'if'. So, it is executed
for all modes, except MODE_MWORKER. Now in MODE_MWORKER process enters in its
wait polling loop just after forking a worker and it terminates almost
immediately, if it exits this loop.

Worker, daemon and process in a foreground mode will perform set_identity() as
before, but now it will be called in a one place at main().

global.last_checks should be verified just after set_identity() call. As it's
stated in comments some configuration options may require full privileges or
some Linux capabilities need to be granted to process. set_identity() via
prepare_caps_for_setuid() may put configured capabilities in process Effective
set and, hence, remove respective flag from global.last_checks.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
02af1fe067 MEDIUM: startup: call chroot() if needed in one place
There are two 'chroot' code blocks, both under quite same:

	'if ((global.mode & (MODE_MWORKER|MODE_DAEMON...)...'

The first block serves to perform chroot only for process in the foreground
mode. The second comes after master-worker fork and allows to do chroot
in daemon and in worker modes.

Due to moving the master-worker fork in init() in some previous commit, the
second 'chroot' code block now is no longer under the 'if'. So, it is executed
for all modes, except MODE_MWORKER. Now in MODE_MWORKER process enters in its
wait polling loop just after forking a worker and it terminates almost
immediately, if it exits this loop.

Worker, daemon and process in a foreground mode will perform the chroot as
before, but now it will be done in a one place at main().
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
7a2ee10d71 REORG: mworker: move mworker_create_master_cli in master 'case'
Let's move mworker_create_master_cli() call in 'master' case just above and get
rid of redundant global.mode tests.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
e4c10a704d MINOR: init: check MODE_MWORKER before creating master CLI
mworker_create_master_cli() creates MASTER proxy and allocates listeners,
which are attached to this proxy. It also creates a reload sockpair.

So, it's more appropriate to do the check, that we are in a MODE_MWORKER, if
master CLI settings were provided via command line, just after the config
parsing. And only then, if runtime mode and command line settings are
coherent, try to perform master-worker fork and try to create master CLI.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
26e53e2e8c REORG: mworker: set nbthread=1 for master after fork
After moving master-worker fork into init() and reintroducing it into a
switch-case (see the previous commit), it is more appropriate to set
nbthread=1 and nbtgroups=1 immediately in the 'case' for the parent process.
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
ae84f06025 BUG/MINOR: mworker: fix memory leak due to master-worker fork
Before this fix, startup logs ring was duplicated before the fork(), so master
and worker had both the original startup_logs ring and the duplicated one. In
the worker context we freed the original ring and used a duplicated one. In
the master context we did nothing, but we still create a duplicated copy again
and again during the reload.

So, let's duplicate startup logs ring only in the worker context. Master
continues to use the original ring initialized in init() before its fork().
2024-10-16 22:02:39 +02:00
Valentine Krasnobaeva
8dd4efe42f MAJOR: mworker: move master-worker fork in init()
This refactoring allows to simplify 'master-worker' logic. The master process
with this change will fork a worker very early at the initialization stage,
which allows to perform a configuration parsing only for the worker. In reality
only the worker process needs to parse and to apply the whole configuration.

Master process just polls master CLI sockets, watches worker status, catches
its termination state and handles the signals. With this refactoring there is
no longer need for master to perform re-execution after reading the whole
configuration file to free additional memory. And there is no longer need for
worker to register atexit callbacks, in order to free the memory, when it
fails to apply the new configuration. In contrast, we now need to set
proc_self pointer to the new worker entry in processes list just after
the fork in the worker process context. proc_self is dereferenced in
mworker_sockpair_register_per_thread(), which is called when worker enters in
its polling loop.

Following patches will try to gather more 'worker' and 'master' specific' code
in the dedicated cases of this new fork() switch, or in a separate functions.
2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva
4cbfcc60f4 MEDIUM: startup: move PID handling in init()
Let's move PID handling in init() from the main() code. It is more appropriate
to open and to write the PID of the process just after daemonization fork. In
case of daemon monoprocess mode, we will simply write a PID of the process,
which is already in the background. In case of 'master-worker' mode, we keep
the previous behaviour and we write only a PID of the master process.

This allows to remove redundant tests of the process execution mode, tests of
the pidfd value and consequent writes to this pidfd. This patch prepares the
refactoring of master-worker fork by moving it in init() function as well.
2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva
95c19be2ab MINOR: startup: refactor "daemonization" fork
Let's put "daemonization" fork into a switch-case. This is more readable and we
don't need to allocate memory for the fork() return value here.
2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva
90b8181c0a MEDIUM: startup: move daemonization fork in init
Let's move daemonization fork in init(). We need to perform this fork always
before forking a worker process, in order to be able to launch master and then
its worker in daemon, i.e. background mode, if haproxy was started with '-D'
option.

This refactoring is a preparation step, needed for replacing then master-worker
fork in init() as well. This allows the master process not to read the whole
configuration file and not to do re-execution in order to free additional
memory, when worker was forked. In the new refactored design only the worker
process will read and apply a new configuration, while the master will arrive
very fast in its polling loop to wait worker's termination and to handle
signals. See more details in the following commits.
2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva
df12791da3 MINOR: startup: add O_CLOEXEC flag to open /dev/null
As master process performs execvp() syscall to handle USR2 and HUP signals in
mworker_reexec(), let's add O_CLOEXEC flag, when we open '/dev/null' in order
to avoid fd leak.

This a preparation step to refactor master-worker logic. See more details in
the next commits.
2024-10-16 22:00:58 +02:00
Valentine Krasnobaeva
5bbcdc003a REGTESTS: cli: add delay 0.1 before connect to cli
When vtest starts haproxy process, it loops until the moment, when haproxy
pidfile is created. When pidfile is created, vtest considers that haproxy
process is ready and it starts to perform test commands, in particular, it
connects to CLI. It's not very reliable approach to base the check of the
process readiness on the PID file. After master-worker architecture
refactoring pidfile is created in the early init stage, but master and worker
are not yet finished its initialization routines. So, all mcli tests and some
tests where we sent commands to CLI start to fail regularly.

In vtest at the moment there is no any other approach to check that the
process is really ready. So let's add a delay 0.1s before connecting to CLI in
all mcli tests and in acl_cli_spaces test.
2024-10-16 22:00:58 +02:00
Willy Tarreau
2c2dac77aa DEBUG: mux-h2/flags: add H2_CF_DEM_RXBUF & H2_SF_EXPECT_RXDATA for the decoder
Both flags were recently added but missing from the decoders flags, so
they appeared in hex in dev/flags/flags output. No backport needed.
2024-10-16 18:32:52 +02:00
Willy Tarreau
ca275d99ce BUG/MEDIUM: queue: make sure never to queue when there's no more served conns
Since commit 53f52e67a0 ("BUG/MEDIUM: queue: always dequeue the backend when
redistributing the last server"), we've got two reports again still showing
the theoretically impossible condition in pendconn_add(), including a single
threaded one.

Thanks to the traces, the issue could be tracked down to the redispatch part.
In fact, in non-determinist LB algorithms (RR, LC, FAS), we don't perform the
LB if there are pending connections in the backend, since it indicates that
previous attempts already failed, so we directly return SRV_STATUS_FULL. And
contrary to a previous belief, it is possible to meet this condition with
be->served==0 when redispatching (and likely with maxconn not greater than
the number of threads).

The problem is that in this case, the entry is queued and then the
pendconn_must_try_again() function checks if any connections are currently
being served to detect if we missed a race, and tries again, but that
situation is not caused by a concurrent thread and will never fix itself,
resulting in the loop.

All that part around pendconn_must_try_again() is still quite brittle, and
a safer approach would involve a sequence counter to detect new arrivals
and dequeues during the pendconn_add() call. But it's more sensitive work,
probably for a later fix.

This fix must be backported wherever the fix above was backported. Thanks
to Patrick Hemmer, as well as Damien Claisse and Basha Mougamadou from
Criteo for their help on tracking this one!
2024-10-16 18:08:39 +02:00
Aurelien DARRAGON
85298189bf BUG/MEDIUM: server: server stuck in maintenance after FQDN change
Pierre Bonnat reported that SRV-based server-template recently stopped
to work properly.

After reviewing the changes, it was found that the regression was caused
by a4d04c6 ("BUG/MINOR: server: make sure the HMAINT state is part of MAINT")

Indeed, HMAINT is not a regular maintenance flag. It was implemented in
b418c122 a4d04c6 ("BUG/MINOR: server: make sure the HMAINT state is part
of MAINT"). This flag is only set (and never removed) when the server FQDN
is changed from its initial config-time value. This can happen with "set
server fqdn" command as well as SRV records updates from the DNS. This
flag should ideally belong to server flags.. but it was stored under
srv_admin enum because cur_admin is properly exported/imported via server
state-file while regular server's flags are not.

Due to a4d04c6, when a server FQDN changes, the server is considered in
maintenance, and since the HMAINT flag is never removed, the server is
stuck in maintenance.

To fix the issue, we partially revert a4d04c6. But this latter commit is
right on one point: HMAINT flag was way too confusing and mixed-up between
regular MAINT flags, thus there's nothing to blame about a4d04c6 as it was
error-prone anyway.. To prevent such kind of bugs from happening again,
let's rename HMAINT to something more explicit (SRV_ADMF_FQDN_CHANGED) and
make it stand out under srv_admin enum so we're not tempted to mix it with
regular maintenance flags anymore.

Since a4d04c6 was set to be backported in all versions, this patch must
be backported there as well.
2024-10-16 14:26:57 +02:00
Amaury Denoyelle
0918c41ef6 BUG/MEDIUM: quic: support wait-for-handshake
wait-for-handshake http-request action was completely ineffective with
QUIC protocol. This commit implements its support for QUIC.

QUIC MUX layer is extended to support wait-for-handshake. A new function
qcc_handle_wait_for_hs() is executed during qcc_io_process(). It detects
if MUX processing occurs after underlying QUIC handshake completion. If
this is the case, it indicates that early data may be received. As such,
connection is flagged with CO_FL_EARLY_SSL_HS, which is necessary to
block stream processing on wait-for-handshake action.

After this, qcc subscribs on quic_conn layer for RECV notification. This
is used to detect QUIC handshake completion. Thus,
qcc_handle_wait_for_hs() can be reexecuted one last time, to remove
CO_FL_EARLY_SSL_HS and notify every streams flagged as
SE_FL_WAIT_FOR_HS.

This patch must be backported up to 2.6, after a mandatory period of
observation. Note that it relies on the backport of the two previous
patches :
- MINOR: quic: notify connection layer on handshake completion
- BUG/MINOR: stream: unblock stream on wait-for-handshake completion
2024-10-16 11:51:35 +02:00
Amaury Denoyelle
73031e81cd BUG/MINOR: stream: unblock stream on wait-for-handshake completion
wait-for-handshake is an http-request action which permits to delay the
processing of content received as TLS early data. The action yields
as long as connection handshake is in progress. In the meantime, stconn
is flagged with SE_FL_WAIT_FOR_HS.

When the handshake is finished, MUX layer is responsible to woken up
SE_FL_WAIT_FOR_HS flagged stconn instances to restart the stream
processing. On sc_conn_process(), SE_FL_WAIT_FOR_HS flag is removed and
stream layer is woken up.

However, there may be a blocking after MUX notification. sc_conn_recv()
may return 0 due to no new data reception, which prevents
sc_conn_process() execution. The stream is thus blocked until its
timeout.

To fix this, checks in sc_conn_recv() about the handshake termination
condition. If true, explicitely returns 1 to ensure sc_conn_process()
will be executed.

Note that this bug is not reproducible due to various conditions related
to early data implementation in haproxy. Indeed, connection layer
instantiation is always delayed until SSL handshake completion, which
prevents the handling of early data as expected.

This fix will be necessary to implement wait-for-handshake support for
QUIC. As such, it must be backported with the next commit up to 2.6,
after a mandatory period of observation.
2024-10-16 11:44:31 +02:00
Amaury Denoyelle
5a5950e42d MINOR: quic: notify connection layer on handshake completion
Wake up connection layer on QUIC handshake completion via
quic_conn_io_cb. Select SUB_RETRY_RECV as this was previously unused by
QUIC MUX layer.

For the moment, QUIC MUX never subscribes for handshake completion.
However, this will be necessary for features such as the delaying of
early data forwarding via wait-for-handshake.

This patch will be necessary to implement wait-for-handshake support for
QUIC. As such, it must be backported with next commits up to 2.6,
after a mandatory period of observation.
2024-10-16 11:42:06 +02:00
Willy Tarreau
5091f90479 MINOR: activity/memprofile: always return "other" bin on NULL return address
It was found in a large "show profiling memory" output that a few entries
have a NULL return address, which causes confusion because this address
will be reused by the next new allocation caller, possibly resulting in
inconsistencies such as "free() ... pool=trash" which makes no sense. The
cause is in fact that the first caller had an entry->info pointing to the
trash pool from a p_alloc/p_free with a NULL return address, and the second
had a different type and reused that entry.

Let's make sure undecodable stacks causing an apparent NULL return address
all lead to the "other" bin.

While this is not exactly a bug, it would make sense to backport it to the
recent branches where the feature is used (probably at least as far as 2.8).
2024-10-15 08:12:34 +02:00
Willy Tarreau
93c9f19af7 REGTESTS: fix a reload race in abns_socket.vtc
This test issues a reload over the master CLI, but it is totally
possible that the master has not yet finished starting up the master
CLI when the command is issued, resulting in a failure. This was much
more visible on the new master-worker model, but definitely affects the
old one and could be the reason for this test to occasionally fail on
the CI.
2024-10-14 19:15:21 +02:00
William Lallemand
0302adf996 CI: cirrus-ci: bump FreeBSD image to 14-1
FreeBSD CI since to be broken for a while, try to upgrade the image to
the latest 14.1 version.
2024-10-14 14:28:26 +02:00
Willy Tarreau
e4cb0ad632 MINOR: mux-h2/traces: add buffer-related info to h2s and h2c
The traces currently don't contain any info about the amount of data
present in buffers, making it difficult to figure if an empty buffer
is the cause for not demuxing or if a full buffer is the cause for
not reading more data. Let's add them, with the head/tail info as
well.
2024-10-12 18:07:21 +02:00
Willy Tarreau
a8f907a459 MINOR: mux-h2/traces: add missing flags and proxy ID in traces
H2 traces are unusable to detect bugs most of the time because they miss
the h2c and h2s flags, as well as the proxy, which makes it very hard to
figure if the info comes from the client or the server as soon as two
layers are stacked. This commit adds these precious information as well
as the h2s's rx and tx windows.

This could be backported to a few recent branches, but the rx window
calculation will have to be replaced with the static value there.
2024-10-12 17:45:51 +02:00
Willy Tarreau
fcab647613 OPTIM: mux-h2: use tasklet_wakeup_after() in h2s_notify_recv()
This reduces the avg wakeup latency of sc_conn_io_cb() from 1900 to 51us.
The L2 cache misses from from 1.4 to 1.2 billion for 20k req. But the
perf is not better. Also there are situations where we must not perform
such wakeup, these may only be done from h2_io_cb, hence the test on the
next_tasklet pointer and its reset when leaving the function. In practice
all callers to h2s_close() or h2s_destroy() can reach that code, this
includes h2_detach, h2_snd_buf, h2_shut etc.

Another test with 40 concurrent connections, transferring 40k 1MB objects
at different concurrency levels from 1 to 80 also showed a 21% drop in L2
cache misses, and a 2% perf improvement:

Before:
   329,510,887,528  instructions
    50,907,966,181  branches
       843,515,912  branch-misses
     2,753,360,222  cache-misses
    19,306,172,474  L1-icache-load-misses
    17,321,132,742  L1-dcache-load-misses
       951,787,350  LLC-load-misses

      44.660469000 seconds user
      62.459354000 seconds sys

   => avg perf: 373 MB/s

After:
   331,310,219,157  instructions
    51,343,396,257  branches
       851,567,572  branch-misses
     2,183,369,149  cache-misses
    19,129,827,134  L1-icache-load-misses
    17,441,877,512  L1-dcache-load-misses
       906,923,115  LLC-load-misses

      42.795458000 seconds user
      62.277983000 seconds sys

   => avg perf: 380 MB/s

With small requests, it's the L1 and L3 cache misses which reduced by
3% and 7% respectively, and the performance went up by 3%.
2024-10-12 17:17:51 +02:00
Willy Tarreau
04ce6536e1 OPTIM: mux-h2: try to continue reading after demuxing when useful
When we stop demuxing in the middle of a frame, we know that there are
other data following. The demux buffer is small and unique, but now we
have rxbufs, so after h2_process_demux() is left, the dbuf is almost
empty and has room to be delivered into another rxbuf.

Let's implement a short loop with a counter and a few conditions around
the demux call. We limit the number of turns to the number of available
rxbufs and no more than 12, since it shows good performance, and the
wakeup is only called once. This has shown a nice 12-20% bandwidth gain
on backend-side H2 transferring 1MB-large objects, and does not affect
the rest (headers, control etc). The number of wakeup calls was divided
by 5 to 8, which is also a nice improvement. The counter is limited to
make sure we don't add processing latency. Tests were run to find the
optimal limit, and it turns out that 16 is just slightly better, but not
worth the +33% increase in peak processing latency.

The h2_process_demux() function just doens't call the wakeup function
anymore, and solely focuses on transferring from dbuf to rxbuf.

Practical measurement: test with h2load producing 4 concurrent connections
with 10 concurrent streams each, downloading 1MB objects (20k total) via
two layers of haproxy stacked, reaching httpterm over H1 (numbers are total
for the 2 h2 front and 1 h2 back). All on a single thread.

Before: 549-553 MB/s (on h2load)
  function    calls  cpu_tot  cpu_avg
  h2_io_cb  2562340  8.157s   3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
  h2_io_cb    30109  840.9ms  27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb    16105  106.4ms  6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb        1  11.75us  11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb  2608555  9.104s   3.490us --total--

  perf stat:
   153,117,996,214 instructions                             (71.41%)
    22,919,659,027 branches       # 14.97% of inst          (71.41%)
       384,009,600 branch-misses  #  1.68% of all branches  (71.42%)
        44,052,220 cache-misses          # 1 inst / 3476    (71.44%)
     9,819,232,047 L1-icache-load-misses # 6.4% of inst     (71.45%)
     8,426,410,306 L1-dcache-load-misses # 5.5% of inst     (57.15%)
        10,951,949 LLC-load-misses       # 1 inst / 13982   (57.13%)

      12.372600000 seconds user
      23.629506000 seconds sys

After: 660 MB/s (+20%)
  function    calls  cpu_tot  cpu_avg
  h2_io_cb   244502  4.410s   18.04us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
  h2_io_cb    42107  1.062s   25.22us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb    13703  106.3ms  7.758us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb        1  13.74us  13.74us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb   300313  5.578s   18.57us --total--

  perf stat:
   126,840,441,876 instructions                             (71.40%)
    17,576,059,236 branches       # 13.86% of inst          (71.40%)
       274,136,753 branch-misses  #  1.56% of all branches  (71.42%)
        30,413,562 cache-misses          # 1 inst / 4170    (71.45%)
     6,665,036,203 L1-icache-load-misses # 5.25% of inst    (71.46%)
     7,519,037,097 L1-dcache-load-misses # 5.9% of inst     (57.15%)
         6,702,411 LLC-load-misses       # 1 inst / 18925   (57.12%)

      10.490097000 seconds user
      19.212515000 seconds sys

It's also interesting to see that less total time is spent in these
functions, clearly indicating that the cost of interrupted processing,
and the extraneous cache misses come into play at some point. Indeed,
after the change, the number of instructions went down by 17.2%, while
the L2 cache misses dropped by 31% and the L3 cache misses by 39%!
2024-10-12 16:38:36 +02:00
Willy Tarreau
9fbc01710a OPTIM: mux-h2: make h2_send() report more accurate wake up conditions
h2_send() used to report non-zero every time any data were sent, and
this was used from h2_snd_buf() or h2_done_ff() to trigger a wakeup,
which possibly can do nothing. Restricting this wakeup to either a
successful send() combined with the ability to demux, or an error.

Doing this makes the number of h2_io_cb() wakeups drop from 422k to
245k for 1000 1MB objects delivered over 100 streams between two H2
proxies, without any behavior change nor performance change. In
practice, most send() calls do not result in a wakeup anymore but
synchronous errors still do.

A local test downloading 10k 1MB objects from an H1 server with a single
connection shows this change:

   before      after    caller
   1547        1467     h2_process_demux()
   2138           0     h2_done_ff()       <---
     38        1453     ssl_sock_io_cb()   <---
     18           0     h2_snd_buf()
      1           1     h2_init()
   3742        2921     -- total --

In practice the ssl_sock_io_cb() wakeups are those notifying about
SUB_RETRY_RECV, which are not accounted for when h2_done_ff() performs
the wakeup because the tasklet is already queued (a counter placed
there shows that it's nonetheless called). So there's no transfer and
h2_done_ff() was only hiding the other one.

Another test involving 4 connections with 10 concurrent streams each
and 20000 1MB objects total shows a total disparition of the wakeups
from h2_snd_buf and h2_done_ff, which used to account together for
50% of the wakeups, resulting in effectively halving the number of
wakeups which, based on their avg process time, were not doing
anything:

Before:
  function   calls     cpu_tot   cpu_avg
  h2_io_cb   2571208   7.406s    2.880us <- h2c_restart_reading@src/mux_h2.c:940 tasklet_wakeup
  h2_io_cb   2536949   251.4ms   99.00ns <- h2_snd_buf@src/mux_h2.c:7573 tasklet_wakeup ###
  h2_io_cb     41100   5.622ms   136.0ns <- h2_done_ff@src/mux_h2.c:7779 tasklet_wakeup ###
  h2_io_cb     38979   852.8ms   21.88us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb     12519   90.28ms   7.211us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb         1   13.81us   13.81us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb   5200756   8.606s    1.654us --total--

After:
  h2_io_cb   2562340   8.157s    3.183us <- h2c_restart_reading@src/mux_h2.c:957 tasklet_wakeup
  h2_io_cb     30109   840.9ms   27.93us <- sock_conn_iocb@src/sock.c:1007 tasklet_wakeup
  h2_io_cb     16105   106.4ms   6.607us <- ssl_sock_io_cb@src/ssl_sock.c:5721 tasklet_wakeup
  h2_io_cb         1   11.75us   11.75us <- sock_conn_iocb@src/sock.c:986 tasklet_wakeup
  h2_io_cb   2608555   9.104s    3.490us --total--
2024-10-12 16:38:36 +02:00
Willy Tarreau
633c41c621 MEDIUM: mux-h2: rework h2_restart_reading() to differentiate recv and demux
From the beginning, h2_restart_reading() has always been confusing because
it decides whether or not to wake the tasklet handler up or not. This
tasklet handler does two things, one is receiving from the socket to the
demux buf, and one is demuxing from the demux buf to the streams' rxbufs.

The conditions are governed by h2_recv_allowed(), which is also called at
a few places to decide whether or not to actually receive from the socket.
It starts to be visible that this leaves some difficulties regarding what
to do with possibly pending data.

In 2.0 with commit 3ca18bf0b ("BUG/MEDIUM: h2: Don't attempt to recv from
h2_process_demux if we subscribed."), we even had to address a special
case where it was possibly to endlessly wake up because the conditions
would rely on the demux buffer's contents, though the solution consisted
in passing a flag to decide whether or not to consider the buffer's
contents.

In 2.5 commit b5f7b5296 ("BUG/MEDIUM: mux-h2: Handle remaining read0 cases
on partial frames") introduced a new flag H2_CF_DEM_SHORT_READ which
indicates that the demux had to stop in the middle of a frame and cannot
make progress without more data. More adaptations later came in based on
this but this actually reflected exactly what was needed to solve this
painful situation: a state indicating whether to receive or parse.

Now's about time to definitely address this by reworking h2_restart_reading()
to check two completely independent things:
  - the ability to receive more data into the demux buffer, which is
    based on its allocation/fill state and the socket's errors
  - the ability to demux such data, which is based on the presence of
    enough data (i.e. no stuck short read), and ability to find an rx
    buf to continue the processing.

Now the conditions are much more understandable, and it's also visible
that the consider_buffer argument, whose value was not trivial for
callers, is not used anymore.

Tests stacking two layers of H2 show strictly no change to the wakeup
cause distributions nor counts.
2024-10-12 16:38:36 +02:00
Willy Tarreau
e057f8367c DOC: design-thoughts: add diagrams illustrating an rx win groth
Let's just see on a diagram how the receiver can detect that the
window is large enough for the remote sender to fill the link. Here
it seems that a first criterion is that data are accumulating in
the rxbuf, indicating that the next hop doesn't consume them fast
enough. On the diagram it's visible when blue arrows (incoming data)
are more frequent than the magenta ones on average (outgoing data),
which happens when silence moments are less frequent and don't allow
the reader to catch up. It's also visible that there are two phases
alternating in the transfer:
  - measure round trip time (i.e. how long it takes to restart
    sending after a WU was sent after a long silence)

  - measure the lowest rxbuf size during the previous round trip

It's worth noting that a window size change only has *observable* effect
after two RTT: the first RTT is to restart sending (opening or enlarging
the window), the second RTT to measure the lowest rxbuf size over the
period.

By turning the advertised window into an offset and comparing it to
the received quantity, it's possible to measure the RTT of the whole
chain (including the client possibly producing the data). Note that
when multiple streams compete for BW this can become tricky. Limiting
the window to available buffers and counting the number of sending
streams on a connection could work (i.e. split total buffers into
1+#senders, first one being used for tx).
2024-10-12 16:38:36 +02:00
Willy Tarreau
0fd66703c2 MEDIUM: mux-h2: change the default initial window to 16kB
Now that we're using all available rx buffers for transfers, there's
no point anymore in advertising more than the minimum value we can
safely buffer. Let's be conservative and only rely on the dynamic
buffers to improve speed beyond the configured value, and make sure
than many streams will no longer cause unfairness.

Interestingly, the total number of wakeups has further shrunk down, but
with a different distribution. From 128k for 1000 1M transfers, it went
down to 119k, with 96k from restart_reading, 10k from done_ff and 2.6k
from snd_buf. done_ff went up by 30% and restart_reading went down by
30%.
2024-10-12 16:38:26 +02:00
Willy Tarreau
1ed9d37c88 MINOR: mux-h2: add tune.h2.be.rxbuf and tune.h2.fe.rxbuf global settings
These settings allow to change the total buffer size allocated to the
backend and frontend respectively. This way it's no longer necessary to
play with tune.bufsize nor increase the number of streams to benefit from
more buffers.

Setting tune.h2.fe.rxbuf to 4m to match a sender's max tcp_wmem resulted
in 257 Mbps for a single stream at 103ms vs 121 Mbps default (or 5.1 Mbps
with a single buffer and 64kB window).
2024-10-12 16:29:16 +02:00
Willy Tarreau
e018d9a0cf MAJOR: mux-h2: make the rxbuf allocation algorithm a bit smarter
Without using bandwidth estimates, we can already use up to the number
of allocatable rxbufs and share them evenly between receiving streams.
In practice we reserve one buffer for any non-receiving stream, plus
1 per 8 possible new streams, and divide the rest between the number
of receiving streams.

Finally, for front streams, this is rounded up to the buffer size while
for back streams we round it down. The rationale here is that front to
back is very fast to flush and slow to refill so we want to optimise
upload bandwidth regardless of the number of streams, while it's the
opposite in the other way so we try to minimize HoL.

That shows good results with a single stream being able to send at 121
Mbps at 103ms using 1.4 MB buffer with default settings, or 8 streams
sharing the bandwidth at 180kB each. Previously the limit was approx
5.1 Mbps per stream.

It also enables better sharing of backend connections: a slow (100 Mbps)
and a fast (1 Gbps) clients were both downloading 2 100MB files each over
a shared H2 connection. The fast one used to show 6.86 to 20.74s with an
avg of 11.45s and an stddev of 5.81s before the patch, and went to a
much more respectable 6.82 to 7.73s with 7.08s avg and 0.336s stddev.

We don't try to increase the window past the remaining content length.
First, this is pointless (though harmless), but in addition it causes
needless emission of WINDOW_UPDATE frames on small uploads that are
smaller than a window, and beyond being useless, it upsets vtest which
expects an RST on some tests. The scheduling is not reliable enough to
insert an expect for a window update first, so in the end wich that
extra check we save a few useless frames on small uploads and please
vtest.

A new setting should be added to allow to increase the number of buffers
without having to change the number of streams. At this point it's not
done.
2024-10-12 16:29:16 +02:00
Willy Tarreau
3816c38601 MAJOR: mux-h2: permit a stream to allocate as many buffers as desired
Now we don't enforce allocation limits in h2s_get_rxbuf(), since there
is no benefit in not processing pending data, it would still cause HoL
for no saving. The only reason for not allocating is if there are no
buffers available for the connection.

In theory this should not change anything except that it excerts code
paths that support reallocating multiple buffers, which could possibly
uncover a sleeping bug. This is why it's placed in a separate commit.

And one observation worth noting is that it almost cut in half the number
of iocb wakeups: for 1000 1MB transfers over 100 concurrent streams of a
single connection, we used to observe 208k wakeups (110 from restart_reading,
80 from snd_buf, 11 from done_ff), and now we're observing 128k (113 from
restart_reading, 2.4 from snd_buf, 6.9k from done_ff), which seems to
indicate that pretty often the demuxing was blocked on a buffer full due
to the default advertised window of 64k.
2024-10-12 16:29:16 +02:00
Willy Tarreau
4eb3ff1d3b MAJOR: mux-h2: make streams use the connection's buffers
For now it seems to work as before, and even when artificially inflating
the number of allocatable buffers per stream. The number of allocated
slots is always the same as the max number of streams, which guarantees
that each stream will find one buffer. we only grant one buffer per
stream at this point, since the goal was to replace the existing single
rxbuf.

A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate
a failure to get an rxbuf slot from the connection. It was lightly
tested (by forcing bl_init() to a lower number of buffers). It is not
yet certain whether it's more useful to have a new flag or to reuse
the existing H2_CF_DEM_SFULL which indicates the rxbuf is full,
but at least the new flag more accurately translates the condition,
that may make a difference in the future. However, given that when
RXBUF is set, most of the time it results in a failure to find more
room to demux and it sets SFULL, for now we have to always clear
SFULL when clearing RXBUF as well. This means that most of the time
we'll see 3 combinations:
  - none: everything's OK
  - SFULL: the unique rx buffer is full
  - RXBUF || (RXBUF|SFULL): cannot allocate more entries

Note that we need to be super careful in h2_frt_transfer_data() because
the htx_free_data_space() function doesn't guarantee that the room is
usable, so htx_add_data() may still fail despite an apparent room. For
this reason, h2_frt_transfer_data() maintains a "full" flag to indicate
that a transfer attempt failed and that a new buffer is required.
2024-10-12 16:29:16 +02:00
Willy Tarreau
6279cbc9e9 MINOR: mux-h2: clear up H2_CF_DEM_DFULL and H2_CF_DEM_SHORT_READ ambiguity
Since commit 485da0b05 ("BUG/MEDIUM: mux_h2: Handle others remaining
read0 cases on partial frames"), H2_CF_DEM_SHORT_READ is set when there
is no blocking flags. However, it checks H2_CF_DEM_BLOCK_ANY which does
not include H2_CF_DEM_DFULL. This results in many cases where both
H2_CF_DEM_DFULL and H2_CF_DEM_SHORT_READ are set together, which makes
no sense, since one says the demux buffer is full while the other one
says an incomplete read was done. This doesn't permit to properly
decide whether to restart reading or processing.

Let's make sure to clear DFULL in h2_process_demux() whenever we
consume incoming data from the dbuf, and check for DFULL before
setting SHORT_READ.

This could probably be considered as a bug fix but it's hard to say if
it has any impact on the current code, probably at worst it might cause
a few useless wakeups, so until there's any proof that it needs to be
backported, better not do it.
2024-10-12 16:29:16 +02:00
Willy Tarreau
b74bedf157 MINOR: mux-h2: simplify the wake up code in h2_rcv_buf()
The code used to decide when to restart reading is far from being trivial
and will cause trouble after the forthcoming changes: it checks if the
current stream is the same that is being demuxed, and only if so, wakes
the demux to restart reading. Once streams will start to use multiple
buffers, this condition will make no sense anymore. Actually the real
reason is split into two steps:
  - detect if the demux is currently blocked on the current stream, and
    if so remove SFULL
  - detect if any demux blocking flags were removed during the operations,
    and if so, wake demuxing.

For now this doesn't change anything.
2024-10-12 16:29:16 +02:00
Willy Tarreau
a0ed92f3dd MINOR: mux-h2: simplify the exit code in h2_rcv_buf()
The code used to decide what to tell to the upper layer and when to free
the rxbuf is a bit convoluted and difficult to adapt to dynamic rxbufs.
We first need to deal with memory management (b_free) and only then to
decide what to report upwards. Right now it does it the other way around.

This should not change anything.
2024-10-12 16:29:16 +02:00
Willy Tarreau
3b5ac2b553 MINOR: mux-h2: move H2_CF_WAIT_IN_LIST flag away from the demux flags
It's not convenient to have this flag in the middle of the demux flags,
it easily hides other ones that need to be added. Let's move it after
the other ones.
2024-10-12 16:29:16 +02:00
Willy Tarreau
8cf418811d MINOR: mux-h2: add rxbuf head/tail/count management for h2s
Now the h2s get their rx_head, rx_tail and rx_count associated with the
shared rxbufs. A few functions are provided to manipulate all this,
essentially allocate/release a buffer for the stream, return a buffer
pointer to the head/tail, counting allocated buffers for the stream
and reporting if a stream may still allocate.

For now this code is not used.
2024-10-12 16:29:16 +02:00
Willy Tarreau
a891534bfd MINOR: mux-h2: allocate the array of shared rx bufs in the h2c
In preparation for having a shared list of rx bufs, we're now allocating
the array of shared rx bufs in the h2c. The pool is created at the max
size between the front and back max streams for now, and the array is not
used yet.
2024-10-12 16:29:16 +02:00
Willy Tarreau
721ea5b06c MINOR: mux-h2: count within a connection, how many streams are receiving data
A stream is receiving data from after the HEADERS frame missing END_STREAM,
to the end of the stream or HREM (the presence of END_STREAM). We're now
adding a flag to the stream that indicates this state, as well as a counter
in the connection of streams currently receiving data. The purpose will be
to gauge at any instant the number of streams that might have to share the
available bandwidth and buffers count in order not to allocate too much flow
control to any single stream. For now the counter is kept up to date, and is
reported in "show fd".
2024-10-12 16:29:16 +02:00
Willy Tarreau
c9275084bc MEDIUM: mux-h2: start to introduce the window size in the offset calculation
Instead of incrementing the last_max_ofs by the amount of received bytes,
we now start from the new current offset to which we add the static window
size. The result is exactly the same but it prepares the code to use a
window size combined with an offset instead of just refilling the budget
from what was received.

It was even verified that changing h2_fe_settings_initial_window_size in
the middle of a transfer using gdb does indeed allow the transfer speed
to adapt accordingly.
2024-10-12 16:29:16 +02:00