Commit Graph

22309 Commits

Author SHA1 Message Date
William Lallemand
ef943c186d REGTESTS: update the ocsp-update tests
Update the ocsp-update tests for the recent changes:

- "tune.ssl.ocsp-update.mode" was renamed iin "ocsp-update.mode"
2024-05-17 14:50:00 +02:00
William Lallemand
ee58fac1b4 MINOR: ssl: rename tune.ssl.ocsp-update.mode in ocsp-update.mode
Since the ocsp-update is not strictly a tuning of the SSL stack, but a
feature of its own, lets rename the option.

The option was also missing from the index.
2024-05-17 14:50:00 +02:00
Willy Tarreau
ea3b89952d BUILD: stick-tables: better mark the stktable_data as 32-bit aligned
Aurélien reported that clang's build was broken by the recent fix
845fb846c7 ("BUG/MEDIUM: stick-tables: properly mark stktable_data
as packed"), because it now wants to use a helper for some atomic
ops (to increment std_t_uint). While this makes no sense to do
something that slow on modern architectures like x86 and arm64 which
are fine with unaligned accesses, we actually we can simply mark the
struct as aligned to its smallest element which is 32-bit (but still
packed). With this, it was verified that it is enough for clang to
see that its 32-bit operations will always be aligned, while making
64-bit operations safe on 64-bit platforms that do not support unaligned
accesses.

This should be backported wherever the patch above is backported.
2024-05-17 11:00:45 +02:00
Amaury Denoyelle
0d35f8d918 MINOR: h3: report glitch on RFC violation
Increment glitch connection counter on every HTTP/3 or QPACK errors
which is a violation of the specification. This could be useful to get
rid early of bogus clients.
2024-05-16 10:58:54 +02:00
Amaury Denoyelle
216f70f989 MINOR: mux-quic: support glitches
Implement basic support for glitches on QUIC multiplexer. This is mostly
identical too glitches for HTTP/2.

A new configuration option named tune.quic.frontend.glitches-threshold
is defined to limit the number of glitches on a connection before
closing it.

Glitches counter is incremented via qcc_report_glitch(). A new
qcc_app_ops callback <report_susp> is defined. On threshold reaching, it
allows to set an application error code to close the connection. For
HTTP/3, value H3_EXCESSIVE_LOAD is returned. If not defined, default
code INTERNAL_ERROR is used.

For the moment, no glitch are reported for QUIC or HTTP/3 usage. This
will be added in future patches as needed.
2024-05-16 10:58:20 +02:00
Amaury Denoyelle
a6993a669b MINOR: h3: adjust error reporting on receive
This commit is the second step to simplify HTTP/3 error management. This
times it deals with receive side on h3_rcv_buf().

Various internal HTTP/3 to HTX conversion functions does not set
H3_INTERNAL_ERROR on h3c err anymore. Only standard error code are set.
For every errors, both internal and protocol ones, a negative value is
returned. This ensure that h3_rcv_buf() looping is interrupted. This
function will then set H3_INTERNAL_ERROR only if no standard error is
registered via h3c or h3s.

Along the previous commit, this should better reflect internal errors
from protocol ones caused by a faulty client.
2024-05-16 10:31:17 +02:00
Amaury Denoyelle
079d13f73f MINOR: h3: adjust error reporting on sending
It's currently difficult to differentiate HTTP/3 standard protocol
violation from internal issues which use solely H3_INTERNAL_ERROR code.
This patch aims is the first step to simplify this. The objective is to
reduce H3_INTERNAL_ERROR. <err> field of h3c should be reserved
exclusively to other values.

Simplify error management in sending via h3_snd_buf(). Sending side is
straightforward as only internal errors can be encountered. Do not
manually set h3c.err to H3_INTERNAL_ERROR in HTX to HTTP/3 various
conversion function. Instead, just return a negative value which is
enough to break h3_snd_buf() loop. H3_INTERNAL_ERROR is thus positionned
on a single location in this function for all sending operations.
2024-05-16 10:31:17 +02:00
Amaury Denoyelle
e094412337 MINOR: h3/qpack: adjust naming for errors
Rename enum values used for HTTP/3 and QPACK RFC defined codes. First
uses a prefix H3_ERR_* which serves as identifier between them. Also
separate QPACK values in a new dedicated enum qpack_err. This is deemed
cleaner.
2024-05-16 10:31:17 +02:00
Amaury Denoyelle
2dabcf30be MINOR: qpack: prepare error renaming
There is two distinct enums both related to QPACK error management. The
first one is dedicated to RFC defined code. The other one is a set of
internal values returned by qpack_decode_fs(). There has been issues
discovered recently due to the confusion between them.

Rename internal values with the prefix QPACK_RET_*. The older name
QPACK_ERR_* will be used in a future commit for the first enum.
2024-05-16 10:31:17 +02:00
Christopher Faulet
25bcdb1d95 BUG/MAJOR: h1: Be stricter on request target validation during message parsing
As stated in issue #2565, checks on the request target during H1 message
parsing are not good enough. Invalid paths, not starting by a slash are in
fact parsed as authorities. The same error is repeated at the sample fetch
level. This last point is annoying because routing rules may be fooled. It
is also an issue when the URI or the Host header are updated.

Because the error is repeated at different places, it must be fixed. We
cannot be lax by arguing it is the server's job to accept or reject invalid
request targets. With this patch, we strengthen the checks performed on the
request target during H1 parsing. Idea is to reject invalid requests at this
step to be sure it is safe to manipulate the path or the authority at other
places.

So now, the asterisk-form is only allowed for OPTIONS and OTHER methods.
This last point was added to not reject the H2 preface. In addition, we take
care to have only one asterisk and nothing more. For the CONNECT method, we
take care to have a valid authority-form. All other form are rejected. The
authority-form is now only supported for CONNECT method. No specific check
is performed on the origin-form (except for the CONNECT method). For the
absolute-form, we take care to have a scheme and a valid authority.

These checks are not perfect but should be good enough to properly identify
each part of the request target for a relative small cost. But, it is a
breaking change. Some requests are now be rejected while they was not on
older versions. However, nowadays, it is most probably not an issue.  If it
turns out it's really an issue for legitimate use-cases, an option would be
to supports these kinds of requests when the "accept-invalid-http-request"
option is set, with the consequence of seeing some sample fetches having an
unexpected behavior.

This patch should fix the issue #2665. It MUST NOT be backported. First
because it is a breaking change. And then because by avoiding backporting
it, it remains possible to relax the parsing with the
"accept-invalid-http-request" option.
2024-05-15 21:20:37 +02:00
Christopher Faulet
d3d9d83f03 BUG/MEDIUM: h1: Reject CONNECT request if the target has a scheme
The target of a CONNECT request must not have scheme. However, this was not
checked during the message parsing. It is now rejected.

This patch may be backported as far as 2.4.
2024-05-15 21:20:37 +02:00
Christopher Faulet
d724b0d147 BUG/MINOR: h1: Check authority for non-CONNECT methods only if a scheme is found
When a non-CONNECT H1 request is parsed, the authority is compared to the
host header value, to validate that they are the same. However there is an
issue here when a relative path is used (not begining with a '/'). In this
case, the path is considered as the authority and will be erroneously
compared to the host header value. It is observable with this kind of
request:

  GET admin HTTP/1.1
  Host: www.mysite.com

In this case "admin" is parsed as an authority while it is in fact a path.
At this step, it is not a big deal because it just happens on the very first
checks on the message during the parsing. However, the same happens when the
authority is updated. This will be fixed in another commit

Note this kind of request is invalid because the path does not start with a
'/'. But, till now, HAProxy does not reject it.

This patch is related to issue #2565. It must be backported as far as 2.4.
2024-05-15 21:20:37 +02:00
Willy Tarreau
821a04377d BUG/MEDIUM: muxes: enforce buf_wait check in takeover()
The ->takeover() is quite tricky. It didn't take care of the possibility
that the original thread's connection handler had been woken up to handle
an event (e.g. read0), failed to get a buffer, registered against its own
thread's buffer_wait queue and left the connection in an idle state.

A new thread could then come by, perform a takeover(), and when a buffer
was available, the new thread's tasklet would be woken up by the old one
via *_buf_available(), causing all sort of problems. These problems are
easy to reproduce, by running with shared backend connections and few
buffers (tune.buffers.limit=20, 8 threads, 500 connections, transfer
64kB objects and wait 2-5s for a crash to appear).

A first estimated solution consisted in removing the connection from the
idle list but it turns out that it would be worse for the delete stuff
(the connection no longer appearing as idle, making it impossible to find
it in order to close it). Also, idle counts wouldn't match anymore the
list's state, and the special case of private connections could be
difficult to handle as the connection could be forcefully re-added to the
idle list after allocation despite being private.

After multiple attempts to address the problem in various ways, it appears
that the only reliable solution for now (without starting to turn many
lists to mt_lists) is to have the takeover() function handle the buf_wait
detection or unregistration itself:

  - when doing a regular takeover aiming at finding an idle connection
    for a new request, connections that are blocked in a buffer_wait
    queue are quite rare and not interesting at all (since not immediately
    usable), so skipping them is sufficient. For this we detect that the
    desired connection belongs to a buffer_wait list by checking its
    buf_wait.list element. Note that this check is *not* thread-safe! The
    LIST_DEL_INIT() is performed by __offer_buffers() after the callback
    was called. But this is sufficient as it is now because the only way
    for the element to be seen as not in a list is after the element was
    last touched by __offer_buffers(), so the situation for this connection
    will not change in a different way later.

  - when doing a server delete, we're running under thread isolation.
    The connection might get taken over to be killed. The only trick is
    that private connections not belonging to any idle list may also
    experience this, and in this case even the idle_conns lock will not
    offer any protection against anything. But since we're run under
    thread isolation, we're certain not to compete with the other thread,
    so it's safe to directly unregister the connection from its owner
    thread. Normally this is already handled by conn_release() in
    cli_parse_delete_server(), which calls mux->destroy(), but this would
    actually update the current thread's queue instead of the origin
    thread's, thus we do need to perform an explicit dequeue before
    completing the takeover.

With this, the problem now looks solved for HTTP/1, HTTP/2 and FCGI,
though extensive tests were essentially run on HTTP/1 and HTTP/2.

While the problem has been there for a very long time, there should be
no reason to backport it since buffer_wait didn't practically work
before 3.0-dev and the process used to freeze hard very quickly before
we'd even have a chance to meet that race.
2024-05-15 19:37:12 +02:00
Willy Tarreau
b0349cf2de MINOR: dynbuf: provide a b_dequeue() variant for multi-thread
In order to forcefully unregister a buffer waiter during an inter-thread
takeover under isolation, we'll need to that the function works without
th_ctx but the target thread's ctx instead. Let's implement this by
passing the target thread as an argument. Now b_dequeue() simply calls
this one with tid. It's OK it's not on that critical a path, especially
since the list has been checked for existence before performing the call.
2024-05-15 19:37:12 +02:00
Willy Tarreau
edb99e296d BUG/MINOR: ssl_sock: fix xprt_set_used() to properly clear the TASK_F_USR1 bit
In 2.4-dev8 with commit 5c7086f6b0 ("MEDIUM: connection: protect idle
conn lists with locks"), the idle conns list started to be protected
using the lock for takeover, and the SSL layer used to always take
that lock. Later in 2.4-dev11, with commit 4149168255 ("MEDIUM: ssl:
implement xprt_set_used and xprt_set_idle to relax context checks"), we
decided to relax this lock using TASK_F_USR1 just as is done in muxes.

However the xprt_set_used() call, that's supposed to clear the flag,
visibly suffered from a copy-paste and kept the OR operation instead of
the AND, resulting in the flag never being released, so that SSL on the
backend continues to take the lock on each and every I/O access even when
the connection is not idle.

The effect is only a reduced performance. This could be backported, but
given the non-zero risk of triggering another bug somewhere, it would
be prudent to wait for this fix to be sufficiently tested in new
versions first.
2024-05-15 19:37:12 +02:00
Willy Tarreau
b6ed749adc SCRIPTS: run-regtests: fix a few occurrences of extended regexes
Running run-regtests on OpenBSD failed to identify haproxy version and
the various build options because the backslash is not recognized in
grep expressions. One must only use -E for the extended regexes and
not use the slash.
2024-05-15 19:33:45 +02:00
Willy Tarreau
845fb846c7 BUG/MEDIUM: stick-tables: properly mark stktable_data as packed
The stktable_data union is made of types of varying sizes, and depending
on which types are stored in a table, some offsets might not necessarily
be aligned. This results in a bus error for certain regtests (e.g.
lb-services) on MIPS64. This bug may impact MIPS64, SPARC64, armv7 when
accessing a 64-bit counter (e.g. bytes) and depending on how the compiler
emitted the operation, and cause a trap that's emulated by the OS on RISCV
(heavy cost). x86_64 and armv8 are not affected at all.

Let's properly mark the struct with __attribute__((packed)) so that the
compiler emits the suitable unaligned-compatible instructions when
accessing the fields.

This should be backported to all versions where it applies.
2024-05-15 19:03:18 +02:00
Willy Tarreau
276cdc11e8 BUG/MEDIUM: htx: mark htx_sl as packed since it may be realigned
A test on MIPS64 revealed that the following reg tests would all
fail at the same place in htx_replace_stline() when updating
parts of the request line:
  reg-tests/cache/if-modified-since.vtc
  reg-tests/http-rules/h1or2_to_h1c.vtc
  reg-tests/http-rules/http_after_response.vtc
  reg-tests/http-rules/normalize_uri.vtc
  reg-tests/http-rules/path_and_pathq.vtc

While the status line is normally aligned since it's the first block
of the HTX, it may become unaligned once replaced. The problem is, it
is a structure which contains some u16 and u32, and dereferencing them
on machines not natively supporting unaligned accesses makes them crash
or handle crap. Typically, MIPS/MIPS64/SPARC will crash, ARMv5 will
either crash or (more likely) return swapped values and do crap, and
RISCV will trap and turn to slow emulation.

We can assign the htx_sl struct the packed attribute, but then this
also causes the ints to fill the 2-bytes gap before them, always causing
unaligned accesses for this part on such machines. The patch does a bit
better, by explicitly filling this two-bytes hole, and packing the
struct.

This should be backported to all versions.
2024-05-15 19:03:17 +02:00
Amaury Denoyelle
86aafd0236 BUG/MINOR: qpack: fix error code reported on QPACK decoding failure
qpack_decode_fs() is used to decode QPACK field section on HTTP/3
headers parsing. Its return value is incoherent as it returns either
QPACK_DECOMPRESSION_FAILED defined in RFC 9204 or any other internal
values defined in qpack-dec.h. On failure, such return code is reused by
HTTP/3 layer to be reported via a CONNECTION_CLOSE frame. This is
incorrect if an internal error values was reported as it is not defined
by any specification.

Fir return values of qpack_decode_fs() in two ways. Firstly, fix invalid
usages of QPACK_DECOMPRESSION_FAILED when decoded content is too large
for the correct internal error QPACK_ERR_TOO_LARGE.

Secondly, adjust qpack_decode_fs() API to only returns internal code
values. A new internal enum QPACK_ERR_DECOMP is defined to replace
QPACK_DECOMPRESSION_FAILED. Caller is responsible to convert it to a
suitable error value. For other internal values, H3_INTERNAL_ERROR is
used. This is done through a set of convert functions.

This should be backported up to 2.6. Note that trailers are not
supported in 2.6 so chunk related to h3_trailers_to_htx() can be safely
skipped.
2024-05-15 16:07:15 +02:00
Amaury Denoyelle
4295dd21bd BUG/MINOR: mux-quic: fix error code on shutdown for non HTTP/3
qcc_shutdown() is called whenever the connection must be closed. If
application protocol defined its owned shutdown callback, it is invoked
to use the correct error code. Else transport error code NO_ERROR is
used.

A bug occurs in the latter case as NO_ERROR is used with quic_err_app()
which is reserved for application errro codes. This will trigger the
emission of a CONNECTION_CLOSE of type 0x1d (Application) instead of
0x1c (Transport).

This bug is considered minor as it does not impact QUIC with HTTP/3. It
may only be visible when using experimental HTTP/0.9 protocol.

This should be backported up to 2.6. For 2.6, patch must be completed
rewritten due to code differences. Here is the change to apply :

  diff --git a/src/mux_quic.c b/src/mux_quic.c
  index 26fb70ddf..c48f82e27 100644
  --- a/src/mux_quic.c
  +++ b/src/mux_quic.c
  @@ -1918,7 +1918,9 @@ static void qc_release(struct qcc *qcc)
                          qc_send(qcc);
                  }
                  else {
  -                       qcc_emit_cc_app(qcc, QC_ERR_NO_ERROR, 0);
  +                       /* Duplicate from qcc_emit_cc_app() for Transport error code. */
  +                       if (!(qcc->conn->handle.qc->flags & QUIC_FL_CONN_IMMEDIATE_CLOSE))
  +                               qcc->conn->handle.qc->err = quic_err_transport(QC_ERR_NO_ERROR);
                  }
          }
2024-05-15 16:03:01 +02:00
Amaury Denoyelle
412f1eeb89 BUG/MEDIUM: server: clear purgeable conns before server deletion
Since the following commit, idle connections are cleared before a server
is deleted. This is better than blocking server deletion due to inactive
connections :

  6e0afb2e27
  MEDIUM: server: close idle conn on server deletion

A BUG_ON() has been added to ensure that server idle conn counter is nul
after these connections are removed. However, Willy managed to trigger
it easily by repeatedly and randomly delete servers accross a
single-thread haproxy using a server-template with 1000 instances. In
parallel, a h1load client is executed to generate traffic.

This BUG_ON() reflected that it some connections referencing the server
targetted for deletion remained, even though idle server list is empty.
In fact, this is caused by connections scheduled for purging. These
connections are moved from idle server list to a global toremove_list
while still being accounted by the server.

A first approach could be to decrement server idle counter while moving
connection to the purge list. However, this is functionnaly incorrect as
these purgeable connections still reference the server and it could
cause a crash if cleared after it.

The correct fix for this issue is simply to remove every purgeable
connections before a server is deleted. This is implemented by this
patch by extending cli_parse_delete_server(). It could be enough to only
remove connections targetted the deleted server, but as these
connections will be purged anyway it is justified to clear the whole
list.

This must not be backported, unless the above mentionned patch is.
2024-05-15 15:01:55 +02:00
Aurelien DARRAGON
231d3d32be MEDIUM: hlua: take nbthread into account in hlua_get_nb_instruction()
Based on Willy's idea (from 3.0-dev6 announcement message): in this patch
we try to reduce the max latency that can be caused by running lua scripts
with default settings.

Indeed, by default, hlua engine is allowed to process up to 10k
instructions per batch. While this value was found to be the optimal one
for a single thread, it turns out that keeping a thread busy for 10k lua
instructions could increase thread contention. This is especially true
when the script is loaded with 'lua-load', because in that case the
current thread owns the main lua lock and prevent other threads from
making any progress if they're also waiting on the main lock.

Thanks to Thierry Fournier's work, we know that performance-wise we can
reach optimal performance by sticking between 500 and 10k instructions
per batch. Given that, when the script is loaded using 'lua-load', if no
"tune.lua.forced-yield" was set by the user, we automatically divide the
default value (10K) by the number of threads haproxy can use to reduce
thread contention (given that all threads could compete for the main lua
lock), however we make sure not to return a value below 500, because
Thierry's work showed that this would come with a significant performance
loss.

The historical behavior may still be enforced by setting
"tune.lua.forced-yield" to 10000 in the global config section.
2024-05-15 11:59:44 +02:00
Aurelien DARRAGON
e60d9dddf8 MINOR: hlua: add hlua_nb_instruction getter
No functional behavior change, but this will ease the work of dynamically
computing hlua_nb_instruction value depending on various inputs.
2024-05-15 11:59:37 +02:00
Tim Duesterhus
6610f656ea DOC: Update UUID references to RFC 9562
When support for UUIDv7 was added in commit
aab6477b67
the specification still was a draft.

It has since been published as RFC 9562.

This patch updates all UUID references from the obsoleted RFC 4122 and the
draft for RFC 9562 to the published RFC 9562.
2024-05-15 11:40:08 +02:00
William Lallemand
8c6f43d382 REGTESTS: ssl: be more verbose with ocsp_compat_check.vtc
the ocsp_compat_check.vtc reg-test is difficult to debug given than the
haproxy output is piped in `grep -q`.

This patch helps by showing the haproxy output as well as the return
code.
2024-05-15 10:36:02 +02:00
William Manley
366b722f7e MINOR: rhttp: Don't require SSL when attach-srv name parsing
An attach-srv config line usually looks like this:
    tcp-request session attach-srv be/srv name ssl_c_s_dn(CN)

while a rhttp server line usually looks like this:
    server srv rhttp@ sni req.hdr(host)

The server sni argument is used as a key for looking up connection in
the connection pool. The attach-srv name argument is used as a key for
inserting connections into the pool. For it to work correctly they must
match. There was a check that either both the attach-srv and server
provide that key or neither does.

It also checked that SSL and SNI was activated on the server. However,
thanks to current connect_server() implementation, it appears that SNI
is usable even without SSL to identify a connection in the pool. Thus,
it can be diverted from its original intent in reverse HTTP case to
serve even without SSL activated. For example, this could be useful to
use `fc_pp_unique_id` as a name expression (DISCLAIMER: note that for
now PROXY protocol is not compatible with rhttp).

Error is still reported if either SNI or name is used without the other.
This patch adjust the message to a more helpful one.

Arguably it would be easier to understand if instead of using `name` and
`sni` for `attach-srv` and `server` rules it used the same term in both
places - like "conn-pool-key" or something. That would make it clear
that the two must match.
2024-05-14 16:39:07 +02:00
Aurelien DARRAGON
32f0cd3242 BUG/MINOR: log: smp_rgs array issues with inherited global log directives
When a log directive is defined in the global section, each time we use
"log global" in a proxy section, the global log directives are duplicated
for the current proxy. This works by creating a new proxy logger struct
and duplicating every members for each global one.

However, smp_rgs logger member is a special pointer member that is
allocated when "range" is used on a log directive. Currently, we simply
copy the array pointer (from the global one), instead of creating our own
copy. Because of that, range log sampling may not work properly in some
situations prior to 3f1284560 ("MINOR: log: remove the unused curr_idx in
struct smp_log_range") when used in global log directives, for instance:

  global
    log 127.0.0.1:5114 format raw sample 1-2,3:4 local0 info # should receive 75% of all proxy logs
    log 127.0.0.1:5115 format raw sample 4:4 local0 info     # should receive 25% of all proxy logs

  listen proxy1
    log global

  listen proxy2
    log global

May not work as expected, because curr_idx was stored within smp_rgs array
member prior to 3f1284560, and due to this bug, it happens to be shared
between every log directive inherited from a "global" one. The result is
that curr_idx counter will not behave properly because the index will be
increased globally instead of per-log directive, and it could even suffer
from concurrent thread accesses under load since we don't own the global
log directive's lock when manipulating it.

Another issue that was revealed because of this bug is that the smp_rgs
array allocated during config parsing is never freed in free_logger(),
resulting in small memory leak during clean exit.

To fix these issues all at once, let's properly duplicate smp_rgs logger
struct member in dup_logger() like we already do for other special members
so that every log directive have its own sms_rgs copy, and then
systematically free it in free_logger().

While this bug affects all stable versions (including 2.4), it's probably
best to not backport this beyond 2.6 because of 211ea252d
("BUG/MINOR: logs: fix logsrv leaks on clean exit") prerequisite that
first appears in 2.6.

[ada: for versions prior to 2.9, 969e212
 ("MINOR: log: add dup_logsrv() helper function") and 76acde91
 ("BUG/MINOR: log: keep the ref in dup_logger()") must be backported
 first.
 Note: Some ctx adjustments should be performed because 'logger' struct
 used to be named 'logsrv' in the past and 2.9 introduced logger target
 struct member. Thus it's probably easier to manually apply 76acde91 and
 the current bugfix by hand directly on top of 969e212.
]
2024-05-14 12:00:23 +02:00
Aurelien DARRAGON
9d4a44e713 BUG/MINOR: log: fix leak in add_sample_to_logformat_list() error path
If add_sample_to_logformat_list() fails to allocate new logformat_node,
then we directly jump to error_free label to cleanup the node using
free_logformat_node() before returning an error.

However if the node failed to allocate, then the sample expression that
was allocated just before (not yet assigned) isn't released
(free_logformat_node() is a no-op when NULL is provided). Thus if expr
wasn't assigned to the node during early failure, then it must be manually
released.

This bug was introduced by 2462e5bcc ("BUG/MINOR: log: fix potential
lf->name memory leak") which wasn't marked for backports. It only
affects 3.0.
2024-05-13 16:44:27 +02:00
Ilia Shipitsin
cbe78c0281 CI: drop asan.log umbrella completely
asan.log redirection appeared to work poorly, let's cease that practice
for good.

ML: https://www.mail-archive.com/haproxy@formilux.org/msg44844.html
2024-05-13 11:36:36 +02:00
Willy Tarreau
7217a9e9b9 [RELEASE] Released version 3.0-dev11
Released version 3.0-dev11 with the following main changes :
    - BUILD: clock: improve check for pthread_getcpuclockid()
    - CI: add Illumos scheduled workflow
    - CI: netbsd: limit scheduled workflow to parent repo only
    - OPTIM: log: resolve logformat options during postparsing
    - BUG/MINOR: haproxy: only tid 0 must not sleep if got signal
    - REGTEST: add tests for acl() sample fetch
    - BUG/MINOR: acl: support built-in ACLs with acl() sample
    - BUG/MINOR: cfgparse: use curproxy global var from config post validation
    - MEDIUM: stconn/muxes: Add an abort reason for SE shutdowns on muxes
    - MINOR: mux-h2: Set the SE abort reason when a RST_STREAM frame is received
    - MEDIUM: mux-h2: Forward h2 client cancellations to h2 servers
    - MINOR: mux-quic: Set tha SE abort reason when a STOP_SENDING frame is received
    - MINOR: stconn: Add samples to retrieve about stream aborts
    - MINOR: mux-quic: Add .ctl callback function to get info about a mux connection
    - MINOR: muxes: Add ctl commands to get info on streams for a connection
    - MINOR: connection: Add samples to retrieve info on streams for a connection
    - BUG/MEDIUM: log/ring: broken syslog octet counting
    - BUG/MEDIUM: mux-quic: fix crash on STOP_SENDING received without SD
    - DOC: lua: fix filters.txt file location
    - MINOR: dynbuf: pass a criticality argument to b_alloc()
    - MINOR: dynbuf: add functions to help queue/requeue buffer_wait fields
    - MINOR: dynbuf: use the b_queue()/b_requeue() functions everywhere
    - MEDIUM: dynbuf: make the buffer_wq an array of list heads
    - CLEANUP: tinfo: better align fields in thread_ctx
    - MINOR: dynbuf: provide a b_dequeue() function to detach a bw from the queue
    - MEDIUM: dynbuf: generalize the use of b_dequeue() to detach buffer_wait
    - MEDIUM: dynbuf/stream: re-enable queueing upon failed buffer allocation
    - MEDIUM: dynbuf/stream: do not allocate the buffers in the callback
    - MEDIUM: applet: make appctx_buf_available() only wake the applet up, not allocate
    - MINOR: applet: set the blocking flag in the buffer allocation function
    - MINOR: applet: adjust the allocation criticity based on the requested buffer
    - MINOR: dynbuf/mux-h1: use different criticalities for buffer allocations
    - MEDIUM: dynbuf/mux-h1: do not allocate the buffers in the callback
    - MEDIUM: dynbuf: refrain from offering a buffer if more critical ones are waiting
    - MINOR: stconn: report that a buffer allocation succeeded
    - MINOR: stream: report that a buffer allocation succeeded
    - MINOR: applet: report about buffer allocation success
    - MINOR: mux-h1: report that a buffer allocation succeeded
    - MEDIUM: stream: allocate without queuing when retrying
    - MEDIUM: channel: allocate without queuing when retrying
    - MEDIUM: mux-h1: allocate without queuing when retrying
    - MEDIUM: dynbuf: implement emergency buffers
    - MEDIUM: dynbuf: use emergency buffers upon failed memory allocations
2024-05-10 17:39:19 +02:00
Willy Tarreau
fc792694a6 MEDIUM: dynbuf: use emergency buffers upon failed memory allocations
Now, if a pool_alloc() fails for a buffer and if conditions are met
based on the queue number, we'll try to get an emergency buffer.

Thanks to this the situation is way more stable now. With only 4 reserve
buffers and 1 buffer it's possible to reliably serve 500 concurrent end-
to-end H1 connections and consult stats in parallel in loops showing the
growing number of buf_wait events in "show activity" without facing an
instant stall like in the past. Lower values still cause quick stalls
though.

It's also apparent that some subsystems do not seem to detach from the
buffer_wait lists when leaving. For example several crashes in the H1
part showed list elements still present after a free(), so maybe some
operations performed inside h1_release() after the b_dequeue() call
can sometimes result in a new allocation. Same for streams, where
the dequeue is done relatively early.
2024-05-10 17:18:13 +02:00
Willy Tarreau
0ce51dc93b MEDIUM: dynbuf: implement emergency buffers
The buffer reserve set by tune.buffers.reserve has long been unused, and
in order to deal gracefully with failed memory allocations we'll need to
resort to a few emergency buffers that are pre-allocated per thread.

These buffers are only for emergency use, so every time their count is
below the configured number a b_free() will refill them. For this reason
their count can remain pretty low. We changed the default number from 2
to 4 per thread, and the minimum value is now zero (e.g. for low-memory
systems). The tune.buffers.limit setting has always been a problem when
trying to deal with the reserve but now we could simplify it by simply
pushing the limit (if set) to match the reserve. That was already done in
the past with a static value, but now with threads it was a bit trickier,
which is why the per-thread allocators increment the limit on the fly
before allocating their own buffers. This also means that the configured
limit is saner and now corresponds to the regular buffers that can be
allocated on top of emergency buffers.

At the moment these emergency buffers are not used upon allocation
failure. The only reason is to ease bisecting later if needed, since
this commit only has to deal with resource management.
2024-05-10 17:18:13 +02:00
Willy Tarreau
47665be083 MEDIUM: mux-h1: allocate without queuing when retrying
Now when trying to allocate a buffer, we can check if we've been notified
of availability via the callback, in which case we should not consult the
queue, or if we're doing a first allocation and check the queue. At this
point it still doesn't change much since the stream still doesn't make use
of it but some progress is expected.
2024-05-10 17:18:13 +02:00
Willy Tarreau
5b8d27617f MEDIUM: channel: allocate without queuing when retrying
Now when trying to allocate a channel buffer, we can check if we've been
notified of availability via the producer stream connector callback, in
which case we should not consult the queue, or if we're doing a first
allocation and check the queue.
2024-05-10 17:18:13 +02:00
Willy Tarreau
b5714b45e8 MEDIUM: stream: allocate without queuing when retrying
Now when trying to allocate the work buffer, we can check if we've been
notified of availability via the buf_wait callback, in which case we
should not consult the queue, or if we're doing a first allocation and
check the queue.
2024-05-10 17:18:13 +02:00
Willy Tarreau
f552f79ba5 MINOR: mux-h1: report that a buffer allocation succeeded
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag in addition to clearing the ALLOC one, for
each of the 3 levels where we may fail an allocation. The flag will be
cleared upon a successful allocation. This will soon be used to decide to
re-allocate without waiting again in the queue. For now it has no effect.

There's just a trick, we need to clear the various *_ALLOC flags before
testing h1_recv_allowed() otherwise it will return false!
2024-05-10 17:18:13 +02:00
Willy Tarreau
cb2d758043 MINOR: applet: report about buffer allocation success
When appctx_buf_available() is called, it now sets APPCTX_FL_IN_MAYALLOC
or APPCTX_FL_OUT_MAYALLOC depending on the reportedly permitted buffer
allocation, and these flags are cleared when the said buffers are
allocated. For now they're not used for anything else.
2024-05-10 17:18:13 +02:00
Willy Tarreau
17d8916bb1 MINOR: stream: report that a buffer allocation succeeded
When the buffer allocation callback is notified of a buffer availability,
it will now set a MAYALLOC flag on the stream so that the stream knows it
is allowed to bypass the queue checks. For now this is not used.
2024-05-10 17:18:13 +02:00
Willy Tarreau
7aff64518c MINOR: stconn: report that a buffer allocation succeeded
We used to have two states for the channel's input buffer used by the SC,
NEED_BUFF or not, flipped by sc_need_buff() and sc_have_buff(). We want to
have a 3rd state, indicating that we've just got a desired buffer. Let's
add an HAVE_BUFF flag that is set by sc_have_buff() and that is cleared by
sc_used_buff(). This way by looking at HAVE_BUFF we know that we're coming
back from the allocation callback and that the offered buffer has not yet
been used.
2024-05-10 17:18:13 +02:00
Willy Tarreau
d1eb48a12b MEDIUM: dynbuf: refrain from offering a buffer if more critical ones are waiting
Now b_alloc() will check the queues at the same and higher criticality
levels before allocating a buffer, and will refrain from allocating one
if these are not empty. The purpose is to put some priorities in the
allocation order so that most critical allocators are offered a chance
to complete.

However in order to permit a freshly dequeued task to allocate again while
siblings are still in the queue, there is a special DB_F_NOQUEUE flag to
pass to b_alloc() that will take care of this special situation.
2024-05-10 17:18:13 +02:00
Willy Tarreau
a160b3c50c MEDIUM: dynbuf/mux-h1: do not allocate the buffers in the callback
One of the problematic designs with the buffer_wait mechanism is that
the callbacks pre-allocate the buffers and stay in the run queue for
a while, resulting in all of the few buffers being assigned to waiting
tasks instead of being all available to one task that needs them all at
once.

Here we simply stop doing this, the callback clears the waiting flags
and wakes the task up so that it has a chance of still finding some
buffers.
2024-05-10 17:18:13 +02:00
Willy Tarreau
c510e81a3f MINOR: dynbuf/mux-h1: use different criticalities for buffer allocations
While it could certainly still be improved, this first approach consists
in assigning buffers like this in the H1 mux:
  - h1c->obuf : DB_MUX_TX
  - h1c->ibuf : DB_MUX_RX
  - h1s->rxbuf: DB_SE_RX

That's done via 3 distinct functions for better code clarity, and it
also allowed to move the missing buffer flags assignment there.

Among possible improvements would be to take into consideration the
state of the parser (i.e. no data yet vs data, or headers vs payload)
so that even server beginning of response or pure payload can be lowered
in priority.
2024-05-10 17:18:13 +02:00
Willy Tarreau
4a42af1744 MINOR: applet: adjust the allocation criticity based on the requested buffer
When we want to allocate an in buffer, it's in order to pass data to
the applet, that will consume it, so it must be seen as the same as
a send() from the higher level, i.e. MUX_TX. And for the outbuf, it's
a stream endpoint returning data, i.e. DB_SE_RX.
2024-05-10 17:18:13 +02:00
Willy Tarreau
4ffb3b5ebe MINOR: applet: set the blocking flag in the buffer allocation function
Instead of having each caller of appctx_get_buf() think about setting
the blocking flag, better have the function do it, since it's already
handling the queue anyway. This way we're sure that both are consistent.
2024-05-10 17:18:13 +02:00
Willy Tarreau
ee0d56ac85 MEDIUM: applet: make appctx_buf_available() only wake the applet up, not allocate
Now we don't want bufwait handlers to preallocate the resources they
were expecting since it contributes to the shortage. Let's just wake
the applet up and that's all.
2024-05-10 17:18:13 +02:00
Willy Tarreau
9a27d7aa6f MEDIUM: dynbuf/stream: do not allocate the buffers in the callback
One of the problematic designs with the buffer_wait mechanism is that
the callbacks pre-allocate the buffers and stay in the run queue for
a while, resulting in all of the few buffers being assigned to waiting
tasks instead of being all available to one task that needs them all at
once.

Here we simply stop doing this, the callback clears the waiting flags
and wakes the task up so that it has a chance of still finding some
buffers.
2024-05-10 17:18:13 +02:00
Willy Tarreau
db21062881 MEDIUM: dynbuf/stream: re-enable queueing upon failed buffer allocation
The errors were not working fine anyway since we know that upon low memory
condition everything freezes. However we have a chance to do better now,
so let's start by re-enabling queueing when allocations fail.
2024-05-10 17:18:13 +02:00
Willy Tarreau
f5566afec6 MEDIUM: dynbuf: generalize the use of b_dequeue() to detach buffer_wait
Now thanks to this the bufq_map field is expected to remain accurate.
2024-05-10 17:18:13 +02:00
Willy Tarreau
f70bd5fad1 MINOR: dynbuf: provide a b_dequeue() function to detach a bw from the queue
Now that we need to keep the bitmap in sync with the list heads, we don't
want tasks to leave just doing a LIST_DEL_INIT() without updating the map.
Let's provide a b_dequeue() function for that purpose. The function detects
when it's going to remove the last element and figures the queue number
based on the pointer since it points to the root. It's not used yet.
2024-05-10 17:18:13 +02:00
Willy Tarreau
53461e4d94 CLEANUP: tinfo: better align fields in thread_ctx
The introduction of buffer_wq[] in thread_ctx pushed a few fields around
and the cache line alignment is less satisfying. And more importantly, even
before this, all the lists in the local parts were 8-aligned, with the first
one split across two cache lines.

We can do better:
  - sched_profile_entry is not atomic at all, the data it points to is
    atomic so it doesn't need to be in the atomic-only region, and it can
    fill the 8-hole before the lists
  - the align(2*void) that was only before tasklets[] moves before all
    lists (and it's a nop for now)

This now makes the lists and buffer_wq[] start on a cache line boundary,
leaves 48 bytes after the lists before the atomic-only cache line, and
leaves a full cache line at the end for 128-alignment. This way we still
have plenty of room in both parts with better aligned fields.
2024-05-10 17:18:13 +02:00