Commit Graph

7900 Commits

Author SHA1 Message Date
Christopher Faulet
15a520d474 MINOR: config/trace: Add a 'traces' section to declare debug traces
It is no longer supported to declare debug traces, via 'trace' directive, in
a global section. A 'traces' directive must be used instead. The syntax of
the 'trace' directive in these sections remains the same. But it is no
longer experimental.

The main reason for this change is to avoid to have a ring section defined
before a global one. Indeed, for now, forward declarations of ring sections
are not supported. So to configure traces, you had to add a ring section
before the global one defining the traces. Most of time, that meant to have
two global sections :

  global
    [...] # global settings

  ring <name>
    [...]

  global
    [...] # trace config

In addition, it will be possible to easily extend the traces section by
adding some new directives.
2024-10-02 10:22:51 +02:00
Amaury Denoyelle
cc4384aeb7 MEDIUM: quic: handle out-of-order ACK at streamdesc layer
qc_stream_desc_ack() is the entrypoint for streamdesc layer to handle a
new acknowledgement of previously emitted STREAM data.

Previously, it was only able to deal with in-order ACK offset. The
caller was responsible to buffer out-of-order ACKs. Change this by
dealing with the latter case directly in qc_stream_desc_ack(). This
notably simplify ACK handling in quic_rx module.
2024-10-01 16:22:20 +02:00
Amaury Denoyelle
62558a9285 MINOR: quic: move buffered ACK to streambuf
QUIC streamdesc layer is used to manage QUIC MUX stream txbuf data
storage until acknowledgment. Currently, it only supports in-order
acknowledgment at the stream level. This requires to be able to buffer
out-of-order ACKs until they can be handled.

Previously, these ACKs were stored in a tree to the streamdesc instance.
Move this indexed storage at the streambuf instance.

This commit is purely an architecture change. However, it will allow to
extend ACK management in future patches, such as the ability to merge
overlapping out-of-order ACKs.
2024-10-01 16:19:42 +02:00
Amaury Denoyelle
943e48dadd MINOR: quic: store streambuf in a streamdesc tree
qc_stream_desc layer is used by QUIC MUX to store emitted STREAM data
until their acknowledgement. Each stream with Tx capability can allocate
its own qc_stream_desc. In turn, each stream desc can have one or
multiple data buffers. This is useful when a MUX stream releases a
buffer and allocate a new one, to preserve bandwith without waiting to
receive all acknowledgement of the previous buffer.

Each buffer is encapsulated in a qc_stream_buf structure. Previously, it
was stored as a list into qc_stream_desc. Change this storage to use a
tree instead. Each buffer is indexed by their offset.

This commit does not introduce functional changes. However, this
rearchitecture will be necessary for future commit to extend ACK
management which require fetching individual buffer instance, not just
the first or last element of a streamdesc, by their offset.
2024-10-01 16:19:41 +02:00
Amaury Denoyelle
f4a83fbb14 MINOR: quic: do not remove qc_stream_desc automatically on ACK handling
qc_stream_desc_ack() is used to handle ACK received for STREAM frame. It
removes acknowledged data from their underlying buffer.

If all data were removed after ACK handling, qc_stream_desc instance
would automatically be freed at the end of qc_stream_desc_ack().
However, this renders the function complicated to use. Simplify this by
removing this automatic removal. Now, caller is responsible to check
after ACK handling if qc_stream_desc instance can be removed. This is
easily done using qc_stream_desc_done() helper.
2024-10-01 16:19:25 +02:00
Amaury Denoyelle
db68f8ed86 MINOR: quic: refactor STREAM room notification
qc_stream_desc is an intermediary layer between QUIC MUX and quic_conn.
It is a facility which permits to store data to emit and keep them for
retransmission until acknowledgment. This layer is responsible to notify
QUIC MUX each time a buffer is freed. This is necessary as MUX buffer
allocation is limited by the underlying congestion window size.

Refactor this to use a mechanism similar to send notification. A new
callback notify_room can now be registered to qc_stream_desc instance.
This is set by QUIC MUX to qmux_ctrl_room(). On MUX QUIC free, special
care is now taken to reset notify_room callback to NULL.

Thanks to this refactoring, further adjustment have been made to refine
the architecture. One of them is the removal of qc_stream_desc
QC_SD_FL_OOB_BUF, which is now converted to a MUX layer flag
QC_SF_TXBUF_OOB.
2024-10-01 16:19:25 +02:00
Amaury Denoyelle
d7f4e5abf0 MEDIUM: quic: strengthen MUX send notification
Previous commit implement a refactor of MUX send notification from
quic_conn layer. With this new architecture, a proper callback is
defined for each qc_stream_desc instance.

This architecture change allows to simplify notification from quic_conn
layer. First, ensure the MUX callback to properly ignore retransmission
of an already emitted frame. Luckily, this can be handled easily by
comparing offsets and FIN status. Also, each QCS instance can now be
unregistered from send notification just prior qc_stream_desc releasing.
This ensures a QCS is never manipulated from quic_conn after its
emission ending. Both these changes render the send notification more
robust. As a nice effect, flag QUIC_FL_CONN_TX_MUX_CONTEXT can be
removed as it is now unneeded.
2024-10-01 16:19:25 +02:00
Amaury Denoyelle
6ad99af0a9 MINOR: quic: refactor MUX send notification
For STREAM emission, MUX QUIC generates one or several frames and emit
them via qc_send_mux(). Lower layer may use them as-is, or split them to
lower chunk to fit in a QUIC packet. It is then responsible to notify
the MUX to report the amount of data sent.

Previously, this was done via a direct call from quic_conn to MUX using
qcc_streams_sent_done(). Modify this to have a better isolation accross
layers. Define a send callback handled by the qc_stream_desc instance.
This allows the MUX to register each QCS instance individually to the
renamved qmux_ctrl_send() which replaces qcc_streams_sent_done().

At quic_conn layer, qc_stream_desc_send() can be used now. This is a
wrapper to qc_stream_desc layer to invoke the send callback if
registered.

This mechanism of qc_stream_desc callback should be extended later to
implement other notifications accross the QUIC stack.
2024-10-01 16:19:25 +02:00
Christopher Faulet
273d322b6f MINOR: stream/stats: Expose the total number of streams ever created in stats
A shared counter is added in the thread context to track the total number of
streams created on the thread. This number is then reported in stats. It
will be a useful information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Christopher Faulet
18ee22ff76 MINOR: stream/stats: Expose the current number of streams in stats
A shared counter is added in the thread context to track the current number
of streams. This number is then reported in stats. It will be a useful
information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Christopher Faulet
6a94b7419e MINOR: stream: Support dynamic changes of the number of connection retries
Thanks to the previous patch, it is now possible to add an action to
dynamically change the maxumum number of connection retires for a stream.
"set-retries" action may now be used to do so, from a "tcp-request content"
or a "http-request" rule. This action accepts an expression or an integer
between 0 and 100. The integer value is checked during the configuration
parsing and leads to an error if it is not in the expected range. However,
for the expression, the value is retrieve at runtime. So, invalid value are
just ignored.

Too high value is forbidden to avoid any trouble. 100 retries seems already
be an amazingly hight value. In addition, the option is only available on
backend or listen sections.

Because the max retries is limited to 100 at most, it can be stored as a
unsigned short. This save some space in the stream structure.
2024-09-30 16:55:53 +02:00
Christopher Faulet
91e785edc9 MINOR: stream: Rely on a per-stream max connection retries value
Instead of directly relying on the backend parameter to limit the number of
connection retries, we now use a per-stream value. This value is by default
inherited from the backend value when it is set. So for now, there is no
change except the stream value is used instead of the backend value. But
thanks to this change, it will be possible to dynamically change this value.
2024-09-30 16:55:53 +02:00
Christopher Faulet
0d91de2be4 MINOR: action: Export release_expr_int_action() release function
This function was only used by TCP actions and was private to tcp_act.c
file. However, it make sense to make it public to be used by any action
relying on an int-or-expression argument.
2024-09-30 16:55:53 +02:00
Willy Tarreau
7caf073faa MINOR: tools: do not attempt to use backtrace() on linux without glibc
The function is provided by glibc. Nothing prevents us from using our
own outside of glibc there (tested on aarch64 with musl). We still do
not enable it by default as we don't yet know if all archs work well,
but it's sufficient to pass USE_BACKTRACE=1 when building with musl to
verify it's OK.
2024-09-29 09:52:23 +02:00
Willy Tarreau
1c4776dbc3 BUILD: tools: only include execinfo.h for the real backtrace() function
No need to include this possibly non-existing file when using our own
backtrace() implementation, it's only needed for the libc-provided one.
Because of this it's currently not possible to build musl with backtrace
enabled.
2024-09-29 09:52:23 +02:00
Willy Tarreau
a4d04c649a BUG/MINOR: server: make sure the HMAINT state is part of MAINT
In 1.8 when adding "set server fqdn" with commit b418c1228c ("MINOR:
server: cli: Add server FQDNs to server-state file and stats socket."),
the HMAINT flag was not made part of the MAINT ones, so technically
speaking when changing the FQDN, the server is not completely considered
as in maintenance mode.

In its defense, the code location around that was completely messy, with
the aggregator flag being hidden between other values and purposely but
discretely ignoring one of the flags, so the comments were updated to
make the intent clearer (particularly regarding CMAINT which looked like
it was also forgotten while it was on purpose).

This can be backported anywhere.
2024-09-27 18:40:15 +02:00
Willy Tarreau
b8e3b0a18d BUG/MEDIUM: stream: make stream_shutdown() async-safe
The solution found in commit b500e84e24 ("BUG/MINOR: server: shut down
streams under thread isolation") to deal with inter-thread stream
shutdown doesn't work fine because there exists code paths involving
a server lock which can then deadlock on thread_isolate(). A better
solution then consists in deferring the shutdown to the stream itself
and just wake it up for that.

The only thing is that TASK_WOKEN_OTHER is a bit too generic and we
need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED),
so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the
task's state to convey these info. The caller only needs to wake the
task up with these flags set, and the stream handler will then finish
the job locally using stream_shutdown_self().

This needs to be carefully backported to all branches affected by the
dequeuing issue and containing any of the 5541d4995d ("BUG/MEDIUM:
queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or
b11495652e ("BUG/MEDIUM: queue: implement a flag to check for the
dequeuing").
2024-09-27 12:15:41 +02:00
Willy Tarreau
b5281283bb MINOR: task: define two new one-shot events for use with WOKEN_OTHER or MSG
TASK_WOKEN_MSG only says "someone sent you a message" but doesn't convey
any info about the message. TASK_WOKEN_OTHER says "you're woken for another
reason" but doesn't tell which one. Most often they're used as-is by the
task handlers to report very specific situations.

For some important control notifications, having the ability to modulate
the message a little bit is useful, so let's define two user event types
UEVT1 and UEVT2 to be used in conjunction with TASK_WOKEN_MSG or _OTHER
so that the application can know that a specific condition was explicitly
requested. It will be used this way:

  task_wakeup(s->task, TASK_WOKEN_MSG | TASK_F_UEVT1);
or:
  task_wakeup(s->task, TASK_WOKEN_OTHER | TASK_F_UEVT2);

Since events are cumulative, keep in mind not to consider a 3rd value
as the combination of EVT1+EVT2; these really mean that the two events
appeared (though in unspecified order).
2024-09-27 11:56:10 +02:00
Aurelien DARRAGON
4189eb7aca MINOR: log: add log_orig_proxy() helper function
Function may be used on proxy where log-steps are used to check if a given
log origin should be handled or not.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
c043d5d372 MINOR: log: introduce "log-steps" proxy keyword
For now it is only available for proxies with frontend capability because
log-steps are only evaluated under sess_log() or strm_log() which
essentially focus on the frontend side when it comes to log settings so
it's better to keep it this way for better consistency, at least for now.

For now the setting does nothing (it is not considered during runtime),
it will be implemented and documented in upcoming commits.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
9341792baf MINOR: proxy: add log_steps struct member
add proxy->conf.log_steps eb32 root tree which will be used to store the
log origin identifiers that should result in haproxy emitting a log as
configured by the user using upcoming "log-steps" proxy keyword.

It was chosen to use eb32 tree instead of simple bitfield because despite
the slight overhead it is more future-proof given that we already
implemented the prerequisites for seamless custom log origins registration
that will also be usable from "log-steps" proxy keyword.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
b882402a29 MINOR: log: support extra log origins for '%OG' alias
Following previous commits, let's improve log_orig_to_str() so that
extra log origins (registered through log_orig_register()) can be
translated to string from origin ID.

For that, it is required to add eb_32 tree node to log_origin struct in
order to enable quick integer lookup during runtime. Slow name lookup
using the list is acceptable for config parsing, but it is not the case
during runtime when log_orig_to_str() is expected to be used. Also, to
prevent duplicated info, get rid of ->id field and use ->tree.key instead
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
f8bb9d5c57 MINOR: log: explicitly handle extra log origins as error when relevant
Thanks to previous commit, we can know check for log_orig optional flags
in functions taking struct log_orig as parameter. Let's take this
opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a
few places to handle the log message differently because if the flag is
set then the caller expects the log to be handled as an error explicitly.

e.g.: in _process_send_log_override(), if the flag is set, use the error
log format instead of the dedicated one.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
3c15ee05e9 MINOR: log: introduce log_orig flags
Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically
contains the log origin ids.

Add 'struct log_orig' which wraps 'enum log_orig' with optional flags
(no flags defined for now).

Add log_orig() helper func that takes id and flags as parameter and
returns log_orig struct initialized with input arguments.

Update functions taking log origin as parameter so they explicitly take
log orig id or log orig wrapper as argument depending on the level of
context expected by the function.
2024-09-26 16:53:07 +02:00
Aurelien DARRAGON
818475c5cc MINOR: log: introduce extra log profile steps
add a way to register additional log origins using log_origin_register()
that may be used as log profile steps from log profile sections.

For now this does nothing as no extra origins are registered and extra log
origins are not yet considered for runtime logging paths.

When specifying an extra logging step for on <step> under log-profile
section, the logging step is stored within a binary tree for efficient
lookup during runtime. No performance impact should be expected if extra
log origins are not being used, and slight performance impact if extra
log origins are used.

Don't forget to update the documentation when new log origins are added
(both %OG log alias and on <step> log-profile keyword are concerned.
2024-09-26 16:53:07 +02:00
Oliver Dala
a889413f5e BUG/MEDIUM: cli: Deadlock when setting frontend maxconn
The proxy lock state isn't passed down to relax_listener
through dequeue_proxy_listeners, which causes a deadlock
in relax_listener when it tries to get that lock.

Backporting: Older versions didn't have relax_listener and directly called
resume_listener in dequeue_proxy_listeners. lpx should just be passed directly
to resume_listener then.

The bug was introduced in commit 001328873c

[cf: This patch should fix the issue #2726. It must be backported as far as
2.4]
2024-09-25 17:12:11 +02:00
Christopher Faulet
96edacc546 DEV: flags/applet: decode appctx flags
Decode APPCTX flags via appctx_show_flags() function.
2024-09-24 18:26:36 +02:00
Willy Tarreau
ccd1ecba1d MEDIUM: cfgparse: drop duplicate named defaults sections after use
It has never been permitted to explicitly reference named defaults
sections for which there are duplicate names. This means that when
a duplicate defaults section is found, there's no point in keeping
it since it will never be used for lookups, so it can be dropped.

However, some such defaults sections might have some rules in them
that are implicitly referenced by proxies placed after them. In this
case they cannot be removed.

What is done here is that upon each new named section creation, if
another one is found with the same name, its config location is stored
into the new proxy's {prev_file,prev_line} pair, and the old section is
either destroyed if its refcount is null, or just unindexed. The dup
check when creating a new proxy now consists in checking the prev_line
instead of performing a dup lookup on the defaults section.

This will guarantee that we can't find duplicate defaults sections in
their tree anymore, while still keeping track of what's allocated and
releasing everything upon exit.

Beyond the consistency gain, there are nice savings for large configs
involving many defaults sections: a test with 300k sections saved
about 1.9 GB of RAM, and started 25% faster likely thanks to spending
less time allocating memory.
2024-09-20 16:35:32 +02:00
Willy Tarreau
c8b813771d MINOR: proxy: add a list of orphaned defaults sections
We'll soon delete unreferenced and duplicated named defaults sections
from the list of proxies. The problem with this is that this list (in
fact a name-based tree) is used to release all of them at the end. Let's
add a list of orphaned defaults sections, typically those containing
"http-check send" statements or various other rules, and that are
implicitly inherited by a proxy hence have a non-zero refcount while
also having a name. These now makes it possible to remove them from
the name index while still keeping their memory around for the lifetime
of the process, and cleaning it at the end.
2024-09-20 15:59:04 +02:00
Willy Tarreau
b325453c36 MINOR: proxy: use the global file names for conf->file
Proxy file names are assigned a bit everywhere (resolvers, peers,
cli, logs, proxy). All these elements were enumerated and now use
copy_file_name(). The only ha_free() call was turned to drop_file_name().

As a bonus side effect, a 300k backend config saved 14 MB of RAM.
2024-09-19 15:38:19 +02:00
Willy Tarreau
9ab21a3c2d CLEANUP: stick-table: make the file location point to a global file name
The file name used to point to the calling function's stack for stick
tables, which was OK during parsing but remained dangling afterwards.
At least it was already marked const so as not to accidentally free it.
Let's make it point to a file_name_node now.
2024-09-19 15:38:19 +02:00
Willy Tarreau
d6c060c5ae MINOR: tools: add minimal file name management
In proxies, stick-tables, servers, etc... at plenty of places we store
a file name and a line number. Some file names are the result of strdup()
(e.g. in proxies), others not (e.g. stick-tables) and leave dangling
pointers at the end of parsing. The risk of double-free is not null
either.

In order to stop this, let's first add a simple tool that allows to
register short strings inside a global list, these strings happening
to be server names. The strings are either duplicated and stored upon
failure to find them, or just added to this storage. Since file names
are not expected to disappear before the end of the process, for now
we don't even implement refcounting, and we free them all at the end.
There's already a drop_file_name() function to reset the pointer like
ha_free() used to do, and even if not strictly needed it's a good
habit to get used to doing it.

The strings are returned as const so that they're stored as-is in
structs, and that nasty free() calls are easily caught. The pointer
points to the char[] storage inside the node itself. This way later
if we want to implement refcounting, it will be trivial to just look
up a string and change its associated node's refcount. If needed,
comparisons can also be made on pointers.

For now they're not used yet and are released on deinit().
2024-09-19 15:36:58 +02:00
Willy Tarreau
8df44eea6d BUILD: cebtree: silence a bogus gcc warning on impossible code paths
gcc-12 and above report a wrong warning about a negative length being
passed to memcmp() on an impossible code path when built at -O0. The
pattern is the same at a few places, basically:

  int foo(int op, const void *a, const void *b, size_t size, size_t arg)
  {
      if (op == 1) // arg is a strict multiple of size
          return memcmp(a, b, arg - size);
      return 0;
  }
  ...
  int bar()
  {
     return foo(0, a, b, sizeof(something), 0);
  }

It *might* be possible to invent dummy values for the "len" argument
above in the real code, but that significantly complexifies it and as
usual can easily result in introducing undesired bugs.

Here we take a different approach consisting in shutting the
-Wstringop-overread warning on gcc>=12 at -O0 since that's the only
condition that triggers it. The issue was reported to and confirmed by
the gcc team here:  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114622

No backport needed, but this should be upstreamed into cebtree after
checking that all involved macros are available.
2024-09-18 17:42:52 +02:00
Willy Tarreau
f793845f4a MEDIUM: clock: collect the monotonic time in clock_local_update_date()
Now we collect this clock in clock_local_update_date(), the closest from
the poller, which is also used when busy-polling, and the values is set
into the thread's curr_mono_time which did not exist before. Later,
clock_leaving_poll() just sets the prev_mono_time value from the curr_
one instead of retrieving the time at this specific point. It also means
that the monotonic time will now also cover the time needed to update
the global time, which should be negligible. Note that we don't collect
the CPU time in the clock_local_update_date() function even though it's
tempting, because when doing busy-polling, it would be collected on each
round while being useless.

Doing so will make sure that the local time always knows the monotonic
time when it is available.
2024-09-17 09:08:10 +02:00
Christopher Faulet
5fc12b0afd BUG/MEDIUM: sc_strm/applet: Wake applet after a successfull synchronous send
On a synchronous send from the stream to an applet, if some data were sent,
we must take care to wake the applet up. It is important because if
everything was sent at this stage, there is no other chance to wake the
applet up, mainly because SE_FL_WAIT_DATA flag is set on the applet's sedesc
in sc_update_tx() at the end of process_stream(). This flag prevent any
wakeup of the applet for a send event.

It is not necessary for a mux because the mux stream is called when a
syncrhonous send from the stream is performed. So it is reponsible to wake
the mux connection if necessary.

This patch must be backport to 3.0.
2024-09-16 22:55:40 +02:00
Willy Tarreau
5d350d1e50 OPTIM: vars: use multiple name heads in the vars struct
Given that the original list-based version was using a list head as the
root of the variables, while the tree is using a single pointer, it made
sense to reuse that space to place multiple roots, indexed on the lower
bits of the name hash. Two roots slightly increase the performance level,
but the best gain is obtained with 4 roots. The performance is now always
above that of the list, even with small counts, and with 100 vars, it's
21% higher than before, or 67% higher than with the list.

We keep the same lock (it could have made sense to use one lock per head),
because most of the variables in large configs are attached to a stream
or a session, hence are not shared between threads. Thus there's no point
in sharding the pointer.
2024-09-15 23:51:51 +02:00
Willy Tarreau
47ec7c681e OPTIM: vars: use a cebtree instead of a list for variable names
Configs involving many variables can start to eat a lot of CPU in name
lookups. The reason is that the names themselves are dynamic in that
they are relative to dynamic objects (sessions, streams, etc), so
there's no fixed index for example. The current implementation relies
on a standard linked list, and in order to speed up lookups and avoid
comparing strings, only a 64-bit hash of the variable's name is stored
and compared everywhere.

But with just 100 variables and 1000 accesses in a config, it's clearly
visible that variable name lookup can reach 56% CPU with a config
generated this way:

  for i in {0..100}; do
    printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i;
    for j in {1..10}; do [ $i -lt $j ] || printf ",add(txn.var%04d)" $((i-j)); done;
    echo;
  done

The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf
profile showing:

  Samples: 170K of event 'cycles', Event count (approx.): 142378815419
  Overhead  Shared Object            Symbol
    56.39%  haproxy                  [.] var_to_smp
     6.65%  haproxy                  [.] var_set.part.0
     5.76%  haproxy                  [.] sample_process_cnv
     3.23%  haproxy                  [.] sample_conv_var2smp
     2.88%  haproxy                  [.] sample_conv_arith_add
     2.33%  haproxy                  [.] __pool_alloc
     2.19%  haproxy                  [.] action_store
     2.13%  haproxy                  [.] vars_get_by_desc
     1.87%  haproxy                  [.] smp_dup

[above, var_to_smp() calls var_get() under the read lock].

By switching to a binary tree, the cost is significantly lower, the
performance reaches 117k RPS (+37%) with this profile:

  Samples: 170K of event 'cycles', Event count (approx.): 142323631229
  Overhead  Shared Object            Symbol
    40.22%  haproxy                  [.] cebu64_lookup
     7.12%  haproxy                  [.] sample_process_cnv
     6.15%  haproxy                  [.] var_to_smp
     4.75%  haproxy                  [.] cebu64_insert
     3.79%  haproxy                  [.] sample_conv_var2smp
     3.40%  haproxy                  [.] cebu64_delete
     3.10%  haproxy                  [.] sample_conv_arith_add
     2.36%  haproxy                  [.] action_store
     2.32%  haproxy                  [.] __pool_alloc
     2.08%  haproxy                  [.] vars_get_by_desc
     1.96%  haproxy                  [.] smp_dup
     1.75%  haproxy                  [.] var_set.part.0
     1.74%  haproxy                  [.] cebu64_first
     1.07%  [kernel]                 [k] aq_hw_read_reg
     1.03%  haproxy                  [.] pool_put_to_cache
     1.00%  haproxy                  [.] sample_process

The performance lowers a bit earlier than with the list however. What
can be seen is that the performance maintains a plateau till 25 vars,
starts degrading a little bit for the tree while it remains stable till
28 vars for the list. Then both cross at 42 vars and the list continues
to degrade doing a hyperbole while the tree resists better. The biggest
loss is at around 32 variables where the list stays 10% higher.

Regardless, given the extremely narrow band where the list is better, it
looks relevant to switch to this in order to preserve the almost linear
performance of large setups. For example at 1000 variables and 10k
lookups, the tree is 18 times faster than the list.

In addition this reduces the size of the struct vars by 8 bytes since
there's a single pointer, though it could make sense to re-invest them
into a secondary head for example.
2024-09-15 23:49:01 +02:00
Willy Tarreau
a0205f9de4 IMPORT: import cebtree (compact elastic binary trees)
This is an import of the compact elastic binary trees at commit
a9cd84a ("OPTIM: descent: better prefetch less and for writes when
deleting")

These will be used to replace certain lists (and possibly certain
tree nodes as well). They're as fast (or even faster) than ebtrees
for lookups, as fast for insertion and slower for deletion, and a
node only uses 2 pointers (like a list).

The only changes were cebtree.h where common/tools.h was replaced
with ebtree.h which we already have and already provides the needed
functions and macros, and the addition of a wrapper cebtree-prv.h in
src/ to redirect to import/cebtree-prv.h.
2024-09-15 23:44:59 +02:00
Willy Tarreau
6e92988e20 MINOR: vars: remove the emptiness tests in callers before pruning
All callers of vars_prune_* currently check the list for emptiness.
Let's leave that to vars_prune() itself, it will ease some changes in
the code. Thanks to the previous inlining of the vars_prune() function,
there's no performance loss, and even a very tiny 0.1% gain.
2024-09-15 23:44:16 +02:00
Willy Tarreau
2c1a9c3a43 OPTIM: vars: inline vars_prune() to avoid many calls
Many configs don't have variables and call it for no reason, and even
configs with variables don't necessarily have some in all scopes.
2024-09-15 23:42:09 +02:00
Willy Tarreau
b11495652e BUG/MEDIUM: queue: implement a flag to check for the dequeuing
As unveiled in GH issue #2711, commit 5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()") does have some
side effects in that it can occasionally cause an endless loop.

As Christopher analysed it, the problem is that process_srv_queue(),
which uses a trylock in order to leave only one thread in charge of
the dequeueing process, can lose the lock race against pendconn_add().
If this happens on the last served request, then there's no more thread
to deal with the dequeuing, and assign_server_and_queue() will loop
forever on a condition that was initially exepected to be extremely
rare (and still is, except that now it can become sticky). Previously
what was happening is that such queued requests would just time out
and since that was very rare, nobody would notice.

The root of the problem really is that trylock. It was added so that
only one thread dequeues at a time but it doesn't offer only that
guarantee since it also prevents a thread from dequeuing if another
one is in the process of queuing. We need a different criterion.

What we're doing now is to set a flag "dequeuing" in the server, which
indicates that one thread is currently in the process of dequeuing
requests. This one is atomically tested, and only if no thread is in
this process, then the thread grabs the queue's lock and dequeues.
This way it will be serialized with pendconn_add() and no request
addition will be missed.

It is not certain whether the original race covered by the fix above
can still happen with this change, so better keep that fix for now.

Thanks to @Yenya (Jan Kasprzak) for the precise and complete report
allowing to spot the problem.

This patch should be backported wherever the patch above was backported.
2024-09-13 08:35:47 +02:00
Aurelien DARRAGON
68cfb222b5 BUG/MEDIUM: pattern: prevent UAF on reused pattern expr
Since c5959fd ("MEDIUM: pattern: merge same pattern"), UAF (leading to
crash) can be experienced if the same pattern file (and match method) is
used in two default sections and the first one is not referenced later in
the config. In this case, the first default section will be cleaned up.
However, due to an unhandled case in the above optimization, the original
expr which the second default section relies on is mistakenly freed.

This issue was discovered while trying to reproduce GH #2708. The issue
was particularly tricky to reproduce given the config and sequence
required to make the UAF happen. Hopefully, Github user @asmnek not only
provided useful informations, but since he was able to consistently
trigger the crash in his environment he was able to nail down the crash to
the use of pattern file involved with 2 named default sections. Big thanks
to him.

To fix the issue, let's push the logic from c5959fd a bit further. Instead
of relying on "do_free" variable to know if the expression should be freed
or not (which proved to be insufficient in our case), let's switch to a
simple refcounting logic. This way, no matter who owns the expression, the
last one attempting to free it will be responsible for freeing it.
Refcount is implemented using a 32bit value which fills a previous 4 bytes
structure gap:

        int                        mflags;               /*    80     4 */

        /* XXX 4 bytes hole, try to pack */

        long unsigned int          lock;                 /*    88     8 */
(output from pahole)

Even though it was not reproduced in 2.6 or below by @asmnek (the bug was
revealed thanks to another bugfix), this issue theorically affects all
stable versions (up to c5959fd), thus it should be backported to all
stable versions.
2024-09-09 16:07:05 +02:00
Aaron Kuehler
50322dff81 MEDIUM: server: add init-state
Allow the user to set the "initial state" of a server.

Context:

Servers are always set in an UP status by default. In
some cases, further checks are required to determine if the server is
ready to receive client traffic.

This introduces the "init-state {up|down}" configuration parameter to
the server.

- when set to 'fully-up', the server is considered immediately available
  and can turn to the DOWN sate when ALL health checks fail.
- when set to 'up' (the default), the server is considered immediately
  available and will initiate a health check that can turn it to the DOWN
  state immediately if it fails.
- when set to 'down', the server initially is considered unavailable and
  will initiate a health check that can turn it to the UP state immediately
  if it succeeds.
- when set to 'fully-down', the server is initially considered unavailable
  and can turn to the UP state when ALL health checks succeed.

The server's init-state is considered when the HAProxy instance
is (re)started, a new server is detected (for example via service
discovery / DNS resolution), a server exits maintenance, etc.

Link: https://github.com/haproxy/haproxy/issues/51
2024-09-05 11:13:10 +02:00
Ilya Shipitsin
1f6e5f7a61 CLEANUP: assorted typo fixes in the code and comments
This is 43rd iteration of typo fixes
2024-09-03 17:49:21 +02:00
Christopher Faulet
a7f6b0ac03 MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates
Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.

For now, this parameter must be in the range ]0-1024].
2024-09-02 15:50:25 +02:00
Aperence
20efb856e1 MEDIUM: protocol: add MPTCP per address support
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.

Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.

To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.

This attached patch adds MPTCP per address support, to be used with:

  mptcp{,4,6}@<address>[:port1[-port2]]

MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.

These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.

MPTCP is both supported for the frontend and backend sides.

Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.

Note that this is a re-implementation of Björn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.

Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.

This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.

Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]
Link: https://github.com/haproxy/haproxy/issues/1028 [4]
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5]

Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
2024-08-30 18:53:49 +02:00
Aperence
38618822e1 MINOR: server: add a alt_proto field for server
Add a new field alt_proto to the server structures that
specify if an alternate protocol should be used for this server.

This field can be transparently passed to protocol_lookup to get
an appropriate protocol structure.

This change allows thus to create servers with different protocols,
and not only TCP anymore.
2024-08-30 18:53:49 +02:00
Aperence
a7b04e383a MINOR: tools: extend str2sa_range to add an alt parameter
Add a new parameter "alt" that will store wether this configuration
use an alternate protocol.

This alt pointer will contain a value that can be transparently
passed to protocol_lookup to obtain an appropriate protocol structure.

This change is needed to allow for example the servers to know if it
need to use an alternate protocol or not.
2024-08-30 18:53:49 +02:00
Frederic Lecaille
f627b9272b BUG/MEDIUM: quic: always validate sender address on 0-RTT
It has been reported by Wedl Michael, a student at the University of Applied
Sciences St. Poelten, a potential vulnerability into haproxy as described below.

An attacker could have obtained a TLS session ticket after having established
a connection to an haproxy QUIC listener, using its real IP address. The
attacker has not even to send a application level request (HTTP3). Then
the attacker could open a 0-RTT session with a spoofed IP address
trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests.

To mitigate this vulnerability, one decided to use a token which can be provided
to the client each time it successfully managed to connect to haproxy. These
tokens may be reused for future connections to validate the address/path of the
remote peer as this is done with the Retry token which is used for the current
connection, not the next one. Such tokens are transported by NEW_TOKEN frames
which was not used at this time by haproxy.

So, each time a client connect to an haproxy QUIC listener with 0-RTT
enabled, it is provided with such a token which can be reused for the
next 0-RTT session. If no such a token is presented by the client,
haproxy checks if the session is a 0-RTT one, so with early-data presented
by the client. Contrary to the Retry token, the decision to refuse the
connection is made only when the TLS stack has been provided with
enough early-data from the Initial ClientHello TLS message and when
these data have been accepted. Hopefully, this event arrives fast enough
to allow haproxy to kill the connection if some early-data have been accepted
without token presented by the client.

quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN
frame with this newly implemented token to be transported inside.

quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre()
and modified to be reused and derive the secret for the new token implementation.

quic_token_validate() has been implemented to validate both the Retry and
the new token implemented by this patch. When this is a non-retry token
which could not be validated, the datagram received is marked as requiring
a Retry packet to be sent, and no connection is created.

When the Initial packet does not embed any non-retry token and if 0-RTT is enabled
the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon
as the TLS stack detects that some early-data have been provided and accepted by
the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from
ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted()
new function. The secret TLS handshake is interrupted as soon as possible returnin
0 from ha_quic_add_handshake_data(). The connection is also marked as
requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from
ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb())
knows how to behave: kill the connection after having sent a Retry packet.

About TLS stack compatibility, this patch is supported by aws-lc. It is
disabled for wolfssl which does not support 0-RTT at this time thanks
to HAVE_SSL_0RTT_QUIC.

This patch depends on these commits:

     MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
     MINOR: quic: Implement qc_ssl_eary_data_accepted().
     MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
     BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
     MINOR: quic: Token for future connections implementation.
     MINOR: quic: Implement quic_tls_derive_token_secret().
     MINOR: tools: Implement ipaddrcpy().

Must be backported as far as 2.6.
2024-08-30 17:04:09 +02:00
Frederic Lecaille
609b124561 MINOR: quic: Implement qc_ssl_eary_data_accepted().
This function is a wrapper around SSL_get_early_data_status() for
OpenSSL derived stack and SSL_early_data_accepted() boringSSL derived
stacks like AWS-LC. It returns true for a TLS server if it has
accepted the early data received from a client.

Also implement quic_ssl_early_data_status_str() which is dedicated to be used
for debugging purposes (traces). This function converts the enum returned
by the two function mentionned above to a human readable string.
2024-08-30 17:04:09 +02:00