Commit Graph

22119 Commits

Author SHA1 Message Date
Amaury Denoyelle
250c19032f BUG/MINOR: server: reject enabled for dynamic server
Since their first implementation, dynamic servers are created into
maintenance state. This has been done purposely to avoid immediate
activation of a newly inserted server.

However, this principle is incompatible if "enabled" keyword is used on
"add server". The newly created instance will be unreacheable as proxy
load-balancing algorithm is not informed of its presence via
srv_lb_propagate(). The new server could be unblocked by toggling its
state with "disable server" / "enable server" commands, which will
trigger srv_lb_propagate() invocation.

To avoid this unexpected state, simply forbid "enabled" keyword for
dynamic servers. In the long-term, it could be possible to re authorize
it but at least this requires to call srv_lb_propagate() on dynamic
server creation.

This should fix github issue #2497.

This patch should not be backported as-is, to avoid breaking dynamic
servers API on stable versions. "enabled" should instead be ignored for
them. This will be implemented in a dedicated patch on top of 2.9.
2024-03-28 11:51:05 +01:00
Remi Tricot-Le Breton
28dcb7bb64 REGTESTS: ssl: Add functional test for global ocsp-update option
Add tests for the 'tune.ssl.ocsp-update.mode' global option that can be
used to enable ocsp auto update on all certificates.
2024-03-27 11:38:28 +01:00
Remi Tricot-Le Breton
c42132b3d5 REGTESTS: ssl: Add OCSP update compatibility tests
Add tests that focus on the incompatibility checks on ocsp-update mode.
This test will only call "haproxy -c" on multiple configurations that
combine the crt-list 'ocsp-update' option and the global
'tune.ssl.ocsp-update.mode'.
2024-03-27 11:38:28 +01:00
Remi Tricot-Le Breton
7359c0c7f4 MEDIUM: ssl: Add 'tune.ssl.ocsp-update.mode' global option
This option can be used to set a default ocsp-update mode for all
certificates of a given conf file. It allows to activate ocsp-update on
certificates without the need to create separate crt-lists. It can still
be superseded by the crt-list 'ocsp-update' option. It takes either "on"
or "off" as value and defaults to "off".
Since setting this new parameter to "on" would mean that we try to
enable ocsp-update on any certificate, and also certificates that don't
have an OCSP URI, the checks performed in ssl_sock_load_ocsp were
softened. We don't systematically raise an error when trying to enable
ocsp-update on a certificate that does not have an OCSP URI, be it via
the global option or the crt-list one. We will still raise an error when
a user tries to load a certificate that does have an OCSP URI but a
missing issuer certificate (if ocsp-update is enabled).
2024-03-27 11:38:28 +01:00
Remi Tricot-Le Breton
b1d623949c BUG/MINOR: ssl: Detect more 'ocsp-update' incompatibilities
The inconsistencies in 'ocsp-update' parameter were only checked when
parsing a crt-list line so if a certificate was used on a bind line
after being used in a crt-list with 'ocsp-update' set to 'on', then no
error would be raised. This patch helps detect such inconsistencies.

This patch can be backported up to branch 2.8.
2024-03-27 11:38:28 +01:00
Remi Tricot-Le Breton
97c2734f44 BUG/MINOR: ssl: Wrong ocsp-update "incompatibility" error message
In a crt-list such as the following:
    foo.pem [ocsp-update off] foo.com
    foo.pem bar.com
we would get a wrong "Incompatibilities found in OCSP update mode ..."
error message during init when the two lines are actually saying the
same thing since the default for 'ocsp-update' option is 'off'.

This patch can be backported up to branch 2.8.
2024-03-27 11:38:28 +01:00
Willy Tarreau
9cf3d1fcc0 [RELEASE] Released version 3.0-dev6
Released version 3.0-dev6 with the following main changes :
    - MINOR: mux-h2: always use h2c_report_glitch()
    - MEDIUM: mux-h2: allow to set the glitches threshold to kill a connection
    - MINOR: quic: simplify rescheduling for handshake
    - MINOR: quic: remove qc_treat_rx_crypto_frms()
    - DOC: configuration: clarify ciphersuites usage (V2)
    - MINOR: tools: use public interface for FreeBSD get_exec_path()
    - BUG/MINOR: ssl: fix possible ctx memory leak in sample_conv_aes_gcm()
    - BUG/MINOR: ssl: do not set the aead_tag flags in sample_conv_aes_gcm()
    - BUG/MINOR: server: fix first server template not being indexed
    - MEDIUM: ssl: initialize the SSL stack explicitely
    - MEDIUM: ssl: allow to change the OpenSSL security level from global section
    - CLEANUP: ssl: remove useless #ifdef in openssl-compat.h
    - CI: github: add -DDEBUG_LIST to the default builds
    - BUG/MINOR: hlua: segfault when loading the same filter from different contexts
    - BUG/MINOR: hlua: missing lock in hlua_filter_new()
    - BUG/MINOR: hlua: fix missing lock in hlua_filter_delete()
    - DEBUG: lua: precisely identify if stream is stuck inside lua or not
    - MINOR: hlua: use accessors for stream hlua ctx
    - BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread (2nd try)
    - MINOR: debug: enable insecure fork on the command line
    - CI: github: add -dI to haproxy arguments
    - BUG/MINOR: listener: Wake proxy's mngmt task up if necessary on session release
    - BUG/MINOR: listener: Don't schedule frontend without task in listener_release()
    - MINOR: session: rename private conns elements
    - BUG/MAJOR: server: do not delete srv referenced by session
    - BUG/MEDIUM: spoe: Don't rely on stream's expiration to detect processing timeout
    - BUG/MINOR: spoe: Be sure to be able to quickly close IDLE applets on soft-stop
    - MAJOR: spoe: Deprecate the SPOE filter
    - MINOR: cfgparse: Add a global option to expose deprecated directives
    - MINOR: spoe: Add SPOE filters in the exposed deprecated directives
    - CLEANUP: assorted typo fixes in the code and comments
    - CI: temporarily adjust kernel entropy to work with ASAN/clang
    - BUG/MEDIUM: spoe: Return an invalid frame on recv if size is too small
    - BUG/MINOR: session: ensure conn owner is set after insert into session
    - BUG/MEDIUM: http_ana: ignore NTLM for reuse aggressive/always and no H1
    - BUG/MAJOR: connection: fix server used_conns with H2 + reuse safe
    - BUG/MAJOR: ocsp: Separate refcount per instance and per store
    - REGTESTS: ssl: Add OCSP related tests
    - BUG/MEDIUM: ssl: Fix crash when calling "update ssl ocsp-response" when an update is ongoing
    - BUG/MEDIUM: ssl: Fix crash in ocsp-update log function
    - MEDIUM: ssl: Change output of ocsp-update log
    - MINOR: ssl: Change level of ocsp-update logs
    - CLEANUP: ssl: Remove undocumented ocsp fetches
    - REGTESTS: ssl: Add checks on ocsp-update log format
    - MINOR: connection: implement conn_release()
    - MINOR: connection: extend takeover with release option
    - MEDIUM: server: close idle conn on server deletion
    - MEDIUM: mux: prepare for takeover on private connections
    - MEDIUM: server: close private idle connection before server deletion
    - BUG/MINOR: mux-quic: close all QCS before freeing QCC tasklet
    - BUG/MEDIUM: mux-fcgi: Properly handle EOM flag on end-of-trailers HTX block
    - BUILD: server: fix build regression on old compilers (<= gcc-4.4)
    - OPTIM: http_ext: avoid useless copy in http_7239_extract_{ipv4,ipv6}
    - MINOR: debug: add "debug dev trace" to flood with traces
    - MINOR: atomic: add a read-specific variant of __ha_cpu_relax()
    - MINOR: applet: add new function applet_append_line()
    - MINOR: log/applet: add new function syslog_applet_append_event()
    - MEDIUM: ring/sink: use applet_append_line()/syslog_applet_append_event() for readers
    - REORG: dns/ring: split the ring between the generic one and the DNS one
    - MEDIUM: ring: move the ring reader code to ring_dispatch_messages()
    - MEDIUM: sink: move the generic ring forwarder code use ring_dispatch_messages()
    - MEDIUM: log/sink: make the log forwarder code use ring_dispatch_messages()
    - MINOR: buf: add b_add_ofs() to add a count to an absolute position
    - MINOR: buf: add b_rel_ofs() to turn an absolute offset into a relative one
    - MINOR: buf: add b_putblk_ofs() to copy a block at a specific position
    - MINOR: buf: add b_getblk_ofs() that works relative to area and not head
    - MINOR: ring: make the ring reader use only absolute offsets
    - MINOR: ring: reserve one special value for the readers count
    - MINOR: vecpair: add new vector pair based data manipulation mechanisms
    - MINOR: vecpair: add necessary functions to use vecpairss from/to ring APIs
    - MINOR: ring: rename totlen vs msglen in ring_write()
    - MINOR: ring: add ring_data() to report the amount of data in a ring
    - MINOR: ring: add ring_size() to return the ring's size
    - MINOR: ring: add ring_dup() to copy a ring into another one
    - MINOR: ring: also add ring_area(), ring_head(), ring_tail()
    - MINOR: ring: make callers use ring_data() and ring_size(), not ring->buf
    - MINOR: errors: use ring_dup() to duplicate the startup_logs
    - MINOR: ring: use ring_size(), ring_area(), ring_head() and ring_tail()
    - MINOR: ring: add a flag to indicate a mapped file
    - MAJOR: ring: insert an intermediary ring_storage level
    - MINOR: ring: resize only under thread isolation
    - MINOR: ring: allow to reduce a ring size
    - MEDIUM: ring: replace the buffer API in ring_write() with the vec<->ring API
    - MEDIUM: ring: change the ring reader to use the new vector-based API now
    - MEDIUM: ring: remove the struct buffer from the ring
    - MEDIUM: ring: align the head and tail fields in the ring_storage structure
    - MINOR: ring: make the reader check the readers count before inc/dec
    - MEDIUM: ring: lock the tail's readers counters before proceeding with the changes
    - MEDIUM: ring: protect the reader's positions against writers
    - MEDIUM: ring: use the topmost bit of the tail as a lock
    - MEDIUM: move the ring's lock to only protect the readers list
    - MEDIUM: ring: unlock the ring's tail earlier
    - MINOR: ring: don't take the readers lock if there are no readers
    - MEDIUM: ring/applet: turn the wait_entry list to an mt_list instead
    - MEDIUM: ring: protect the initialization of the initial reader offset
    - MINOR: ring: make sure ring_dispatch waits when facing a changing message
    - MAJOR: ring: drop the now unneeded lock
    - OPTIM: ring: don't even try to update offset when failed to read
    - OPTIM: ring: have only one thread at a time wake up all readers
    - MINOR: ring: keep a few frequently used pointers in the local stack
    - MINOR: ring: add the definition of a ring waiting cell
    - MINOR: ring: make the number of queues configurable
    - MAJOR: ring: implement a waiting queue in front of the ring
    - MEDIUM: ring: significant boost in the loop by checking the ring queue ptr first
    - MEDIUM: ring: improve speed in the queue waiting loop on x86_64
    - MINOR: ring: simplify the write loop a little bit
    - CLEANUP: ring: further simplify the write loop
    - MINOR: ring: it's not x86 but all non-ARMv8.1 which needs the read before OR
    - MINOR: ring: avoid writes to cells during copy
    - OPTIM: ring: use relaxed stores to release the threads
    - CLEANUP: ring: use only curr_cell and not next_cell in the main write loop
    - BUILD: ssl: fix build error on older compilers with openssl-3.2
    - BUG/MINOR: server: 'source' interface ignored from 'default-server' directive
    - BUG/MAJOR: ring: free the ring storage not the ring itself when using maps
2024-03-26 15:36:49 +01:00
Willy Tarreau
40d1c84bf0 BUG/MAJOR: ring: free the ring storage not the ring itself when using maps
A recent issue was uncovered by the CI which started to randomly report
segfaults on a few tests, and more systematically on FreeBSD. It turn
out that it was introduced by recent commit 03816ccfa9 ("MAJOR: ring:
insert an intermediary ring_storage level"), which overlooked the munmap()
path of the sink and startup logs: once the ring and its storage were
split, it was no longer correct to munmap() the ring, only its storage
area needs to be unmapped, and the ring must always be freed separately.

Thanks to Christopher and William for their help at trying to reproduce
it and figure the circumstances that triggers it.

No backport is needed.
2024-03-26 15:15:59 +01:00
Aurelien DARRAGON
bd98db5078 BUG/MINOR: server: 'source' interface ignored from 'default-server' directive
Sebastien Gross reported that 'interface' keyword ('source' subargument)
is silently ignored when used from 'default-server' directive despite the
documentation implicitly stating that the keyword should be supported
there.

When support for 'source' keyword was added to 'default-server' directive
in dba97077 ("MINOR: server: Make 'default-server' support 'source'
keyword."), we properly duplicated the conn iface_name from the default-
server but we forgot to copy the conn iface_len which must be set as well
since it is used as setsockopt()'s 'optlen' argument in
tcp_connect_server().

It should be backported to all stable versions.
2024-03-26 11:09:02 +01:00
Willy Tarreau
2431b20640 BUILD: ssl: fix build error on older compilers with openssl-3.2
OpenSSL 3.2 triggers the code part added by commit 25da217 ("MINOR: ssl:
Update ssl_fc_curve/ssl_bc_curve to use SSL_get0_group_name") which
contains a variable declaration in the for() statement and breaks on
older compilers, as reported in GH issues #2501.

Let's just declare it normally to fix the problem. This must be
backported wherever the commit above is (at least 2.9).
2024-03-25 21:21:47 +01:00
Willy Tarreau
4bc81ec985 CLEANUP: ring: use only curr_cell and not next_cell in the main write loop
It turns out that we can reduce by one variable in the loop and this
clobbers one less register, making it slightly faster on Cortex A72.
2024-03-25 17:34:19 +00:00
Willy Tarreau
0a0a64ef02 OPTIM: ring: use relaxed stores to release the threads
We don't care in what order the threads are released, so we can write
their sent value using relaxed atomic stores. This brings a 3-5% perf
boost on ARM with 80 cores, reaching 7.25M/s, and doesn't change
anything on x86 since it keeps using strict ordering.
2024-03-25 17:34:19 +00:00
Willy Tarreau
cabe945876 MINOR: ring: avoid writes to cells during copy
It has been found that performing a first pass consisting in copying
all messages, and a second one to notify about releases is more efficient
on AMD than updating all of them on the fly using a CAS, despite making
writers wait longer to be released.

Maybe it's related to the ability for the CPU to prefetch the contents
during a simple load while it wouldn't do it for an XCHG, it's unsure
at this point. This will also mater permit to use relaxed stores to
release threads.

On ARM the performance increased to 7.0M/s. If this patch is applied
before the dropping of the intermediary step, instead it drops to
3.9M/s. This shows the dependency between such changes that strive to
limit the number of writes on the fast path.

On x86_64, the EPYC at 3C6T saw a small drop from 4.57M to 4.45M, but
the 24C48T setup saw a nice 33% boost from 3.33M to 4.44M, i.e. we
get stable perf at 3 and 24 cores, despite having 8 CCX involved and
fighting with each other.

Other possibilities are:
  - use of HA_ATOMIC_XCHG() instead of FETCH_OR()
    => slightly faster (4.62/7.37 vs 4.58/7.34). Pb: requires to
       modify the readers to wait much longer since the tail value
       won't be valid in this case during updates, and it will have
       to wait by looping over it.
  - use other conditions to release a cell
    => to be tested
2024-03-25 17:34:19 +00:00
Willy Tarreau
39df8c903d MINOR: ring: it's not x86 but all non-ARMv8.1 which needs the read before OR
Archs relying on CAS benefit from a read prior to FETCH_OR, so it's
not just x86 that benefits from this. Let's just change the condition
to only exclude __ARM_FEATURE_ATOMICS which is the only one faster
without.
2024-03-25 17:34:19 +00:00
Willy Tarreau
e6fc167aec CLEANUP: ring: further simplify the write loop
The loop was cleaned up a little bit so that the inner loops are more
readable and that the ifdef'd parts are whole blocks and not just an
"if" condition. A few conditions were adjusted to benefit from "break"
and "continue".
2024-03-25 17:34:19 +00:00
Willy Tarreau
4b984c5baa MINOR: ring: simplify the write loop a little bit
This is mostly a cleanup in that it turns the two-level loop into a
single one, but it also simplifies the code a little bit and brings
some performance savings again, which are mostly noticeable on ARM,
but don't change anything for x86.
2024-03-25 17:34:19 +00:00
Willy Tarreau
573bbbe127 MEDIUM: ring: improve speed in the queue waiting loop on x86_64
x86_64 doesn't have a native atomic FETCH_OR(), it's implemented using
a CAS, which will always cause a write cycle. Here we know we can just
wait as long as the lock bit is held so better loop on a load, and only
attempt the CAS on success. This requires a tiny ifdef and brings nice
benefits. This brings the performance back from 3.33M to 3.75M at 24C48T
while doing no change at 3C6T.
2024-03-25 17:34:19 +00:00
Willy Tarreau
30a659c355 MEDIUM: ring: significant boost in the loop by checking the ring queue ptr first
By doing that and placing the cpu_relax at the right places, the ARM
reaches 6.0M/s on 80 threads. On x86_64, at 3C6T the EPYC sees a small
increase from 4.45M to 4.57M but at 24C48T it sees a drop from 3.82M
to 3.33M due to the write contention hidden behind the CAS that
implements the FETCH_OR(), that we'll address next.
2024-03-25 17:34:19 +00:00
Willy Tarreau
1e2311edbc MAJOR: ring: implement a waiting queue in front of the ring
The queue-based approach consists in forcing threads to wait away from
the work area so as not to disturb the current writer, and to prepare
the work by grouping them in a queue. The last arrived takes the head
of the queue by placing its preinitialized ring cell there, becomes the
queue's leader, informs itself about the amount of previously accumulated
bytes so that when its turn comes, it immediately knows how much room is
needed to be released.

It can then take the whole queue with it, leaving an empty one for new
threads to come while it's releasing the room needed to copy everything.

By doing so we're cascading contention areas so that multiple parts can
work in parallel.

Note that we must never leave a write counter set to 0xFF at tail, and
this happens when a message cannot fit and we give up, because in this
case we're writing back tail_ofs, and only later we restore the counter.

The solution here is to make a special case when we're going to drop
the messages, and to write the readers count before restoring tail.

This already shows a tremendous performance gain on ARM (385k -> 4.8M),
thanks to the fact that now all waiting threads wait on the queue's
head instead of polluting the tail lock. On x86_64, the EPYC sees a big
boost at 24C48T (1.88M -> 3.82M) and a slowdown at 3C6T (6.0->4.45)
though this one is much less of a concern as so few threads need less
bandwidth than bigger counts.
2024-03-25 17:34:19 +00:00
Willy Tarreau
6c1b29d06f MINOR: ring: make the number of queues configurable
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.

Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.

This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.

It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
2024-03-25 17:34:19 +00:00
Willy Tarreau
e3f101a19a MINOR: ring: add the definition of a ring waiting cell
This is what will be used to describe one waiting thread, its message
in the queues, and the aggregation of pending messages after it.
2024-03-25 17:34:19 +00:00
Willy Tarreau
447189f286 MINOR: ring: keep a few frequently used pointers in the local stack
Code disassembly shows that ring->storage->tail and ring->queue are
accessed a lot and reloaded a lot due to aliasing. Let's just have
variables for them in the local stack. It makes the code smaller and
slightly faster.
2024-03-25 17:34:19 +00:00
Willy Tarreau
c7bd7a68e4 OPTIM: ring: have only one thread at a time wake up all readers
It's inefficient and counter-productive that each ring writer iterates
over all readers to wake them up. Let's just have one in charge of this,
it strongly limits contention. The only thing is that since the thread
is iterating over a list, we want to be sure that if the first readers
have already completed their job, they will be woken up again. For this
we keep a counter of messages delivered after the wakeup started, and
the waking thread will check it before going back to sleep. In order to
avoid looping forever, it will also drop its waking flag soon enough to
possibly let another one take it.

There used to be a few cases of watchdogs before this on a 24-core AMD
EPYC platform on the list iteration those never appeared anymore.
The perf has dropped a bit on 3C6T on the EPYC, from 6.61 to 6.0M but
remains unchanged at 24C48T.
2024-03-25 17:34:19 +00:00
Willy Tarreau
1f8b14b7be OPTIM: ring: don't even try to update offset when failed to read
If there's nothing to read, it's pointless for a reader to try to update
the offset pointer, that's two atomic ops to replace a value by itself
twice. Let's just stop this.
2024-03-25 17:34:19 +00:00
Willy Tarreau
9e99cfbeb6 MAJOR: ring: drop the now unneeded lock
It was only used to protect the list which is now an mt_list so it
doesn't provide any required protection anymore. It obviously also
used to provide strict ordering between the writer and the reader
when the writer started to update the messages, but that's now
covered by the oredered tail updates and updates to the readers
count to protect the area.

The message rate on small thread counts (up to 12) saw a boost of
roughly 5% while on large counts while for large counts it lost
about 2% due to some contention now becoming visible elsewhere.
Typical measures are 6.13M -> 6.61M at 3C6T, and 1.88 -> 1.92M at
24C48T on the EPYC.
2024-03-25 17:34:19 +00:00
Willy Tarreau
cb482f92c4 MINOR: ring: make sure ring_dispatch waits when facing a changing message
The writer is using tags 0xFF instead of readers count at the front of
messages that are undergoing an update, while the tail has already been
updated. The reader needs to take care of this because it can face these
messages and mistakenly parse data that's still being written, leading
to corruption (especially if this happens while the size is changing).

Let's just stop reading when facing reserved codes, since they indicate
that the end of usable messages was reached.
2024-03-25 17:34:19 +00:00
Willy Tarreau
31b93b40b0 MEDIUM: ring: protect the initialization of the initial reader offset
Since we're going to remove the lock, there's no more way to prevent the
ring from being fed while we're attaching a client to it. We need to
freeze the buffer while looking at its head so that we can attach there
and have a trustable one. We could do it by setting the lock bit on the
tail offset but quite frankly we don't need to bother with that, attaching
a client is rare enough to permit a thread_isolate().
2024-03-25 17:34:19 +00:00
Willy Tarreau
a2d2dbf210 MEDIUM: ring/applet: turn the wait_entry list to an mt_list instead
Rings are keeping a lock only for the list, which apparently doesn't
need anything more than an mt_list, so let's first turn it into that
before dropping the lock. There should be no visible effect.
2024-03-25 17:34:19 +00:00
Willy Tarreau
04f1e3f3d9 MINOR: ring: don't take the readers lock if there are no readers
There's no point looking for freshly attached readers if there are none,
taking this lock requires an atomic write to a shared area, something we
clearly want to avoid.

A general test with 213-byte messages on different thread counts shows
how the performance degrades across CCX and how this patch improves the
situation:
                   Before          After
    3C6T/1CCX:    6.39 Mmsg/s     6.35 Mmsg/s
   6C12T/2CCX:    2.90 Mmsg/s     3.16 Mmsg/s
  12C24T/4CCX:    2.14 Mmsg/s     2.33 Mmsg/s
  24C48T/8CCX:    1.75 Mmsg/s     1.92 Mmsg/s

This tends to confirm that the queues will really be needed and that
they'll have to be per-ccx hence per thread-group. They will amortize
the number of updates on head & tail (one per multiple messages).
2024-03-25 17:34:19 +00:00
Willy Tarreau
41d3ea521b MEDIUM: ring: unlock the ring's tail earlier
We know we can continue to protect the message area so we can unlock the
tail as soon as we know its new value. Now we're seeing ~6.4M msg/s vs
5.4M previously on 3C6T of a 3rd gen EPYC, and 1.88M vs 1.54M for 24C48T
threads, which is a significant gain!

This requires to carefully write the new head counter before releasing
the writers, and to change the calculation of the work area from
tail..head to tail...new_tail while writing the message.
2024-03-25 17:34:19 +00:00
Willy Tarreau
3cdd3d27a8 MEDIUM: move the ring's lock to only protect the readers list
Now the lock is only taken around the readers list. With careful
ordering of writes to head/tail, the ring remains protected.

The perf is a bit better, though (1.54M msg/s vs 1.4M at 48T on
a 3rd gen EPYC, and 5.4M vs 5.3M for a 3C6T setup).
2024-03-25 17:34:19 +00:00
Willy Tarreau
eb3d5f464d MEDIUM: ring: use the topmost bit of the tail as a lock
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.

For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.

Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
2024-03-25 17:34:19 +00:00
Willy Tarreau
2192983ffd MEDIUM: ring: protect the reader's positions against writers
The reader now needs to protect the positions it's reading. This is
already done via the readers counter at the beginning of messages,
but as long as the lock is present, this counter is decremented
before starting to parse messages, and incremented at the end.

We must now do that in reverse, first protect the end of the messages,
and only then remove ourselves from the already processed messages, so
that at no point could a writer pass over and possibly overwrite data
we're currently watching.
2024-03-25 17:34:19 +00:00
Willy Tarreau
73b2436fe6 MEDIUM: ring: lock the tail's readers counters before proceeding with the changes
The goal here is to start to protect the writing area inside the area
itself so that we'll later be able to release the ring's lock. We're not
there yet, but at least the tail is marked as protected for as long as the
message is not fully written.
2024-03-25 17:34:19 +00:00
Willy Tarreau
d336d71cbb MINOR: ring: make the reader check the readers count before inc/dec
We'll want to reserve some special values for the readers count to
temporary lock the following message, but for this it will be mandatory
that readers check for them before incrementing/decrementing the counter.
Let'sdo that using a CAS. The readers performance is not as critical as
the writer's anyway so the slight overhead is not a problem.
2024-03-25 17:34:19 +00:00
Willy Tarreau
dd8ea5d928 MEDIUM: ring: align the head and tail fields in the ring_storage structure
We really want to let the readers and writers act on different areas, so
we want to have the tail and the head on separate cache lines, themselves
separate from the rest of the ring. Doing so improves the performance from
2.15 to 2.35M msg/s at 48 threads on a 24-core EPYC.

This increases the header space from 32 to 192 bytes when threads are
enabled. But since we already have the header size available in the file,
haring remains able to detect the aligned vs unaligned formats and call
dump_v2a() when aligned is detected.
2024-03-25 17:34:19 +00:00
Willy Tarreau
bf3dead20c MEDIUM: ring: remove the struct buffer from the ring
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.

The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.

The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
2024-03-25 17:34:19 +00:00
Willy Tarreau
01aa0a057c MEDIUM: ring: change the ring reader to use the new vector-based API now
The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.
2024-03-25 17:34:19 +00:00
Willy Tarreau
4e6fadb8a1 MEDIUM: ring: replace the buffer API in ring_write() with the vec<->ring API
This is the start of the replacement of the buffer API calls. Only the
ring_write() function was touched. Instead of manipulating a buffer all
along, we now extract the ring buffer's head and tail upon entry, store
them locally and use them using the vec<->ring API until the last moment
where we can update the buffer with the new values. One subtle point is
that we must never fill the buffer past the last byte otherwise the
vec-to-ring conversion gets lost and there's no more possibility to know
where's the beginning nor the end (just like when dealing with head+tail
in fact), because it then becomes impossible to distinguish between an
empty and a full buffer.
2024-03-25 17:34:19 +00:00
Willy Tarreau
4e6de42b27 MINOR: ring: allow to reduce a ring size
In ring_resize() we used to check if the new ring was at least as large
as the previous one before resizing it, but what counts is that it's as
large as the previous one's contents. Initially it was thought this
would not really matter, but given that rings are initially created as
BUFSIZE, it's currently not possible to shrink them for debugging
purposes. Now with this change it is.
2024-03-25 17:34:19 +00:00
Willy Tarreau
0fa05ce171 MINOR: ring: resize only under thread isolation
The ring resizing was already quite tricky, but when facing atomic
writes it will no longer be possible and we definitely do not want to
have to deal with a lock there. Since it's only done at boot time, and
possibly later from the CLI, let's just do it under thread isolation.
2024-03-25 17:34:19 +00:00
Willy Tarreau
03816ccfa9 MAJOR: ring: insert an intermediary ring_storage level
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.

This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.

It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
2024-03-25 17:34:19 +00:00
Willy Tarreau
01abdcb307 MINOR: ring: add a flag to indicate a mapped file
Till now we used to rely on a heuristic pointer comparison to check if
a ring was mapped or allocated. Better assign a flag to clarify this
because it's going to become difficult otherwise.
2024-03-25 17:34:19 +00:00
Willy Tarreau
80441a6983 MINOR: ring: use ring_size(), ring_area(), ring_head() and ring_tail()
Some open-coded constructs were updated to make use of the ring accessors
instead. This allows to remove some direct dependencies on the buffers
API a bit more.
2024-03-25 17:34:19 +00:00
Willy Tarreau
a75052d665 MINOR: errors: use ring_dup() to duplicate the startup_logs
In startup_logs_dup() we currently need to reference the ring's buffer,
better not do this as it will complicate operations when switching to
other types.
2024-03-25 17:34:19 +00:00
Willy Tarreau
7c9ce715c9 MINOR: ring: make callers use ring_data() and ring_size(), not ring->buf
As we're going to remove the ring's buffer, we don't want callers to access
it directly, so let's use ring_data() and ring_size() instead for this.
2024-03-25 17:34:19 +00:00
Willy Tarreau
b30fd8cc2d MINOR: ring: also add ring_area(), ring_head(), ring_tail()
These will essentially be used to simplify the conversion to a new API.
2024-03-25 17:34:19 +00:00
Willy Tarreau
dc4836c15c MINOR: ring: add ring_dup() to copy a ring into another one
This will mostly be used during reallocation and boot-time duplicates,
the purpose is simply to save the caller from having to know the details
of the internal representation.
2024-03-25 17:34:19 +00:00
Willy Tarreau
a185d3d90d MINOR: ring: add ring_size() to return the ring's size
This is just to ease conversion so that callers stop accessing the ring's
buffer.
2024-03-25 17:34:19 +00:00
Willy Tarreau
4c41fcd0da MINOR: ring: add ring_data() to report the amount of data in a ring
This will be used as an accessor for the few functions that need this
outside of ring.c.
2024-03-25 17:34:19 +00:00