Commit Graph

20573 Commits

Author SHA1 Message Date
Amaury Denoyelle 4fb538d4b6 MEDIUM: h2: reverse connection after SETTINGS reception
Reverse connection after SETTINGS reception if it was set as reversable.
This operation is done in a new function h2_conn_reverse(). It regroups
common changes which are needed for both reversal direction :
H2_CF_IS_BACK is set or unset and timeouts are inverted.

For the moment, only passive reverse is fully implemented. Once done,
the connection instance is directly inserted in its targetted server
pool. It can then be used immediately for future transfers using this
server.
2023-08-24 14:49:03 +02:00
Amaury Denoyelle 1f76b8ae07 MEDIUM: connection: implement passive reverse
Define a new method conn_reverse(). This method is used to reverse a
connection from frontend to backend or vice-versa depending on its
initial status.

For the moment, passive reverse only is implemented. This covers the
transition from frontend to backend side. The connection is detached
from its owner session which can then be freed. Then the connection is
linked to the server instance.

only for passive connection on
frontend to transfer them on the backend side. This requires to free the
connection session after detaching it from.
2023-08-24 14:44:33 +02:00
Amaury Denoyelle d8d9122a02 MINOR: connection: centralize init/deinit of backend elements
A connection contains extra elements which are only used for the backend
side. Regroup their allocation and deallocation in two new functions
named conn_backend_init() and conn_backend_deinit().

No functional change is introduced with this commit. The new functions
are reused in place of manual alloc/dealloc in conn_new() / conn_free().
This patch will be useful for reverse connect support with connection
conversion from backend to frontend side and vice-versa.
2023-08-24 14:44:33 +02:00
Amaury Denoyelle fbe35afaa4 MINOR: proxy: simplify parsing 'backend/server'
Several CLI handlers use a server argument specified with the format
'<backend>/<server>'. The parsing of this arguement is done in two
steps, first splitting the string with '/' delimiter and then use
get_backend_server() to retrieve the server instance.

Refactor this code sections with the following changes :
* splitting is reimplented using ist API
* get_backend_server() is removed. Instead use the already existing
  proxy_be_by_name() then server_find_by_name() which contains
  duplicated code with the now removed function.

No functional change occurs with this commit. However, it will be useful
to add new configuration options reusing the same '<backend>/<server>'
for reverse connect.
2023-08-24 14:44:33 +02:00
Willy Tarreau 9b47ed1a93 IMPORT: xxhash: update xxHash to version 0.8.2
Peter Varkoly reported a build issue on ppc64le in xxhash.h. Our version
(0.8.1) was the last one 9 months ago, and since then this specific issue
was addressed in 0.8.2, so let's apply the maintenance update.

This should be backported to 2.8 and 2.7.
2023-08-24 12:01:06 +02:00
Willy Tarreau 821fc95146 MINOR: pattern: do not needlessly lookup the LRU cache for empty lists
If a pattern list is empty, there's no way we can find its elements in
the pattern cache, so let's avoid this expensive lookup. This can happen
for ACLs or maps loaded from files that may optionally be empty for
example. Doing so improves the request rate by roughly 10% for a single
such match for only 8 threads. That's normal because the LRU cache
pre-creates an entry that is about to be committed for the case the list
lookup succeeds after a miss, so we bypass all this.
2023-08-22 07:27:01 +02:00
William Lallemand 3fde27d980 BUG/MINOR: quic: ssl_quic_initial_ctx() uses error count not error code
ssl_quic_initial_ctx() is supposed to use error count and not errror
code.

Bug was introduced by 557706b3 ("MINOR: quic: Initialize TLS contexts
for QUIC openssl wrapper").

No backport needed.
2023-08-21 15:35:17 +02:00
William Lallemand 8c004153e5 BUG/MINOR: quic: allow-0rtt warning must only be emitted with quic bind
When built with USE_QUIC_OPENSSL_COMPAT, a warning is emitted when using
allow-0rtt. However this warning is emitted for every allow-0rtt
keywords on the bind line which is confusing, it must only be done in
case the bind is a quic one. Also this does not handle the case where
the allow-0rtt keyword is in the crt-list.

This patch moves the warning to ssl_quic_initial_ctx() in order to emit
the warning in every useful cases.
2023-08-21 15:33:26 +02:00
Frdric Lcaille 2677dc1c32 MINOR: quic+openssl_compat: Emit an alert for "allow-0rtt" option
QUIC 0-RTT is not supported when haproxy is linked against an TLS stack with
limited QUIC support (OpenSSL).

Modify the "allow-0rtt" option callback to make it emit a warning if set on
a QUIC listener "bind" line.
2023-08-17 15:44:03 +02:00
Frdric Lcaille 0e13325f23 MINOR: quic+openssl_compat: Do not start without "limited-quic"
Add a check for limited-quic in check_config_validity() when compiled
with USE_QUIC_OPENSSL_COMPAT so that we prevent a config from starting
accidentally with limited QUIC support. If a QUIC listener is found
when using the compatibility mode and limited-quic is not set, an error
message is reported explaining that the SSL library is not compatible
and proposing the user to enable limited-quic if that's what they want,
and the startup fails.

This partially reverts commit 7c730803d ("MINOR: quic: Warning for
OpenSSL wrapper QUIC bindings without "limited-quic"") since a warning
was not sufficient.
2023-08-17 15:44:03 +02:00
Amaury Denoyelle cd97ba147c BUILD/IMPORT: fix compilation with PLOCK_DISABLE_EBO=1
Compilation is broken due to missing __pl_wait_unlock_long() definition
when building with PLOCK_DISABLE_EBO=1. This has been introduced since
the following commit which activates the inlining version of
pl_wait_unlock_long() :
  commit 071d689a51
  MINOR: threads: inline the wait function for pthread_rwlock emulation

Add an extra check on PLOCK_DISABLE_EBO before choosing the inline or
default version of pl_wait_unlock_long() to fix this.
2023-08-17 11:16:54 +02:00
Willy Tarreau 544c2f2d9e MINOR: pools: use EBO to wait for unlock during pool_flush()
pool_flush() could become a source of contention on the pool's free list
if there are many competing thread using that pool. Let's make sure we
use EBO and not just a simple CPU relaxation there, to avoid disturbing
them.
2023-08-17 09:09:20 +02:00
Willy Tarreau 78fa54863d MINOR: atomic: make sure to always relax after a failed CAS
There were a few places left where we forgot to call __ha_cpu_relax()
after a failed CAS, in the HA_ATOMIC_UPDATE_{MIN,MAX} macros, and in
a few sync_* API macros (the same as above plus HA_ATOMIC_CAS and
HA_ATOMIC_XCHG). Let's add them now.

This could have been a cause of contention, particularly with
process_stream() calling stream_update_time_stats() which uses 8 of them
in a call (4 for the server, 4 for the proxy). This may be a possible
explanation for the high CPU consumption reported in GH issue #2251.

This should be backported at least to 2.6 as it's harmless.
2023-08-17 09:09:20 +02:00
Willy Tarreau 071d689a51 MINOR: threads: inline the wait function for pthread_rwlock emulation
When using pthread_rwlock emulation, contention is reported on
pl_wait_unlock_long(). This is really not convenient to analyse what is
happening. Now plock supports inlining the wait call for just the lorw
functions by enabling PLOCK_LORW_INLINE_WAIT. Let's do this so that now
the wait time will be precisely reported as either pthread_rwlock_rdlock()
or pthread_rwlock_wrlock() depending on the contended function, but no
more on pl_wait_unlock_long(), which will still be reported for all
other locks.
2023-08-17 00:09:05 +02:00
Willy Tarreau e56275378f IMPORT: lorw: support inlining the wait call
Now when PLOCK_LORW_INLINE_WAIT is defined, the pl_wait_unlock_long()
calls in pl_lorw_rdlock() and pl_lorw_wrlock() will be inlined so that
all the CPU time is accounted for in the calling function.

This is plock upstream commit c993f81d581732a6eb8fe3033f21970420d21e5e.
2023-08-17 00:09:05 +02:00
Willy Tarreau 66dcc0550e IMPORT: plock: always expose the inline version of the lock wait function
Doing so will allow to expose the time spent in certain highly
contended functions, which can be desirable for more accurate CPU
profiling. For example this could be done in locking functions that
are already not inlined so that they are the ones being reported as
those consuming the CPU instead of just pl_wait_unlock_long().

This is plock upstream commit 7505c2e2c8c4aa0ab8f52a2288e1334ae6412be4.
2023-08-17 00:09:05 +02:00
Willy Tarreau c6b98f05d2 IMPORT: plock: also support inlining the int code
Commit 9db830b ("plock: support inlining exponential backoff code")
added an option to support inlining of the wait code for longs but
forgot to do it for ints. Let's do it now.

This is plock upstream commit b1f9f0d252fa40577d11cfb2bc0a809d6960a297.
2023-08-17 00:09:05 +02:00
Aurelien DARRAGON 3b4d2b7975 DEV: makefile: fix POSIX compatibility for "range" target
make "range" which was introduced with 06d34d4 ("DEV: makefile: add a
new "range" target to iteratively build all commits") does not work with
POSIX shells (namely: bourne shell), and will fail with this kind of
errors:

   |/bin/sh: 6: Syntax error: "(" unexpected (expecting ")")
   |make: *** [Makefile:1226: range] Error 2

This is because arrays and arithmetic expressions which are used for the
"range" target are not supported by sh (unlike bash and other "modern"
interpreters).

However the make "all" target already complies with POSIX, so in this
commit we try to make "range" target POSIX compliant to ensure that the
makefile works as expected on systems where make uses /bin/sh as default
intepreter and where /bin/sh points to POSIX shell.
2023-08-17 00:09:05 +02:00
William Lallemand 6ecb7df4e1 BUILD: Makefile: realigned USE_* options in make help
Realigned the USE_* options of `make help` because of the length of
USE_QUIC_OPENSSL_COMPAT.

No backport needed.
2023-08-17 00:03:01 +02:00
William Lallemand 17bfc75974 BUILD: Makefile: add USE_QUIC_OPENSSL_COMPAT to make help
Add the missing USE_QUIC_OPENSSL_COMPAT option to `make help`.

No backport needed.
2023-08-17 00:01:27 +02:00
William Lallemand 1b5f9de1b4 BUILD: Makefile: add the USE_QUIC option to make help
Add the missing "USE_QUIC" option to `make help`.

Must be backported as far as 2.4.
2023-08-16 23:41:15 +02:00
Remi Tricot-Le Breton 672203c26b DOC: jwt: Add explicit list of supported algorithms
Add explicit list of algorithms supported by the jwt_verify converter.
2023-08-16 11:53:42 +02:00
Tim Duesterhus c21b98a6d3 REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+ (3)
Introduced in:

424981cde REGTEST: add ifnone-forwardfor test
b015b3eb1 REGTEST: add RFC7239 forwarded header tests

see also:

fbbbc33df REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+
2023-08-15 11:29:13 +02:00
Willy Tarreau f97db23b6d SCRIPTS: git-show-backports: automatic ref and base detection with -m
When running with -m (check for missing backports) we often have to fill
lots of information that can be determined automatically the vast majority
of the time:
  - restart point (last cherry-picked ID from one of the last commits)
  - current branch (HEAD)
  - reference branch (the one that contains most of the last commits)

These elements are not that hard to determine, so let's make sure we
can fall back to them when running in missing mode.

The reference branch is guessed by looking at the upstream branch that
most frequently contains some of the last 10 commits. It can be inaccurate
if multiple branches exist with these commits, or when upstream changes
due to a non-LTS branch disappearing in the middle of the series, in which
case passing "-r" will help. But most of the time it works OK. It also gives
precedence to local branches over remote ones for such choices. A test in
2.4 at commit 793a4b520 correctly shows 2.6/master as the upstream despite
2.5 having been used for the early ones of the tag.

For the restart point, we assume that the most recent commit that was
backported serves as a reference (and not the most recently backported
commit). This means that the usual case where an old commit was found
to be missing will not fool the analysis. Commits are inspected from
2 commits before the last tag, and reordered from the parent's tree
to see which one is the last one.

With this, it's sufficient to issue "git-show-backports -q -m" to get
the list of backports from the upstream branch, restarting from the
last backported one.
2023-08-14 13:12:56 +02:00
Johannes Naab d5590ef633 DOC: typo: fix sc-set-gpt references
Only sc-inc-gpc and sc-set-gpt do exist. The mix-up sc-inc-gpt crept in
in 71d189219 (DOC: config: Rework and uniformize how TCP/HTTP rules are
documented, 2021-10-14) and got copied in a92480462 (MINOR: http-rules:
Add missing actions in http-after-response ruleset, 2023-01-05).
2023-08-14 09:04:45 +02:00
Aurelien DARRAGON 7eb05891d8 BUG/MINOR: stktable: allow sc-add-gpc from tcp-request connection
Following the previous commit's logic, we enable the use of sc-add-gpc
from tcp-request connection since it was probably forgotten in the first
place for sc-set-gpt0, and since sc-add-gpc was inspired from it, it also
lacks its.

As sc-add-gpc was implemented in 5a72d03a58 ("MINOR: stick-table: implement
the sc-add-gpc() action"), this should only be backported to 2.8
2023-08-14 09:03:49 +02:00
Aurelien DARRAGON 6c79309fda BUG/MINOR: stktable: allow sc-set-gpt(0) from tcp-request connection
Both the documentation and original developer intents seem to suggest
that sc-set-gpt/sc-set-gpt0 actions should be available from
tcp-request connection.

Yet because it was probably forgotten when expr support was added to
sc-set-gpt0 in 0d7712dff0 ("MINOR: stick-table: allow sc-set-gpt0 to
set value from an expression") it doesn't work and will report this
kind of errors:
 "internal error, unexpected rule->from=0, please report this bug!"

Fixing the code to comply with the documentation and the expected
behavior.

This must be backported to every stable versions.

[for < 2.5, as only sc-set-gpt0 existed back then, the patch must be
manually applied to skip irrelevant parts]
2023-08-14 09:03:44 +02:00
Willy Tarreau 67da85fa4c DEV: flags/show-sess-to-flags: properly decode fd.state
fd.state is reported without the "0x" prefix in show sess, let's support
this during decoding.

This may be backported to all versions supporting this utility.
2023-08-14 08:48:49 +02:00
Willy Tarreau 75028bcba6 [RELEASE] Released version 2.9-dev3
Released version 2.9-dev3 with the following main changes :
    - BUG/MINOR: ssl: OCSP callback only registered for first SSL_CTX
    - BUG/MEDIUM: h3: Properly report a C-L header was found to the HTX start-line
    - MINOR: sample: add pid sample
    - MINOR: sample: implement act_conn sample fetch
    - MINOR: sample: accept_date / request_date return %Ts / %tr timestamp values
    - MEDIUM: sample: implement us and ms variant of utime and ltime
    - BUG/MINOR: sample: check alloc_trash_chunk() in conv_time_common()
    - DOC: configuration: describe Td in Timing events
    - MINOR: sample: implement the T* timer tags from the log-format as fetches
    - DOC: configuration: add sample fetches for timing events
    - BUG/MINOR: quic: Possible crash when acknowledging Initial v2 packets
    - MINOR: quic: Export QUIC traces code from quic_conn.c
    - MINOR: quic: Export QUIC CLI code from quic_conn.c
    - MINOR: quic: Move TLS related code to quic_tls.c
    - MINOR: quic: Add new "QUIC over SSL" C module.
    - MINOR: quic: Add a new quic_ack.c C module for QUIC acknowledgements
    - CLEANUP: quic: Defined but no more used function (quic_get_tls_enc_levels())
    - MINOR: quic: Split QUIC connection code into three parts
    - CLEANUP: quic: quic_conn struct cleanup
    - MINOR: quic; Move the QUIC frame pool to its proper location
    - BUG/MINOR: chunk: fix chunk_appendf() to not write a zero if buffer is full
    - BUG/MEDIUM: h3: Be sure to handle fin bit on the last DATA frame
    - DOC: configuration: rework the custom log format table
    - BUG/MINOR: quic+openssl_compat: Non initialized TLS encryption levels
    - CLEANUP: acl: remove cache_idx from acl struct
    - REORG: cfgparse: extract curproxy as a global variable
    - MINOR: acl: add acl() sample fetch
    - BUILD: cfgparse: keep a single "curproxy"
    - BUG/MEDIUM: bwlim: Reset analyse expiration date when then channel analyse ends
    - MEDIUM: stream: Reset response analyse expiration date if there is no analyzer
    - BUG/MINOR: htx/mux-h1: Properly handle bodyless responses when splicing is used
    - BUG/MEDIUM: quic: consume contig space on requeue datagram
    - BUG/MINOR: http-client: Don't forget to commit changes on HTX message
    - CLEANUP: stconn: Move comment about sedesc fields on the field line
    - REGTESTS: http: Create a dedicated script to test spliced bodyless responses
    - REGTESTS: Test SPLICE feature is enabled to execute script about splicing
    - BUG/MINOR: quic: reappend rxbuf buffer on fake dgram alloc error
    - BUILD: quic: fix wrong potential NULL dereference
    - MINOR: h3: abort request if not completed before full response
    - BUG/MAJOR: http-ana: Get a fresh trash buffer for each header value replacement
    - CLEANUP: quic: Remove quic_path_room().
    - MINOR: quic: Amplification limit handling sanitization.
    - MINOR: quic: Move some counters from [rt]x quic_conn anonymous struct
    - MEDIUM: quic: Send CONNECTION_CLOSE packets from a dedicated buffer.
    - MINOR: quic: Use a pool for the connection ID tree.
    - MEDIUM: quic: Allow the quic_conn memory to be asap released.
    - MINOR: quic: Release asap quic_conn memory (application level)
    - MINOR: quic: Release asap quic_conn memory from ->close() xprt callback.
    - MINOR: quic: Warning for OpenSSL wrapper QUIC bindings without "limited-quic"
    - REORG: http: move has_forbidden_char() from h2.c to http.h
    - BUG/MAJOR: h3: reject header values containing invalid chars
    - MINOR: mux-h2/traces: also suggest invalid header upon parsing error
    - MINOR: ist: add new function ist_find_range() to find a character range
    - MINOR: http: add new function http_path_has_forbidden_char()
    - MINOR: h2: pass accept-invalid-http-request down the request parser
    - REGTESTS: http-rules: add accept-invalid-http-request for normalize-uri tests
    - BUG/MINOR: h1: do not accept '#' as part of the URI component
    - BUG/MINOR: h2: reject more chars from the :path pseudo header
    - BUG/MINOR: h3: reject more chars from the :path pseudo header
    - REGTESTS: http-rules: verify that we block '#' by default for normalize-uri
    - DOC: clarify the handling of URL fragments in requests
    - BUG/MAJOR: http: reject any empty content-length header value
    - BUG/MINOR: http: skip leading zeroes in content-length values
    - BUG/MEDIUM: mux-h1: fix incorrect state checking in h1_process_mux()
    - BUG/MEDIUM: mux-h1: do not forget EOH even when no header is sent
    - BUILD: mux-h1: shut a build warning on clang from previous commit
    - DEV: makefile: add a new "range" target to iteratively build all commits
    - CI: do not use "groupinstall" for Fedora Rawhide builds
    - CI: get rid of travis-ci wrapper for Coverity scan
    - BUG/MINOR: quic: mux started when releasing quic_conn
    - BUG/MINOR: quic: Possible crash in quic_cc_conn_io_cb() traces.
    - MINOR: quic: Add a trace for QUIC conn fd ready for receive
    - BUG/MINOR: quic: Possible crash when issuing "show fd/sess" CLI commands
    - BUG/MINOR: quic: Missing tasklet (quic_cc_conn_io_cb) memory release (leak)
    - BUG/MEDIUM: quic: fix tasklet_wakeup loop on connection closing
    - BUG/MINOR: hlua: fix invalid use of lua_pop on error paths
    - MINOR: hlua: add hlua_stream_ctx_prepare helper function
    - BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread
    - MAJOR: threads/plock: update the embedded library again
    - MINOR: stick-table: move the task_queue() call outside of the lock
    - MINOR: stick-table: move the task_wakeup() call outside of the lock
    - MEDIUM: stick-table: change the ref_cnt atomically
    - MINOR: stick-table: better organize the struct stktable
    - MEDIUM: peers: update ->commitupdate out of the lock using a CAS
    - MEDIUM: peers: drop then re-acquire the wrlock in peer_send_teachmsgs()
    - MEDIUM: peers: only read-lock peer_send_teachmsgs()
    - MEDIUM: stick-table: use a distinct lock for the updates tree
    - MEDIUM: stick-table: touch updates under an upgradable read lock
    - MEDIUM: peers: drop the stick-table lock before entering peer_send_teachmsgs()
    - MINOR: stick-table: move the update lock into its own cache line
    - CLEANUP: stick-table: slightly reorder the stktable struct
    - BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP
    - MINOR: tools: make ptr_hash() support 0-bit outputs
    - MINOR: tools: improve ptr hash distribution on 64 bits
    - OPTIM: tools: improve hash distribution using a better prime seed
    - OPTIM: pools: use exponential back-off on shared pool allocation/release
    - OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated
    - MINOR: pools: introduce the use of multiple buckets
    - MEDIUM: pools: spread the allocated counter over a few buckets
    - MEDIUM: pools: move the used counter over a few buckets
    - MEDIUM: pools: move the needed_avg counter over a few buckets
    - MINOR: pools: move the failed allocation counter over a few buckets
    - MAJOR: pools: move the shared pool's free_list over multiple buckets
    - MINOR: pools: make pool_evict_last_items() use pool_put_to_os_no_dec()
    - BUILD: pools: fix build error on clang with inline vs forceinline
2023-08-12 19:59:27 +02:00
Willy Tarreau 2d18717fb8 BUILD: pools: fix build error on clang with inline vs forceinline
clang is more picky than gcc regarding duplicate "inline". The functions
declared with "forceinline" don't need to have "inline" since it's already
in the macro.
2023-08-12 19:58:17 +02:00
Willy Tarreau 29eed99b50 MINOR: pools: make pool_evict_last_items() use pool_put_to_os_no_dec()
The bucket is already known, no need to calculate it again. Let's just
include the lower level functions.
2023-08-12 19:04:34 +02:00
Willy Tarreau 7bf829ace1 MAJOR: pools: move the shared pool's free_list over multiple buckets
This aims at further reducing the contention on the free_list when using
global pools. The free_list pointer now appears for each bucket, and both
the alloc and the release code skip to a next bucket when ending on a
contended entry. The default entry used for allocations and releases
depend on the thread ID so that locality is preserved as much as possible
under low contention.

It would be nice to improve the situation to make sure that releases to
the shared pools doesn't consider the first entry's pointer but only an
argument that would be passed and that would correspond to the bucket in
the thread's cache. This would reduce computations and make sure that the
shared cache only contains items whose pointers match the same bucket.
This was not yet done. One possibility could be to keep the same splitting
in the local cache.

With this change, an h2load test with 5 * 160 conns & 40 streams on 80
threads that was limited to 368k RPS with the shared cache jumped to
3.5M RPS for 8 buckets, 4M RPS for 16 buckets, 4.7M RPS for 32 buckets
and 5.5M RPS for 64 buckets.
2023-08-12 19:04:34 +02:00
Willy Tarreau 8a0b5f783b MINOR: pools: move the failed allocation counter over a few buckets
The failed allocation counter cannot depend on a pointer, but since it's
a perpetually increasing counter and not a gauge, we don't care where
it's incremented. Thus instead we're hashing on the TID. There's no
contention there anyway, but it's better not to waste the room in
the pool's heads and to move that with the other counters.
2023-08-12 19:04:34 +02:00
Willy Tarreau da6999f839 MEDIUM: pools: move the needed_avg counter over a few buckets
That's the same principle as for ->allocated and ->used. Here we return
the summ of the raw values, so the result still needs to be fed to
swrate_avg(). It also means that we now use the local ->used instead
of the global one for the calculations and do not need to call pool_used()
anymore on fast paths. The number of samples should likely be divided by
the number of buckets, but that's not done yet (better observe first).

A function pool_needed_avg() was added to report aggregated values for
the "show pools" command.

With this change, an h2load made of 5 * 160 conn * 40 streams on 80
threads raised from 1.5M RPS to 6.7M RPS.
2023-08-12 19:04:34 +02:00
Willy Tarreau 9e5eb586b1 MEDIUM: pools: move the used counter over a few buckets
That's the same principle as for ->allocated. The small difference here
is that it's no longer possible to decrement ->used in batches when
releasing clusters from the cache to the shared cache, so the counter
has to be decremented for each of them. But as it provides less
contention and it's done only during forced eviction, it shouldn't be
a problem.

A function "pool_used()" was added to return the sum of the entries.
It's used by pool_alloc_nocache() and pool_free_nocache() which need
to count the number of used entries. It's not a problem since such
operations are done when picking/releasing objects to/from the OS,
but it is a reminder that the number of buckets should remain small.

With this change, an h2load test made of 5 * 160 conn * 40 streams on
80 threads raised from 812k RPS to 1.5M RPS.
2023-08-12 19:04:34 +02:00
Willy Tarreau cdb711e42b MEDIUM: pools: spread the allocated counter over a few buckets
The ->used counter is one of the most stressed, and it heavily
depends on the ->allocated one, so let's first move ->allocated
to a few buckets.

A function "pool_allocated()" was added to return the sum of the entries.
It's important not to abuse it as it does iterate, so everywhere it's
possible to avoid it by keeping a local counter, it's better. Currently
it's used for limited pools which need to make sure they do not allocate
too many objects. That's an acceptable tradeoff to save CPU on large
machines at the expense of spending a little bit more on small ones which
normally are not under load.
2023-08-12 19:04:34 +02:00
Willy Tarreau 06885aaea7 MINOR: pools: introduce the use of multiple buckets
On many threads and without the shared cache, there can be extreme
contention on the ->allocated counter, the ->free_list pointer, and
the ->used counter. It's possible to limit this contention by spreading
the counters a little bit over multiple entries, that are summed up when
a consultation is needed. The criterion used to spread the values cannot
be related to the thread ID due to migrations, since we need to keep
consistent stats (allocated vs used).

Instead we'll just hash the pointer, it provides an index that does the
job and that is consistent for the object. When having just a few entries
(16 here as it showed almost identical performance between global and
non-global pools) even iterations should be short enough during
measurements to not be a problem.

A pair of functions designed to ease pointer hash bucket calculation were
added, with one of them doing it for thread IDs because allocation failures
will be associated with a thread and not a pointer.

For now this patch only brings in the relevant parts of the infrastructure,
the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512
threads or more are supported, 5 bits when 128 or more are supported, 4
bits when 16 or more are supported, otherwise 3 bits for small setups.
The array in the pool_head and the two utility functions are already
added. It should have no measurable impact beyond inflating the pool_head
structure.
2023-08-12 19:04:34 +02:00
Willy Tarreau 29ad61fb00 OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated
The pool's allocation counter doesn't strictly require to be updated
from these functions, it may more efficiently be done in the caller
(even out of a loop for pool_flush() and pool_gc()), and doing so will
also help us spread the counters over an array later. The functions
were renamed _noinc and _nodec to make sure we catch any possible
user in an external patch. If needed, the original functions may easily
be reimplemented in an inline function.
2023-08-12 19:04:34 +02:00
Willy Tarreau feeda4132b OPTIM: pools: use exponential back-off on shared pool allocation/release
Running a stick-table stress with -dMglobal under 56 threads shows
extreme contention on the pool's free_list because it has to be
processed in two phases and only used to implement a cpu_relax() on
the retry path.

Let's at least implement exponential back-off here to limit the neighbor's
noise and reduce the time needed to successfully acquire the pointer. Just
doing so shows there's still contention but almost doubled the performance,
from 1.1 to 2.1M req/s.
2023-08-12 19:04:34 +02:00
Willy Tarreau f0d188f6ed OPTIM: tools: improve hash distribution using a better prime seed
During tests it was noticed that the current hash is not that good
on 4- and 5- bit hashes. About 7.5% of all the 32-bit primes were tested
as candidates for the hash function, by submitting them 128 arrangements
of N pointers among 40k extracted from haproxy's pools, and the average
fill rates for 1- to 12- bit hashes were measured and compared. It was
clear that some values do not provide great hashes and other ones are
way more resistant.

The current value is not bad at all but delivers 42.6% unique 2-bit
outputs, 41.6% 3-bit, 38.0% 4-bit, 38.2% 5-bit and 37.1% 10-bit. Some
values did perform significantly better, among which 0xacd1be85 which
does 43.2% 2-bit, 42.5% 3-bit, 42.2% 4-bit, 39.2% 5-bit and 37.3% 10-bit.

The reverse value used in the ptr2_hash() was really underperforming and
was replaced with 0x9d28e4e9 which does 49.6%, 40.4%, 42.6%, 39.1%, and
37.2% respectvely.

This should slightly improve the accuracy of the task and memory
profiling, and will be useful for pools.
2023-08-12 19:04:34 +02:00
Willy Tarreau 58946d44f8 MINOR: tools: improve ptr hash distribution on 64 bits
When testing the pointer hash on 64-bit real pointers (map entries),
it appeared that the shift by 33 bits that hoped to compensate for the
3 nul LSB degrades the hash, and the centering is more optimal on
31-(bits+1)/2. This makes sense since the topmost bit of the
multiplicator is 31, so for an input of 1 bit and 1 bit of output we
would always get zero. With the formula adjusted this way, we can get
up to ~15% more unique entries at 10 bits and ~24% more at 11 bits.
2023-08-12 19:04:34 +02:00
Willy Tarreau ab6cb5dea0 MINOR: tools: make ptr_hash() support 0-bit outputs
When dealing with macro-based size definitions, it is useful to be able
to hash pointers on zero bits so that the macro automatically returns a
constant 0. For now it only supports 1-32. Let's just add this special
case. It's automatically optimized out by the compiler since the function
is inlined.
2023-08-12 19:04:34 +02:00
Willy Tarreau 59c347c15e BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP
LONGBITS was defined long ago with old compilers that didn't provide the
word size. It's still present as being referenced in various places in the
code, but we must not use it to define other macros that may be evaluated
at pre-processing time since it contains sizeof() and casts that are not
compatible with preprocessor conditions. Let's switch MAX_THREADS_PER_GROUP
to __WORDSIZE so that we can condition blocks of code on it if needed.

LONGBITS should really be removed by now, given that we don't support
compilers not providing __WORDSIZE anymore (gcc < 4.2).
2023-08-12 19:04:34 +02:00
Willy Tarreau 9e52c35de4 CLEANUP: stick-table: slightly reorder the stktable struct
By moving the config-time stuff after the updt_lock, we can plug some
holes without interfering with it. This allows us to get back to the
768-bytes struct. The performance was not affected at all.
2023-08-11 19:03:35 +02:00
Willy Tarreau 9c6248560e MINOR: stick-table: move the update lock into its own cache line
The read-lock contention observed on the update lock while turning it
into an upgradable lock were due to false sharing with the nearby
updates. Simply moving the lock alone into its own cache line is
sufficient to almost double the performance again, raising from 2355
to 4480k RPS with very low contention:

  Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 743422995452 lost
  Overhead  Shared Object          Symbol
    15.88%  haproxy                [.] stktable_lookup_key
     5.94%  haproxy                [.] ebmb_lookup
     5.69%  haproxy                [.] http_wait_for_request
     3.66%  haproxy                [.] stktable_touch_with_exp
     2.62%  [kernel]               [k] _raw_spin_unlock_irqrestore
     1.86%  haproxy                [.] http_action_return
     1.79%  haproxy                [.] stream_process_counters
     1.78%  [kernel]               [k] skb_release_data
     1.77%  haproxy                [.] process_stream

Unfortunately, trying to move the line anywhere else didn't work,
despite the remaining holes, because this structure is not quite
clean. This adds 64 bytes to a struct that was already 768 long,
so it's now 832. It's possible to repack it a little bit and regain
these bytes by removing the THREAD_ALIGN before "keys" because we
rarely use the config stuff, but that's a bit unsafe.
2023-08-11 19:03:35 +02:00
Willy Tarreau 45eeaad45f MEDIUM: peers: drop the stick-table lock before entering peer_send_teachmsgs()
The function drops the lock very early, and the only operations that
are performed on the entry code are updating the current peer's
last_local_table, which doesn't need to be protected. Thus it's
easier to drop the lock before entering the function and it further
limits its scope.

This has raised the peak RPS from 2050 to 2355k/s with a peers section on
the 80-core machine.
2023-08-11 19:03:35 +02:00
Willy Tarreau cfeca3a3a3 MEDIUM: stick-table: touch updates under an upgradable read lock
Instead of taking the update's write lock in stktable_touch_with_exp(),
while most of the time under high load there is nothing to update because
the entry is touched before having been synchronized present, let's do
the check under a read lock and upgrade it to perform the update if
needed. These updates are rare and the contention is not expected to be
very high, so at the first failure to upgrade we retry directly with a
write lock.

By doing so the performance has almost doubled again, from 1140 to 2050k
with a peers section enabled. The contention is now on taking the read
lock itself, so there's little to be gained beyond this in this function.
2023-08-11 19:03:35 +02:00
Willy Tarreau 87e072eea5 MEDIUM: stick-table: use a distinct lock for the updates tree
Updating an entry in the updates tree is currently performed under the
table's write lock, which causes huge contention with other accesses
such as lookups and free. Aside the updates tree, the update,
localupdate and commitupdate variables, nothing is manipulated, so
let's create a distinct lock (updt_lock) to protect these together
to remove this contention. It required to add an extra lock in the
few places where we delete the update (though only if we're really
going to delete it) to protect the tree. This is very convenient
because now peer_send_teachmsgs() only needs to take this read lock,
and there is very little contention left on the stick-table.

With this alone, the performance jumped from 614k to 1140k/s on a
80-thread machine with a peers section! Stick-table updates with
no peers however now has to stand two locks and slightly regressed
from 4.0-4.1M/s to 3.9-4.0. This is fairly minimal compared to the
significant unlocking of the peers updates and considered totally
acceptable.
2023-08-11 19:03:35 +02:00
Willy Tarreau 29982ea769 MEDIUM: peers: only read-lock peer_send_teachmsgs()
This function doesn't need to be write-locked. It performs a lookup
of the next update at its index, atomically updates the ref_cnt on
the stksess, updates some shared_table fields on the local thread,
and updates the table's commitupdate. Now that this update is atomic
we don't need to keep the write lock during that period. In addition
this function's callers do not rely on the write lock to be held
either since it was droped during peer_send_updatemsg() anyway.

Now, when the function is entered with a write lock, it's downgraded
to a read lock, otherwise a read lock is grabbed. Updates are looked
up under the read lock and the message is sent without the lock. The
commitupdate is still performed under the read lock (so as not to
break the code too much), and the write lock is re-acquired when
leaving if needed. This allows multiple peers to look up updates in
parallel and to avoid stalling stick-table lookups.
2023-08-11 19:03:35 +02:00
Willy Tarreau d4f8286e45 MEDIUM: peers: drop then re-acquire the wrlock in peer_send_teachmsgs()
This function maintains the write lock for a while. In practice it does
not need to hold it that long, and some parts could be performed under a
read lock. This patch first drops then re-acquires the write lock at the
function's entry. The purpose is simply to break the end-to-end atomicity
to prove that it has no impact in case something needs to be bisected
later. In fact the write lock is already dropped while calling
peer_send_updatemsg().
2023-08-11 19:03:35 +02:00