haproxy

Commit Graph

Author	SHA1	Message	Date
Amaury Denoyelle	4fb538d4b6	MEDIUM: h2: reverse connection after SETTINGS reception Reverse connection after SETTINGS reception if it was set as reversable. This operation is done in a new function h2_conn_reverse(). It regroups common changes which are needed for both reversal direction : H2_CF_IS_BACK is set or unset and timeouts are inverted. For the moment, only passive reverse is fully implemented. Once done, the connection instance is directly inserted in its targetted server pool. It can then be used immediately for future transfers using this server.	2023-08-24 14:49:03 +02:00
Amaury Denoyelle	1f76b8ae07	MEDIUM: connection: implement passive reverse Define a new method conn_reverse(). This method is used to reverse a connection from frontend to backend or vice-versa depending on its initial status. For the moment, passive reverse only is implemented. This covers the transition from frontend to backend side. The connection is detached from its owner session which can then be freed. Then the connection is linked to the server instance. only for passive connection on frontend to transfer them on the backend side. This requires to free the connection session after detaching it from.	2023-08-24 14:44:33 +02:00
Amaury Denoyelle	d8d9122a02	MINOR: connection: centralize init/deinit of backend elements A connection contains extra elements which are only used for the backend side. Regroup their allocation and deallocation in two new functions named conn_backend_init() and conn_backend_deinit(). No functional change is introduced with this commit. The new functions are reused in place of manual alloc/dealloc in conn_new() / conn_free(). This patch will be useful for reverse connect support with connection conversion from backend to frontend side and vice-versa.	2023-08-24 14:44:33 +02:00
Amaury Denoyelle	fbe35afaa4	MINOR: proxy: simplify parsing 'backend/server' Several CLI handlers use a server argument specified with the format '<backend>/<server>'. The parsing of this arguement is done in two steps, first splitting the string with '/' delimiter and then use get_backend_server() to retrieve the server instance. Refactor this code sections with the following changes : * splitting is reimplented using ist API * get_backend_server() is removed. Instead use the already existing proxy_be_by_name() then server_find_by_name() which contains duplicated code with the now removed function. No functional change occurs with this commit. However, it will be useful to add new configuration options reusing the same '<backend>/<server>' for reverse connect.	2023-08-24 14:44:33 +02:00
Willy Tarreau	9b47ed1a93	IMPORT: xxhash: update xxHash to version 0.8.2 Peter Varkoly reported a build issue on ppc64le in xxhash.h. Our version (0.8.1) was the last one 9 months ago, and since then this specific issue was addressed in 0.8.2, so let's apply the maintenance update. This should be backported to 2.8 and 2.7.	2023-08-24 12:01:06 +02:00
Willy Tarreau	821fc95146	MINOR: pattern: do not needlessly lookup the LRU cache for empty lists If a pattern list is empty, there's no way we can find its elements in the pattern cache, so let's avoid this expensive lookup. This can happen for ACLs or maps loaded from files that may optionally be empty for example. Doing so improves the request rate by roughly 10% for a single such match for only 8 threads. That's normal because the LRU cache pre-creates an entry that is about to be committed for the case the list lookup succeeds after a miss, so we bypass all this.	2023-08-22 07:27:01 +02:00
William Lallemand	3fde27d980	BUG/MINOR: quic: ssl_quic_initial_ctx() uses error count not error code ssl_quic_initial_ctx() is supposed to use error count and not errror code. Bug was introduced by `557706b3` ("MINOR: quic: Initialize TLS contexts for QUIC openssl wrapper"). No backport needed.	2023-08-21 15:35:17 +02:00
William Lallemand	8c004153e5	BUG/MINOR: quic: allow-0rtt warning must only be emitted with quic bind When built with USE_QUIC_OPENSSL_COMPAT, a warning is emitted when using allow-0rtt. However this warning is emitted for every allow-0rtt keywords on the bind line which is confusing, it must only be done in case the bind is a quic one. Also this does not handle the case where the allow-0rtt keyword is in the crt-list. This patch moves the warning to ssl_quic_initial_ctx() in order to emit the warning in every useful cases.	2023-08-21 15:33:26 +02:00
Fr�d�ric L�caille	2677dc1c32	MINOR: quic+openssl_compat: Emit an alert for "allow-0rtt" option QUIC 0-RTT is not supported when haproxy is linked against an TLS stack with limited QUIC support (OpenSSL). Modify the "allow-0rtt" option callback to make it emit a warning if set on a QUIC listener "bind" line.	2023-08-17 15:44:03 +02:00
Fr�d�ric L�caille	0e13325f23	MINOR: quic+openssl_compat: Do not start without "limited-quic" Add a check for limited-quic in check_config_validity() when compiled with USE_QUIC_OPENSSL_COMPAT so that we prevent a config from starting accidentally with limited QUIC support. If a QUIC listener is found when using the compatibility mode and limited-quic is not set, an error message is reported explaining that the SSL library is not compatible and proposing the user to enable limited-quic if that's what they want, and the startup fails. This partially reverts commit `7c730803d` ("MINOR: quic: Warning for OpenSSL wrapper QUIC bindings without "limited-quic"") since a warning was not sufficient.	2023-08-17 15:44:03 +02:00
Amaury Denoyelle	cd97ba147c	BUILD/IMPORT: fix compilation with PLOCK_DISABLE_EBO=1 Compilation is broken due to missing __pl_wait_unlock_long() definition when building with PLOCK_DISABLE_EBO=1. This has been introduced since the following commit which activates the inlining version of pl_wait_unlock_long() : commit `071d689a51` MINOR: threads: inline the wait function for pthread_rwlock emulation Add an extra check on PLOCK_DISABLE_EBO before choosing the inline or default version of pl_wait_unlock_long() to fix this.	2023-08-17 11:16:54 +02:00
Willy Tarreau	544c2f2d9e	MINOR: pools: use EBO to wait for unlock during pool_flush() pool_flush() could become a source of contention on the pool's free list if there are many competing thread using that pool. Let's make sure we use EBO and not just a simple CPU relaxation there, to avoid disturbing them.	2023-08-17 09:09:20 +02:00
Willy Tarreau	78fa54863d	MINOR: atomic: make sure to always relax after a failed CAS There were a few places left where we forgot to call __ha_cpu_relax() after a failed CAS, in the HA_ATOMIC_UPDATE_{MIN,MAX} macros, and in a few sync_* API macros (the same as above plus HA_ATOMIC_CAS and HA_ATOMIC_XCHG). Let's add them now. This could have been a cause of contention, particularly with process_stream() calling stream_update_time_stats() which uses 8 of them in a call (4 for the server, 4 for the proxy). This may be a possible explanation for the high CPU consumption reported in GH issue #2251. This should be backported at least to 2.6 as it's harmless.	2023-08-17 09:09:20 +02:00
Willy Tarreau	071d689a51	MINOR: threads: inline the wait function for pthread_rwlock emulation When using pthread_rwlock emulation, contention is reported on pl_wait_unlock_long(). This is really not convenient to analyse what is happening. Now plock supports inlining the wait call for just the lorw functions by enabling PLOCK_LORW_INLINE_WAIT. Let's do this so that now the wait time will be precisely reported as either pthread_rwlock_rdlock() or pthread_rwlock_wrlock() depending on the contended function, but no more on pl_wait_unlock_long(), which will still be reported for all other locks.	2023-08-17 00:09:05 +02:00
Willy Tarreau	e56275378f	IMPORT: lorw: support inlining the wait call Now when PLOCK_LORW_INLINE_WAIT is defined, the pl_wait_unlock_long() calls in pl_lorw_rdlock() and pl_lorw_wrlock() will be inlined so that all the CPU time is accounted for in the calling function. This is plock upstream commit c993f81d581732a6eb8fe3033f21970420d21e5e.	2023-08-17 00:09:05 +02:00
Willy Tarreau	66dcc0550e	IMPORT: plock: always expose the inline version of the lock wait function Doing so will allow to expose the time spent in certain highly contended functions, which can be desirable for more accurate CPU profiling. For example this could be done in locking functions that are already not inlined so that they are the ones being reported as those consuming the CPU instead of just pl_wait_unlock_long(). This is plock upstream commit 7505c2e2c8c4aa0ab8f52a2288e1334ae6412be4.	2023-08-17 00:09:05 +02:00
Willy Tarreau	c6b98f05d2	IMPORT: plock: also support inlining the int code Commit 9db830b ("plock: support inlining exponential backoff code") added an option to support inlining of the wait code for longs but forgot to do it for ints. Let's do it now. This is plock upstream commit b1f9f0d252fa40577d11cfb2bc0a809d6960a297.	2023-08-17 00:09:05 +02:00
Aurelien DARRAGON	3b4d2b7975	DEV: makefile: fix POSIX compatibility for "range" target make "range" which was introduced with `06d34d4` ("DEV: makefile: add a new "range" target to iteratively build all commits") does not work with POSIX shells (namely: bourne shell), and will fail with this kind of errors: \|/bin/sh: 6: Syntax error: "(" unexpected (expecting ")") \|make: *** [Makefile:1226: range] Error 2 This is because arrays and arithmetic expressions which are used for the "range" target are not supported by sh (unlike bash and other "modern" interpreters). However the make "all" target already complies with POSIX, so in this commit we try to make "range" target POSIX compliant to ensure that the makefile works as expected on systems where make uses /bin/sh as default intepreter and where /bin/sh points to POSIX shell.	2023-08-17 00:09:05 +02:00
William Lallemand	6ecb7df4e1	BUILD: Makefile: realigned USE_* options in make help Realigned the USE_* options of `make help` because of the length of USE_QUIC_OPENSSL_COMPAT. No backport needed.	2023-08-17 00:03:01 +02:00
William Lallemand	17bfc75974	BUILD: Makefile: add USE_QUIC_OPENSSL_COMPAT to make help Add the missing USE_QUIC_OPENSSL_COMPAT option to `make help`. No backport needed.	2023-08-17 00:01:27 +02:00
William Lallemand	1b5f9de1b4	BUILD: Makefile: add the USE_QUIC option to make help Add the missing "USE_QUIC" option to `make help`. Must be backported as far as 2.4.	2023-08-16 23:41:15 +02:00
Remi Tricot-Le Breton	672203c26b	DOC: jwt: Add explicit list of supported algorithms Add explicit list of algorithms supported by the jwt_verify converter.	2023-08-16 11:53:42 +02:00
Tim Duesterhus	c21b98a6d3	REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+ (3) Introduced in: `424981cde` REGTEST: add ifnone-forwardfor test `b015b3eb1` REGTEST: add RFC7239 forwarded header tests see also: `fbbbc33df` REGTESTS: Do not use REQUIRE_VERSION for HAProxy 2.5+	2023-08-15 11:29:13 +02:00
Willy Tarreau	f97db23b6d	SCRIPTS: git-show-backports: automatic ref and base detection with -m When running with -m (check for missing backports) we often have to fill lots of information that can be determined automatically the vast majority of the time: - restart point (last cherry-picked ID from one of the last commits) - current branch (HEAD) - reference branch (the one that contains most of the last commits) These elements are not that hard to determine, so let's make sure we can fall back to them when running in missing mode. The reference branch is guessed by looking at the upstream branch that most frequently contains some of the last 10 commits. It can be inaccurate if multiple branches exist with these commits, or when upstream changes due to a non-LTS branch disappearing in the middle of the series, in which case passing "-r" will help. But most of the time it works OK. It also gives precedence to local branches over remote ones for such choices. A test in 2.4 at commit 793a4b520 correctly shows 2.6/master as the upstream despite 2.5 having been used for the early ones of the tag. For the restart point, we assume that the most recent commit that was backported serves as a reference (and not the most recently backported commit). This means that the usual case where an old commit was found to be missing will not fool the analysis. Commits are inspected from 2 commits before the last tag, and reordered from the parent's tree to see which one is the last one. With this, it's sufficient to issue "git-show-backports -q -m" to get the list of backports from the upstream branch, restarting from the last backported one.	2023-08-14 13:12:56 +02:00
Johannes Naab	d5590ef633	DOC: typo: fix sc-set-gpt references Only sc-inc-gpc and sc-set-gpt do exist. The mix-up sc-inc-gpt crept in in `71d189219` (DOC: config: Rework and uniformize how TCP/HTTP rules are documented, 2021-10-14) and got copied in `a92480462` (MINOR: http-rules: Add missing actions in http-after-response ruleset, 2023-01-05).	2023-08-14 09:04:45 +02:00
Aurelien DARRAGON	7eb05891d8	BUG/MINOR: stktable: allow sc-add-gpc from tcp-request connection Following the previous commit's logic, we enable the use of sc-add-gpc from tcp-request connection since it was probably forgotten in the first place for sc-set-gpt0, and since sc-add-gpc was inspired from it, it also lacks its. As sc-add-gpc was implemented in `5a72d03a58` ("MINOR: stick-table: implement the sc-add-gpc() action"), this should only be backported to 2.8	2023-08-14 09:03:49 +02:00
Aurelien DARRAGON	6c79309fda	BUG/MINOR: stktable: allow sc-set-gpt(0) from tcp-request connection Both the documentation and original developer intents seem to suggest that sc-set-gpt/sc-set-gpt0 actions should be available from tcp-request connection. Yet because it was probably forgotten when expr support was added to sc-set-gpt0 in `0d7712dff0` ("MINOR: stick-table: allow sc-set-gpt0 to set value from an expression") it doesn't work and will report this kind of errors: "internal error, unexpected rule->from=0, please report this bug!" Fixing the code to comply with the documentation and the expected behavior. This must be backported to every stable versions. [for < 2.5, as only sc-set-gpt0 existed back then, the patch must be manually applied to skip irrelevant parts]	2023-08-14 09:03:44 +02:00
Willy Tarreau	67da85fa4c	DEV: flags/show-sess-to-flags: properly decode fd.state fd.state is reported without the "0x" prefix in show sess, let's support this during decoding. This may be backported to all versions supporting this utility.	2023-08-14 08:48:49 +02:00
Willy Tarreau	75028bcba6	[RELEASE] Released version 2.9-dev3 Released version 2.9-dev3 with the following main changes : - BUG/MINOR: ssl: OCSP callback only registered for first SSL_CTX - BUG/MEDIUM: h3: Properly report a C-L header was found to the HTX start-line - MINOR: sample: add pid sample - MINOR: sample: implement act_conn sample fetch - MINOR: sample: accept_date / request_date return %Ts / %tr timestamp values - MEDIUM: sample: implement us and ms variant of utime and ltime - BUG/MINOR: sample: check alloc_trash_chunk() in conv_time_common() - DOC: configuration: describe Td in Timing events - MINOR: sample: implement the T* timer tags from the log-format as fetches - DOC: configuration: add sample fetches for timing events - BUG/MINOR: quic: Possible crash when acknowledging Initial v2 packets - MINOR: quic: Export QUIC traces code from quic_conn.c - MINOR: quic: Export QUIC CLI code from quic_conn.c - MINOR: quic: Move TLS related code to quic_tls.c - MINOR: quic: Add new "QUIC over SSL" C module. - MINOR: quic: Add a new quic_ack.c C module for QUIC acknowledgements - CLEANUP: quic: Defined but no more used function (quic_get_tls_enc_levels()) - MINOR: quic: Split QUIC connection code into three parts - CLEANUP: quic: quic_conn struct cleanup - MINOR: quic; Move the QUIC frame pool to its proper location - BUG/MINOR: chunk: fix chunk_appendf() to not write a zero if buffer is full - BUG/MEDIUM: h3: Be sure to handle fin bit on the last DATA frame - DOC: configuration: rework the custom log format table - BUG/MINOR: quic+openssl_compat: Non initialized TLS encryption levels - CLEANUP: acl: remove cache_idx from acl struct - REORG: cfgparse: extract curproxy as a global variable - MINOR: acl: add acl() sample fetch - BUILD: cfgparse: keep a single "curproxy" - BUG/MEDIUM: bwlim: Reset analyse expiration date when then channel analyse ends - MEDIUM: stream: Reset response analyse expiration date if there is no analyzer - BUG/MINOR: htx/mux-h1: Properly handle bodyless responses when splicing is used - BUG/MEDIUM: quic: consume contig space on requeue datagram - BUG/MINOR: http-client: Don't forget to commit changes on HTX message - CLEANUP: stconn: Move comment about sedesc fields on the field line - REGTESTS: http: Create a dedicated script to test spliced bodyless responses - REGTESTS: Test SPLICE feature is enabled to execute script about splicing - BUG/MINOR: quic: reappend rxbuf buffer on fake dgram alloc error - BUILD: quic: fix wrong potential NULL dereference - MINOR: h3: abort request if not completed before full response - BUG/MAJOR: http-ana: Get a fresh trash buffer for each header value replacement - CLEANUP: quic: Remove quic_path_room(). - MINOR: quic: Amplification limit handling sanitization. - MINOR: quic: Move some counters from [rt]x quic_conn anonymous struct - MEDIUM: quic: Send CONNECTION_CLOSE packets from a dedicated buffer. - MINOR: quic: Use a pool for the connection ID tree. - MEDIUM: quic: Allow the quic_conn memory to be asap released. - MINOR: quic: Release asap quic_conn memory (application level) - MINOR: quic: Release asap quic_conn memory from ->close() xprt callback. - MINOR: quic: Warning for OpenSSL wrapper QUIC bindings without "limited-quic" - REORG: http: move has_forbidden_char() from h2.c to http.h - BUG/MAJOR: h3: reject header values containing invalid chars - MINOR: mux-h2/traces: also suggest invalid header upon parsing error - MINOR: ist: add new function ist_find_range() to find a character range - MINOR: http: add new function http_path_has_forbidden_char() - MINOR: h2: pass accept-invalid-http-request down the request parser - REGTESTS: http-rules: add accept-invalid-http-request for normalize-uri tests - BUG/MINOR: h1: do not accept '#' as part of the URI component - BUG/MINOR: h2: reject more chars from the :path pseudo header - BUG/MINOR: h3: reject more chars from the :path pseudo header - REGTESTS: http-rules: verify that we block '#' by default for normalize-uri - DOC: clarify the handling of URL fragments in requests - BUG/MAJOR: http: reject any empty content-length header value - BUG/MINOR: http: skip leading zeroes in content-length values - BUG/MEDIUM: mux-h1: fix incorrect state checking in h1_process_mux() - BUG/MEDIUM: mux-h1: do not forget EOH even when no header is sent - BUILD: mux-h1: shut a build warning on clang from previous commit - DEV: makefile: add a new "range" target to iteratively build all commits - CI: do not use "groupinstall" for Fedora Rawhide builds - CI: get rid of travis-ci wrapper for Coverity scan - BUG/MINOR: quic: mux started when releasing quic_conn - BUG/MINOR: quic: Possible crash in quic_cc_conn_io_cb() traces. - MINOR: quic: Add a trace for QUIC conn fd ready for receive - BUG/MINOR: quic: Possible crash when issuing "show fd/sess" CLI commands - BUG/MINOR: quic: Missing tasklet (quic_cc_conn_io_cb) memory release (leak) - BUG/MEDIUM: quic: fix tasklet_wakeup loop on connection closing - BUG/MINOR: hlua: fix invalid use of lua_pop on error paths - MINOR: hlua: add hlua_stream_ctx_prepare helper function - BUG/MEDIUM: hlua: streams don't support mixing lua-load with lua-load-per-thread - MAJOR: threads/plock: update the embedded library again - MINOR: stick-table: move the task_queue() call outside of the lock - MINOR: stick-table: move the task_wakeup() call outside of the lock - MEDIUM: stick-table: change the ref_cnt atomically - MINOR: stick-table: better organize the struct stktable - MEDIUM: peers: update ->commitupdate out of the lock using a CAS - MEDIUM: peers: drop then re-acquire the wrlock in peer_send_teachmsgs() - MEDIUM: peers: only read-lock peer_send_teachmsgs() - MEDIUM: stick-table: use a distinct lock for the updates tree - MEDIUM: stick-table: touch updates under an upgradable read lock - MEDIUM: peers: drop the stick-table lock before entering peer_send_teachmsgs() - MINOR: stick-table: move the update lock into its own cache line - CLEANUP: stick-table: slightly reorder the stktable struct - BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP - MINOR: tools: make ptr_hash() support 0-bit outputs - MINOR: tools: improve ptr hash distribution on 64 bits - OPTIM: tools: improve hash distribution using a better prime seed - OPTIM: pools: use exponential back-off on shared pool allocation/release - OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated - MINOR: pools: introduce the use of multiple buckets - MEDIUM: pools: spread the allocated counter over a few buckets - MEDIUM: pools: move the used counter over a few buckets - MEDIUM: pools: move the needed_avg counter over a few buckets - MINOR: pools: move the failed allocation counter over a few buckets - MAJOR: pools: move the shared pool's free_list over multiple buckets - MINOR: pools: make pool_evict_last_items() use pool_put_to_os_no_dec() - BUILD: pools: fix build error on clang with inline vs forceinline	2023-08-12 19:59:27 +02:00
Willy Tarreau	2d18717fb8	BUILD: pools: fix build error on clang with inline vs forceinline clang is more picky than gcc regarding duplicate "inline". The functions declared with "forceinline" don't need to have "inline" since it's already in the macro.	2023-08-12 19:58:17 +02:00
Willy Tarreau	29eed99b50	MINOR: pools: make pool_evict_last_items() use pool_put_to_os_no_dec() The bucket is already known, no need to calculate it again. Let's just include the lower level functions.	2023-08-12 19:04:34 +02:00
Willy Tarreau	7bf829ace1	MAJOR: pools: move the shared pool's free_list over multiple buckets This aims at further reducing the contention on the free_list when using global pools. The free_list pointer now appears for each bucket, and both the alloc and the release code skip to a next bucket when ending on a contended entry. The default entry used for allocations and releases depend on the thread ID so that locality is preserved as much as possible under low contention. It would be nice to improve the situation to make sure that releases to the shared pools doesn't consider the first entry's pointer but only an argument that would be passed and that would correspond to the bucket in the thread's cache. This would reduce computations and make sure that the shared cache only contains items whose pointers match the same bucket. This was not yet done. One possibility could be to keep the same splitting in the local cache. With this change, an h2load test with 5 * 160 conns & 40 streams on 80 threads that was limited to 368k RPS with the shared cache jumped to 3.5M RPS for 8 buckets, 4M RPS for 16 buckets, 4.7M RPS for 32 buckets and 5.5M RPS for 64 buckets.	2023-08-12 19:04:34 +02:00
Willy Tarreau	8a0b5f783b	MINOR: pools: move the failed allocation counter over a few buckets The failed allocation counter cannot depend on a pointer, but since it's a perpetually increasing counter and not a gauge, we don't care where it's incremented. Thus instead we're hashing on the TID. There's no contention there anyway, but it's better not to waste the room in the pool's heads and to move that with the other counters.	2023-08-12 19:04:34 +02:00
Willy Tarreau	da6999f839	MEDIUM: pools: move the needed_avg counter over a few buckets That's the same principle as for ->allocated and ->used. Here we return the summ of the raw values, so the result still needs to be fed to swrate_avg(). It also means that we now use the local ->used instead of the global one for the calculations and do not need to call pool_used() anymore on fast paths. The number of samples should likely be divided by the number of buckets, but that's not done yet (better observe first). A function pool_needed_avg() was added to report aggregated values for the "show pools" command. With this change, an h2load made of 5 * 160 conn * 40 streams on 80 threads raised from 1.5M RPS to 6.7M RPS.	2023-08-12 19:04:34 +02:00
Willy Tarreau	9e5eb586b1	MEDIUM: pools: move the used counter over a few buckets That's the same principle as for ->allocated. The small difference here is that it's no longer possible to decrement ->used in batches when releasing clusters from the cache to the shared cache, so the counter has to be decremented for each of them. But as it provides less contention and it's done only during forced eviction, it shouldn't be a problem. A function "pool_used()" was added to return the sum of the entries. It's used by pool_alloc_nocache() and pool_free_nocache() which need to count the number of used entries. It's not a problem since such operations are done when picking/releasing objects to/from the OS, but it is a reminder that the number of buckets should remain small. With this change, an h2load test made of 5 * 160 conn * 40 streams on 80 threads raised from 812k RPS to 1.5M RPS.	2023-08-12 19:04:34 +02:00
Willy Tarreau	cdb711e42b	MEDIUM: pools: spread the allocated counter over a few buckets The ->used counter is one of the most stressed, and it heavily depends on the ->allocated one, so let's first move ->allocated to a few buckets. A function "pool_allocated()" was added to return the sum of the entries. It's important not to abuse it as it does iterate, so everywhere it's possible to avoid it by keeping a local counter, it's better. Currently it's used for limited pools which need to make sure they do not allocate too many objects. That's an acceptable tradeoff to save CPU on large machines at the expense of spending a little bit more on small ones which normally are not under load.	2023-08-12 19:04:34 +02:00
Willy Tarreau	06885aaea7	MINOR: pools: introduce the use of multiple buckets On many threads and without the shared cache, there can be extreme contention on the ->allocated counter, the ->free_list pointer, and the ->used counter. It's possible to limit this contention by spreading the counters a little bit over multiple entries, that are summed up when a consultation is needed. The criterion used to spread the values cannot be related to the thread ID due to migrations, since we need to keep consistent stats (allocated vs used). Instead we'll just hash the pointer, it provides an index that does the job and that is consistent for the object. When having just a few entries (16 here as it showed almost identical performance between global and non-global pools) even iterations should be short enough during measurements to not be a problem. A pair of functions designed to ease pointer hash bucket calculation were added, with one of them doing it for thread IDs because allocation failures will be associated with a thread and not a pointer. For now this patch only brings in the relevant parts of the infrastructure, the CONFIG_HAP_POOL_BUCKETS_BITS macro that defaults to 6 bits when 512 threads or more are supported, 5 bits when 128 or more are supported, 4 bits when 16 or more are supported, otherwise 3 bits for small setups. The array in the pool_head and the two utility functions are already added. It should have no measurable impact beyond inflating the pool_head structure.	2023-08-12 19:04:34 +02:00
Willy Tarreau	29ad61fb00	OPTIM: pools: make pool_get_from_os() / pool_put_to_os() not update ->allocated The pool's allocation counter doesn't strictly require to be updated from these functions, it may more efficiently be done in the caller (even out of a loop for pool_flush() and pool_gc()), and doing so will also help us spread the counters over an array later. The functions were renamed _noinc and _nodec to make sure we catch any possible user in an external patch. If needed, the original functions may easily be reimplemented in an inline function.	2023-08-12 19:04:34 +02:00
Willy Tarreau	feeda4132b	OPTIM: pools: use exponential back-off on shared pool allocation/release Running a stick-table stress with -dMglobal under 56 threads shows extreme contention on the pool's free_list because it has to be processed in two phases and only used to implement a cpu_relax() on the retry path. Let's at least implement exponential back-off here to limit the neighbor's noise and reduce the time needed to successfully acquire the pointer. Just doing so shows there's still contention but almost doubled the performance, from 1.1 to 2.1M req/s.	2023-08-12 19:04:34 +02:00
Willy Tarreau	f0d188f6ed	OPTIM: tools: improve hash distribution using a better prime seed During tests it was noticed that the current hash is not that good on 4- and 5- bit hashes. About 7.5% of all the 32-bit primes were tested as candidates for the hash function, by submitting them 128 arrangements of N pointers among 40k extracted from haproxy's pools, and the average fill rates for 1- to 12- bit hashes were measured and compared. It was clear that some values do not provide great hashes and other ones are way more resistant. The current value is not bad at all but delivers 42.6% unique 2-bit outputs, 41.6% 3-bit, 38.0% 4-bit, 38.2% 5-bit and 37.1% 10-bit. Some values did perform significantly better, among which 0xacd1be85 which does 43.2% 2-bit, 42.5% 3-bit, 42.2% 4-bit, 39.2% 5-bit and 37.3% 10-bit. The reverse value used in the ptr2_hash() was really underperforming and was replaced with 0x9d28e4e9 which does 49.6%, 40.4%, 42.6%, 39.1%, and 37.2% respectvely. This should slightly improve the accuracy of the task and memory profiling, and will be useful for pools.	2023-08-12 19:04:34 +02:00
Willy Tarreau	58946d44f8	MINOR: tools: improve ptr hash distribution on 64 bits When testing the pointer hash on 64-bit real pointers (map entries), it appeared that the shift by 33 bits that hoped to compensate for the 3 nul LSB degrades the hash, and the centering is more optimal on 31-(bits+1)/2. This makes sense since the topmost bit of the multiplicator is 31, so for an input of 1 bit and 1 bit of output we would always get zero. With the formula adjusted this way, we can get up to ~15% more unique entries at 10 bits and ~24% more at 11 bits.	2023-08-12 19:04:34 +02:00
Willy Tarreau	ab6cb5dea0	MINOR: tools: make ptr_hash() support 0-bit outputs When dealing with macro-based size definitions, it is useful to be able to hash pointers on zero bits so that the macro automatically returns a constant 0. For now it only supports 1-32. Let's just add this special case. It's automatically optimized out by the compiler since the function is inlined.	2023-08-12 19:04:34 +02:00
Willy Tarreau	59c347c15e	BUILD: defaults: use __WORDSIZE not LONGBITS for MAX_THREADS_PER_GROUP LONGBITS was defined long ago with old compilers that didn't provide the word size. It's still present as being referenced in various places in the code, but we must not use it to define other macros that may be evaluated at pre-processing time since it contains sizeof() and casts that are not compatible with preprocessor conditions. Let's switch MAX_THREADS_PER_GROUP to __WORDSIZE so that we can condition blocks of code on it if needed. LONGBITS should really be removed by now, given that we don't support compilers not providing __WORDSIZE anymore (gcc < 4.2).	2023-08-12 19:04:34 +02:00
Willy Tarreau	9e52c35de4	CLEANUP: stick-table: slightly reorder the stktable struct By moving the config-time stuff after the updt_lock, we can plug some holes without interfering with it. This allows us to get back to the 768-bytes struct. The performance was not affected at all.	2023-08-11 19:03:35 +02:00
Willy Tarreau	9c6248560e	MINOR: stick-table: move the update lock into its own cache line The read-lock contention observed on the update lock while turning it into an upgradable lock were due to false sharing with the nearby updates. Simply moving the lock alone into its own cache line is sufficient to almost double the performance again, raising from 2355 to 4480k RPS with very low contention: Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 743422995452 lost Overhead Shared Object Symbol 15.88% haproxy [.] stktable_lookup_key 5.94% haproxy [.] ebmb_lookup 5.69% haproxy [.] http_wait_for_request 3.66% haproxy [.] stktable_touch_with_exp 2.62% [kernel] [k] _raw_spin_unlock_irqrestore 1.86% haproxy [.] http_action_return 1.79% haproxy [.] stream_process_counters 1.78% [kernel] [k] skb_release_data 1.77% haproxy [.] process_stream Unfortunately, trying to move the line anywhere else didn't work, despite the remaining holes, because this structure is not quite clean. This adds 64 bytes to a struct that was already 768 long, so it's now 832. It's possible to repack it a little bit and regain these bytes by removing the THREAD_ALIGN before "keys" because we rarely use the config stuff, but that's a bit unsafe.	2023-08-11 19:03:35 +02:00
Willy Tarreau	45eeaad45f	MEDIUM: peers: drop the stick-table lock before entering peer_send_teachmsgs() The function drops the lock very early, and the only operations that are performed on the entry code are updating the current peer's last_local_table, which doesn't need to be protected. Thus it's easier to drop the lock before entering the function and it further limits its scope. This has raised the peak RPS from 2050 to 2355k/s with a peers section on the 80-core machine.	2023-08-11 19:03:35 +02:00
Willy Tarreau	cfeca3a3a3	MEDIUM: stick-table: touch updates under an upgradable read lock Instead of taking the update's write lock in stktable_touch_with_exp(), while most of the time under high load there is nothing to update because the entry is touched before having been synchronized present, let's do the check under a read lock and upgrade it to perform the update if needed. These updates are rare and the contention is not expected to be very high, so at the first failure to upgrade we retry directly with a write lock. By doing so the performance has almost doubled again, from 1140 to 2050k with a peers section enabled. The contention is now on taking the read lock itself, so there's little to be gained beyond this in this function.	2023-08-11 19:03:35 +02:00
Willy Tarreau	87e072eea5	MEDIUM: stick-table: use a distinct lock for the updates tree Updating an entry in the updates tree is currently performed under the table's write lock, which causes huge contention with other accesses such as lookups and free. Aside the updates tree, the update, localupdate and commitupdate variables, nothing is manipulated, so let's create a distinct lock (updt_lock) to protect these together to remove this contention. It required to add an extra lock in the few places where we delete the update (though only if we're really going to delete it) to protect the tree. This is very convenient because now peer_send_teachmsgs() only needs to take this read lock, and there is very little contention left on the stick-table. With this alone, the performance jumped from 614k to 1140k/s on a 80-thread machine with a peers section! Stick-table updates with no peers however now has to stand two locks and slightly regressed from 4.0-4.1M/s to 3.9-4.0. This is fairly minimal compared to the significant unlocking of the peers updates and considered totally acceptable.	2023-08-11 19:03:35 +02:00
Willy Tarreau	29982ea769	MEDIUM: peers: only read-lock peer_send_teachmsgs() This function doesn't need to be write-locked. It performs a lookup of the next update at its index, atomically updates the ref_cnt on the stksess, updates some shared_table fields on the local thread, and updates the table's commitupdate. Now that this update is atomic we don't need to keep the write lock during that period. In addition this function's callers do not rely on the write lock to be held either since it was droped during peer_send_updatemsg() anyway. Now, when the function is entered with a write lock, it's downgraded to a read lock, otherwise a read lock is grabbed. Updates are looked up under the read lock and the message is sent without the lock. The commitupdate is still performed under the read lock (so as not to break the code too much), and the write lock is re-acquired when leaving if needed. This allows multiple peers to look up updates in parallel and to avoid stalling stick-table lookups.	2023-08-11 19:03:35 +02:00
Willy Tarreau	d4f8286e45	MEDIUM: peers: drop then re-acquire the wrlock in peer_send_teachmsgs() This function maintains the write lock for a while. In practice it does not need to hold it that long, and some parts could be performed under a read lock. This patch first drops then re-acquires the write lock at the function's entry. The purpose is simply to break the end-to-end atomicity to prove that it has no impact in case something needs to be bisected later. In fact the write lock is already dropped while calling peer_send_updatemsg().	2023-08-11 19:03:35 +02:00

1 2 3 4 5 ...

20573 Commits All Branches Search

20573 Commits

All Branches