Commit Graph

18617 Commits

Author SHA1 Message Date
William Lallemand
5d1e1317c2 DOC: management: "show startup-logs" for master CLI
"show startup-logs" on the master CLI has a slighly different behavior.

No backport needed.
2022-10-14 15:46:19 +02:00
William Lallemand
f76b3b47ea DOC: management: add forgotten "show startup-logs"
The keyword was never documented.

Must be backported in all maintained versions.
2022-10-14 15:43:28 +02:00
Aurelien DARRAGON
b5684c54d1 DOC/CLEANUP: lua-api: removing duplicate core.proxies attribute
core.proxies attribute was found 2 times in the core Class.
Removing the second occurrence, which is purely duplicate and
does not belong here.
2022-10-14 15:18:25 +02:00
Aurelien DARRAGON
846fc7d950 DOC: lua-api: updating toolbox link
Link to lua toolbox was dead (project has been deprecated).
Adding a legacy link to get old toolbox source code as well as
a link to luarocks that seems to have superseded it.
2022-10-14 15:18:25 +02:00
Aurelien DARRAGON
53901f423d DOC/CLEANUP: lua-api: some minor corrections
This is a minor cleanup before 2.7 release to correct some typos
and errors without changing the meaning.
2022-10-14 15:18:25 +02:00
Christopher Faulet
380ae9c3ff MINOR: httpclient/lua: Don't set req_payload callback if body is empty
The HTTPclient callback req_payload callback is set when a request payload
must be streamed. In the lua, this callback is set when a body is passed as
argument in one of httpclient functions (head/get/post/put/delete). However,
there is no reason to set it if body string is empty.

This patch is related to the issue #1898. It may be backported as far as
2.5.
2022-10-14 15:18:25 +02:00
Christopher Faulet
48005de17c BUG/MEDIUM: httpclient: Don't set EOM flag on an empty HTX message
In the HTTP client, when the request body is streamed, at the end of the
payload, we must be sure to not set the EOM flag on an empty message.
Otherwise, because there is no data, the buffer is reset to be released and
the flag is lost. Thus, the HTTP client is never notified of the end of
payload for the request and the applet is blocked. If the HTTP client is
instanciated from a Lua script, it is even worse because we fall into a
wakeup loop between the lua script and the HTTP client applet. At the end,
HAProxy is killed because of the watchdog.

This patch should fix the issue #1898. It must be backported to 2.6.
2022-10-14 15:18:25 +02:00
Frédéric Lécaille
48e46f98cc BUILD: ssl_sock: bind_conf uninitialized in ssl_sock_bind_verifycbk()
Even if this cannot happen, ensure <bind_conf> is initialized in this
function to please some compilers.

Takes the opportunity of this patch to replace an ABORT_NOW() by
a BUG_ON() because if the variable values they test are not initialized,
this is really because there is a bug.

Must be backported to 2.6.
2022-10-14 10:25:11 +02:00
William Lallemand
ef3e5a1b68 DOC: management: update the "reload" command of the master CLI
Update the "reload" command with the new output format.
2022-10-13 18:16:12 +02:00
Willy Tarreau
f5a0c8abf5 MEDIUM: quic: respect the threads assigned to a bind line
Right now the QUIC thread mapping derives the thread ID from the CID
by dividing by global.nbthread. This is a problem because this makes
QUIC work on all threads and ignores the "thread" directive on the
bind lines. In addition, only 8 bits are used, which is no more
compatible with the up to 4096 threads we may have in a configuration.

Let's modify it this way:
  - the CID now dedicates 12 bits to the thread ID
  - on output we continue to place the TID directly there.
  - on input, the value is extracted. If it corresponds to a valid
    thread number of the bind_conf, it's used as-is.
  - otherwise it's used as a rank within the current bind_conf's
    thread mask so that in the end we still get a valid thread ID
    for this bind_conf.

The extraction function now requires a bind_conf in order to get the
group and thread mask. It was better to use bind_confs now as the goal
is to make them support multiple listeners sooner or later.
2022-10-13 18:08:05 +02:00
William Lallemand
ec1f8a62ca MINOR: mworker/cli: reload command displays the startup-logs
Change the output of the "reload" command, it now displays "Success=0"
if the reload failed and "Success=1" if it succeed.

If the startup-logs is available (USE_SHM_OPEN=1), the command will
print a "--\n" line, followed by the content of the startup-logs.

Example:

	$ echo "reload" | socat /tmp/master.sock -
	Success=1
	--
	[NOTICE]   (482713) : haproxy version is 2.7-dev7-4827fb-69
	[NOTICE]   (482713) : path to executable is ./haproxy
	[WARNING]  (482713) : config : 'http-request' rules ignored for proxy 'frt1' as they require HTTP mode.
	[NOTICE]   (482713) : New worker (482720) forked
	[NOTICE]   (482713) : Loading success.

	$ echo "reload" | socat /tmp/master.sock -
	Success=0
	--
	[NOTICE]   (482886) : haproxy version is 2.7-dev7-4827fb-69
	[NOTICE]   (482886) : path to executable is ./haproxy
	[ALERT]    (482886) : config : parsing [test3.cfg:1]: unknown keyword 'Aglobal' out of section.
	[ALERT]    (482886) : config : Fatal errors found in configuration.
	[WARNING]  (482886) : Loading failure!

	$
2022-10-13 17:59:48 +02:00
William Lallemand
eba6a54cd4 MINOR: logs: startup-logs can use a shm for logging the reload
When compiled with USE_SHM_OPEN=1 the startup-logs are now able to use
an shm which is used to keep the logs when switching to mworker wait
mode. This allows to keep the failed reload logs.

When allocating the startup-logs at first start of the process, haproxy
will do a shm_open with a unique path using the PID of the process, the
file is unlink immediatly so we don't let unwelcomed files be. The fd
resulting from this shm is stored in the HAPROXY_STARTUPLOGS_FD
environment variable so it can be mmap again when switching to wait
mode.

When forking children, the process is copying the mmap to a a mallocated
ring so we never share the same memory section between the master and
the workers. When switching to wait mode, the shm is not used anymore as
it is also copied to a mallocated structure.

This allow to use the "show startup-logs" command over the master CLI,
to get the logs of the latest startup or reload. This way the logs of
the latest failed reload are also kept.

This is only activated on the linux-glibc target for now.
2022-10-13 16:50:22 +02:00
William Lallemand
35df34223b MINOR: buffers: split b_force_xfer() into b_cpy() and b_force_xfer()
Split the b_force_xfer() into b_ncat() and b_force_xfer().

The previous b_force_xfer() implementation was basically a copy with a
b_del on the src buffer. Keep this implementation to make b_ncat(), and
just call b_ncat() + b_del() into b_force_xfer().
2022-10-13 16:45:28 +02:00
William Lallemand
9e4ead3095 MINOR: ring: ring_cast_from_area() cast from an allocated area
Cast an unified ring + storage area to a ring from area, without
reinitializing the data buffer. Reinitialize the waiters and the lock.

It helps retrieving a previously allocated ring, from an mmap for
example.
2022-10-13 16:45:28 +02:00
Amaury Denoyelle
91b2305ad7 MINOR: quic: implement datagram cleanup for quic_receiver_buf
Each time data is read on QUIC receiver socket, we try to reuse the
first datagram of the currently used quic_receiver_buf instead of
allocating a new one. This algorithm is suboptimal if there is several
unused datagrams as only the first one is tested and its buffer removed
from quic_receiver_buf.

If QUIC traffic is quite substential, this can lead to an important
number of quic_dgram occurences allocated from pool_head_quic_dgram and
a lack of free space in allocated quic_receiver_buf buffers.

To improve this, each time we want to reuse a datagram, we pop elements
until a non-yet released datagram is found or the list is empty. All
intermediary elements are freed and the last found datagram can be
reused. This operation has been extracted in a dedicated function named
quic_rxbuf_purge_dgrams().

This should improve memory consumption incured by quic_dgram instances under heavy
QUIC traffic. Note that there is still room for improvement as if the
first datagram is still in use, it may block several unused datagram
after him. However this requires to support removal of datagrams out of
order which is currently not possible.

This should be backported up to 2.6.
2022-10-13 11:06:48 +02:00
Amaury Denoyelle
1cba8d60f3 CLEANUP: quic: improve naming for rxbuf/datagrams handling
QUIC datagrams are read from a random thread. They are then redispatch
to the connection thread according to the first packet DCID. These
operations are implemented through a special buffer designed to avoid
locking.

Refactor this code with the following changes :
* <rxbuf> type is renamed <quic_receiver_buf>. Its list element is also
  renamed to highligh its attach point to a receiver.
* <quic_dgram> and <quic_receiver_buf> definition are moved to
  quic_sock-t.h. This helps to reduce the size of quic_conn-t.h.
* <quic_dgram> list elements are renamed to highlight their attach point
  into a <quic_receiver_buf> and a <quic_dghdlr>.

This should be backported up to 2.6.
2022-10-13 11:06:48 +02:00
Amaury Denoyelle
8c4d062d25 CLEANUP: quic: remove unused rxbufs member in receiver
rxbuf is the structure used to store QUIC datagrams and redispatch them
to the connection thread.

Each receiver manages a list of rxbuf. This was stored both as an array
and a mt_list. Currently, only mt_list is needed so removed <rxbufs>
member from receiver structure.

This should be backported up to 2.6.
2022-10-13 11:05:41 +02:00
Frédéric Lécaille
e1a49cfd4d MINOR: quic: Split the secrets key allocation in two parts
Implement quic_tls_secrets_keys_alloc()/quic_tls_secrets_keys_free() to allocate
the memory for only one direction (RX or TX).
Modify ha_quic_set_encryption_secrets() to call these functions for one of this
direction (or both). So, for now on we can rely on the value of the secret keys
to know if it was derived.
Remove QUIC_FL_TLS_SECRETS_SET flag which is no more useful.
Consequently, the secrets are dumped by the traces only if derived.

Must be backported to 2.6.
2022-10-13 10:12:03 +02:00
Frédéric Lécaille
4aa7d8197a BUG/MINOR: quic: Stalled 0RTT connections with big ClientHello TLS message
This issue was reproduced with -Q picoquic client option to split a big ClientHello
message into two Initial packets and haproxy as server without any knowledged of
any previous ORTT session (restarted after a firt 0RTT session). The ORTT received
packets were removed from their queue when the second Initial packet was parsed,
and the QUIC handshake state never progressed and remained at Initial state.

To avoid such situations, after having treated some Initial packets we always
check if there are ORTT packets to parse and we never remove them from their
queue. This will be done after the hanshake is completed or upon idle timeout
expiration.

Also add more traces to be able to analize the handshake progression.

Tested with ngtcp2 and picoquic

Must be backported to 2.6.
2022-10-13 10:12:03 +02:00
Frédéric Lécaille
9f9263ed13 MINOR: quic: Use a non-contiguous buffer for RX CRYPTO data
Implement quic_get_ncbuf() to dynamically allocate a new ncbuf to be attached to
any quic_cstream struct which needs such a buffer. Note that there is no quic_cstream
for 0RTT encryption level. quic_free_ncbuf() is added to release the memory
allocated for a non-contiguous buffer.

Modify qc_handle_crypto_frm() to call this function and allocate an ncbuf for
crypto data which are not received in order. The crypto data which are received in
order are not buffered but provide to the TLS stack (calling qc_provide_cdata()).

Modify qc_treat_rx_crypto_frms() which is called after having provided the
in order received crypto data to the TLS stack to provide again the remaining
crypto data which has been buffered, if possible (if they are in order). Each time
buffered CRYPTO data were consumed, we try to release the memory allocated for
the non-contiguous buffer (ncbuf).
Also move rx.crypto.offset quic_enc_level struct member to rx.offset quic_cstream
struct member.

Must be backported to 2.6.
2022-10-13 10:12:03 +02:00
Frédéric Lécaille
a20c93e6e2 MINOR: quic: Extract CRYPTO frame parsing from qc_parse_pkt_frms()
Implement qc_handle_crypto_frm() to parse a CRYPTO frame.

Must be backported to 2.6.
2022-10-13 10:12:03 +02:00
Frédéric Lécaille
7e3f7c47e9 MINOR: quic: New quic_cstream object implementation
Add new quic_cstream struct definition to implement the CRYPTO data stream.
This is a simplication of the qcs object (QUIC streams) for the CRYPTO data
without any information about the flow control. They are not attached to any
tree, but to a QUIC encryption level, one by encryption level except for
the early data encryption level (for 0RTT). A stream descriptor is also allocated
for each CRYPTO data stream.

Must be backported to 2.6
2022-10-13 10:12:03 +02:00
Ilya Shipitsin
b65fd66666 CI: SSL: temporarily stick to LibreSSL=3.5.3
recently released 3.6.0 introduced some regression which must be
resolved first, let us use 3.5.3 notation instead of "latest"
2022-10-13 08:53:27 +02:00
Ilya Shipitsin
14711bdc9a CI: SSL: use proper version generating when "latest" semantic is used
both "OPENSSL_VERSION=latest" and "LIBRESSL_VERSION=latest" processing
introduced errors when build-ssl.sh script was invoked. that error
in turn led to skipping custom openssl build and haproxy was linked against
stock openssl, i.e. openssl-1.1.1
2022-10-13 08:53:11 +02:00
Willy Tarreau
d114f4a68f MEDIUM: checks: spread the checks load over random threads
The CPU usage pattern was found to be high (5%) on a machine with
48 threads and only 100 servers checked every second That was
supposed to be only 100 connections per second, which should be very
cheap. It was figured that due to the check tasks unbinding from any
thread when going back to sleep, they're queued into the shared queue.

Not only this requires to manipulate the global queue lock, but in
addition it means that all threads have to check the global queue
before going to sleep (hence take a lock again) to figure how long
to sleep, and that they would all sleep only for the shortest amount
of time to the next check, one would pick it and all other ones would
go down to sleep waiting for the next check.

That's perfectly visible in time-to-first-byte measurements. A quick
test consisting in retrieving the stats page in CSV over a 48-thread
process checking 200 servers every 2 seconds shows the following tail:

  percentile   ttfb(ms)
  99.98        2.43
  99.985       5.72
  99.99       32.96
  99.995     82.176
  99.996     82.944
  99.9965    83.328
  99.997      83.84
  99.9975    84.288
  99.998      85.12
  99.9985    86.592
  99.999         88
  99.9995    89.728
  99.9999   100.352

One solution could consist in forcefully binding checks to threads at
boot time, but that's annoying, will cause trouble for dynamic servers
and may cause some skew in the load depending on some server patterns.

Instead here we take a different approach. A check remains bound to its
thread for as long as possible, but upon every wakeup, the thread's load
is compared with another random thread's load. If it's found that that
other thread's load is less than half of the current one's, the task is
bounced to that thread. In order to prevent that new thread from doing
the same, we set a flag "CHK_ST_SLEEPING" that indicates that it just
woke up and we're bouncing the task only on this condition.

Tests have shown that the initial load was very unfair before, with a few
checks threads having a load of 15-20 and the vast majority having zero.
With this modification, after two "inter" delays, the load is either zero
or one everywhere when checks start. The same test shows a CPU usage that
significantly drops, between 0.5 and 1%. The same latency tail measurement
is much better, roughly 10 times smaller:

  percentile   ttfb(ms)
  99.98        1.647
  99.985       1.773
  99.99        4.912
  99.995        8.76
  99.996        8.88
  99.9965      8.944
  99.997       9.016
  99.9975      9.104
  99.998       9.224
  99.9985      9.416
  99.999         9.8
  99.9995      10.04
  99.9999     10.432

In fact one difference here is that many threads work while in the past
they were waking up and going down to sleep after having perturbated the
shared lock. Thus it is anticipated that this will scale way smoother
than before. Under strace it's clearly visible that all threads are
sleeping for the time it takes to relaunch a check, there's no more
thundering herd wakeups.

However it is also possible that in some rare cases such as very short
check intervals smaller than a scheduler's timeslice (such as 4ms),
some users might have benefited from the work being concentrated on
less threads and would instead observe a small increase of apparent
CPU usage due to more total threads waking up even if that's for less
work each and less total work. That's visible with 200 servers at 4ms
where show activity shows that a few threads were overloaded and others
doing nothing. It's not a problem, though as in practice checks are not
supposed to eat much CPU and to wake up fast enough to represent a
significant load anyway, and the main issue they could have been
causing (aside the global lock) is an increase last-percentile latency.
2022-10-12 21:49:30 +02:00
Willy Tarreau
a840b4a39b MINOR: checks: use the lighter PRNG for spread checks
There's no point using ha_random32() which is heavy and uses shared
variables to calculate a random timer when we have statistical_prng()
which does the same and was made exactly for this.
2022-10-12 21:49:30 +02:00
Willy Tarreau
99521abd59 BUG/MINOR: server: make sure "show servers state" hides private bits
In the past we've seen "show servers state" dump some internal bits for
the check states, that were causing regtests to fail. The relevant bits
have been added to the doc to fix the public API and make sure they do
not change by accident, but the output doesn't take care of masking the
undesired ones, causing regtests (and possibly user programs) to fail
when new bits are added. Let's add the mask for the only documented ones
(0x0F for check and 0x1F for agent respectively).

This could be backported wherever the server state is present, though
there's a tiny risk that some undocumented bits might have already
leaked to some user scripts, so it might be wise to wait a bit before
doing that or even not to backport too far.
2022-10-12 21:45:39 +02:00
Christopher Faulet
104985610d BUG/MEDIUM: mux-h1: Handle abort with an incomplete message during parsing
In h1_process_demux(), aborts for incomplete messages were not properly
handled. It was not an issue because the abort was detected later in
h1_process(). But it will be an issue to perform the aborts refoctoring.

First, when a read0 was detected, the SE_FL_EOI flag was set for messages in
DONE or TUNNEL state or for messages without known length (so responses in
close mode). The last statement is not accurate. The message must also be in
DATA state. Otherwise, SE_FL_EOI flag may be set on incomplete message.

Then, an error was reported, via SE_FL_ERROR flag, only when an incomplete
message was detected on the payload parsing. It must also be reported if
headers are incomplete. Here again, the error is detected later for now. But
it could be an issue later.

There is no reason to backport this patch.
2022-10-12 17:10:44 +02:00
Christopher Faulet
9009c974c1 BUG/MEDIUM: mux-h1: Add connection error handling when reading/sending on a pipe
There is no error handling when we read or write on a pipe. There error is
caught later, in the mux I/O handler. But there is no reason to not do so
here.

There is no reason to backport it because no issue was reported for now
because of this "bug". In all cases, it must be evaluated first.
2022-10-12 17:10:44 +02:00
Christopher Faulet
c8db114afc MINOR: flags/mux-fcgi: Decode FCGI connection and stream flags
The new functions fconn_show_flags() and fstrm_show_flags() decode the flags
state into a string, and are used by dev/flags:

$ /dev/flags/flags fconn 0x3100
fconn->flags = FCGI_CF_GET_VALUES | FCGI_CF_KEEP_CONN | FCGI_CF_MPXS_CONNS

./dev/flags/flags fstrm  0x3300
fstrm->flags = FCGI_SF_WANT_SHUTW | FCGI_SF_WANT_SHUTR | FCGI_SF_OUTGOING_DATA | FCGI_SF_BEGIN_SENT
2022-10-12 17:10:41 +02:00
Christopher Faulet
3965aa7494 REORG: mux-fcgi: Extract flags and enums into mux_fcgi-t.h
The same was performed for the H2 and H1 multiplexers. FCGI connection and
stream flags are moved in a dedicated header file. It will be mainly used to
be able to decode mux-fcgi flags from the flags utility.

In this patch, we move the flags and enums to mux_fcgi-t.h, as well as the
two state decoding inline functions.
2022-10-12 17:10:37 +02:00
Amaury Denoyelle
3e0648837c BUG/MINOR: stick-table: fix build with DEBUG_THREAD
Compilation is broken with DEBUG_THREAD since the following patch
  76642223f0
  MEDIUM: stick-table: switch the table lock to rwlock

Fix this by updating a legacy HA_SPIN_INIT() to HA_RWLOCK_INIT().

No backport needed unless the mentionned patch is backported.
2022-10-12 16:54:59 +02:00
Willy Tarreau
cbdb528a76 MEDIUM: stick-table: requeue the wakeup task out of the write lock
We don't need to call stktable_requeue_exp() with the table's lock
held anymore, so let's move it out. It should slightly reduce the
contention on the write lock, though it is now already quite low.
2022-10-12 14:19:05 +02:00
Willy Tarreau
dbae89e09c MEDIUM: stick-table: always use atomic ops to requeue the table's task
We're generalizing the change performed in previous commit "MEDIUM:
stick-table: requeue the expiration task out of the exclusive lock"
to stktable_requeue_exp() so that it can also be used by callers of
__stktable_store(). At the moment there's still no visible change
since it's still called under the write lock. However, the previous
code in stitable_touch_with_exp() was updated to use this function.
2022-10-12 14:19:05 +02:00
Willy Tarreau
eb23e3e243 MINOR: stick-table: split stktable_store() between key and requeue
__staktable_store() performs two distinct things, one is to insert a key
and the other one is to requeue the task's expiration date. Since the
latter might be done without a lock, let's first split the function in
two halves. For now this has no impact.
2022-10-12 14:19:05 +02:00
Willy Tarreau
e3f5ae895a MEDIUM: stick-table: requeue the expiration task out of the exclusive lock
With 48 threads, a heavily loaded table with plenty of trackers and
rules and a short expiration timer of 10ms saturates the CPU at 232k
rps. By carefully using atomic ops we can make sure that t->exp_next
and t->task->expire converge to the earliest next expiration date and
that all of this can be performed under atomic ops without any lock.
That's what this patch is doing in stktable_touch_with_exp(). This is
sufficient to double the performance and reach 470k rps.

It's worth noting that __stktable_store() uses a mix of eb32_insert()
and task_queue, and that the second part of it could possibly benefit
from this, even though sometimes it's called under a lock that was
already held.
2022-10-12 14:19:05 +02:00
Willy Tarreau
e62885237c MEDIUM: stick-table: make stktable_set_entry() look up under a read lock
On a 24-core machine having some "stick-store response" rules, a lot of
time is spent in the write lock in stktable_set_entry(). Let's apply the
same mechanism as for the stktable_get_entry() consisting in looking up
the value under the read lock and upgrading it to a write lock only to
perform modifications. Here we even have the luxury of upgrading the
lock since there are no alloc/free in the path. All this increases the
performance by 40% (from 363k to 510k rps).
2022-10-12 14:19:05 +02:00
Willy Tarreau
996f1a5124 MEDIUM: stick-table: do not take a lock to update t->current anymore.
We don't need to be protected by the table's lock when touching t->current
if we do it using atomics, and that's great because it allows us to have
a cleaner stksess_new() that doesn't require a lock either, and to avoid
manipulating pools under a lock.

That's another 1% performance gain from 2.07 to 2.10M req/s under 48
threads.
2022-10-12 14:19:05 +02:00
Willy Tarreau
47f229702e MEDIUM: stick-table: make stktable_get_entry() look up under a read lock
On a 24-core machine doing lots of track-sc, it was found that the lock
in stktable_get_entry() was responsible for 25% of the CPU alone. It's
sad because most of its job is to protect the table during the lookup.

Here we're taking a slightly different approach: the lock is first taken
for reads during the lookup, and only in case of failure we switch it for
a write lock. We don't even perform an upgrade here since an allocation
is needed between the two, it would be wasted to do it under the lock,
and is generally not a good idea, so better release the read lock and
try again.

Here the performance under 48 threads with 3 trackers on the same table
jumped from 455k to 2.07M, or 4.55x! Note that the same approach should
be possible for stktable_set_entry().
2022-10-12 14:19:05 +02:00
Willy Tarreau
a7d6a1396e MEDIUM: stick-table: switch to rdlock in stktable_lookup() and lookup_key()
These functions do not modify anything in the the table except the refcount
on success. Let's just lock the table for shared accesses and make use of
atomic ops to update the refcount. This brings a nice gain from 425k to
455k under 48 threads (7%), but some contention remains on the exclusive
locks in other parts.

Note that the refcount continues to be updated under the lock because it's
not yet certain whether there are races between it and some of the exclusive
lock on the table. The difference is marginal and we prefer to stay on the
safe side for now.
2022-10-12 14:19:05 +02:00
Willy Tarreau
175aa06232 MEDIUM: stick-table: free newly allocated stkess if it couldn't be inserted
In __stktable_get_entry() now we're planning for the possibility that the
call to __stktable_store() doesn't add the newly allocated entry and instead
finds a previously inserted one. At the moment this doesn't exist because
the lookup + insert passes are made under the same lock. But it will soon
change.
2022-10-12 14:19:05 +02:00
Willy Tarreau
d2d3fd9b5e MEDIUM: stick-table: return inserted entry in __stktable_store()
This function is used to create an entry in the table. But it doesn't
consider the possibility that the entry already exists, because right
now it's only called in situations where it was verified under a lock
that it doesn't exist. Since we'll soon need to break that assumption
we need it to verify that the requested entry was added and to return
a pointer to the one in the tree so that the caller can detect any
possible conflict. At the moment this is not used.
2022-10-12 14:19:05 +02:00
Willy Tarreau
8d3c3336f9 MEDIUM: stick-table: make stksess_kill_if_expired() avoid the exclusive lock
stream_store_counters() calls stksess_kill_if_expired() for each active
counter. And this one takes an exclusive lock on the table before
checking if it has any work to do (hint: it almost never has since it
only wants to delete expired entries). However a lock is still neeed for
now to protect the ref_cnt, but we can do it atomically under the read
lock.

Let's change the mechanism. Now what we do is to check out of the lock
if the entry is expired. If it is, we take the write lock, expire it,
and decrement the refcount. Otherwise we just decrement the refcount
under a read lock. With this change alone, the config based on 3
trackers without the previous patches saw a 2.6x improvement, but here
it doesn't yet change anything because some heavy contention remains
on the lookup part.
2022-10-12 14:19:05 +02:00
Willy Tarreau
a7536ef9e1 MEDIUM: stick-table: only take the lock when needed in stktable_touch_with_exp()
As previously mentioned, this function currently holds an exclusive lock
on the table during all the time it take to check if the entry needs to
be updated and synchronized with peers. The reality is that many setups
do not use peers and that on highly loaded setups, the same entries are
hammered all the time so the key's expiration doesn't change between a
number of consecutive accesses.

With this patch we take a different approach. The function starts
without taking the lock, and will take it only if needed, keeping track
of it. This way we can avoid it most of the time, or even entirely.
Finally if the decrefcnt argument requires that the refcount is
decremented, we either do it using a non-atomic op if the table was
locked (since no other entry may touch it) or via an atomic under the
read lock only.

With this change alone, a 48-thread test with 3 trackers increased
from 193k req/s to 425k req/s, which is a 2.2x factor.
2022-10-12 14:19:05 +02:00
Willy Tarreau
9f5cb435b6 MINOR: stick-table: move the write lock inside stktable_touch_with_exp()
Taking the write lock prior to entering that function is a problem
because this function is full of conditions that most of the time can
lead to eliminating the lock.

This commit first moves the write lock inside the function and passes
the extra argument required to implement stktable_touch_remote() and
stktable_touch_local(). It also renames the function to remove the
underscores since there's no other variant and it's exported under
this name (probably an old rename that was not propagated). The code
was stressed under 48 threads using 3 trackers on the same table. It
already shows a tiny 3% improvement from 187k to 193k rps.
2022-10-12 14:19:05 +02:00
Willy Tarreau
4be073b99b MINOR: stick-table: do not take an exclusive lock when downing ref_cnt
At plenty of places we decrement ts->ref_cnt under the write lock
because it's held. We don't technically need it to be done that way
if there's contention and an atomic could suffice. However until all
places are turned to atomic, we at least need to do that under a
read lock for now, so that we don't mix atomic and non-atomic uses.
Regardless it already brings ~1.5% req rate improvement with 3 trackers
on the same table under 48 threads at 184k->187k rps.
2022-10-12 14:19:05 +02:00
Willy Tarreau
76642223f0 MEDIUM: stick-table: switch the table lock to rwlock
Right now a spinlock is used, but most accesses are for reads, so let's
switch the lock to an rwlock and switch all accesses to exclusive locks
for now. There should be no visible difference at this point.
2022-10-12 14:19:05 +02:00
Willy Tarreau
f6a42c3a37 MINOR: freq_ctr: use the thread's local time whenever possible
Right now when dealing with freq_ctr updates, we're using the process-
wide monotinic time, and accessing it is expensive since every thread
needs to update it, so this adds some contention. However we don't need
it all the time, the thread's local time is most of the time strictly
equal to the global time, and may be off by one millisecond when the
global time is switched to the next one by another thread, and in this
case we don't want to use the local time because it would risk to cause
a rotation of the counter. But that's precisely the condition we're
already relying on for the slow path!

What this patch does is to add a check for the period against the
local time prior to anything else, and immediately return after
updating the counter if still within the period, otherwise fall back
to the existing code. Given that the function starts to inflate a bit,
it was split between s very short inline part that does the hot path,
and the slower fallback that's in a cold function. It was measured that
on a 24-CPU machine it was called ~0.003% of the time.

The resulting improvement sits between 2 and 3% at 500k req/s tracking
an http_req_rate counter.
2022-10-12 14:19:05 +02:00
Willy Tarreau
b13044cc1a MINOR: plock: support disabling exponential back-off
The new macro PLOCK_DISABLE_EBO may be defined to disable exponential
backoff. This can be useful to more easily spot functions that cause
contention. In this case the CPU will be spent inside the functions
themselves instead of the pl_wait_unlock_{long,int}() functions, making
them easier to spot using "perf top" even if that causes a significant
degradation of the thread scalability.
2022-10-12 14:19:05 +02:00
Willy Tarreau
bc7c207f74 BUG/MAJOR: stick-tables: do not try to index a server name for applets
Since commit 03cdf55e6 ("MINOR: stream: Stickiness server lookup by name.")
in 2.0-dev6, server names may be used instead of their IDs, in order to
perform stickiness. However the commit above may end up trying to insert
an empty server name in the dictionary when the server is an applet
instead, resulting in an immediate segfault. This is typically what
happens when a "stick-store" rule is present in a backend featuring a
"stats" directive. As there doesn't seem to be an easy way around it,
it seems to imply that "stick-store" is not much used anymore.

The solution here is to only try to insert non-null keys into the
dictionary. The patch moves the check of the key type before the
first lock so that the test on the key can be performed under the lock
instead of locking twice (the patch is more readable with diff -b).

Note that before 2.4, there's no <key> variable there as it was
introduced by commit 92149f9a8 ("MEDIUM: stick-tables: Add srvkey
option to stick-table"), but the __objt_server(s->target)->id still
needs to be tested.

This needs to be backported as far as 2.0.
2022-10-12 14:19:05 +02:00