This is a first step towards unifying all the fallback code. Right now
these two functions are the only ones which do not update the needed_avg
rate counter since there's currently no shared pool kept when using them.
But their code is similar to what could be used everywhere except for
this one, so let's make them capable of maintaining usage statistics.
As a side effect the needed field in "show pools" will now be populated.
The mem_should_fail() call enabled by DEBUG_FAIL_ALLOC used to be placed
only in the no-cache version of the allocator. Now we can generalize it
to all modes and remove the exclusive test on CONFIG_HAP_NO_GLOBAL_POOLS.
We're going to make the local pool always present unless pools are
completely disabled. This means that pools are always enabled by
default, regardless of the use of threads. Let's drop this notion
of "local" pools and make it just "pool". The equivalent debug
option becomes DEBUG_NO_POOLS instead of DEBUG_NO_LOCAL_POOLS.
For now this changes nothing except the option and dropping the
dependency on USE_THREAD.
Initially per-thread pool caches were stored into a fixed-size array.
But this was a bit ugly because the last allocated pools were not able
to benefit from the cache at all. As a work around to preserve
performance, a size of 64 cacheable pools was set by default (there
are 51 pools at the moment, excluding any addon and debugging code),
so all in-tree pools were covered, at the expense of higher memory
usage.
In addition an index had to be calculated for each pool, and was used
to acces the pool cache head into that array. The pool index was not
even stored into the pools so it was required to determine it to access
the cache when the pool was already known.
This patch changes this by moving the pool cache head into the pool
head itself. This way it is certain that each pool will have its own
cache. This removes the need for index calculation.
The pool cache head is 32 bytes long so it was aligned to 64B to avoid
false sharing between threads. The extra cost is not huge (~2kB more
per pool than before), and we'll make better use of that space soon.
The pool cache head contains the size, which should probably be removed
since it's already in the pool's head.
In commit fb117e6a8 ("MEDIUM: memory: don't let pool_put_to_cache() free
the objects itself") pool_evict_from_cache() was introduced with no
argument, yet the only call place passes it the pool, the pointer and
the index number!
Let's remove these as they even let the reader think that the function
does something specific to the current pool while it's not the case.
This patch renames all existing uri-normalizers into a more consistent naming
scheme:
1. The part of the URI that is being touched.
2. The modification being performed as an explicit verb.
This normalizer merges `../` path segments with the predecing segment, removing
both the preceding segment and the `../`.
Empty segments do not receive special treatment. The `merge-slashes` normalizer
should be executed first.
See GitHub Issue #714.
When a thread ends its harmeless period, we must only consider running
threads when testing threads_want_rdv_mask mask. To do so, we reintroduce
all_threads_mask mask in the bitwise operation (It was removed to fix a
deadlock).
Note that for now it is useless because there is no way to stop threads or
to have threads reserved for another task. But it is safer this way to avoid
bugs in the future.
A previous patch was pushed to fix a deadlock when an isolated thread ends
its harmless period (a9a9e9aac ["BUG/MEDIUM: thread: Fix a deadlock if an
isolated thread is marked as harmless"]). But, unfortunately, the fix is
incomplete. The same must be done in the outer loop, in
thread_harmless_end() function. The current thread must be ignored when
threads_want_rdv_mask mask is tested.
This patch must also be backported as far as 2.0.
This library is required for the subsequent patch which adds
the JSON query possibility.
It is necessary to change the include statement in "src/mjson.c"
because the imported includes in haproxy are in "include/import"
orig: #include "mjson.h"
new: #include <import/mjson.h>
ub64dec and ub64enc are the base64url equivalent of b64dec and base64
converters. base64url encoding is the "URL and Filename Safe Alphabet"
variant of base64 encoding. It is also used in in JWT (JSON Web Token)
standard.
RFC1421 mention in base64.c file is deprecated so it was replaced with
RFC4648 to which existing converters, base64/b64dec, still apply.
Example:
HAProxy:
http-request return content-type text/plain lf-string %[req.hdr(Authorization),word(2,.),ub64dec]
Client:
Token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjoiZm9vIiwia2V5IjoiY2hhZTZBaFhhaTZlIn0.5VsVj7mdxVvo1wP5c0dVHnr-S_khnIdFkThqvwukmdg
$ curl -H "Authorization: Bearer ${TOKEN}" http://haproxy.local
{"user":"foo","key":"chae6AhXai6e"}
Allocation error are now handled in bind_conf_alloc() functions. Thus
callers, when not already done, are also updated to catch NULL return value.
This patch may be backported (at least partially) to all stable
versions. However, it only fix errors durung configuration parsing. Thus it
is not mandatory.
Add the trace support for the checks. Only tcp-check based health-checks are
supported, including the agent-check.
In traces, the first argument is always a check object. So it is easy to get
all info related to the check. The tcp-check ruleset, the conn-stream and
the connection, the server state...
Olivier spotted that I messed up during a rebase of commit 92c059c2a
("MINOR: atomic: implement native BTS/BTR for x86"), losing the x86
version of the BTS/BTR and leaving the generic version for it instead
of having this block in the #else. Since this variant is not used for
now it was easy to overlook it. Let's re-implement it here.
The time initialization was made a bit complex because we rely on a
dummy negative argument to reset all fields, leaving no distinction
between process-level initialization and thread-level initialization.
This patch changes this by introducing two functions, one for the
process and the second one for the threads. This removes ambigous
test and makes sure that the relevant fields are always initialized
exactly once. This also offers a better solution to the bug fixed in
commit b48e7c001 ("BUG/MEDIUM: time: make sure to always initialize
the global tick") as there is no more special values for global_now_ms.
It's simple enough to be backported if any other time-related issues
are encountered in stable versions in the future.
It was only used by freq_ctr and is not used anymore. In addition the
local curr_sec_ms was removed, as well as the equivalent extern
definitions which did not exist anymore either.
update_freq_ctr_period() was still not very clean and didn't wait for
the rotation lock to be dropped before trying again, thus maintaining
the contention at a high level. In addition, the rotation update was
made in three steps, which are not very efficient in terms of bus
cycles.
Here the wait loop was reworked so that the fast path remains short
and that the contended path waits for the lock to be dropped before
attempting another write, but it only waits a relax cycle before
attempting a read. The rotation block was simplified to remove a
test that was already validated by the first loop, and so that the
retrieval of the current period, its reset and its increment are all
performed in a single atomic op and the store to the previous period
is performed immediately after.
All this results in significantly smaller code for the inline function
(~1kB total) and a shorter critical path.
When counters are rotated, there is contention between the threads which
can slow down the operation of the thread performing the rotation. Let's
apply a cpu_relax there to let the first thread finish faster.
It remains cumbersome to preserve two versions of the freq counters and
two different internal clocks just for this. In addition, the savings
from using two different mechanisms are not that important as the only
saving is a divide that is replaced by a multiply, but now thanks to
the freq_ctr_total() unificaiton the code could also be simplified to
optimize it in case of constants.
This patch turns all non-period freq_ctr functions to static inlines
which call the period-based ones with a period of 1 second. A direct
benefit is that a single internal clock is now needed for any counter
and that they now all rely on ticks.
These 1-second counters are essentially used to report request rates
and to enforce a connection rate limitation in listeners. It was
verified that these continue to work like before.
Both structures are identical except the name of the field starting
the period and its description. Let's call them all freq_ctr and the
period's start "curr_tick" which is generic.
This is only a temporary change and fields are expected to remain
the same with no code change (verified).
There was still no function to compute a wait time for periods, let's
implement it on top of freq_ctr_total() as we'll soon need it for the
per-second one. The divide here is applied on the frequency so that it
will be replaced with a reciprocal multiply when constant.
This one is the easiest to implement, it just requires a call and a
divide of the result. Anti-flapping correction for low-rates was
preserved.
Now calls using a constant period will be able to use a reciprocal
multiply for the period instead of a divide.
Most of the functions designed to read a counter over a period go through
the same complex loop and only differ in the way they use the returned
values, so it was worth implementing all this into freq_ctr_total() which
returns the total number of events over a period so that the caller can
finish its operation using a divide or a remaining time calculation. As
a special case, read_freq_ctr_period() doesn't take pending events but
requires to enable an anti-flapping correction at very low frequencies.
Thus the function implements it when pend<0.
Thanks to this function it will be possible to reimplement the other ones
as inline and merge the per-second ones with the arbitrary period ones
without always adding the cost of a 64 bit divide.
Some variables are mostly read (mostly pointers) but they tend to be
merged with other ones in the same cache line, slowing their access down
in multi-thread setups. This patch declares an empty, aligned variable
in a section called "read_mostly". This will force a cache-line alignment
on this section so that any variable declared in it will be certain to
avoid false sharing with other ones. The section will be eliminated at
link time if not used.
A __read_mostly attribute was added to compiler.h to ease use of this
section.
HA_SECTION() is used as an attribute to force a section name. This is
required because OSX prepends "__DATA, " in front of the declaration.
HA_SECTION_START() and HA_SECTION_STOP() are used as post-attribute on
variable declaration to designate the section start/end (needed only on
OSX, empty on others).
For platforms with an obsolete linker, all macros are left empty. It would
possibly still work on some of them but this will not be needed anyway.
Due to length restrictions on OSX the initcall sections are called "i_"
there while they're called "init_" on other OSes. However the start and
end of sections are still called "__start_init_" and "__stop_init_",
which forces to have distinct code between the OSes. Let's switch everyone
to "i_" and rename the symbols accordingly.
The trace() function is convenient to avoid calling trace() when traces
are not enabled, but there starts to be some callers which place complex
expressions in their trace calls, which results in all of them to be
evaluated before being passed as arguments to the trace() function. This
needlessly wastes precious CPU cycles.
Let's change the function for a macro, so that the arguments are now only
evaluated when the surce has traces enabled. However having a generic
macro being called "trace()" can easily cause conflicts with innocent
code so we rename it "_trace".
Just doing this has resulted in a 2.5% increase of the HTTP/1 request rate.
Interestingly, all arrays used to declare patterns were read-write while
only hard-coded. Let's mark them const so that they move from data to
rodata and don't risk to experience false sharing.
This argument is not being used inside the function (and the functions
themselves are unused as well) and not documented. Its purpose is not clear.
Just remove it.
With latest commit f50906519 ("MEDIUM: fd: merge fdtab[].ev and state
for FD_EV_* and FD_POLL_* into state") one occurrence of a pair of
chars was missed in fd_stop_both(), resulting in the operation to
fail if the upper flags were set. Interestingly it managed to fail
2 tests in all setups in the CI while all used to work fine on my
local machines. Probably that the reason is that the chars had enough
room above them for the CAS to fail then refill "old" overwriting the
upper parts of the stack, and that thanks to this the subsequent tests
worked. With ASAN being used on lots of tests, it very likely caught
it but used to only report failed tests with no more info.
No backport is needed, as this was never released nor backported.
The current BTS/BTR operations on x86 are ugly because they rely on a
CAS, so they may be unfair and take time to converge. Fortunately,
where they are currently used (mostly FDs) the contention is expected
to be rare (mostly listeners). But this also limits their use to such
few low-load cases.
On x86 there is a set of BTS/BTR instructions which help for this,
but before the FD's state migrated to 32 bits there was little use of
them since they do not exist in 8 bits.
Now at least it makes sense to use them, at the very least in order
to significantly reduce the code size (one BTS instead of a CMPXCHG
loop). The implementation relies on modern gcc's ability to return
condition flags and limit code inflation and register spilling. The
fall back is retained on the old implementation for all other situations
(inappropriate target size or non-capable compiler). The code shrank
by 1.6 kB on the fast path.
As expected, for now on up to 4 threads there is no measurable difference
of performance.
Probably due to the result of an old copy-paste, HA_ATOMIC_BTS/BTR were
still implemented using the __sync_* builtins instead of the more
modern __atomic_* which allow to specify the memory model. Let's update
this to use the newer there and also implement the relaxed variants
(which are not used for now).
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
Most ADD/SUB callers use them for a single unit (e.g. refcounts) and
it's a pain to always pass ",1". Let's add them to simplify the API.
However we currently don't add any return value. If needed in the future
better report zero/non-zero than a real value for the sake of efficiency
at the instruction level.
The fetch_and_xxx variant is often missing for add/sub/and/or. In fact
it was only provided for ADD under the name XADD which corresponds to
the x86 instruction name. But for destructive operations like AND and
OR it's missing even more as it's not possible to know the value before
modifying it.
This patch explicitly adds HA_ATOMIC_FETCH_{OR,AND,ADD,SUB} which
cover these standard operations, and renames XADD to FETCH_ADD (there
were only 6 call places).
In the future, backport of fixes involving such operations could simply
remap FETCH_ADD(x) to XADD(x), FETCH_SUB(x) to XADD(-x), and for the
OR/AND if needed, these could possibly be done using BTS/BTR.
It's worth noting that xchg could have been renamed to fetch_and_store()
but xchg already has well understood semantics and it wasn't needed to
go further.
In order to make sure these ones will not be used anymore in an expression,
let's make them always void. New callers will now be forced to use the
explicit _FETCH variant if required.
Currently our atomic ops return a value but it's never known whether
the fetch is done before or after the operation, which causes some
confusion each time the value is desired. Let's create an explicit
variant of these operations suffixed with _FETCH to explicitly mention
that the fetch occurs after the operation, and make use of it at the
few call places.
Gcc 10.2 implements outline atomics on aarch64. The replace all inline
atomic ops with a function call that checks if the machine supports LSE
atomics. This comes with a small cost but allows modern machines to scale
much better than with the old LL/SC ones even when built for full 8.0
compatibility.
This patch enables the use of the __atomic_compare_exchange() builtin
for the double-word CAS when detected as available instead of using the
hand-written LL/SC version. The extra cost is negligible because we do
very few DWCAS operations (essentially FD migrations and shared pools)
so the cost is low but under high contention it can still be beneficial.
As expected no performance difference was measured in either direction
on 4-core machines with this change.
This could be backported to 2.3 if it was shown that FD migrations were
representing a significant source of contention, but for now it does
not appear to be needed.
There is a function called fd_write_frag_line() that's essentially used
by loggers and that is used to write an atomic message line over a file
descriptor using writev(). However a lock is required around the writev()
call to prevent messages from multiple threads from being interleaved.
Till now a SPIN_TRYLOCK was used on a dedicated lock that was common to
all FDs. This is quite not pretty as if there are multiple output pipes
to collect logs, there will be quite some contention. Now that there
are empty flags left in the FD state and that we can finally use atomic
ops on them, let's add a flag to indicate the FD is locked for exclusive
access by a syscall. At least the locking will now be on an FD basis and
not the whole process, so we can remove the log_lock.
No need to keep this flag apart any more, let's merge it into the global
state. The bit was not cleared in fd_insert() because the only user is
the function used to create and atomically send a log message to a pipe
FD, which never registers the fd. Here we clear it nevertheless for the
sake of clarity.
Note that with an extra cleaning pass we could have a bit number
here and simply use a BTS to test and set it.
No need to keep this flag apart any more, let's merge it into the global
state. The CLI's output state was extended to 6 digits and the linger/cloned
flags moved inside the parenthesis.
For a long time we've had fdtab[].ev and fdtab[].state which contain two
arbitrary sets of information, one is mostly the configuration plus some
shutdown reports and the other one is the latest polling status report
which also contains some sticky error and shutdown reports.
These ones used to be stored into distinct chars, complicating certain
operations and not even allowing to clearly see concurrent accesses (e.g.
fd_delete_orphan() would set the state to zero while fd_insert() would
only set the event to zero).
This patch creates a single uint with the two sets in it, still delimited
at the byte level for better readability. The original FD_EV_* values
remained at the lowest bit levels as they are also known by their bit
value. The next step will consist in merging the remaining bits into it.
The whole bits are now cleared both in fd_insert() and _fd_delete_orphan()
because after a complete check, it is certain that in both cases these
functions are the only ones touching these areas. Indeed, for
_fd_delete_orphan(), the thread_mask has already been zeroed before a
poller can call fd_update_event() which would touch the state, so it
is certain that _fd_delete_orphan() is alone. Regarding fd_insert(),
only one thread will get an FD at any moment, and it as this FD has
already been released by _fd_delete_orphan() by definition it is certain
that previous users have definitely stopped touching it.
Strictly speaking there's no need for clearing the state again in
fd_insert() but it's cheap and will remove some doubts during some
troubleshooting sessions.
In preparation of merging FD_POLL* and FD_EV*, this only changes the
value of FD_POLL_* to use bits 8-15 (the second byte). The size of the
field has been temporarily extended to 32 bits already, as well as
the temporary variables that carry the new composite value inside
fd_update_events(). The resulting fdtab entry becomes temporarily
unaligned. All places making access to .ev or FD_POLL_* were carefully
inspected to make sure they were safe regarding this change. Only one
temporary update was needed for the "show fd" code. The code was only
slightly inflated at this step.
The former was not used and the second was used only as a positive mask
of the flags to keep instead of having the flags that are updated. Both
were removed in favor of a new FD_POLL_UPDT_MASK that only mentions the
updated flags. This will ease merging of state and ev later.
This patch registers the parsed file and the line where a log server
is declared to make those information available in configuration
post check.
Those new informations were added on error messages probed resolving
ring names on post configuration check.
Define MODE_DIAG which is used to run haproxy in diagnostic mode. This
mode is used to output extra warnings about possible configuration
blunder or sub-optimal usage. It can be activated with argument '-dD'.
A new output function ha_diag_warning is implemented reserved for
diagnostic output. It serves to standardize the format of diagnostic
messages.
A macro HA_DIAG_WARN_COND is also available to automatically check if
diagnostic mode is on before executing the diagnostic check.
Historically, an option was added to wait for the request payload (option
http-buffer-request). This option has 2 drawbacks. First, it is an ON/OFF
option for the whole proxy. It cannot be enabled on demand depending on the
message. Then, as its name suggests, it only works on the request side. The
only option to wait for the response payload was to write a dedicated
filter. While it is an acceptable solution for complex applications, it is a
bit overkill to simply match strings in the body.
To make everyone happy, this patch adds a dedicated HTTP action to wait for
the message payload, for the request or the response depending it is used in
an http-request or an http-response ruleset. The time to wait is
configurable and, optionally, the minimum payload size to have before stop
to wait.
Both the http action and the old http analyzer rely on the same internal
function.
L6 sample fetches are now ignored when called from an HTTP proxy. Thus, a
warning is emitted during the startup if such usage is detected. It is true
for most ACLs and for log-format strings. Unfortunately, it is a bit painful
to do so for sample expressions.
This patch relies on the commit "MINOR: action: Use a generic function to
check validity of an action rule list".
The check_action_rules() function is now used to check the validity of an
action rule list. It is used from check_config_validity() function to check
L5/6/7 rulesets.
It is now possible to perform HTTP upgrades on a TCP stream from the
frontend side. To do so, a tcp-request content rule must be defined with the
switch-mode action, specifying the mode (for now, only http is supported)
and optionnaly the proto (h1 or h2).
This way it could be possible to set HTTP directives on a TCP frontend which
will only be evaluated if an upgrade is performed. This new way to perform
HTTP upgrades should replace progressively the old way, consisting to route
the request to an HTTP backend. And it should be also a good start to remove
all HTTP processing from tcp-request content rules.
This action is terminal, it stops the ruleset evaluation. It is only
available on proxy with the frontend capability.
The configuration manual has been updated accordingly.
The code responsible to perform an HTTP upgrade from a TCP stream is moved
in a dedicated function, stream_set_http_mode().
The stream_set_backend() function is slightly updated, especially to
correctly set the request analysers.
Now allocation and initialization of HTTP transactions are performed in a
unique function. Historically, there were two functions because the same TXN
was reset for K/A connections in the legacy HTTP mode. Now, in HTX, K/A
connections are handled at the mux level. A new stream, and thus a new TXN,
is created for each request. In addition, the function responsible to end
the TXN is now also reponsible to release it.
So, now, http_create_txn() and http_destroy_txn() must be used to create and
destroy an HTTP transaction.
MX_FL_NO_UPG flag may now be set on a multiplexer to explicitly disable
upgrades from this mux. For now, it is set on the FCGI multiplexer because
it is not supported and there is no upgrade on backend-only multiplexers. It
is also set on the H2 multiplexer because it is clearly not supported.
Historically we've used SOL_IP/SOL_IPV6/SOL_TCP everywhere as the socket
level value in getsockopt() and setsockopt() but as we've seen over time
it regularly broke the build and required to have them defined to their
IPPROTO_* equivalent. The Linux ip(7) man page says:
Using the SOL_IP socket options level isn't portable; BSD-based
stacks use the IPPROTO_IP level.
And it indeed looks like a pure linuxism inherited from old examples and
documentation. strace also reports SOL_* instead of IPPROTO_*, which does
not help... A check to linux/in.h shows they have the same values. Only
SOL_SOCKET and other non-IP values make sense since there is no IPPROTO
equivalent.
Let's get rid of this annoying confusion by removing all redefinitions of
SOL_IP/IPV6/TCP and using IPPROTO_* instead, just like any other operating
system. This also removes duplicated tests for the same value.
Note that this should not result in exposing syscalls to other OSes
as the only ones that were still conditionned to SOL_IPV6 were for
IPV6_UNICAST_HOPS which already had an IPPROTO_IPV6 equivalent, and
IPV6_TRANSPARENT which is Linux-specific.
Very often we use "int" where negative numbers are not needed (and can
further cause trouble) just because it's painful to type "unsigned int"
or "unsigned", or ugly to use in function arguments. Similarly sometimes
chars would absolutely need to be signed but nobody types "signed char".
Let's add a few aliases for such types and make them part of the standard
internal API so that over time we can get used to them and get rid of
horrible definitions. A comment also reminds some commonly available
types and their properties regarding other types.
In order to process samples from the command line interface we'll need
rules as well, and these rules will have to be marked as coming from
the CLI parser. This new origin is used for this.
In order to prepare for supporting calling sample expressions from the
CLI, let's create a new CLI_PARSER parsing context. This one supports
constants and internal samples only.
In order to process samples from the config file we'll need rules as
well, and these rules will have to be marked as coming from the
config parser. This new origin is used for this.
We'd sometimes like to be able to process samples while parsing
the configuration based on purely internal thing but that's not
possible right now. Let's add a new CFG_PARSER context for samples
which only permits constant samples (i.e. those which do not change
in the process' life and which are stable during config parsing).
This level indicates that everything it constant in the expression during
the whole process' life and that it may safely be used at config parsing
time.
For now smp_resolve_args() complains on stderr via ha_alert(), but if we
want to make it a bit more dynamic, we need it to return errors in an
allocated message. Let's pass it an error pointer and have it fill it.
On return we indent the output if it contains more than one line.
Define a new cap PR_CAP_LUA. It can be used to allocate the internal
proxy for lua Socket class. This cap overrides default settings for
preferable values in the lua context.
Move all liberation code related to a proxy in a dedicated function
free_proxy in proxy.c. For now, this function is only called in
haproxy.c. In the future, it will be used to free the lua proxy.
This helps to clean up haproxy.c.
Create a new function parse_new_proxy specifically designed to allocate
a new proxy from the configuration file and copy settings from the
default proxy.
The function alloc_new_proxy is reduced to a minimal allocation. It is
used for default proxy allocation and could also be used for internal
proxies such as the lua Socket proxy.
Move deinit_acl_cond and deinit_act_rules from haproxy.c respectively in
acl.c and action.c. The name of the functions has been slightly altered,
replacing the prefix deinit_* by free_* to reflect their purpose more
clearly.
This change has been made in preparation to the implementation of a free
proxy function. As a side-effect, it helps to clean up haproxy.c.
Create a new module init which contains code related to REGISTER_*
macros for initcalls. init.h is included in api.h to make init code
available to all modules.
It's a step to clean up a bit haproxy.c/global.h.
The default SSL_CTX used by a specific frontend is the one of the first
ckch instance created for this frontend. If this instance has SNIs, then
the SSL context is linked to the instance through the list of SNIs
contained in it. If the instance does not have any SNIs though, then the
SSL_CTX is only referenced by the bind_conf structure and the instance
itself has no link to it.
When trying to update a certificate used by the default instance through
a cli command, a new version of the default instance was rebuilt but the
default SSL context referenced in the bind_conf structure would not be
changed, resulting in a buggy behavior in which depending on the SNI
used by the client, he could either use the new version of the updated
certificate or the original one.
This patch adds a reference to the default SSL context in the default
ckch instances so that it can be hot swapped during a certificate
update.
This should fix GitHub issue #1143.
It can be backported as far as 2.2.
Christopher discovered an issue mostly affecting 2.2 and to a less extent
2.3 and above, which is that it's possible to deadlock a soft-stop when
several threads are using a same listener:
thread1 thread2
unbind_listener() fd_set_running()
lock(listener) listener_accept()
fd_delete() lock(listener)
while (running_mask); -----> deadlock
unlock(listener)
This simple case disappeared from 2.3 due to the removal of some locked
operations at the end of listener_accept() on the regular path, but the
architectural problem is still here and caused by a lock inversion built
around the loop on running_mask in fd_clr_running_excl(), because there
are situations where the caller of fd_delete() may hold a lock that is
preventing other threads from dropping their bit in running_mask.
The real need here is to make sure the last user deletes the FD. We have
all we need to know the last one, it's the one calling fd_clr_running()
last, or entering fd_delete() last, both of which can be summed up as
the last one calling fd_clr_running() if fd_delete() calls fd_clr_running()
at the end. And we can prevent new threads from appearing in running_mask
by removing their bits in thread_mask.
So what this patch does is that it sets the running_mask for the thread
in fd_delete(), clears the thread_mask, thus marking the FD as orphaned,
then clears the running mask again, and completes the deletion if it was
the last one. If it was not, another thread will pass through fd_clr_running
and will complete the deletion of the FD.
The bug is easily reproducible in 2.2 under high connection rates during
soft close. When the old process stops its listener, occasionally two
threads will deadlock and the old process will then be killed by the
watchdog. It's strongly believed that similar situations do exist in 2.3
and 2.4 (e.g. if the removal attempt happens during resume_listener()
called from listener_accept()) but if so, they should be much harder to
trigger.
This should be backported to 2.2 as the issue appeared with the FD
migration. It requires previous patches "fd: make fd_clr_running() return
the remaining running mask" and "MINOR: fd: remove the unneeded running
bit from fd_insert()".
Notes for backport: in 2.2, the fd_dodelete() function requires an extra
argument "do_close" indicating whether we want to remove and close the FD
(fd_delete) or just delete it (fd_remove). While this information is not
conveyed along the chain, we know that late calls always imply do_close=1
become do_close=0 exclusively results from fd_remove() which is only used
by the config parser and the master, both of which are single-threaded,
hence are always the last ones in the running_mask. Thus it is safe to
assume that a postponed FD deletion always implies do_close=1.
Thanks to Olivier for his help in designing this optimal solution.
There's no point taking the running bit in fd_insert() since by
definition there will never be more than one thread inserting the FD,
and that fd_insert() may only be done after the fd was allocated by
the system, indicating the end of use by any other thread.
This will need to be backported to 2.2 to fix an issue.
We'll need to know that a thread is the last one to use an fd, so let's
make fd_clr_running() return the remaining bits after removal. Note that
in practice we're only interested in knowing if it's zero but the compiler
doesn't make use of the clags after the AND and emits a CMPXCHG anyway :-/
This will need to be backported to 2.2 to fix an issue.
The commit reverts following commits:
* 83926a04 BUG/MEDIUM: debug/lua: Don't dump the lua stack if not dumpable
* a61789a1 MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua
Instead of relying on a Lua function to print the lua traceback into the
debugger, we are now using our own internal function (hlua_traceback()).
This one does not allocate memory and use a chunk instead. This avoids any
issue with a possible deadlock in the memory allocator because the thread
processing was interrupted during a memory allocation.
This patch relies on the commit "BUG/MEDIUM: debug/lua: Use internal hlua
function to dump the lua traceback". Both must be backported wherever the
patches above are backported, thus as far as 2.0
The separator string is now configurable, passing it as parameter when the
function is called. In addition, the message have been slightly changed to
be a bit more readable.
If an unknown CA file was first mentioned in an "add ssl crt-list" CLI
command, it would result in a call to X509_STORE_load_locations which
performs a disk access which is forbidden during runtime. The same would
happen if a "ca-verify-file" or "crl-file" was specified. This was due
to the fact that the crt-list file parsing and the crt-list related CLI
commands parsing use the same functions.
The patch simply adds a new parameter to all the ssl_bind parsing
functions so that they know if the call is made during init or by the
CLI, and the ssl_store_load_locations function can then reject any new
cafile_entry creation coming from a CLI call.
It can be backported as far as 2.2.
str2sa_range function options PA_O_DGRAM and PA_O_STREAM are used to
define the supported address types but also to set the default type
if it is not explicit. If the used address support both STREAM and DGRAM,
the default was always set to STREAM.
This patch introduce a new option PA_O_DEFAULT_DGRAM to force the
default to DGRAM type if it is not explicit in the address field
and both STREAM and DGRAM are supported. If only DGRAM or only STREAM
is supported, it continues to be considered as the default.
In commit a1ecbca0a ("BUG/MINOR: freq_ctr/threads: make use of the last
updated global time"), for period-based counters, the millisecond part
of the global_now variable was used as the date for the new period. But
it's wrong, it only works with sub-second periods as it wraps every
second, and for other periods the counters never rotate anymore.
Let's make use of the newly introduced global_now_ms variable instead,
which contains the global monotonic time expressed in milliseconds.
This patch needs to be backported wherever the patch above is backported.
It depends on previous commit "MINOR: time: also provide a global,
monotonic global_now_ms timer".
The period-based freq counters need the global date in milliseconds,
so better calculate it and expose it rather than letting all call
places incorrectly retrieve it.
Here what we do is that we maintain a new globally monotonic timer,
global_now_ms, which ought to be very close to the global_now one,
but maintains the monotonic approach of now_ms between all threads
in that global_now_ms is always ahead of any now_ms.
This patch is made simple to ease backporting (it will be needed for
a subsequent fix), but it also opens the way to some simplifications
on the time handling: instead of computing the local time and trying
to force it to the global one, we should soon be able to proceed in
the opposite way, that is computing the new global time an making the
local one just the latest snapshot of it. This will bring the benefit
of making sure that the global time is always ahead of the local one.
The pool_alloc_dirty() function was renamed to __pool_alloc() and now
takes a set of flags indicating whether poisonning is permitted or not
and whether zeroing the area is needed or not. The pool_alloc() function
is now just a wrapper calling __pool_alloc(pool, 0).
This one used to maintain a shortcut in the pools allocation path that
was only justified by b_alloc_fast() which was not used! Let's get rid
of it as well so that the allocator becomes a bit more straight forward.
It is never used anymore since 1.7 where it was used by b_alloc_margin()
then replaced by direct calls to the pools function, and it maintains a
dependency on the exposed pools functions. It's time to get rid of it,
as it's not even certain it still works.
The channel's buffer allocator, channel_alloc_buffer(), was still relying
on the principle of a margin for the request and not for the response.
But this margin stopped working around 1.7 with the introduction of the
content filters such as SPOE, and was completely anihilated with the
local pools that came with threads. Let's simplify this and just use
b_alloc().
Right now there is a discrepancy beteween b_alloc() and b_allow_margin():
the former forcefully overwrites the target pointer while the latter tests
it and returns it as-is if already allocated.
As a matter of fact, all callers of b_alloc() either preliminary test the
buffer, or assume it's already null.
Let's remove this pain and make the function test the buffer's allocation
before doing it again, and match call places' expectations.
Some parts of the Lua are non-reentrant. We must be sure to carefully track
these parts to not dump the lua stack when it is interrupted inside such
parts. For now, we only identified the custom lua allocator. If the thread
is interrupted during the memory allocation, we must not try to print the
lua stack wich also allocate memory. Indeed, realloc() is not
async-signal-safe.
In this patch we introduce a thread-local counter. It is incremented before
entering in a non-reentrant part and decremented when exiting. It is only
performed in hlua_alloc() for now.
Now that connections aren't being reused when they failed, remove the
reset() method. It was unimplemented anywhere, except for H1 where it did
nothing, anyway.
Introduce a new XPRT method, start(). The init() method will now only
initialize whatever is needed for the XPRT to run, but any action the XPRT
has to do before being ready, such as handshakes, will be done in the new
start() method. That way, we will be sure the full stack of xprt will be
initialized before attempting to do anything.
The init() call is also moved to conn_prepare(). There's no longer any reason
to wait for the ctrl to be ready, any action will be deferred until start(),
anyway. This means conn_xprt_init() is no longer needed.
Remove static qualifier on stats_allocate_proxy_counters_internal. This
function will be used to allocate extra counters at runtime for dynamic
servers.
Prepare the server parsing API to support dynamic servers.
- define a new parsing flag to be used for dynamic servers
- each keyword contains a new field dynamic_ok to indicate if it can be
used for a dynamic server. For now, no keyword are supported.
- do not copy settings from the default server for a new dynamic server.
- a dynamic server is created in a maintenance mode and requires an
explicit 'enable server' command.
- a new server flag named SRV_F_DYNAMIC is created. This flag is set for
all servers created at runtime. It might be useful later, for example
to know if a server can be purged.
Modify the API of parse_server function. Use flags to describe the type
of the parsed server instead of discrete arguments. These flags can be
used to specify if a server/default-server/server-template is parsed.
Additional parameters are also specified (parsing of the address
required, resolve of a name must be done immediately).
It is now unneeded to use strcmp on args[0] in parse_server. Also, the
calls to parse_server are more explicit thanks to the flags.
The freq counters were using the thread's own time as the start of the
current period. The problem is that in case of contention, it was
occasionally possible to perform non-monotonic updates on the edge of
the next second, because if the upfront thread updates a counter first,
it causes a rotation, then the second thread loses the race from its
older time, and tries again, and detects a different time again, but
in the past so it only updates the counter, then a third thread on the
new date would detect a change again, thus provoking a rotation again.
The effect was triple:
- rare loss of stored values during certain transitions from one
period to the next one, causing counters to report 0
- half of the threads forced to go through the slow path every second
- difficult convergence when using many threads where the CAS can fail
a lot and we can observe N(N-1) attempts for N threads to complete
This patch fixes this issue in two ways:
- first, it now makes use og the monotonic global_now value which also
happens to be volatile and to carry the latest known time; this way
time will never jump backwards anymore and only the first thread
updates it on transition, the other ones do not need to.
- second, re-read the time in the loop after each failure, because
if the date changed in the counter, it means that one thread knows
a more recent one and we need to update. In this case if it matches
the new current second, the fast path is usable.
This patch relies on previous patch "MINOR: time: export the global_now
variable" and must be backported as far as 1.8.
This is the process-wide monotonic time that is used to update each
thread's own time. It may be required at a few places where a strictly
monotonic clock is required such as freq_ctr. It will be have to be
backported as a dependency of a forthcoming fix.
Some are not always easy to spot with "chk" vs "check" or hyphens at
some places and not at others. Now entering "option http-close" properly
suggests "httpclose" and "option tcp-chk" suggests "tcp-check". There's
no need to consider the proxy's capabilities, what matters is to figure
what related word the user tried to spell, and there are not that many
options anyway.
The distance between two words can be high due to a sub-word being missing
and in this case it happens that other totally unrealted words are proposed
because their average score looks lower thanks to being shorter. Here we're
introducing the notion of presence of each character so that word sequences
that contain existing sub-words are favored against the shorter ones having
nothing in common. In addition we do not distinguish being/end from a
regular delimitor anymore. That made it harder to spot inverted words.
When tasklets were derived from tasks, there was no immediate need for
the scheduler to know their status after execution, and in a spirit of
simplicity they just started to always return NULL. The problem is that
it simply prevents the scheduler from 1) accounting their execution time,
and 2) keeping track of their current execution status. Indeed, a remote
wake-up could very well end up manipulating a tasklet that's currently
being executed. And this is the reason why those handlers have to take
the idle lock before checking their context.
In 2.5 we'll take care of making tasklets and tasks work more similarly,
but trouble is to be expected if we continue to propagate the trend of
returning NULL everywhere, especially if some fixes relying on a stricter
model later need to be backported. For this reason this patch updates all
known tasklet handlers to make them return NULL only when the tasklet was
freed. It has no effect for now and isn't even guaranteed to always be
100% safe but it puts the code into the right direction for this.
There were still a very small list of functions, variables and fields
called "stats_" while they were really purely CLI-centric. There's the
frontend called "stats_fe" in the global section, which instantiates a
"cli_applet" called "<CLI>" so it was renamed "cli_fe".
The "alloc_stats_fe" function cas renamed to "cli_alloc_fe" which also
better matches the naming convention of all cli-specific functions.
Finally the "stats_permission_denied_msg" used to return an error on
the CLI was renamed "cli_permission_denied_msg".
Now there's no more "stats_something" that designates the CLI.
This is the number of args accepted on a command received on the CLI,
is has long been totally independent of stats and should not carry
this misleading "stats" name anymore.
Instead of making a new one from scratch, let's support not wiping the
existing fingerprint and updating it, and to do the same char by char.
The word-by-word one will still result in multiple beginnings and ends,
but that will accurately translate word boundaries. The char-based one
has more flexibility and requires that the caller maintains the previous
char to indicate the transition, which also allows to insert delimiters
for example.
Entering "show tls" would still emit 35 entries. By measuring the distance
between all unknown words and the candidates, we can sort them and pick the
10 most likely candidates. This works reasonably well, as now "show tls"
only proposes "show tls-keys", "show threads", "show pools" and "show tasks".
If the distance is still too high or if a word is missing, the whole
prefix list continues to be dumped, thus "show" alone will still report
the entire list of commands beginning with "show".
It's still impossible to skip a word, for example "show conn" will not
propose "show servers conn" because the distance is calculated for each
word individually. Some changes to the distance calculation to support
updating an existing map could easily address this. But this is already
a great improvement.
It was mentioned that ACCESS_MASTER_ONLY as for workers only instead of
master-only. And it wasn't clear that all ACCESS_* would belong to the
same thing.
The last time when an item was seen in a resolver responses is now stored in
milliseconds instead of seconds. This avoid some corner-cases at the
edges. This also simplifies time comparisons.
Another way to say it: "Safely unlink requester from a requester callbacks".
Requester callbacks must never try to unlink a requester from a resolution, for
the current requester or another one. First, these callback functions are called
in a loop on a request list, not necessarily safe. Thus unlink resolution at
this place, may be unsafe. And it is useless to try to make these loops safe
because, all this stuff is placed in a loop on a resolution list. Unlink a
requester may lead to release a resolution if it is the last requester.
However, the unkink is necessary because we cannot reset the server state
(hostname and IP) with some pending DNS resolution on it. So, to workaround
this issue, we introduce the "safe" unlink. It is only performed from a
requester callback. In this case, the unlink function never releases the
resolution, it only reset it if necessary. And when a resolution is found
with an empty requester list, it is released.
This patch depends on the following commits :
* MINOR: resolvers: Purge answer items when a SRV resolution triggers an error
* MINOR: resolvers: Use a function to remove answers attached to a resolution
* MINOR: resolvers: Directly call srvrq_update_srv_state() when possible
* MINOR: resolvers: Add function to change the srv status based on SRV resolution
All the series must be backported as far as 2.2. It fixes a regression
introduced by the commit b4badf720 ("BUG/MINOR: resolvers: new callback to
properly handle SRV record errors").
don't release resolution from requester cb
srvrq_update_srv_status() update the server status based on result of SRV
resolution. For now, it is only used from snr_update_srv_status() when
appropriate.
resolv_purge_resolution_answer_records() must be used to removed all answers
attached to a resolution. For now, it is only used when a resolution is
released.
This function search for a SRV answer item associated to a requester
whose type is server.
This is mainly useful to "link" a server to its SRV record when no
additional record were found to configure the IP address.
This patch is required by a bug fix.
action_suggest() will return a pointer to an action whose keyword more or
less ressembles the passed argument. It also accepts to be more tolerant
against prefixes (since actions taking arguments are handled as prefixes).
This will be used to suggest approaching words.
Just like with the server keywords, now's the turn of "bind" keywords.
The difference is that 100% of the bind keywords are registered, thus
we do not need the list of extra keywords.
There are multiple bind line parsers today, all were updated:
- peers
- log
- dgram-bind
- cli
$ printf "listen f\nbind :8000 tcut\n" | ./haproxy -c -f /dev/stdin
[NOTICE] 070/101358 (25146) : haproxy version is 2.4-dev11-7b8787-26
[NOTICE] 070/101358 (25146) : path to executable is ./haproxy
[ALERT] 070/101358 (25146) : parsing [/dev/stdin:2] : 'bind :8000' unknown keyword 'tcut'; did you mean 'tcp-ut' maybe ?
[ALERT] 070/101358 (25146) : Error(s) found in configuration file : /dev/stdin
[ALERT] 070/101358 (25146) : Fatal errors found in configuration.
Instead of just reporting "unknown keyword", let's provide a function which
will look through a list of registered keywords for a similar-looking word
to the one that wasn't matched. This will help callers suggest correct
spelling. Also, given that a large part of the config parser still relies
on a long chain of strcmp(), we'll need to be able to pass extra candidates.
Thus the function supports an optional extra list for this purpose.
This introduces two functions, one which creates a fingerprint of a word,
and one which computes a distance between two words fingerprints. The
fingerprint is made by counting the transitions between one character and
another one. Here we consider the 26 alphabetic letters regardless of
their case, then any digit as a digit, and anything else as "other". We
also consider the first and last locations as transitions from begin to
first char, and last char to end. The distance is simply the sum of the
squares of the differences between two fingerprints. This way, doubling/
missing a letter has the same cost, however some repeated transitions
such as "e"->"r" like in "server" are very unlikely to match against
situations where they do not exist. This is a naive approach but it seems
to work sufficiently well for now. It may be refined in the future if
needed.
It is possible to have a session without a listener. It happens for applets
on the client side. Thus all accesses to the listener info from the session
must be guarded. It was the purpose of the commit 36119de18 ("BUG/MEDIUM:
session: NULL dereference possible when accessing the listener"). However,
some tests on the session's listener existence are missing in proxy_inc_*
functions.
This patch should fix the issues #1171, #1172, #1173, #1174 and #1175. It
must be backported with the above commit as far as 1.8.
Since commit f8fb4f75f ("MINOR: atomic: implement a more efficient arm64
__ha_cas_dw() using pairs"), on some modern arm64 (armv8.1+) compiled
with -march=armv8.1-a under gcc-7.5.0, a build error may appear on ev_poll.o :
/tmp/ccHD2lN8.s:1771: Error: reg pair must start from even reg at operand 1 -- `casp x27,x28,x22,x23,[x12]'
Makefile:927: recipe for target 'src/ev_poll.o' failed
It appears that the compiler cannot always assign register pairs there
for a structure made of two u64. It was possibly later addressed since
gcc-9.3 never caused this, but there's no trivially available info on
the subject in the changelogs. Unsuprizingly, using a u128 instead does
fix this, but it significantly inflates the code (+4kB for just 6 places,
very likely that it loaded some extra stubs) and the comparison is ugly,
involving two slower conditional jumps instead of a single one and a
conditional comparison. For example, ha_random64() grew from 144 bytes
to 232.
However, simply forcing the base register does work pretty well, and
makes the code even cleaner and more efficient by further reducing it
by about 4.5kB, possibly because it helps the compiler to pick suitable
registers for the pair there. And the perf on 64-cores looks steadily
0.5% above the previous one, so let's do this.
Note that the commit above was backported to 2.3 to fix scalability
issues on AWS Graviton2 platform, so this one will need to be as well.
The QUIC connection struct connection member was not initialized. This may
make randomly haproxy handle TLS connections as QUIC ones only when QUIC support
is enabled leading to such OpenSSL errors (captured from a reg test output, TLS
Client-Hello callback failed):
OpenSSL error[0x10000085] OPENSSL_internal: CONNECTION_REJECTED
OpenSSL error[0x10000410] OPENSSL_internal: SSLV3_ALERT_HANDSHAKE_FAILURE
OpenSSL error[0x1000009a] OPENSSL_internal: HANDSHAKE_FAILURE_ON_CLIENT_HELLO
This patch should fix#1168 github issue.
The recent default runqueue size reduction appeared to have significantly
lowered performance on low-thread count configs. Testing various values
runqueue values on different workloads under thread counts ranging from
1 to 64, it appeared that lower values are more optimal for high thread
counts and conversely. It could even be drawn that the optimal value for
various workloads sits around 280/sqrt(nbthread), and probably has to do
with both the L3 cache usage and how to optimally interlace the threads'
activity to minimize contention. This is much easier to optimally
configure, so let's do this by default now.
The recently introduced Financial Information eXchange (FIX)
converters have some hard coded tags based on the specification that
were misspelled. Specifically, SenderComID and TargetComID should
be SenderCompID and TargetCompID according to the specification [1][2].
This patch updates all references, which includes the converters
themselves, the regression test, and the documentation.
[1] https://fiximate.fixtrading.org/en/FIX.5.0SP2_EP264/tag49.html
[2] https://fiximate.fixtrading.org/en/FIX.5.0SP2_EP264/tag56.html
Parameter "accepted_payload_size" is currently considered regardless
the used nameserver is using TCP or UDP. It remains mandatory to annouce
such capability to support e-dns, so a value have to be announced also
in TCP. Maximum DNS message size in TCP is limited by protocol to 65535
and so for UDP (65507) if system supports such UDP messages. But
the maximum value for this option was arbitrary forced to 8192.
This patch change this maximum to 65535 to allow user to set bigger value
for UDP if its system supports. It also sets accepted_payload_size
in TCP allowing to retrieve huge responses if the configuration uses
TCP nameservers.
The request announcing the accepted_payload_size capability is currently
built at resolvers level and is common to all used nameservers of the
section regardess transport protocol used. A further patch should be
made to at least specify a different payload size depending of the
transport, and perhaps could be forced to 65535 in case of TCP and
maximum would be forced back to 65507 matching UDP max.
This patch is appliable since 2.4 version