Olivier spotted that I messed up during a rebase of commit 92c059c2a
("MINOR: atomic: implement native BTS/BTR for x86"), losing the x86
version of the BTS/BTR and leaving the generic version for it instead
of having this block in the #else. Since this variant is not used for
now it was easy to overlook it. Let's re-implement it here.
The time initialization was made a bit complex because we rely on a
dummy negative argument to reset all fields, leaving no distinction
between process-level initialization and thread-level initialization.
This patch changes this by introducing two functions, one for the
process and the second one for the threads. This removes ambigous
test and makes sure that the relevant fields are always initialized
exactly once. This also offers a better solution to the bug fixed in
commit b48e7c001 ("BUG/MEDIUM: time: make sure to always initialize
the global tick") as there is no more special values for global_now_ms.
It's simple enough to be backported if any other time-related issues
are encountered in stable versions in the future.
It was only used by freq_ctr and is not used anymore. In addition the
local curr_sec_ms was removed, as well as the equivalent extern
definitions which did not exist anymore either.
update_freq_ctr_period() was still not very clean and didn't wait for
the rotation lock to be dropped before trying again, thus maintaining
the contention at a high level. In addition, the rotation update was
made in three steps, which are not very efficient in terms of bus
cycles.
Here the wait loop was reworked so that the fast path remains short
and that the contended path waits for the lock to be dropped before
attempting another write, but it only waits a relax cycle before
attempting a read. The rotation block was simplified to remove a
test that was already validated by the first loop, and so that the
retrieval of the current period, its reset and its increment are all
performed in a single atomic op and the store to the previous period
is performed immediately after.
All this results in significantly smaller code for the inline function
(~1kB total) and a shorter critical path.
When counters are rotated, there is contention between the threads which
can slow down the operation of the thread performing the rotation. Let's
apply a cpu_relax there to let the first thread finish faster.
It remains cumbersome to preserve two versions of the freq counters and
two different internal clocks just for this. In addition, the savings
from using two different mechanisms are not that important as the only
saving is a divide that is replaced by a multiply, but now thanks to
the freq_ctr_total() unificaiton the code could also be simplified to
optimize it in case of constants.
This patch turns all non-period freq_ctr functions to static inlines
which call the period-based ones with a period of 1 second. A direct
benefit is that a single internal clock is now needed for any counter
and that they now all rely on ticks.
These 1-second counters are essentially used to report request rates
and to enforce a connection rate limitation in listeners. It was
verified that these continue to work like before.
Both structures are identical except the name of the field starting
the period and its description. Let's call them all freq_ctr and the
period's start "curr_tick" which is generic.
This is only a temporary change and fields are expected to remain
the same with no code change (verified).
There was still no function to compute a wait time for periods, let's
implement it on top of freq_ctr_total() as we'll soon need it for the
per-second one. The divide here is applied on the frequency so that it
will be replaced with a reciprocal multiply when constant.
This one is the easiest to implement, it just requires a call and a
divide of the result. Anti-flapping correction for low-rates was
preserved.
Now calls using a constant period will be able to use a reciprocal
multiply for the period instead of a divide.
Most of the functions designed to read a counter over a period go through
the same complex loop and only differ in the way they use the returned
values, so it was worth implementing all this into freq_ctr_total() which
returns the total number of events over a period so that the caller can
finish its operation using a divide or a remaining time calculation. As
a special case, read_freq_ctr_period() doesn't take pending events but
requires to enable an anti-flapping correction at very low frequencies.
Thus the function implements it when pend<0.
Thanks to this function it will be possible to reimplement the other ones
as inline and merge the per-second ones with the arbitrary period ones
without always adding the cost of a 64 bit divide.
Some variables are mostly read (mostly pointers) but they tend to be
merged with other ones in the same cache line, slowing their access down
in multi-thread setups. This patch declares an empty, aligned variable
in a section called "read_mostly". This will force a cache-line alignment
on this section so that any variable declared in it will be certain to
avoid false sharing with other ones. The section will be eliminated at
link time if not used.
A __read_mostly attribute was added to compiler.h to ease use of this
section.
HA_SECTION() is used as an attribute to force a section name. This is
required because OSX prepends "__DATA, " in front of the declaration.
HA_SECTION_START() and HA_SECTION_STOP() are used as post-attribute on
variable declaration to designate the section start/end (needed only on
OSX, empty on others).
For platforms with an obsolete linker, all macros are left empty. It would
possibly still work on some of them but this will not be needed anyway.
Due to length restrictions on OSX the initcall sections are called "i_"
there while they're called "init_" on other OSes. However the start and
end of sections are still called "__start_init_" and "__stop_init_",
which forces to have distinct code between the OSes. Let's switch everyone
to "i_" and rename the symbols accordingly.
The trace() function is convenient to avoid calling trace() when traces
are not enabled, but there starts to be some callers which place complex
expressions in their trace calls, which results in all of them to be
evaluated before being passed as arguments to the trace() function. This
needlessly wastes precious CPU cycles.
Let's change the function for a macro, so that the arguments are now only
evaluated when the surce has traces enabled. However having a generic
macro being called "trace()" can easily cause conflicts with innocent
code so we rename it "_trace".
Just doing this has resulted in a 2.5% increase of the HTTP/1 request rate.
Interestingly, all arrays used to declare patterns were read-write while
only hard-coded. Let's mark them const so that they move from data to
rodata and don't risk to experience false sharing.
This argument is not being used inside the function (and the functions
themselves are unused as well) and not documented. Its purpose is not clear.
Just remove it.
With latest commit f50906519 ("MEDIUM: fd: merge fdtab[].ev and state
for FD_EV_* and FD_POLL_* into state") one occurrence of a pair of
chars was missed in fd_stop_both(), resulting in the operation to
fail if the upper flags were set. Interestingly it managed to fail
2 tests in all setups in the CI while all used to work fine on my
local machines. Probably that the reason is that the chars had enough
room above them for the CAS to fail then refill "old" overwriting the
upper parts of the stack, and that thanks to this the subsequent tests
worked. With ASAN being used on lots of tests, it very likely caught
it but used to only report failed tests with no more info.
No backport is needed, as this was never released nor backported.
The current BTS/BTR operations on x86 are ugly because they rely on a
CAS, so they may be unfair and take time to converge. Fortunately,
where they are currently used (mostly FDs) the contention is expected
to be rare (mostly listeners). But this also limits their use to such
few low-load cases.
On x86 there is a set of BTS/BTR instructions which help for this,
but before the FD's state migrated to 32 bits there was little use of
them since they do not exist in 8 bits.
Now at least it makes sense to use them, at the very least in order
to significantly reduce the code size (one BTS instead of a CMPXCHG
loop). The implementation relies on modern gcc's ability to return
condition flags and limit code inflation and register spilling. The
fall back is retained on the old implementation for all other situations
(inappropriate target size or non-capable compiler). The code shrank
by 1.6 kB on the fast path.
As expected, for now on up to 4 threads there is no measurable difference
of performance.
Probably due to the result of an old copy-paste, HA_ATOMIC_BTS/BTR were
still implemented using the __sync_* builtins instead of the more
modern __atomic_* which allow to specify the memory model. Let's update
this to use the newer there and also implement the relaxed variants
(which are not used for now).
This patch replaces roughly all occurrences of an HA_ATOMIC_ADD(&foo, 1)
or HA_ATOMIC_SUB(&foo, 1) with the equivalent HA_ATOMIC_INC(&foo) and
HA_ATOMIC_DEC(&foo) respectively. These are 507 changes over 45 files.
Most ADD/SUB callers use them for a single unit (e.g. refcounts) and
it's a pain to always pass ",1". Let's add them to simplify the API.
However we currently don't add any return value. If needed in the future
better report zero/non-zero than a real value for the sake of efficiency
at the instruction level.
The fetch_and_xxx variant is often missing for add/sub/and/or. In fact
it was only provided for ADD under the name XADD which corresponds to
the x86 instruction name. But for destructive operations like AND and
OR it's missing even more as it's not possible to know the value before
modifying it.
This patch explicitly adds HA_ATOMIC_FETCH_{OR,AND,ADD,SUB} which
cover these standard operations, and renames XADD to FETCH_ADD (there
were only 6 call places).
In the future, backport of fixes involving such operations could simply
remap FETCH_ADD(x) to XADD(x), FETCH_SUB(x) to XADD(-x), and for the
OR/AND if needed, these could possibly be done using BTS/BTR.
It's worth noting that xchg could have been renamed to fetch_and_store()
but xchg already has well understood semantics and it wasn't needed to
go further.
In order to make sure these ones will not be used anymore in an expression,
let's make them always void. New callers will now be forced to use the
explicit _FETCH variant if required.
Currently our atomic ops return a value but it's never known whether
the fetch is done before or after the operation, which causes some
confusion each time the value is desired. Let's create an explicit
variant of these operations suffixed with _FETCH to explicitly mention
that the fetch occurs after the operation, and make use of it at the
few call places.
Gcc 10.2 implements outline atomics on aarch64. The replace all inline
atomic ops with a function call that checks if the machine supports LSE
atomics. This comes with a small cost but allows modern machines to scale
much better than with the old LL/SC ones even when built for full 8.0
compatibility.
This patch enables the use of the __atomic_compare_exchange() builtin
for the double-word CAS when detected as available instead of using the
hand-written LL/SC version. The extra cost is negligible because we do
very few DWCAS operations (essentially FD migrations and shared pools)
so the cost is low but under high contention it can still be beneficial.
As expected no performance difference was measured in either direction
on 4-core machines with this change.
This could be backported to 2.3 if it was shown that FD migrations were
representing a significant source of contention, but for now it does
not appear to be needed.
There is a function called fd_write_frag_line() that's essentially used
by loggers and that is used to write an atomic message line over a file
descriptor using writev(). However a lock is required around the writev()
call to prevent messages from multiple threads from being interleaved.
Till now a SPIN_TRYLOCK was used on a dedicated lock that was common to
all FDs. This is quite not pretty as if there are multiple output pipes
to collect logs, there will be quite some contention. Now that there
are empty flags left in the FD state and that we can finally use atomic
ops on them, let's add a flag to indicate the FD is locked for exclusive
access by a syscall. At least the locking will now be on an FD basis and
not the whole process, so we can remove the log_lock.
No need to keep this flag apart any more, let's merge it into the global
state. The bit was not cleared in fd_insert() because the only user is
the function used to create and atomically send a log message to a pipe
FD, which never registers the fd. Here we clear it nevertheless for the
sake of clarity.
Note that with an extra cleaning pass we could have a bit number
here and simply use a BTS to test and set it.
No need to keep this flag apart any more, let's merge it into the global
state. The CLI's output state was extended to 6 digits and the linger/cloned
flags moved inside the parenthesis.
For a long time we've had fdtab[].ev and fdtab[].state which contain two
arbitrary sets of information, one is mostly the configuration plus some
shutdown reports and the other one is the latest polling status report
which also contains some sticky error and shutdown reports.
These ones used to be stored into distinct chars, complicating certain
operations and not even allowing to clearly see concurrent accesses (e.g.
fd_delete_orphan() would set the state to zero while fd_insert() would
only set the event to zero).
This patch creates a single uint with the two sets in it, still delimited
at the byte level for better readability. The original FD_EV_* values
remained at the lowest bit levels as they are also known by their bit
value. The next step will consist in merging the remaining bits into it.
The whole bits are now cleared both in fd_insert() and _fd_delete_orphan()
because after a complete check, it is certain that in both cases these
functions are the only ones touching these areas. Indeed, for
_fd_delete_orphan(), the thread_mask has already been zeroed before a
poller can call fd_update_event() which would touch the state, so it
is certain that _fd_delete_orphan() is alone. Regarding fd_insert(),
only one thread will get an FD at any moment, and it as this FD has
already been released by _fd_delete_orphan() by definition it is certain
that previous users have definitely stopped touching it.
Strictly speaking there's no need for clearing the state again in
fd_insert() but it's cheap and will remove some doubts during some
troubleshooting sessions.
In preparation of merging FD_POLL* and FD_EV*, this only changes the
value of FD_POLL_* to use bits 8-15 (the second byte). The size of the
field has been temporarily extended to 32 bits already, as well as
the temporary variables that carry the new composite value inside
fd_update_events(). The resulting fdtab entry becomes temporarily
unaligned. All places making access to .ev or FD_POLL_* were carefully
inspected to make sure they were safe regarding this change. Only one
temporary update was needed for the "show fd" code. The code was only
slightly inflated at this step.
The former was not used and the second was used only as a positive mask
of the flags to keep instead of having the flags that are updated. Both
were removed in favor of a new FD_POLL_UPDT_MASK that only mentions the
updated flags. This will ease merging of state and ev later.
This patch registers the parsed file and the line where a log server
is declared to make those information available in configuration
post check.
Those new informations were added on error messages probed resolving
ring names on post configuration check.
Define MODE_DIAG which is used to run haproxy in diagnostic mode. This
mode is used to output extra warnings about possible configuration
blunder or sub-optimal usage. It can be activated with argument '-dD'.
A new output function ha_diag_warning is implemented reserved for
diagnostic output. It serves to standardize the format of diagnostic
messages.
A macro HA_DIAG_WARN_COND is also available to automatically check if
diagnostic mode is on before executing the diagnostic check.
Historically, an option was added to wait for the request payload (option
http-buffer-request). This option has 2 drawbacks. First, it is an ON/OFF
option for the whole proxy. It cannot be enabled on demand depending on the
message. Then, as its name suggests, it only works on the request side. The
only option to wait for the response payload was to write a dedicated
filter. While it is an acceptable solution for complex applications, it is a
bit overkill to simply match strings in the body.
To make everyone happy, this patch adds a dedicated HTTP action to wait for
the message payload, for the request or the response depending it is used in
an http-request or an http-response ruleset. The time to wait is
configurable and, optionally, the minimum payload size to have before stop
to wait.
Both the http action and the old http analyzer rely on the same internal
function.
L6 sample fetches are now ignored when called from an HTTP proxy. Thus, a
warning is emitted during the startup if such usage is detected. It is true
for most ACLs and for log-format strings. Unfortunately, it is a bit painful
to do so for sample expressions.
This patch relies on the commit "MINOR: action: Use a generic function to
check validity of an action rule list".