We used to have 3 thread-based arrays for toremove_lock, idle_cleanup,
and toremove_connections. The problem is that these items are small,
and that this creates false sharing between threads since it's possible
to pack up to 8-16 of these values into a single cache line. This can
cause real damage where there is contention on the lock.
This patch creates a new array of struct "idle_conns" that is aligned
on a cache line and which contains all three members above. This way
each thread has access to its variables without hindering the other
ones. Just doing this increased the HTTP/1 request rate by 5% on a
16-thread machine.
The definition was moved to connection.{c,h} since it appeared a more
natural evolution of the ongoing changes given that there was already
one of them declared in connection.h previously.
It looked strange to see pool_evict_from_cache() always very present
on "perf top", but there was actually a reason to this: while b_free()
uses pool_free() which properly disposes the buffer into the local cache
and b_alloc_fast() allocates using pool_get_first() which considers the
local cache, b_alloc_margin() does not consider the local cache as it
only uses __pool_get_first() which only allocates from the shared pools.
The impact is that basically everywhere a buffer is allocated (muxes,
streams, applets), it's always picked from the shared pool (hence
involves locking) and is released to the local one and makes it grow
until it's required to trigger a flush using pool_evict_from_cache().
Buffers usage are thus not thread-local at all, and cause eviction of
a lot of possibly useful objects from the local caches.
Just fixing this results in a 10% request rate increase in an HTTP/1 test
on a 16-thread machine.
This bug was caused by recent commit ed891fd ("MEDIUM: memory: make local
pools independent on lockless pools") merged into 2.2-dev9, so not backport
is needed.
BoringSSL does not have X509_get_X509_PUBKEY
let our emulation level define that for BoringSSL as well
Build log:
src/ssl_sample.o: In function `smp_fetch_ssl_x_key_alg':
/home/travis/build/haproxy/haproxy/src/ssl_sample.c:592: undefined reference to `X509_get_X509_PUBKEY'
clang-7: error: linker command failed with exit code 1 (use -v to see invocation)
Makefile:860: recipe for target 'haproxy' failed
make: *** [haproxy] Error 1
travis-ci: https://travis-ci.com/github/haproxy/haproxy/jobs/351670996
With the rework of the config line parser, we've started to emit a dump
of the initial line underlined by a caret character indicating the error
location. But with extremely large lines it starts to take time and can
even cause trouble to slow terminals (e.g. over ssh), and this becomes
useless. In addition, control characters could be dumped as-is which is
bad, especially when the input file is accidently wrong (an executable).
This patch adds a string sanitization function which isolates an area
around the error position in order to report only that area if the string
is too large. The limit was set to 80 characters, which will result in
roughly 40 chars around the error being reported only, prefixed and suffixed
with "..." as needed. In addition, non-printable characters in the line are
now replaced with '?' so as not to corrupt the terminal. This way invalid
variable names, unmatched quotes etc will be easier to spot.
A typical output is now:
[ALERT] 176/092336 (23852) : parsing [bad.cfg:8]: forbidden first char in environment variable name at position 811957:
...c$PATH$PATH$d(xlc`%?$PATH$PATH$dgc?T$%$P?AH?$PATH$PATH$d(?$PATH$PATH$dgc?%...
^
Now that all tasklet queues are scanned at once by run_tasks_from_lists(),
it becomes possible to always check for lower priority classes and jump
back to them when they exist.
This patch adds tune.sched.low-latency global setting to enable this
behavior. What it does is stick to the lowest ranked priority list in
which tasks are still present with an available budget, and leave the
loop to refill the tasklet lists if the trees got new tasks or if new
work arrived into the shared urgent queue.
Doing so allows to cut the latency in half when running with extremely
deep run queues (10k-100k), thus allowing forwarding of small and large
objects to coexist better. It remains off by default since it does have
a small impact on large traffic by default (shorter batches).
Now process_runnable_tasks is responsible for calculating the budgets
for each queue, dequeuing from the tree, and calling run_tasks_from_lists().
This latter one scans the queues, picking tasks there and respecting budgets.
Note that its name was updated with a plural "s" for this reason.
It is neither convenient nor scalable to check each and every tasklet
queue to figure whether it's empty or not while we often need to check
them all at once. This patch introduces a tasklet class mask which gets
a bit 1 set for each queue representing one class of service. A single
test on the mask allows to figure whether there's still some work to be
done. It will later be usable to better factor the runqueue code.
Bits are set when tasklets are queued. They're cleared when queues are
emptied. It is possible that a queue is empty but has a bit if a tasklet
was added then removed, but this is not a problem as this is properly
checked for in run_tasks_from_list().
It will be convenient to have the tasklet queue number soon, better make
current_queue an index rather than a pointer to the queue. When not currently
running (e.g. from I/O), the index is -1.
Add some functions to deinit the whole crtlist and ckch architecture.
It will free all crtlist, crtlist_entry, ckch_store, ckch_inst and their
associated SNI, ssl_conf and SSL_CTX.
The SSL_CTX in the default_ctx and initial_ctx still needs to be free'd
separately.
Since commit 2954c47 ("MEDIUM: ssl: allow crt-list caching"), the
ssl_bind_conf is allocated directly in the crt-list, and the crt-list
can be shared between several bind_conf. The deinit() code wasn't
changed to handle that.
This patch fixes the issue by removing the free of the ssl_conf in
ssl_sock_free_all_ctx().
It should be completed with a patch that free the ssl_conf and the
crt-list.
Fix issue #700.
A test on large objects revealed a big performance loss from 2.1. The
cause was found to be related to cache locality between scheduled
operations that are batched using tasklets. It happens that we now
have several layers of tasklets and that queuing all these operations
leaves time to let memory objects cool down in the CPU cache, effectively
resulting in halving the performance.
A quick test consisting in putting most unknown tasklets into the BULK
queue almost fixed the performance regression, but this is a wrong
approach as it can also slow down some low-latency transfers or access
to applets like the CLI.
What this patch does instead is to queue unknown tasklets into the same
queue as the current one when tasklet_wakeup() is itself called from a
task/tasklet, otherwise it uses urgent for real I/O (when sched->current
is NULL). This results in the called tasklet being woken up much sooner,
often at the end of the current batch of tasklets.
By doing so, a test on 2 cores 4 threads with 256 concurrent H1 conns
transferring 16m objects with 256kB buffers jumped from 55 to 88 Gbps.
It's even possible to go as high as 101 Gbps by evaluating the URGENT
queue after the BULK one, though this was not done as considered
dangerous for latency sensitive operations.
This reinforces the importance of getting back the CPU transfer
mechanisms based on tasklet_wakeup_after() to work at the tasklet level
by supporting an immediate wakeup in certain cases.
No backport is needed, this is strictly 2.2.
In task_per_thread[] we now have current_queue which is a pointer to
the current tasklet_list entry being evaluated. This will be used to
know the class under which the current task/tasklet is currently
running.
When DEBUG_FD is set at build time, we'll keep a counter of per-FD events
in the fdtab. This counter is reported in "show fd" even for closed FDs if
not zero. The purpose is to help spot situations where an apparently closed
FD continues to be reported in loops, or where some events are dismissed.
As reported in issue #419, a "clear map" operation on a very large map
can take a lot of time and freeze the entire process for several seconds.
This patch makes sure that pat_ref_prune() can regularly yield after
clearing some entries so that the rest of the process continues to work.
The first part, the removal of the patterns, can take quite some time
by itself in one run but it's still relatively fast. It may block for
up to 100ms for 16M IP addresses in a tree typically. This change needed
to declare an I/O handler for the clear operation so that we can get
back to it after yielding.
The second part can be much slower because it deconstructs the elements
and its users, but it iterates progressively so we can yield less often
here.
The patch was tested with traffic in parallel sollicitating the map being
released and showed no problem. Some traffic will definitely notice an
incomplete map but the filling is already not atomic anyway thus this is
not different.
It may be backported to stable versions once sufficiently tested for side
effects, at least as far as 2.0 in order to avoid the watchdog triggering
when the process is frozen there. For a better behaviour, all these
prune_* functions should support yielding so that the callers have a
chance to continue also yield in turn.
Some of the recent optimizations around the polling to save a few
epoll_ctl() calls have shown that they could also cause some trouble.
However, over time our code base has become totally asynchronous with
I/Os always attempted from the upper layers and only retried at the
bottom, making it look like we're getting closer to EPOLLET support.
There are showstoppers there such as the listeners which cannot support
this. But given that most of the epoll_ctl() dance comes from the
connections, we can try to enable edge-triggered polling on connections.
What this patch does is to add a new global tunable "tune.fd.edge-triggered",
that makes fd_insert() automatically set an et_possible bit on the fd if
the I/O callback is conn_fd_handler. When the epoll code sees an update
for such an FD, it immediately registers it in both directions the first
time and doesn't update it anymore.
On a few tests it proved quite useful with a 14% request rate increase in
a H2->H1 scenario, reducing the epoll_ctl() calls from 2 per request to
2 per connection.
The option is obviously disabled by default as bugs are still expected,
particularly around the subscribe() code where it is possible that some
layers do not always re-attempt reading data after being woken up.
localpeer <name>
Sets the local instance's peer name. It will be ignored if the "-L"
command line argument is specified or if used after "peers" section
definitions. In such cases, a warning message will be emitted during
the configuration parsing.
This option will also set the HAPROXY_LOCALPEER environment variable.
See also "-L" in the management guide and "peers" section in the
configuration manual.
This one was confusingly called, I thought it was the cumulated number
of streams but it's the number of calls to process_stream(). Let's make
this clearer.
We have poll_drop, poll_dead and poll_skip which are confusingly named
like their poll_io and poll_exp counterparts except that they are not
per poll() call but per-fd. This patch renames them to poll_drop_fd(),
poll_dead_fd() and poll_skip_fd() for this reason.
The "show activity" output mentions a number of indicators to explain
wake up reasons but doesn't have the number of times poll() sees some
I/O. And given that multiple events can happen simultaneously, it's
not always possible to deduce this metric by subtracting.
This patch adds a new "poll_io" counter that allows one to see how
often poll() returns with at least one active FD. This should help
detect stuck events and measure various ratios of poll sub-metrics.
fd_set_running() and fd_takeover() may both use a double-word CAS on the
(running_mask, thread_mask) couple and as such they expect the fields to
be exactly arranged like this. It's critical not to reorder them, so add
a comment to avoid such a potential mistake later.
Since 2.1-dev2, with commit 305d5ab46 ("MAJOR: fd: Get rid of the fd cache.")
we don't have the fd_lock anymore and as such its acitvity counter is always
zero. Let's remove it from the struct and from "show activity" output, as
there are already plenty of indicators to look at.
The cache line comment in the struct activity was updated to reflect
reality as it looks like another one already got removed in the past.
This macro is provided by clang but gcc lacks it. Not having it makes
it painful to test features on both compilers. Better define it to zero
when not available so that __has_feature(foo) never errors.
This function takes on input a string to tokenize, an output storage
(which may be the same) and a number of options indicating how to handle
certain characters (single & double quote support, backslash support,
end of line on '#', environment variables etc). On output it will provide
a list of pointers to individual words after having possibly unescaped
some character sequences, handled quotes and resolved environment
variables, and it will also indicate a status made of:
- a list of failures (overlap between src/dst, wrong quote etc)
- the pointer to the first sequence in error
- the required output length (a-la snprintf()).
This allows a caller to freely unescape/unquote a string by using a
pre-allocated temporary buffer and expand it as necessary. It takes
extreme care at avoiding expensive operations and intentionally does
not use memmove() when removing escapes, hence the reason for the
different input and output buffers. The goal is to use it as the basis
for the config parser.
As reported in issue #689, there is a subtle bug in the ebtree code used
to compared memory blocks. It stems from the platform-dependent memcmp()
implementation. Original implementations used to perform a byte-per-byte
comparison and to stop at the first non-matching byte, as in this old
example:
https://www.retro11.de/ouxr/211bsd/usr/src/lib/libc/compat-sys5/memcmp.c.html
The ebtree code has been relying on this to detect the first non-matching
byte when comparing keys. This is made so that a zero-terminated string
can fail to match against a longer string.
Over time, especially with large busses and SIMD instruction sets,
multi-byte comparisons have appeared, making the processor fetch bytes
past the first different byte, which could possibly be a trailing zero.
This means that it's possible to read past the allocated area for a
string if it was allocated by strdup().
This is not correct and definitely confuses address sanitizers. In real
life the problem doesn't have visible consequences. Indeed, multi-byte
comparisons are implemented so that aligned words are loaded (e.g. 512
bits at once to process a cache line at a time). So there is no way such
a multi-byte access will cross a page boundary and end up reading from
an unallocated zone. This is why it was never noticed before.
This patch addresses this by implementing a one-byte-at-a-time memcmp()
variant for ebtree, called eb_memcmp(). It's optimized for both small and
long strings and guarantees to stop after the first non-matching byte. It
only needs 5 instructions in the loop and was measured to be 3.2 times
faster than the glibc's AVX2-optimized memcmp() on short strings (1 to
257 bytes), since that latter one comes with a significant setup cost.
The break-even seems to be at 512 bytes where both version perform
equally, which is way longer than what's used in general here.
This fix should be backported to stable versions and reintegrated into
the ebtree code.
Commit 0a3b43d9c ("MINOR: haproxy: Make use of deinit_and_exit() for
clean exits") introduced this build warning:
src/haproxy.c: In function 'main':
src/haproxy.c:3775:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
This is because the new deinit_and_exit() is not marked as "noreturn"
so depending on the optimizations, the noreturn attribute of exit() will
either leak through it and silence the warning or not and confuse the
compiler. Let's just add the attribute to fix this.
No backport is needed, this is purely 2.2.
As reported in issue #686, ARM64 build fails since the include files
reorganization. This is caused by the lack of string.h while a memcpy()
is present in __ha_cas_dw().
clang just failed on fd.c with this error:
src/fd.c:491:9: error: logical not is only applied to the left hand side of this comparison [-Werror,-Wlogical-not-parentheses]
while (HA_SPIN_TRYLOCK(OTHER_LOCK, &log_lock) != 0) {
^ ~~
That's because this expands to this:
while (!pl_try_s(&log_lock) != 0) {
Let's just add parenthesis in the TRYLOCK macros to avoid this.
This may need to be backported if commit df187875d ("BUG/MEDIUM: log:
don't hold the log lock during writev() on a file descriptor") is
backported as well as it seems to be the first one to trigger it.
The set of files proto_udp.{c,h} were misleadingly named, as they do not
provide anything related to the UDP protocol but to datagram handling
instead, since currently all UDP processing is hard-coded where it's used
(dns, logs). They are to UDP what connection.{c,h} are to proto_tcp. This
was causing confusion about how to insert UDP socket management code,
so let's rename them right now to dgram.{c,h} which more accurately
matches what's inside since every function and type is already prefixed
with "dgram_".
There are list definitions everywhere in the code, let's drop the need
for including list-t.h to declare them. The rest of the list manipulation
is huge however and not needed everywhere so using the list walking macros
still requires to include list.h.
This patch fixes all the leftovers from the include cleanup campaign. There
were not that many (~400 entries in ~150 files) but it was definitely worth
doing it as it revealed a few duplicates.
Since these are used as type attributes or conditional clauses, they
are used about everywhere and should not require a dependency on
thread.h. Moving them to compiler.h along with other similar statements
like ALIGN() etc looks more logical; this way they become part of the
base API. This allowed to remove thread-t.h from ~12 files, one was
found to only require thread-t and not thread and dict.c was found to
require thread.h.
That's already where MAX_PROCS is set, and we already handle the case of
the default value so there is no reason for placing it in thread.h given
that most call places don't need the rest of the threads definitions. The
include was removed from global-t.h and activity.c.
Sometimes we need to align a struct member or a struct's size only when
threads are enabled. This is the case on fdtab for example. Instead of
using ugly ifdefs in the code itself, let's have a THREAD_ALIGNED() macro
performing the alignment only when threads are enabled. For now this was
only applied to fd-t.h as it was the only place found.
Most of the files dealing with error reports have to include log.h in order
to access ha_alert(), ha_warning() etc. But while these functions don't
depend on anything, log.h depends on a lot of stuff because it deals with
log-formats and samples. As a result it's impossible not to embark long
dependencies when using ha_warning() or qfprintf().
This patch moves these low-level functions to errors.h, which already
defines the error codes used at the same places. About half of the users
of log.h could be adjusted, sometimes revealing other issues such as
missing tools.h. Interestingly the total preprocessed size shrunk by
4%.
The struct sample_data is used by pattern, map and vars, and currently
requires to include sample-t which comes with many other dependencies.
Let's move sample_data into its own file to shorten the dependency tree.
This revealed a number of issues in adjacent files which were hidden by
the fact that sample-t.h brought everything that was missing.
Directly including stddef.h in many files results in it being processed
multiple times while it can be centralized in api-t.h and be guarded
against multiple inclusions. Doing so reduces the number of preprocessed
lines by 1200!
Checks.c remains one of the largest file of the project and it contains
too many things. The tcpchecks code represents half of this file, and
both parts are relatively isolated, so let's move it away into its own
file. We now have tcpcheck.c, tcpcheck{,-t}.h.
Doing so required to export quite a number of functions because check.c
has almost everything made static, which really doesn't help to split!
check.c is one of the largest file and contains too many things. The
e-mail alerting code is stored there while nothing is in mailers.c.
Let's move this code out. That's only 4% of the code but a good start.
In order to do so, a few tcp-check functions had to be exported.
When building contrib/hpack there is a warning about an unused static
function. Actually it makes no sense to make it static, instead it must
be regularly exported. Similarly there is hpack_dht_get_tail() which is
inlined in the C file and which would make more sense with all other ones
in the H file.
There's no point splitting the file in two since only cfgparse uses the
types defined there. A few call places were updated and cleaned up. All
of them were in C files which register keywords.
There is nothing left in common/ now so this directory must not be used
anymore.