When threads are disabled, some variables such as tid and tid_bit are
still checked everywhere, the MAX_THREADS_MASK macro is ~0UL while
MAX_THREADS is 1, and the all_threads_mask variable is replaced with a
macro forced to zero. The compiler cannot optimize away all this code
involving checks on tid and tid_bit, and we end up in special cases
where all_threads_mask has to be specifically tested for being zero or
not. It is not even certain the code paths are always equivalent when
testing without threads and with nbthread 1.
Let's change this to make sure we always present a single thread when
threads are disabled, and have the relevant values declared as constants
so that the compiler can optimize all the tests away. Now we have
MAX_THREADS_MASK set to 1, all_threads_mask set to 1, tid set to zero
and tid_bit set to 1. Doing just this has removed 4 kB of code in the
no-thread case.
A few checks for all_threads_mask==0 have been removed since it never
happens anymore.
An offsetof() macro was introduced with commit 928fbfa ("MINOR: compiler:
introduce offsetoff().") with a fallback for older compilers. But this
breaks gcc 3.4 because __size_t and __uintptr_t are not defined there.
However size_t and uintptr_t are, so let's fix it this way. No backport
needed.
The purpose is to make sure that all variables which directly depend
on this nbthread argument are set at the right moment. For now only
all_threads_mask needs to be set. It used to be set while calling
thread_sync_init() which is called too late for certain checks. The
same function handles threads and non-threads, which removes the need
for some thread-specific knowledge from cfgparse.c.
If nbthread is MAX_THREADS, the shift operation needed to compute
all_threads_mask fails in thread_sync_init(). Instead pass a number
of threads to this function and let it compute the mask without
overflowing.
This should be backported to 1.8.
This lock was necessary to manipulate the pendconn element between
concurrent places, but was causing great difficulties in the list walk
by having to iterate over multiple entries instead of being able to
safely pick the first one (in fact the first element was always the
right one but the locking model was hard to prove).
Here since we know we can always rely on the queue's locks, we take
the queue's lock every time we need to modify the element. In practice
it was already the case everywhere except in pendconn_dequeue() which
only works on an element that was already detached. This function had
to be protected against the risk of meeting an incompletely detached
element (which could be unlinked but not yet assigned). By taking the
queue lock around the LIST_ISEMPTY test, it's enough to ensure that a
concurrent thread either didn't begin or had completed the operation.
The true benefit really is in pendconn_process_next_strm() where we
can again safely work with the first element of each queue. This will
significantly simplify next updates to this code.
Whenever it's possible to avoid a copy, b_xfer() will simply swap the
buffer's heads without touching the data. This has brought the performance
back from 140 kH/s to 202 kH/s on the test case.
The latter function is more suited to operations that don't require any
check because the check has already been performed. It will be used by
other b_* functions.
This function is used a lot in block copies and is needlessly
complicated since it still uses pointer arithmetic. Let's fall
back to regular offsets and simplify it. This removed around
23 bytes from b_putblk() and it removed any conditional jump.
In thread_sync_barrier, we exit when all threads have set their own bit in the
barrier mask. It is done by comparing it to all_threads_mask. But we must not
use a simple equality to do so, becaue all_threads_mask may change. Since commit
ba86c6c25 ("MINOR: threads: Be sure to remove threads from all_threads_mask on
exit"), when a thread exit, its bit is removed from all_threads_mask. Instead,
we must use a bitwise AND to test is all bits of all_threads_mask are set.
This also requires that all_threads_mask is set to volatile if we want to
catch changes.
This patch must be backported in 1.8.
Now all the code used to manipulate chunks uses a struct buffer instead.
The functions are still called "chunk*", and some of them will progressively
move to the generic buffer handling code as they are cleaned up.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
It used to be called 'len' during the reorganisation but strictly speaking
it's not a length since it wraps. Also we already use '_data' as the suffix
to count available data, and data is also what we use to indicate the amount
of data in a pipe so let's improve consistency here. It was important to do
this in two operations because data used to be the name of the pointer to
the storage area.
This one is more generic and designed to work on a random block. It
may later get a b_rep_ist() variant since many strings are already
available as (ptr,len).
There was no point keeping that function in the buffer part since it's
exclusively used by HTTP at the channel level, since it also automatically
appends the CRLF. This further cleans up the buffer code.
The new file istbuf.h links the indirect strings (ist) with the buffers.
The purpose is to encourage addition of more standard buffer manipulation
functions that rely on this in order to improve the overall ease of use
along all the code. Just like ist.h and buf.h, this new file is not
expected to depend on anything beyond these two files.
A few functions were added and/or converted from buffer.h :
- b_isteq() : indicates if a buffer and a string match
- b_isteat() : consumes a string from the buffer if it matches
- b_istput() : appends a small string to a buffer (all or none)
- b_putist() : appends part of a large string to a buffer
The equivalent functions were removed from buffer.h and changed at the
various call places.
The two variants now do exactly the same (appending at the tail of the
buffer) so let's not keep the distinction between these classes of
functions and have generic ones for this. It's also worth noting that
b{i,o}_putchk() wasn't used at all and was removed.
There's no distinction between in and out data now. The latter covers
the needs of the former and supports wrapping. The extra cost is
negligible given the locations where it's used.
Since we never access this field directly anymore, but only through the
channel's wrappers, it can now move to the channel. The buffers are now
completely free from the distinction between input and output data.
Since we use "_data" for the amount of data at many places, as opposed to
"_space" for the amount of space, let's rename the "data" field to "area"
so that we can reuse "data" later for the amount of data in the buffer
(currently called "len" despite not being contigous).
b_set_data() is used :
- in proto_http and hlua to trim input data (b_set_data(co_data()))
- in SPOE to append data to a buffer while building a message
In no case will this truncate a buffer so we can safely remove the
test for len < b->output.
b_del() is used in :
- mux_h2 with the demux buffer : always processes input data
- checks with output data though output is not considered at all there
- b_eat() which is not used anywhere
- co_skip() where the len is always <= output
Thus the distinction for output data is not needed anymore and the
decrement can be made inconditionally in co_skip().
This is intentionally the minimal and safest set of changes, some cleanups
area still required. These changes are quite tricky and cannot be
independantly tested, so it's important to keep this patch as bisectable
as possible.
buf_empty and buf_wanted were changed and are now exactly similar since
there's no <p> member in the structure anymore. Given that no test is
ever made in the code to check that buf == &buf_wanted, it may be possible
that we don't need to have two anymore, unless some buf_empty tests have
precedence. This will have to be investigated.
A significant part of this commit affects the HTTP compression code,
which used to deeply manipulate the input and output buffers without
any reasonable solution for a better abstraction. For this reason, if
any regression is met and designates this patch as the culprit, it is
important to run tests which specifically involve compression or which
definitely don't use it in order to spot the issue.
Cc: Olivier Houchard <ohouchard@haproxy.com>
For the same consistency reasons, let's use b_empty() at the few places
where an empty buffer is expected, or c_empty() if it's done on a channel.
Some of these places were there to realign the buffer so
{b,c}_realign_if_empty() was used instead.
We used to have variations around buffer_total_space() and
size-buffer_len() or size-b_data(). Let's simplify all this. buffer_len()
was also removed as not used anymore.
Now the new API functions are being used everywhere, we can get rid
of b_ptr(). A few last users like bi_istput() and bo_istput() appear
to only differ by what part of the buffer they're increasing, but
that should quickly be merged.
Now that there are no more users requiring to modify the buffer anymore,
switch these ones to const char and const buffer. This will make it more
obvious next time send functions are tempted to modify the buffer's output
count. Minor adaptations were necessary at a few call places which were
using char due to the function's previous prototype.
Till now the callers had to know which one to call for specific use cases.
Let's fuse them now since a single one will remain after the API migration.
Given that bi_del() may only be used where o==0, just combine the two tests
by first removing output data then only input.
This will be important so that we can parse a buffer without touching it.
Now we indicate where from the buffer's head we plan to start to copy, and
for how many bytes. This will be used by send functions to loop at the end
of the buffer without having to update the buffer's output byte count.
This new functoin limits itself to the amount of data available in the
buffer and doesn't care about the direction anymore. It's only called
from co_getblk() which already checks that no more than the available
output bytes is requested.
These ones were merged into a single b_contig_space() that covers both
(the bo_ case was a simplified version of the other one). The function
doesn't use ->i nor ->o anymore.