This function now dumps info about the HTX message into a buffer, passed as
argument. In addition, it is possible to only dump meta information, without the
message content.
We often need ISO time + microseconds in traces and ring buffers, thus
function does this by calling gettimeofday() and keeping a cached value
of the part representing the tv_sec value, and only rewrites the microsecond
part. The cache is per-thread so it's lockless and safe to use as-is.
Some tests already show that it's easy to see 3-4 events in a single
microsecond, thus it's likely that the nanosecond version will have to
be implemented as well. But certain comments on the net suggest that
some parsers are having trouble beyond microsecond, thus for now let's
stick to the microsecond only.
Make it so MT_LIST_ADD and MT_LIST_ADDQ return 1 if it managed to add the
item, 0 (because it was already in a list) otherwise.
Make it so MT_LIST_DEL returns 1 if it managed to remove the item from a
list, or 0 otherwise (because it was in no list).
Add a few new macroes to the mt_lists.
MT_LIST_LOCK_ELT()/MT_LIST_UNLOCK_ELT() helps locking/unlocking an element.
This should only be used if you know for sure nobody else will remove the
element from the list in the meanwhile.
mt_list_for_each_entry_safe() is an iterator, similar to
list_for_each_entry_safe().
It takes 5 arguments, item, list_head, member are similar to those of
the non-mt variant, tmpelt is a temporary pointer to a struct mt_list, while
tmpelt2 is a struct mt_list itself.
MT_LIST_DEL_SELF() can be used to delete an item while parsing the list with
mt_list_for_each_entry_safe(). It shouldn't be used outside, and you
shouldn't use MT_LIST_DEL() while using mt_list_for_each_entry_safe().
Instead of using the same type for regular linked lists and "autolocked"
linked lists, use a separate type, "struct mt_list", for the autolocked one,
and introduce a set of macros, similar to the LIST_* macros, with the
MT_ prefix.
When we use the same entry for both regular list and autolocked list, as
is done for the "list" field in struct connection, we know have to explicitely
cast it to struct mt_list when using MT_ macros.
To avoid code duplication in the futur mux FCGI, functions parsing H1 messages
and converting them into HTX have been moved in the file h1_htx.c. Some
specific parts remain in the mux H1. But most of the parsing is now generic.
This new flag may be used to report unexpected error because of not well
formatted HTX messages (not related to a parsing error) or our incapactity to
handle the processing because we reach a limit (ressource exhaustion, too big
headers...). It should result to an error 500 returned to the client when
applicable.
In prompts on the CLI we now commonly need to propose a keyword name
and a description and it doesn't make sense to define a new struct for
each such pairs. Let's simply have a generic "name_desc" for this.
The new functions are :
__b_put_varint() : inserts a varint when it's known that it fits
b_put_varint() : tries to insert a varint at the tail
b_get_varint() : tries to get a varint from the head
b_peek_varint() : tries to peek a varint at a specific offset
Wrapping is supported so that they are expected to be safe to use to
manipulate varints with buffers anywhere.
It will sometimes be useful to encode varints to know the output size in
advance. Two versions are provided, one inline using a switch/case construct
which will be trivial for use with constants (and will be very fast albeit
huge) and one function iterating on the number which is 5 times smaller,
for use with variables.
The function call tracing code is a quite old and was never ported to
support threads. It's not even sure whether it still works well, but
at least its presence creates confusion for future work so let's rename
it to calltrace.c and add a comment about its lack of thread-safety.
It's sometimes convenient for debugging macros not to be forced to
explicitly pass NULL in an unused argument. This macro does this, it
replaces a missing arg with NULL.
This is somewhat related to indent_msg() except that this one places a
known prefix at the beginning of each line, allows to replace the EOL
character, and not to insert a prefix on the first line if not desired.
It works with a normal output buffer/chunk so it doesn't need to allocate
anything nor to modify the input string. It is suitable for use in multi-
line backtraces.
Since last commit there's no point anymore in having two variants of the
same function, let's switch to b_free() only. __b_drop() was renamed to
__b_free() for obvious consistency reasons.
A small race exists in buffers with "show sess all". This one wants to show
some information grabbed from the buffer (especially in HTX mode). But the
thread owning this buffer might just be releasing its area, right after a
free() or munmap() call, resulting in a head that is not seen as empty yet
though the area was released. It may then be dereferenced by "show sess all"
causing a crash. Note that in practice it only happens in debug mode with
UAF enabled, but it's tricky enough to fix it right now.
This should be backported to stable versions which support threads and a
store barrier. It's worth noting that by performing the clearing first,
b_free() and b_drop() now become two exact equivalent.
Commit 85b2cae63 ("MINOR: pools: make the thread harmless during the
mmap/munmap syscalls") was used to relax the pressure experienced by
other threads when running in debug mode with UAF enabled. It places
a pair of thread_harmless_now()/thread_harmless_end() around the call
to mmap(), assuming callers are not sensitive to parallel activity.
But there are a few cases like "show sess all" where this happens in
isolated threads, and marking the thread as harmless there is a very
bad idea, even worse when arriving to thread_harmless_end() which loops
forever.
Let's only do that when the thread is not isolated. No backport is
needed as the patch above was only in 2.1-dev.
It happens that upon looping threads the watchdog fires, starts a dump,
and other threads expire their budget while waiting for the other threads
to get dumped and trigger a watchdog event again, adding some confusion
to the traces. With this patch the situation becomes clearer as we export
the list of threads being dumped so that the watchdog can check it before
deciding to trigger. This way such threads in queue for being dumped are
not attempted to be reported in turn.
This should be backported to 2.0 as it helps understand stack traces.
This one was added by commit daacf3664 ("BUG/MEDIUM: protocols: add a
global lock for the init/deinit stuff") but I forgot to add it to the
include file, breaking DEBUG_THREAD.
Default HTTP error messages are stored in an array of chunks. And since the HTX
was added, these messages are also converted in HTX and stored in another
array. But now, the first array is not used anymore because the legacy HTTP mode
was removed.
So now, only the array with the HTX messages are kept. The other one was
removed.
Because the h2 multiplexer only uses the HTX mode, following H2 functions were
removed :
* h2_prepare_h1_reqline
* h2_make_h1_request()
* h2_make_h1_trailers()
Instead of using a array of (struct block), it is more natural and intuitive to
use an array of char. Indeed, not only (struct block) are stored in this array,
but also their payload.
<head> and <tail> fields are now signed 32-bits integers. For an empty HTX
message, these fields are set to -1. So the field <used> is now useless and can
safely be removed. To know if an HTX message is empty or not, we just compare
<head> against -1 (it also works with <tail>). The function htx_nbblks() has
been added to get the number of used blocks.
These calls can take quite some time and leave the thread harmless so
it's better to mark it as such. This makes "show sess" respond way
faster during high loads running on processes build with DEBUG_UAF
since these calls are stressed a lot.
When calling mmap(), in general the system gives us a page but does not
really allocate it until we first dereference it. And it turns out that
this time is much longer than the time to perform the mmap() syscall.
Unfortunately, when running with memory debugging enabled, we mmap/munmap()
each object resulting in lots of such calls and a high contention on the
allocator. And the first accesses to the page being done under the pool
lock is extremely damaging to other threads.
The simple fact of writing a 0 at the beginning of the page after
allocating it and placing the POOL_LINK pointer outside of the lock is
enough to boost the performance by 8x in debug mode and to save the
watchdog from triggering on lock contention. This is what this patch
does.
The malloc and free calls and especially the underlying mmap/munmap()
can occasionally take a huge amount of time and even cause the thread
to sleep. This is visible when haproxy is compiled with DEBUG_UAF which
causes every single pool allocation/free to allocate and release pages.
In this case, when using the locked pools, the watchdog can occasionally
fire under high contention (typically requesting 40000 1M objects in
parallel over 8 threads). Then, "perf top" shows that 50% of the CPU
time is spent in mmap() and munmap(). The reason the watchdog fires is
because some threads spin on the pool lock which is held by other threads
waiting on mmap() or munmap().
This patch modifies this so that the pool lock is released during these
syscalls. Not only this allows other threads to request try to allocate
their data in parallel, but it also considerably reduces the lock
contention.
Note that the locked pools are only used on small architectures where
high thread counts would not make sense, so this will not provide any
benefit in the general case. However it makes the debugging versions
way more stable, which is always appreciated.
The receive limit of an HTX channel must be calculated against the total size of
the HTX message. Otherwise, the buffer may never be seen as full whereas the
receive limit is 0. Indeed, the function channel_htx_full() already takes care
to add a block size to the buffer's reserve (8 bytes). So if the function
channel_htx_recv_limit() also keep a block size free in addition to the buffer's
reserve, it means that at least 2 block size will be kept free but only one will
be taken into account, freezing the stream if the option http-buffer-request is
enabled.
This patch fixes the Github issue #136. It should be backported to 2.0 and
1.9. Thanks jaroslawr (Jarosław Rzeszótko) for his help.
As its name suggest, this function change the value length of a block. But it
also update the HTX message accordingly. It simplifies the HTX API. The function
htx_set_blk_value_len() is still available and must be used with caution because
this one does not update the HTX message. It just updates the HTX block. It
should be considered as an internal function. When possible,
htx_change_blk_value_len() should be used instead.
This function is used to fix a bug affecting the 2.0. So, this patch must be
backported to 2.0.
The first one, HTX_SL_F_HAS_SCHM, will be used to know the request has an
explicit scheme. So, in H2, it is always true because the pseudo-header
":scheme" is mandatory. In H1, it is only true when an absolute URI is found on
the start-line. The other flags, HTX_SL_F_SCHM_HTTP and HTX_SL_F_SCHM_HTTPS,
will be used to know which scheme the request have. For now, other protocols are
not handled.
The aim of these flags is to pass this information to the backend side in
general, and to the H2 mux in particular. So the multiplexer will have a chance
to use this information to send the right scheme to the server.
The function htx_add_data_before() was removed because it was buggy. The
function htx_move_blk_before() may be used if necessary to do something
equivalent, except it just moves blocks. It doesn't handle the adding.
In an HTX message, it may have 2 available rooms to store a new block. The first
one is between the blocks and their payload. Blocks are added starting from the
end of the buffer and their payloads are added starting from the begining. So
the first free room is between these 2 edges. The second one is at the begining
of the buffer, when we start to wrap to add new payloads. Once we start to use
this one, the other one is ignored until the next defragmentation of the HTX
message.
In theory, there is no problem. But in practice, some lacks in the HTX structure
force us to defragment too often HTX messages to always be in a known state. The
second free room is not tracked as it should do and the first one may be easily
corrupted when rewrites happen.
So to fix the problem and avoid unecessary defragmentation, the HTX structure
has been refactored. The front (the block's position of the first payload before
the blocks) is no more stored. Instead we keep the relative addresses of 3 edges:
* tail_addr : The start address of the free space in front of the the blocks
table
* head_addr : The start address of the free space at the beginning
* end_addr : The end address of the free space at the beginning
Here is the general view of the HTX message now:
head_addr end_addr tail_addr
| | |
V V V
+------------+------------+------------+------------+------------------+
| | | | | |
| PAYLOAD | Free space | PAYLOAD | Free space | Blocks area |
| ==> | 1 | ==> | 2 | <== |
+------------+------------+------------+------------+------------------+
<head_addr> is always lower or equal to <end_addr> and <tail_addr>. <end_addr>
is always lower or equal to <tail_addr>.
In addition;, to simplify everything, the blocks area are now contiguous. It
doesn't wrap anymore. So the head is always the block with the lowest position,
and the tail is always the one with the highest position.
The function htx_add_data_before() is buggy and cannot work. It first add a data
block and then move it before another one, passed in argument. The problem
happens when a defragmentation is done to add the new block. In this case, the
reference is no longer valid, because the blocks are rearranged. So, instead of
moving the new block before the reference, it is moved at the head of the HTX
message.
So this function has been removed. It was only used by the compression filter to
add a last data block before a TLR, EOT or EOM block. Now, the new function
htx_add_last_data() is used. It adds a last data block, after all others and
before any TLR, EOT or EOM block. Then, the next bock is get. It is the first
non-data block after data in the HTX message. The compression loop continues
with it.
This patch must be backported to 1.9.
This function provides an alternate way to leave a critical section run
under thread_isolate(). Currently, a thread may remain in thread_release()
without having the time to notice that the rdv mask was released and taken
again by another thread entering thread_isolate() (often the same that just
released it). This is because threads wait in harmless mode in the loop,
which is compatible with the conditions to enter thread_isolate(). It's
not possible to make them wait with the harmless bit off or we cannot know
when the job is finished for the next thread to start in thread_isolate(),
and if we don't clear the rdv bit when going there, we create another
race on the start point of thread_isolate().
This new synchronous variant of thread_release() makes use of an extra
mask to indicate the threads that want to be synchronously released. In
this case, they will be marked harmless before releasing their sync bit,
and will wait for others to release their bit as well, guaranteeing that
thread_isolate() cannot be started by any of them before they all left
thread_sync_release(). This allows to construct synchronized blocks like
this :
thread_isolate()
/* optionally do something alone here */
thread_sync_release()
/* do something together here */
thread_isolate()
/* optionally do something alone here */
thread_sync_release()
And so on. This is particularly useful during initialization where several
steps have to be respected and no thread must start a step before the
previous one is completed by other threads.
This one must not be placed after any call to thread_release() or it would
risk to block an earlier call to thread_isolate() which the current thread
managed to leave without waiting for others to complete, and end up here
with the thread's harmless bit cleared, blocking others. This might be
improved in the future.
As reported in GH issue #109 and in discourse issue
https://discourse.haproxy.org/t/haproxy-returns-408-or-504-error-when-timeout-client-value-is-every-25d
the time parser doesn't error on overflows nor underflows. This is a
recurring problem which additionally has the bad taste of taking a long
time before hitting the user.
This patch makes parse_time_err() return special error codes for overflows
and underflows, and adds the control in the call places to report suitable
errors depending on the requested unit. In practice, underflows are almost
never returned as the parsing function takes care of rounding values up,
so this might possibly happen on 64-bit overflows returning exactly zero
after rounding though. It is not really possible to cut the patch into
pieces as it changes the function's API, hence all callers.
Tests were run on about every relevant part (cookie maxlife/maxidle,
server inter, stats timeout, timeout*, cli's set timeout command,
tcp-request/response inspect-delay).
The fd_sets we've been using in the log encoding functions are not portable
and were shown to break at least under Cygwin. This patch gets rid of them
in favor of the new bitmap functions. It was verified with the config below
that the log output was exactly the same before and after the change :
defaults
mode http
option httplog
log stdout local0
timeout client 1s
timeout server 1s
timeout connect 1s
frontend foo
bind :8001
capture request header chars len 255
backend bar
option httpchk "GET" "/" "HTTP/1.0\r\nchars: \x01\x02\x03\x04\x05\x06\x07\x08\x09\x0b\x0c\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
server foo 127.0.0.1:8001 check
We now have ha_bit_{set,clr,flip,test} to manipulate bitfields made
of arrays of longs. The goal is to get rid of the remaining non-portable
FD_{SET,CLR,ISSET} that still exist at a few places.
This type of blocks is useless because transition between data and trailers is
obvious. And when there is no trailers, the end-of-message is still there to
know when data end for chunked messages.
HTTP trailers are now parsed in the same way headers are. It means trailers are
converted to K/V blocks followed by an end-of-trailer marker. For now, to make
things simple, the type for trailer blocks are not the same than for header
blocks. But the aim is to make no difference between headers and trailers by
using the same type. Probably for the end-of marker too.
This patch adds minimalistic definitions to implement dictionary new data structure
which is an ebtree of ebpt_node structs with strings as keys. Note that this has nothing
to see with real dictionary data structure (maps of keys in association with values).
In lock profiles it's visible that there is a huge contention on the
buffer lock. The reason is that when offer_buffers() is called, it
systematically takes the lock before verifying if there is any
waiter. However doing so doesn't protect against races since a
waiter can happen just after we release the lock as well. Similarly
in h2 we take the lock every time an h2c is going to be released,
even without checking that the h2c belongs to a wait list. These
two have now been addressed by verifying non-emptiness of the list
prior to taking the lock.
Haproxy is designed to be able to continue to run even under very low
memory conditions. However this can sometimes have a serious impact on
performance that it hard to diagnose. Let's report counters of failed
pool and buffer allocations per thread in show activity.