Now, the SI calls h2_rcv_buf() with the right count value. So we can rely on
it. Unlike the H1 multiplexer, it is fairly easier for the H2 multiplexer
because the HTX message already exists, we only transfer blocks from the H2S to
the channel. And this part is handled by htx_xfer_blks().
Now, the SI calls h1_rcv_buf() with the right count value. So we can rely on
it. During the parsing, we now really respect this value to be sure to never
exceed it. To do so, once headers are parsed, we should estimate the size of the
HTX message before copying data.
This patch makes the function more accurate. Thanks to the function
htx_get_max_blksz(), the transfer of data has been simplified. Note that now the
total number of bytes copied (metadata + payload) is returned. This slighly
change how the function is used in the H2 multiplexer.
This functions should be used to get the maximum size for a block, not exceeding
the max amount of bytes passed in argument. Thus max may be set to -1 to have no
limit.
When channel_recv_max() is called for an HTX stream, we fall back on the HTX
version. This function is called from si_cs_recv(). This will let us pass the
max amount of bytes to read to HTX multiplexers.
The first block is the start-line, if defined. Otherwise it the head of the HTX
message. So now, during HTTP analysis, lookup are all done using the first block
instead of the head. Concretely, for now, it is the same because only one HTTP
message is stored at a time in an HTX message. 1xx informational messages are
handled separatly from the final reponse and from each other. But it will make
sense when the 1xx informational messages and the associated final reponse will
be stored in the same HTX message.
Since the HTX start-line is now referenced by position instead of by its payload
address, it is fairly easier to replace it. No need to search the rigth block to
find the start-line comparing the payloads address. It just enough to get the
block at the position sl_pos.
Now, we only return the start-line. If not found, NULL is returned. No lookup is
performed and the HTX message is no more updated. It is now the caller
responsibility to update the position of the start-line to the right value. So
when it is not found, i.e sl_pos is set to -1, it means the last start-line has
been already processed and the next one has not been inserted yet.
It is mandatory to rely on this kind of warranty to store 1xx informational
responses and final reponse in the same HTX message.
in the H2 multiplexer, when a HEADERS frame is built before sending it, we have
the warranty the start-line is the head of the HTX message. It is safer to rely
on this fact than on the sl_pos value. For now, it's safe to use sl_pos in muxes
because HTTP 1xx messages are considered as full messages in HTX and only one
HTTP message can be stored at a time in HTX. But we are trying to handle 1xx
messages as a part of the reponse message. In this way, an HTTP reponse will be
the sum of all 1xx informational messages followed by the final response. So it
will be possible to have several start-line in the same HTX message. And the
sl_pos will point to the first unprocessed start-line from the analyzers point
of view.
It is the first block relatively to the start-line. So it is the start-line if
its position is set (sl_pos != -1), otherwise it is the head. The functions
htx_get_first() and htx_get_first_blk() can be used to get it. This change is
mandatory to consider 1xx informational messages as part of a response.
The head of an HTX message is heavily used whereas the wrap position is only
used when a block is added or removed. So it is more logical to store the head
position in the HTX message instead of the wrap one. The wrap position can be
easily deduced. To get it, the new function htx_get_wrap() may be used.
We've been emitting warnings for over 5 years (since 1.5-dev22) about
configs accidently carrying multiple servers with the same name in the
same backend, and this starts to cause some real trouble in dynamic
environments since it's still very difficult to accurately process
a state-file and we still can't transport a server's name over the
peers protocol because of this.
It's about time to force users to fix their configs if they still
hadn't given that there is zero technical justification for doing this,
beyond the "yyp" (or copy-paste accident) when editing the config.
The message remains as clear as before, indicating the file and lines
of the conflict so that the user can easily fix it.
On armv7 haproxy doesn't work because of the fixes on the double-word
CAS. There are two issues. The first one is that the last argument in
case of dwcas is a pointer to the set of value and not a value ; the
second is that it's not enough to cast the data as (void*) since it will
be a single word. Let's fix this by using the pointers as an array of
long. This was tested on i386, armv7, x86_64 and aarch64 and it is now
fine. An alternate approach using a struct was attempted as well but it
used to produce less optimal code.
This fix must be backported to 1.9. This fixes github issue #105.
Cc: Olivier Houchard <ohouchard@haproxy.com>
In pendconn_redistribute() we scan the queue using eb32_next() on the
node we've just deleted, which is wrong since the node is not in the
tree anymore, and it could dereference one node that has already been
released by another thread. Note that we cannot use eb32_first() in the
loop here instead because we need to skip pendconns having SF_FORCE_PRST.
Instead, let's keep a copy of the next node before deleting it.
In addition, the pendconn retrieved there is wrong, it uses &node as
the pointer instead of node, resulting in very quick crashes when the
server list is scanned.
Fortunately this only happens when "option redispatch" is used in
conjunction with "maxconn" on server lines, "cookie" for the stickiness,
and when a server goes down with entries in its queue.
This bug was introduced by commit 0355dabd7 ("MINOR: queue: replace
the linked list with a tree") so the fix must be backported to 1.9.
In fwrr_get_next_server(), we optionally pass a server to avoid. It
usually points to the current server during a redispatch operation. If
this server is usable, an "avoided" pointer is set and we continue to
look for another server. If in the end no other server is found, then
we fall back to this avoided one, which is still better than nothing.
The problem that may arise with threads is that in the mean time, this
avoided server might have received extra connections and might not be
usable anymore. This causes it to be queued a second time in the "full"
list and the loop to search for a server again, ending up on this one
again and so on.
This patch makes sure that we break out of the loop when we have to
pick the avoided server. It's probably what the code intended to do
as the current break statement causes fwrr_update_position() and
fwrr_dequeue_srv() to be called again on the avoided server.
It must be backported to 1.9 and 1.8, and seems appropriate for older
versions though it's unclear what the impact of this bug might be
there since the race doesn't exist and we're left with the double
update of the server's position.
The unused fd_del and fd_skip were being abused during debugging sessions
as general purpose event counters. With their removal, let's officially
have dedicated counters for such use cases. These counters are called
"ctr0".."ctr2" and are listed at the end when DEBUG_DEV is set.
starting with OpenSSL 1.0.0 recommended way to disable compression is
using SSL_OP_NO_COMPRESSION when creating context.
manipulations with SSL_COMP_get_compression_methods, sk_SSL_COMP_num
are only required for OpenSSL < 1.0.0
Since commit 88698d9 ("MEDIUM: connections: Add a way to control the
number of idling connections.") when building without threads, gcc
complains that the operations made on the idle_orphan_conns[] list is
out of bounds, which is always false since 1) <i> can only equal zero,
and 2) given it's equal to <tid> we never even enter the loop. But as
usual it thinks it knows better, so let's mask the origin of this <i>
value to shut it up. Another solution consists in making <i> unsigned
and adding an explicit range check.
Now when we fail to send because the mux buffer is full, before giving
up and marking MFULL, we try to allocate another buffer in the mux's
ring to try again. Thanks to this (and provided there are enough buffers
allocated to the mux's ring), a single stream picked in the send_list
cannot steal all the mux's room at once. For this, we expand the ring
size to 31 buffers as it seems to be optimal on benchmarks since it
divides the number of context switches by 3. It will inflate each H2
conn's memory by 1 kB.
The bandwidth is now much more stable. Prior to this, it a test on
h2->h1 with very large objects (1 GB), a few tens of connections and
a few tens of streams per connection would show a varying performance
between 34 and 95 Gbps on 2 cores/4 threads, with h2_snd_buf() stopped
on a buffer full condition between 300000 and 600000 times per second.
Now the performance is constantly between 88 and 96 Gbps. Measures show
that buffer full conditions are met around only 159 times per second
in this case, or rougly 2000 to 4000 times less often.
This makes the code more readable and reduces the calls to br_tail().
In addition, all calls to h2_get_buf() are now made via this local
variable, which should significantly help for retries.
Now send() uses a loop to iterate over all buffers to be sent. These
buffers are released and deleted from the vector once completely sent.
If any buffer gets released, offer_buffers() is called to wake up some
waiters.
For now it's only one buffer long so the head and tails are always the
same, thus it doesn't change what used to work. In short, br_tail(h2c->mbuf)
was inserted everywhere we used to have h2c->mbuf.
The purpose is to manipulate rings made of series of buffers so that
it is possible to continue to work on a next buffer once one is full.
This will be used by muxes to deal with contention between multiple
streams and a single output buffer. No data is expected to span over
multiple buffers, all of them will be used like a regular buffer. This
will significantly limit the amount of changes and the code complexity
while still supporting larger output buffering.
The ring is made of a head and a tail indexes both of which point to a
buffer descriptor. At least one descriptor is always valid, so it could
be seen as a form of pagination always presenting one buffer. The root
of the ring is itself stored into a buffer descriptor so that the user
only has to declare a buffer array and to call br_init() on it in order
to use it.
It has not been used for many years, is unlikely to be reused and
conflicts with the similarly named macro in flt_trace, causing warnings
at build time when including debug.h in low-level files. Let's simply
remove it.
Transferring large objects over H2 sometimes shows unexplained performance
variations. A long analysis resulted in the following discovery. Often the
mux buffer looks like this :
[ empty_head | data | empty_tail ]
Typical numbers are (very common) :
- empty_head = 31
- empty_tail = 16 (total free=47)
- data = 16337
- size = 16384
- data to copy: 43
The reason for these holes are the blocking factors that are not always
the same in and out (due to keeping 9 bytes for the frame size, or the
56 bytes corresponding to the HTX header). This can easily happen 10000
times a second if the network bandwidth permits it!
In this case, while copying a DATA frame we find that the buffer has its
free space wrapped so we decide to realign it to optimize the copy. It's
possible that this practice stems from the code used to emit headers,
which do not support fragmentation and which had no other option left.
But it comes with two problems :
- we don't check if the data fits, which results in a memcpy for nothing
- we can move huge amounts of data to just copy a small block.
This patch addresses this two ways :
- first, by not forcing a data realignment if what we have to copy does
not fit, as this is totally pointless ;
- second, by refusing to move too large data blocks. The threshold was
set to 1 kB, because it may make sense to move 1 kB of data to copy
a 15 kB one at once, which will leave as a single 16 kB block, but
it doesn't make sense to mvoe 15 kB to copy just 1 kB. In all cases
the data would fit and would just be split into two blocks, which is
not very expensive, hence the low limit to 1 kB
With such changes, realignments are very rare, they show up around once
every 15 seconds at 60 Gbps, and look like this, resulting in a much more
stable bit rate :
buf=0x7fe6ec0c3510,h=16333,d=35,s=16384 room=16349 in=16337
This patch should be safe for backporting to 1.9 if some performance
issues are reported there.
It's amazing that the value was still incremented under the date lock,
let's first use an atomic increment for the counter and move it out of
the date lock to reduce contention. These are just counters, we don't
need to take locks if we're not rotating, atomic ops are enough. This
patch does this, and leaves the lock for when the period is over. It's
important to note that some values might be added just before or just
after a rotation but this is not a problem since we don't care if a
value is counted in the previous or next period when it's exactly on
the edge. Great care was taken to ensure that the current counter is
always atomically updated.
Other minor cleanups were performed, such as avoiding to reload the
value from memory after a CAS, or using &~1 instead of two shifts to
remove the lowest bit.
according to manpage:
sk_TYPE_zero() sets the number of elements in sk to zero. It
does not free sk so after this call sk is still valid.
so we need to free all elements
[wt: seems like it has been there forever and should be backported
to all stable branches]
s/accidently/accidentally/
s/any ot these messages/any of theses messages/
s/catched/caught/
s/completly/completely/
s/convertor/converter/
s/desribing/describing/
s/developper/developer/
s/eventhough/even though/
s/exectution/execution/
s/functionnality/functionality/
s/If it receive a/If it receives a/
s/In can even/It can even/
s/informations/information/
s/it will be remove /it will be removed /
s/langage/language/
s/mentionned/mentioned/
s/negociated/negotiated/
s/Optionnaly/Optionally/
s/ouputs/outputs/
s/outweights/outweighs/
s/ressources/resources/
Fortunately, this loop does nothing. Otherwise it would have led to an infinite
loop. It was probably forgotten during a refactoring, in the early stage of the
HTX.
This patch must be backported to 1.9.
When an 1xx reponse is processed, we forward it immediatly. But another message
may already be in the channel's buffer, waiting to be processed. This may be
another 1xx reponse or the final one. So instead of forwarding everything, we
must take care to only forward the processed 1xx response.
This patch must be backported to 1.9.
When a parsing error occurrs in the H1 multiplexer, we stop to copy HTX
blocks. So the error may be reported with an emtpy HTX message. For instance, if
the headers parsing failed. When it happens, the flag CS_FL_EOS is also set on
the conn_stream. But it is an error. Most of time, it is set on established
connections, so it is not really an issue. But if it happens when the server
connection is not fully established, the connection is shut down immediatly and
the stream-interface is switched from SI_ST_CON to SI_ST_DIS/CLO. So HTX
analyzers have no chance to catch the error.
Instead of setting CS_FL_EOS, it is fairly better to set CS_FL_EOI, which is the
right flag to use. The same is also done on H2 upgrade. As a side effet of this
fix, in the stream-interface code, we must now set the flag CF_READ_PARTIAL on
the channel when the flag CF_EOI is set. It is a warranty to wakeup the stream
when EOI is reported to the channel while no data are received.
This patch must be backported to 1.9.
In HTX, when a HEADERS frame is formatted before sending it to the client or the
server, If an EOM is found because there is no body, we must count it in the
number bytes sent.
This patch must be backported to 1.9.
When a LUA HTTP object is created using the current TXN object, it is important
to also set the right direction and flags, using ones from the TXN object.
This patch may be backported to all supported branches with the lua
support. But, it seems to have no impact for now.
In spoe_release_appctx(), the SPOE applet may be used after it was released to
get its exit status code. Of course, HAProxy crashes when this happens.
This patch must be backported to 1.9 and 1.8.
As fat as possible, we try to keep the connections alive on redirect. It's
possible when the request has no body or when the request parsing is finished.
No backport is needed.
The stats page now reports the per-process output bit rate and applies
the usual conversions needed to turn the TCP payload rate to an Ethernet
bit rate in order to give a reasonably accurate estimate of how far from
interface saturation we are.
Many times we've been missing per-process traffic statistics. While it
didn't make sense in multi-process mode, with threads it does. Thus we
now have a counter of bytes emitted by raw_sock, and a freq counter for
these as well. However, freq_ctr are limited to 32 bits, and given that
loads of 300 Gbps have already been reached over a loopback using
splicing, we need to downscale this a bit. Here we're storing 1/32 of
the byte rate, which gives a theorical limit of 128 GB/s or ~1 Tbps,
which is more than enough. Let's have fun re-reading this sentence in
2029 :-) The values can be read in "show info" output on the CLI.
It's needed on Linux to have access to timerfd_*, and on FreeBSD this
lib is needed as well, though not enabled in our default build. We can
see later if it's OK to enable it, for now let's fix the build issues.
SI_TKILL is for Linux. We're again in the non-portable area. Both OSes
use macros to define these values so we can #ifdef them. Let's make
SI_TKILL defined based on SI_LWP when only the latter is defined.
Bah, the linux manpage suggests to use si_int but it's a fake, it's only
a define on sigval.sival_int where sigval is defined as si_value. Let's
use si_value.sival_int, at least it builds on both Linux and FreeBSD. It's
likely that this code will have to be limited to a small subset of OSes
if it causes difficulties like this.
Released version 2.0-dev4 with the following main changes :
- BUILD: enable freebsd builds on cirrus-ci
- BUG/MINOR: http_fetch: Rely on the smp direction for "cookie()" and "hdr()"
- MEDIUM: Make 'option forceclose' actually warn
- MEDIUM: Make 'resolution_pool_size' directive fatal
- DOC: management: place "show activity" at the right place
- MINOR: cli/activity: show the dumping thread ID starting at 1
- MINOR: task: export global_task_mask
- MINOR: cli/debug: add a thread dump function
- BUG/MEDIUM: streams: Don't use CF_EOI to decide if the request is complete.
- BUG/MEDIUM: streams: Try to L7 retry before aborting the connection.
- BUG/MINOR: debug: make ha_task_dump() always check the task before dumping it
- BUG/MINOR: debug: make ha_task_dump() actually dump the requested task
- MINOR: debug: make ha_thread_dump() and ha_task_dump() take a buffer
- BUG/MINOR: debug: don't check the call date on tasklets
- MINOR: thread: implement ha_thread_relax()
- MINOR: task: put barriers after each write to curr_task
- MINOR: task: always reset curr_task when freeing a task or tasklet
- MINOR: stream: detach the stream from its own task on stream_free()
- MEDIUM: debug/threads: implement an advanced thread dump system
- REGTEST: extend the check duration on tls_health_checks and mark it slow
- DOC: fix "successful" typo
- MINOR: init: setenv HAPROXY_CFGFILES
- MINOR: threads/init: synchronize the threads startup
- MEDIUM: init/mworker: make the pipe register function a regular initcall
- CLEANUP: memory: make the fault injection code use the OTHER_LOCK label
- CLEANUP: threads: remove the now unused START_LOCK label
- MINOR: init/threads: make the global threads an array of structs
- MINOR: threads: add each thread's clockid into the global thread_info
- CLEANUP: stream: remove an obsolete debugging test
- MINOR: tools: add dump_hex()
- MINOR: debug: implement ha_panic()
- MINOR: debug/cli: add some debugging commands for developers
- MINOR: tools: provide a may_access() function and make dump_hex() use it
- MINOR: debug: make ha_panic() report threads starting at 1
- REORG: compat: move some integer limit definitions from standard.h to compat.h
- REORG: threads: move the struct thread_info from global.h to hathreads.h
- MINOR: compat: make sure to always define clockid_t
- MINOR: threads: always place the clockid in the struct thread_info
- MINOR: threads: add a thread-local thread_info pointer "ti"
- MINOR: time: move the cpu, mono, and idle time to thread_info
- MINOR: time: add a function to retrieve another thread's cputime
- MINOR: debug: report each thread's cpu usage in "show thread"
- BUILD: threads: only assign the clock_id when supported
- BUILD: makefile: use USE_OBSOLETE_LINKER for solaris
- BUILD: makefile: remove -fomit-frame-pointer optimisation (solaris)
- MAJOR: polling: add event ports support (Solaris)
- BUG/MEDIUM: streams: Don't switch from SI_ST_CON to SI_ST_DIS on read0.
- CLEANUP: time: refine the test on _POSIX_TIMERS
- MINOR: compat: define a new empty type empty_t for non-implemented fields
- CLEANUP: time: switch clockid_t to empty_t when not available
- BUG/MINOR: mworker: Fix memory leak of mworker_proc members
- CLEANUP: objtype: make obj_type() and obj_type_name() take consts
- MINOR: debug: switch to SIGURG for thread dumps
- CLEANUP: threads: really move thread_info to hathreads.c
- MINOR: threads: make threads_{harmless|want_rdv}_mask constant 0 without threads
- CLEANUP: debug: always report harmless/want_rdv even without threads
- MINOR: threads: implement ha_tkill() and ha_tkillall()
- CLEANUP: debug: make use of ha_tkill() and remove ifdefs
- MINOR: stream: introduce a stream_dump() function and use it in stream_dump_and_crash()
- MINOR: debug: dump streams when an applet, iocb or stream is known
- MINOR: threads: add a "stuck" flag to the thread_info struct
- MINOR: threads: add a timer_t per thread in thread_info
- MAJOR: watchdog: implement a thread lockup detection mechanism
- MINOR: stream: remove the cpu time detection from process_stream()
- MINOR: connection: report the mux names in "haproxy -vv"
- CLEANUP: mux-h1: use "H1" and not "h1" as the mux's name
- BUG/MEDIUM: WURFL: segfault in wurfl-get() with missing info.
- MINOR: WURFL: call header_retireve_callback() in dummy library
- MINOR: WURFL: fixed Engine load failed error when wurfl-information-list contains wurfl_root_id
- MINOR: WURFL: shows log messages during module initialization
- MINOR: WURFL: removes heading wurfl-information-separator from wurfl-get-all() and wurfl-get() results
- MINOR: WURFL: wurfl_get() and wurfl_get_all() now return an empty string if device detection fails
- MEDIUM: WURFL: HTX awareness.
- MINOR: WURFL: module version bump to 2.0
- MINOR: WURFL: do not emit warnings when not configured
- CONTRIB: wurfl: address 3 build issues in the wurfl dummy library
- BUG/MEDIUM: init/threads: provide per-thread alloc/free function callbacks
- BUILD: travis: add sanitizers to travis-ci builds
- BUILD: time: remove the test on _POSIX_C_SOURCE
- CLEANUP: build: rename some build macros to use the USE_* ones
- CLEANUP: raw_sock: remove support for very old linux splice bug workaround
- BUG/MEDIUM: dns: make the port numbers unsigned
- MEDIUM: config: deprecate the antique req* and rsp* commands