This trick is deprecated since the health-check refactoring, It is now
invalid. It means the following line will trigger an error during the
configuration parsing:
option httpchk OPTIONS * HTTP/1.1\r\nHost:\ www
It must be replaced by:
option httpchk OPTIONS * HTTP/1.1
http-check send hdr Host www
ssl_tlsext_ticket_key_cb() is called when "tls-ticket-keys" option is used on a
"bind" line. It needs to have an access to the TLS ticket keys which have been
stored into the listener bind_conf struct. The fix consists in nitializing the
<ref> variable (references to TLS secret keys) the correct way when this callback
is called for a QUIC connection. The bind_conf struct is store into the quic_conn
object (QUIC connection).
This issue may be in relation with GH #1851. Thank you for @tasavis for the report.
Must be backported to 2.6.
Obviously, frames which are duplicated from others must not be retransmitted if
the original frame they were duplicated from was already acknowledged.
This should have been detected by qc_build_frms() which skips such frames,
except if the QUIC xprt does really bad things which are not supported by
the upper layer. This will have to be checked with Amaury.
To prevent the retransmision of these frames which leads to crashes as reported by
hpn0t0ad this gdb backtrace in GH #1809 where the frame builder tries to copy a huge
number of bytes to the packet buffer:
Thread 7 (Thread 0x7fddf373a700 (LWP 13)):
#0 __memmove_sse2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:520
No locals.
#1 0x000055b17435705e in quic_build_stream_frame (buf=0x7fddf372ef78, end=<optimized out>, frm=0x7fdde08d3470, conn=<optimized out>) at src/quic_frame.c:515
to_copy = 18446697703428890384
stream = 0x7fdde08d3490
wrap = <optimized out>
which matches this part of quic_frame.c code:
wrap = (const unsigned char *)b_wrap(stream->buf);
if (stream->data + stream->len > wrap) {
size_t to_copy = wrap - stream->data;
memcpy(*buf, stream->data, to_copy);
*buf += to_copy;
we release as soon as possible the impacted frames as there is really no need
to retransmit such frames.
Thank you to @hpn0t0ad for having provided us with useful traces in github
issue #1809.
Must be backported in 2.6.
Commit 5c83e3a156 made some adjustments
to clarify which TCP_INFO information is supported by each respective
OS.
There was a comment like so..
Note that fc_rtt and fc_rttvar are supported on any OS that has TCP_INFO,
not just linux/freebsd/netbsd, so we continue to expose them unconditionally.
But the diff didn't do so in a consistent manner.
Released version 2.7-dev5 with the following main changes :
- BUG/MINOR: mux-quic: Fix memleak on QUIC stream buffer for unacknowledged data
- BUG/MEDIUM: cpu-map: fix thread 1's affinity affecting all threads
- MINOR: cpu-map: remove obsolete diag warning about combined ranges
- BUG/MAJOR: mworker: fix infinite loop on master with no proxies.
- REGTESTS: launch http_reuse_always in mworker mode
- BUG/MINOR: quix: Memleak for non in flight TX packets
- BUG/MINOR: quic: Wrong list_for_each_entry() use when building packets from qc_do_build_pkt()
- BUG/MINOR: quic: Safer QUIC frame builders
- MINOR: quic: Replace MT_LISTs by LISTs for RX packets.
- BUG/MEDIUM: applet: fix incorrect check for abnormal return condition from handler
- BUG/MINOR: applet: make the call_rate only count the no-progress calls
- MEDIUM: peers: limit the number of updates sent at once
- BUILD: tcp_sample: fix build of get_tcp_info() on OpenBSD
- BUG/MINOR: resolvers: return the correct value in resolvers_finalize_config()
- BUG/MINOR: mworker: does not create the "default" resolvers in wait mode
- BUG/MINOR: tcpcheck: Disable QUICKACK only if data should be sent after connect
- REGTESTS: Fix prometheus script to perform HTTP health-checks
- MINOR: resolvers: shut the warning when "default" resolvers is implicit
- Revert "BUG/MINOR: quix: Memleak for non in flight TX packets"
- BUG/MINOR: quic: Leak in qc_release_lost_pkts() for non in flight TX packets
- BUG/MINOR: quic: Stalled connections (missing I/O handler wakeup)
- CLEANUP: quic: No more use ->rx_list MT_LIST entry point (quic_rx_packet)
- CLEANUP: quic: Remove a useless check in qc_lstnr_pkt_rcv()
- MINOR: quic: Remove useless traces about references to TX packets
- Revert "MINOR: quic: Remove useless traces about references to TX packets"
- DOC: configuration: do-resolve doesn't work with a port in the string
- MINOR: sample: add the host_only and port_only converters
- BUG/MINOR: httpclient: fix resolution with port
- DOC: configuration.txt: do-resolve must use host_only to remove its port.
- BUG/MINOR: quic: Null packet dereferencing from qc_dup_pkt_frms() trace
- BUG/MINOR: quic: Frames added to packets even if not built.
- BUG/MEDIUM: spoe: Properly update streams waiting for a ACK in async mode
- BUG/MEDIUM: peers: Add connect and server timeut to peers proxy
- BUG/MEDIUM: peers: Don't use resync timer when local resync is in progress
- BUG/MEDIUM: peers: Don't start resync on reload if local peer is not up-to-date
- BUG/MINOR: hlua: Rely on CF_EOI to detect end of message in HTTP applets
- BUG/MEDIUM: mux-h1: do not refrain from signaling errors after end of input
- BUG/MINOR: epoll: do not actively poll for Rx after an error
- MINOR: raw-sock: don't try to send if an error was already reported
- BUG/MINOR: quic: Missing header protection AES cipher context initialisations (draft-v2)
- MINOR: quic: Add a trace to distinguish the datagram from the packets inside
- BUG/MINOR: ssl: fix deinit of the ca-file tree
- BUG/MINOR: ssl: leak of ckch_inst_link in ckch_inst_free()
- BUG/MINOR: tcpcheck: Disable QUICKACK for default tcp-check (with no rule)
- BUG/MEDIUM: ssl: Fix a UAF when old ckch instances are released
- BUG/MINOR: ssl: revert two wrong fixes with ckhi_link
- BUG/MINOR: dev/udp: properly preset the rx address size
- BUILD: debug: make sure debug macros are never empty
- MINOR: quic: Move traces about RX/TX bytes from QUIC_EV_CONN_PRSAFRM event
- BUG/MINOR: quic: TX frames memleak
- BUG/MINOR: ssl: leak of ckch_inst_link in ckch_inst_free() v2
- MINOR: sink/ring: rotate non-empty file-backed contents only
- BUG/MINOR: regex: Properly handle PCRE2 lib compiled without JIT support
- REGTESTS: http_request_buffer: Add a barrier to not mix up log messages
- BUG/MEDIUM: mux-h1: always use RST to kill idle connections in pools
- MINOR: backend: always satisfy the first req reuse rule with l7 retries
- BUG/MINOR: quic: Do not ack when probing
- MINOR: quic: Add TX frames addresses to traces to several trace events
- MINOR: quic: Trace typo fix in qc_release_frm()
- BUG/MINOR: quic: Frames leak during retransmissions
- BUG/MINOR: h2: properly set the direction flag on HTX response
- BUG/MEDIUM: httpclient: always detach the caller before self-killing
- BUG/MINOR: httpclient: only ask for more room on failed writes
- BUG/MINOR: httpclient: keep-alive was accidentely disabled
- MEDIUM: httpclient: enable ALPN support on outgoing https connections
- BUG/MINOR: mux-h2: fix the "show fd" dest buffer for the subscriber
- BUG/MINOR: mux-h1: fix the "show fd" dest buffer for the subscriber
- BUG/MINOR: mux-fcgi: fix the "show fd" dest buffer for the subscriber
- DEBUG: stream: minor rearrangement of a few fields in struct stream.
- MINOR: debug: report applet pointer and handler in crashes when known
- MINOR: mux-h2: extract the stream dump function out of h2_show_fd()
- MINOR: mux-h2: extract the connection dump function out of h2_show_fd()
- MINOR: muxes: add a "show_sd" helper to complete "show sess" dumps
- MINOR: mux-h2: provide a "show_sd" helper to output stream debugging info
- MINOR: mux-h2: insert line breaks in "show sess all" output for legibility
- MINOR: mux-quic: provide a "show_sd" helper to output stream debugging info
- MINOR: mux-h1: split "show_fd" into connection and stream
- MINOR: mux-h1: provide a "show_sd" helper to output stream debugging info
- BUG/MINOR: http-act: initialize http fmt head earlier
In github issue #1850, Christian Ruppert reported a case of crash in
2.6 when failing to parse some http rules. This started to happen
with 2.6 commit dd7e6c6 ("BUG/MINOR: http-rules: completely free
incorrect TCP rules on error") but has some of its roots in 2.2
commit 2eb539687 ("MINOR: http-rules: Add release functions for
existing HTTP actions").
The cause is that when the release function is set for HTTP actions,
the rule->arg.http.fmt list head is not yet initialized, hence is
NULL, thus the release function crashes when it tries to iterate over
it. In fact this code was initially not written with the perspective
of releasing such elements upon error, so the arg list initialization
happened after error checking.
This patch just moves the list initialization just after setting the
release pointer and that's OK.
This patch must be backported to 2.6 since the problem is visible
there. It could be backported to 2.5 but the issue is not triggered
there without the first mentioned patch above that landed in 2.6, so
it will not bring any obvious benefit.
We now have two functions, one for dumping connections and the other
one for dumping the streams. This will permit to use it from show_sd.
A few optional line breaks were inserted where relevant to keep lines
homogenous when a prefix is passed.
It's very limited but at least provides the very basic info about QCS and
QCC when issuing "show sess all":
scf=0x7fa9642394a0 flags=0x00000080 state=EST endp=CONN,0x7fa9642351f0,0x02001001 sub=3
> qcs=0x7fa9642351f0 .flg=0x5 .id=396 .st=HCR .ctx=0x7fa9642353f0, .err=0
> qcc=0x7fa96405ce20 .flg=0 .nbsc=100 .nbhreq=100, .task=0x7fa964054260
co0=0x7fa96405cd50 ctrl=quic4 xprt=QUIC mux=QUIC data=STRM target=LISTENER:0x328c530
flags=0x00200300 fd=-1 fd.state=00 updt=0 fd.tmask=0x0
It will need to be improved but it's better than nothing already. This
should be backported to 2.6 if the other dumps are backported.
With this, it now becomes possible to see the state of each H2 stream from
"show sess all". Lines are still too long and need to be split, but that's
for another patch.
This helper will be called for muxes that provide it and will be used
to let the mux provide extra information about the stream attached to
a stream descriptor. A line prefix is passed in argument so that the
mux is free to break long lines without breaking indent. No prefix
means no line breaks should be produced (e.g. for short dumps).
The function will be reusable to dump streams, so let's extract it.
Note that due to "last_h2s" being originally printed as a prefix for
the stream dump, now the pointer is displayed by the caller instead.
When an appctx is found looping over itself, we report a number of info
but not the pointers to the definition nor the handler, which can be quite
handy in some cases. Let's add them and try to decode the symbol.
Some recent traces started to show confusing stream pointers ending with
0xe. The reason was that the stream's obj_type was almost unused in the
past and was stuffed in a hole in the structure. But now it's present in
all "show sess all" outputs and having to mentally match this value against
another one that's 0x17e lower is painful. The solution here is to move the
obj_type at the top, like in almost every other structure, but without
breaking the efficient layout.
This patch moves a few fields around and manages to both plug some holes
(16 bytes saved, 976 to 960) and avoid channels needlessly crossing cache
boundaries (res was spread over 3 lines vs 2 now).
Nothing else was changed. It would be desirable to backport this to 2.6
since it's where dumps are currently being processed the most.
Commit 1776ffb97 ("MINOR: mux-fcgi: make the "show fd" helper also decode
the fstrm subscriber when known") improved the output of "show fd" for the
FCGI mux, but the output is sent to the trash buffer instead of the msg
argument. It turns out that this has no effect right now as the caller
passes the trash but this is risky.
This should be backported to 2.4.
Commit 150c4f8b7 ("MINOR: mux-h1: make the "show fd" helper also decode
the h1s subscriber when known") improved the output of "show fd" for the
H1 mux, but the output is sent to the trash buffer instead of the msg
argument. It turns out that this has no effect right now as the caller
passes the trash but this is risky.
This should be backported to 2.4.
Commit 98e40b981 ("MINOR: mux-h2: make the "show fd" helper also decode
the h2s subscriber when known") improved the output of "show fd" for the
H2 mux, but the output is sent to the trash buffer instead of the msg
argument. It turns out that this has no effect right now as the caller
passes the trash but this is risky.
This should be backported to 2.4.
Since everything is available for this, let's enable ALPN with the
usual "h2,http/1.1" on the https server. This will allow HTTPS requests
to use HTTP/2 when available.
It may be needed to permit to disable this (or to set the string) in
case some client code explicitly checks for the "HTTP/1.1" string, but
since httpclient is quite young it's unlikely that such code already
exists.
The servers were not set with default settings, meaning that a few
settings including the pool_max_delay were not set, thus disabling
connection pools, which is the cause of the fact that keep-alive was
disabled as reported in issue #1831. There might possibly be other
issues pending since all these fields were left to zero.
Note that this patch alone will not fix keep-alive because the applet
does not enforce SE_FL_NOT_FIRST and relies on the default http-reuse
safe, thus if servers are not shared, all requests are considered
first ones and do not reuse existing connections.
In 2.7, commit ecb40b2c3 ("MINOR: backend: always satisfy the first
req reuse rule with l7 retries") addressed this in a more elegant way
by fixing http-reuse to take into account the fact that properly
configured l7 retries provide exactly the capability that reuse safe
was trying to cover, and this patch is suitable for backporting.
This patch should be backported to 2.6 only.
There's a tiny issue in the I/O handler by which both a failed request
emission and missing response data will want to subscribe for more room
on output. That's not correct in that only the case where the request
buffer is full should cause this, the other one should just wait for
incoming data. This could theoretically cause spurious wakeups at
certain key points (e.g. connect() time maybe) though this could not
be reproduced but better fix this while it's easy enough.
It doesn't seem necessary to backport it right now, though this may
have to in case a concrete reproducible case is discovered.
If the caller dies before the server responds, the httpclient can crash
in hc_cli_res_end_cb() when unregistering because it dereferences
hc->caller which was already freed during the caller's unregistration.
The easiest way to reproduce it is by sending twice the following
request on the same CLI connection in expert mode, with httpterm
running on local port 8000:
httpclient GET http://127.0.0.1:8000/?t=600
Note the 600ms delay that's larger than socat's default 500.
The code checks for a NULL everywhere hc->caller is used, but the NULL
was forgotten in this specific case. It must be placed in the second
half of httpclient_stop_and_destroy() which is responsible for signaling
the client that the caller leaves.
This must be backported to 2.6.
In 1.9-dev, a new flag was introduced on the start line with commit
f1ba18d7b ("MEDIUM: htx: Don't rely on h1_sl anymore except during H1
header parsing") to designate a response message: HTX_SL_F_IS_RESP.
Unfortunately as it was done in parallel to the mux_h2 support for
the backend, it was never integrated there. It was not used by then
so this remained unnoticed for a while.
However the http_client now uses it, and missing that flag prevents
it from using the H2 mux, so let's properly add it.
There's no point in backporting this far away, but since the http_client
is fully operational in 2.6 it would make sense to backport this fix at
least there to secure the code.
The frame which are retransmitted by qc_dgrams_retransmit() are duplicated
from sent but not acknowledged packets and added to local frames lists.
Some may not have been sent. If not replaced somewhere (linked to the
connection) they are lost for ever (leak). We splice the list remaining
contents to the packets number space frame list to avoid such a situation.
Must be backported to 2.6.
<force_ack> boolean variable passed to qc_do_build_pkt() which builds a clear
packet is there to force this function to build an ACK frame regardless of
others conditions. This is used during handshake, when we acknowledge every
handshake packets received.
This variable was already taken into an account by the local variable <must_ack>
which is there at least to ignore any other conditions than this one: "are
we building a probing packet?". Indeed we do not want to add ACK frames when
we probe the peers. This is to have more chances to embed the new duplicated frames
into another packets without splitting them. So, the test on <force_ack> boolean
value is useless, silly and brakes the rule which consists in not acknowledging
when probing.
Must be backported to 2.6.
The "first req" rule consists in not delivering a connection's first
request to a connection that's not known for being safe so that we
don't deliver a broken page to a client if the server didn't intend to
keep it alive. That's what's used by "http-reuse safe" particularly.
But the reason this rule was created was precisely because haproxy was
not able to re-emit the request to the server in case of connection
breakage, which is precisely what l7 retries later brought. As such,
there's no reason for enforcing this rule when l7 retries are properly
enabled because such a blank page will trigger a retry and will not be
delivered to the client.
This patch simply checks that the l7 retries are enabled for the 3 cases
that can be triggered on a dead or dying connection (failure, empty, and
timeout), and if all 3 are enabled, then regular idle connections can be
reused.
This could almost be marked as a bug fix because a lot of users relying
on l7 retries do not necessarily think about using http-reuse always due
to the recommendation against it in the doc, while the protection that
the safe mode offers is never used in that mode, and it forces the http
client not to reuse existing persistent connections since it never sets
the "not first" flag.
It could also be decided that the protection is not used either when
the origin is an applet, as in this case this is internal code that
we can decide to let handle the retry by itself (all info are still
present). But at least the httpclient will be happy with this alone.
It would make sense to backport this at least to 2.6 in order to let
the httpclient reuse connections, maybe to older releases if some
users report low reuse counts.
When idle H1 connections cannot be stored into a server pool or are later
evicted, they're often seen closed with a FIN then an RST. The problem is
that this is sufficient to leave them in TIME_WAIT in the local sockets
table and port exhaustion may happen.
The reason is that in h1_release() we rely on h1_shutw_conn() which itself
decides whether to close in silent or normal mode only based on the
H1C_F_ST_SILENT_SHUT flag. This flag is only set by h1_shutw() based on
the requested mode. But when the connection is in the idle list, the mode
ought to always be silent.
What this patch does is to set the flag before trying to add to the idle
list, and remove it after removing from the idle list. This way if the
connection fails to be added or has to be killed, it's closed with an
RST.
This must be backported as far as 2.4. It's not sure whether older
versions need an equivalent.
Depending on the timing, time to time, the log messages can be mixed. A
client can start and be fully handled by HAProxy (including its log message)
before the log message of the previous client was emitted or received. To
fix the issue, a barrier was added to be sure to eval the "expect" rule on
logs before starting the next client.
This patch should fix the issue #1847. It may be backported to all branches
containing this reg-tests.
The PCRE2 JIT support is buggy. If HAProxy is compiled with USE_PCRE2_JIT
option while the PCRE2 library is compiled without the JIT support, any
matching will fail because pcre2_jit_compile() return value is not properly
handled. We must fall back on pcre2_match() if PCRE2_ERROR_JIT_BADOPTION
error is returned.
This patch should fix the issue #1848. It must be backported as far as 2.4.
If the service is rechecked before a reload, that may cause the config
to be parsed twice and file-backed rings to be lost.
Here we make sure that such a ring does contain information before
deciding to rotate it. This way the first process starting after some
writes will cause a rotate but not subsequent ones until new writes
are applied.
An attempt was also made to disable rotations on checks but this was a
bad idea, as the ring is still initialized and this causes the contents
to be lost. The choice of initializing the ring during parsing is
questionable but the config check ought to be as close as possible to a
real start, and we could imagine that the ring is used by some code
during startup (e.g. lua). So this approach was abandonned and config
checks also cause a rotation, as the purpose of this rotation is to
preserve latest information against accidental removal.
ckch_inst_free() unlink the ckch_inst_link structure but never free it.
It can't be fixed simply because cli_io_handler_commit_cafile_crlfile()
is using a cafile_entry list to iterate a list of ckch_inst entries
to free. So both cli_io_handler_commit_cafile_crlfile() and
ckch_inst_free() would modify the list at the same time.
In order to let the caller manipulate the ckch_inst_link,
ckch_inst_free() now checks if the element is still attached before
trying to detach and free it.
For this trick to work, the caller need to do a LIST_DEL_INIT() during
the iteration over the ckch_inst_link.
list_for_each_entry was also replace by a while (!LIST_ISEMPTY()) on the
head list in cli_io_handler_commit_cafile_crlfile() so the iteration
works correctly, because it could have been stuck on the first detached
element. list_for_each_entry_safe() is not enough to fix the issue since
multiple element could have been removed.
Must be backported as far as 2.5.
Move these traces to QUIC_EV_CONN_SPPKTS trace event. They were displayed
at a useless location. Make them displayed just after having sent a packet
and when checking the anti-amplication limit.
Useful to diagnose issues in relation with the recovery.
As outlined in commit f7ebe584d7 ("BUILD: debug: Add braces to if
statement calling only CHECK_IF()"), the BUG_ON() family of macros
is incorrectly defined to be empty when debugging is disabled, and
that can lead to trouble. Make sure they always fall back to the
usual "do { } while (0)". This may be backported to 2.6 if needed,
though no such issue was met there to date.
addrlen was not preset to sizeof(addr) on rx, resulting in the address
often not being filled and response packets not always flowing back.
Let's also consistently use "addr" in the bind call (it points to
frt_addr there but it's a bit confusing).
This reverts commit 056ad01d55.
This reverts commit ddd480cbdc.
The architecture is ambiguous here: ckch_inst_free() is detaching and
freeing the "ckch_inst_link" linked list which must be free'd only from
the cafile_entry side.
The problem was also hidden by the fix ddd480c ("BUG/MEDIUM: ssl: Fix a
UAF when old ckch instances are released") which change the ckchi_link
inner loop by a safe one. However this can't fix entirely the problem
since both __ckch_inst_free_locked() could remove several nodes in the
ckchi_link linked list.
This revert is voluntary reintroducing a memory leak before really fixing
the problem.
Must be backported in 2.5 + 2.6.
When old chck instances is released at the end of "commit ssl ca-file" or
"commit ssl crl-file" commands, the link is released. But we walk through
the list using the unsafe macro. list_for_each_entry_safe() must be used.
This bug was introduced by commit 056ad01d5 ("BUG/MINOR: ssl: leak of
ckch_inst_link in ckch_inst_free()"). Thus this patch must be backported as
far as 2.5.
The commit 871dd8211 ("BUG/MINOR: tcpcheck: Disable QUICKACK only if data
should be sent after connect") introduced a regression. It removes the test
on the next rule to be able to disable TCP_QUICKACK when only a connect is
performed (so no next rule).
This patch must be backported as far as 2.2.
ckch_inst_free() unlink the ckch_inst_link structure but never free it.
It can cause a memory leak upon a ckch_inst_free() done with CLI
operation.
Bug introduced by commit 4458b97 ("MEDIUM: ssl: Chain ckch instances in
ca-file entries").
Must be backported as far as 2.5.
Commit b0c4827 ("BUG/MINOR: ssl: free the cafile entries on deinit")
introduced a double free.
The node was never removed from the tree before its free.
Fix issue #1836.
Must be backported where b0c4827 was backported. (2.6 for now).
Without such a trace, we do not know when a datagram is sent. Only trace for
the packets inside the datagrams were displayed.
Must be backported to 2.6.
This bug arrived with this commit:
"MINOR: quic: Add reusable cipher contexts for header protection"
haproxy could crash because of missing cipher contexts initializations for
the header protection and draft-v2 Initial secrets. This was due to the fact
that these initialization both for RX and TX secrets were done outside of
qc_new_isecs(). The role of this function is definitively to initialize these
cipher contexts in addition to the derived secrets. Indeed this function is called
by qc_new_conn() which initializes the connection but also by qc_conn_finalize()
which also calls qc_new_isecs() in case of a different QUIC version was negotiated
by the peers from the one used by the client for its first Initial packet.
This was reported by "v2" QUIC interop test with at least picoquic as client.
Must be backported to 2.6.
There's no point trying to send() on a socket on which an error was already
reported. This wastes syscalls. Till now it was possible to occasionally
see an attempt to sendto() after epoll_wait() had reported EPOLLERR.
In 2.2, commit 5d7dcc2a8 ("OPTIM: epoll: always poll for recv if neither
active nor ready") was added to compensate for the fact that our iocbs
are almost always asynchronous now and do not have the opportunity to
update the FD correctly. As such, they just perform a wakeup, the FD is
turned to inactive, the tasklet wakes up, performs the I/O, updates the
FD, most of the time this is done withing the same polling loop, and the
update cancels itself in the poller without having to switch the FD off
then on.
The issue was that when deciding to claim an FD was active for reads
if it was active for writes, we forgot one situation that unfortunately
causes excessive wakeups: dealing with errors. Indeed, errors are
reported and keep ringing as long as the FD is active for sending even
if the consumer disabled the FD for receiving. Usually this only causes
one extra wakeup for the time it takes to consider a potential write
subscriber and to call it, though with many tasks in a run queue, it
can last a bit longer and be reported more often.
The fix consists in checking that we really want to get more receive
events on this FD, that is:
- that no prevous EPOLLERR was reported
- that the FD doesn't carry a sticky error
- that the FD is not shut for reads
With this, after the last epoll_wait() reports EPOLLERR, one last recv()
is performed to flush pending data and the FD is immediately unregistered.
It's probably not needed to backport this as its effects are not much
visible, though it should not harm.
Before, EPOLLERR was seen twice:
accept4(4, {sa_family=AF_INET, sin_port=htons(22314), sin_addr=inet_addr("127.0.0.1")}, [128 => 16], SOCK_NONBLOCK) = 8
accept4(4, 0x261b160, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(8, "POST / HTTP/1.1\r\nConnection: close\r\nTransfer-encoding: chunk"..., 16320, 0, NULL, NULL) = 66
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(8002), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 8, {events=EPOLLIN|EPOLLRDHUP, data={u32=8, u64=8}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 9, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP, data={u32=9, u64=9}}) = 0
epoll_wait(3, [{events=EPOLLOUT, data={u32=9, u64=9}}], 200, 355) = 1
recvfrom(9, 0x25cfb30, 16320, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
sendto(9, "POST / HTTP/1.1\r\ntransfer-encoding: chunked\r\n\r\n", 47, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 47
epoll_ctl(3, EPOLL_CTL_MOD, 9, {events=EPOLLIN|EPOLLRDHUP, data={u32=9, u64=9}}) = 0
epoll_wait(3, [{events=EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, data={u32=9, u64=9}}], 200, 354) = 1
recvfrom(9, "HTTP/1.1 200 OK\r\ncontent-length: 0\r\nconnection: close\r\n\r\n", 16320, 0, NULL, NULL) = 57
sendto(8, "HTTP/1.1 200 OK\r\ncontent-length: 0\r\nconnection: close\r\n\r\n", 57, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 57
->epoll_wait(3, [{events=EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, data={u32=9, u64=9}}], 200, 354) = 1
epoll_ctl(3, EPOLL_CTL_DEL, 9, 0x7ffe0b65fb24) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=8, u64=8}}], 200, 354) = 1
recvfrom(8, "A\n0123456789\r\n0\r\n\r\n", 16320, 0, NULL, NULL) = 19
close(9) = 0
close(8) = 0
After, EPOLLERR is only seen only once, with one less call to epoll_wait():
accept4(4, {sa_family=AF_INET, sin_port=htons(22362), sin_addr=inet_addr("127.0.0.1")}, [128 => 16], SOCK_NONBLOCK) = 8
accept4(4, 0x20d0160, [128], SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(8, "POST / HTTP/1.1\r\nConnection: close\r\nTransfer-encoding: chunk"..., 16320, 0, NULL, NULL) = 66
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 9
connect(9, {sa_family=AF_INET, sin_port=htons(8002), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 8, {events=EPOLLIN|EPOLLRDHUP, data={u32=8, u64=8}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 9, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP, data={u32=9, u64=9}}) = 0
epoll_wait(3, [{events=EPOLLOUT, data={u32=9, u64=9}}], 200, 411) = 1
recvfrom(9, 0x2084b30, 16320, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
sendto(9, "POST / HTTP/1.1\r\ntransfer-encoding: chunked\r\n\r\n", 47, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 47
epoll_ctl(3, EPOLL_CTL_MOD, 9, {events=EPOLLIN|EPOLLRDHUP, data={u32=9, u64=9}}) = 0
epoll_wait(3, [{events=EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, data={u32=9, u64=9}}], 200, 411) = 1
recvfrom(9, "HTTP/1.1 200 OK\r\ncontent-length: 0\r\nconnection: close\r\n\r\n", 16320, 0, NULL, NULL) = 57
sendto(8, "HTTP/1.1 200 OK\r\ncontent-length: 0\r\nconnection: close\r\n\r\n", 57, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 57
epoll_ctl(3, EPOLL_CTL_DEL, 9, 0x7ffc95d46f04) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=8, u64=8}}], 200, 411) = 1
recvfrom(8, "A\n0123456789\r\n0\r\n\r\n", 16320, 0, NULL, NULL) = 19
close(9) = 0
close(8) = 0
In 2.6-dev4, a fix for truncated response was brought with commit 99bbdbcc2
("BUG/MEDIUM: mux-h1: only turn CO_FL_ERROR to CS_FL_ERROR with empty ibuf"),
trying to address the situation where an error is present at the connection
level but some data are still pending to be read by the stream. However,
this patch did not consider the case where the stream was no longer willing
to read the pending data, resulting in a situation where some aborted
transfers could lead to excessive CPU usage by causing constant stream
wakeups for which no error was reported. This perfectly matches what was
observed and reported in github issue #1842. It's not trivial to reproduce,
but aborting HTTP/1 pipelining in the middle of transfers seems to give
good results (using h2load and Ctrl-C in the middle).
The fix was incorrct as the error should be held only if there were data
that the stream was able to read. This is the approach taken by this patch,
which also checks via SE_FL_EOI | SE_FL_EOS that the stream will be able
to consume the pending data.
Note that the loop was provoked by the attempt by sc_conn_io_cb() itself
to call sc_conn_send() which resulted in a write subscription in
h1_subscribe() which immediately calls a tasklet_wakeup() since the
event is ready, and that it is now stopped by the presence of SE_FL_ERROR
that is checked in sc_conn_io_cb(). It seems that an extra check down the
send() path to refrain from subscribing when the connection is in error
could speed up error detection or at least avoid a risk of loops in this
case, but this is tricky. In addition, there's already SE_FL_ERR_PENDING
that seems more suitable for reporting when there are pending data, but
similarly, it probably isn't checked well enough to be suitable for
backports.
FWIW the issue may (unreliably) be reproduced by chaining haproxy to
httpterm and issuing:
(printf "GET /?s=10g HTTP/1.1\r\n\r\n"; sleep 0.1; printf "\r\n") | \
nc6 --half-close 0 8001 | head -c1000000000 >/dev/null
It's necessary to play with the size of the head command that's supposed
to trigger the error at some point. A variant involving h2load in h1 mode
and multiple pipelined streams, that is stopped with Ctrl-C also tends to
work.
As the fix above was backported as far as 2.0, it would be tempting to
backport this one as far. However tests have shown that the oldest
version that can trigger this issue is 2.5, maybe due to subtle
differences in older ones, so it's probably not worth going further
until an issue is reported. Note that in 2.5 and older, the SE_FL_*
flags are applied on the conn_stream instead, as CS_FL_*.
Special thanks go to Felipe W Damasio for providing lots of detailed data
allowing to quickly spot the root cause of the problem.
applet:getline() and applet:receive() functions for HTTP applets must rely
on the channel flags to detect the end of the message and not on HTX
flags. It means CF_EOI must be used instead of HTX_FL_EOM.
It is important because the HTX flag is transient. Because there is no flag
on HTTP applets to save the info, it is not reliable. However CF_EOI once
set is never removed. So it is safer to rely on it. Otherwise, the call to
these functions hang.
This patch must be backported as far as 2.4.