The xprt layer is reponsible to notify the mux of a CONNECTION_CLOSE
reception. In this case the flag QC_CF_CC_RECV is positionned on the
qcc and the mux tasklet is waken up.
One of the notable effect of the QC_CF_CC_RECV is that each qcs will be
released even if they have remaining data in their send buffers.
A qcs is not freed if there is remaining data in its buffer. In this
case, the flag QC_SF_DETACH is positionned.
The qcc io handler is responsible to remove the qcs if the QC_SF_DETACH
is set and their buffers are empty.
Replace bug.h by api.h in mux_quic header.
This is required because bug.h uses atomic operations when compiled with
DEBUG_MEM_STATS. api.h takes care of including atomic.h before bug.h.
Set the HTX EOM flag on RX the app layer. This is required to notify
about the end of the request for the stream analyzers, else the request
channel never goes to MSG_DONE state.
Remove qc_eval_pkt() which has come with the multithreading support. It
was there to evaluate the length of a TX packet before building. We could
build from several thread TX packets without consuming a packet number for nothing (when
the building failed). But as the TX packet building functions are always
executed by the same thread, the one attached to the connection, this does
not make sense to continue to use such a function. Furthermore it is buggy
since we had to recently pad the TX packet under certain circumstances.
After the handshake has succeeded, we must delete any remaining
Initial or Handshake packets from the RX buffer. This cannot be
done depending on the state the connection (->st quic_conn struct
member value) as the packet are not received/treated in order.
Add a null byte to the end of the RX buffer to notify the consumer there is no
more data to treat.
Modify quic_rx_packet_pool_purge() which is the function which remove the
RX packet from the buffer.
Also rename this function to quic_rx_pkts_del().
As the RX packets may be accessed by the QUIC connection handler (quic_conn_io_cb())
the function responsible of decrementing their reference counters must not
access other information than these reference counters! It was a very bad idea
to try to purge the RX buffer asap when executing this function.
Handle the case when the app layer sending buffer is full. A new flag
QC_SF_BLK_MROOM is set in this case and the transfer is interrupted. It
is expected that then the conn-stream layer will subscribe to SEND.
The MROOM flag is reset each time the muxer transfer data from the app
layer to its own buffer. If the app layer has been subscribed on SEND it
is woken up.
Implement the subscription in the mux on the qcs instance.
Subscribe is now used by the h3 layer when receiving an incomplete frame
on the H3 control stream. It is also used when attaching the remote
uni-directional streams on the h3 layer.
In the qc_send, the mux wakes up the qcs for each new transfer executed.
This is done via the method qcs_notify_send().
The xprt wakes up the qcs when receiving data on unidirectional streams.
This is done via the method qcs_notify_recv().
Set the QC_SF_FIN_STREAM on the app layers (h3 / hq-interop) when
reaching the HTX EOM. This is used to warn the mux layer to set the FIN
on the QUIC stream.
Re-implement the QUIC mux. It will reuse the mechanics from the previous
mux without all untested/unsupported features. This should ease the
maintenance.
Note that a lot of features are broken for the moment. They will be
re-implemented on the following commits to have a clean commit history.
ha_backtrace_to_stderr() must be declared in CRASH_NOW() macro whe HAProxy
is compiled with DEBUG_STRICT_NOCRASH. Otherwise an error is reported during
compilation:
include/haproxy/bug.h:58:26: error: implicit declaration of function ‘ha_backtrace_to_stderr’ [-Werror=implicit-function-declaration]
58 | #define CRASH_NOW() do { ha_backtrace_to_stderr(); } while (0)
This patch must be backported as far as 2.4.
If H1 headers are not fully received at once, the parsing is restarted a
last time when all headers are finally received. When this happens, the h1m
flags are sanitized to remove all value set during parsing.
But some flags where erroneously preserved. Among others, H1_MF_TE_CHUNKED
flag was not removed, what could lead to parsing error.
To fix the bug and make things easy, a mask has been added with all flags
that must be preserved. It will be more stable. This mask is used to
sanitize h1m flags.
This patch should fix the issue #1469. It must be backported to 2.5.
When the response is parsed, query items are stored in a list, attached to
the parsed response (resolve_response).
First, there is one and only one query sent at a time. Thus, there is no
reason to use a list. There is a test to be sure there is only one query
item in the response. Then, the reference on this query item is only used to
validate the domain name is the one requested. So the query list can be
removed. We only expect one query item, no reason to loop on query records.
In addition, the query domain name is now immediately checked against the
resolution domain name. This way, the query item is only manipulated during
the response parsing.
During post-parsing stage, the SSL context of a server is initialized if SSL
is configured on the server or its default-server. It is required to be able
to enable SSL at runtime. However a regression was introduced, because the
last parsed default-server is used. But it is not necessarily the
default-server line used to configure the server. This may lead to
erroneously initialize the SSL context for a server without SSL parameter or
the skip it while it should be done.
The problem is the default-server used to configure a server is not saved
during configuration parsing. So, the information is lost during the
post-parsing. To fix the bug, the SRV_F_DEFSRV_USE_SSL flag is
introduced. It is used to know when a server was initialized with a
default-server using SSL.
For the record, the commit f63704488e ("MEDIUM: cli/ssl: configure ssl on
server at runtime") has introduced the bug.
This patch must be backported as far as 2.4.
Apple libmalloc has its own notion of memory arenas as malloc_zone with
rich API having various callbacks for various allocations strategies but
here we just use the defaults.
In trim_all_pools, we advise to purge each zone as much as possible, called "greedy" mode.
As soon as the connection ID (the one choosen by the QUIC server) has been used
by the client, we can delete its original destination connection ID from its tree.
When running Key Update process, we must maintain much information
especially when the key phase bit has been toggled by the peer as
it is possible that it is due to late packets. This patch adds
quic_tls_kp new structure to do so. They are used to store
previous and next secrets, keys and IVs associated to the previous
and next RX key phase. We also need the next TX key phase information
to be able to encrypt packets for the next key phase.
When sending a CONNECTION_CLOSE frame to immediately close the connection,
do not provide CRYPTO data to the TLS stack. Do not built anything else than a
CONNECTION_CLOSE and do not derive any secret when in immediately close state.
Seize the opportunity of this patch to rename ->err quic_conn struct member
to ->error_code.
We set this TLS error when no application protocol could be negotiated
via the TLS callback concerned. It is converted as a QUIC CRYPTO_ERROR
error (0x178).
Change the way the CIDs are organized to rattach received packets DCID
to QUIC connection. This is necessary to be able to handle multiple DCID
to one connection.
For this, the quic_connection_id structure has been extended. When
allocated, they are inserted in the receiver CID tree instead of the
quic_conn directly. When receiving a packet, the receiver tree is
inspected to retrieve the quic_connection_id. The quic_connection_id
contains now contains a reference to the QUIC connection.
Add ->err member to quic_conn struct to store the connection errors.
This is the responsability of ->send_alert callback of SSL_QUIC_METHOD
struct to handle the TLS alert and consequently update ->err value.
At this time, when entering qc_build_pkt() we build a CONNECTION_CLOSE
frame close the connection when ->err value is not null.
If we want to run quic-tracker against haproxy, we must at least
support the draft version of the TLS extension for the QUIC transport
parameters (0xffa5). quic-tracker QUIC version is draft-29 at this time.
We select this depending on the QUIC version. If draft, we select the
draft TLS extension.
When establishing an outboud connection, haproxy checks if the cached
TLS session has the same SNI as the connection we are trying to
resume.
This test was done by calling SSL_get_servername() which in TLSv1.2
returned the SNI. With TLSv1.3 this is not the case anymore and this
function returns NULL, which invalidates any outboud connection we are
trying to resume if it uses the sni keyword on its server line.
This patch fixes the problem by storing the SNI in the "reused_sess"
structure beside the session itself.
The ssl_sock_set_servername() now has a RWLOCK because this session
cache entry could be accessed by the CLI when trying to update a
certificate on the backend.
This fix must be backported in every maintained version, however the
RWLOCK only exists since version 2.4.
Implement a new app_ops layer for quic interop. This layer uses HTTP/0.9
on top of QUIC. Implementation is minimal, with the intent to be able to
pass interoperability test suite from
https://github.com/marten-seemann/quic-interop-runner.
It is instantiated if the negotiated ALPN is "hq-interop".
Remove the hardcoded initialization of h3 layer on mux init. Now the
ALPN is looked just after the SSL handshake. The app layer is then
installed if the ALPN negotiation returned a supported protocol.
This required to add a get_alpn on the ssl_quic layer which is just a
call to ssl_sock_get_alpn() from ssl_sock. This is mandatory to be able
to use conn_get_alpn().
This change is required to be able to use multiple app_ops layer on top
of QUIC. The stream-interface will now call the mux snd_buf which is
just a proxy to the app_ops snd_buf function.
The architecture may be simplified in the structure to install the
app_ops on the stream_interface and avoid the detour via the mux layer
on the sending path.
In 1.8 when muxes and conn_streams were introduced, the call to
conn_full_close() was replaced with a call to cs_close() which only
relied on shutr/shutw (commits 6978db35e ("MINOR: connection:
add cs_close() to close a conn_stream") and a553ae96f ("MEDIUM:
connection: replace conn_full_close() with cs_close()")). By then
this was fine, and the rare risk of non-idempotent calls was addressed
by the muxes implementing the functions (e.g. mux_pt).
Later with commit 325607397 ("MEDIUM: stream: do not forcefully close
the client connection anymore"), stream_free() started to call cs_close()
instead of forcibly closing the connection via conn_full_close(). At this
point this started to break idempotence because it was possible to emit
a shutw() (e.g. when option httpclose was set), then to have it called
agian upon stream_free() via cs_close(). By then it was not a problem
since only mux_pt would implement this and did check for idempotence.
When HTX was implemented and mux-h1/h2 offered support for shutw/shutr,
the idempotence changed a little bit because the last shutdown mode
(normal/silent) was recorded and used at the moment of closing. The
call to cs_close() uses the silent mode and will replace the current
one. This has an effect on data pending in the buffer if the FIN could
not be sent before cs_close(), because lingering may be disabled and
final data lost in the network stack.
Interestingly, during 2.4-dev3, this was addressed as the side effect
of an improvement by commit 3c82d8b32 ("MINOR: mux-h1: Rework how
shutdowns are handled"), where the H1 mux's shutdown function becomes
explicitly idempotent. However older versions (2.3 to 2.0) do not have
it.
This patch addresses the issue globally by making sure that cs_shutr()
and cs_shutw() are idempotent and cannot have their data truncated by
a late cs_close(). It fixes the truncation that is observed in 2.3 and
2.2 as described in issue #1450.
This must be backported as far as 2.0, along with commit f14d750bf
("BUG/MEDIUM: conn-stream: Don't reset CS flags on close") which it
depends on.
Implement a reload failure counter which counts the number of failure
since the last success. This counter is available in 'show proc' over
the master CLI.
cs_close() and cs_drain_and_close() are called to close a conn-stream.
cs_shutr() and cs_shutw() are called with the appropriate modes. But the
conn-stream is not released at this stage. However the flags are
reset. Thus, after a cs_close(), we loose shutdown flags. If cs_close() is
performed several times, it is a problem. And indeed, it is possible. On one
hand, the stream-interface may close the conn-stream. On the other end, the
stream always closes it when it is released.
It is a problem for the H1 multiplexer. Because the conn-stream flags are
reset, the shutr and shutw are performed twice. For a delayed shutdown, the
dirty mode is used instead of the normal one because the last call to
h1_shutw() overwrite H1C flags set by the first call. This leads to dirty
shutdowns while normal ones are required. At the end, it is possible to
truncate the messages.
This bug was revealed by the commit a85c522d4 ("BUG/MINOR: mux-h1: Save
shutdown mode if the shutdown is delayed").
This patch is related to the issue #1450. It must be backported as far as
2.0.
- add new metric: `haproxy_backend_agg_server_check_status`
it counts the number of servers matching a specific check status
this permits to exclude per server check status as the usage is often
to rely on the total. Indeed in large setup having thousands of
servers per backend the memory impact is not neglible to store the per
server metric.
- realign promex_str_metrics array
quite simple implementation - we could improve it later by adding an
internal state to the prometheus exporter, thus to avoid counting at
every dump.
this patch is an attempt to close github issue #1312. It may bebackported
to 2.4 if requested.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
`info_field_names` and `stat_field_names` no longer exist and have been
moved in stats.c
To avoid changing this comment, just mention the name of the new table
`info_fields` and `stat_fields`
Signed-off-by: William Dauchy <wdauchy@gmail.com>
This function claims to perform an strncat()-like operation but it does
not, it always copies the indicated number of bytes, regardless of the
presence of a NUL character (what is currently done by chunk_memcat()).
Let's remove it and explicitly replace it with chunk_memcat().
Commit 3d2093af9 ("MINOR: connection: Add a connection error code sample
fetch") added these convenient sample-fetch functions but it appears that
due to a misunderstanding the redundant "conn" part was kept in their
name, causing confusion, since "fc" already stands for "front connection".
Let's simply call them "fc_err" and "bc_err" to match all other related
ones before they appear in a final release. The VTC they appeared in were
also updated, and the alpha sort in the keywords table updated.
Cc: William Lallemand <wlallemand@haproxy.org>
->frms_rwlock is an old lock supposed to be used when several threads
could handle the same connection. This is no more the case since this
commit:
"MINOR: quic: Attach the QUIC connection to a thread."
Add a buffer per QUIC connection. At this time the listener which receives
the UDP datagram is responsible of identifying the underlying QUIC connection
and must copy the QUIC packets to its buffer.
->pkt_list member has been added to quic_conn struct to enlist the packets
in the order they have been copied to the connection buffer so that to be
able to consume this buffer when the packets are freed. This list is locked
thanks to a R/W lock to protect it from concurent accesses.
quic_rx_packet struct does not use a static buffer anymore to store the QUIC
packets contents.
At this time we allocate an RX buffer by thread.
Also take the opportunity offered by this patch to rename TX related variable
names to distinguish them from the RX part.
This patch renames all dns extra counters and stats functions, types and
enums using the 'resolv' prefix/suffixes.
The dns extra counter domain id used on cli was replaced by "resolvers"
instead of "dns".
The typed extra counter prefix dumping resolvers domain "D." was
also renamed "N." because it points counters on a Nameserver.
This was done to finish the split between "resolver" and "dns" layers
and to avoid further misunderstanding when haproxy will handle dns
load balancing.
This should not be backported.
This patch add a union and struct into dns_counter struct to split
application specific counters.
The only current existing application is the resolver.c layer but
in futur we could handle different application such as dns load
balancing with others specific counters.
This patch should not be backported.
Handle properly websocket streams if the server uses an ALPN with both
h1 and h2. Add a new field h2_ws in the server structure. If set to off,
reuse is automatically disable on backend and ALPN is forced to http1.x
if possible. Nothing is done if on.
Implement a mechanism to be able to use a different http version for
websocket streams. A new server member <ws> represents the algorithm to
select the protocol. This can overrides the server <proto>
configuration. If the connection uses ALPN for proto selection, it is
updated for websocket streams to select the right protocol.
Three mode of selection are implemented :
- auto : use the same protocol between non-ws and ws streams. If ALPN is
use, try to update it to "http/1.1"; this is only done if the server
ALPN contains "http/1.1".
- h1 : use http/1.1
- h2 : use http/2.0; this requires the server to support RFC8441 or an
error will be returned by haproxy.
Add a new parameter force_mux_ops. This will be useful to specify an
alternative to the srv->mux_proto field. If non-NULL, it will be use to
force the mux protocol wether srv->mux_proto is set or not.
This argument will become useful to install a mux for non-standard
streams, most notably websocket streams.
Implement a new function to update the ALPN on an existing connection.
on an existing connection. The ALPN from the ssl context can be checked
to update the ALPN only if it is a subset of the context value.
This method will be useful to change a connection ALPN for websocket,
must notably if the server does not support h2 websocket through the
rfc8441 Extended Connect.
Define a new stream flag SF_WEBSOCKET and a new cs flag CS_FL_WEBSOCKET.
The conn-stream flag is first set by h1/h2 muxes if the request is a
valid websocket upgrade. The flag is then converted to SF_WEBSOCKET on
the stream creation.
This will be useful to properly manage websocket streams in
connect_server().
With this feature the lua implementation of the httpclient is now able
to stream a payload larger than an haproxy buffer.
The hlua_httpclient_send() function is now split into:
hlua_httpclient_send() which initiate the httpclient and parse the lua
parameters
hlua_httpclient_snd_yield() which will send the request and be called
again to stream the request if the body is larger than an haproxy buffer
hlua_httpclient_rcv_yield() which will receive the response and store it
in the lua buffer.
This patch add a way to handle HTTP requests streaming using a
callback.
The end of the data must be specified by using the "end" parameter in
httpclient_req_xfer().
The memcpy() call in the aarch64 version of __ha_cas_dw() is sometimes
inlined and sometimes not, depending on the gcc version. It's only used
to copy two void*, so let's use direct assignment instead of memcpy().
It would also be possible to change the asm code to directly write there,
but it's not worth it.
With this change the code is 8kB smaller with gcc-5.4.
__atomic_compare_exchange() is incorrectly documented in the gcc builtins
doc, it says the desired value is "type *desired" while in reality it is
"const type *desired" as expected since that value must in no way be
modified by the operation. However it seems that clang has implemented
it as documented, and reports build warnings when fed a const.
This is quite problematic because it means we have to betry the callers,
pretending we won't touch their constants but not knowing what the
compiler would do with them, and possibly hiding future bugs.
Instead of forcing a cast, let's just switch to the better
__atomic_compare_exchange_n() that takes a value instead of a pointer.
At least with this one there is no doubt about how the input will be
used.
It was verified that the output object code is the same both in clang
and gcc with this change.
At a few places we were still using protocol_by_family() instead of
the richer protocol_lookup(). The former is limited as it enforces
SOCK_STREAM and a stream protocol at the control layer. At least with
protocol_lookup() we don't have this limitationn. The values were still
set for now but later we can imagine making them configurable on the
fly.
Instead of using sock_type and ctrl_type to select a protocol, let's
make use of the new protocol type. For now they always match so there
is no change. This is applied to address parsing and to socket retrieval
from older processes.
The protocol selection is currently performed based on the family,
control type and socket type. But this is often not enough, as both
only provide DGRAM or STREAM, leaving few variants. Protocols like
SCTP for example might be indistinguishable from TCP here. Same goes
for TCP extensions like MPTCP.
This commit introduces a new enum proto_type that is placed in each
and every protocol definition, that will usually more or less match
the sock_type, but being an enum, will support additional values.
For now, these addresses are never set. But the idea is to be able to set, at
least first, the client source and destination addresses at the stream level
without updating the session or connection ones.
Of course, because these addresses are carried by the strream-interface, it
would be possible to set server source and destination addresses at this level
too.
Functions to fill these addresses have been added: si_get_src() and
si_get_dst(). If not already set, these functions relies on underlying
layers to fill stream-interface addresses. On the frontend side, the session
addresses are used if set, otherwise the client connection ones are used. On
the backend side, the server connection addresses are used.
And just like for sessions and conncetions, si_src() and si_dst() may be used to
get source and destination addresses or the stream-interface. And, if not set,
same mechanism as above is used.
For now, these addresses are never set. But the idea is to be able to set
client source and destination addresses at the session level without
updating the connection ones.
Functions to fill these addresses have been added: sess_get_src() and
sess_get_dst(). If not already set, these functions relies on
conn_get_src() and conn_get_dst() to fill session addresses.
And just like for conncetions, sess_src() and sess_dst() may be used to get
source and destination addresses. However, if not set, the corresponding
address from the underlying client connection is returned. When this
happens, the addresses is filled in the connection object.
conn_get_src() and conn_get_dst() functions are used to fill the source and
destination addresses of a connection. On success, ->src and ->dst
connection fields can be safely used.
For convenience, 2 new functions are added here: conn_src() and conn_dst().
These functions return the corresponding address, as a const and only if it
is already set. Otherwise NULL is returned.
Flags used to set the execution context of a lua txn are used as an enum. It is
not uncommon but there are few flags otherwise. So to remove ambiguities, a
comment and a _NONE value are added to have a clear definition of supported
values.
This patch should fix the issue #1429. No backport needed.
httpclient_req_gen() takes a payload argument which can be use to put a
payload in the request. This payload can only fit a request buffer.
This payload can also be specified by the "body" named parameter within
the lua. httpclient.
It is also used within the CLI httpclient when specified as a CLI
payload with "<<".
Use include file <sys/time.h> to fix compilation error with timeval in
some files. This is as reported as 'man 7 system_data_types'. The build
error is reported on NetBSD 9.2.
This should be backported up to 2.2.
Some browsers may send Initial packets with sizes greater than 1252 bytes
(QUIC_INITIAL_IPV4_MTU). Let us increase this size limit up to 2048 bytes.
Also use this size for "max_udp_payload_size" transport parameter to limit
the size of the datagrams we want to receive.
Sometimes we'd like to do our best to drain pending data before closing
in order to save the peer from risking to receive an RST on close.
This adds a new connection flag CO_FL_WANT_DRAIN that is used to
trigger a call to conn_ctrl_drain() from conn_ctrl_close(), and the
sock_drain() function ignores fd_recv_ready() if this flag is set,
in order to catch latest data. It's not used for now.
This applicationn specific flag was added in 2.4-dev by commit 6fa8bcdc7
("MINOR: task: add an application specific flag to the state: TASK_F_USR1")
to help preserve a the idle connections status across wakeup calls. While
the code to do this was OK for tasklets, it was wrong for tasks, as in an
effort not to lose it when setting the RUNNING flag (that tasklets don't
have), it ended up being inconditionally set. It just happens that for now
no regular tasks use it, only tasklets.
This fix makes sure we always atomically perform (state & flags | running)
there, using a CAS. It also does it for tasklets because it was possible
to lose some such flags if set by another thread, even though this should
not happen with current code. In order to make the code more readable (and
avoid the previous mistake of repeated flags in the bit field), a new
TASK_PERSISTENT aggregate was declared in task.h for this.
In practice the CAS is cheap here because task states are stable or
convergent so the loop will almost never be taken.
This should be backported to 2.4.
This macro is similar to LIST_INLIST() except that it is guaranteed to
perform the test atomically, so that even if LIST_INLIST() is intrumented
with debugging code to perform extra consistency checks, it will not fail
when used in the context of barriers and atomic ops.
perf top shows that we spend a lot of time trying to read item->type in
the lookup loop, because the node is placed after the very long name,
so when the node is found, no data is in the cache yet. Let's simply
move the node upper in the struct. This results in the CPU usage of
resolv_validate_dns_response() to drop by 4 points.
A bogus test in b_get_varint(), b_put_varint(), b_peek_varint() shifts
the end of the buffer by one byte. Since the bug is the same in the read
and write functions, the buffer contents remain compatible, which explains
why this bug was not detected earlier. But if the buffer ends on an
aligned address or page, it can result in a one-byte overflow which will
typically cause a crash or an inconsistent behavior.
This API is only used by rings (e.g. for traces and boot messages) and
by DNS responses, so the probability to hit it is extremely low, but a
crash on boot was observed.
This must be backported to 2.2.
With SRV records, a huge amount of time is spent looking for records
by walking long lists. It is possible to reduce this by indexing values
in trees instead. However the whole code relies a lot on the list
ordering, and even implements some round-robin on it to distribute IP
addresses to servers.
This patch starts carefully by replacing the list with a an eb32 tree
that is still used like a list, with a constant key 0. Since ebtrees
preserve insertion order for duplicates, the tree walk visits the nodes
in the exact same order it did with the lists. This allows to implement
the required infrastructure without changing the behavior.
This one was used to indicate whether the callee had to follow particularly
safe code path when removing resolutions. Since the code now uses a kill
list, this is not needed anymore.
This code is dangerous enough that we certainly don't want external code
to ever approach it, let's not export unnecessary functions like this one.
It was made static and a comment was added about its purpose.
In order for all the error return values to be distributed on the same
side (instead of surrounding the success error code), the return values
for errors other than a simple verification failure are switched to
negative values. This way the result of the jwt_verify converter can be
compared strictly to 1 as well relative to 0 (any <= 0 return value is
an error).
The documentation was also modified to discourage conversion of the
return value into a boolean (which would definitely not work).
The PR_FL_READY flags must now be set on a proxy at the end of the
configuration validity check to notify it is fully configured and may be
safely used.
For now there is no real usage of this flag. But it will be usefull for
referenced default proxies to finish their configuration only once.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
A proxy may now references the defaults section it is used. To do so, a
pointer on the default proxy was added in the proxy structure. And a
refcount must be used to track proxies using a default proxy. A default
proxy is destroyed iff its refcount is equal to zero and when it drops to
zero.
All this stuff must be performed during init/deinit staged for now. All
unreferenced default proxies are removed after the configuration parsing.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
It is now possible to designate the defaults section to use by adding a name
of the corresponding defaults section and referencing it in the desired
proxy section. However, this introduces an ambiguity. This named defaults
section may still be implicitly used by other proxies if it is the last one
defined. In this case for instance:
default common
...
default frt from common
...
default bck from common
...
frontend fe from frt
...
backend be from bck
...
listen stats
...
Here, it is not really obvious the last section will use the 'bck' defaults
section. And it is probably not the expected behaviour. To help users to
properly configure their haproxy, a warning is now emitted if a defaults
section is explicitly AND implicitly used. The configuration manual was
updated accordingly.
Because this patch adds a warning, it should probably not be backported to
2.4. However, if is is backported, it depends on commit "MINOR: proxy:
Introduce proxy flags to replace disabled bitfield".
This change is required to support TCP/HTTP rules in defaults sections. The
'disabled' bitfield in the proxy structure, used to know if a proxy is
disabled or stopped, is replaced a generic bitfield named 'flags'.
PR_DISABLED and PR_STOPPED flags are renamed to PR_FL_DISABLED and
PR_FL_STOPPED respectively. In addition, everywhere there is a test to know
if a proxy is disabled or stopped, there is now a bitwise AND operation on
PR_FL_DISABLED and/or PR_FL_STOPPED flags.
These two fields are exclusive as they depend on the data type.
Let's move them into a union to save some precious bytes. This
reduces the struct resolv_answer_item size from 600 to 576 bytes.
The struct resolv_answer_item contains an address field of type
"sockaddr" which is only 16 bytes long, but which is used to store
either IPv4 or IPv6. Fortunately, the contents only overlap with
the "target" field that follows it and that is large enough to
absorb the extra bytes needed to store AAAA records. But this is
dangerous as just moving fields around could result in memory
corruption.
The fix uses a union and removes the casts that were used to hide
the problem.
Older versions need to be checked and possibly fixed. This needs
to be backported anyway.
In multi-threaded mode, on operating systems supporting multiple listeners on
the same IP:port, this will automatically create this number of multiple
identical listeners for the same line, all bound to a fair share of the number
of the threads attached to this listener. This can sometimes be useful when
using very large thread counts where the in-kernel locking on a single socket
starts to cause a significant overhead. In this case the incoming traffic is
distributed over multiple sockets and the contention is reduced. Note that
doing this can easily increase the CPU usage by making more threads work a
little bit.
If the number of shards is higher than the number of available threads, it
will automatically be trimmed to the number of threads. A special value
"by-thread" will automatically assign one shard per thread.
This function's purpose will be to duplicate a listener in INIT state.
This will be used to ease declaration of listeners spanning multiple
groups, which will thus require multiple FDs hence multiple receivers.
With groups at some point we'll have to have distinct masks/groups in the
receiver and the bind_conf, because a single bind_conf might require to
instantiate multiple receivers (one per group).
Let's split the thread mask and group to have one for the bind_conf and
another one for the receiver while it remains easy to do. This will later
allow to use different storage for the bind_conf if needed (e.g. support
multiple groups).
In file included from include/haproxy/jwt.h:25:
include/haproxy/jwt-t.h:66:2: error: unknown type name 'EVP_PKEY'
EVP_PKEY *pkey;
^
1 error generated.
Fix this compilation issue by inserting openssl-compat.h in jwt-t.h
This new converter takes a JSON Web Token, an algorithm (among the ones
specified for JWS tokens in RFC 7518) and a public key or a secret, and
it returns a verdict about the signature contained in the token. It does
not simply return a boolean because some specific error cases cas be
specified by returning an integer instead, such as unmanaged algorithms
or invalid tokens. This enables to distinguich malformed tokens from
tampered ones, that would be valid format-wise but would have a bad
signature.
This converter does not perform a full JWT validation as decribed in
section 7.2 of RFC 7519. For instance it does not ensure that the header
and payload parts of the token are completely valid JSON objects because
it would need a complete JSON parser. It only focuses on the signature
and checks that it matches the token's contents.
A JWT signed with the RSXXX or ESXXX algorithm (RSA or ECDSA) requires a
public certificate to be verified and to ensure it is valid. Those
certificates must not be read on disk at runtime so we need a caching
mechanism into which those certificates will be loaded during init.
This is done through a dedicated ebtree that is filled during
configuration parsing. The path to the public certificates will need to
be explicitely mentioned in the configuration so that certificates can
be loaded as early as possible.
This tree is different from the ckch one because ckch entries are much
bigger than the public certificates used in JWT validation process.
This helper function splits a JWT under Compact Serialization format
(dot-separated base64-url encoded strings) into its different sub
strings. Since we do not want to manage more than JWS for now, which can
only have at most three subparts, any JWT that has strictly more than
two dots is considered invalid.
The full list of possible algorithms used to create a JWS signature is
defined in section 3.1 of RFC7518. This patch adds a helper function
that converts the "alg" strings into an enum member.
This fetch can be used to retrieve the data contained in an HTTP
Authorization header when the Bearer scheme is used. This is used when
transmitting JSON Web Tokens for instance.
On receiving CONNECTION_CLOSE frame, the mux is flagged for immediate
connection close. A stream is closed even if there is data not ACKed
left if CONNECTION_CLOSE has been received.
The mux tx buffers have been rewritten with buffers attached to qcs
instances. qc_buf_available and qc_get_buf functions are updated to
manipulates qcs. All occurences of the unused qcc ring buffer are
removed to ease the code maintenance.
Defer the shutting of a qcs if there is still data in its tx buffers. In
this case, the conn_stream is closed but the qcs is kept with a new flag
QC_SF_DETACH.
On ACK reception, the xprt wake up the shut_tl tasklet if the stream is
flagged with QC_SF_DETACH. This tasklet is responsible to free the qcs
and possibly the qcc when all bidirectional streams are removed.
Remove the tx mux ring buffers in qcs, which should be in the qcc. For
the moment, use a simple architecture with 2 simple tx buffers in the
qcs.
The first buffer is used by the h3 layer to prepare the data. The mux
send operation transfer these into the 2nd buffer named xprt_buf. This
buffer is only freed when an ACK has been received.
This architecture is functional but not optimal for two reasons :
- it won't limit the buffer usage by connection
- each transfer on a new stream requires an allocation
This new ssllib_name_startswith precondition check can be used to
distinguish application linked with OpenSSL from the ones linked with
other SSL libraries (LibreSSL or BoringSSL namely). This check takes a
string as input and returns 1 when the SSL library's name starts with
the given string. It is based on the OpenSSL_version function which
returns the same output as the "openssl version" command.
These ones are passed on rule creation for the sole purpose of being
reported in "show sess", which is not done yet. For now the entries
are allocated upon rule creation and freed in free_act_rules().
Rules are currently allocated using calloc() by their caller, which does
not make it very convenient to pass more information such as the file
name and line number.
This patch introduces new_act_rule() which performs the malloc() and
already takes in argument the ruleset (ACT_F_*), the file name and the
line number. This saves the caller from having to assing ->from, and
will allow to improve the internal storage with more info.
Rename __GLOBL and __GLOBL1 to __HA_GLOBL and __HA_GLOBL1, as the former are
already defined on FreeBSD.
This should be backported to 2.4, 2.3 and 2.2.
There have been a large number of issues reported with conn_cur
synchronization because the concept is wrong. In an active-passive
setup, pushing the local connections count from the active node to
the passive one will result in the passive node to have a higher
counter than the real number of connections. Due to this, after a
switchover, it will never be able to close enough connections to
go down to zero. The same commonly happens on reloads since the new
process preloads its values from the old process, and if no connection
happens for a key after the value is learned, it is impossible to reset
the previous ones. In active-active setups it's a bit different, as the
number of connections reflects the number on the peer that pushed last.
This patch solves this by marking the "conn_cur" local and preventing
it from being learned from peers. It is still pushed, however, so that
any monitoring system that collects values from the peers will still
see it.
The patch is tiny and trivially backportable. While a change of behavior
in stable branches is never welcome, it remains possible to fix issues
if reports become frequent.
In the configuration sometimes we'll omit a thread group number to designate
a global thread number range, and sometimes we'll mention the group and
designate IDs within that group. The operation is more complex than it
seems due to the need to check for ranges spanning between multiple groups
and determining groups from threads from bit masks and remapping bit masks
between local/global.
This patch adds a function to perform this operation, it takes a group and
mask on input and updates them on output. It's designed to be used by "bind"
lines but will likely be usable at other places if needed.
For situations where specified threads do not exist in the group, we have
the choice in the code between silently fixing the thread set or failing
with a message. For now the better option seems to return an error, but if
it turns out to be an issue we can easily change that in the future. Note
that it should only happen with "x/even" when group x only has one thread.
This extends the "thread" statement of bind lines to support an optional
thread group number. When unspecified (0) it's an absolute thread range,
and when specified it's one relative to the thread group. Masks are still
used so no more than 64 threads may be specified at once, and a single
group is possible. The directive is not used for now.
This is the equivalent of "tid" for ease of access. In the future if we
make th_cfg a pure thread-local array (not a pointer), it may make sense
to move it there.
ha_set_tid() was randomly used either to explicitly set thread 0 or to
set any possibly incomplete thread during boot. Let's replace it with
a pointer to a valid thread or NULL for any thread. This allows us to
check that the designated threads are always valid, and to ignore the
thread 0's mapping when setting it to NULL, and always use group 0 with
it during boot.
The initialization code is also cleaner, as we don't pass ugly casts
of a thread ID to a pointer anymore.
This will be a convenient way to communicate the thread ID and its
local ID in the group, as well as their respective bits when creating
the threads or when only a pointer is given.
This will ease the reporting of the current thread group ID when coming
from the thread itself, especially since it returns the visible ID,
starting at 1.
This takes care of unassigned threads groups and places unassigned
threads there, in a more or less balanced way. Too sparse allocations
may still fail though. For now with a maximum group number fixed to 1
nothing can really fail.
A the "tg" thread-local variable now always points to the current
thread group. It's pre-initializd to the first one during boot and is
set to point to the thread's one by ha_set_tid(). This last one takes
care of checking whether the thread group was assigned or not because
it may be called during boot before threads are initialized.
This registers a mapping of threads to groups by enumerating for each thread
what group it belongs to, and marking the group as assigned. It takes care of
checking for redefinitions, overlaps, and holes. It supports both individual
numbers and ranges. The thread group is referenced from the thread config.
This creates a struct tgroup_info which knows the thread ID of the first
thread in a group, and the number of threads in it. For now there's only
one thread group supported in the configuration, but it may be forced to
other values for development purposes by defining MAX_TGROUPS, and it's
enabled even when threads are disabled and will need to remain accessible
during boot to keep a simple enough internal API.
For the purpose of easing the configurations which do not specify a thread
group, we're starting group numbering at 1 so that thread group 0 can be
"undefined" (i.e. for "bind" lines or when binding tasks).
The goal will be to later move there some global items that must be
made per-group.
We want to make sure that the current thread_info accessed via "ti" will
remain constant, so that we don't accidentally place new variable parts
there and so that the compiler knows that info retrieved from there is
not expected to have changed between two function calls.
Only a few init locations had to be adjusted to use the array and the
rest is unaffected.
The last 3 fields were 3 list heads that are per-thread, and which are:
- the pool's LRU head
- the buffer_wq
- the streams list head
Moving them into thread_ctx completes the removal of dynamic elements
from the struct thread_info. Now all these dynamic elements are packed
together at a single place for a thread.
The TI_FL_STUCK flag is manipulated by the watchdog and scheduler
and describes the apparent life/death of a thread so it changes
all the time and it makes sense to move it to the thread's context
for an active thread.
The "thread_info" name was initially chosen to store all info about
threads but since we now have a separate per-thread context, there is
no point keeping some of its elements in the thread_info struct.
As such, this patch moves prev_cpu_time, prev_mono_time and idle_pct to
thread_ctx, into the thread context, with the scheduler parts. Instead
of accessing them via "ti->" we now access them via "th_ctx->", which
makes more sense as they're totally dynamic, and will be required for
future evolutions. There's no room problem for now, the structure still
has 84 bytes available at the end.
The scheduler contains a lot of stuff that is thread-local and not
exclusively tied to the scheduler. Other parts (namely thread_info)
contain similar thread-local context that ought to be merged with
it but that is even less related to the scheduler. However moving
more data into this structure isn't possible since task.h is high
level and cannot be included everywhere (e.g. activity) without
causing include loops.
In the end, it appears that the task_per_thread represents most of
the per-thread context defined with generic types and should simply
move to tinfo.h so that everyone can use them.
The struct was renamed to thread_ctx and the variable "sched" was
renamed to "th_ctx". "sched" used to be initialized manually from
run_thread_poll_loop(), now it's initialized by ha_set_tid() just
like ti, tid, tid_bit.
The memset() in init_task() was removed in favor of a bss initialization
of the array, so that other subsystems can put their stuff in this array.
Since the tasklet array has TL_CLASSES elements, the TL_* definitions
was moved there as well, but it's not a problem.
The vast majority of the change in this patch is caused by the
renaming of the structures.
We used to remap SI_TKILL to SI_LWP when SI_TKILL was not available
(e.g. FreeBSD) but that's ugly and since we need this only in a single
switch/case block in wdt.c it's even simpler and cleaner to perform the
two tests there, so let's do this.
The watchdog timer had no more reason for being shared with the struct
thread_info since the watchdog is the only user now. Let's remove it
from the struct and move it to a static array in wdt.c. This removes
some ifdefs and the need for the ugly mapping to empty_t that might be
subject to a cast to a long when compared to TIMER_INVALID. Now timer_t
is not known outside of wdt.c and clock.c anymore.
This removes the knowledge of clockid_t from anywhere but clock.c, thus
eliminating a source of includes burden. The unused clock_id field was
removed from thread_info, and the definition setting of clockid_t was
removed from compat.h. The most visible change is that the function
now_cpu_time_thread() now takes the thread number instead of a tinfo
pointer.
The code that deals with timer creation for the WDT was moved to clock.c
and is called with the few relevant arguments. This removes the need for
awareness of clock_id from wdt.c and as such saves us from having to
share it outside. The timer_t is also known only from both ends but not
from the public API so that we don't have to create a fake timer_t
anymore on systems which do not support it (e.g. macos).
This was previously open-coded in run_thread_poll_loop(). Now that
we have clock.c dedicated to such stuff, let's move the code there
so that we don't need to keep such ifdefs nor to depend on the
clock_id.
Instead of fiddling with before_poll and after_poll in
activity_count_runtime(), the function is now called by
clock_entering_poll() which passes it the number of microseconds
spent working. This allows to remove all calls to
activity_count_runtime() from the pollers.
The entering_poll/leaving_poll/measure_idle functions that were hard
to classify and used to move to various locations have now been placed
into clock.c since it's precisely about time-keeping. The functions
were renamed to clock_*. The samp_time and idle_time values are now
static since there is no reason for them to be read from outside.
There is currently a problem related to time keeping. We're mixing
the functions to perform calculations with the os-dependent code
needed to retrieve and adjust the local time.
This patch extracts from time.{c,h} the parts that are solely dedicated
to time keeping. These are the "now" or "before_poll" variables for
example, as well as the various now_*() functions that make use of
gettimeofday() and clock_gettime() to retrieve the current time.
The "tv_*" functions moved there were also more appropriately renamed
to "clock_*".
Other parts used to compute stolen time are in other files, they will
have to be picked next.
It was brough by an unneeded addition of a local variable after a test
in commit f7f53afcf ("BUILD/MEDIUM: tcp: set-mark setting support for
FreeBSD."). No backport needed.
Remove the quic_conn from the receiver connection_ids tree on
quic_conn_free. This fixes a crash due to dangling references in the
tree after a quic connection release.
This operation must be conducted under the listener lock. For this
reason, the quic_conn now contains a reference to its attached listener.
Following include reorganzation, there is some missing include files for
task.h when compiling with DEBUG_TASK :
- activity.h for task_profiling_mask
- time.h for now_mono_time()
This is present since the following commit
d8b325c748
REORG: task: uninline the loop time measurement code
No need to backport this.
These ones are rarely used or only to waste CPU cycles waiting, and are
the last ones requiring system includes in thread.h. Let's uninline them
and move them to thread.c.
This removes the thread identifiers from struct thread_info and moves
them only in static array in thread.c since it's now the only file that
needs to touch it. It's also the only file that needs to include
pthread.h, beyond haproxy.c which needs it to start the poll loop. As
a result, much less system includes are needed and the LoC reduced by
around 3%.
haproxy.c still has to deal with pthread-specific low-level stuff that
is OS-dependent. We should not have to deal with this there, and we do
not need to access pthread anywhere else.
Let's move these 3 functions to thread.c and keep empty inline ones for
when threads are disabled.
It's not needed to inline it at all (one call per loop) and it introduces
dependencies, let's move it to fd.c.
Removing the few remaining includes that came with it further reduced
by ~0.2% the LoC and the build time is now below 6s.
TV_ETERNITY, TV_ETERNITY_MS and MAX_DELAY_MS may be configured and
ought to be in defaults.h so that they can be inherited from everywhere
without including time.h and could also be redefined if neede
(particularly for MAX_DELAY_MS).
It's pointless to inline this, it's called exactly once per poll loop,
and it depends on time.h which is quite deep. Let's move that to task.c
along with sched_report_idle().
The remaining large functions are those allocating/initializing and
occasionally freeing connections, conn_streams and sockaddr. Let's
move them to connection.c. In fact, cs_free() is the only one-liner
but let's move it along with the other ones since a call will be
small compared to the rest of the work done there.
The following inlined functions are particularly large (and probably not
inlined at all by the compiler), and together represent roughly half of
the file, while they're used at most once per connection. They were moved
to connection.c.
conn_upgrade_mux_fe, conn_install_mux_fe, conn_install_mux_be,
conn_install_mux_chk, conn_delete_from_tree, conn_init, conn_new,
conn_free
No need to include the full tree management code, type files only
need the definitions. Doing so reduces the whole code size by around
3.6% and the build time is down to just 6s.
ebtree is one piece using a lot of inlines and each tree root or node
definition needed by many of our structures requires to parse and
compile all these includes, which is large and painfully slow. Let's
move the very basic definitions to their own file and include it from
ebtree.h.
The following functions are quite heavy and have no reason to be kept
inlined:
srv_release_conn, srv_lookup_conn, srv_lookup_conn_next,
srv_add_to_idle_list
They were moved to server.c. It's worth noting that they're a bit
at the edge between server and connection and that maybe we could
create an idle-conn file for these in the near future.
We do not really need to have them inlined, and having xxhash.h included
by connection.h results in this 4700-lines file being processed 101 times
over the whole project, which accounts for 13.5% of the total size!
Additionally, half of the functions are only needed from connection.c.
Let's move the functions there and get rid of the painful include.
The build time is now down to 6.2s just due to this.
The hash type stored everywhere is XXH64_hash_t, which annoyingly forces
everyone to include the huge xxhash file. We know it's an uint64_t because
that's its purpose and the type is only made to abstract it on machines
where uint64_t is not availble. Let's switch the type to uint64_t
everywhere and avoid including xxhash from the type file.
Plenty of includes were present there only for struct pointers resulting
in them being used from many other places. The LoC reduced again by more
than 1% by cleaning this.
This one is expensive in code size because it comes with xxhash.h at a
low level of dependency that's inherited at plenty of places, and for
a function does doesn't benefit from inlining and could possibly even
benefit from not being inline given that it's large and called from the
scheduler.
Moving it to activity.c reduces the LoC by 1.2% and the binary size by
~1kB.
This function has no reason for being inlined, it's called from non
critical places (once in pollers), is quite large and comes with
dependencies (time and freq_ctr). Let's move it to acitvity.c. That's
another 0.4% less LoC to build.
These are ticks, not timeval, and they're a cause for plenty of files
including time.h just to access now_ms that's only used with ticks
functions. Let's move them over there.
The idle time calculation stuff was moved to task.h by commit 6dfab112e
("REORG: sched: move idle time calculation from time.h to task.h") but
these two variables that are only maintained by task.{c,h} were still
left in time.{c,h}. They have to move as well.
These two counters were the only ones not in the global struct, while
the SSL freq counters or the req counts are already in it, this forces
stats.c to include ssl_sock just to know about them. Let's move them
over there with their friends. This reduces from 408 to 384 the number
of includes of opensslconf.h.
This one has nothing to do with ssl_sock as it manipulates the struct
server only. Let's move it to server.c and remove unneeded dependencies
on ssl_sock.h. This further reduces by 10% the number of includes of
opensslconf.h and by 0.5% the number of compiled lines.
This one doesn't use anything from an SSL context, it only checks the
type of the transport layer of a connection, thus it belongs to
connection.h. This is particularly visible due to all the ifdefs
around it in various call places.
This is exactly the same as for listeners, servers only include
openssl-compat to provide the SSL_CTX type to use as two pointers to
contexts, and to detect if NPN, ALPN, and cipher suites are supported,
and save up to 5 pointers in the ssl_ctx struct if not supported. This
is pointless, as these ones have all been supported for about a decade,
and including this file comes with a long dependency chain that impacts
lots of other files. The ctx was made a void*.
Now the build time was significantly reduced, from 9.2 to 8.1 seconds,
thanks to opensslconf.h being included "only" 456 times instead of 2424
previously!
The total number of lines of code compiled was reduced by 15%.
Listeners only include openssl-compat to provide the SSL_CTX type to
use as two pointers to contexts, and to detect if NPN, ALPN, and cipher
suites are supported, and save up to 5 pointers in the ssl_bind_conf
struct if not supported. This is pointless, as these ones have all been
supported for about a decade, and including this file comes with a long
dependency chain that impacts lots of other files. The initial_ctx and
default_ctx can perfectly remain void* instead of SSL_CTX*.
These functions have no reason for being inlined, and they require some
includes with long dependencies. Let's move them to listener.c and trim
unused includes in listener.h.
This file includes streams, proxies, Lua just for some definitions of
structures for which we only have a pointer. Let's drop this. That's
responsible for 0.2% of all the lines of code.
The lock-debugging code in thread.h has no reason to be inlined. the
functions are quite fat and perform a lot of operations so there's no
saving keeping them inlined. Worse, most of them are in fact not
inlined, resulting in a significantly bigger executable.
This patch moves all this part from thread.h to thread.c. The functions
are still exported in thread.h of course. This results in ~166kB less
code:
text data bss dec hex filename
3165938 99424 897376 4162738 3f84b2 haproxy-before
2991987 99424 897376 3988787 3cdd33 haproxy-after
In addition the build time with thread debugging enabled has shrunk
from 19.2 to 17.7s thanks to much less code to be parsed in thread.h
that is included virtually everywhere.
pool-os.h relies on a number of includes solely because the
pool_alloc_area() function was inlined, and this only because we want
the normal version to be inlined so that we can track the calling
places for the memory profiler. It's worth noting that it already
does not work at -O0, and that when UAF is enabled we don't care a
dime about profiling.
This patch does two things at once:
- force-inline the functions so that pool_alloc_area() is still
inlined at -O0 to help track malloc() users ;
- uninline the UAF version of these (that rely on mmap/munmap)
and move them to pools.c so that we can remove all unneeded
includes.
Doing so reduces by ~270kB or 0.15% the total build size.
These ones are called from a few places in the code and are only provided
by ebtree.h, which is not normal given that some callers do not even use
ebtree.
channel, stream_interface, appctx, buffer, proxy and htx ones are used
in function arguments and most of them are not defined but were inherited
from intermediary inclues. Let's define them here and drop the unneeded
includes.
Some structures are inherited via intermediary includes (e.g. dns_counters
comes from a long path). Let's define the missing ones and includes vars-t
that is needed in the structure.
We're using variable-to-sample conversion at least 4 times in the code,
two of which are bogus. Let's introduce a generic conversion function
that performs the required checks.
This one only handles integers, contrary to its sibling with the suffix
_str that only handles strings. Let's rename it and uninline it since it
may well be used from outside.
The SSL stuff in struct server takes less than 3% of it and requires
lots of annoying ifdefs in the code just to take care of the cases
where the field is absent. Let's get rid of this and stop including
openssl-compat from server.c to detect NPN and ALPN capabilities.
This reduces the total LoC by another 0.4%.
During httpclient_destroy, add a condition in the BUG_ON which checks
that the client was started before it has ended. A httpclient structure
could have been created without being started.
httpclient_stop_and_destroy() tries to destroy the httpclient structure
if the client was stopped.
In the case the client wasn't stopped, it ask the client to stop itself
and to destroy the httpclient structure itself during the release of the
applet.
That's where that code initially was but it had been moved to
activity_count_runtime() for pure reasons of dependency loops. These
ones are no longer true so we can move that code back to the scheduler
and keep it where the information are updated and checked.
time.h is a horrible place to put activity calculation, it's a
historical mistake because the functions were there. We already have
most of the parts in sched.{c,h} and these ones make an exception in
the middle, forcing time.h to include some thread stuff and to access
the before/after_poll and idle_pct values.
Let's move these 3 functions to task.h with the other ones. They were
prefixed with "sched_" instead of the historical "tv_" which already
made no sense anymore.
I don't know why I inlined this one, this makes no sense given that it's
only used for stats, and it starts a circular dependency on tinfo.h which
can be problematic in the future. In addition, all the stuff related to
idle time calculation should be with the rest of the scheduler, which
currently is in task.{c,h}, so let's move it there.
We'll need to improve the API to pass other arguments in the future, so
let's start to adapt better to the current use cases. task_new() is used:
- 18 times as task_new(tid_bit)
- 18 times as task_new(MAX_THREADS_MASK)
- 2 times with a single bit (in a loop)
- 1 in the debug code that uses a mask
This patch provides 3 new functions to achieve this:
- task_new_here() to create a task on the calling thread
- task_new_anywhere() to create a task to be run anywhere
- task_new_on() to create a task to run on a specific thread
The change is trivial and will allow us to later concentrate the
required adaptations to these 3 functions only. It's still possible
to call task_new() if needed but a comment was added to encourage the
use of the new ones instead. The debug code was not changed and still
uses it.
Work lists were a mechanism introduced in 1.8 to asynchronously delegate
some work to be performed on another thread via a dedicated task.
The only user was the listeners, to deal with the queue. Nowadays
the tasklets have made this much more convenient, and have replaced
work_lists in the listeners. It seems there will be no valid use case
of work lists anymore, so better get rid of them entirely and keep the
scheduler code cleaner.
__task_queue() must absolutely not be called with TICK_ETERNITY or it
will place a never-expiring node upfront in the timers queue, preventing
any timer from expiring until the process is restarted. Code was found
to cause this using "task_schedule(task, now_ms)" which does this one
millisecond every 49.7 days, so let's add a condition against this. It
must never trigger since any process susceptible to trigger it would
already accumulate tasks until it dies.
An extra test was added in wake_expired_tasks() to detect tasks whose
timeout would have been changed after being queued.
An improvement over this could be in the future to use a non-scalar
type (union/struct) for expiration dates so as to avoid the risk of
using them directly like this. But now_ms is already such a valid
time and this specific construct would still not be caught.
This could even be backported to stable versions to help detect other
occurrences if any.
The ssl_bc_hsk_err sample fetch will need to raise more errors than only
handshake related ones hence its renaming to a more generic ssl_bc_err.
This patch is required because some handshake failures that should have
been caught by this fetch (verify error on the server side for instance)
were missed. This is caused by a change in TLS1.3 in which the
'Finished' state on the client is reached before its certificate is sent
(and verified) on the server side (see the "Protocol Overview" part of
RFC 8446).
This means that the SSL_do_handshake call is finished long before the
server can verify and potentially reject the client certificate.
The ssl_bc_hsk_err will then need to be expanded to catch other types of
errors.
This change is also applied to the frontend fetches (ssl_fc_hsk_err
becomes ssl_fc_err) and to their string counterparts.
In case of a connection error happening after the SSL handshake is
completed, the error code stored in the connection structure would not
always be set, hence having some connection failures being described as
successful in the fc_conn_err or bc_conn_err sample fetches.
The most common case in which it could happen is when the SSL server
rejects the client's certificate. The SSL_do_handshake call on the
client side would be sucessful because the client effectively sent its
client hello and certificate information to the server, but the next
call to SSL_read on the client side would raise an SSL_ERROR_SSL code
(through the SSL_get_error function) which is decribed in OpenSSL
documentation as a non-recoverable and fatal SSL error.
This patch ensures that in such a case, the connection's error code is
set to a special CO_ERR_SSL_FATAL value.
There's no reason CONFIG_HAP_POOLS and its opposite are located into
pools-t.h, it forces those that depend on them to inlcude the file.
Other similar options are normally dealt with in defaults.h, which is
part of the default API, so let's do that.
According to the RFC7230, "chunked" encoding must not be applied more than
once to a message body. To handle this case, h1_parse_xfer_enc_header() is
now responsible to fail when a parsing error is found. It also fails if the
"chunked" encoding is not the last one for a request.
To help the parsing, two H1 parser flags have been added: H1_MF_TE_CHUNKED
and H1_MF_TE_OTHER. These flags are set, respectively, when "chunked"
encoding and any other encoding are found. H1_MF_CHNK flag is used when
"chunked" encoding is the last one.
This commit provides an hlua_httpclient object which is a bridge between
the httpclient and the lua API.
The HTTPClient is callable in lua this way:
local httpclient = core.httpclient()
local response = httpclient:get("http://127.0.0.1:9000/?s=9999")
core.Debug("Status: ".. res.status .. ", Reason : " .. res.reason .. ", Len:" .. string.len(res.body) .. "\n")
The resulting response object will provide a "status" field which
contains the status code, a "reason" string which contains the reason
string, and a "body" field which contains the response body.
The implementation uses the httpclient callback to wake up the lua task
which yield each time it pushes some data. The httpclient works in the
same thread as the lua task.
The transient flag CO_RFL_BUF_NOT_STUCK should now be set when the mux's
rcv_buf() function is called, in si_cs_recv(), to be sure the mux is able to
perform some optimisation during data copy. This flag is set when we are
sure the channel buffer is not stuck. Concretely, it happens when there are
data scheduled to be sent.
It is not a fix and this flag is not used for now. But it makes sense to have
this info to be sure to be able to do some optimisations if necessary.
This patch is related to the issue #1362. It may be backported to 2.4 to
ease future backports.
HTX_FL_FRAGMENTED flag is now set on an HTX message when it is
fragmented. It happens when an HTX block is removed in the middle of the
message and flagged as unused. HTX_FL_FRAGMENTED flag is removed when all
data are removed from the message or when the message is defragmented.
Note that some optimisations are still possible because the flag can be
avoided in other situations. For instance when the last header of a bodyless
message is removed.
If the stream-interface is waiting for more buffer room to store incoming
data, it is important at the stream level to stop to wait for more data to
continue. Thanks to the previous patch ("BUG/MEDIUM: stream-int: Notify
stream that the mux wants more room to xfer data"), the stream is woken up
when this happens. In this patch, we take care to interrupt the
corresponding tcp-content ruleset or to stop waiting for the HTTP message
payload.
To ease detection of the state, si_rx_blocked_room() helper function has
been added. It returns non-zero if the stream interface's Rx path is blocked
because of lack of room in the input buffer.
This patch is part of a series related to the issue #1362. It should be
backported as ar as 2.0, probably with some adaptations. So be careful
during backports.
When the mux failed to transfer data to the upper layer because of a lack of
room, it is important to wake the stream up to let it handle this
event. Otherwise, if the stream is waiting for more data, both the stream
and the mux reamin blocked waiting for each other.
When this happens, the mux set the CS_FL_WANT_ROOM flag on the
conn-stream. Thus, in si_cs_recv() we are able to detect this event. Today,
the stream-interface is blocked. But, it is not enough to wake the stream
up. To fix the bug, CF_READ_PARTIAL flag is extended to also handle cases
where a read exception occurred. This flag should idealy be renamed. But for
now, it is good enough. By setting this flag, we are sure the stream will be
woken up.
This patch is part of a series related to the issue #1362. It should be
backported as far as 2.0, probably with some adaptations. So be careful
during backports.
When a message is parsed and copied into the channel buffer, in
h1_process_demux(), more space is requested if some pending data remain
after the parsing while the channel buffer is not empty. To do so,
CS_FL_WANT_ROOM flag is set. It means the H1 parser needs more space in the
channel buffer to continue. In the stream-interface, when this flag is set,
the SI is considered as blocked on the RX path. It is only unblocked when
some data are sent.
However, it is not accurrate because the parsing may be stopped because
there is not enough data to continue. For instance in the middle of a chunk
size. In this case, some data may have been already copied but the parser is
blocked because it must receive more data to continue. If the calling SI is
blocked on RX at this stage when the stream is waiting for the payload
(because http-buffer-request is set for instance), the stream remains stuck
infinitely.
To fix the bug, we must request more space to the app layer only when it is
not possible to copied more data. Actually, this happens when data remain in
the input buffer while the H1 parser is in states MSG_DATA or MSG_TUNNEL, or
when we are unable to copy headers or trailers into a non-empty buffer.
The first condition is quite easy to handle. The second one requires an API
refactoring. h1_parse_msg_hdrs() and h1_parse_msg_tlrs() fnuctions have been
updated. Now it is possible to know when we need more space in the buffer to
copy headers or trailers (-2 is returned). In the H1 mux, a new H1S flag
(H1S_F_RX_CONGESTED) is used to track this state inside h1_process_demux().
This patch is part of a series related to the issue #1362. It should be
backported as far as 2.0, probably with some adaptations. So be careful
during backports.
During the packet loss detection we must treat the paquet number
in this order Initial -> Handshake -> O1RTT. This was not the case
due to the chosen order to implement the array of packet number space
which was there before the packet loss detection implementation.
The STREAM data to send coming from the upper layer must be stored until
having being acked by the peer. To do so, we store them in buffer structs,
one by stream (see qcs.tx.buf). Each time a STREAM is built by quic_push_frame(),
its offset must match the offset of the first byte added to the buffer (modulo
the size of the buffer) by the frame. As they are not always acknowledged in
order, they may be stored in eb_trees ordered by their offset to be sure
to sequentially delete the STREAM data from their buffer, in the order they
have been added to it.
This function does exactly the same thing as b_xfer() which transfers
data from a struct buffer to another one but without zero copy when
the destination buffer is empty. This is at least useful to transfer
h3 data to the QUIC mux from buffer with garbage medata which have
been used to build h3 frames without too much memcopy()/memmove().
The peer transport parameter values were not initialized with
the default ones (when absent), especially the
"active_connection_id_limit" parameter with 2 as default value
when absent from received remote transport parameters. This
had as side effect to send too much NEW_CONNECTION_ID frames.
This was the case for curl which does not announce any
"active_connection_id_limit" parameter.
Also rename ->idle_timeout to ->max_idle_timeout to reflect the RFC9000.
These salts are used to derive initial secrets to decrypt the first Initial packet.
We support draft-29 and v1 QUIC version initial salts.
Add parameters to our QUIC-TLS API functions used to derive these secret for
these salts.
Make our xprt_quic use the correct initial salt upon QUIC version field found in
the first paquet. Useful to support connections with curl which use draft-29
QUIC version.
Move the "ACK required" bit from the packet number space to the connection level.
Force the "ACK required" option when acknowlegding Handshake or Initial packet.
A client may send three packets with a different encryption level for each. So,
this patch modifies qc_treat_rx_pkts() to consider two encryption level passed
as parameters, in place of only one.
Make qc_conn_io_cb() restart its process after the handshake has succeeded
so that to process any Application level packets which have already been received
in the same datagram as the last CRYPTO frames in Handshake packets.
Make qc_prep_hdshk_pkts() and qui_conn_io_cb() handle the case
where we enter them with QUIC_HS_ST_COMPLETE or QUIC_HS_ST_CONFIRMED
as connection state with QUIC_TLS_ENC_LEVEL_APP and QUIC_TLS_ENC_LEVEL_NONE
to consider to prepare packets.
quic_get_tls_enc_levels() is modified to return QUIC_TLS_ENC_LEVEL_APP
and QUIC_TLS_ENC_LEVEL_NONE as levels to consider when coalescing
packets in the same datagram.
qc_build_pkt() has recently been modified to support any type of
supported frame at any encryption level (assuming that an encryption level does
not support any type of frame) but quic_tls_level_pkt_type()
prevented it from building application level packet type because it was written
only for the handshake.
This patch simply adds the remaining encryption level QUIC_TLS_ENC_LEVEL_APP
which must be supported by quic_tls_level_pkt_type().
This should be used by the function which build packets to prevent
it from failing. This is important when the packet numbers are consumed
by several threads. The packet number is used to build and encrypt packets
and must be incremented only and only if the packet it refers to has been
successfully built.
These structures are similar. quic_tx_frm was there to try to reduce the
size of such objects which embed a union for all the QUIC frames.
Furtheremore this patch fixes the issue where quic_tx_frm objects were freed
from the pool for quic_frame.
Make quic_rx_packet_ref(inc|dec)() functions be thread safe.
Make use of ->rx.crypto.frms_rwlock RW lock when manipulating RX frames
from qc_treat_rx_crypto_frms().
Modify atomically several variables attached to RX part of quic_enc_level struct.
->rx.crypto member of quic_enc_level struct was not initialized as
this was done for all other members of this structure. This patch
fixes this.
Also adds a RW lock for the frame of this member.
Add two functions to encrement or decrement a referenc counter
attached to TX packet structure (struct quic_tx_packet). The packet are freed
when their counters reach the null value.
We use only ring buffers (struct qring) to prepare and send QUIC datagrams.
We can safely remove the old buffering implementation which was not thread safe.
We modify the functions responsible of building packets to put these latters
in ring buffers (qc_build_hdshk_pkt() during the handshake step, and
qc_build_phdshk_apkt() during the post-handshake step). These functions
remove a ring buffer from its list to build as much as possible datagrams.
Eache datagram is prepended of two field: the datagram length and the
first packet in the datagram. We chain the packets belonging to the same datagram
in a singly linked list to reach them from the first one: indeed we must
modify some members of each packet when we really send them from send_ppkts().
This function is also modified to retrieved the datagram from ring buffers.
We initialize the pointer to the listener TX ring buffer list.
Note that this is not done for QUIC clients as we do not fully support them:
we only have to allocate the list and attach it to server struct I guess.
We allocate an array of QUIC ring buffer, one by thread, and arranges them in a
MT_LIST. Everything is allocated or nothing: we do not want to usse an incomplete
array of ring buffers to ensure that each thread may safely acquire one of these
buffers.
This implementation is inspired from Linux kernel circular buffer implementation
(see include/linux/circ-buf.h). Such buffers may be used at the same time both
by writer and reader (lock-free).
Modify the I/O dgram handler principal function used to parse QUIC packets
be thread safe. Its role is at least to create new incoming connections
add to two trees protected by the same RW lock. The packets are for now on
fully parsed before possibly creating new connections.
Move the connection state from quic_conn_ctx struct to quic_conn struct which
is the structure which is used to store the QUIC connection part information.
This structure is initialized by the I/O dgram handler for each new connection
to QUIC listeners. This is needed for the multithread support so that to not
to have to depend on the connection context potentially initialized by another
thread.
We must protect from concurrent the tree which stores the QUIC packets received
by the dgram I/O handler, these packets being also parsed by the xprt task.
Make depends qc_new_isecs() only on quic_conn struct initialization only (no more
dependency on connection struct initialization) to be able to run it as soon as
the quic_conn struct is initialized (from the I/O handler) before running ->accept()
quic proto callback.
Move the QUIC conn (struct quic_conn) initialization from quic_sock_accept_conn()
to qc_lstnr_pkt_rcv() as this is done for the server part.
Move the timer initialization to ->start xprt callback to ensure the connection
context is done : it is initialized by the ->accept callback which may be run
by another thread than the one for the I/O handler which also run ->start.
The name the maximum packet size transport parameter was ambiguous and replaced
by maximum UDP payload size. Our code would be also ambiguous if it does not
reflect this change.
This function calls quic_mux_transport_params_update() to update the related
streams transport parameter of the mux. It is there only so that not to have
to include mux_quic.h to update these parameters.
Add a new structure to store enough information about STREAM frames which
must be stored before being delivered to the application layer, for any
reason.
The flow control at stream level is organized by types (client bidi, server bidi,
client uni, server uni). Adds at least callback to retrieve the number
of available streams by direction.
This file has been derived from mux_h2.c removing all h2 parts. At
QUIC mux layer, there must not be any reference to http. This will be the
responsability of the application layer (h3) to open streams handled by the mux.
We move ->params transport parameters to ->rx.params. They are the
transport parameters which will be sent to the peer, and used for
the endpoint flow control. So, they will be used to received packets
from the peer (RX part).
Also move ->rx_tps transport parameters to ->tx.params. They are the
transport parameter which are sent by the peer, and used to respect
its flow control limits. So, they will be used when sending packets
to the peer (TX part).
appctx_new() is exclusively called with tid_bit and it only uses the
mask to pass it to the accompanying task. There is no point requiring
the caller to know about a mask there, nor is there any point in
creating an applet outside of the context of its own thread anyway.
Let's drop this and pass tid_bit to task_new() directly.
A new warning is reported by gcc11 when using a pointer to uninitialized
memory block for a function with a const pointer argument. The warning
is triggered for istalloc, used by http_client.c / proxy.c / tcpcheck.c.
This warning is reported because the uninitialized memory block
allocated by malloc should not be passed to a const argument as in ist2.
See https://gcc.gnu.org/onlinedocs/gcc-11.1.0/gcc/Warning-Options.html#index-Wmaybe-uninitialized
This should be backported up to 2.2.
Ilya reported in issue #1391 a build warning on Fedora about mallinfo()
being deprecated in favor of mallinfo2() since glibc-2.33. Let's add
support for it. This should be backported where the following commit is
also backported: 157e39303 ("MINOR: pools: automatically disable
malloc_trim() with external allocators").
"f(void)" is the correct and preferred form for a function taking no
argument, while some places use the older "f()". These were reported
by clang's -Wmissing-prototypes, for example:
src/cpuset.c:111:5: warning: no previous prototype for function 'ha_cpuset_size' [-Wmissing-prototypes]
int ha_cpuset_size()
include/haproxy/cpuset.h:42:5: note: this declaration is not a prototype; add 'void' to make it a prototype for a zero-parameter function
int ha_cpuset_size();
^
void
This aggregate patch fixes this for the following functions:
ha_backtrace_to_stderr(), ha_cpuset_size(), ha_panic(), ha_random64(),
ha_thread_dump_all_to_trash(), get_exec_path(), check_config_validity(),
mworker_child_nb(), mworker_cli_proxy_(create|stop)(),
mworker_cleantasks(), mworker_cleanlisteners(), mworker_ext_launch_all(),
mworker_reload(), mworker_(env|proc_list)_to_(proc_list|env)(),
mworker_(un|)block_signals(), proxy_adjust_all_maxconn(),
proxy_destroy_all_defaults(), get_tainted(),
pool_total_(allocated|used)(), thread_isolate(_full|)(),
thread(_sync|)_release(), thread_harmless_till_end(),
thread_cpu_mask_forced(), dequeue_all_listeners(), next_timer_expiry(),
wake_expired_tasks(), process_runnable_tasks(), init_acl(),
init_buffer(), (de|)init_log_buffers(), (de|)init_pollers(),
fork_poller(), pool_destroy_all(), pool_evict_from_local_caches(),
pool_total_failures(), dump_pools_to_trash(), cfg_run_diagnostics(),
tv_init_(process|thread)_date(), __signal_process_queue(),
deinit_signals(), haproxy_unblock_signals()
-Wundef triggered on a MIPS-based musl build on __WORDSIZE that's used
in ultoa_o() and some Lua initialization. The former will fail to convert
integers larger to 1 billion to proper string in this case. Let's make
sure this macro is defined and fall back to values determined from
__SIZEOF_LONG__ otherwise. A cleaner long-term approach would consist
in removing all remaining occurrences of this macro.
This can be backported to all versions.
Building with an old musl-based toolchain reported this warning:
include/haproxy/thread.h: In function 'ha_thread_relax':
include/haproxy/thread.h:256:5: warning: "_POSIX_PRIORITY_SCHEDULING" is not defined [-Wundef]
#if _POSIX_PRIORITY_SCHEDULING
^
There were indeed two "#if" insteadd of #ifdef" for this macro, let's
fix them.
When the header list is added, after the message parsing, headers with no
value are now ignored. It is not the same than headers with empty value
fields. Only headers with a NULL pointer as value are skipped. This only
happens if the header value is removed during the message
parsing. Concretly, such headers are now ignored when htx_add_all_headers()
is called. However, htx_add_header() is not affected by this change.
Symetrically, the same is true for trailers. It may be backported to 2.4
because of the previous fix ("BUG/MEDIUM: mux-h1: Remove "Upgrade:" header
for requests with payload").
There's no point taking the variables locks for sess/txn/req/res
contexts since these ones always run inside the same thread anyway.
This patch conditions the lock on the variable's scope to avoid
flushing cache lines when not needed.
This showed an improvement of ~5% on a 16-thread machine with 12
variables.
The global table of known variables names can only grow and was designed
for static names that are registered at boot. Nowadays it's possible to
set dynamic variable names from Lua or from the CLI, which causes a real
problem that was partially addressed in 2.2 with commit 4e172c93f
("MEDIUM: lua: Add `ifexist` parameter to `set_var`"). Please see github
issue #624 for more context.
This patch simplifies all this by removing the need for a central
registry of known names, and storing 64-bit hashes instead. This is
highly sufficient given the low number of variables in each context.
The hash is calculated using XXH64() which is bijective over the 64-bit
space thus is guaranteed collision-free for 1..8 chars. Above that the
risk remains around 1/2^64 per extra 8 chars so in practice this is
highly sufficient for our usage. A random seed is used at boot to seed
the hash so that it's not attackable from Lua for example.
There's one particular nit though. The "ifexist" hack mentioned above
is now limited to variables of scope "proc" only, and will only match
variables that were already created or declared, but will now verify
the scope as well. This may affect some bogus Lua scripts and SPOE
agents which used to accidentally work because a similarly named
variable used to exist in a different scope. These ones may need to be
fixed to comply with the doc.
Now we can sum up the situation as this one:
- ephemeral variables (scopes sess, txn, req, res) will always be
usable, regardless of any prior declaration. This effectively
addresses the most problematic change from the commit above that
in order to work well could have required some script auditing ;
- process-wide variables (scope proc) that are mentioned in the
configuration, referenced in a "register-var-names" SPOE directive,
or created via "set-var" in the global section or the CLI, are
permanent and will always accept to be set, with or without the
"ifexist" restriction (SPOE uses this internally as well).
- process-wide variables (scope proc) that are only created via a
set-var() tcp/http action, via Lua's set_var() calls, or via an
SPOE with the "force-set-var" directive), will not be permanent
but will always accept to be replaced once they are created, even
if "ifexist" is present
- process-wide variables (scope proc) that do not exist will only
support being created via the set-var() tcp/http action, Lua's
set_var() calls without "ifexist", or an SPOE declared with
"force-set-var".
This means that non-proc variables do not care about "ifexist" nor
prior declaration, and that using "ifexist" should most often be
reliable in Lua and that SPOE should most often work without any
prior declaration. It may be doable to turn "ifexist" to 1 by default
in Lua to further ease the transition. Note: regtests were adjusted.
Cc: Tim Dsterhus <tim@bastelstu.be>
We certainly do not want that a permanent variable (one that is listed
in the configuration) be erased by accident by an "unset-var" action.
Let's make sure these ones are only reset to an empty sample, like at
the moment of their initial registration. One trick is that the same
function is used to purge the memory at the end and to delete, so we
need to add an extra "force" argument to make the choice.
In order to continue to honor the ifexist Lua option and prevent rogue
SPOA agents from creating too many variables, we'll need to keep the
ability to mark certain proc.* variables as permanent when they're
known from the config file.
Let's add a flag there for this. It's added to the variable when the
variable is created with this flag set by the caller.
Another approach could have been to use a distinct list or distinct
scope but that sounds complicated and bug-prone.
Passing this flag to var_set() will result in the variable to only be
created if it did not exist, otherwise nothing is done (it's not even
updated). This will be used for pre-registering names.
When setting variables, there are currently two variants, one which will
always create the variable, and another one, "ifexist", which will only
create or update a variable if a similarly named variable in any scope
already existed before.
The goal was to limit the risk of injecting random names in the proc
scope, but it was achieved by making use of the somewhat limited name
indexing model, which explains the scope-agnostic restriction.
With this change, we're moving the check downwards in the chain, at the
variable level, and only variables under the scope "proc" will be subject
to the restriction. A new set of VF_* flags was added to adjust how
variables are set, and VF_UPDATEONLY is used to mention this restriction.
In this exact state of affairs, this is not completely exact, as if a
similar name was not known in any scope, the variable will continue to
be rejected like before, but this will change soon.
The vars_init() name is particularly confusing as it does not initialize
the variables code but the head of a list of variables passed in
arguments. And we'll soon need to have proper initialization code, so
let's rename it now.
In ticket #1348 some users expressed some concerns regarding the removal
of the "grace" directive from the proxies. Their use case very closely
mimmicks the original intent of the grace keyword, which is, let haproxy
accept traffic for some time when stopping, while indicating an external
LB that it's stopping.
This is implemented here by starting a task whose expiration triggers
the soft-stop for real. The global "stopping" variable is immediately
set however. For example, this below will be sufficient to instantly
notify an external check on port 9999 that the service is going down,
while other services remain active for 10s:
global
grace 10s
frontend ext-check
bind :9999
monitor-uri /ext-check
monitor fail if { stopping }
Ori Hollander of JFrog Security reported that htx_add_header() and
htx_add_trailer() were missing a length check on the header name. While
this does not allow to overwrite any memory area, it results in bits of
the header name length to slip into the header value length and may
result in forging certain header names on the input. The sad thing here
is that a FIXME comment was present suggesting to add the required length
checks :-(
The injected headers are visible to the HTTP internals and to the config
rules, so haproxy will generally stay synchronized with the server. But
there is one exception which is the content-length header field, because
it is already deduplicated on the input, but before being indexed. As
such, injecting a content-length header after the deduplication stage
may be abused to present a different, shorter one on the other side and
help build a request smuggling attack, or even maybe a response splitting
attack. CVE-2021-40346 was assigned to this problem.
As a mitigation measure, it is sufficient to verify that no more than
one such header is present in any message, which is normally the case
thanks to the duplicate checks:
http-request deny if { req.hdr_cnt(content-length) gt 1 }
http-response deny if { res.hdr_cnt(content-length) gt 1 }
This must be backported to all HTX-enabled versions, hence as far as 2.0.
In 2.3 and earlier, the functions are in src/htx.c instead.
Many thanks to Ori for his work and his responsible report!
Since commit "BUG/MINOR: config: reject configs using HTTP with bufsize
>= 256 MB" we are now sure that it's not possible anymore to have an HTX
block of a size 256 MB or more, even after concatenation thanks to the
tests for len >= htx_free_data_space(). Let's remove these now obsolete
comments.
A BUG_ON() was added in htx_add_blk() to track any such exception if
the conditions would change later, to complete the one that is performed
on the start address that must remain within the buffer.
In preparation for support default values when fetching variables, we
need to update the internal API to pass an extra argument to functions
vars_get_by_{name,desc} to provide an optional default value. This
patch does this and always passes NULL in this argument. var_to_smp()
was extended to fall back to this value when available.
The set-var() action is convenient because it preserves the input type
but it's a pain to deal with when trying to concatenate values. The
most recurring example is when it's needed to build a variable composed
of the source address and the source port. Usually it ends up like this:
tcp-request session set-var(sess.port) src_port
tcp-request session set-var(sess.addr) src,concat(":",sess.port)
This is even worse when trying to aggregate multiple fields from stick-table
data for example. Due to this a lot of users instead abuse headers from HTTP
rules:
http-request set-header(x-addr) %[src]:%[src_port]
But this requires some careful cleanups to make sure they won't leak, and
it's significantly more expensive to deal with. And generally speaking it's
not clean. Plus it must be performed for each and every request, which is
expensive for this common case of ip+port that doesn't change for the whole
session.
This patch addresses this limitation by implementing a new "set-var-fmt"
action which performs the same work as "set-var" but takes a format string
in argument instead of an expression. This way it becomes pretty simple to
just write:
tcp-request session set-var-fmt(sess.addr) %[src]:%[src_port]
It is usable in all rulesets that already support the "set-var" action.
It is not yet implemented for the global "set-var" directive (which already
takes a string) and the CLI's "set var" command, which would definitely
benefit from it but currently uses its own parser and engine, thus it
must be reworked.
The doc and regtests were updated.
For a long time we couldn't have arguments in expressions used in
tcp-request, tcp-response etc rules. But now due to the variables
it's possible, and their context in case of failure to resolve an
argument (e.g. backend name not found) is not properly reported
because there is no arg context values in ARGC_* to report them.
Let's add a number of missing ones for tcp-request {connection,
session,content}, tcp-response content, tcp-check, the config
parser (for "set-var" in the global section) and the CLI parser
(for "set-var" on the CLI).