haproxy

mirror of http://git.haproxy.org/git/haproxy.git/ synced 2025-02-20 20:57:00 +00:00

Author	SHA1	Message	Date
Willy Tarreau	2c1a9c3a43	OPTIM: vars: inline vars_prune() to avoid many calls Many configs don't have variables and call it for no reason, and even configs with variables don't necessarily have some in all scopes.	2024-09-15 23:42:09 +02:00
Willy Tarreau	b11495652e	BUG/MEDIUM: queue: implement a flag to check for the dequeuing As unveiled in GH issue #2711, commit `5541d4995d` ("BUG/MEDIUM: queue: deal with a rare TOCTOU in assign_server_and_queue()") does have some side effects in that it can occasionally cause an endless loop. As Christopher analysed it, the problem is that process_srv_queue(), which uses a trylock in order to leave only one thread in charge of the dequeueing process, can lose the lock race against pendconn_add(). If this happens on the last served request, then there's no more thread to deal with the dequeuing, and assign_server_and_queue() will loop forever on a condition that was initially exepected to be extremely rare (and still is, except that now it can become sticky). Previously what was happening is that such queued requests would just time out and since that was very rare, nobody would notice. The root of the problem really is that trylock. It was added so that only one thread dequeues at a time but it doesn't offer only that guarantee since it also prevents a thread from dequeuing if another one is in the process of queuing. We need a different criterion. What we're doing now is to set a flag "dequeuing" in the server, which indicates that one thread is currently in the process of dequeuing requests. This one is atomically tested, and only if no thread is in this process, then the thread grabs the queue's lock and dequeues. This way it will be serialized with pendconn_add() and no request addition will be missed. It is not certain whether the original race covered by the fix above can still happen with this change, so better keep that fix for now. Thanks to @Yenya (Jan Kasprzak) for the precise and complete report allowing to spot the problem. This patch should be backported wherever the patch above was backported.	2024-09-13 08:35:47 +02:00
Aurelien DARRAGON	68cfb222b5	BUG/MEDIUM: pattern: prevent UAF on reused pattern expr Since `c5959fd` ("MEDIUM: pattern: merge same pattern"), UAF (leading to crash) can be experienced if the same pattern file (and match method) is used in two default sections and the first one is not referenced later in the config. In this case, the first default section will be cleaned up. However, due to an unhandled case in the above optimization, the original expr which the second default section relies on is mistakenly freed. This issue was discovered while trying to reproduce GH #2708. The issue was particularly tricky to reproduce given the config and sequence required to make the UAF happen. Hopefully, Github user @asmnek not only provided useful informations, but since he was able to consistently trigger the crash in his environment he was able to nail down the crash to the use of pattern file involved with 2 named default sections. Big thanks to him. To fix the issue, let's push the logic from `c5959fd` a bit further. Instead of relying on "do_free" variable to know if the expression should be freed or not (which proved to be insufficient in our case), let's switch to a simple refcounting logic. This way, no matter who owns the expression, the last one attempting to free it will be responsible for freeing it. Refcount is implemented using a 32bit value which fills a previous 4 bytes structure gap: int mflags; /* 80 4 / / XXX 4 bytes hole, try to pack / long unsigned int lock; / 88 8 */ (output from pahole) Even though it was not reproduced in 2.6 or below by @asmnek (the bug was revealed thanks to another bugfix), this issue theorically affects all stable versions (up to `c5959fd`), thus it should be backported to all stable versions.	2024-09-09 16:07:05 +02:00
Aaron Kuehler	50322dff81	MEDIUM: server: add init-state Allow the user to set the "initial state" of a server. Context: Servers are always set in an UP status by default. In some cases, further checks are required to determine if the server is ready to receive client traffic. This introduces the "init-state {up\|down}" configuration parameter to the server. - when set to 'fully-up', the server is considered immediately available and can turn to the DOWN sate when ALL health checks fail. - when set to 'up' (the default), the server is considered immediately available and will initiate a health check that can turn it to the DOWN state immediately if it fails. - when set to 'down', the server initially is considered unavailable and will initiate a health check that can turn it to the UP state immediately if it succeeds. - when set to 'fully-down', the server is initially considered unavailable and can turn to the UP state when ALL health checks succeed. The server's init-state is considered when the HAProxy instance is (re)started, a new server is detected (for example via service discovery / DNS resolution), a server exits maintenance, etc. Link: https://github.com/haproxy/haproxy/issues/51	2024-09-05 11:13:10 +02:00
Ilya Shipitsin	1f6e5f7a61	CLEANUP: assorted typo fixes in the code and comments This is 43rd iteration of typo fixes	2024-09-03 17:49:21 +02:00
Christopher Faulet	a7f6b0ac03	MEDIUM: stick-table: Add support of a factor for IN/OUT bytes rates Add a factor parameter to stick-tables, called "brates-factor", that is applied to in/out bytes rates to work around the 32-bits limit of the frequency counters. Thanks to this factor, it is possible to have bytes rates beyond the 4GB. Instead of counting each bytes, we count blocks of bytes. Among other things, it will be useful for the bwlim filter, to be able to configure shared limit exceeding the 4GB/s. For now, this parameter must be in the range ]0-1024].	2024-09-02 15:50:25 +02:00
Aperence	20efb856e1	MEDIUM: protocol: add MPTCP per address support Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension that enables a TCP connection to use different paths. Multipath TCP has been used for several use cases. On smartphones, MPTCP enables seamless handovers between cellular and Wi-Fi networks while preserving established connections. This use-case is what pushed Apple to use MPTCP since 2013 in multiple applications [2]. On dual-stack hosts, Multipath TCP enables the TCP connection to automatically use the best performing path, either IPv4 or IPv6. If one path fails, MPTCP automatically uses the other path. To benefit from MPTCP, both the client and the server have to support it. Multipath TCP is a backward-compatible TCP extension that is enabled by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...). Multipath TCP is included in the Linux kernel since version 5.6 [3]. To use it on Linux, an application must explicitly enable it when creating the socket. No need to change anything else in the application. This attached patch adds MPTCP per address support, to be used with: mptcp{,4,6}@<address>[:port1[-port2]] MPTCP v4 and v6 protocols have been added: they are mainly a copy of the TCP ones, with small differences: names, proto, and receivers lists. These protocols are stored in __protocol_by_family, as an alternative to TCP, similar to what has been done with QUIC. By doing that, the size of __protocol_by_family has not been increased, and it behaves like TCP. MPTCP is both supported for the frontend and backend sides. Also added an example of configuration using mptcp along with a backend allowing to experiment with it. Note that this is a re-implementation of Bj�rn's work from 3 years ago [4], when haproxy's internals were probably less ready to deal with this, causing his work to be left pending for a while. Currently, the TCP_MAXSEG socket option doesn't seem to be supported with MPTCP [5]. This results in a warning when trying to set the MSS of sockets in proto_tcp:tcp_bind_listener. This can be resolved by adding two new variables: sock_inet(6)_mptcp_maxseg_default that will hold the default value of the TCP_MAXSEG option. Note that for the moment, this will always be -1 as the option isn't supported. However, in the future, when the support for this option will be added, it should contain the correct value for the MSS, allowing to correctly set the TCP_MAXSEG option. Link: https://www.rfc-editor.org/rfc/rfc8684.html [1] Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2] Link: https://www.mptcp.dev [3] Link: https://github.com/haproxy/haproxy/issues/1028 [4] Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5] Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be> Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>	2024-08-30 18:53:49 +02:00
Aperence	38618822e1	MINOR: server: add a alt_proto field for server Add a new field alt_proto to the server structures that specify if an alternate protocol should be used for this server. This field can be transparently passed to protocol_lookup to get an appropriate protocol structure. This change allows thus to create servers with different protocols, and not only TCP anymore.	2024-08-30 18:53:49 +02:00
Aperence	a7b04e383a	MINOR: tools: extend str2sa_range to add an alt parameter Add a new parameter "alt" that will store wether this configuration use an alternate protocol. This alt pointer will contain a value that can be transparently passed to protocol_lookup to obtain an appropriate protocol structure. This change is needed to allow for example the servers to know if it need to use an alternate protocol or not.	2024-08-30 18:53:49 +02:00
Frederic Lecaille	f627b9272b	BUG/MEDIUM: quic: always validate sender address on 0-RTT It has been reported by Wedl Michael, a student at the University of Applied Sciences St. Poelten, a potential vulnerability into haproxy as described below. An attacker could have obtained a TLS session ticket after having established a connection to an haproxy QUIC listener, using its real IP address. The attacker has not even to send a application level request (HTTP3). Then the attacker could open a 0-RTT session with a spoofed IP address trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests. To mitigate this vulnerability, one decided to use a token which can be provided to the client each time it successfully managed to connect to haproxy. These tokens may be reused for future connections to validate the address/path of the remote peer as this is done with the Retry token which is used for the current connection, not the next one. Such tokens are transported by NEW_TOKEN frames which was not used at this time by haproxy. So, each time a client connect to an haproxy QUIC listener with 0-RTT enabled, it is provided with such a token which can be reused for the next 0-RTT session. If no such a token is presented by the client, haproxy checks if the session is a 0-RTT one, so with early-data presented by the client. Contrary to the Retry token, the decision to refuse the connection is made only when the TLS stack has been provided with enough early-data from the Initial ClientHello TLS message and when these data have been accepted. Hopefully, this event arrives fast enough to allow haproxy to kill the connection if some early-data have been accepted without token presented by the client. quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN frame with this newly implemented token to be transported inside. quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre() and modified to be reused and derive the secret for the new token implementation. quic_token_validate() has been implemented to validate both the Retry and the new token implemented by this patch. When this is a non-retry token which could not be validated, the datagram received is marked as requiring a Retry packet to be sent, and no connection is created. When the Initial packet does not embed any non-retry token and if 0-RTT is enabled the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon as the TLS stack detects that some early-data have been provided and accepted by the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted() new function. The secret TLS handshake is interrupted as soon as possible returnin 0 from ha_quic_add_handshake_data(). The connection is also marked as requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb()) knows how to behave: kill the connection after having sent a Retry packet. About TLS stack compatibility, this patch is supported by aws-lc. It is disabled for wolfssl which does not support 0-RTT at this time thanks to HAVE_SSL_0RTT_QUIC. This patch depends on these commits: MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event. MINOR: quic: Implement qc_ssl_eary_data_accepted(). MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder MINOR: quic: Token for future connections implementation. MINOR: quic: Implement quic_tls_derive_token_secret(). MINOR: tools: Implement ipaddrcpy(). Must be backported as far as 2.6.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	609b124561	MINOR: quic: Implement qc_ssl_eary_data_accepted(). This function is a wrapper around SSL_get_early_data_status() for OpenSSL derived stack and SSL_early_data_accepted() boringSSL derived stacks like AWS-LC. It returns true for a TLS server if it has accepted the early data received from a client. Also implement quic_ssl_early_data_status_str() which is dedicated to be used for debugging purposes (traces). This function converts the enum returned by the two function mentionned above to a human readable string.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	e926378375	MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct) Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN as size as defined by the token for future connections (quic_token.c). Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()). Also add comments to denote that the NEW_TOKEN parser function is used only by clients and that its builder is used only by servers.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	f5b09dc452	MINOR: quic: Token for future connections implementation. There exist two sorts of token used by QUIC. They are both used to validate the peer address (path validation). Retry are used for the current connection the client want to open. This patch implement the other sort of tokens which after having been received from a connection, may be provided for the next connection from the same IP address to validate it (or validate the network path between the client and the server). The token generation is implemented by quic_generate_token(), and the token validation by quic_token_chek(). The same method is used as for Retry tokens to build such tokens to be reused for future connections. The format is very simple: one byte for the format identifier to distinguish these new tokens for the Retry token, followed by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes are added to this token and a salt to derive the AEAD secret used to cipher the token. In addition to this salt, this is the client IP address which is used also as AAD to derive the AEAD secret. So, the length of the token is fixed: 37 bytes.	2024-08-30 17:04:09 +02:00
Frederic Lecaille	74caa0eece	MINOR: quic: Implement quic_tls_derive_token_secret(). This is function is similar to quic_tls_derive_retry_token_secret(). Its aim is to derive the secret used to cipher the token to be used for future connections. This patch renames quic_tls_derive_retry_token_secret() to a more and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret(). Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret() and quic_tls_derive_token_secret() new function which calls quic_do_tls_derive_token_secret().	2024-08-30 17:04:09 +02:00
Frederic Lecaille	fb7a092203	MINOR: tools: Implement ipaddrcpy(). Implement ipaddrcpy() new function to copy only the IP address from a sockaddr_storage struct object into a buffer.	2024-08-30 17:04:09 +02:00
Nicolas CARPi	a33407b499	CLEANUP: mqtt: fix typo in MQTT_REMAINING_LENGHT_MAX_SIZE There was a typo in the macro name, where LENGTH was incorrectly written. This didn't cause any issue because the typo appeared in all occurrences in the codebase.	2024-08-30 14:58:59 +02:00
Christopher Faulet	62c9d51ca4	BUG/MINIR: proxy: Match on 429 status when trying to perform a L7 retry Support for 429 was recently added to L7 retries (`0d142e075` "MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status"). But the l7_status_match() function was not properly updated. The switch statement must match the 429 status to be able to perform a L7 retry. This patch must be backported if the commit above is backported. It is related to #2687.	2024-08-30 12:13:32 +02:00
Christopher Faulet	0d142e0756	MINOR: proxy: Add support of 429-Too-Many-Requests in retry-on status The "429" status can now be specified on retry-on directives. PR_RE_* flags were updated to remains sorted. This patch should fix the issue #2687. It is quite simple so it may safely be backported to 3.0 if necessary.	2024-08-28 10:05:34 +02:00
William Lallemand	e8fecef0ff	MEDIUM: ssl: capture the signature_algorithms extension from Client Hello Activate the capture of the TLS signature_algorithms extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:17:40 +02:00
William Lallemand	ce7fb6628e	MEDIUM: ssl: capture the supported_versions extension from Client Hello Activate the capture of the TLS supported_versions extension from the Client Hello. This list is stored in the ssl_capture buffer when the global option "tune.ssl.capture-cipherlist-size" is enabled.	2024-08-26 15:12:42 +02:00
Valentine Krasnobaeva	7b78e1571b	MINOR: mworker: restore initial env before wait mode This patch is the follow-up of `1811d2a6ba` (MINOR: tools: add helpers to backup/clean/restore env). In order to avoid unexpected behaviour in master-worker mode during the process reload with a new configuration, when the old one has contained '*env' keywords, let's backup its initial environment before calling parse_cfg() and let's clean and restore it in the context of master process, just before it enters in a wait polling loop. This will garantee that new workers will have a new updated environment and not the previous one inherited from the master, which does not read the configuration, when it's in a wait-mode.	2024-08-23 17:06:59 +02:00
Valentine Krasnobaeva	1811d2a6ba	MINOR: tools: add helpers to backup/clean/restore env 'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could modify the process runtime environment. In case of master-worker mode this creates a problem, as the configuration is read only once before the forking a worker and then the master process does the reexec without reading any config files, just to free the memory. So, during the reload a new worker process will be created, but it will inherited the previous unchanged environment from the master in wait mode, thus it won't benefit the changes in configuration, related to '*env' keywords. This may cause unexpected behavior or some parser errors in master-worker mode. So, let's add a helper to backup all process env variables just before it will read its configuration. And let's also add helpers to clean up the current runtime environment and to restore it to its initial state (as it was before parsing the config).	2024-08-23 17:06:33 +02:00
Willy Tarreau	2a799b64b0	MINOR: protocol: add the real address family to the protocol For custom families, there's sometimes an underlying real address and it would be nice to be able to directly use the real family in calls to bind() and connect() without having to add explicit checks for exceptions everywhere. Let's add a .real_family field to struct proto_fam for this. For now it's always equal to the family except for non-transferable ones such as rhttp where it's equal to the custom one (anything else could fit).	2024-08-21 17:37:46 +02:00
Willy Tarreau	ba4a416c66	MINOR: protocol: add a family lookup At plenty of places we have access to an address family which may include some custom addresses but we cannot simply convert them to the real families without performing some random protocol lookups. Let's simply add a proto_fam table like we have for the protocols. The protocols could even be indexed there, but for now it's not worth it.	2024-08-21 16:46:15 +02:00
Willy Tarreau	732913f848	MINOR: protocol: properly assign the sock_domain and sock_family When we finally split sock_domain from sock_family in 2.3, something was not cleanly finished. The family is what should be stored in the address while the domain is what is supposed to be passed to socket(). But for the custom addresses, we did the opposite, just because the protocol_lookup() function was acting on the domain, not the family (both of which are equal for non-custom addresses). This is an API bug but there's no point backporting it since it does not have visible effects. It was visible in the code since a few places were using PF_UNIX while others were comparing the domain against AF_MAX instead of comparing the family. This patch clarifies this in the comments on top of proto_fam, addresses the indexing issue and properly reconfigures the two custom families.	2024-08-21 16:46:15 +02:00
Willy Tarreau	67bf1d6c9e	MINOR: quic: support a tolerance for spurious losses Tests performed between a 1 Gbps connected server and a 100 mbps client, distant by 95ms showed that: - we need 1.1 MB in flight to fill the link - rare but inevitable losses are sufficient to make cubic's window collapse fast and long to recover - a 100 MB object takes 69s to download - tolerance for 1 loss between two ACKs suffices to shrink the download time to 20-22s - 2 losses go to 17-20s - 4 losses reach 14-17s At 100 concurrent connections that fill the server's link: - 0 loss tolerance shows 2-3% losses - 1 loss tolerance shows 3-5% losses - 2 loss tolerance shows 10-13% losses - 4 loss tolerance shows 23-29% losses As such while there can be a significant gain sometimes in setting this tolerance above zero, it can also significantly waste bandwidth by sending far more than can be received. While it's probably not a solution to real world problems, it repeatedly proved to be a very effective troubleshooting tool helping to figure different root causes of low transfer speeds. In spirit it is comparable to the no-cc congestion algorithm, i.e. it must not be used except for experimentation.	2024-08-21 08:34:30 +02:00
Willy Tarreau	fab0e99aa1	MINOR: quic: store the lost packets counter in the quic_cc_event element Upon loss detection, qc_release_lost_pkts() notifies congestion controllers about the event and its final time. However it does not pass the number of lost packets, that can provide useful hints for some controllers. Let's just pass this option.	2024-08-21 08:02:44 +02:00
Amaury Denoyelle	0d6112b40b	MINOR: mux-quic: retry after small buf alloc failure Previous commit switch to small buffers for HTTP/3 HEADERS emission. This ensures that several parallel streams can allocate their own buffer without hitting the connection buffer limit based now on the congestion window size. However, this prevents the transmission of responses with uncommonly large headers. Indeed, if all headers cannot be encoded in a single buffer, an error is reported which cause the whole connection closure. Adjust this by implementing a realloc API exposed by QUIC MUX. This allows application layer to switch from a small to a default buffer and restart its processing. This guarantees that again headers not longer than bufsize can be properly transferred.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	885e4c5cf8	MINOR: quic: support sbuf allocation in quic_stream This patch extends qc_stream_desc API to be able to allocate small buffers. QUIC MUX API is similarly updated as ultimatly each application protocol is responsible to choose between a default or a smaller buffer. Internally, the type of allocated buffer is remembered via qc_stream_buf instance. This is mandatory to ensure that the buffer is released in the correct pool, in particular as small and standard buffers can be configured with the same size. This commit is purely an API change. For the moment, small buffers are not used. This will changed in a dedicated patch.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	d0d8e57d47	MINOR: quic: define sbuf pool Define a new buffer pool reserved to allocate smaller memory area. For the moment, its usage will be restricted to QUIC, as such it is declared in quic_stream module. Add a new config option "tune.bufsize.small" to specify the size of the allocated objects. A special check ensures that it is not greater than the default bufsize to avoid unexpected effects.	2024-08-20 18:12:27 +02:00
Amaury Denoyelle	1de5f718cf	MINOR: quic/config: adapt settings to new conn buffer limit QUIC MUX buffer allocation limit is now directly based on the underlying congestion window size. previous static limit based on conn-tx-buffers is now unused. As such, this commit adds a warning to users to prevent that it is now obsolete. Secondly, update max-window-size setting. It is now the main entrypoint to limit both the maximum congestion window size and the number of QUIC MUX allocated buffer on emission. Remove its special value '0' which was used to automatically adjust it on now unused conn-tx-buffers.	2024-08-20 17:59:35 +02:00
Amaury Denoyelle	aeb8c1ddc3	MAJOR: mux-quic: allocate Tx buffers based on congestion window Each QUIC MUX may allocate buffers for MUX stream emission. These buffers are then shared with quic_conn to handle ACK reception and retransmission. A limit on the number of concurrent buffers used per connection has been defined statically and can be updated via a configuration option. This commit replaces the limit to instead use the current underlying congestion window size. The purpose of this change is to remove the artificial static buffer count limit, which may be difficult to choose. Indeed, if a connection performs with minimal loss rate, the buffer count would limit severely its throughput. It could be increase to fix this, but it also impacts others connections, even with less optimal performance, causing too many extra data buffering on the MUX layer. By using the dynamic congestion window size, haproxy ensures that MUX buffering corresponds roughly to the network conditions. Using QCC <buf_in_flight>, a new buffer can be allocated if it is less than the current window size. If not, QCS emission is interrupted and haproxy stream layer will subscribe until a new buffer is ready. One of the criticals parts is to ensure that MUX layer previously blocked on buffer allocation is properly woken up when sending can be retried. This occurs on two occasions : * after an already used Tx buffer is cleared on ACK reception. This case is already handled by qcc_notify_buf() via quic_stream layer. * on congestion window increase. A new qcc_notify_buf() invokation is added into qc_notify_send(). Finally, remove <avail_bufs> QCC field which is now unused. This commit is labelled MAJOR as it may have unexpected effect and could cause significant behavior change. For example, in previous implementation QUIC MUX would be able to buffer more data even if the congestion window is small. With this patch, data cannot be transferred from the stream layer which may cause more streams to be shut down on client timeout. Another effect may be more CPU consumption as the connection limit would be hit more often, causing more streams to be interrupted and woken up in cycle.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	000976af58	MINOR: mux-quic: define buf_in_flight Define a new QCC counter named <buf_in_flight>. Its purpose is to account the current sum of all allocated stream buffer size used on emission. For this moment, this counter is updated and buffer allocation and deallocation. It will be used to replace <avail_bufs> once congestion window is used as limit for buffer allocation in a future commit.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	4c4bf26f44	MEDIUM: mux-quic: implement API to ignore txbuf limit for some streams Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark streams which are not subject to the connection limit on allocated MUX stream buffer. The purpose is to simplify handling of QUIC MUX streams which do not transfer data and as such are not driven by haproxy layer, for example HTTP/3 control stream. These streams interacts synchronously with QUIC MUX and cannot retry emission in case of temporary failure. This commit will be useful once connection buffer allocation limit is reimplemented to directly rely on the congestion window size. This will probably cause the buffer limit to be reached more frequently, maybe even on QUIC MUX initialization. As such, it will be possible to mark control streams and prevent them to be subject to the buffer limit. QUIC MUX expose a new function qcs_send_metadata(). It can be used by an application protocol to specify which streams are used for control exchanges. For the moment, no such stream use this mechanism.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	f4d1bd0b76	MINOR: mux-quic: account stream txbuf in QCC A limit per connection is put on the number of buffers allocated by QUIC MUX for emission accross all its streams. This ensures memory consumption remains under control. This limit is simply explained as a count of buffers which can be concurrently allocated for each connection. As such, quic_conn structure was used to account currently allocated buffers. However, a quic_conn nevers allocates new stream buffers. This is only done at QUIC MUX layer. As such, this commit moves buffer accounting inside QCC structure. This simplifies the API, most notably qc_stream_buf_alloc() usage. Note that this commit inverts the accounting. Previously, it was initially set to 0 and increment for each allocated buffer. Now, it is set to the maximum value and decrement for each buf usage. This is considered as clearer to use.	2024-08-20 17:17:17 +02:00
Amaury Denoyelle	c24c8667b2	MINOR: quic: define max-window-size config setting Define a new global keyword tune.quic.frontend.max-window-size. This allows to set globally the maximum congestion window size for each QUIC frontend connections. The default value is 0. It is a special value which automatically derive the size from the configured QUIC connection buffer limit. This is similar to the previous "quic-cc-algo" behavior, which can be used to override the maximum window size per bind line.	2024-08-20 17:02:29 +02:00
Valentine Krasnobaeva	8b1dfa9def	MINOR: cfgparse: limit file size loaded via /dev/stdin load_cfg_in_mem() can continuously reallocate memory in order to load an extremely large input from /dev/stdin, until it fails with ENOMEM, which means that process has consumed all available RAM. In case of containers and virtualized environments it's not very good. So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will limit the size of input supplied via /dev/stdin.	2024-08-20 14:28:34 +02:00
Nathan Wehrman	fd48b28315	MINOR: Implements new log format of option tcplog clf Some systems require log formats in the CLF format and that meant that I could not send my logs for proxies in mode tcp to those servers. This implements a format that uses log variables that are compatble with TCP mode frontends and replaces traditional HTTP values in the CLF format to make them stand out. Instead of logging method and URI like this "GET /example HTTP/1.1" it will log "TCP " and for a response code I used "000" so it would be easy to separate from legitimate HTTP traffic. Now your log servers that require a CLF format can see the timings for TCP traffic as well as HTTP.	2024-08-20 07:46:34 +02:00
Aurelien DARRAGON	f8299bc5ea	MINOR: log: "drop" support for log-profile steps It is now possible to use "drop" keyword for "on" lines under a log-profile section to specify that no log at all should be emitted for the specified step (setting an empty format was not sufficient to do so because only the log payload would be empty, not the log header, thus the log would still be emitted). It may be useful to selectively disable logging at specific steps for a given log target (since the log profile may be set on log directives): log-profile myprof on request format "blabla" sd "custom sd" on response drop New testcase was added to reg-tests/log/log_profiles.vtc	2024-08-19 18:53:01 +02:00
William Lallemand	b2a8e8731d	MINOR: channel: implement ci_insert() function ci_insert() is a function which allows to insert a string <str> of size <len> at <pos> of the input buffer. This is the equivalent of ci_insert_line2() but without inserting '\r\n'	2024-08-08 17:29:37 +02:00
Valentine Krasnobaeva	c6cfa7cb4a	MINOR: startup: rename readcfgfile in parse_cfg As readcfgfile no longer opens configuration files and reads them with fgets, but performs only the parsing of provided data, let's rename it to parse_cfg by analogy with read_cfg in haproxy.c.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	5b52df4c4d	MEDIUM: startup: load and parse configs from memory Let's call load_cfg_in_ram() helper for each configuration file to load it's content in some area in memory. Adapt readcfgfile() parser function respectively. In order to limit changes in its scope we give as an argument a cfgfile structure, already filled in init_args() and in load_cfg_in_ram() with file metadata and content. Parser function (readcfgfile()) uses now fgets_from_mem() instead of standard fgets from libc implementations. SPOE filter parses its own configuration file, pointed by 'config' keyword in the configuration already loaded in memory. So, let's allocate and fill for this a supplementary cfgfile structure, which is not referenced in cfg_cfgfiles list. This structure and the memory with content of SPOE filter configuration are freed immediately in parse_spoe_flt(), when readcfgfile() returns. HAProxy OpenTracing filter also uses its own configuration file. So, let's follow the same logic as we do for SPOE filter.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	007f7f2f02	MINOR: tools: add fgets_from_mem Add fgets_from_mem() helper to read lines from configuration files, stored now as memory chunks. In order to limit changes in the first-level parser code (readcfgfile()), it is better to reimplement the standard fgets, i.e. to have a fgets, which can read the serialized data line by line from some memory area, instead of file stream, and can keep the same behaviour as libc implementations fgets.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	5b9ed6e4be	MINOR: cfgparse: add load_cfg_in_mem Add load_cfg_in_mem() helper, which allows to store the content of a given file in memory.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	bafb0ce272	MINOR: startup: adapt list_append_word to use cfgfile list_append_word() helper was used before only to chain configuration file names in a list. As now we start to use cfgfile structure which represents entire file in memory and its metadata, let's adapt this helper to use this structure and let's rename it to list_append_cfgfile(). Adapt functions, which process configuration files and directories to use cfgfile structure and list_append_cfgfile() instead of wordlist.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	39f2a19620	REORG: tools: move list_append_word to cfgparse Let's move list_append_word to cfgparse.c as it is used only to fill cfg_cfgfiles list with configuration file names.	2024-08-07 18:41:41 +02:00
Valentine Krasnobaeva	70b842e847	MINOR: cfgparse: add struct cfgfile to represent config in memory This and following commits serve to prepare loading configuration files in memory, before parsing them, as we may need to parse some parts of configuration in different moments of the startup sequence. This is a case of the new master-worker initialization process. Here we need to read at first only the global and the program sections and only after some steps (forking worker, etc) the rest of the configuration. Add a new structure cfgfile to keep configuration files metadata and content, loaded somewhere in a memory. Instances of filled cfgfile structures could be chained in a list, as the order in which they were loaded is important.	2024-08-07 18:41:41 +02:00
Willy Tarreau	10c8baca44	MINOR: trace: add a per-source helper to pre-fill the context Now sources which want to do it can provide a helper that can pre-fill some fields in the context based on their knowledge (e.g. mux streams).	2024-08-07 16:02:59 +02:00
Willy Tarreau	7d55a70f5a	MINOR: trace: move the known trace context into a dedicated struct We now have a trace_ctx to hold the sess, conn, qc, stream and so on. This will allow us to pass it across layers so that other helpers can help fill them. Ideally it should be passed as an argument to __trace_enabled() by __trace() so that it can be passed back to the trace callback. But it seems that trace callbacks are smart enough to figure all their info when they need them.	2024-08-07 16:02:59 +02:00
Willy Tarreau	d465610ec3	MEDIUM: trace: implement a "follow" mechanism With "follow" from one source to another, it becomes possible for a source to automatically follow another source's tracked pointer. The best example is the session: - the "session" source is enabled and has a "lockon session" -> its lockon_ptr is equal to the session when valid - other sources (h1,h2,h3 etc) are configured for "follow session" and will then automatically check if session's lockon_ptr matches its own session, in which case tracing will be enabled for that trace (no state change). It's not necessary to start/pause/stop traces when using this, only "follow" followed by a source with lockon enabled is needed. Some combinations might work better than others. At the moment the session is almost never known from the backend, but this may improve. The meta-source "all" is supported for the follower so that all sources will follow the tracked one.	2024-08-07 16:02:59 +02:00
Amaury Denoyelle	9f829ea3f3	MINOR: mux-quic: measure QCS lifetime and its blocking state Reuse newly defined tot_time structure to measure various values related to a QCS lifetime. First, a timer is used to comptabilize the total QCS lifetime. Then, two other timers are used to account the total time during which Tx from stream layer to MUX is blocked, either on lack of buffer or due to flow-control. These three timers are reported in qmux_dump_qcs_info(). Thus, they are available in traces and for QUIC MUX debug string sample.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	a6e2523ca1	MINOR: time: define tot_time structure Define a new utility type tot_time. Its purpose is to be able to account elapsed time accross multiple periods. Functions are defined to easily start and stop measures, and return the current value.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	663416b4ef	MINOR: quic: dump quic_conn debug string for logs Define a new xprt_ops callback named dump_info. This can be used to extend MUX debug string with infos from the lower layer. Implement dump_info for QUIC stack. For now, only minimal info are reported : bytes in flight and size of the sending window. This should allow to detect if the congestion controller is fine. These info are reported via QUIC MUX debug string sample.	2024-08-07 15:40:52 +02:00
Amaury Denoyelle	eb4dfa3b36	MINOR: mux-quic: define dump functions for QCC and QCS Extract trace code to dump QCC and QCS instances into dedicated functions named qmux_dump_qc{c,s}_info(). This will allow to easily print QCC/QCS infos outside of traces.	2024-08-07 15:40:52 +02:00
Willy Tarreau	921e04bf87	MINOR: stconn: add a new pair of sf functions {bs,fs}.debug_str These are passed to the underlying mux to retrieve debug information at the mux level (stream/connection) as a string that's meant to be added to logs. The API is quite complex just because we can't pass any info to the bottom function. So we construct a union and pass the argument as an int, and expect the callee to fill that with its buffer in return. Most likely the mux->ctl and ->sctl API should be reworked before the release to simplify this. The functions take an optional argument that is a bit mask of the layers to dump: muxs=1 muxc=2 xprt=4 conn=8 sock=16 The default (0) logs everything available.	2024-08-07 14:07:41 +02:00
Amaury Denoyelle	e177cf341c	BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM STREAM frames have dedicated handling on retransmission. A special check is done to remove data already acked in case of duplicated frames, thus only unacked data are retransmitted. This handling is faulty in case of an empty STREAM frame with FIN set. On retransmission, this frame does not cover any unacked range as it is empty and is thus discarded. This may cause the transfer to freeze with the client waiting indefinitely for the FIN notification. To handle retransmission of empty FIN STREAM frame, qc_stream_desc layer have been extended. A new flag QC_SD_FL_WAIT_FOR_FIN is set by MUX QUIC when FIN has been transmitted. If set, it prevents qc_stream_desc to be freed until FIN is acknowledged. On retransmission side, qc_stream_frm_is_acked() has been updated. It now reports false if FIN bit is set on the frame and qc_stream_desc has QC_SD_FL_WAIT_FOR_FIN set. This must be backported up to 2.6. However, this modifies heavily critical section for ACK handling and retransmission. As such, it must be backported only after a period of observation. This issue can be reproduced by using the following socat command as server to add delay between the response and connection closure : $ socat TCP-LISTEN:<port>,fork,reuseaddr,crlf SYSTEM:'echo "HTTP/1.1 200 OK"; echo ""; sleep 1;' On the client side, ngtcp2 can be used to simulate packet drop. Without this patch, connection will be interrupted on QUIC idle timeout or haproxy client timeout with ERR_DRAINING on ngtcp2 : $ ngtcp2-client --exit-on-all-streams-close -r 0.3 <host> <port> "http://<host>:<port>/?s=32o" Alternatively to ngtcp2 random loss, an extra haproxy patch can also be used to force skipping the emission of the empty STREAM frame : diff --git a/include/haproxy/quic_tx-t.h b/include/haproxy/quic_tx-t.h index efbdfe687..1ff899acd 100644 --- a/include/haproxy/quic_tx-t.h +++ b/include/haproxy/quic_tx-t.h @@ -26,6 +26,8 @@ extern struct pool_head pool_head_quic_cc_buf; / Flag a sent packet as being probing with old data / #define QUIC_FL_TX_PACKET_PROBE_WITH_OLD_DATA (1UL << 5) +#define QUIC_FL_TX_PACKET_SKIP_SENDTO (1UL << 6) + / Structure to store enough information about TX QUIC packets. / struct quic_tx_packet { / List entry point. / diff --git a/src/quic_tx.c b/src/quic_tx.c index 2f199ac3c..2702fc9b9 100644 --- a/src/quic_tx.c +++ b/src/quic_tx.c @@ -318,7 +318,7 @@ static int qc_send_ppkts(struct buffer buf, struct ssl_sock_ctx ctx) tmpbuf.size = tmpbuf.data = dglen; TRACE_PROTO("TX dgram", QUIC_EV_CONN_SPPKTS, qc); - if (!skip_sendto) { + if (!skip_sendto && !(first_pkt->flags & QUIC_FL_TX_PACKET_SKIP_SENDTO)) { int ret = qc_snd_buf(qc, &tmpbuf, tmpbuf.data, 0, gso); if (ret < 0) { if (gso && ret == -EIO) { @@ -354,6 +354,7 @@ static int qc_send_ppkts(struct buffer buf, struct ssl_sock_ctx ctx) qc->cntrs.sent_bytes_gso += ret; } } + first_pkt->flags &= ~QUIC_FL_TX_PACKET_SKIP_SENDTO; b_del(buf, dglen + QUIC_DGRAM_HEADLEN); qc->bytes.tx += tmpbuf.data; @@ -2066,6 +2067,17 @@ static int qc_do_build_pkt(unsigned char pos, const unsigned char *end, continue; } + switch (cf->type) { + case QUIC_FT_STREAM_8 ... QUIC_FT_STREAM_F: + if (!cf->stream.len && (qc->flags & QUIC_FL_CONN_TX_MUX_CONTEXT)) { + TRACE_USER("artificially drop packet with empty STREAM frame", QUIC_EV_CONN_TXPKT, qc); + pkt->flags \|= QUIC_FL_TX_PACKET_SKIP_SENDTO; + } + break; + default: + break; + } + quic_tx_packet_refinc(pkt); cf->pkt = pkt; }	2024-08-07 11:03:32 +02:00
Amaury Denoyelle	714009b7bc	MINOR: quic: implement function to check if STREAM is fully acked When a STREAM frame is retransmitted, a check is performed to remove range of data already acked from it. This is useful when STREAM frames are duplicated and splitted to cover different data ranges. The newly retransmitted frame contains only unacked data. This process is performed similarly in qc_dup_pkt_frms() and qc_build_frms(). Refactor the code into a new function named qc_stream_frm_is_acked(). It returns true if frame data are already fully acked and retransmission can be avoided. If only a partial range of data is acknowledged, frame content is updated to only cover the unacked data. This patch does not have any functional change. However, it simplifies retransmission for STREAM frames. Also, it will be reused to fix retransmission for empty STREAM frames with FIN set from the following patch : BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM As such, it must be backported prior to it.	2024-08-07 10:57:10 +02:00
Amaury Denoyelle	bb9ac256a1	MINOR: quic: convert qc_stream_desc release field to flags qc_stream_desc had a field <release> used as a boolean. Convert it with a new <flags> field and QC_SD_FL_RELEASE value as equivalent. The purpose of this patch is to be able to extend qc_stream_desc by adding newer flags values. This patch is required for the following patch BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM As such, it must be backported prior to it.	2024-08-06 18:00:17 +02:00
Amaury Denoyelle	7b89aa5b19	BUG/MINOR: h1: do not forward h2c upgrade header token haproxy supports tunnel establishment through HTTP Upgrade mechanism. Since the following commit, extended CONNECT is also supported for HTTP/2 both on frontend and backend side. commit `9bf957335e` MEDIUM: mux_h2: generate Extended CONNECT from htx upgrade As specified by HTTP/2 rfc, "h2c" can be used by an HTTP/1.1 client to request an upgrade to HTTP/2. In haproxy, this is not supported so it silently ignores this. However, Connection and Upgrade headers are forwarded as-is on the backend side. If using HTTP/1 on the backend side and the server supports this upgrade mechanism, haproxy won't be able to parse the HTTP response. If using HTTP/2, mux backend tries to incorrectly convert the request to an Extended CONNECT with h2c protocol, which may also prevent the response to be transmitted. To fix this, flag HTTP/1 request with "h2c" or "h2" token in an upgrade header. On converting the header list to HTX, the upgrade header is skipped if any of this token is present and the H1_MF_CONN_UPG flag is removed. This issue can easily be reproduced using curl --http2 argument to connect to an HTTP/1 frontend. This must be backported up to 2.4 after a period of observation.	2024-08-01 18:23:32 +02:00
Amaury Denoyelle	4b0bda42f7	MINOR: flags/mux-quic: decode qcc and qcs flags Decode QUIC MUX connection and stream elements via qcc_show_flags() and qcs_show_flags(). Flags definition have been moved outside of USE_QUIC to ease compilation of flags binary.	2024-07-31 17:59:35 +02:00
Frederic Lecaille	1733dff42a	MINOR: tcp_sample: Move TCP low level sample fetch function to control layer Add ->get_info() new control layer callback definition to protocol struct to retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by an integer into a long long int. Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer function (src/proto_tcp.c) and define it as the ->get_info() callback for TCPv4 and TCPv6. Note that get_tcp_info() is called for several TCP sample fetches. This patch is useful to support some of these sample fetches for QUIC and to keep the code simple and easy to maintain.	2024-07-31 10:29:42 +02:00
William Lallemand	f76e8e50f4	BUILD: ssl: replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC Replace USE_OPENSSL_AWSLC by OPENSSL_IS_AWSLC in the code source, so we won't need to set USE_OPENSSL_AWSLC in the Makefile on the long term.	2024-07-30 18:53:08 +02:00
William Lallemand	56eefd6827	BUG/MEDIUM: ssl: reactivate 0-RTT for AWS-LC Then reactivate HAVE_SSL_0RTT and HAVE_SSL_0RTT_QUIC for AWS-LC, which were wrongly deactivated in `f5353f2c` ("MINOR: ssl: add HAVE_SSL_0RTT constant"). Must be backported to 3.0.	2024-07-30 18:53:08 +02:00
Willy Tarreau	1a8f3a368f	MINOR: queue: add a function to check for TOCTOU after queueing There's a rare TOCTOU case that happens from time to time with maxconn 1 and multiple threads. Between the moment we see the queue full and the moment we queue a request, it's possible that the last request on the server or proxy ended and that no other one is left to offer it its place. Given that all this code path is performance-critical and we cannot afford to increase the lock duration, better recheck for the condition after queueing. For this we need to be able to check for the condition and cleanly dequeue a request. That's what this patch provides via the new function pendconn_must_try_again(). It will catch more requests than absolutely needed though it will catch them all. It may find that around 1/1000 of requests are at risk, though testing shows that in practice, it's around 1 per million that really gets stuck (other ones benefit from timing and finishing late requests). Maybe in the future some conditions might be refined but it's harmless. What happens to such requests is that they're dequeued and their pendconn freed, so that the caller can decide to try to LB or queue them again. For now the function is not used, it's just added separately for easier tracking.	2024-07-29 09:27:01 +02:00
Frederic Lecaille	76ff8afa2d	MINOR: quic: Add information to "show quic" for CUBIC cc. Add ->state_cli() new callback to quic_cc_algo struct to define a function called by the "show quic (cc\|full)" commands to dump some information about the congestion algorithm internal state currently in use by the QUIC connections. Implement this callback for CUBIC algorithm to dump its internal variables: - K: (the time to reach the cubic curve inflexion point), - last_w_max: the last maximum window value reached before intering the last recovery period. This is also the window value at the inflexion point of the cubic curve, - wdiff: the difference between the current window value and last_w_max. So negative before the inflexion point, and positive after.	2024-07-26 16:42:44 +02:00
Willy Tarreau	2dab1ba84b	MEDIUM: h1: allow to preserve keep-alive on T-E + C-L In 2.5-dev9, commit `631c7e866` ("MEDIUM: h1: Force close mode for invalid uses of T-E header") enforced a recently arrived new security rule in the HTTP specification aiming at preventing a class of content-smuggling attacks involving HTTP/1.0 agents. It consists in handling the very rare T-E + C-L requests or responses in close mode. It happens it does have an impact of a rare few and very old clients (probably running insecure TLS stacks by the way) that continue to send both with their POST requests. The impact is that for each and every request they'll have to reconnect, possibly negotiating a full TLS handshake that becomes harmful to the machine in terms of CPU computation. This commit adds a new option "h1-do-not-close-on-insecure-transfer-encoding" that does exactly what it says, it just asks not to close on such messages, even though the message continues to be sanitized and C-L dropped. It means that the risk is only between the sender and haproxy, which is limited, and might be the only acceptable solution for such environments having to deal with broken implementations. The cases are so rare that it should not need to be backported, or in the worst case, to the latest LTS if there is any demand.	2024-07-26 15:59:35 +02:00
Amaury Denoyelle	08515af9df	MINOR: quic: implement send-retry quic-initial rules Define a new quic-initial "send-retry" rule. This allows to force the emission of a Retry packet on an initial without token instead of instantiating a new QUIC connection.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	69d7e9f3b7	MINOR: quic: implement reject quic-initial action Define a new quic-initial action named "reject". Contrary to dgram-drop, the client is notified of the rejection by a CONNECTION_CLOSE with CONNECTION_REFUSED error code. To be able to emit the necessary CONNECTION_CLOSE frame, quic_conn is instantiated, contrary to dgram-drop action. quic_set_connection_close() is called immediatly after qc_new_conn() which prevents the handshake startup.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	f91be2657e	MINOR: quic: pass quic_dgram as obj_type for quic-initial rules To extend quic-initial rules, pass quic_dgram instance to argument for the various actions. As such, quic_dgram is now supported as an obj_type and can be used in session origin field.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	1259700763	MINOR: quic: support ACL for quic-initial rules Add ACL condition support for quic-initial rules. This requires the extension of quic_parse_quic_initial() to parse an extra if/unless block. Only layer4 client samples are allowed to be used with quic-initial rules. However, due to the early execution of quic-initial rules prior to any connection instantiation, some samples are non supported. To be able to use the 4 described samples, a dummy session is instantiated before quic-initial rules execution. Its src and dst fields are set from the received datagram values.	2024-07-25 15:39:39 +02:00
Amaury Denoyelle	cafe596608	MEDIUM: quic: implement quic-initial rules Implement a new set of rules labelled as quic-initial. These rules as specific to QUIC. They are scheduled to be executed early on Initial packet parsing, prior a new QUIC connection instantiation. Contrary to tcp-request connection, this allows to reject traffic earlier, most notably by avoiding unnecessary QUIC SSL handshake processing. A new module quic_rules is created. Its main function quic_init_exec_rules() is called on Initial packet parsing in function quic_rx_pkt_retrieve_conn(). For the moment, only "accept" and "dgram-drop" are valid actions. Both are final. The latter drops silently the Initial packet instead of allocating a new QUIC connection.	2024-07-25 15:39:39 +02:00
William Lallemand	28cb01f8e8	MEDIUM: quic: implement CHACHA20_POLY1305 for AWS-LC With AWS-LC, the aead part is covered by the EVP_AEAD API which provides the correct EVP_aead_chacha20_poly1305(), however for header protection it does not provides an EVP_CIPHER for chacha20. This patch implements exceptions in the header protection code and use EVP_CIPHER_CHACHA20 and EVP_CIPHER_CTX_CHACHA20 placeholders so we can use the CRYPTO_chacha_20() primitive manually instead of the EVP_CIPHER API. This requires to check if we are using EVP_CIPHER_CTX_CHACHA20 when doing EVP_CIPHER_CTX_free().	2024-07-25 13:45:39 +02:00
William Lallemand	177c84808c	MEDIUM: quic: add key argument to header protection crypto functions In order to prepare the code for using Chacha20 with the EVP_AEAD API, both quic_tls_hp_decrypt() and quic_tls_hp_encrypt() need an extra key argument. Indeed Chacha20 does not exists as an EVP_CIPHER in AWS-LC, so the key won't be embedded into the EVP_CIPHER_CTX, so we need an extra parameter to use it.	2024-07-25 13:45:39 +02:00
William Lallemand	d55a297b85	MINOR: quic: rename confusing wording aes to hp Some of the crypto functions used for headers protection in QUIC are named with an "aes" name even thought they are not used for AES encryption only. This patch renames these "aes" to "hp" so it is clearer.	2024-07-25 13:45:38 +02:00
William Lallemand	31c831e29b	MEDIUM: ssl/quic: implement quic crypto with EVP_AEAD The QUIC crypto is using the EVP_CIPHER API in order to achieve authenticated encryption, this was the API which was used with OpenSSL. With libraries that inspires from BoringSSL (libreSSL and AWS-LC), the AEAD algorithms are implemented using the EVP_AEAD API. This patch converts the call to the EVP_CIPHER API when called in the contex of AEAD cryptography for QUIC. The patch defines some QUIC_AEAD macros that can be either EVP_CIPHER or EVP_AEAD depending on the library. This was mainly done for AWS-LC but this could be useful for other libraries. This should finally allow to use CHACHA20_POLY1305 with AWS-LC. This patch allows to use the following ciphers with the EVP_AEAD API: - TLS1_3_CK_AES_128_GCM_SHA256 - TLS1_3_CK_AES_256_GCM_SHA384 AWS-LC does not implement TLS1_3_CK_AES_128_CCM_SHA256 and TLS1_3_CK_CHACHA20_POLY1305_SHA256 requires some hack for headers protection which will come in another patch.	2024-07-25 13:45:38 +02:00
Aurelien DARRAGON	709b3db941	MINOR: sink: add processed events counter in sft Add a new struct member to sft structure named e_processed in order to track the total number of events processed by sft applets. sink_forward_oc_io_handler() and sink_forward_io_handler() now make use of ring_dispatch_messages() optional value added in the previous commit in order to increase the number of processed events.	2024-07-24 17:59:08 +02:00
Aurelien DARRAGON	47323e64ad	MINOR: ring: count processed messages in ring_dispatch_messages() ring_dispatch_messages() now takes an optional argument <processed> which must point to a size_t counter when provided. When provided, the value is updated to the number of messages processed by the function.	2024-07-24 17:59:03 +02:00
Christopher Faulet	2f3c4d1b6c	MINOR: spoe: export the list of SPOP error reasons The strings representing the human-readable version for SPOP errors are now exported. It is now an array of IST to ease manipulation.	2024-07-24 14:19:10 +02:00
Christopher Faulet	f8fed07d3a	MINOR: spoe: Add a function to validate a version is supported spoe_check_vsn() function can now be used to check if a version, converted to an integer, via spoe_str_to_vsn() for instance, is supported. To do so, the list of all supported version is now exported.	2024-07-24 14:19:10 +02:00
Christopher Faulet	f93828f229	MEDIUM: vars: Be able to parse parent scopes for variables Add session/stream scopes related to the parent. To do so, "psess", "ptxn", "preq" or "pres" must be used instead of tranditionnal scopes (without the first "p"). the "proc" scope is not concerned by this change because it is not linked to a stream. When such scopes are used, a specific flags is added on the variable description during the variable parsing. For now, theses scopes are parsed and the variable description is updated accordingly. But at the end, any operation on the variable value fails.	2024-07-18 16:39:39 +02:00
Christopher Faulet	d430edcda3	MINOR: vars: Use a description to set/unset a variable instead of its hash and scope Now a variable description is retrieved when a variable is parsed, we can use it to set or unset the variable value. It is mandatory to be able to know the parent stream, if any, must be used, instead of the current one.	2024-07-18 16:39:38 +02:00
Christopher Faulet	eb2d71614f	MINOR: vars: Fill a description instead of hash and scope when a name is parsed A variable description is now used to parse a variable and extract its name and its scope. It is mandatory to be able to add some flags on the variable when it is evaluated (set or get). Among other things, this will be used to know the parent stream, if any, must be used, instead of the current one.	2024-07-18 16:39:38 +02:00
Christopher Faulet	b020bb73a0	MINOR: stream: Add a pointer to set the parent stream A pointer to a parent stream was added in the stream structure. For now, this pointer is never set, but the idea is to have an access to a stream environment from another one from the moment there is a parent/child relationship betwee these streams. Concretely, for now, there is nothing to formalize this relationship.	2024-07-18 16:39:38 +02:00
Aurelien DARRAGON	d3d35f0fc6	BUILD: tree-wide: cast arguments to tolower/toupper to unsigned char (2) Fix build warning on NetBSD by reapplying `f278eec37a` ("BUILD: tree-wide: cast arguments to tolower/toupper to unsigned char"). This should fix issue #2551.	2024-07-18 13:29:52 +02:00
William Lallemand	344c3ce8fc	MEDIUM: ssl: add extra_chain to ckch_data The extra_chain member is a pointer to the 'issuers-chain-path' file that completed the chain. This is useful to get what chain file was used.	2024-07-17 16:52:06 +02:00
Valentine Krasnobaeva	665dde6481	MINOR: debug: use LIM2A to show limits It is more handy to use LIM2A in debug_parse_cli_show_dev(), as it allows to show a custom string ("unlimited"), if a given limit value equals to 0. normalize_rlim() handler is needed to convert properly RLIM_INFINITY to zero, with the respect of type sizes, as rlim_t is always 4 bytes on 32bit and 64bit arch.	2024-07-16 14:04:41 +02:00
Willy Tarreau	75b335abc7	MINOR: fd: don't scan the full fdtab on all threads During tests, it's pretty visible that with many threads and a large number of FDs, the process may take time to be ready. The reason for this is that the full fdtab array is scanned by each and every thread at boot in fd_reregister_all() in order to make each thread-local poller adopt the FDs that are relevant to it. The problem is that when dealing with 1-2M FDs and 64+ threads, it starts to represent quite a number of loops, and usually the fdtab array doesn't entirely fit in the CPU's L3 cache, causing extra memory accesses. It's particularly visible when issuing debugging commands to the CLI because usually the first one fails while the CPU is at 100% for half a second (which also is socat's timeout). A quick test with this: global stats socket /tmp/sock1 level admin mode 666 stats timeout 1h maxconn 2000000 And the following script started in another window: while ! time socat -t5 - /tmp/sock1 <<< "show version";do date -Ins;done shows that it takes 1.58s for the socat instance that succeeds on an Ampere Altra with 80 cores, this requires to change the timeout (defaults to half a second) otherwise it returns nothing. In addition it also means that during reloads, some CPU spikes will be noticed. Adding a prefetch of the current FD + 16 improves the startup time by 30% but that's far from being sufficient. In practice all of this is performed at boot time, a moment at which we know that extremely few FDs are registered (basically just the listeners), so FD numbers are usually very low and the rest of the table is scanned for no benefit. Ideally, knowing upfront how many FDs we have should be sufficient. A first approach would consist in counting the entries on a single thread before registering pollers. It's not necessarily efficient and would take time anyway. This patch takes a different approach. It consists in keeping a thread-local max ("fd_highest") that is updated whenever fd_insert() is called with a larger number. Of course this is not correct once all threads have started, but it will remain valid during boot since the same value is used during startup and is cloned for each thread, and no scheduling happens anywhere during this period, so that all threads are aware of the highest FD they've seen registered, even if it had been done in some init code, and this without having to deal with a shared variable. Here on the test platform, the script gets its response in 10ms vs 1580 before.	2024-07-15 19:19:13 +02:00
Christopher Faulet	a492e08e62	CLEANUP: spoe: Uniformize function definitions SPOE functions definitions were splitted on 2 or more lines, with the return type alone on the first line. It is unusual in the HAProxy code. The related issue is #2502.	2024-07-12 15:27:05 +02:00
Christopher Faulet	cab98784d8	MAJOR: spoe: Rewrite SPOE applet to use the SPOP mux It is the huge part of the series. The patch is not so huge, it removes functions to produce or consume frames. The SPOE applet is pretty light now. But since this patch, the SPOP multiplexer is now used. The SPOP mode is now automatically ised for SPOP backends. So if there are bugs in the SPOP multiplexer, they will be visible now. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	1bea73612a	MEDIUM: check/spoe: Use SPOP multiplexer to perform SPOP health-checks The SPOP health-checks are now performed using the SPOP multiplexer. This will be fixed later, but for now, it is considered as a L4 health-check and no specific status code is reported. It means the corresponding vtest script is marked as broken for now. Functionnaly speaking, the same is performed. A connection is opened, a HELLO frame is sent to the agent and we wait for the HELLO frame from the agent in reply. But only L4OK, L4KO or L4TOUT will be reported. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	7e1bb7283b	MEDIUM: mux-spop: Introduce the SPOP multiplexer It is no possible yet to use it. Idles connections and pipelining mode are not supported for now. But it should be possible to open a SPOP connection, perform the HELLO handshake, send a NOTIFY frame based on data produced by the client side and receive the corresponding ACK frame to transfer its content to the client side. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	d0d23a7a66	MINOR: spoe: Move spoe_str_to_vsn() into the header file The function used to convert the SPOE version from a string to an integer is now located in spoe-t.h header file. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	08b522d6ac	MINOR: spoe: Move all stuff regarding the filter/applet in the C file Structures describing the SPOE applet context, the SPOE filter configuration and context and the SPOE messages and groups are moved in the C file. In spoe-t.h file, it remains the structure describing an SPOE agent and flags used by both sides. In addition, the SPOE frontend, created for a given SPOE engine, is moved from the SPOE filter configuration to the SPOE agent structure. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	e6145a0ea1	MINOR: spoe: Dynamically alloc the message list per event of an agent The inline array used to store, the configured messages per event in the SPOE agent structure, is replaced by a dynamic array, allocated during the configuration parsing. The main purpose of this change is to be able to move all stuff regarding the SPOE filter and applet in the C file. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	ce53bb6284	MINOR: spoe: Rename some flags and constant to use SPOP prefix A SPOP multiplexer will be added. Many flags, constants and structures will be remove from the applet scope. So the "SPOP" prefix is used instead of "SPOE", to be consistent. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	51ebf644e5	MINOR: stconn: Use a dedicated function to get the opposite sedesc se_opposite() function is added to let an endpoint retrieve the opposite endpoint descriptor. Muxes supportng the zero-copy forwarding can now use it. The se_shutdown() function too. This will be use by the SPOP multiplexer to be able to retrieve the SPOE agent configuration attached to the applet on client side. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	4b8098bf48	MINOR: connection: No longer include stconn type header in connection-t.h It is a small change, but it is cleaner to no include stconn-t.h header in connection-t.h, mainly to avoid circular definitions. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	33ac3dabcb	MEDIUM: applet: Add a .shut callback function for applets Applets can now define a shutdown callback function, just like the multiplexer. It is especially usefull to get the abort reason. This will be pretty useful to get the status code from the SPOP stream to report it at the SPOe filter level. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	1538c4aa82	MEDIUM: proxy/spoe: Add a SPOP mode The SPOE was significantly lightened. It is now possible to refactor it to use a dedicated multiplexer. The first step is to add a SPOP mode for proxies. The corresponding multiplexer mode is also added. For now, there is no SPOP multiplexer, so it is only declarative. But at the end, the SPOP multiplexer will be automatically selected for servers inside a SPOP backend. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	b986952a75	MINOR: spoe: Remove the dedicated SPOE applet task The dedicated task per SPOE applet is no longer used. So it is removed. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	4e589095d9	MAJOR: spoe: Remove idle applets and pipelining support Management of idle applets is removed. Consequently, the pipelining support is also removed. It is a huge change but it should be transparent for the agents, except regarding the performances. Of course, being able to reuse already openned connections and being able to multiplex frames on a given connection is a must have. These features will be restored later. hello and idle timeout are not longer used. Because an applet is spawned to process a NOTIFY frame and closed after receiving the ACK reply, the processing timeout is the only one required. In addition, the parameters to limit the SPOE applet creation are no longer used too. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	2405881ab0	MINOR: spoe: Remove debugging All the SPOE debugging is removed. The code will be easier to rework this way and the debugging will be mainly moved in the SPOP multiplexter via the trace API. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	d37489abef	MINOR: spoe: Use only a global engine-id per agent Because the async mode was removed, it is no longer mandatory to announce a different engine identifiers per thread for a given SPOE agent. This was used to be sure requests and the corresponding responses are stuck on the same thread. So, now, a SPOE agent only announces one engine identifier on all connections. No changes should be expected for agents. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	52ad7eb79e	MEDIUM: spoe: Remove async mode support The support for asynchronous mode, the ability to send messages on a connection and receive the responses on any other connections, is removed. It appears this feature was a bit overkill. And it is a problem for this refactoring. This feature is removed and will not be restored at the end. It is not a big deal for agent supporting the async mode because it is usable if it is announced on both sides. HAProxy stops to announce it. This should be transparent for agents. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	e3c92209f7	MEDIUM: spoe: Remove fragmentation support It is the first patch of a long series to refactor the SPOE filter. The idea is to rely on a dedicated multiplexer instead of hakcing HAProxy with a list of applets processing a message queue. First of all, optionnal features will be removed. Some will be restored at the end, some others will just be removed. It is the case here. The frame fragmentation support is removed. The only purpose of this feature is to be able to support the streaming. Because it is out of the scope of this refactoring, the fragmentation is removed. The related issue is #2502.	2024-07-12 15:27:04 +02:00
Christopher Faulet	249a547f37	CLEANUP: stconn: Fix a typo in comments for SE_ABRT_SRC_* Just a little typo: s/set bu/ set by/	2024-07-12 15:27:04 +02:00
Valentine Krasnobaeva	9302869c95	BUG/MINOR: limits: fix license type in limits.h Need to use LGPL-2.1-or-later in headers since our hedaers default to LGPL.	2024-07-11 18:15:48 +02:00
Amaury Denoyelle	3be58fc720	CLEANUP: quic: rename TID affinity elements This commit is the renaming counterpart of the previous one, this time for quic_conn module. Several elements related to TID affinity update from quic_conn has been renamed : public functions, but also flag renamed to QUIC_FL_CONN_TID_REBIND and trace event to QUIC_EV_CONN_BIND_TID. This should be backported with the same instruction as the previous commit.	2024-07-11 15:14:06 +02:00
Amaury Denoyelle	9fbe8b0334	CLEANUP: proto: rename TID affinity callbacks Since the following patch, protocol API to update a connection TID affinity has been extended. commit `1a43b9f32c` MINOR: proto: extend connection thread rebind API The single callback set_affinity has been splitted in 3 different functions which are called at different stages during listener_accept(), depending on accept queue push success or not. However, the naming was rendered confusing by the usage of function prefix 1 and 2. Rename proto callback related to TID affinity update and use the following names : * bind_tid_prep * bind_tid_commit * bind_tid_reset This commit should probably be backported at least up to 3.0 with the above patch. This is because the fix was recently backported and it would allow to keep changes minimal between the two versions. It could even be backported up to 2.8 if there is no major conflict.	2024-07-11 15:14:06 +02:00
Amaury Denoyelle	b0990b38f8	MINOR: quic: add counters of sent bytes with and without GSO Add a sent bytes counter for each quic_conn instance. A secondary field which only account bytes sent via GSO which is useful to ensure if this is activated. For the moment, these counters are reported on "show quic" but not aggregated on proxy quic module stats.	2024-07-11 11:02:44 +02:00
Amaury Denoyelle	d0ea173e35	MEDIUM: quic: implement GSO fallback mechanism UDP GSO on Linux is not implemented in every network devices. For example, this is not available for veth devices frequently used in container environment. In such case, EIO is reported on send() invocation. It is impossible to test at startup for proper GSO support in this case as a listener may be bound on multiple network interfaces. Furthermore, network interfaces may change during haproxy lifetime. As such, the only option is to react on send syscall error when GSO is used. The purpose of this patch is to implement a fallback when encountering such conditions. Emission can be retried immediately by trying to send each prepared datagrams individually. To support this, qc_send_ppkts() is able to iterate over each datagram in a so-called non-GSO fallback mode. Between each emission, a datagram header is rewritten in front of the buffer which allows the sending loop to proceed until last datagram is emitted. To complement this, quic_conn listener is flagged on first GSO send error with value LI_F_UDP_GSO_NOTSUPP. This completely disables GSO for all future emission with QUIC connections using this listener. For the moment, non-GSO fallback mode is activated when EIO is reported after GSO has been set. This is the error reported for the veth usage described above.	2024-07-11 11:02:44 +02:00
Amaury Denoyelle	448d3d388a	MINOR: quic: add GSO parameter on quic_sock send API Add <gso_size> parameter to qc_snd_buf(). When non-null, this specifies the value for socket option SOL_UDP/UDP_SEGMENT. This allows to send several datagrams in a single call by splitting data multiple times at <gso_size> boundary. For now, <gso_size> remains set to 0 by caller, as such there should not be any functional change.	2024-07-11 11:02:44 +02:00
Amaury Denoyelle	96a34d79d9	MINOR: quic: define quic_cc_path MTU as constant Future commits will implement GSO support to be able to emit multiple datagrams in a single syscall invocation. This will be used every time there is more data to sent than the UDP network MTU. No change will be done for Tx buffer encoding, in particular when using extra metadata datagram header. When GSO will be used, length field will contain the total length of all datagrams to emit in a single GSO syscall send. As such, QUIC send functions will detect that GSO is in use if total length is greater than MTU. This last assumption forces to ensure that MTU is constant. Indeed, in case qc_send() is interrupted, Tx buffer will be left with prepared datagrams. These datagrams will be emitted at the next qc_send() invocation. If MTU would change during these two calls, it would be impossible to know if GSO was used or not. To prevent this, mark <mtu> field of quic_cc_path as constant.	2024-07-11 11:02:44 +02:00
Amaury Denoyelle	35470d5185	MINOR: quic: activate UDP GSO for QUIC if supported Add a startup test for GSO support in quic_test_socketopts() and automatically activate it in qc_prep_pkts() when building datagrams as big as MTU. Also define a new config option tune.quic.disable-udp-gso. This is useful to prevent warning on older platform or to debug an issue which may be related to GSO.	2024-07-11 11:02:44 +02:00
Valentine Krasnobaeva	22db643648	MINOR: haproxy: prepare to move limits-related code This patch is done in order to prepare the move of handlers to compute and to check process related limits as maxconn, maxsock, maxpipes. So, these handlers become no longer static due to the future move. We add the handlers declarations in limits.h in this patch as well, in order to keep the next patch, dedicated to code replacement, without any additional modifications. Such split also assures that this patch can be compiled separately from the next one, where we moving the handlers. This is important in case of git-bisect.	2024-07-10 18:05:48 +02:00
Valentine Krasnobaeva	b8dc783eb9	REORG: global: move rlim_fd_*_at_boot in limits Let's move in 'limits' compilation unit global variables to keep the initial process fd limits.	2024-07-10 18:05:48 +02:00
Valentine Krasnobaeva	47f2afb436	CLEANUP: fd: rm struct rlimit definition As raise_rlim_nofile() was moved to limits compilation unit, limits.h includes the system <sys/resource.h>. So, this definition of rlimit system type structure is no longer need for compilation of fd unit.	2024-07-10 18:05:48 +02:00
Valentine Krasnobaeva	3759674047	REORG: fd: move raise_rlim_nofile to limits Let's move raise_rlim_nofile() from 'fd' compilation unit to 'limits', as it wraps setrlimit to change process RLIMIT_NOFILE.	2024-07-10 18:05:48 +02:00
Valentine Krasnobaeva	1517bcb5e3	MINOR: limits: prepare to keep limits in one place The code which gets, sets and checks initial and current fd limits and process related limits (maxconn, maxsock, ulimit-n, fd-hard-limit) is spread around different functions in haproxy.c and in fd.c. Let's group it together in dedicated limits.c and limits.h. This patch is done in order to prepare the moving of limits-related functions from different places to the new 'limits' compilation unit. It helps to keep clean the next patch, which will do only the move without any additional modifications. Such detailed split is needed in order to be sure not to break accidentally limits logic and in order to be able to compile each commit separately in case of git-bisect.	2024-07-10 18:05:48 +02:00
Willy Tarreau	4e65fc66f6	MAJOR: import: update mt_list to support exponential back-off (try #2 ) This is the second attempt at importing the updated mt_list code (commit 59459ea3). The previous one was attempted with commit `c618ed5ff4` ("MAJOR: import: update mt_list to support exponential back-off") but revealed problems with QUIC connections and was reverted. The problem that was faced was that elements deleted inside an iterator were no longer reset, and that if they were to be recycled in this form, they could appear as busy to the next user. This was trivially reproduced with this: $ cat quic-repro.cfg global stats socket /tmp/sock1 level admin stats timeout 1h limited-quic frontend stats mode http bind quic4@:8443 ssl crt rsa+dh2048.pem alpn h3 timeout client 5s stats uri / $ ./haproxy -db -f quic-repro.cfg & $ h2load -c 10 -n 100000 --npn h3 https://127.0.0.1:8443/ => hang This was purely an API issue caused by the simplified usage of the macros for the iterator. The original version had two backups (one full element and one pointer) that the user had to take care of, while the new one only uses one that is transparent for the user. But during removal, the element still has to be unlocked if it's going to be reused. All of this sparked discussions with Fred and Aur�lien regarding the still unclear state of locking. It was found that the lock API does too much at once and is lacking granularity. The new version offers a much more fine- grained control allowing to selectively lock/unlock an element, a link, the rest of the list etc. It was also found that plenty of places just want to free the current element, or delete it to do anything with it, hence don't need to reset its pointers (e.g. event_hdl). Finally it appeared obvious that the root cause of the problem was the unclear usage of the list iterators themselves because one does not necessarily expect the element to be presented locked when not needed, which makes the unlock easy to overlook during reviews. The updated version of the list presents explicit lock status in the macro name (_LOCKED or _UNLOCKED suffixes). When using the _LOCKED suffix, the caller is expected to unlock the element if it intends to reuse it. At least the status is advertised. The _UNLOCKED variant, instead, always unlocks it before starting the loop block. This means it's not necessary to think about unlocking it, though it's obviously not usable with everything. A few _UNLOCKED were used at obvious places (i.e. where the element is deleted and freed without any prior check). Interestingly, the tests performed last year on QUIC forwarding, that resulted in limited traffic for the original version and higher bit rate for the new one couldn't be reproduced because since then the QUIC stack has gaind in efficiency, and the 100 Gbps barrier is now reached with or without the mt_list update. However the unit tests definitely show a huge difference, particularly on EPYC platforms where the EBO provides tremendous CPU savings. Overall, the following changes are visible from the application code: - mt_list_for_each_entry_safe() + 1 back elem + 1 back ptr => MT_LIST_FOR_EACH_ENTRY_LOCKED() or MT_LIST_FOR_EACH_ENTRY_UNLOCKED() + 1 back elem - MT_LIST_DELETE_SAFE() no longer needed in MT_LIST_FOR_EACH_ENTRY_UNLOCKED() => just manually set iterator to NULL however. For MT_LIST_FOR_EACH_ENTRY_LOCKED() => mt_list_unlock_self() (if element going to be reused) + NULL - MT_LIST_LOCK_ELT => mt_list_lock_full() - MT_LIST_UNLOCK_ELT => mt_list_unlock_full() - l = MT_LIST_APPEND_LOCKED(h, e); MT_LIST_UNLOCK_ELT(); => l=mt_list_lock_prev(h); mt_list_lock_elem(e); mt_list_unlock_full(e, l)	2024-07-09 16:46:38 +02:00
Amaury Denoyelle	19b8c1b7cd	DEV: flags/quic: decode quic_conn flags Decode quic_conn flags via qc_show_flags() function. To support this, quic flags definition have been put outside of USE_QUIC directive.	2024-07-08 09:38:35 +02:00
Amaury Denoyelle	95f624540b	BUG/MEDIUM: quic: prevent crash on accept queue full Handshake for quic_conn instances runs on a single non-chosen thread. On completion, listener_accept() is performed to select the less loaded thread before initializing connection instance. As such, quic_conn instance is migrated to the thread with its upper connection. In case accept queue is full, listener_accept() fallback to local accept mode, which cause the connection to be assigned to the current thread. However, this is not supported by QUIC as quic_conn instance is left on the previously selected thread. In most cases, this will cause a BUG_ON() due to a task manipulation from an outside thread. To fix this, handle quic_conn thread rebind in multiple steps using the new extended protocol API. Several operations have been moved from qc_set_tid_affinity1() to newly defined qc_set_tid_affinity2(), in particular CID TID update. This ensures that quic_conn instance is not prematurely accessed on the new thread until accept queue push is guaranteed to succeed. qc_reset_tid_affinity() is also newly defined to reassign the newly created tasks and tasklets to the current thread. This is necessary to prevent the BUG_ON() crash described above. This must be backported up to 2.8 after a period of observation. Note that it depends on previous patch : MINOR: proto: extend connection thread rebind API	2024-07-04 17:28:56 +02:00
Amaury Denoyelle	1a43b9f32c	MINOR: proto: extend connection thread rebind API MINOR: listener: define callback for accept queue push Extend API for connection thread rebind API by replacing single callback set_affinity by three different ones. Each one of them is used at a different stage of the operation : * set_affinity1 is used similarly to previous set_affinity * set_affinity2 is called directly from accept_queue_push_mp() when an entry has been found in accept ring. This operation cannot fail. * reset_affinity is called after set_affinity1 in case of failure from accept_queue_push_mp() due to no space left in accept ring. This is necessary for protocols which must reconfigure resources before fallback on the current tid. This patch does not have any functional changes. However, it will be required to fix crashes for QUIC connections when accept queue ring is full. As such, it must be backported with it.	2024-07-04 16:33:21 +02:00
Valentine Krasnobaeva	41275a6918	MEDIUM: init: set default for fd_hard_limit via DEFAULT_MAXFD Let's provide a default value for fd_hard_limit, if it's not set in the configuration. With this patch we could set some specific default via compile-time variable DEFAULT_MAXFD as well. Hope, this will be helpfull for haproxy package maintainers. make -j 8 TARGET=linux-glibc DEBUG=-DDEFAULT_MAXFD=50000 If haproxy is comipled without DEFAULT_MAXFD defined, the default will be set to 1048576. This is done to avoid killing the process by its watchdog, while it started without any limitations in its configuration or in the command line and the hard RLIMIT_NOFILE is extremely huge (~1000000000). We use in this case compute_ideal_maxconn() to calculate maxconn and maxsock, maxsock defines the size of internal fdtab, which becames very-very large as well. When the process starts to simply loop over this fdtab (0(n)), this takes a lot of time, so watchdog does it job. To avoid this, maxconn now is always reduced to some reasonable value either by explicit global.fd-hard-limit from configuration, or by its default. The default may be changed at build-time and overwritten then by global.fd-hard-limit at runtime. Explicit global.fd-hard-limit from the configuration has always precedence over DEFAULT_MAXFD, if set. Must be backported in all stable versions until v2.6.0, including v2.6.0.	2024-07-04 07:52:42 +02:00
Amaury Denoyelle	8550549cca	REORG: quic: remove quic_cid_trees reference from proto_quic Previous commit removed access/manipulation to QUIC CID global tree outside of quic_cid module. This ensures that proper locking is always performed. This commit finalizes this cleanup by marking CID global tree as static only to quic_cid source file. Initialization of this tree is removed from proto_quic and now performed using dedicated initcalls quic_alloc_global_cid_tree(). As a side change, complete CID global tree documentation, in particular to explain CID global tree artificial splitting and ODCID handling. Overall, the code is now clearer and safer.	2024-07-03 15:02:40 +02:00
Amaury Denoyelle	0a352ef08e	MINOR: quic: remove access to CID global tree outside of quic_cid module haproxy generates for each QUIC connection a set of CID. The peer must reuse them as DCID for its emitted packet. On datagram reception, DCID field serves as identifier to dispatch them on their correct thread. These CIDs are stored in a global CID tree. Access to this data structure must always be protected with CID_LOCK. This commit is a refactoring to regroup all CID tree access in quic_cid module. Several code parts are ajusted : * quic_cid_insert() is extended to check for insertion race-condition. This is useful on quic_conn instantiation. Code where such race cannot happen can use unsafe _quic_cid_insert() instead. * on RETIRE_CONNECTION_ID frame reception, existing quic_cid_delete() function is used. * remove tree lookup from qc_check_dcid(), extracted in the new quic_cmp_cid_conn() function. Ultimately, the latter should be removed as CID lookup could be conducted on quic_conn owned tree without locking.	2024-07-03 15:02:40 +02:00
Amaury Denoyelle	a05fefe74d	CLEANUP: quic: cleanup prototypes related to CIDs handling Remove duplicated prototypes from quic_conn.h also present in quic_cid.h. Also remove quic_derive_cid() prototype and mark it as static.	2024-07-03 15:02:40 +02:00
Amaury Denoyelle	789d4abd73	BUG/MEDIUM: h3: ensure the ":method" pseudo header is totally valid Ensure pseudo-header method is only constitued of valid characters according to RFC 9110. If an invalid value is found, the request is rejected and stream is resetted. Previously only characters forbidden in headers were rejected (NUL/CR/LF), but this is insufficient for :method, where some other forbidden chars might be used to trick a non-compliant backend server into seeing a different path from the one seen by haproxy. Note that header injection is not possible though. This must be backported up to 2.6. Many thanks to Yuki Mogi of FFRI Security Inc for the detailed report that allowed to quicky spot, confirm and fix the problem.	2024-06-28 14:36:30 +02:00
Willy Tarreau	290659ffd3	MINOR: activity: make the memory profiling hash size configurable at build time The MEMPROF_HASH_BITS variable was set to 10 without a possibility to change it (beyond patching the code). After seeing a few reports already with "other" being listed and a list with close to 1024 entries, it looks like it's about time to either increase the hash size, or at least make it configurable for special cases. As a reminder, in order to remain fast, the algorithm searches no more than 16 places after the hash, so when a table is almost full, searches are long and new places are rare. The present patch just makes it possible to redefine it by passing "-DMEMPROF_HASH_BITS=11" or "-DMEMPROF_HASH_BITS=12" in CFLAGS, and moves the definition to defaults.h to make it easier to find. Such values should be way sufficient for the vast majority of use cases. Maybe in the future we'd change the default. At least this version should be backported to ease rebuilds, say, till 2.8 or so.	2024-06-27 18:01:27 +02:00
Valentine Krasnobaeva	5e06d45df7	REORG: init: encapsulate 'reload' sockpair and master CLI listeners creation Let's encapsulate the logic of 'reload' sockpair and master CLI listeners creation, used by master CLI into a separate function, as we needed this only in master-worker runtime mode. This makes the code of init() more readable.	2024-06-27 16:08:42 +02:00
Christopher Faulet	ad946a704d	MINOR: stick-table: Always decrement ref count before killing a session Guarded functions to kill a sticky session, stksess_kill() stksess_kill_if_expired(), may or may not decrement and test its reference counter before really killing it. This depends on a parameter. If it is set to non-zero value, the ref count is decremented and if it falls to zero, the session is killed. Otherwise, if this parameter is equal to zero, the session is killed, regardless the ref count value. In the code, these functions are always called with a non-zero parameter and the ref count is always decremented and tested. So, there is no reason to still have a special case. Especially because it is not really easy to say if it is supported or not. Does it mean it is possible to kill a sticky session while it is still referenced somewhere ? probably not. So, does it mean it is possible to kill a unreferenced session ? This case may be problematic because the session is accessed outside of any lock and thus may be released by another thread because it is unreferenced. Enlarging scope of the lock to avoid any issue is possible but it is a bit of shame to do so because there is no usage for now. The best is to simplify the API and remove this case. Now, stksess_kill() and stksess_kill_if_expired() functions always decrement and test the ref count before killing a sticky session.	2024-06-26 15:05:06 +02:00
Christopher Faulet	9357873641	BUG/MEDIUM: stick-table: Decrement the ref count inside lock to kill a session When we try to kill a session, the shard must be locked before decrementing the ref count on the session. Otherwise, the ref count can fall to 0 and a purge task (stktable_trash_oldest or process_table_expire) may release the session before we have the opportunity to acquire the lock on the shard to effectively kill the session. This could lead to a double free. Here is the scenario: Thread 1 Thread 2 sktsess_kill(ts) if (ATOMIC_DEC(&ts->ref_cnt) != 0) return /* here the ref count is 0 / stktable_trash_oldest() LOCK(&sh_lock) if (!ATOMIC_LOAD(&ts->ref_cnf)) __stksess_free(ts) UNLOCK(&sh_lock) / here the session was released */ LOCK(&sh_lock) __stksess_free(ts) <--- double free UNLOCK(&sh_lock) The bug was introduced in 2.9 by the commit `7968fe3889` ("MEDIUM: stick-table: change the ref_cnt atomically"). The ref count must be decremented inside the lock for stksess_kill() and sktsess_kill_if_expired() function. This patch should fix the issue #2611. It must be backported as far as 2.9. On the 2.9, there is no sharding. All the table is locked. The patch will have to be adapted.	2024-06-26 12:05:37 +02:00
Frederic Lecaille	bc9821fd26	BUILD: Missing inclusion header for ssize_t type Compilation issue detected as follows by gcc: In file included from src/ncbuf.c:19: src/ncbuf.c: In function 'ncb_write_off': include/haproxy/bug.h:144:10: error: unknown type name 'ssize_t' 144 \| extern ssize_t write(int, const void *, size_t); \	2024-06-26 10:17:09 +02:00
Willy Tarreau	2d27c80288	BUILD: debug: also declare strlen() in __ABORT_NOW() Previous commit `8f204fa8ae` ("MINOR: debug: print gdb hints when crashing") broken on the CI where strlen() isn't known. Let's forward-declare it in the __ABORT_NOW() functions, just like write(). No backport is needed.	2024-06-26 08:04:40 +02:00
Willy Tarreau	8f204fa8ae	MINOR: debug: print gdb hints when crashing To make bug reporting easier for users, when crashing, let's suggest what to do. Typically when a BUG_ON() matches, only the current thread is useful the vast majority of the time, while when the watchdog triggers, all threads are interesting. The messages are printed at the end after the dump. We may adjust these with wiki links in the future is more detailed instructions are relevant.	2024-06-26 07:43:00 +02:00
Valentine Krasnobaeva	2cd52a88be	MINOR: cli/debug: show dev: show capabilities If haproxy compiled with Linux capabilities support, let's show process capabilities before applying the configuration and at runtime in 'show dev' command output. This maybe useful for debugging purposes. Especially in cases, when process changes its UID and GID to non-priviledged or it has started and run under non-priviledged UID and needed capabilities are set by admin on the haproxy binary.	2024-06-26 07:38:21 +02:00
Valentine Krasnobaeva	0d79c9bedf	MINOR: cli/debug: show dev: add cmdline and version 'show dev' command is very convenient to obtain haproxy debugging information, while process is run in container. Let's extend its output with version and cmdline. cmdline is useful in a way, as it shows absolute binary path and its arguments, because sometimes the person, who is debugging failing container is not the same, who has created and deployed it. argc and argv are stored in the exported global structure, because feed_post_mortem() is added as a post check function callback in the post_check_list. So we can't simply change the signature of feed_post_mortem(), without breaking other post check callbacks APIs. Parsers are not supposed to modify argv, so we can safely bypass its pointer to debug_parse_cli_show_dev(), without copying all argument stings somewhere in the heap or on stack.	2024-06-26 07:38:21 +02:00
Valentine Krasnobaeva	fcf1a0bcf5	MINOR: capabilities: export capget and __user_cap_header_struct To be able to show process capabilities before applying its configuration and also at runtime in 'show dev' command output, we need to export the wrapper around capget() syscall. It also seems more handy to place __user_cap_header_struct in .data section and declare it as globally accessible, as we always fill it with the same values. This avoids allocate and fill these 8 bytes each time on the stack frame, when capget() or capset() wrappers are called.	2024-06-26 07:38:21 +02:00
Aurelien DARRAGON	9d312212df	BUG/MINOR: proxy: fix email-alert leak on deinit() (2nd try) As shown in GH #2608 and ("BUG/MEDIUM: proxy: fix email-alert invalid free"), simply calling free_email_alert() from free_proxy() is not the right thing to do. In this patch, we reuse proxy->email_alert.set memory space to introduce proxy->email_alert.flags in order to support 2 flags: PR_EMAIL_ALERT_SET (to mimic proxy->email_alert.set) and PR_EMAIL_ALERT_RESOLVED (set once init_email_alert() was called on the proxy to resolve email_alert.mailer pointer). Thanks to PR_EMAIL_ALERT_RESOLVED flag, free_email_alert() may now properly handle the freeing of proxy email_alert settings: if the RESOLVED flag is set, then it means the .email_alert.mailers.name parsing hint was replaced by the actual mailers pointer, thus no free should be attempted. No backport needed: as described in ("BUG/MEDIUM: proxy: fix email-alert invalid free"), this historical leak is not sensitive as it cannot be triggered during runtime.. thus given that the fix is not backport- friendly, it's not worth the trouble.	2024-06-17 19:37:29 +02:00
Aurelien DARRAGON	ee8be55942	REORG: mailers: move free_email_alert() to mailers.c free_email_alert() was declared in cfgparse.c, but it should belong to mailers.c instead.	2024-06-17 19:37:29 +02:00
William Lallemand	30a432d198	MINOR: ssl: activate sigalgs feature for AWS-LC AWSLC lacks the SSL_CTX_set1_sigalgs_list define, however the function exists, which disables the feature in HAProxy, even if we could have build with it. SSL_CTX_set1_client_sigalgs_list() is not available, though. This patch introduce the define so the feature is enabled.	2024-06-17 17:40:49 +02:00
Aurelien DARRAGON	983513d901	DEBUG: hlua: distinguish burst timeout errors from exec timeout errors hlua burst timeout was introduced in `58e36e5b1` ("MEDIUM: hlua: introduce tune.lua.burst-timeout"). It is a safety measure that allows to detect when too much time is spent on a single lua execution (between 2 interruptions/yields), meaning that the current thread is not able to perform other tasks. Such scenario should be avoided because it will cause thread contention which may have negative performance impact and could cause the watchdog to trigger. When the burst timeout is exceeded, the current Lua execution is aborted and a timeout error is reported to the user. Unfortunately, the same error is currently being reported for cumulative (AKA execution) timeout and for burst timeout, which may be confusing to the user. Indeed, "execution timeout" error historically results from the current hlua context exceeding the total (cumulative) time it's allowed to run. It is set per lua context using the dedicated tunables: - tune.lua.session-timeout - tune.lua.task-timeout - tune.lua.service-timeout We've already faced an user report where the user was able to trigger the burst timeout and got "Lua task: execution timeout." error while the user didn't set cumulative timeout. Thus the error was actually confusing because it was indeed the burst timeout which was causing it due to the use of cpu-intensive call from within the task without sufficient manual "yield" keypoints around the cpu-intensive call to ensure it runs on a dedicated scheduler cycle. In this patch we make it so burst timeout related errors are reported as "burst timeout" errors instead of "execution timeout" errors (which in fact became the generic timeout errors catchall with `58e36e5b1`). To do this, hlua_timer_check() now returns a different value depending if the exeeded timeout is the burst one or the cumulative one, which allows us to return either HLUA_E_ETMOUT or HLUA_E_BTMOUT in hlua_ctx_resume(). It should improve the situation described in GH #2356 and may possibly be backported with `58e36e5b1` to improve error reporting if it applies without resistance.	2024-06-14 18:25:58 +02:00
William Lallemand	ee5aa4e5e6	BUILD: ssl: disable deprecated functions for AWS-LC 1.29.0 AWS-LC have a lot of functions that does nothing, which are now deprecated and emits some warning. This patch disables the following useless functions that emits a warning: SSL_CTX_get_security_level(), SSL_CTX_set_tmp_dh_callback(), ERR_load_SSL_strings(), RAND_keep_random_devices_open() The list of deprecated functions is here: https://github.com/aws/aws-lc/blob/main/docs/porting/functionality-differences.md	2024-06-14 10:41:36 +02:00
William Lallemand	7120c77b14	MEDIUM: ssl: support for ECDA+RSA certificate selection with AWS-LC AWS-LC does not support the SSL_CTX_set_client_hello_cb() function from OpenSSL which allows to analyze ciphers and signatures algorithm of the ClientHello. However it supports the SSL_CTX_set_select_certificate_cb() which allows the same thing but was the implementation from the boringSSL side. This patch uses the SSL_CTX_set_select_certificate_cb() as well as the SSL_early_callback_ctx_extension_get() function to get the signature algorithms. This was successfully tested with openssl s_client as well as testssl.sh. This should allow to enable more reg-tests that depend on certificate selection. Require at least AWS-LC 1.22.0.	2024-06-13 19:36:40 +02:00
William Lallemand	5149cc4990	BUILD: ssl: fix build with wolfSSL fix build with wolfSSL, broken since the reorg in src/ssl_clienthello.c	2024-06-13 17:01:45 +02:00
William Lallemand	4ced880d22	REORG: ssl: move the SNI selection code in ssl_clienthello.c Move the code which is used to select the final certificate with the clienthello callback. ssl_sock_client_sni_pool need to be exposed from outside ssl_sock.c	2024-06-13 16:48:17 +02:00
William Lallemand	fc7c5d892b	MINOR: ssl: add ssl_sock_bind_verifycbk() in ssl_sock.h Add missing ssl_sock_bind_verifycbk() in ssl_sock.h	2024-06-13 16:48:17 +02:00
Aurelien DARRAGON	15e9c7da6b	MINOR: log: add log-profile parsing logic This patch implements prerequisite log-profile struct and parser logic. It has no effect during runtime for now. Logformat expressions provided in log-profile "steps" are postchecked during postparsing for each proxy "log" directive that makes use of a given profile. (this allows to ensure that the logformat expressions used in the profile are compatible with proxy using them)	2024-06-13 15:43:09 +02:00
Aurelien DARRAGON	33f3bec7ee	MINOR: log: add logger flags Logger struct may benefit from having a "flags" struct member to set or remove different logger states. For that, we reuse an existing 4 bytes hole in the logger struct to store a 2 bytes flags integer, leaving the struct with a 2-bytes hole now.	2024-06-13 15:43:09 +02:00
Aurelien DARRAGON	3102c89dde	MINOR: log: provide proxy context to resolve_logger() Prerequisite work for log-profiles, we need to know under which proxy context the logger is being used. When the info is not available, (ie: global section or log-forward section, <px> is set to NULL)	2024-06-13 15:43:09 +02:00

1 2 3 4 5 ...

7961 Commits