It's more generic and versatile than the previous shut_your_big_mouth_gcc()
that was used to silence annoying warnings as it's not limited to ignoring
syscalls returns only. This allows us to get rid of the aforementioned
function and the shut_your_big_mouth_gcc_int variable, that started to
look ugly in multi-threaded environments.
These are mostly comments in the code. A few error messages were fixed
and are of low enough importance not to deserve a backport. Some regtests
were also fixed.
This patch adds the `unique-id` option to `proxy-v2-options`. If this
option is set a unique ID will be generated based on the `unique-id-format`
while sending the proxy protocol v2 header and stored as the unique id for
the first stream of the connection.
This feature is meant to be used in `tcp` mode. It works on HTTP mode, but
might result in inconsistent unique IDs for the first request on a keep-alive
connection, because the unique ID for the first stream is generated earlier
than the others.
Now that we can send unique IDs in `tcp` mode the `%ID` log variable is made
available in TCP mode.
The stream-int code doesn't need to load server.h as it doesn't use
servers at all. However removing this one reveals that proxy.h was
lacking types/checks.h that used to be silently inherited from
types/server.h loaded before in stream_interface.h.
pattern_finalize_config() uses an inefficient algorithm which is a
problem with very large configuration files. This affects startup, and
therefore reload time. When haproxy is deployed as a router in a
Kubernetes cluster the generated configuration file may be large and
reloads are frequently occuring, which makes this a significant issue.
The old algorithm is O(n^2)
* allocate missing uids - O(n^2)
* sort linked list - O(n^2)
The new algorithm is O(n log n):
* find the user allocated uids - O(n)
* store them for efficient lookup - O(n log n)
* allocate missing uids - n times O(log n)
* sort all uids - O(n log n)
* convert back to linked list - O(n)
Performance examples, startup time in seconds:
pat_refs old new
1000 0.02 0.01
10000 2.1 0.04
20000 12.3 0.07
30000 27.9 0.10
40000 52.5 0.14
50000 77.5 0.17
Please backport to 1.8, 2.0 and 2.1.
We always set them both, which makes sense since errors at the FD level
indicate a terminal condition for the socket that cannot be recovered.
Usually this is detected via a write error, but sometimes such an error
may asynchronously be reported on the read side. Let's simplify this
using only the write bit and calling it RW since it's used like this
everywhere, and leave the R bit spare for future use.
This was used only by fd_recv_state() and fd_send_state(), both of
which are unused. This will not work anymore once recv and send flags
start to differ, so let's remove this.
This lock was only needed to protect the buffer_wq list, but now we have
the mt_list for this. This patch simply turns the buffer_wq list to an
mt_list and gets rid of the lock.
It's worth noting that the whole buffer_wait thing still looks totally
wrong especially in a threaded context: the wakeup_cb() callback is
called synchronously from any thread and may end up calling some
connection code that was not expected to run on a given thread. The
whole thing should probably be reworked to use tasklets instead and be
a bit more centralized.
Move the `!` inside the likely and negate it to unlikely.
The previous version should not have caused issues, because it is converted
to a boolean / integral value before being passed to __builtin_expect(), but
it's certainly unusual.
[wt: this was not a bug but purposely written like this to improve code
generation on older compilers but not needed anymore as described here:
https://www.mail-archive.com/haproxy@formilux.org/msg36392.html ]
This marks the end of the transition from the connection polling states
introduced in 1.5-dev12 and the subscriptions in that arrived in 1.9.
The socket layer can now safely use its FD while all upper layers rely
exclusively on subscriptions. These old functions were removed. Some may
deserve some renaming to improved clarty though. The single call to
conn_xprt_stop_both() was dropped in favor of conn_cond_update_polling()
which already does the same.
The last few calls to conn_xprt_{want,stop}_{recv,send} in the central
connection code were replaced with their strictly exact equivalent fd_*,
adding the call to conn_ctrl_ready() when it was missing.
Historically we used to require that the connections held the desired
polling states for the data layer and the socket layer. Then with muxes
these were more or less merged into the transport layer, and now it
happens that with all transport layers having their own state, the
"transport layer state" as we have it in the connection (XPRT_RD_ENA,
XPRT_WR_ENA) is only an exact copy of the undelying file descriptor
state, but with a delay. All of this is causing some difficulties at
many places in the code because there are still some locations which
use the conn_want_* API to remain clean and only rely on connection,
and count on a later collection call to conn_cond_update_polling(),
while others need an immediate action and directly use the FD updates.
Since our updates are now much cheaper, most of them being only an
atomic test-and-set operation, and since our I/O callbacks are deferred,
there's no benefit anymore in trying to "cache" the transient state
change in the connection flags hoping to cancel them before they
become an FD event. Better make such calls transparent indirections
to the FD layer instead and get rid of the deferred operations which
needlessly complicate the logic inside.
This removes flags CO_FL_XPRT_{RD,WR}_ENA and CO_FL_WILL_UPDATE.
A number of functions related to polling updates were either greatly
simplified or removed.
Two places were using CO_FL_XPRT_WR_ENA as a hint to know if more data
were expected to be sent after a PROXY protocol or SOCKSv4 header. These
ones were simply replaced with a check on the subscription which is
where we ought to get the autoritative information from.
Now the __conn_xprt_want_* and their conn_xprt_want_* counterparts
are the same. conn_stop_polling() and conn_xprt_stop_both() are the
same as well. conn_cond_update_polling() only causes errors to stop
polling. It also becomes way more obvious that muxes should not at
all employ conn_xprt_{want|stop}_{recv,send}(), and that the call
to __conn_xprt_stop_recv() in case a mux failed to allocate a buffer
is inappropriate, it ought to unsubscribe from reads instead. All of
this definitely requires a serious cleanup.
http_get_hdrs_size() function may now be used to get the bytes held by headers
in an HTX message. It only works if the headers were not already
forwarded. Metadata are not counted here.
When an end pointer is passed, instead of complaining that a comma is
missing after a keyword, sample_parse_expr() will silently return the
pointer to the current location into this return pointer so that the
caller can continue its parsing. This will be used by more complex
expressions which embed sample expressions, and may even permit to
embed sample expressions into arguments of other expressions.
The main problem we're having with argument parsing is that at the
moment the caller looks for the first character looking like an end
of arguments (')') and calls make_arg_list() on the sub-string inside
the parenthesis.
Let's first change the way it works so that make_arg_list() also
consumes the parenthesis and returns the pointer to the first char not
consumed. This will later permit to refine each argument parsing.
For now there is no functional change.
This patch introduces the 'http-after-response' rules. These rules are evaluated
at the end of the response analysis, just before the data forwarding, on ALL
HTTP responses, the server ones but also all responses generated by
HAProxy. Thanks to this ruleset, it is now possible for instance to add some
headers to the responses generated by the stats applet. Following actions are
supported :
* allow
* add-header
* del-header
* replace-header
* replace-value
* set-header
* set-status
* set-var
* strict-mode
* unset-var
Operations performed when internal responses (redirect/deny/auth/errors) are
returned are always the same. The http_forward_proxy_resp() function is added to
group all of them under a unique function.
The http_server_error() function now relies on http_reply_and_close(). Both do
almost the same actions. In addtion, http_server_error() sets the error flag and
the final state flag on the stream.
The channel_htx_copy_msg() function can now be used to copy an HTX message in a
channel's buffer. This function takes care to not overwrite existing data.
This patch depends on the commit "MINOR: htx: Add a function to append an HTX
message to another one". Both are mandatory to fix a bug in
http_reply_and_close() function. Be careful to backport both first.
Commit a17664d829 ("MEDIUM: tasks: automatically requeue into the bulk
queue an already running tasklet") tried to inflict a penalty to
self-requeuing tasks/tasklets which correspond to those involved in
large, high-latency data transfers, for the benefit of all other
processing which requires a low latency. However, it turns out that
while it ought to do this on a case-by-case basis, basing itself on
the RUNNING flag isn't accurate because this flag doesn't leave for
tasklets, so we'd rather need a distinct flag to tag such tasklets.
This commit introduces TASK_SELF_WAKING to mark tasklets acting like
this. For now it's still set when TASK_RUNNING is present but this
will have to change. The flag is kept across wakeups.
When a tasklet re-runs itself such as in this chain:
si_cs_io_cb -> si_cs_process -> si_notify -> si_chk_rcv
then we know it can easily clobber the run queue and harm latency. Now
what the scheduler does when it detects this is that such a tasklet is
automatically placed into the bulk list so that it's processed with the
remaining CPU bandwidth only. Thanks to this the CLI becomes instantly
responsive again even under heavy stress at 50 Gbps over 40kcon and
100% CPU on 16 threads.
We used to mix high latency tasks and low latency tasklets in the same
list, and to even refill bulk tasklets there, causing some unfairness
in certain situations (e.g. poll-less transfers between many connections
saturating the machine with similarly-sized in and out network interfaces).
This patch changes the mechanism to split the load into 3 lists depending
on the task/tasklet's desired classes :
- URGENT: this is mainly for tasklets used as deferred callbacks
- NORMAL: this is for regular tasks
- BULK: this is for bulk tasks/tasklets
Arbitrary ratios of max_processed are picked from each of these lists in
turn, with the ability to complete in one list from what was not picked
in the previous one. After some quick tests, the following setup gave
apparently good results both for raw TCP with splicing and for H2-to-H1
request rate:
- 0 to 75% for urgent
- 12 to 50% for normal
- 12 to what remains for bulk
Bulk is not used yet.
The xprt_done_cb callback was used to defer some connection initialization
until we're connected and the handshake are done. As it mostly consists of
creating the mux, instead of using the callback, introduce a conn_create_mux()
function, that will just call conn_complete_session() for frontend, and
create the mux for backend.
In h2_wake(), make sure we call the wake method of the stream_interface,
as we no longer wakeup the stream task.
In connect_server(), when creating a new connection for which we don't yet
know the mux (because it'll be decided by the ALPN), instead of associating
the connection to the stream_interface, always create a conn_stream. This way,
we have less special-casing needed. Store the conn_stream in conn->ctx,
so that we can reach the upper layers if needed.
It is now possible to set the error message to use when a deny rule is
executed. It may be a specific error file, adding "errorfile <file>" :
http-request deny deny_status 400 errorfile /etc/haproxy/errorfiles/400badreq.http
It may also be an error file from an http-errors section, adding "errorfiles
<name>" :
http-request deny errorfiles my-errors # use 403 error from "my-errors" section
When defined, this error message is set in the HTTP transaction. The tarpit rule
is also concerned by this change.
It is now possible to import in a proxy, fully or partially, error files
declared in an http-errors section. It may be done using the "errorfiles"
directive, followed by a name and optionally a list of status code. If there is
no status code specified, all error files of the http-errors section are
imported. Otherwise, only error files associated to the listed status code are
imported. For instance :
http-errors my-errors
errorfile 400 ...
errorfile 403 ...
errorfile 404 ...
frontend frt
errorfiles my-errors 403 404 # ==> error 400 not imported
All custom HTTP errors are now stored in a global tree. Proxies use a references
on these messages. The key used for errorfile directives is the file name as
specified in the configuration. For errorloc directives, a key is created using
the redirect code and the url. This means that the same custom error message is
now stored only once. It may be used in several proxies or for several status
code, it is only parsed and stored once.
http_parse_errorloc() may now be used to create an HTTP 302 or 303 redirect
message with a specific url passed as parameter. A parameter is used to known if
it is a 302 or a 303 redirect. A status code is passed as parameter. It must be
one of the supported HTTP error codes to be valid. Otherwise an error is
returned. It aims to be used to parse "errorloc" directives. It relies on
http_load_errormsg() to do most of the job, ie converting it in HTX.
http_parse_errorfile() may now be used to parse a raw HTTP message from a
file. A status code is passed as parameter. It must be one of the supported HTTP
error codes to be valid. Otherwise an error is returned. It aims to be used to
parse "errorfile" directives. It relies on http_load_errorfile() to do most of
the job, ie reading the file content and converting it in HTX.
Now, this action is use its own dedicated function and is no longer handled "in
place" during the TCP rules evaluation. Thus the action name ACT_TCP_CAPTURE is
removed. The action type is set to ACT_CUSTOM and a check function is used to
know if the rule depends on request contents while there is no inspect-delay.
Now, these actions use their own dedicated function and are no longer handled
"in place" during the TCP/HTTP rules evaluation. Thus the action names
ACT_ACTION_TRK_SC0 and ACT_ACTION_TRK_SCMAX are removed. The action type is now
the tracking index. Thus the function trk_idx() is no longer needed.
Now, these actions use their own dedicated function and are no longer handled
"in place" during the HTTP rules evaluation. Thus the action names
ACT_HTTP_REPLACE_HDR and ACT_HTTP_REPLACE_VAL are removed. The action type is
now set to 0 to evaluate the whole header or to 1 to evaluate every
comma-delimited values.
The function http_transform_header_str() is renamed to http_replace_hdrs() to be
more explicit and the function http_transform_header() is removed. In fact, this
last one is now more or less the new action function.
The lua code has been updated accordingly to use http_replace_hdrs().
Info used by HTTP rules manipulating the message itself are splitted in several
structures in the arg union. But it is possible to group all of them in a unique
struct. Now, <arg.http> is used by most of these rules, which contains:
* <arg.http.i> : an integer used as status code, nice/tos/mark/loglevel or
action id.
* <arg.http.str> : an IST used as header name, reason string or auth realm.
* <arg.http.fmt> : a log-format compatible expression
* <arg.http.re> : a regular expression used by replace rules
In HTTP rules, error handling during a rewrite is now handle the same way for
all rules. First, allocation errors are reported as internal errors. Then, if
soft rewrites are allowed, rewrite errors are ignored and only the
failed_rewrites counter is incremented. Otherwise, when strict rewrites are
mandatory, interanl errors are returned.
For now, only soft rewrites are supported. Note also that the warning sent to
notify a rewrite failure was removed. It will be useless once the strict
rewrites will be possible.
Functions to deinitialize the HTTP rules are buggy. These functions does not
check the action name to release the right part in the arg union. Only few info
are released. For auth rules, the realm is released and there is no problem
here. But the regex <arg.hdr_add.re> is always unconditionally released. So it
is easy to make these functions crash. For instance, with the following rule
HAProxy crashes during the deinit :
http-request set-map(/path/to/map) %[src] %[req.hdr(X-Value)]
For now, These functions are simply removed and we rely on the deinit function
used for TCP rules (renamed as deinit_act_rules()). This patch fixes the
bug. But arguments used by actions are not released at all, this part will be
addressed later.
This patch must be backported to all stable versions.
The subscriber used to be passed as a "void *param" that was systematically
cast to a struct wait_event*. By now it appears clear that the subscribe()
call at every layer is well defined and always takes a pointer to an event
subscriber of type wait_event, so let's enforce this in the functions'
prototypes, remove the intermediary variables used to cast it and clean up
the comments to clarify what all these functions do in their context.
In practice all callers use the same wait_event notification for any I/O
so instead of keeping specific code to handle them separately, let's merge
them and it will allow us to create new events later.
For more than a decade we've kept all the sess_update_st_*() functions
in stream.c while they're only there to work in relation with what is
currently being done in backend.c (srv_redispatch_connect, connect_server,
etc). Let's move all this pollution over there and take this opportunity
to try to find slightly less confusing names for these old functions
whose role is only to handle transitions from one specific stream-int
state:
sess_update_st_rdy_tcp() -> back_handle_st_rdy()
sess_update_st_con_tcp() -> back_handle_st_con()
sess_update_st_cer() -> back_handle_st_cer()
sess_update_stream_int() -> back_try_conn_req()
sess_prepare_conn_req() -> back_handle_st_req()
sess_establish() -> back_establish()
The last one remained in stream.c because it's more or less a completion
function which does all the initialization expected on a connection
success or failure, can set analysers and emit logs.
The other ones could possibly slightly benefit from being modified to
take a stream-int instead since it's really what they're working with,
but it's unimportant here.
These ones used to serve as a set of switches between CO_FL_SOCK_* and
CO_FL_XPRT_*, and now that the SOCK layer is gone, they're always a
copy of the last know CO_FL_XPRT_* ones that is resynchronized before
I/O events by calling conn_refresh_polling_flags(), and that are pushed
back to FDs when detecting changes with conn_xprt_polling_changes().
While these functions are not particularly heavy, what they do is
totally redundant by now because the fd_want_*/fd_stop_*() actions
already perform test-and-set operations to decide to create an entry
or not, so they do the exact same thing that is done by
conn_xprt_polling_changes(). As such it is pointless to call that
one, and given that the only reason to keep CO_FL_CURR_* is to detect
changes there, we can now remove them.
Even if this does only save very few cycles, this removes a significant
complexity that has been responsible for many bugs in the past, including
the last one affecting FreeBSD.
All tests look good, and no performance regressions were observed.
CO_FL_WAIT_ROOM is set by the splicing function in raw_sock, and cleared
by the stream-int when splicing is disabled, as well as in
conn_refresh_polling_flags() so that a new call to ->rcv_pipe() could
be attempted by the I/O callbacks called from conn_fd_handler(). This
clearing in conn_refresh_polling_flags() makes no sense anymore and is
in no way related to the polling at all.
Since we don't call them from there anymore it's better to clear it
before attempting to receive, and to set it again later. So let's move
this operation where it should be, in raw_sock_to_pipe() so that it's
now symmetric. It was also placed in raw_sock_to_buf() so that we're
certain that it gets cleared if an attempt to splice is replaced with
a subsequent attempt to recv(). And these were currently already achieved
by the call to conn_refresh_polling_flags(). Now it could theorically be
removed from the stream-int.
In tasklet_free(), to attempt to remove ourself, use MT_LIST_DEL, we can't
just use LIST_DEL(), as we theorically could be in the shared tasklet list.
This should be backported to 2.1.
In order to reduce the number of poller updates, we can benefit from
the fact that modern pollers use sampling to report readiness and that
under load they rarely report the same FD multiple times in a row. As
such it's not always necessary to disable such FDs especially when we're
almost certain they'll be re-enabled again and will require another set
of syscalls.
Now instead of creating an update for a (possibly temporary) removal,
we only perform this removal if the FD is reported again as ready while
inactive. In addition this is performed via another update so that
alternating workloads like transfers have a chance to re-enable the
FD without any syscall during the loop (typically after the data that
filled a buffer have been sent). However we only do that for single-
threaded FDs as the other ones require a more complex setup and are not
on the critical path.
This does cause a few spurious wakeups but almost totally eliminates the
calls to epoll_ctl() on connections seeing intermitent traffic like HTTP/1
to a server or client.
A typical example with 100k requests for 4 kB objects over 200 connections
shows that the number of epoll_ctl() calls doesn't depend on the number
of requests anymore but most exclusively on the number of established
connections:
Before:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.09 0.499964 0 654361 321190 recvfrom
38.33 0.335741 0 369097 1 epoll_wait
4.56 0.039898 0 44643 epoll_ctl
0.02 0.000211 1 200 200 connect
------ ----------- ----------- --------- --------- ----------------
100.00 0.875814 1068301 321391 total
After:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
59.25 0.504676 0 657600 323630 recvfrom
40.68 0.346560 0 374289 1 epoll_wait
0.04 0.000370 0 620 epoll_ctl
0.03 0.000228 1 200 200 connect
------ ----------- ----------- --------- --------- ----------------
100.00 0.851834 1032709 323831 total
As expected there is also a slight increase of epoll_wait() calls since
delaying de-activation of events can occasionally cause one spurious
wakeup.
For now this almost never happens but with subsequent patches it will
become more important not to uselessly call the I/O handlers if the FD
is not active.
The function is not TCP-specific at all, it covers all FD-based sockets
so let's move this where other similar functions are, in connection.c,
and rename it conn_fd_check().