When mux->init() fails, session_free() will call it again to unregister
it while it was already done, resulting in null derefs or use-after-free.
This typically happens on out-of-memory conditions during H1 or H2 connection
or stream allocation.
This fix must be backported to 1.9.
The function channel_htx_truncate() can now be used on HTX buffer to truncate
all incoming data, keeping outgoing one intact. This function relies on the
function channel_htx_erase() and htx_truncate().
This patch may be backported to 1.9. If so, the patch "MINOR: channel/htx: Add
the HTX version of channel_truncate()" must also be backported.
HTX versions for functions to test the free space in input against the reserve
have been added. Now, on HTX streams, following functions can be used:
* channel_htx_may_recv
* channel_htx_recv_limit
* channel_htx_recv_max
* channel_htx_full
This patch must be backported in 1.9 because it will be used by a futher patch
to fix a bug.
This function must be called when new incoming data are pushed in the channel's
buffer. It updates the channel state and take care of the fast forwarding by
consuming right amount of data and decrementing "->to_forward" accordingly when
necessary. In fact, this patch just moves a part of ci_putblk in a dedicated
function.
This patch must be backported to 1.9.
Instead of keeping track of the number of connections we're responsible for,
keep track of the number of connections we're responsible for that we are
currently considering idling (ie that we are not using, they may be in use
by other sessions), that way we can actually reuse connections when we have
more connections than the max configured.
When a session adds a connection to its connection list, we used to remove
connections for an another server if there were not enough room for our
server. This can't work, because those lists are now the list of connections
we're responsible for, not just the idle connections.
To fix this, allow for an unlimited number of servers, instead of using
an array, we're now using a linked list.
In si_release_endpoint(), if the end point is a connection, because we don't
know which mux to use it, make sure we close the connection before freeing it,
or else, we'd have a fd left for polling, which would point to a now free'd
connection.
This should be backported to 1.9.
As long-time changes have accumulated over time, the exported functions
of the stream-interface were almost all prefixed "si_<something>" while
most private ones (mostly callbacks) were called "stream_int_<something>".
There were still a few confusing exceptions, which were addressed to
follow this shcme :
- stream_sock_read0(), only used internally, was renamed stream_int_read0()
and made static
- stream_int_notify() is only private and was made static
- stream_int_{check_timeouts,report_error,retnclose,register_handler,update}
were renamed si_<something>.
Now it is clearer when checking one of these if it risks to be used outside
or not.
There was a reference to struct stream in conn_free() for the case
where we're freeing a connection that doesn't have a mux attached.
For now we know it's always a stream, and we only need to do it to
put a NULL in s->si[1].end.
Let's do it better by storing the pointer to si[1].end in the context
and specifying that this pointer is always nulled if the mux is null.
This way it allows a connection to detach itself from wherever it's
being used. Maybe we could even get rid of the condition on the mux.
We most often store the mux context there but it can also be something
else while setting up the connection. Better call it "ctx" and know
that it's the owner's context than misleadingly call it mux_ctx and
get caught doing suspicious tricks.
The SUB_CAN_SEND/SUB_CAN_RECV enum values have been confusing a few
times, especially when checking them on reading. After some discussion,
it appears that calling them SUB_RETRY_SEND/SUB_RETRY_RECV more
accurately reflects their purpose since these events may only appear
after a first attempt to perform the I/O operation has failed or was
not completed.
In addition the wait_reason field in struct wait_event which carries
them makes one think that a single reason may happen at once while
it is in fact a set of events. Since the struct is called wait_event
it makes sense that this field is called "events" to indicate it's the
list of events we're subscribed to.
Last, the values for SUB_RETRY_RECV/SEND were swapped so that value
1 corresponds to recv and 2 to send, as is done almost everywhere else
in the code an in the shutdown() call.
Types DNS_SRVRQ and CS were not referenced in the type to string
conversions, causing possibly misleading outputs in session dumps.
Now instead of showing "NONE" for unknown invalid types names, we
display "!INVAL!" to clear the confusion that may exist in case of
memory corruption for example.
In session, don't keep an infinite number of connection that can idle.
Add a new frontend parameter, "max-session-srv-conns" to set a max number,
with a default value of 5.
Instead of trying to get the session from the connection, which is not
always there, and of course there could be multiple sessions per connection,
provide it with the init() and attach() methods, so that we know the
session for each outgoing stream.
Add a new command, "pool-max-conn" that sets the maximum number of connections
waiting in the orphan idling connections list (as activated with idle-timeout).
Using "-1" means unlimited. Using pools is now dependant on this.
Now that h1 and legacy HTTP are two distinct things, there's no need
to keep the legacy HTTP parsers in h1.c since they're only used by
the legacy code in proto_http.c, and h1.h doesn't need to include
hdr_idx anymore. This concerns the following functions :
- http_parse_reqline();
- http_parse_stsline();
- http_msg_analyzer();
- http_forward_trailers();
All of these were moved to http_msg.c.
Lots of HTTP code still uses struct http_msg. Not only this code is
still huge, but it's part of the legacy interface. Let's move most
of these functions to a separate file http_msg.c to make it more
visible which file relies on what. It's mostly symmetrical with
what is present in http_htx.c.
The function http_transform_header_str() which used to rely on two
function pointers to look up a header was simplified to rely on
two variants http_legacy_replace_{,full_}header(), making both
sides of the function much simpler.
No code was changed beyond these moves.
All the HTX definition is self-contained and doesn't really depend on
anything external since it's a mostly protocol. In addition, some
external similar files (like h2) also placed in common used to rely
on it, making it a bit awkward.
This patch moves the two htx.h files into a single self-contained one.
The historical dependency on sample.h could be also removed since it
used to be there only for http_meth_t which is now in http.h.
There were a number of ugly setsockopt() calls spread all over
proto_http.c, proto_htx.c and hlua.c just to manipulate the front
connection's TOS, mark or TCP quick-ack. These ones entirely relied
on the connection, its existence, its control layer's presence, and
its addresses. Worse, inet_set_tos() was placed in proto_http.c,
exported and used from the two other ones, surrounded in #ifdefs.
This patch moves this code to connection.h and makes the other ones
rely on it without ifdefs.
If we try to receive before the connection is established, we lose the
send event and are not woken up anymore once the connection is established.
This was diagnosed by Olivier.
No backport is needed.
There are some situations where we need to wait for the other side to
be connected. None of the current blocking flags support this. It used
to work more or less by accident using the old flags. Let's add a new
flag to mention we're blocking on this, it's removed by si_chk_rcv()
when a connection is established. It should be enough for now.
The master is not supposed to run (at the moment) any task before the
polling loop, the created tasks should be run only in the workers but in
the master they should be disabled or removed.
No backport needed.
To ease the fast forwarding and the infinte forwarding on HTX proxies, 2
functions have been added to let the channel be almost aware of the way data are
stored in its buffer. By calling these functions instead of legacy ones, we are
sure to forward the right amount of data.
Now, the function htx_from_buf() will set the buffer's length to its size
automatically. In return, the caller should call htx_to_buf() at the end to be
sure to leave the buffer hosting the HTX message in the right state. When the
caller can use the function htxbuf() to get the HTX message without any update
on the underlying buffer.
The small HTX overhead is enough to make the system perform multiple
reads and unaligned memory copies. Here we provide a function whose
purpose is to reduce the apparent room in a buffer by the size of the
overhead for DATA blocks, which is the struct htx plus 2 blocks (one
for DATA, one for the end of message so that small blocks can fit at
once). The muxes using HTX will be encouraged to use this one instead
of b_room() to compute the available buffer room and avoid filling
their demux buf with more data than can fit at once into the HTX
buffer.
This one is used a lot during transfers, let's avoid resetting its
size when there are already data in the buffer since it implies the
size is correct.
We currently have conn_get_best_mux() to return the best mux for a
given protocol name, side and proxy mode. But we need the mux entry
as well in order to fix the bind_conf and servers at the end of the
config parsing. Let's split the function in two parts. It's worth
noting that the <conn> argument is never used anymore so this part
is eligible to some cleanup.
First, to be called on HTX streams, a filter must explicitly be declared as
compatible by setting the flag STRM_FLT_FL_HAS_FILTERS on the filter's config at
HAProxy startup. This flag is checked when a filter implementation is attached
to a stream.
Then, some changes have been made on HTTP callbacks. The callback http_payload
has been added to filter HTX data. It will be called on HTX streams only. It
replaces the callbacks http_data, http_chunk_trailers and http_forward_data,
called on legacy HTTP streams only and marked as deprecated. The documention
(once updated)) will give all information to implement this new callback. Other
HTTP callbacks will be called for HTX and HTTP legacy streams. So it is the
filter's responsibility to known which kind of data it handles. The macro
IS_HTX_STRM should be used in such cases.
There is at least a noticeable changes in the way data are forwarded. In HTX,
after the call to the callback http_headers, all the headers are considered as
forwarded. So, in http_payload, only the body and eventually the trailers will
be filtered.
During startup, after the configuration parsing, all HTTP error messages
(errorloc, errorfile or default messages) are converted into HTX messages and
stored in dedicated buffers. We use it to return errors in the HTX analyzers
instead of using ugly OOB blocks.
Instead, we now use the htx_sl coming from the HTX message. It avoids to have
too H1 specific code in version-agnostic parts. Of course, the concept of the
start-line is higly influenced by the H1, but the structure htx_sl can be
adapted, if necessary. And many things depend on a start-line during HTTP
analyzis. Using the structure htx_sl also avoid boring conversions between HTX
version and H1 version.
If there is no start-line, this offset is set to -1. Otherwise, it is the
relative address where the start-line is stored in the data block. When the
start-line is added, replaced or removed, this offset is updated accordingly. On
remove, if the start-line is no set and if the next block is a start-line, the
offset is updated. Finally, when an HTX structure is defragmented, the offset is
also updated accordingly.
The HTX start-line is now a struct. It will be easier to extend, if needed. Same
info can be found, of course. In addition it is now possible to set flags on
it. It will be used to set some infos about the message.
Some macros and functions have been added in proto/htx.h to help accessing
different parts of the start-line.
The function htx_find_blk() returns the HTX block containing data with a given
offset, relatively to the beginning of the HTX message. It is a good way to skip
outgoing data and find the first HTX block not already processed.
the functions htx_get_next() and htx_get_prev() are used to iterate on an HTX
message using blocks position. With htx_get_next_blk() and htx_get_prev_blk(),
it is possible to do the same, but with HTX blocks. Of course, internally, we
rely on position's versions to do so. But it is handy for callers to not take
care of the blocks position.
The function htx_add_data_before() can be used to add an HTX block before
another one. For instance, it could be used to add some data before the
end-of-message marker.
Time to time, the need arises to get some info owned by the multiplexer about a
connection stream from the upper layer. Today we really need to get some dates
and durations specific to the conn_stream. It is only true for the mux H1 and
H2. Otherwise it will be impossible to have correct times reported in the logs.
To do so, the structure cs_info has been defined to provide all info we ever
need on a conn_stream from the upper layer. Of course, it is the first step. So
this structure will certainly envloved. But for now, only the bare minimum is
referenced. On the mux side, the callback get_cs_info() has been added in the
structure mux_ops. Multiplexers can now implement it, if necessary, to return a
pointer on a structure cs_info. And finally, the function si_get_cs_info()
should be used from the upper layer. If the stream interface is not attached to
a connection stream, this function returns NULL, likewise if the callback
get_cs_info() is not defined for the corresponding mux.
htx_cut_data_blk() is used to cut the beginning of a DATA block after a
part of it was tranferred. It simply advances the address, reduces the
advertised length and updates the htx's total data count.
It looks like we forgot to report HTX when listing the muxes and their
respective protocols, leading to "NONE" being displayed. Let's report
"HTX" and "HTTP|HTX" since both will exist. Also fix a minor typo in
the output message.
Instead of just storing the last connection in the session, store all of
the connections, for at most MAX_SRV_LIST (currently 5) targets.
That way we can do keepalive on more than 1 outgoing connection when the
client uses HTTP/2.
signal_init(), init_log(), init_stream(), and init_task() all used to
only preset some values and lists. This needs to be done very early to
provide a reliable interface to all other users. The calls used to be
explicit in haproxy.c:init(). Now they're placed in initcalls at the
STG_PREPARE stage. The functions are not exported anymore.
Instead of exporting a number of pools and having to manually delete
them in deinit() or to have dedicated destructors to remove them, let's
simply kill all pools on deinit().
For this a new function pool_destroy_all() was introduced. As its name
implies, it destroys and frees all pools (provided they don't have any
user anymore of course).
This allowed to remove 4 implicit destructors, 2 explicit ones, and 11
individual calls to pool_destroy(). In addition it properly removes
the mux_pt_ctx pool which was not cleared on exit (no backport needed
here since it's 1.9 only). The sig_handler pool doesn't need to be
exported anymore and became static now.
This commit replaces the explicit pool creation that are made in
constructors with a pool registration. Not only this simplifies the
pools declaration (it can be done on a single line after the head is
declared), but it also removes references to pools from within
constructors. The only remaining create_pool() calls are those
performed in init functions after the config is parsed, so there
is no more user of potentially uninitialized pool now.
It has been the opportunity to remove no less than 12 constructors
and 6 init functions.
Building with musl and gcc-5.3 for MIPS returns this :
include/common/buf.h: In function 'b_dist':
include/common/buf.h:252:2: error: unknown type name 'ssize_t'
ssize_t dist = to - from;
^
Including stdint or stddef is not sufficient there to get ssize_t,
unistd is needed as well. It's likely that other platforms will have
the same issue. This patch also addresses it in ist.h and memory.h.
Building on 32 bits gives this :
include/proto/htx.h: In function 'htx_dump':
include/proto/htx.h:443:25: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'uint64_t {aka long long unsigned int}' [-Wformat=]
fprintf(stderr, "htx:%p [ size=%u - data=%u - used=%u - wrap=%s - extra=%lu]\n",
^
In htx_dump(), fprintf() uses %lu but the value is an uint64_t so it
doesn't match on 32-bit. Let's cast this to unsigned long long and use
%llu instead.
When we create a connection, if we have to defer the conn_stream and the
mux creation until we can decide it (ie until the SSL handshake is done, and
the ALPN is decided), store the connection in the stream_interface, so that
we're sure we can destroy it if needed.
If an ALPN (or a NPN) was chosen for a server, defer choosing the mux until
after the SSL handshake is done, and the ALPN/NPN has been negociated, so
that we know which mux to pick.
Right now we measure for each task the cumulated time spent waiting for
the CPU and using it. The timestamp uses a 64-bit integer to report a
nanosecond-level date. This is only enabled when "profiling.tasks" is
enabled, and consumes less than 1% extra CPU on x86_64 when enabled.
The cumulated processing time and wait time are reported in "show sess".
The task's counters are also reset when an HTTP transaction is reset
since the HTTP part pretends to restart on a fresh new stream. This
will make sure we always report correct numbers for each request in
the logs.
This is a new global setting which enables or disables CPU profiling
per task. For now it only sets/resets the variable based on the global
option "profiling.tasks" and supports showing it as well as setting it
from the CLI using "show profiling" and "set profiling". The option will
be used by a future commit. It was done in a way which should ease future
addition of profiling options.
Since we know the time it takes to process everything between two poll()
calls, we can use this as the max latency measurement any task will
experience and average it.
This code does this, and reports in "show activity" the average of this
loop time over the last 1024 poll() loops, for each thread. It will vary
quickly at high loads and slowly under low to moderate loads, depending
on the rate at which poll() is called. The latency a task experiences
is expected to be half of this on average.
At the moment the situation with activity measurement is quite tricky
because the struct activity is defined in global.h and declared in
haproxy.c, with operations made in time.h and relying on freq_ctr
which are defined in freq_ctr.h which itself includes time.h. It's
barely possible to touch any of these files without breaking all the
circular dependency.
Let's move all this stuff to activity.{c,h} and be done with it. The
measurement of active and stolen time is now done in a dedicated
function called just after tv_before_poll() instead of mixing the two,
which used to be a lazy (but convenient) decision.
No code was changed, stuff was just moved around.
Just found that proto/cli.h doesn't build if types/cli.h is not also
included by the caller, as it uses cli_kw_list is used in arguments.
But it's also true for a few other ones like mworker_proc, stream,
and channel, so let's fix this.
The new function signal_unregister() removes every handlers assigned to
a signal. Once the handler list of the signal is empty, the signal is
ignored with SIG_IGN.
This was the largest function of the whole file, taking a rough second
to build alone. Let's move it to a distinct file along with a few
dependencies. Doing so saved about 2 seconds on the total build time.
It does the same than smp_prefetch_http but for HTX messages. It can be called
from an HTTP proxy or a TCP proxy. For HTTP proxies, the parsing is handled by
the mux, so it does nothing but wait. For TCP proxies, it tries to parse an HTTP
message and to convert it in a temporary HTX message. Sample fetches will use
this temporary variable to do their job.
It is more or less the same than legacy version but adapted to be called from
HTX analyzers. In the legacy version of this function, we switch on the HTX code
when applicable.
It is more or less the same than legacy version but adapted to be called from
HTX analyzers. In the legacy version of this function, we switch on the HTX code
when applicable.
It is more or less the same than legacy versions but adapted to be called from
HTX analyzers. In the legacy versions of these functions, we switch on the HTX
code when applicable.
It is more or less the same than legacy versions but adapted to be called from
HTX analyzers. In the legacy versions of these functions, we switch on the HTX
code when applicable.
This file will host all functions to manipulate HTTP messages using the HTX
representation. Functions in this file will be able to be called from anywhere
and are mainly related to the HTTP semantics.
The internal representation of an HTTP message, called HTX, is a structured
representation, unlike the old one which is a raw representation of
messages. Idea is to have a version-agnostic representation of the HTTP
messages, which can be easily used by to handle HTTP/1, HTTP/2 and hopefully
QUIC messages, and communication from one of them to another.
In this patch, we add types to define the internal representation itself and the
main functions to manipulate them.
Now, the connection mode is detected in the mux and not in HTX analyzers
anymore. Keep-alive connections are now managed by the mux. A new stream is
created for each transaction. This removes the most important part of the
synchronization between channels and the HTTP transaction cleanup. These changes
only affect the HTX part (proto_htx.c). Legacy HTTP analyzers remain untouched
for now.
On the client-side, the mux is responsible to create new streams when a new
request starts. It is also responsible to parse and update the "Connection:"
header of the response. On the server-side, the mux is responsible to parse and
update the "Connection:" header of the request. Muxes on each side are
independent. For now, there is no connection pool on the server-side, so it
always close the server connection.
For now, these analyzers are just copies of the legacy HTTP analyzers. But,
during the HTTP refactoring, it will be the main place where it will be
visible. And in legacy analyzers, the macro IS_HTX_STRM is used to know if the
HTX version should be called or not.
Note: the following commits were applied to proto_http.c after this patch
was developed and need to be studied to see if an adaptation to htx
is required :
fd9b68c BUG/MINOR: only mark connections private if NTLM is detected
To prepare the refactoring of the code handling HTTP messages, these macros will
help to use HTX functions instead of legacy ones when the new HTX internal
representation is in use. To do so, for a given stream, we will check if its
frontend has the option PR_O2_USE_HTX. It is useless to test backend options
because it is not possible to mix the HTX representation and the legacy one
(i.e, having an HTX frontend and a legacy backend or vice versa).
Do not destroy the connection when we're about to destroy a stream. This
prevents us from doing keepalive on server connections when the client is
using HTTP/2, as a new stream is created for each request.
Instead, the session is now responsible for destroying connections.
When reusing connections, the attach() mux method is now used to create a new
conn_stream.
Introduce a new field in session, "srv_conn", and a linked list of sessions
in the connection. It will be used later when we'll switch connections
from being managed by the stream, to being managed by the session.
Remaining calls to si_cant_put() were all for lack of room and were
turned to si_rx_room_blk(). A few places where SI_FL_RXBLK_ROOM was
cleared by hand were converted to si_rx_room_rdy().
The now unused si_cant_put() function was removed.
The channel can disable reading from the stream-interface using various
methods, such as :
- CF_DONT_READ
- !channel_may_recv()
- and possibly others
Till now this was done by mangling SI_FL_RX_WAIT_EP which is not
appropriate at all since it's not the stream interface which decides
whether it wants to deliver data or not. Some places were also wrongly
relying on SI_FL_RXBLK_ROOM since it was the only other alternative,
but it's not suitable for CF_DONT_READ.
Let's use the SI_FL_RXBLK_CHAN flag for this instead. It will properly
prevent the stream interface from being woken up and reads from
subscribing to more receipt without being accidently removed. It is
automatically reset if CF_DONT_READ is not set in stream_int_notify().
The code is not trivial because it splits the logic between everything
related to buffer contents (channel_is_empty(), CF_WRITE_PARTIAL, etc)
and buffer policy (CF_DONT_READ). Also it now needs to decide timeouts
based on any blocking flag and not just SI_FL_RXBLK_ROOM anymore.
It looks like this patch has caused a minor performance degradation on
connection rate, which possibly deserves being investigated deeper as
the test conditions are uncertain (e.g. slightly more subscribe calls?).
Till now we were using si_done_put() upon shutr, but these flags could
be reset upon next activity. Now let's switch to SI_FL_RXBLK_SHUT which
doesn't go away. It's also set in stream_int_update() in case a shutr
condition is detected.
The now unused si_done_put() was removed.
Instead of checking complex conditions to call si_cs_recv() upon first
call, let's simply use si_rx_endp_ready() now that si_cs_recv() reports
it accurately, and add si_rx_blocked() to cover any blocking situation.
The stream interface used to conflate a missing buffer and lack of
buffer space into SI_FL_WAIT_ROOM but this causes difficulties as
these cannot be checked at the same moment and are not resolved at
the same moment either. Now we instead mark the buffer as presumably
available using si_rx_buff_rdy() and mark it as unavailable+requested
using si_rx_buff_blk().
The call to si_alloc_buf() was moved after si_stop_put(). This makes
sure that the SI_FL_RX_WAIT_EP flag is cleared on allocation failure so
that the function is called again if the callee fails to do its work.
The SI_FL_WANT_PUT flag is used in an awkward way, sometimes it's
set by the stream-interface to mean "I have something to deliver",
sometimes it's cleared by the channel to say "I don't want you to
send what you have", and it has to be set back once CF_DONT_READ
is cleared. This will have to be split between SI_FL_RX_WAIT_EP
and SI_FL_RXBLK_CHAN. This patch only replaces all uses of the
flag with its natural (but negated) replacement SI_FL_RX_WAIT_EP.
The code is expected to be strictly equivalent. The now unused flag
was completely removed.
The first ones are used to figure if a direction is blocked on the
stream interface for anything but the end point. The second ones are
used to detect if the end point is ready to receive/transmit. They
should be used instead of directly fiddling with the existing bits.
This flag is not enough to describe all blocking situations, as can be
seen in each case we remove it. The muxes has taught us that using multiple
blocking flags in parallel will be much easier, so let's start to do this
now. This patch only renames this flags in order to make next changes more
readable.
This method is used to retrieve the first known good conn_stream from
the mux. It will be used to find the other end of a connection when
dealing with the proxy protocol for example.
There are still some unwelcome synchronous calls to si_cs_recv() in
process_stream(). Let's have a new function si_sync_recv() to perform
a synchronous receive call on a stream interface regardless of the type
of its endpoint, and move these calls there. For now it only implements
conn_streams since it doesn't seem useful to support applets there. The
function implements an extra check for the stream interface to be in an
established state before attempting anything.
It's easy to detect when logs on some paths are lost as sendmsg() will
return EAGAIN. This is particularly true when sending to /dev/log, which
often doesn't support a big logging capacity. Let's keep track of these
and report the total number of dropped messages in "show info".
We exclusively use stream_int_update() now, the lower layers are not
called anymore so let's remove them, as well as si_update() which used
to be their wrapper.
The function used to be called in turn for each side of the stream, but
since it's called exclusively from process_stream(), it prevents us from
making use of the knowledge we have of the operations in progress for
each side, resulting in having to go all the way through functions like
stream_int_notify() which are not appropriate there.
That patch creates a new function, si_update_both() which takes two
stream interfaces expected to belong to the same stream, and processes
their flags in a more suitable order, but for now doesn't change the
logic at all.
The next step will consist in trying to reinsert the rest of the socket
layer-specific update code to ultimately update the flags correctly at
the end of the operation.
After careful inspection, it now seems OK to call si_chk_rcv() only when
SI_FL_WAIT_ROOM is cleared and SI_FL_WANT_PUT is set, since all identified
call places have already taken care of this.
Instead of clearing the SI_FL_WAIT_ROOM flag and losing the information
about the need from the producer to be woken up, we now call si_chk_rcv()
immediately. This is cheap to do and it could possibly be further improved
by only doing it when SI_FL_WAIT_ROOM was still set, though this will
require some extra auditing of the code paths.
The only remaining place where the flag was cleared without a call to
si_chk_rcv() is si_alloc_ibuf(), but since this one is called from a
receive path woken up from si_chk_rcv() or not having failed, the
clearing was not necessary anymore either.
And there was one place in stream_int_notify() where si_chk_rcv() was
called with SI_FL_WAIT_ROOM still explicitly set so this place was
adjusted in order to clear the flag prior to calling si_chk_rcv().
Now we don't have any situation where we randomly clear SI_FL_WAIT_ROOM
without trying to wake the other side up, nor where we call si_chk_rcv()
with the flag set, so this flag should accurately represent a failed
attempt at putting data into the buffer.
When CF_DONT_READ is set, till now we used to set SI_FL_WAIT_ROOM, which
is not appropriate since it would lose the subscribe status. Instead let's
clear SI_FL_WANT_PUT (just like applets do), and set the flag only when
CF_DONT_READ is cleared.
We have to do this in stream_int_update(), and in si_cs_io_cb() after
returning from si_cs_recv() since it would be a bit invasive to hack
this one for now. It must not be done in stream_int_notify() otherwise
it would re-enable blocked applets.
Last, when si_chk_rcv() is called, it immediately clears the flag before
calling ->chk_rcv() so that we are not tempted to uselessly loop on the
same call until the receive function is called. This is the same principle
as what is done with the applet scheduler.
This flag should already be cleared before calling the *chk_rcv() functions.
Before adapting all call places, let's first make sure si_chk_rcv() clears
it before calling them so that these functions do not have to check it again
and so that they do not adjust it. This function will only call the lower
layers if the SI_FL_WANT_PUT flag is present so that the endpoint can decide
not to be called (as done with applets).
There was an ambiguity in which functions of the si_ops struct could be
null or not. only ->update doesn't exist in one of the si_ops (the
embedded one), all others are always defined. ->shutr and ->shutw were
never tested. However ->chk_rcv() and ->chk_snd() were tested, causing
confusion about the proper way to wake the other side up if undefined
(which never happens).
Let's update the comments to state these functions are mandatory and
remove the offending checks.
We now do this on the si_cs_recv() path so that we always have
SI_FL_WANT_PUT properly set when there's a need to receive and
SI_FL_WAIT_ROOM upon failure.
It doesn't make sense to limit this code to applets, as any stream
interface can use it. Let's rename it by simply dropping the "applet_"
part of the name. No other change was made except updating the comments.
The buffer allocation callback appctx_res_wakeup() used to rely on old
tricks to detect if a buffer was already granted to an appctx, namely
by checking the task's state. Not only this test is not valid anymore,
but it's inaccurate.
Let's solely on SI_FL_WAIT_ROOM that is now set on allocation failure by
the functions trying to allocate a buffer. The buffer is now allocated on
the fly and the flag removed so that the consistency between the two
remains granted. The patch also fixes minor issues such as the function
being improperly declared inline(!) and the fact that using appctx_wakeup()
sets the wakeup reason to TASK_WOKEN_OTHER while we try to use TASK_WOKEN_RES
when waking up consecutive to a ressource allocation such as a buffer.
This function replaces stream_res_available(), which is used as a callback
for the buffer allocator. It now carefully checks which stream interface
was blocked on a buffer allocation, tries to allocate the input buffer to
this stream interface, and wakes the task up once such a buffer was found.
It will automatically remove the SI_FL_WAIT_ROOM flag upon success since
the info this flag indicates becomes wrong as soon as the buffer is
allocated.
The code is still far from being perfect because if a call to si_cs_recv()
fails to allocate a buffer, we'll still end up passing via process_stream()
again, but this could be improved in the future by using finer-grained
wake-up notifications.
This patch implements analysers for parsing the CLI and extra features
for the master's CLI.
For each command (sent alone, or separated by ; or \n) the request
analyser will determine to which server it should send the request.
The 'mode cli' proxy is able to parse a prefix for each command which is
used to select the apropriate server. The prefix start by @ and is
followed by "master", the PID preceded by ! or the relative PID. (e.g.
@master, @1, @!1234). The servers are not round-robined anymore.
The command is sent with a SHUTW which force the server to close the
connection after sending its response. However the proxy allows a
keepalive connection on the client side and does not close.
The response analyser does not do much stuff, it only reinits the
connection when it received a close from the server, and forward the
response. It does not analyze the response data.
The only guarantee of the end of the response is the close of the
server, we can't rely on the double \n since it's not send by every
command.
This could be reimplemented later as a filter.
This patch introduces mworker_cli_proxy_new_listener() which allows the
creation of new listeners for the CLI proxy.
Using this function it is possible to create new listeners from the
program arguments with -Sa <unix_socket>. It is allowed to create
multiple listeners with several -Sa.
This patch implements a listen proxy within the master. It uses the
sockpair of all the workers as servers.
In the current state of the code, the proxy is only doing round robin on
the CLI of the workers. A CLI mode will be needed to know to which CLI
send the requests.
The init code of the mworker_proc structs has been moved before the
init of the listeners.
Each socketpair is now connected to a CLI within the workers, which
allows the master to access their CLI.
The inherited flag of the worker side socketpair is removed so the
socket can be closed in the master.
With the new synchronous si_cs_send() at the end of process_stream(),
we're seeing re-appear the I/O layer specific part of the stream interface
which is supposed to deal with I/O event subscription. The only difference
is that now we subscribe to I/Os only after having attempted (and failed)
them.
This patch brings a cleanup in this by reintroducing stream_int_update_conn()
with the send code from process_stream(). However this alone would not be
enough because the flags which are cleared afterwards would result in the
loss of the possible events (write events only at the moment). So the flags
clearing and stream-int state updates are also performed inside si_update()
between the generic code and the I/O specific code. This definitely makes
sense as after this call we can simply check again for channel and SI flag
changes and decide to loop once again or not.
This will supersed channel_alloc_buffer() while relying on it. It will
automatically adjust SI_FL_WAIT_ROOM on the stream-int depending on
success or failure to allocate this buffer.
It's worth noting that it could make sense to also set SI_FL_WANT_PUT
each time we do this to further simplify the code at user places such
as applets, but it would possibly not be easy to clean this flag
everywhere an rx operation stops.
The behaviour of the flag CF_WRITE_PARTIAL was modified by commit
95fad5ba4 ("BUG/MAJOR: stream-int: don't re-arm recv if send fails") due
to a situation where it could trigger an immediate wake up of the other
side, both acting in loops via the FD cache. This loss has caused the
need to introduce CF_WRITE_EVENT as commit c5a9d5bf, to replace it, but
both flags express more or less the same thing and this distinction
creates a lot of confusion and complexity in the code.
Since the FD cache now acts via tasklets, the issue worked around in the
first patch no longer exists, so it's more than time to kill this hack
and to restore CF_WRITE_PARTIAL's semantics (i.e.: there has been some
write activity since we last left process_stream).
This patch mostly reverts the two commits above. Only the part making
use of CF_WROTE_DATA instead of CF_WRITE_PARTIAL to detect the loss of
data upon connection setup was kept because it's more accurate and
better suited.
This patch makes shctx capable of storing objects in several parts,
each parts being made of several blocks. There is no more need to
walk through until reaching the end of a row to append new blocks.
A new pointer to a struct shared_block member, named last_reserved,
has been added to struct shared_block so that to memorize the last block which was
reserved by shctx_row_reserve_hot(). Same thing about "last_append" pointer which
is used to memorize the last block used by shctx_row_data_append() to store the data.
This option makes a proxy use only HTX-compatible muxes instead of the
HTTP-compatible ones for HTTP modes. It must be set on both ends, this
is checked at parsing time.
Some samples representing time will cover more than one sample at once
if they are units of time per time. For this we'd need to have the
ability to loop over swrate_add() multiple times but that would be
inefficient. By developing the function elevated to power N, it's
visible that some coefficients quickly disappear and that those which
remain at the first order more or less compensate each other.
Thus a simplified version of this function was added to provide a single
value for a given number of samples. Tests with multiple values, window
sizes and sample sizes have shown that it is possible to make it remain
surprisingly accurate (typical error < 0.2% over various large window
and sample sizes, even samples representing up to 1/4 of the window).
Avoid using conn_xprt_want_send/recv, and totally nuke cs_want_send/recv,
from the upper layers. The polling is now directly handled by the connection
layer, it is activated on subscribe(), and unactivated once we got the event
and we woke the related task.
Make sure we don't have any subscription when the connection is going in
idle mode, otherwise there's a race condition when the connection is
reused, if there are still old subscriptions, new ones won't be done.
No backport is needed.
The 4 pollers all contain the same code used to compute the poll timeout.
This is pointless, let's centralize this into fd.h. This also gets rid of
the useless SCHEDULER_RESOLUTION macro which used to work arond a very old
linux 2.2 bug causing select() to wake up slightly before the timeout.
Currently we have per-thread arrays of trees and counts, but these
ones unfortunately share cache lines and are accessed very often. This
patch moves the task-specific stuff into a structure taking a multiple
of a cache line, and has one such per thread. Just doing this has
reduced the cache miss ratio from 19.2% to 18.7% and increased the
12-thread test performance by 3%.
It starts to become visible that we really need a process-wide per-thread
storage area that would cover more than just these parts of the tasks.
The code was arranged so that it's easy to move the pieces elsewhere if
needed.
Now we still have a main contention point with the timers in the main
wait queue, but the vast majority of the tasks are pinned to a single
thread. This patch creates a per-thread wait queue and queues a task
to the local wait queue without any locking if the task is bound to a
single thread (the current one) otherwise to the shared queue using
locking. This significantly reduces contention on the wait queue. A
test with 12 threads showed 11 ms spent in the WQ lock compared to
4.7 seconds in the same test without this change. The cache miss ratio
decreased from 19.7% to 19.2% on the 12-thread test, and its performance
increased by 1.5%.
Another indirect benefit is that the average queue size is divided
by the number of threads, which roughly removes log(nbthreads) levels
in the tree and further speeds up lookups.
The vast majority of FDs are only seen by one thread. Currently the lock
on FDs costs a lot because it's touched often, though there should be very
little contention. This patch ensures that the lock is only grabbed if the
FD is shared by more than one thread, since otherwise the situation is safe.
Doing so resulted in a 15% performance boost on a 12-threads test.
peers_init_sync() doesn't check task_new()'s return value and doesn't
return any result to indicate success or failure. Let's make it return
an int and check it from the caller.
This can be backported as far as 1.6.
To ease the refactoring, the function "http_header_add_tail" have been
remove. Now, "http_header_add_tail2" is always used. And the function
"capture_headers" have been renamed into "http_capture_headers". Finally, some
functions have been exported.
HTTP_FLG_* and HTTP_IS_* were moved from "proto/proto_http.h" to "common/http.h"
but the associated comment was forgotten during the move.
This is 1.9-specific and should not be backported.
Make sure we unsubscribe from events before si_release_endpoint destroys
the conn_stream, or it will be never called. To do so, move the call to
unsubscribe to si_release_endpoint() directly.
This is 1.9-specific and shouldn't be backported.
When subscribing, we don't need to provide a list element, only the h2 mux
needs it. So instead, Add a list element to struct h2s, and use it when a
list is needed.
This forces us to use the unsubscribe method, since we can't just unsubscribe
by using LIST_DEL anymore.
This patch is larger than it should be because it includes some renaming.
As we don't know how subscriptions are handled, we can't just assume we can
use LIST_DEL() to unsubscribe, so introduce a new method to mux and connections
to do so.
The prototypes of functions find_hdr_value_end(), extract_cookie_value()
and http_header_match2() were still in proto_http.h while some of them
don't exist anymore and the others were just moved. Let's remove them.
In addition, da.c was updated to use http_extract_cookie_value() which
is the correct one.
These ones are mostly called from cfgparse.c for the parsing and do
not depend on the HTTP representation. The functions's prototypes
were moved to proto/http_rules.h, making this file work exactly like
tcp_rules. Ideally we should stop calling these functions directly
from cfgparse and register keywords, but there are a few cases where
that wouldn't work (stats http-request) so it's probably not worth
trying to go this far.
The current proto_http.c file is huge and contains different processing
domains making it very difficult to work on an alternative representation.
This commit moves some parts to other files :
- ACL registration code => http_acl.c
This code only creates some ACL mappings and doesn't know anything
about HTTP nor about the representation. This code could even have
moved to acl.c but it was not worth polluting it again.
- HTTP sample conversion => http_conv.c
This code doesn't depend on the internal representation but definitely
manipulates some HTTP elements, such as dates. It also has access to
captures.
- HTTP sample fetching => http_fetch.c
This code does depend entirely on the internal representation but is
totally independent on the analysers. Placing it into a different
file will ease the transition to the new representation and the
creation of a wrapper if required. An include file was created due
to CHECK_HTTP_MESSAGE_FIRST() being used at various places.
- HTTP action registration => http_act.c
This code doesn't directly interact with the messages nor the
transaction but it does so via some exported http functions like
http_replace_req_line() or http_set_status() so it will be easier
to change only this after the conversion.
- a few very generic parts were found and moved to http.{c,h} as
relevant.
It is worth noting that the functions moved to these new files are not
referenced anywhere outside of the files and are only called as registered
callbacks, so these files do not even require associated include files.
Instead of using si_cs_io_cb() in process_stream() use si_cs_send/si_cs_recv
instead, as si_cs_io_cb() may lead to process_stream being woken up when it
shouldn't be, and thus timeout would never get triggered.
Callers of si_appctx() always use the result without checking it because
they know by construction that it's valid. This results in unchecked null
pointer warnings at -Wextra, so let's remove this test and make it clear
that it's up to the caller to check validity first.
stktable_data_ptr() currently performs null pointer checks but most
callers don't check the result since they know by construction that
it cannot be null. This causes valid warnings when building with
-Wextra which are worth addressing since it will result in better
code. Let's provide an unguarded version of this function for use
where the check is known to be useless and untested.
Till now it was very difficult for a mux to know what proxy it was
working for. Let's pass the proxy when the mux is instanciated at
init() time. It's not yet used but the H1 mux will definitely need
it, just like the H2 mux when dealing with backend connections.
This state was only a delimiter between headers and body but it now
causes more harm than good because it requires someone to change it.
Since the H1 parser knows if we're in DATA or CHUNK_SIZE, simply let
it set the right next state so that h1m->state constantly matches
what is expected afterwards.
This will allow the parser to fill some extra fields like the method or
status without having to store them permanently in the HTTP message. At
this point however the parser cannot restart from an interrupted read.
This way we maintain the old mechanism stating that -2 means we block
on errors, -1 means we only capture them, and a positive value indicates
the position of the first error.
Currently the only user of struct h1m is the h2 mux when it has to parse
an H1 message coming from the channel. Unfortunately this is not enough
to efficiently parse HTTP/1 messages like those coming from the network
as we don't want to restart from scratch at every byte received.
This patch reintroduces the "next" offset into the H1 message so that any
H1 parser can use it to restart when called with a state that is not the
initial state.
This is the *parsing* state of an HTTP/1 message. Currently the h1_state
is composite as it's made both of parsing and control (100SENT, BODY,
DONE, TUNNEL, ENDING etc). The purpose here is to have a purely H1 state
that can be used by H1 parsers. For now it's equivalent to h1_state.
For struct connection, struct conn_stream, and for the h2 mux, add 2 new
lists, one that handles waiters for recv, and one that handles waiters for
recv and send. That way we can ask to subscribe for either recv or send.
In tasklet_free(), if we're currently in the runnable task list, don't
forget to decrement taks_list_size, or it'll end up being to big, and we may
not process tasks in the global runqueue.
This protocol is based on the uxst one, but it uses socketpair and FD
passing insteads of a connect()/accept().
The "sockpair@" prefix has been implemented for both bind and server
keywords.
When HAProxy wants to connect through a sockpair@, it creates 2 new
sockets using the socketpair() syscall and pass one of the socket
through the FD specified on the server line.
On the bind side, haproxy will receive the FD, and will use it like it
was the FD of an accept() syscall.
This protocol was designed for internal communication within HAProxy
between the master and the workers, but it's possible to use it
externaly with a wrapper and pass the FD through environment variabls.
It's possible to have several protocols per family which is a problem
with the current way the protocols are stored.
This allows to register a new protocol in HAProxy which is not a
protocol in the strict socket definition. It will be used to register a
SOCK_STREAM protocol using socketpair().
These error codes and messages are agnostic to the version, even if
they are represented as HTTP/1.0 messages. Ultimately they will have
to be transformed into internal HTTP messages to be used everywhere.
The HTTP/1.1 100 Continue message was turned to an IST and the local
copy in the Lua code was removed.
This function is purely HTTP once http_txn is put aside. So the original
one was renamed to http_txn_get_path() and it extracts the relevant offsets
from the txn to pass them to http_get_path(). One benefit of the new version
is that it returns the length at the same time so that allowed to slightly
simplify http_get_path_from_string() which had to look up the end pointer
previously and which is not needed anymore.
It's a bit painful to have to deal with HTTP semantics for each protocol
version (H1 and H2), and working on the version-agnostic code further
emphasizes the problem.
This patch creates http.h and http.c which are agnostic to the version
in use, and which borrow a few parts from proto_http and from h1. For
example the once thought h1-specific h1_char_classes array is in fact
dictated by RFC7231 and is used to parse HTTP headers. A few changes
were made to a few files which were including proto_http.h while they
only needed http.h.
Certain string definitions pre-dated the introduction of indirect
strings (ist) so some were used to simplify the definition of the known
HTTP methods. The current lookup code saves 2 kB of a heavily used table
and is faster than the previous table based lookup (typ. 14 ns vs 16
before).
This function now captures an error regardless of its side and protocol.
The caller must pass a number of elements and may pass a protocol-specific
structure and a callback to display it. Later this function may deal with
more advanced allocation techniques to avoid allocating as many buffers
as proxies.
This function returns the proxy associated to a connection. For front
connections it returns the frontend, and for back connections it
returns the backend. This will be used to retrieve some configuration
parameters from within a mux.
Sometimes a connection is prepared before the target is set, sometimes
after. There's no real rule since the few functions involved operate on
different and independent fields. Soon we'll benefit from knowing the
target at the connection layer, in order to figure the associated proxy
and retrieve the various parameters (timeouts etc). This patch slightly
reorders a few calls to conn_prepare() so that we can make sure that the
target is always known to the mux.
The new function sess_log() only needs a session to emit a log. It will
ignore the parts that depend on the stream. It is usable to emit a log
to report early errors in muxes. These ones will typically mention
"<BADREQ>" for the request and 0 for the HTTP status code.
The current build_logline() can only be used with valid streams, which
means it is not suitable for use from muxes. We start by moving it into
another more generic function which takes the session as an argument,
to avoid complexifying all the internal API for jsut a few use cases.
This new function is not supposed to be called directly from outside so
we'll be able to instrument it to support several calling conventions.
For now the behaviour and conditions remain unchanged.
This patch improves the previous fix by implementing the socket draining
code directly in conn_sock_drain() so that it always applies regardless
of the protocol's family. Thus it gets rid of tcp_drain().
Since commit 843b7cb ("MEDIUM: chunks: make the chunk struct's fields
match the buffer struct") a chunk length is unsigned so we can remove
negative size checks.
During a test it happened that a connection was deleted before the
stream it's attached to, resulting in a crash related to the fix
18a85fe ("BUG/MEDIUM: streams: Don't forget to remove the si from
the wait list.") during the LIST_DEL(). Make sure to always delete
the list's head in this case so that other elements can safely
detach later.
This is purely 1.9, no backport is needed.
Set the flag for the current thread in active_threads_mask when waking a
tasklet, or we will never run it if no tasks are available.
This is 1.9-specific, no backport is needed.
When we choose to insert a fd in either the global or the local fd update list,
and the thread_mask against all_threads_mask before checking if it's tid_bit,
that way, if we run with nbthreads==1, we will always use the local list,
which is cheaper than the global one.
Instead of just using the conn_stream wait_list, give the stream_interface
its own. When the conn_stream will have its own buffers, the stream_interface
may have to wait on it.
Instead of using si_cs_send() as a task handler, define a new function,
si_cs_io_cb(), and give si_cs_send() its original prototype. Right now
si_cs_io_cb() just handles send, but later it'll handle recv() too.
Modify tasklet_wakeup() so that it handles a task as well, and inserts it
directly into the tasklet list, making it effectively a tasklet.
This should make future developments easier.
This adds the set-priority-class and set-priority-offset actions to
http-request and tcp-request content. At this point they are not used
yet, which is the purpose of the next commit, but all the logic to
set and clear the values is there.
We'll need trees to manage the queues by priorities. This change replaces
the list with a tree based on a single key. It's effectively a list but
allows us to get rid of the list management right now.
Commit 7ce0c89 ("MEDIUM: mux: Use the mux protocol specified on
bind/server lines") assumed a bit too strongly that we could only have
servers on the connect side :-) It segfaults under this config :
defaults
contimeout 5s
clitimeout 5s
srvtimeout 5s
mode http
listen test1
bind :8001
dispatch 127.0.0.1:8002
frontend test2
mode http
bind :8002
redirect location /
No backport needed.
To do so, mux choices are split to handle incoming and outgoing connections in a
different way. The protocol specified on the bind/server line is used in
priority. Then, for frontend connections, the ALPN is retrieved and used to
choose the best mux. For backend connection, there is no ALPN. Finaly, if no
protocol is specified and no protocol matches the ALPN, we fall back on a
default mux, choosing in priority the first mux with exactly the same mode.
Because there can be several default multiplexers (without name), they are now
reported with the name "<default>". And a message warns they cannot be
referenced with the "proto" keyword on a bind line or a server line.
Now we try to synchronously push updates as they come using the new rdv
point, so that the call to the server update function from the main poll
loop is not needed anymore.
It further reduces the apparent latency in the health checks as the response
time almost always appears as 0 ms, resulting in a slightly higher check rate
of ~1960 conn/s. Despite this, the CPU consumption has slightly dropped again
to ~32% for the same test.
The only trick is that the checks code is built with a bit of recursivity
because srv_update_status() calls server_recalc_eweight(), and the latter
needs to signal srv_update_status() in case of updates. Thus we added an
extra argument to this function to indicate whether or not it must
propagate updates (no if it comes from srv_update_status).
Multiplexers are not necessarily associated to an ALPN. ALPN is a TLS extension,
so it is not always defined or used. Instead, we now rather speak of
multiplexer's protocols. So in this patch, there are no significative changes,
some structures and functions are just renamed.
This function is generic and is able to automatically transfer data from a
buffer to the conn_stream's tx buffer. It does this automatically if the mux
doesn't define another snd_buf() function.
It cannot yet be used as-is with the conn_stream's txbuf without risking to
lose data on close since conn_streams need to be orphaned for this.
To be symmetrical with the recv() part, we no handle retryable and partial
transmission using a intermediary buffer in the conn_stream. For now it's only
set to BUF_NULL and never allocated nor used.
It cannot yet be used as-is without risking to lose data on close since
conn_streams need to be orphaned for this.
This is a partial revert of the commit deccd1116 ("MEDIUM: mux: make
mux->snd_buf() take the byte count in argument"). It is a requirement to do
zero-copy transfers. This will be mandatory when the TX buffer of the
conn_stream will be used.
So, now, data are consumed by mux->snd_buf() and not only sent. So it needs to
update the buffer state. On its side, the caller must be aware the buffer can be
replaced y an empty or unallocated one.
As a side effet of this change, the function co_set_data() is now only responsible
to update the channel set, by update ->output field.
Since BoringSSL 3b2ff028, API now correctly match OpenSSL 1.1.0.
The patch revert part of haproxy 019f9b10: "Fix BoringSSL call and
openssl-compat.h/#define occordingly.".
This will not break openssl/libressl compat.
Add a new pipe, one per thread, so that we can write on it to wake a thread
sleeping in a poller, and use it to wake threads supposed to take care of a
task, if they are all sleeping.
Now pendconn_free() takes a stream, checks that pend_pos is set, clears
it, and uses pendconn_unlink() to complete the job. It's cleaner and
centralizes all the bookkeeping work in pendconn_unlink() only and
ensures that there's a single place where the stream's position in the
queue is manipulated.
For now the pendconns may be dequeued at two places :
- pendconn_unlink(), which operates on a locked queue
- pendconn_free(), which operates on an unlocked queue and frees
everything.
Some changes are coming to the queue and we'll need to be able to be a
bit stricter regarding the places where we dequeue to keep the accounting
accurate. This first step renames the locked function __pendconn_unlink()
as it's for use by those aware of it, and introduces a new general purpose
pendconn_unlink() function which automatically grabs the necessary locks
before calling the former, and pendconn_cond_unlink() which additionally
checks the pointer and the presence in the queue.
As __task_wakeup() is responsible for increasing
rqueue_local[tid]/global_rqueue_size, make __task_unlink_rq responsible for
decreasing it, as process_runnable_tasks() isn't the only one that removes
tasks from runqueues.
This function is generic and is able to automatically transfer data
from a conn_stream's rx buffer to the destination buffer. It does this
automatically if the mux doesn't define another rcv_buf() function.
In order to reorganize the connection layers, recv() operations will
need to be retryable and to support partial transfers. This requires
an intermediary buffer to hold the data coming from the mux. After a
few attempts, it turns out that this buffer is best placed inside the
conn_stream itself. For now it's only set to buf_empty and it will be
up to the caller to allocate it if required.
This new function wl_set_waitcb() prepopulates a wait_list with a tasklet
and a context and returns it so that it can be passed to ->subscribe() to
be added to a connection or conn_stream's wait_list. The caller doesn't
need to know all the insiders details anymore this way.
Totally nuke the "send" method, instead, the upper layer decides when it's
time to send data, and if it's not possible, uses the new subscribe() method
to be called when it can send data again.
Add a new "subscribe" method for connection, conn_stream and mux, so that
upper layer can subscribe to them, to be called when the event happens.
Right now, the only event implemented is "SUB_CAN_SEND", where the upper
layer can register to be called back when it is possible to send data.
The connection and conn_stream got a new "send_wait_list" entry, which
required to move a few struct members around to maintain an efficient
cache alignment (and actually this slightly improved performance).
Now all the code used to manipulate chunks uses a struct buffer instead.
The functions are still called "chunk*", and some of them will progressively
move to the generic buffer handling code as they are cleaned up.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
It used to be called 'len' during the reorganisation but strictly speaking
it's not a length since it wraps. Also we already use '_data' as the suffix
to count available data, and data is also what we use to indicate the amount
of data in a pipe so let's improve consistency here. It was important to do
this in two operations because data used to be the name of the pointer to
the storage area.
There was no point keeping that function in the buffer part since it's
exclusively used by HTTP at the channel level, since it also automatically
appends the CRLF. This further cleans up the buffer code.
Since we never access this field directly anymore, but only through the
channel's wrappers, it can now move to the channel. The buffers are now
completely free from the distinction between input and output data.
b_del() is used in :
- mux_h2 with the demux buffer : always processes input data
- checks with output data though output is not considered at all there
- b_eat() which is not used anywhere
- co_skip() where the len is always <= output
Thus the distinction for output data is not needed anymore and the
decrement can be made inconditionally in co_skip().
This is intentionally the minimal and safest set of changes, some cleanups
area still required. These changes are quite tricky and cannot be
independantly tested, so it's important to keep this patch as bisectable
as possible.
buf_empty and buf_wanted were changed and are now exactly similar since
there's no <p> member in the structure anymore. Given that no test is
ever made in the code to check that buf == &buf_wanted, it may be possible
that we don't need to have two anymore, unless some buf_empty tests have
precedence. This will have to be investigated.
A significant part of this commit affects the HTTP compression code,
which used to deeply manipulate the input and output buffers without
any reasonable solution for a better abstraction. For this reason, if
any regression is met and designates this patch as the culprit, it is
important to run tests which specifically involve compression or which
definitely don't use it in order to spot the issue.
Cc: Olivier Houchard <ohouchard@haproxy.com>
For the same consistency reasons, let's use b_empty() at the few places
where an empty buffer is expected, or c_empty() if it's done on a channel.
Some of these places were there to realign the buffer so
{b,c}_realign_if_empty() was used instead.
Now that there are no more users requiring to modify the buffer anymore,
switch these ones to const char and const buffer. This will make it more
obvious next time send functions are tempted to modify the buffer's output
count. Minor adaptations were necessary at a few call places which were
using char due to the function's previous prototype.
Till now the callers had to know which one to call for specific use cases.
Let's fuse them now since a single one will remain after the API migration.
Given that bi_del() may only be used where o==0, just combine the two tests
by first removing output data then only input.
This function was sometimes used from a channel and sometimes from a buffer.
In both cases it requires knowledge of the size of the output data (to skip
them). Here the split ensures the channel can deal with this point, and that
other places not having output data can continue to work.
These ones manipulate the output data count which will be specific to
the channel soon, so prepare the call points to use the channel only.
The b_* functions are now unused and were removed.
The few call places where it's used can use the trash as a swap buffer,
which is made for this exact purpose. This way we can rely on the
generic b_slow_realign() call.
Where relevant, the channel version is used instead. The buffer version
was ported to be more generic and now takes a swap buffer and the output
byte count to know where to set the alignment point. The H2 mux still
uses buffer_slow_realign() with buf->o but it will change later.
This adds :
- c_orig() : channel buffer's origin
- c_size() : channel buffer's size
- c_wrap() : channel buffer's wrapping location
- c_data() : channel buffer's total data count
- c_room() : room left in channel buffer's
- c_empty() : true if channel buffer is empty
- c_full() : true if channel buffer is full
- c_ptr() : pointer to an offset relative to input data in the buffer
- c_adv() : advances the channel's buffer (bytes become part of output)
- c_rew() : rewinds the channel's buffer (output bytes not output anymore)
- c_realign_if_empty() : realigns the buffer if it's empty
- co_data() : # of output data
- co_head() : beginning of output data
- co_tail() : end of output data
- ci_data() : # of input data
- ci_head() : beginning of input data
- ci_tail() : end of input data
- ci_stop() : location after ci_tail()
- ci_next() : pointer to next input byte
And for the ci_* / co_* functions above, the "__*" variants which disable
wrapping checks, and the "_ofs" variants which return an offset relative to
the buffer's origin instead.
Up until now, a tasklet couldn't be free'd while it was in the list, it is
no longer the case, so make sure we remove it from the list before freeing it.
To do so, we have to make sure we correctly initialize it, so use LIST_INIT,
instead of setting the pointers to NULL.
To make sure we don't inadvertently insert task in the global runqueue,
while only the local runqueue is used without threads, make its definition
and usage conditional on USE_THREAD.
When building without threads enabled, instead of just using the global
runqueue, just use the local runqueue associated with the only thread, as
that's what is now expected for a single thread in prcoess_runnable_tasks().
This should fix haproxy when built without threads.
When an applet is created, let's assign it the same nice value as the task
of the stream which owns it. It ensures that fairness is properly propagated
to applets, and that the CLI can regain a low latency behaviour again. Huge
differences have been seen under extreme loads, with the CLI being called
every 200 microseconds instead of 11 milliseconds.
This function returns true is some notifications are registered.
This function is usefull for the following patch
BUG/MEDIUM: lua/socket: Sheduling error on write: may dead-lock
It should be backported in 1.6, 1.7 and 1.8
Don't forget to increase tasks_run_queue when we're adding a task to the
tasklet list, and to decrease it when we remove a task from a runqueue,
or its value won't be accurate, and could lead to tasks not being executed
when put in the global run queue.
1.9-dev only, no backport is needed.
There's no real reason to have a specific scheduler for applets anymore, so
nuke it and just use tasks. This comes with some benefits, the first one
being that applets cannot induce high latencies anymore since they share
nice values with other tasks. Later it will be possible to configure the
applets' nice value. The second benefit is that the applet scheduler was
not very thread-friendly, having a big lock around it in prevision of this
change. Thus applet-intensive workloads should now scale much better with
threads.
Some more improvement is possible now : some applets also use a task to
handle timers and timeouts. These ones could now be simplified to use only
one task.
Introduce tasklets, lightweight tasks. They have no notion of priority,
they are just run as soon as possible, and will probably be used for I/O
later.
For the moment they're used to replace the temporary thread-local list
that was used in the scheduler. The first part of the struct is common
with tasks so that tasks can be cast to tasklets and queued in this list.
Once a task is in the tasklet list, it has its leaf_p set to 0x1 so that
it cannot accidently be confused as not in the queue.
Pure tasklets are identifiable by their nice value of -32768 (which is
normally not possible).
A lot of tasks are run on one thread only, so instead of having them all
in the global runqueue, create a per-thread runqueue which doesn't require
any locking, and add all tasks belonging to only one thread to the
corresponding runqueue.
The global runqueue is still used for non-local tasks, and is visited
by each thread when checking its own runqueue. The nice parameter is
thus used both in the global runqueue and in the local ones. The rare
tasks that are bound to multiple threads will have their nice value
used twice (once for the global queue, once for the thread-local one).
In preparation for thread-specific runqueues, change the task API so that
the callback takes 3 arguments, the task itself, the context, and the state,
those were retrieved from the task before. This will allow these elements to
change atomically in the scheduler while the application uses the copied
value, and even to have NULL tasks later.
The polled_mask is only used in the pollers, and removing it from the
struct fdtab makes it fit in one 64B cacheline again, on a 64bits machine,
so make it a separate array.
With the old model, any fd shared by multiple threads, such as listeners
or dns sockets, would only be updated on one threads, so that could lead
to missed event, or spurious wakeups.
To avoid this, add a global list for fd that are shared, using the same
implementation as the fd cache, and only remove entries from this list
when every thread as updated its poller.
[wt: this will need to be backported to 1.8 but differently so this patch
must not be backported as-is]
Modify fd_add_to_fd_list() and fd_rm_from_fd_list() so that they take an
offset in the fdtab to the list entry, instead of hardcoding the fd cache,
so we can use them with other lists.
While running a task, we may try to delete and free a task that is about to
be run, because it's part of the local tasks list, or because rq_next points
to it.
So flag any task that is in the local tasks list to be deleted, instead of
run, by setting t->process to NULL, and re-make rq_next a global,
thread-local variable, that is modified if we attempt to delete that task.
Many thanks to PiBa-NL for reporting this and analysing the problem.
This should be backported to 1.8.
In order to use arbitrary data in the CLI (multiple lines or group of words
that must be considered as a whole, for example), it is now possible to add a
payload to the commands. To do so, the first line needs to end with a special
pattern: <<\n. Everything that follows will be left untouched by the CLI parser
and will be passed to the commands parsers.
Per-command support will need to be added to take advantage of this
feature.
Signed-off-by: Aurlien Nephtali <aurelien.nephtali@corp.ovh.com>
In some cases, we call cs_destroy() very early, so early the connection
doesn't yet have a mux, so we can't call mux->detach(). In this case,
just destroy the associated connection.
This should be backported to 1.8.
Now, the function parse_logsrv should be used to parse a "log" line. This
function will update the list of loggers passed in argument. It can release all
log servers when "no log" line was parsed (by the caller) or it can parse "log
global" or "log <address> ... " lines. It takes care of checking the caller
context (global or not) to prohibit "log global" usage in the global section.
Clearing the update_mask bit in fd_insert may lead to duplicate insertion
of fd in fd_updt, that could lead to a write past the end of the array.
Instead, make sure the update_mask bit is cleared by the pollers no matter
what.
This should be backported to 1.8.
[wt: warning: 1.8 doesn't have the lockless fdcache changes and will
require some careful changes in the pollers]
Commit 4815c8c ("MAJOR: fd/threads: Make the fdcache mostly lockless.")
made the fd cache lockless, but after a few iterations, a subtle part was
lost, consisting in setting the bit on the fd_cache_mask immediately when
adding an event. Now it was done only when the cache started to process
events, but the problem it causes is that fd_cache_mask isn't reliable
anymore as an indicator of presence of events to be processed with no
delay outside of fd_process_cached_events(). This results in some spurious
delays when processing inter-thread wakeups between tasks. Just restoring
the flag when the event is added is enough to fix the problem.
Kudos to Christopher for spotting this one!
No backport is needed as this is only in the development version.
The management of the servers and the proxies queues was not thread-safe at
all. First, the accesses to <strm>->pend_pos were not protected. So it was
possible to release it on a thread (for instance because the stream is released)
and to use it in same time on another one (because we redispatch pending
connections for a server). Then, the accesses to stream's information (flags and
target) from anywhere is forbidden. To be safe, The stream's state must always
be updated in the context of process_stream.
So to fix these issues, the queue module has been refactored. A lock has been
added in the pendconn structure. And now, when we try to dequeue a pending
connection, we start by unlinking it from the server/proxy queue and we wake up
the stream. Then, it is the stream reponsibility to really dequeue it (or
release it). This way, we are sure that only the stream can create and release
its <pend_pos> field.
However, be careful. This new implementation should be thread-safe
(hopefully...). But it is not optimal and in some situations, it could be really
slower in multi-threaded mode than in single-threaded one. The problem is that,
when we try to dequeue pending connections, we process it from the older one to
the newer one independently to the thread's affinity. So we need to wait the
other threads' wakeup to really process them. If threads are blocked in the
poller, this will add a significant latency. This problem happens when maxconn
values are very low.
This patch must be backported in 1.8.
When a listener is temporarily disabled, we start by locking it and then we call
.pause callback of the underlying protocol (tcp/unix). For TCP listeners, this
is not a problem. But listeners bound on an unix socket are in fact closed
instead. So .pause callback relies on unbind_listener function to do its job.
Unfortunatly, unbind_listener hold the listener's lock and then call an internal
function to unbind it. So, there is a deadlock here. This happens during a
reload. To fix the problemn, the function do_unbind_listener, which is lockless,
is now exported and is called when a listener bound on an unix socket is
temporarily disabled.
This patch must be backported in 1.8.
ssl_sock_get_pkey_algo can be used to report pkey algorithm to log
and ppv2 (RSA2048, EC256,...).
Extract pkey information is not free in ssl api (lock/alloc/free):
haproxy can use the pkey information computed in load_certificate.
Store and use this information in a SSL ex_data when available,
compute it if not (SSL multicert bundled and generated cert).
A TLS ticket keys file can be updated on the CLI and used in same time. So we
need to protect it to be sure all accesses are thread-safe. Because updates are
infrequent, a R/W lock has been used.
This patch must be backported in 1.8
Each fd_{may|cant|stop|want}_{recv|send} function sets or resets a
single bit at once, then recomputes the need for updates, and then
the new cache state. Later, pollers will compute the new polling
state based on the resulting operations here. In fact the conditions
are so simple that they can be performed by a single "if", or sometimes
even optimized away.
This means that in practice a simple compare-and-swap operation if often
enough to set the new value inluding the new polling state, and that only
the cache and fdupdt have to be performed under the lock. Better, for the
most common operations (fd_may_{recv,send}, used by the pollers), a simple
atomic OR is needed.
This patch does this for the fd_* functions above and it doesn't yet
remove the now useless fd_compute_new_polling_status() because it's still
used by other pollers. A pure connection rate test shows a 1% performance
increase.
An fd cache entry might be removed and added at the end of the list, while
another thread is parsing it, if that happens, we may miss fd cache entries,
to avoid that, add a new field in the struct fdtab, "added_mask", which
contains a mask for potentially affected threads, if it is set, the
corresponding thread will set its bit in fd_cache_mask, to avoid waiting in
poll while it may have more work to do.
Create a local, per-thread, fdcache, for file descriptors that only belongs
to one thread, and make the global fd cache mostly lockless, as we can get
a lot of contention on the fd cache lock.
fd_insert() is currently called just after setting the owner and iocb,
but proceeding like this prevents the operation from being atomic and
requires a lock to protect the maxfd computation in another thread from
meeting an incompletely initialized FD and computing a wrong maxfd.
Fortunately for now all fdtab[].owner are set before calling fd_insert(),
and the first lock in fd_insert() enforces a memory barrier so the code
is safe.
This patch moves the initialization of the owner and iocb to fd_insert()
so that the function will be able to properly arrange its operations and
remain safe even when modified to become lockless. There's no other change
beyond the internal API.
These functions were created for poll() in 1.5-dev18 (commit 80da05a4) to
replace the previous FD_{CLR,SET,ISSET} that were shared with select()
because some libcs enforce a limit on FD_SET. But FD_SET doesn't seem
to be universally MT-safe, requiring locks in the select() code that
are not needed in the poll code. So let's move back to the initial
situation where we used to only use bit fields, since that has been in
use since day one without a problem, and let's use these hap_fd_*
functions instead of FD_*.
This patch only moves the functions to fd.h and revives hap_fd_isset()
that was recently removed to kill an "unused" warning.
Since only select() and poll() still make use of maxfd, let's move
its computation right there in the pollers themselves, and only
during each fd update pass. The computation doesn't need a lock
anymore, only a few atomic ops. It will be accurate, be done much
less often and will not be required anymore in the FD's fast patch.
This provides a small performance increase of about 1% in connection
rate when using epoll since we get rid of this computation which was
performed under a lock.
Some pollers like epoll() need to know if the fd is already known or
not in order to compute the operation to perform (add, mod, del). For
now this is performed based on the difference between the previous FD
state and the new state but this will not be usable anymore once threads
become responsible for their own polling.
Here we come with a different approach : a bitmask is stored with the
fd to indicate which pollers already know it, and the pollers will be
able to simply perform the add/mod/del operations based on this bit
combined with the new state.
This patch only adds the bitmask declaration and initialization, it
is it not yet used. It will be needed by the next two fixes and will
need to be backported to 1.8.
Since the fd update tables are per-thread, we need to have a bit per
thread to indicate whether an update exists, otherwise this can lead
to lost update events every time multiple threads want to update the
same FD. In practice *for now*, it only happens at start time when
listeners are enabled and ask for polling after facing their first
EAGAIN. But since the pollers are still shared, a lost event is still
recovered by a neighbor thread. This will not reliably work anymore
with per-thread pollers, where it has been observed a few times on
startup that a single-threaded listener would not always accept
incoming connections upon startup.
It's worth noting that during this code review it appeared that the
"new" flag in the fdtab isn't used anymore.
This fix should be backported to 1.8.
A bitfield has been added to know if there are some FDs processable by a
specific thread in the FD cache. When a FD is inserted in the FD cache, the bits
corresponding to its thread_mask are set. On each thread, the bitfield is
updated when the FD cache is processed. If there is no FD processed, the thread
is removed from the bitfield by unsetting its tid_bit.
Note that this bitfield is updated but not checked in
fd_process_cached_events. So, when this function is called, the FDs cache is
always processed.
[wt: should be backported to 1.8 as it will help fix a design limitation]
Since commit f9ce57e ("MEDIUM: connection: make conn_sock_shutw() aware
of lingering"), we refrain from performing the shutw() on the socket if
there is no lingering risk. But there is a problem with this in tunnel
and in TCP modes where a client is explicitly allowed to send a shutw
to the server, eventhough it it risky.
Not doing it creates this situation reported by Ricardo Fraile and
diagnosed by Christopher : a typical HTTP client (eg: curl) connecting
via the config below to an HTTP server would receive its response,
immediately close while the server remains in keep-alive mode. The
shutr() received by haproxy from the client is "propagated" to the
server side but not acted upon because fdtab[fd].linger_risk is set,
so we expect that the next close will immediately complete this
operation.
listen proxy-tcp
bind 127.0.0.1:8888
mode tcp
timeout connect 5s
timeout server 10s
timeout client 10s
server server1 127.0.0.1:8000
But since the whole stream will not end until the server closes in
turn, the server doesn't close and haproxy expires on server timeout.
This problem has already struck by waking up an older bug and was
partially fixed with commit 8059351 ("BUG/MEDIUM: http: don't disable
lingering on requests with tunnelled responses") though it was not
enough.
The problem is that linger_risk is not suited here. In fact we need to
know whether or not it is desired to close normally or silently, and
whether or not a shutr() has already been received on this connection.
This is the approach this patch takes, and it solves the problem for
the various difficult modes (tcp, http-server-close, pretend-keepalive).
This fix needs to be backported to 1.8. Many thanks to Ricardo for
providing very detailed traces and configurations.
The new function check_request_for_cacheability() is used to check if
a request may be served from the cache, and/or allows the response to
be stored into the cache. For this it checks the cache-control and
pragma header fields, and adjusts the existing TX_CACHEABLE and a new
TX_CACHE_IGNORE flags.
For now, just like its response side counterpart, it only checks the
first value of the header field. These functions should be reworked to
improve their parsers and validate all elements.
The thread patches adds refcount for notifications. The notifications are
used with the Lua cosocket. These refcount free the notifications when
the session is cleared. In the Lua task case, it not have sessions, so
the nofications are never cleraed.
This patch adds a garbage collector for signals. The garbage collector
just clean the notifications for which the end point is disconnected.
This patch should be backported in 1.8
This BUG was introduced with:
'MEDIUM: threads/stick-tables: handle multithreads on stick tables'
The API was reviewed to handle stick table entry updates
asynchronously and the caller must now call a 'stkable_touch_*'
function each time the content of an entry is modified to
register the entry to be synced.
There was missing call to stktable_touch_* resulting in
not propagated entries to remote peers (or local one during reload)
pendconn_get_next_strm() is called from process_srv_queue() under the
server lock, and calls stream_add_srv_conn() with this lock held, while
the latter tries to take it again. This results in a deadlock when
a server's maxconn is reached and haproxy is built with thread support.
Commit 9dcf9b6 ("MINOR: threads: Use __decl_hathreads to declare locks")
accidently lost a few "extern" in certain lock declarations, possibly
causing certain entries to be declared at multiple places. Apparently
it hasn't caused any harm though.
The offending ones were :
- fdtab_lock
- fdcache_lock
- poll_lock
- buffer_wq_lock
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
Rename the global variable "proxy" to "proxies_list".
There's been multiple proxies in haproxy for quite some time, and "proxy"
is a potential source of bugs, a number of functions have a "proxy" argument,
and some code used "proxy" when it really meant "px" or "curproxy". It worked
by pure luck, because it usually happened while parsing the config, and thus
"proxy" pointed to the currently parsed proxy, but we should probably not
rely on this.
[wt: some of these are definitely fixes that are worth backporting]
It can happen that we want to read early data, write some, and then continue
reading them.
To do so, we can't reuse tmp_early_data to store the amount of data sent,
so introduce a new member.
If we read early data, then ssl_sock_to_buf() is now the only responsible
for getting back to the handshake, to make sure we don't miss any early data.