The principle is that when emitting a message, if some dropped events
were logged, we first attempt to report this counter before going
further. This is done under an exclusive lock while all logs are
produced under a shared lock. This ensures that the dropped line is
accurately reported and doesn't accidently arrive after a later
event.
This now provides sink_new_buf() which allocates a ring buffer. One such
ring ("buf0") of 1 MB is created already, and may be used by sink_write().
The sink's creation should probably be moved somewhere else later.
Our circular buffers are well suited for being used as ring buffers for
not-so-structured data. The machanism here consists in making room in a
buffer before inserting a new record which is prefixed by its size, and
looking up next record based on the previous one's offset and size. We
can have up to 255 consumers watching for data (dump in progress, tail)
which guarantee that entrees are not recycled while they're being dumped.
The complete representation is described in the header file. For now only
ring_new(), ring_resize() and ring_free() are created.
Currently both logs and event sinks may use a file descriptor to
atomically emit some output contents. The two may use the same FD though
nothing is done to make sure they use the same lock. Also there is quite
some redundancy between the two. Better make a specific function to send
a fragmented message to a file descriptor which will take care of the
locking via the fd's lock. The function is also able to truncate a
message and to enforce addition of a trailing LF when building the
output message.
I forgot to fix this one before pushing, despite my tests. lockon_ptr is
only used to compare pointers, it doesn't need to point to a writable
location. Without threads the atomic store is turned into an assignment
and rightfully complains.
Given that we can pass typed arguments to the trace() function, let's
add provisions for tracking them. They are source-specific so we need
to let the source fill their name and description. Only those with a
non-null name will be proposed.
With a few macros it's possible for a trace source to commit to only
using a certain type for a given argument (or set of). This will be
particularly useful to let the trace subsystem retrieve some precious
information such as a connection, session, listener, source address or
so, and enable/disable filtering and/or locking.
The principle of this subsystem will be to support taking live traces
at various places in the code with conditional triggers, filters, and
ability to lock on some elements. The traces will support typed events
and will be sent into sinks made of ring buffers, file descriptors or
remote servers.
This is the most basic type of sink. It pre-registers "stdout" and
"stderr", and is able to use writev() on them. The writev() operation
is locked to avoid mixing outputs. It's likely that the registration
should move somewhere else to take into account the fact that stdout
and stderr are still opened or are closed.
The principle will be to be able to dispatch events to various destinations
called "sinks". This is already done in part in logs where log servers can
be either a UDP socket or a file descriptor. This will be needed with the
new trace subsystem where we may also want to add ring buffers. And it turns
out that all such destinations make sense at all places. Logs may need to be
sent to a TCP server via a ring buffer, or consulted from the CLI. Trace
events may need to be sent to stdout/stderr as well as to remote log servers.
This patch creates a new structure "sink" aiming at addressing these similar
needs. The goal is to merge together what is common to all of them, such as
the output format, the dropped events count, etc, and also keep separately
the target identification (network address, file descriptor). Provisions
were made to have a "waiter" on the sink. For a TCP log server it will be
the task to wake up after writing to the log buffer. For a ring buffer, it
could be the list of watchers on the CLI running a "tail" operation and
waiting for new events. A lock was also placed in the struct since many
operations will require some locking, including the FD ones. The output
formats covers those in use by logs and two extra ones prepending the ISO
time in front of the message (convenient for stdio/buffer).
For now only the generic infrastructure is present, no type-specific
output is implemented. There's the sink_write() function which prepares
and formats a message to be sent, trying hard to avoid copies and only
using pointer manipulation, where the type-specific code just has to be
added. Dropped messages are already counted (for now 100% drop). The
message is put into an iovec array as it will be trivial to use with
file descriptors and sockets.
Right now we used to have extremely inconsistent states to report output,
one is CLI_ST_PRINT which prints constant message cli->msg with the
assigned severity, and CLI_ST_PRINT_FREE which prints dynamically
allocated cli->err with severity LOG_ERR, and nothing in between,
eventhough it's useful to be able to report dynamically allocated
messages as well as constant error messages.
This patch adds two extra states, which are not particularly well named
given the constraints imposed by existing ones. One is CLI_ST_PRINT_ERR
which prints a constant error message. The other one is CLI_ST_PRINT_DYN
which prints a dynamically allocated message. By doing so we maintain
the compatibility with current code.
It is important to keep in mind that we cannot pre-initialize pointers
and automatically detect what message type it is based on the assigned
fields, because the CLI's context is in a union shared with all other
users, thus unused fields contain anything upon return. This is why we
have no choice but using 4 states. Keeping the two fields <msg> and
<err> remains useful because one is const and not the other one, and
this catches may copy-paste mistakes. It's just that <err> is pretty
confusing here, it should be renamed.
The CPU time accounting field called "cpu_time" is used only by tasks
and not tasklets, yet it used to be stored into the TASK_COMMON part,
which doesn't make sense and wastes tasklet memory. In addition, moving
it to tasks also helps better group the various parts in cache lines.
In stream_set_backend(), if we have a TCP stream, and we want to upgrade it
to H2 instead of attempting ot reuse the stream, just destroy the
conn_stream, make sure we don't log anything about the stream, and pretend
we failed setting the backend, so that the stream will get destroyed.
New streams will then be created by the mux, as if the connection just
happened.
This fixes a crash when upgrading from TCP to H2, as the H2 mux totally
ignored the conn_stream provided by the upgrade, as reported in github
issue #196.
This should be backported to 2.0.
Now that the architecture was changed so that attempts to receive/send data
always come from the upper layers, instead of them only trying to do so when
the lower layer let them know they could try, we can finally get rid of the
fd cache. We don't really need it anymore, and removing it gives us a small
performance boost.
A problem involving server slowstart was reported by @max2k1 in issue #197.
The problem is that pendconn_grab_from_px() takes the proxy lock while
already under the server's lock while process_srv_queue() first takes the
proxy's lock then the server's lock.
While the latter seems more natural, it is fundamentally incompatible with
mayn other operations performed on servers, namely state change propagation,
where the proxy is only known after the server and cannot be locked around
the servers. Howwever reversing the lock in process_srv_queue() is trivial
and only the few functions related to dynamic cookies need to be adjusted
for this so that the proxy's lock is taken for each server operation. This
is possible because the proxy's server list is built once at boot time and
remains stable. So this is what this patch does.
The comments in the proxy and server structs were updated to mention this
rule that the server's lock may not be taken under the proxy's lock but
may enclose it.
Another approach could consist in using a second lock for the proxy's queue
which would be different from the regular proxy's lock, but given that the
operations above are rare and operate on small servers list, there is no
reason for overdesigning a solution.
This fix was successfully tested with 10000 servers in a backend where
adjusting the dyncookies in loops over the CLI didn't have a measurable
impact on the traffic.
The only workaround without the fix is to disable any occurrence of
"slowstart" on server lines, or to disable threads using "nbthread 1".
This must be backported as far as 1.8.
When a lua action or a lua sample fetch is called, a lua transaction is
created. It is an entry in the stack containing the class TXN. Thanks to it, we
can know the direction (request or response) of the call. But, for some
functions, it is also necessary to know if the buffer is "HTTP ready" for the
given direction. "HTTP ready" means there is a valid HTTP message in the
channel's buffer. So, when a lua action or a lua sample fetch is called, the
flag HLUA_TXN_HTTP_RDY is set if it is appropriate.
There is no standard case for HTTP header names because, as stated in the
RFC7230, they are case-insensitive. So applications must handle them in a
case-insensitive manner. But some bogus applications erroneously rely on the
case used by most browsers. This problem becomes critical with HTTP/2
because all header names must be exchanged in lowercase. And HAProxy uses the
same convention. All header names are sent in lowercase to clients and servers,
regardless of the HTTP version.
This design choice is linked to the HTX implementation. So, for previous
versions (2.0 and 1.9), a workaround is to disable the HTX mode to fall
back to the legacy HTTP mode.
Since the legacy HTTP mode was removed, some users reported interoperability
issues because their application was not able anymore to handle HTTP/1 message
received from HAProxy. So, we've decided to add a way to change the case of some
headers before sending them. It is now possible to define a "mapping" between a
lowercase header name and a version supported by the bogus application. To do
so, you must use the global directives "h1-case-adjust" and
"h1-case-adjust-file". Then options "h1-case-adjust-bogus-client" and
"h1-case-adjust-bogus-server" may be used in proxy sections to enable the
conversion. See the configuration manual for more info.
Of course, our advice is to urgently upgrade these applications for
interoperability concerns and because they may be vulnerable to various types of
content smuggling attacks. But, if your are really forced to use an unmaintained
bogus application, you may use these directive, at your own risks.
If it is relevant, this feature may be backported to 2.0.
Dragan Dosen found that the listeners lock is not sufficient to protect
the listeners list when proxies are stopping because the listeners are
also unlinked from the protocol list, and under certain situations like
bombing with soft-stop signals or shutting down many frontends in parallel
from multiple CLI connections, it could be possible to provoke multiple
instances of delete_listener() to be called in parallel for different
listeners, thus corrupting the protocol lists.
Such operations are pretty rare, they are performed once per proxy upon
startup and once per proxy on shut down. Thus there is no point trying
to optimize anything and we can use a global lock to protect the protocol
lists during these manipulations.
This fix (or a variant) will have to be backported as far as 1.8.
The purpose will be to store the target address there and not to
allocate a connection just for this anymore. For now it's only placed
in the struct, a few fields were moved to plug some holes, and the
entry is freed on release (never allocated yet for now). This must
have no impact. Note that in order to fit, the store_count which
previously was an int was turned into a short, which is way more
than enough given that the hard-coded limit is 8.
Now addresses are dynamically allocated when needed. Each connection is
created with src=dst=NULL, these entries are allocated on the fly, and
released when the connection is released.
At the moment we're facing difficulties with connection reuse based on
the fact that connections may be allocated very early only to set a
target address in transparent mode. With the imminent removal of the
legacy mode, the connection reuse by a same stream will not exist
anymore and all this awful complexity is not justified anymore. However
we still need to be able to assign addresses somewhere.
Thus instead of allocating a connection, we'll only place addresses where
needed in the stream during operations. But this takes quite some room
(typically 128 bytes). This is a nice opportunity for cleaning all this
up and dynamically allocatating the addresses fields, which will result
in actually saving memory from connection structs since most of the time
the client's "to" address is not used and the server's "from" is not used
either, thus saving ~256 bytes per end-to-end connection.
For now these new "src" and "dst" pointers point to addr.from and addr.to.
This will allow us to smoothly update the whole code to use these pointers
prior to going further and switching them to pools.
The keywords req* and rsp* are now unsupported. So the corresponding lists are
now unused. It is safe to remove them from the structure proxy.
As a result, the code dealing with these rules in HTTP analyzers was also
removed.
It was announced for the 2.1. Following keywords are now unsupported:
* reqadd, reqallow, reqiallow, reqdel, reqidel, reqdeny, reqideny, reqpass,
reqipass, reqrep, reqirep reqtarpit, reqitarpit
* rspadd, rspdel, rspidel, rspdeny, rspideny, rsprep, rspirep
a fatal error is emitted if one of these keyword is found during the
configuraion parsing.
The option 'http-tunnel' is deprecated and it was only used in the legacy HTTP
mode. So this option is now totally ignored and a warning is emitted during
HAProxy startup if it is found in a configuration file.
The old module proto_http does not exist anymore. All code dedicated to the HTTP
analysis is now grouped in the file proto_htx.c. So, to finish the polishing
after removing the legacy HTTP code, proto_htx.{c,h} files have been moved in
http_ana.{c,h} files.
In addition, all HTX analyzers and related functions prefixed with "htx_" have
been renamed to start with "http_" instead.
Many flags of the HTTP transction (TX_*) are now unused and useless. So the
flags TX_WAIT_CLEANUP, TX_HDR_CONN_*, TX_CON_CLO_SET and TX_CON_KAL_SET were
removed. Most of TX_CON_WANT_* were also removed. Only TX_CON_WANT_TUN has been
kept.
First of all, all legacy HTTP analyzers and all functions exclusively used by
them were removed. So the most of the functions in proto_http.{c,h} were
removed. Only functions to deal with the HTTP transaction have been kept. Then,
http_msg and hdr_idx modules were entirely removed. And finally the structure
http_msg was lightened of all its useless information about the legacy HTTP. The
structure hdr_ctx was also removed because unused now, just like unused states
in the enum h1_state. Note that the memory pool "hdr_idx" was removed and
"http_txn" is now smaller.
This commit breaks the compatibility with filters still relying on the legacy
HTTP code. The legacy callbacks were removed (http_data, http_chunk_trailers and
http_forward_data).
For now, the filters must still set the flag FLT_CFG_FL_HTX to be used on HTX
streams.
Since the legacy HTTP mode is disabled and no multiplexer relies on it anymore,
there is no reason to have 2 multiplexer protocols for the HTTP. So the protocol
PROTO_MODE_HTX was removed and all HTTP multiplexers use now PROTO_MODE_HTTP.
Since commit 81492c989 ("MINOR: threads: flatten the per-thread cpu-map"),
we don't keep the proc*thread matrix anymore to represent the full binding
possibilities, but only the proc and thread ones. The problem is that the
per-process binding is not the same for each thread and for the process,
and the proc[] array was assumed to store the per-proc first thread value
when doing this change. Worse, the logic present there tries to deal with
thread ranges and process ranges in a way which automatically exclused the
other possibility (since ranges cannot be used on both) but as such fails
to apply changes if neither the process nor the thread is expressed as a
range.
The real problem comes from the fact that specifying cpu-map 1/1 doesn't
yet reveal if the per-process mask or the per-thread mask needs to be
updated. In practice it's the thread one but then the current storage
doesn't allow to store the binding of the first thread of each other
process in nbproc>1 configurations.
When removing the proc*thread matrix, what ought to have been kept was
both the thread column for process 1 and the process line for threads 1,
but instead only the thread column was kept. This patch reintroduces the
storage of the configuration for the first thread of each process so that
it is again possible to store either the per-thread or per-process
configuration.
As a partial workaround for existing configurations, it is possible to
systematically indicate at least two processes or two threads at once
and map them by pairs or more so that at least two values are present
in the range. E.g :
# set processes 1-4 to cpus 0-3 :
cpu-map auto:1-4/1 0 1 2 3
# or:
cpu-map 1-2/1 0 1
cpu-map 2-3/1 2 3
# set threads 1-4 to cpus 0-3 :
cpu-map auto:1/1-4 0 1 2 3
# or :
cpu-map 1/1-2 0 1
cpu-map 3/3-4 2 3
This fix must be backported to 2.0.
Sometimes we need to delegate some list processing to a function running
on another thread. In this case the list element will simply be queued
into a dedicated self-locked list and the task responsible for this list
will be woken up, calling the associated function which will run over the
list.
This is what work_list does. Such lists will be dedicated to a limited
type of work but will significantly ease such remote handling. A function
is provided to create these per-thread lists, their tasks and to properly
bind each task to a distinct thread, so that the caller only has to store
the resulting pointer to the start of the structure.
These structures should not be abused though as each head will consume
4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.
While experimenting with potentially improved fairness and latency using
ticket locks on a Ryzen 16-thread/8-core, a very strange situation happened
a lot for some levels of traffic. Around 300k connections per second, no
more connections would be accepted on the multi-threaded listener but all
others would continue to work fine. All attempts to trace showed that the
threads were all in the trylock in the fd cache, or in the spinlock of
fd_update_events(), or in the one of fd_may_recv(). But as indicated this
was not a deadlock since the process continues to work fine.
After quite some investigation it appeared that the issue is caused by a
lack of fairness between the fdcache's trylock and these functions' spin
locks above. In fact, regardless of the success or failure of the fdcache's
attempt at grabbing the lock, the poller was calling fd_update_events()
which locks the FD once for something that can be done with a CAS, and
then calls fd_may_recv() with another lock for something that most often
didn't change. The high contention on these spinlocks leaves no chance to
any other thread to grab the lock using trylock(), and once this happens,
there is no thread left to process incoming connection events nor to stop
polling on the FD, leaving all threads at 100% CPU but partially operational.
This patch addresses the issue by using bit-test-and-set instead of the OR
in fd_may_recv() / fd_may_send() so that nothing is done if the FD was
already configured as expected. It does the same in fd_update_events()
using a CAS to check if the FD's events need to be changed at all or not.
With this patch applied, it became impossible to reproduce the issue, and
now there's no way to saturate all 16 CPUs with the load used for testing,
as no more than 1350-1400 were noticed at 300+kcps vs 1600.
Ideally this patch should go further and try to remove the remaining
incarnations of the fdlock as this seems possible, but it's difficult
enough to be done in a distinct patch that will not have to be backported.
It is possible that workloads involving a high connection rate may slightly
benefit from this patch and observe a slightly lower CPU usage even when
the service doesn't misbehave.
This patch must be backported to 2.0 and 1.9.
This code should be now used by action to stop at the same time the rules
processing and the possible following processings. And from its side, the return
code ACT_RET_STOP should be used to only stop rules processing.
So concretely, for TCP rules, there is no changes. ACT_RET_STOP and ACT_RET_DONE
are handled the same way. However, for HTTP rules, ACT_RET_STOP should now be
mapped on HTTP_RULE_RES_STOP and ACT_RET_DONE on HTTP_RULE_RES_DONE. So this
way, a action will have the possibilty to stop all processing or only rules
processing.
Note that changes about the TCP is done in this commit but changes about the
HTTP will be done in another one because it will fix a bug in the same time.
This patch must be backported to 2.0 because a bugfix depends on it.
When using a level lower than admin on the master CLI, a \n is output
before the response, this is caused by the response of the "operator" or
"user" that are sent before the actual command.
To fix this problem we introduce the flag APPCTX_CLI_ST1_NOLF which ask
a command response to not be followed by the final \n.
This patch made a special case with the command operator and user
followed by a - so they are not followed by \n.
This patch must be backported to 2.0 and 1.9.
Server states can be recovered from either a "global" file (all backends)
or a "local" file (per backend).
The way the algorithm to parse the state file was first implemented was good
enough for a low number of backends and servers per backend.
Basically, for each backend the state file (global or local) is opened,
parsed entirely and for each line we check if it contains data related to
a server from the backend we're currently processing.
We must read the file entirely, just in case some lines for the current
backend are stored at the end of the file.
This does not scale at all!
This patch changes the behavior above for the "global" file only. Now,
the global file is read and parsed once and all lines it contains are
stored in a tree, for faster discovery.
This result in way much less fopen, fgets, and strcmp calls, which make
loading of very big state files very quick now.
It's really confusing to call it a task because it's a tasklet and used
in places where tasks and tasklets are used together. Let's rename it
to tasklet to remove this confusion.
When we look up an dictionary entry in the cache used upon transmission
we store the last result in ->prev_lookup of struct dcache_tx so that
to compare it with the subsequent entries to look up and save performances.
When allocating new dictionary entries we store the length of the strings.
May be useful so that not to have to call strlen() too much often at runing
time.
We store pointers to server names dictionary entries in a pre-allocated array of
ebpt_node's (->entries member of struct dcache_tx) to cache those sent to remote
peers. Consequently the ID used to identify the server name dictionary entry is
also used as index for this array. There is no need to implement a lookup by key
for this dictionary cache.