William reported that since commit 6b3089856f ("MEDIUM: fd: do not use
the FD_POLL_* flags in the pollers anymore") the master's CLI often
fails to access sub-processes. There are two causes to this. One is
that we did report FD_POLL_ERR on an FD as soon as FD_EV_SHUT_W was
seen, which is automatically inherited from POLLHUP. And since we do
not store the current shutdown state of an FD we can't know if the
poller reports a sudden close resulting from an error or just a
byproduct of a previous shutdown(WR) followed by a read0. The current
patch addresses this by only considering this when the FD was active,
since a shutdown FD is not active. The second issue is that *somewhere*
down the chain, channel data are ignored if an error is reported on a
channel. This results in content truncation, but this cause was not
figured yet.
No backport is needed.
It is now possible to format stats counters as floats. But the stats applet does
not use it.
This patch is required by the Prometheus exporter to send the time averages in
seconds. If the promex change is backported, this patch must be backported
first.
A different engine-id is now generated for each thread. So, it is possible to
enable the async mode with several threads.
This patch may be backported to older versions.
We often need ISO time + microseconds in traces and ring buffers, thus
function does this by calling gettimeofday() and keeping a cached value
of the part representing the tv_sec value, and only rewrites the microsecond
part. The cache is per-thread so it's lockless and safe to use as-is.
Some tests already show that it's easy to see 3-4 events in a single
microsecond, thus it's likely that the nanosecond version will have to
be implemented as well. But certain comments on the net suggest that
some parsers are having trouble beyond microsecond, thus for now let's
stick to the microsecond only.
Now that we can wake tasklet for other threads, make sure that if the thread
is sleeping, we wake it up, or the tasklet won't be executed until it's
done sleeping.
That also means that, before going to sleep, and after we put our bit
in sleeping_thread_mask, we have to check that nobody added a tasklet for
us, just checking for global_tasks_mask isn't enough anymore.
The aim is to rassemble all scheduler information related to the current
thread. It simply points to task_per_thread[tid] without having to perform
the operation at each time. We save around 1.2 kB of code on performance
sensitive paths and increase the request rate by almost 1%.
There are a number of tests there which are enforced on tasklets while
they will never apply (various handlers, destroyed task or not, arguments,
results, ...). Instead let's have a single TASK_IS_TASKLET() test and call
the tasklet processing function directly, skipping all the rest.
It now appears visible that the only unneeded code is the update to
curr_task that is never used for tasklets, except for opportunistic
reporting in the debug handler, which can only catch si_cs_io_cb,
which in practice doesn't appear in any report so the extra cost
incurred there is pointless.
This change alone removes 700 bytes of code, mostly in
process_runnable_tasks() and increases the performance by about
1%.
Now that we can wake up a remote thread's tasklet, it's way more
interesting to use a tasklet than a task in the accept queue, as it
will avoid passing through all the scheduler. Just doing this increases
the accept rate by about 4%, overall recovering the slight loss
introduced by the tasklet change. In addition it makes sure that
even a heavily loaded scheduler (e.g. many very fast checks) will
not delay a connection accept.
Change the tasklet code so that the tasklet list is now a mt_list.
That means that tasklet now do have an associated tid, for the thread it
is expected to run on, and any thread can now call tasklet_wakeup() for
that tasklet.
One can change the associated tid with tasklet_set_tid().
Make it so MT_LIST_ADD and MT_LIST_ADDQ return 1 if it managed to add the
item, 0 (because it was already in a list) otherwise.
Make it so MT_LIST_DEL returns 1 if it managed to remove the item from a
list, or 0 otherwise (because it was in no list).
In srv_add_to_idle_list(), use LIST_DEL_INIT instead of just LIST_DEL.
We're about to add the connection to a mt_list, and MT_LIST_ADD/MT_LIST_ADDQ
will be modified to make sure we're not adding the element if it's already
in a list.
Add a few new macroes to the mt_lists.
MT_LIST_LOCK_ELT()/MT_LIST_UNLOCK_ELT() helps locking/unlocking an element.
This should only be used if you know for sure nobody else will remove the
element from the list in the meanwhile.
mt_list_for_each_entry_safe() is an iterator, similar to
list_for_each_entry_safe().
It takes 5 arguments, item, list_head, member are similar to those of
the non-mt variant, tmpelt is a temporary pointer to a struct mt_list, while
tmpelt2 is a struct mt_list itself.
MT_LIST_DEL_SELF() can be used to delete an item while parsing the list with
mt_list_for_each_entry_safe(). It shouldn't be used outside, and you
shouldn't use MT_LIST_DEL() while using mt_list_for_each_entry_safe().
Instead of using the same type for regular linked lists and "autolocked"
linked lists, use a separate type, "struct mt_list", for the autolocked one,
and introduce a set of macros, similar to the LIST_* macros, with the
MT_ prefix.
When we use the same entry for both regular list and autolocked list, as
is done for the "list" field in struct connection, we know have to explicitely
cast it to struct mt_list when using MT_ macros.
The FCGI application handles all the configuration parameters used to format
requests sent to an application. The configuration of an application is grouped
in a dedicated section (fcgi-app <name>) and referenced in a backend to be used
(use-fcgi-app <name>). To be valid, a FCGI application must at least define a
document root. But it is also possible to set the default index, a regex to
split the script name and the path-info from the request URI, parameters to set
or unset... In addition, this patch also adds a FCGI filter, responsible for
all processing on a stream.
To avoid code duplication in the futur mux FCGI, functions parsing H1 messages
and converting them into HTX have been moved in the file h1_htx.c. Some
specific parts remain in the mux H1. But most of the parsing is now generic.
Application is a generic term here. It is a modules which handle its own log
server list, with no dependency on a proxy. Such applications can now call the
function app_log() to log messages, passing a log server list and a tag as
parameters. Internally, the function __send_log() has been adapted accordingly.
Most of times, when a keyword is added in proxy section or on the server line,
we need to have a post-parser callback to check the config validity for the
proxy or the server which uses this keyword.
It is possible to register a global post-parser callback. But all these
callbacks need to loop on the proxies and servers to do their job. It is neither
handy nor efficient. Instead, it is now possible to register per-proxy and
per-server post-check callbacks.
Most of times, when any allocation is done during configuration parsing because
of a new keyword in proxy section or on the server line, we must add a call in
the deinit() function to release allocated ressources. It is now possible to
register a post-deinit callback because, at this stage, the proxies and the
servers are already releases.
Now, it is possible to register deinit callbacks per-proxy or per-server. These
callbacks will be called for each proxy and server before releasing them.
This new flag may be used to report unexpected error because of not well
formatted HTX messages (not related to a parsing error) or our incapactity to
handle the processing because we reach a limit (ressource exhaustion, too big
headers...). It should result to an error 500 returned to the client when
applicable.
It is now possible to export stats using the JSON format from the HTTP stats
page. Like for the CSV export, to export stats in JSON, you must add the option
";json" on the stats URL. It is also possible to dump the JSON schema with the
option ";json-schema". Corresponding Links have been added on the HTML page.
This patch fixes the issue #263.
This adds two extra fields to the stats, one for the current number of idle
connections and one for the configured limit. A tooltip link now appears on
the HTML page to show these values in front of the active connection values.
This should be backported to 2.0 and 1.9 as it's the only way to monitor
the idle connections behaviour.
When using "http-reuse safe", which is the default, a new incoming connection
does not automatically reuse an existing connection for the first request, as
we don't want to risk to lose the contents if we know the client will not be
able to replay the request. A side effect to this is that when dealing with
mostly http-close traffic, the reuse rate is extremely low and we keep
accumulating server-side connections that may even never be reused. At some
point we're limited to a ratio of file descriptors, but when the system is
configured with very high FD limits, we can still reach the limit of outgoing
source ports and make the system significantly slow down trying to find an
available port for outgoing connections. A simple test on my laptop with
ulimit 100000 and with the following config results in the load immediately
dropping after a few seconds :
listen l1
bind :4445
mode http
server s1 127.0.0.1:8000
As can be seen, the load falls from 38k cps to 400 cps during the first 200ms
(in fact when the source port table is full and connect() takes ages to find
a spare port for a new connection):
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
2439 2439 39338 39338 356094 5743 5743 0 0 0.4 0.5 0.4
7637 5198 38185 37666 1115002 5575 5499 0 0 0.7 0.5 0.7
7719 82 25730 820 1127002 3756 120 0 0 21.8 18.8 21.8
7797 78 19492 780 1138446 2846 114 0 0 61.4 2.5 61.4
7877 80 15754 800 1150182 2300 117 0 0 58.6 0.5 58.6
7920 43 13200 430 1156488 1927 63 0 0 58.9 0.3 58.9
At this point, lots of connections are indeed in use, for only 10 connections
on the frontend side:
$ ss -ant state established | wc -l
39022
This patch makes sure we never keep more idle connections than we've ever
had outstanding requests on a server. This way the total number of idle
connections will never exceed the sum of maximum connections. Thus highly
loaded servers will be able to get many connections and slightly loaded
servers will keep less. Ideally we should apply similar limits per process
and the per backend, but in practice this already addresses the issues
pretty well:
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
4423 4423 40209 40209 645758 5870 5870 0 0 0.2 0.4 0.2
8020 3597 40100 39966 1170920 5854 5835 0 0 0.2 0.4 0.2
12037 4017 40123 40170 1757402 5858 5864 0 0 0.2 0.4 0.2
16069 4032 40172 40320 2346074 5865 5886 0 0 0.2 0.4 0.2
20047 3978 40013 39386 2926862 5842 5750 0 0 0.3 0.4 0.3
24005 3958 40008 39979 3504730 5841 5837 0 0 0.2 0.4 0.2
$ ss -ant state established | wc -l
234
This patch must be backported to 2.0. It could be useful in 1.9 as well
eventhough pools and reuse are not enabled by default there.
As mentioned in previous commit, these flags do not map well to
modern poller capabilities. Let's use the FD_EV_*_{R,W} flags instead.
This first patch only performs a 1-to-1 mapping making sure that the
previously reported flags are still reported identically while using
the closest possible semantics in the pollers.
It's worth noting that kqueue will now support improvements such as
returning distinctions between shut and errors on each direction,
though this is not exploited for now.
There's currently a big ambiguity on our use of POLLHUP because we
currently map POLLHUP and POLLRDHUP to FD_POLL_HUP. The first one
indicates a close in *both* directions while the second one indicates
a unidirectional close. Since we don't know from the resulting flag
we always have to read when reported. Furthermore kqueue only reports
unidirectional responses which are mapped to FD_POLL_HUP as well, and
their write closes are mapped to a general error.
We could add a new FD_POLL_RDHUP flag to improve the mapping, or
switch only to the POLL* flags, but that further complicates the
portability for operating systems like FreeBSD which do not have
POLLRDHUP but have its semantics.
Let's instead directly use the per-direction flag values we already
have, and it will be a first step in the direction of finer states.
Thus we introduce an ERR and a SHUT status for each direction, that
the pollers will be able to compute and pass to fd_update_events().
It's worth noting that FD_EV_STATUS already sees the two new flags,
but they are harmless since used only by fd_{recv,send}_state() which
are never called. Thus in its current state this patch must be totally
transparent.
These two functions are used to enable recv/send but only if the FD is
not marked as active yet. The purpose is to conditionally mark them as
tentatively usable without interfering with the polling if polling was
already enabled, when it's supposed to be likely true.
Given that all our I/Os are now directed from top to bottom and not the
opposite way around, and the FD cache was removed, it doesn't make sense
anymore to create FDs that are marked not ready since this would prevent
the first accesses unless the caller explicitly does an fd_may_recv()
which is not expected to be its job (which conn_ctrl_init() has to do
by the way). Let's move this into fd_insert() instead, and have a single
atomic operation for both directions via fd_may_both().
Now that we don't have to update FD_EV_POLLED_* at the same time as
FD_EV_ACTIVE_*, we don't need to use a CAS anymore, a bit-test-and-set
operation is enough. Doing so reduces the code size by a bit more than
1 kB. One function was special, fd_done_recv(), whose comments and doc
were inaccurate for the part related to the lack of polling.
Since commit 7ac0e35f2 in 1.9-dev1 ("MAJOR: fd: compute the new fd polling
state out of the fd lock") we've started to update the FD POLLED bit a
bit more aggressively. Lately with the removal of the FD cache, this bit
is always equal to the ACTIVE bit. There's no point continuing to watch
it and update it anymore, all it does is create confusion and complicate
the code. One interesting side effect is that it now becomes visible that
all fd_*_{send,recv}() operations systematically call updt_fd_polling(),
except fd_cant_recv()/fd_cant_send() which never saw it change.
Now by prefixing a log server with "ring@<name>" it's possible to send
the logs to a ring buffer. One nice thing is that it allows multiple
sessions to consult the logs in real time in parallel over the CLI, and
without requiring file system access. At the moment, ring0 is created as
a default sink for tracing purposes and is available. No option is
provided to create new rings though this is trivial to add to the global
section.
Instead of detecting an AF_UNSPEC address family for a log server and
to deduce a file descriptor, let's create a target type field and
explicitly mention that the socket is of type FD.
The purpose is to be able to remember that initialization was already
done for a file descriptor. This will allow to get rid of some dirty
hacks performed in the logs or fd sinks where the init state of the
fd has to be guessed.
The "cache" entry was still present in the fdtab struct and it was
reported in "show sess". Removing it broke the cache-line alignment
on 64-bit machines which is important for threads, so it was fixed
by adding an attribute(aligned()) when threads are in use. Doing it
only in this case allows 32-bit thread-less platforms to see the
struct fit into 32 bytes.
Now it is possible for a reader to subscribe and wait for new events
sent to a ring buffer. When new events are written to a ring buffer,
the applets that are subscribed are woken up to display new events.
For now we only support this with the CLI applet called by "show events"
since the I/O handler is indeed a CLI I/O handler. But it's not
complicated to add other mechanisms to consume events and forward them
to external log servers for example. The wait mode is enabled by adding
"-w" after "show events <sink>". An extra "-n" was added to directly
seek to new events only.
Some CLI parsers are currently abusing the CLI context types such as
pointers to stuff longs into them by lack of room. But the context is
80 bytes while cli is only 48, thus there's some room left. This patch
adds a list element and two size_t usable as various offsets. The list
element is initialized.
The detail level initially based on syslog levels is not used, while
something related is missing, trace verbosity, to indicate whether or
not we want to call the decoding callback and what level of decoding
we want (raw captures etc). Let's change the field to "verbosity" for
this. A verbosity of zero means that the decoding callback is not
called, and all other levels are handled by this callback and are
source-specific. The source is now prompted to list the levels that
are proposed to the user. When the source doesn't define anything,
"quiet" and "default" are available.
Working on adding traces to mux-h2 revealed that the function names are
manually copied a lot in developer traces. The reason is that they are
not preprocessor macros and as such cannot be concatenated. Let's
slightly adjust the trace() function call to take a function name just
after the file:line argument. This argument is only added for the
TRACE_DEVEL and 3 new TRACE_ENTER, TRACE_LEAVE, and TRACE_POINT macros
and left NULL for others. This way the function name is only reported
for traces aimed at the developers. The pretty-print callback was also
extended to benefit from this. This will also significantly shrink the
data segment as the "entering" and "leaving" strings will now be merged.
One technical point worth mentioning is that the function name is *not*
passed as an ist to the inline function because it's not considered as
a builtin constant by the compiler, and would lead to strlen() being
run on it from all call places before calling the inline function. Thus
instead we pass the const char * (that the compiler knows where to find)
and it's the __trace() function that converts it to an ist for internal
consumption and for the pretty-print callback. Doing this avoids losing
5-10% peak performance.
The "payload" trace level was ambigous because its initial purpose was
to be able to dump received data. But it doesn't make sense to force to
report data transfers just to be able to report state changes. For
example, all snd_buf()/rcv_buf() operations coming from the application
layer should be tagged at this level. So here we move this payload level
above the state transitions and rename it to avoid the ambiguity making
one think it's only about request/response payload. Now it clearly is
about any data transfer and is thus just below the developer level. The
help messages on the CLI and the doc were slightly reworded to help
remove this ambiguity.
In prompts on the CLI we now commonly need to propose a keyword name
and a description and it doesn't make sense to define a new struct for
each such pairs. Let's simply have a generic "name_desc" for this.
Save the authority TLV in a PROXYv2 header from the client connection,
if present, and make it available as fc_pp_authority.
The fetch can be used, for example, to set the SNI for a backend TLS
connection.
Previously the callback was almost mandatory so it made sense to have it
before the message. Now that it can default to the one declared in the
trace source, most TRACE() calls contain series of empty args and callbacks,
which make them suitable for being at the end and being totally omitted.
This patch thus reverses the TRACE arguments so that the message appears
first, then the mask, then arg1..arg4, then the callback. In practice
we'll mostly see 1 arg, or 2 args and nothing else, and it will not be
needed anymore to pass long series of commas in the middle of the
arguments. However if a source is enforced, the empty commas will still
be needed for all omitted arguments.
It becomes apparent that most traces will use a single trace pretty
print callback, so let's allow the trace source to declare a default
one so that it can be omitted from trace calls, and will be used if
no other one is specified.