This patch implements a new global parameter for the master-worker mode.
When setting the mworker-max-reloads value, a worker receive a SIGTERM
if its number of reloads is greater than this value.
Since previous commit it's not needed anymore to test a task pointer
before calling task_destory() so let's just remove these tests from
the various callers before they become confusing. The function's
arguments were also documented. The same should probably be done
with tasklet_free() which involves a test in roughly half of the
call places.
Commit 3f795f7 ("MEDIUM: tasks: Merge task_delete() and task_free() into
task_destroy().") replaced task_delete() and task_free() with a single
function named task_destroy().
This patch adds a check for struct task* argument in function
task_destroy() to prevent a possible segfault on NULL and also to make
the function safer for use in other cases.
Now we atomically allocate the my_regex struct within function
regex_comp() and compile the regex or free both in case of failure. The
pointer to the allocated my_regex struct is returned directly. The
my_regex* argument to regex_comp() is removed.
Function regex_free() was modified so that it systematically frees the
my_regex entry. The function does nothing when called with a NULL as
argument (like free()). It will avoid existing risk of not properly
freeing the initialized area.
Other structures are also updated in order to be compatible (the ones
related to Lua and action rules).
With this patch we add a prefix to stick-table names declared in "peers" sections
concatenating the "peers" section name followed by a '/' character with
the stick-table name. Consequently, "peers" sections have their own
namespace for their stick-tables. Obviously, these stick-table names are not the
ones which should be sent over the network. So these configurations must be
compatible and should make A and B peers communicate with peers protocol:
# haproxy A config, old way stick-table declerations
peers mypeers
peer A ...
peer B ...
backend t1
stick-table type string size 10m store gpc0 peers mypeers
# haproxy B config, new way stick-table declerations
peers mypeers
peer A ...
peer B ...
table t1 type string size store gpc0 10m
This "network" name is stored in ->nid new field of stktable struct. The "local"
stktable-name is still stored in ->id.
Add a list of proxies for all the stick-tables (->proxies_list struct stktable
member) so that to be able to compute the process bindings of the peers after having
parsed the configuration file.
The proxies are added to the stick-tables they reference when parsing
stick-tables lines in proxy sections, when checking the actions in
check_trk_action() and when resolving samples args for stick-tables
without checking is they are duplicates. We check only there is no loop.
Then, after having parsed everything, we add the proxy bindings to the
peers frontend bindings with stick-tables they reference.
This patch adds the support for the "table" line parsing in "peers" sections
to declare stick-table in such sections. This also prevents the user from having
to declare dummy backends sections with a unique stick-table inside.
Even if still supported, this usage will become deprecated.
To do so, the ->table member of proxy struct which is a stktable struct is replaced
by a pointer to a stktable struct allocated at parsing time in src/cfgparse-listen.c
for the dummy stick-table backends and in src/cfgparse.c for "peers" sections.
This has an impact on the code for stick-table sample converters and on the stickiness
rules parsers which first store the name of the dummy before resolving the rules.
This patch replaces proxy_tbl_by_name() calls by stktable_find_by_name() calls
to lookup for stick-tables stored in "stktable_by_name" ebtree at parsing time.
There is only one remaining place where proxy_tbl_by_name() is used: src/hlua.c.
At several places in the code we relied on the fact that ->size member of stick-table
was equal to zero to consider the stick-table was present by not configured,
this do not make sense anymore as ->table member of struct proxyis fow now on a pointer.
These tests are replaced by a test on ->table value itself.
In "peers" section we do not have to temporary store the name of the section the
stick-table are attached to because this name is obviously already known just after
having entered this "peers" section.
About the CLI stick-table I/O handler, the pointer to proxy struct is replaced by
a pointer to a stktable struct.
With this patch we move the code responsible of parsing "stick-table"
lines to implement parse_stick_table() function in src/stick-tabble.c
so that to be able to parse "stick-table" elsewhere than in proxy sections.
We have have also added a conf struct to stktable struct to store the filename
and the line in the file the stick-table has been parsed to help in
diagnosing and displaying any configuration issue.
This implements support for the new API which relies on a call to
setsockopt().
On systems that support it (currently, only Linux >= 4.11), this enables
using TCP fast open when connecting to server.
Please note that you should use the retry-on "conn-failure", "empty-response"
and "response-timeout" keywords, or the request won't be able to be retried
on failure.
Co-authored-by: Olivier Houchard <ohouchard@haproxy.com>
The connect() method had 2 arguments, "data", that tells if there's pending
data to be sent, and "delack" that tells if we have to use a delayed ack
inconditionally, or if the backend is configured with tcp-smart-connect.
Turn that into one argument, "flags".
That way it'll be easier to provide more informations to connect() without
adding extra arguments.
SSL_SESSION_get0_id_context is introduced in LibreSSL-2.7.0
async operations are not supported by LibreSSL
early data is not supported by LibreSSL
packet_length is removed from SSL struct in LibreSSL
Add a way to retry requests if we got a junk response from the server, ie
an incomplete response, or something that is not valid HTTP.
To do so, one can use the new "junk-response" keyword for retry-on.
Add a new keyword for retry-on, 0rtt-rejected. If set, we will try to
replay requests for which we sent early data that got rejected by the
server.
If that option is set, we will attempt to use 0rtt if "allow-0rtt" is set
on the server line even if the client didn't send early data.
When running in HTX mode, if we sent the request, but failed to get the
answer, either because the server just closed its socket, we hit a server
timeout, or we get a 404, 408, 425, 500, 501, 502, 503 or 504 error,
attempt to retry the request, exactly as if we just failed to connect to
the server.
To do so, add a new backend keyword, "retry-on".
It accepts a list of keywords, which can be "none" (never retry),
"conn-failure" (we failed to connect, or to do the SSL handshake),
"empty-response" (the server closed the connection without answering),
"response-timeout" (we timed out while waiting for the server response),
or "404", "408", "425", "500", "501", "502", "503" and "504".
The default is "conn-failure".
Currently the thread array is a local variable inside a function block
and there is no access to it from outside, which often complicates
debugging. Let's make it global and export it. Also the allocation
return is now checked.
When we initially experimented with threads and processes support, we
needed to implement arrays of threads per process for cpu-map, but this
is not needed anymore since we support either threads or processes.
Let's simply make the thread-based cpu-map per thread and not per
thread and per process since that's not used anymore. Doing so reduces
the global struct from 33kB to 1.5kB.
When for some reason the session is not the owner of the connection anymore,
make sure we remove CO_FL_SESS_IDLE, even if we're about to call
conn->mux->destroy(), as the destroy may not destroy the connection
immediately if it's still in use.
This should be backported to 1.9.
u
In channel_erase(), don't forget to set output to 0, otherwise the
channel won't seem empty, when it really is, and that could lead to
stream never closing properly.
This should be backported to 1.9.
There is a bug when global.tune.maxaccept is set to -1 (no limit). It is pretty
visible with one process (nbproc sets to 1). The functions listener_accept() and
accept_queue_process() don't expect to handle negative maxaccept values. So
instead of accepting incoming connections without any limit, none are never
accepted and HAProxy loop infinitly in the scheduler.
When there are 2 or more processes, the bug is a bit more subtile. The limit for
a listener is set to 1. So only one connection is accepted at a time by a given
listener. This happens because the listener's maxaccept value is an unsigned
integer. In check_config_validity(), it is first set to UINT_MAX (-1 casted in
an unsigned integer), and then some calculations on it leads to an integer
overflow.
To fix the bug, the listener's maxaccept value is now a signed integer. So, if a
negative value is set for global.tune.maxaccept, we keep it untouched for the
listener and no calculation is made on it. Then, in the listener code, this
signed value is casted to a unsigned one. It simplifies all tests instead of
dealing with negative values. So, it limits the number of connections accepted
at a time to UINT_MAX at most. But, honestly, it not an issue.
This patch must be backported to 1.9 and 1.8.
Port range uses a ring buffer, and unfortunately, when making haproxy
multithreaded, it's been overlooked, and the ring buffer is not thread-safe.
When specifying a source range, 2 or more threads could pick the same
port, and of course only one of them could use the port, the others would
always fail the connection.
To fix this, make it a lock-free ring buffer. This is easier than usual
because we know the ring buffer can never be full.
This should be backported to 1.8 and 1.9.
The same way we have HA_ATOMIC_STORE(), implement HA_ATOMIC_LOAD().
This should be backported to 1.8 and 1.9, as we need it for a bug fix
in port ranges.
It's not logical to report context switch rates per thread in show activity
because everything else is a counter and it's not even possible to compare
values. Let's only report counts. Further, this simplifies the scheduler's
code.
This patch implements the sampling and load-balancing of log servers configured
with "sample" new keyword implemented by this commit:
'MINOR: log: Add "sample" new keyword to "log" lines'.
As the list of ranges used to sample the log to balance is ordered, we only
have to maintain ->curr_idx member of smp_info struct which is the index of
the sample and check if it belongs or not to the current range to decide if we
must send it to the log server or not.
This patch implements the parsing of "sample" new optional keyword for "log" lines
to be able to sample and balance the load of log messages between serveral log
destinations declared by "log" lines. This keyword must be followed by a list of
comma seperated ranges of indexes numbered from 1 to define the samples to be used
to balance the load of logs to send. This "sample" keyword must be used on "log" lines
obviously before the remaining optional ones without keyword. The list of ranges
must be followed by a colon character to separate it from the log sampling size.
With such following configuration declarations:
log stderr local0
log 127.0.0.1:10001 sample 2-3,8-11:11 local0
log 127.0.0.2:10002 sample 5:5 local0
in addition to being sent to stderr, about the second "log" line, every 11 logs
the logs #2 up to #3 would be sent to 127.0.0.1:10001, then #8 up tp #11 four
logs would be sent to the same log server and so on periodically. Logs would be
sent to 127.0.0.2:100002 every 5 logs.
It is also possible to define the size of the sample with a value different of
the maximum of the high limits of the ranges, for instance as follows:
log 127.0.0.1:10001 sample 2-3,8-11:15 local0
as before the two logs #2 and #3 would be sent to 127.0.0.1:10001, then #8
up tp #11 logs, but in this case here, this would be done periodically every 15
messages.
Also note that the ranges must not overlap each others. This is to ease the
way the logs are periodically sent.
This simplifies the API and hide the details in the sample. This way, only
string and binary are aware of these info, because other types cannot be
partially encoded.
This patch may be backported to 1.9 and 1.8.
Fragmented arg will do fetch at every encode time, each fetch may get
different result if SMP_F_MAY_CHANGE, for example res.payload, but
the length already encoded in first fragment of the frame, that will
cause SPOA decode failed and waste resources.
This patch must be backported to 1.9 and 1.8.
During 1.9 development (and even a bit after) we've started to face a
significant number of situations where streams were abusively spinning
due to an uncaught error flag or complex conditions that couldn't be
correctly identified. Sometimes streams wake appctx up and conversely
as well. More importantly when this happens the only fix is to restart.
This patch adds a new function to report a serious error, some relevant
info and to crash the process using abort() so that a core dump is
available. The purpose will be for this function to be called in various
situations where the process is unfixable. It will help detect these
issues much earlier during development and may even help fixing test
platforms which are able to automatically restart when such a condition
happens, though this is not the primary purpose.
This patch only provides the function and doesn't use it yet.
We have various functions like conn_get_ctrl_name() to retrieve
some information reported in "show sess" for debugging, which
assume that the connection is valid. This is really not convenient
in code aimed at debugging and is error-prone. Let's add a validity
test first.
Instead of having to manually turn task profiling on/off in the
configuration, by default it will work in "auto" mode, which
automatically turns on on any thread experiencing sustained loop
latencies over one millisecond averaged over the last 1024 samples.
This may happen with configs using lots of regex (thing map_reg for
example, which is the lazy way to convert Apache's rewrite rules but
must not be abused), and such high latencies affect all the process
and the problem is most often intermittent (e.g. hitting a map which
is only used for certain host names).
Thus now by default, with profiling set to "auto", it remains off all
the time until something bad happens. This also helps better focus on
the issues when looking at the logs as well as in "show sess" output.
It automatically turns off when the average loop latency over the last
1024 calls goes below 990 microseconds (which typically takes a while
when in idle).
This patch could be backported to stable versions after a bit more
exposure, as it definitely improves observability and the ability to
quickly spot the culprit. In this case, previous patch ("MINOR:
activity: make the profiling status per thread and not global") must
also be taken.
In order to later support automatic profiling turn on/off, we need to
have it per-thread. We're keeping the global option to know whether to
turn it or on off, but the profiling status is now set per thread. We're
updating the status in activity_count_runtime() which is called before
entering poll(). The reason is that we'll extend this with run time
measurement when deciding to automatically turn it on or off.
Very similarly to previous commit doing the same for streams, we now
measure and report an appctx's call rate. This will help catch applets
which do not consume all their data and/or which do not properly report
that they're waiting for something else. Some of them like peers might
theorically be able to exhibit some occasional peeks when teaching a
full table to a nearby peer (e.g. the new replacement process), but
nothing close to what a bogus service can do so there is no risk of
confusion.
Quite a few times some bugs have made a stream task incorrectly
handle a complex combination of events, which was often reported as
"100% CPU", and was usually caused by the event not being properly
identified and flushed, and the stream's handler called in loops.
This patch adds a call rate counter to the stream struct. It's not
huge, it's really inexpensive (especially compared to the rest of the
processing function) and will easily help spot such tasks in "show sess"
output, possibly even allowing to kill them.
A future patch should probably consist in alerting when they're above a
certain threshold, possibly sending a dump and killing them. Some options
could also consist in aborting in order to get an analyzable core dump
and let a service manager restart a fresh new process.
It's particularly useful to spot runaway tasks to see this. The context
switch rate covers all tasklet calls (tasks and I/O handlers) while the
task wakeups only covers tasks picked from the run queue to be executed.
High values there will indicate either an intense traffic or a bug that
mades a task go wild.
Since the commit 1d2b586cd ("MAJOR: htx: Enable the HTX mode by default for all
proxies"), the HTX is enabled by default for all proxies, HTTP and TCP, but also
CLI and HEALTH proxies. But when the best mux is retrieved, only HTTP and TCP
modes are checked. If the TCP mode is not explicitly set, it is considered as an
HTTP proxy. It is an hidden bug introduced when the option "http-use-htx" was
added. It has no effect until the commit 1d2b586cd. But now, when a stats socket
is created for the master process, the mux h1 is installed on all incoming
connections to the CLI proxy, leading to segfaults because HTX operations are
performed on raw buffers.
So to fix the buf, when a mux is installed, all proxies are considered as TCP
proxies, except HTTP ones. This way, CLI and HEALTH proxies will be handled as
TCP proxies.
This patch must be backported to 1.9 although it has no effect. It is safer to
not keep hidden bugs.
The 'do-resolve' action is an http-request or tcp-request content action
which allows to run DNS resolution at run time in HAProxy.
The name to be resolved can be picked up in the request sent by the
client and the result of the resolution is stored in a variable.
The time the resolution is being performed, the request is on pause.
If the resolution can't provide a suitable result, then the variable
will be empty. It's up to the admin to take decisions based on this
statement (return 503 to prevent loops).
Read carefully the documentation concerning this feature, to ensure your
setup is secure and safe to be used in production.
This patch creates a global counter to track various errors reported by
the action 'do-resolve'.
dns_requester structure can be allocated at run time when servers get
associated to DNS resolution (this happens when SRV records are used in
conjunction with service discovery).
Well, this memory allocation is safer if managed in an HAProxy pool,
furthermore with upcoming HTTP action which can perform DNS resolution
at runtime.
This patch moves the memory management of the dns_requester structure
into its own pool.
As specified in the function comment, the function h1_skip_chunk_crlf() must not
change anything and return zero if not enough data are available. This must
include the case where there is no data at all. On this point, it must do the
same that other h1 parsing functions. This bug is made visible since the commit
91f77d599 ("BUG/MINOR: mux-h1: Process input even if the input buffer is
empty").
This patch must be backported to 1.9.
As by default we add all keepalive connections to the idle pool, if we run
into a pathological case, where all client don't do keepalive, but the server
does, and haproxy is configured to only reuse "safe" connections, we will
soon find ourself having lots of idling, unusable for new sessions, connections,
while we won't have any file descriptors available to create new connections.
To fix this, add 2 new global settings, "pool_low_ratio" and "pool_high_ratio".
pool-low-fd-ratio is the % of fds we're allowed to use (against the maximum
number of fds available to haproxy) before we stop adding connections to the
idle pool, and destroy them instead. The default is 20. pool-high-fd-ratio is
the % of fds we're allowed to use (against the maximum number of fds available
to haproxy) before we start killing idling connection in the event we have to
create a new outgoing connection, and no reuse is possible. The default is 25.
For most of the xprt methods, provide a xprt_ctx. This will be useful later
when we'll want to be able to stack xprts.
The init() method now has to create and provide the said xprt_ctx if needed.
task_delete() was never used without calling task_free() just after, and
task_free() was only used on error pathes to destroy a just-created task,
so merge them into task_destroy(), that will remove the task from the
wait queue, and make sure the task is either destroyed immediately if it's
not in the run queue, or destroyed when it's supposed to run.
The wakeup condition in task_wakeup() is redundant as it is already
validated by the CAS. Better move the __task_wakeup() call there, it
also has the merit of being easier to audit this way. This also reduces
the code size by around 1.8 kB :
$ size haproxy-?
text data bss dec hex filename
2153806 100208 1307676 3561690 3658da haproxy-1
2152094 100208 1307676 3559978 36522a haproxy-2
Commit 0c7a4b6 ("MINOR: tasks: Don't set the TASK_RUNNING flag when
adding in the tasklet list.") revealed a hole in the way tasks may
be freed : they could be removed while in the run queue when the
TASK_QUEUED flag was present but not the TASK_RUNNING one. But it
seems the issue was emphasized by commit cde7902 ("MEDIUM: tasks:
improve fairness between the local and global queues") though the
code it replaces was already affected given how late the TASK_RUNNING
flag was set after removal from the global queue.
At the moment the task is picked from the global run queue, if it
is the last one, the global run queue lock is dropped, and then
the TASK_RUNNING flag was added. In the mean time another thread
might have performed a task_free(), and immediately after, the
TASK_RUNNING flag was re-added to the task, which was then added
to the tasklet list. The unprotected window was extremely faint
but does definitely exist and inconsistent task lists have been
observed a few times during very intensive tests over the last few
days. From this point various options are possible, the task might
have been re-allocated while running, and assigned state 0 and/or
state QUEUED while it was still running, resulting in the tast not
being put back into the tree.
This commit simply makes sure that tests on TASK_RUNNING before removing
the task also cover TASK_QUEUED.
It must be backported to 1.9 along with the previous ones touching
that area.
Now that we have the warranty that a task won't be added in the runqueue
while the TASK_QUEUED or the TASK_RUNNING flag is set, don't bother trying
to lock the task by setting leaf_p to 0x1 while inserting it in the runqueue
or having it in the tasklet_list, as nobody else will attempt to add it.