[wt: the new version is 2.1 but it's useful to document the different
versions since they're found in field. There's some overlap with the
new one and they complement on certain areas. Most likely they'll
ultimately be merged.]
Tim Düsterhus noticed that the create-release script had mangled the
version in the peers protocol doc, forcing it to 1.8 due to its syntax
matching the format of an haproxy version. Let's just slightly readjust
the header not to match this by removing the word "version" and placing
it on the same line as the title.
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
Rename the global variable "proxy" to "proxies_list".
There's been multiple proxies in haproxy for quite some time, and "proxy"
is a potential source of bugs, a number of functions have a "proxy" argument,
and some code used "proxy" when it really meant "px" or "curproxy". It worked
by pure luck, because it usually happened while parsing the config, and thus
"proxy" pointed to the currently parsed proxy, but we should probably not
rely on this.
[wt: some of these are definitely fixes that are worth backporting]
It is now possible on a "bind" line (or a "stats socket" line) to specify the
thread set allowed to process listener's connections. For instance:
# HTTPS connections will be processed by all threads but the first and HTTP
# connection will be processed on the first thread.
bind *:80 process 1/1
bind *:443 ssl crt mycert.pem process 1/2-
Now, it is possible to bind CPU at the thread level instead of the process level
by defining a thread set in "cpu-map" directives. Thus, its format is now:
cpu-map [auto:]<process-set>[/<thread-set>] <cpu-set>...
where <process-set> and <thread-set> must follow the format:
all | odd | even | number[-[number]]
Having a process range and a thread range in same time with the "auto:" prefix
is not supported. Only one range is supported, the other one must be a fixed
number. But it is allowed when there is no "auto:" prefix.
Because it is possible to define a mapping for a process and another for a
thread on this process, threads will be bound on the intersection of their
mapping and the one of the process on which they are attached. If the
intersection is null, no specific binding will be set for the threads.
It was a temporary directive used for development purpose. Now, CPU mapping for
at the thread level should be done using the cpu-map directive. This feature
will be added in a next commit.
Now, processa and CPU ranges can be partially defined. The higher bound can be
omitted. In such case, it is replaced by the corresponding maximum value, 32 or
64 depending on the machine's word size.
By extension, It is also true for the "bind-process" directive and "process"
parameter on a "bind" or a "stats socket" line.
The prefix "auto:" can be added before the process set to let HAProxy
automatically bind a process to a CPU by incrementing process and CPU sets. To
be valid, both sets must have the same size. No matter the declaration order of
the CPU sets, it will be bound from the lower to the higher bound.
Examples:
# all these lines bind the process 1 to the cpu 0, the process 2 to cpu 1
# and so on.
cpu-map auto:1-4 0-3
cpu-map auto:1-4 0-1 2-3
cpu-map auto:1-4 3 2 1 0
# bind each process to exaclty one CPU using all/odd/even keyword
cpu-map auto:all 0-63
cpu-map auto:even 0-31
cpu-map auto:odd 32-63
# invalid cpu-map because process and CPU sets have different sizes.
cpu-map auto:1-4 0 # invalid
cpu-map auto:1 0-3 # invalid
Now, this function returns a status code to indicate a success (0) or a failure
(1) and the error message in set in <err> parameter. And the result of the parsing
is set in <proc> parameter.
Now, you can define processes concerned by a cpu-map line using a range. For
instance, the following line binds the first 32 processes on CPUs 0 to 3:
cpu-map 1-32 0-3
The documentation specifies that you can have several "process" options to
define several ranges on "bind" lines (or "stats socket" lines). It is uncommon,
but it should be possible. So the bind_proc bitmask in bind_conf structure must
not be overwritten at each new "process" option parsed.
This bug also exists in 1.7, 1.6 and 1.5. So it may be backported. But no one
seems to have noticed it, so it was probably never hitted.
The cache exhibited a but in process_stream() where upon abort it is
possible to switch the stream-int's state to SI_ST_CLO without calling
si_release_endpoint(), resulting in a possibly missing ->release() for
the applet.
It should affect all other applets as well (eg: lua, spoe, peers) and
should carefully be backported to stable branches after some observation
period.
BoringSSL early data differ from OpenSSL 1.1.1 implementation. When early
handshake is done, SSL_in_early_data report if SSL_read will be done on early
data. CO_FL_EARLY_SSL_HS and CO_FL_EARLY_DATA can be adjust accordingly.
HTTP/2 mandates the support of 16384 bytes frames by default, so we need
a large enough buffer to process them. Till now if tune.bufsize was too
small, H2 connections were simply rejected during their establishment,
making it quite hard to troubleshoot the issue.
Now we detect when HTTP/2 is enabled on an HTTP frontend and emit an
error if tune.bufsize is not large enough, with the appropriate
recommendation.
At the moment, the "client" timeout is used on an HTTP/2 connection once
it's idle with no active stream. With this patch, this timeout is replaced
by client-fin once a GOAWAY frame is sent. This closely matches what is
done on HTTP/1 since the principle is the same, as it indicates a willing
ness to quickly close a connection on which we don't expect to see anything
anymore.
As reported by Lukas, it causes more harm than good, for example on
prompt for authentication. Now we have an "http-request reject" rule
to use instead of "http-request deny" if we absolutely want to close
the connection.
Apparently the h2c client has trouble reading the RST_STREAM frame after
a GOAWAY was sent, so it's likely that other clients may face the same
difficulty. Curl and Firefox don't care about this ordering, so let's
send it first.
This one acts similarly to its tcp-request counterpart. It immediately
closes the request without emitting any response. It can be suitable in
certain DoS conditions, as well as to close an HTTP/2 connection.
The cache was relying on the txn->uri for creating its key, which was a
big problem when there was no log activated.
This patch does a sha1 of the host + uri, and stores it in the txn.
When a object is stored, the eb32node uses the first 32 bits of the hash
as a key, and the whole hash is stored in the cache entry.
During a lookup, the truncated hash is used, and when it matches an
entry we check the real sha1.
In case any stream was waiting for the handshake after receiving early data,
we have to wake all of them. Do so by making the mux responsible for
removing the CO_FL_EARLY_DATA flag after all of them are woken up, instead
of doing it in si_cs_wake_cb(), which would then only work for the first one.
This makes wait_for_handshake work with HTTP/2.
It can happen that we want to read early data, write some, and then continue
reading them.
To do so, we can't reuse tmp_early_data to store the amount of data sent,
so introduce a new member.
If we read early data, then ssl_sock_to_buf() is now the only responsible
for getting back to the handshake, to make sure we don't miss any early data.
There is a small unprotected window for a task between the wait queue
and the run queue where a task could be woken up and destroyed at the
same time. What typically happens is that a timeout is reached at the
same time an I/O completes and wakes it up, and the I/O terminates the
task, causing a use after free in wake_expired_tasks() possibly causing
a crash and/or memory corruption :
thread 1 thread 2
(wake_expired_tasks) (stream_int_notify)
HA_SPIN_UNLOCK(TASK_WQ_LOCK, &wq_lock);
task_wakeup(task, TASK_WOKEN_IO);
...
process_stream()
stream_free()
task_free()
pool_free(task)
task_wakeup(task, TASK_WOKEN_TIMER);
This case is reasonably easy to reproduce with a config using very short
server timeouts (100ms) and client timeouts (10ms), while injecting on
httpterm requesting medium sized objects (5kB) over SSL. All this is
easier done with more threads than allocated CPUs so that pauses can
happen anywhere and last long enough for process_stream() to kill the
task.
This patch inverts the lock and the wakeup(), but requires some changes
in process_runnable_tasks() to ensure we never try to grab the WQ lock
while having the RQ lock held. This means we have to release the RQ lock
before calling task_queue(), so we can't hold the RQ lock during the
loop and must take and drop it.
It seems that a different approach with the scope-aware trees could be
easier, but it would possibly not cover situations where a task is
allowed to run on multiple threads. The current solution covers it and
doesn't seem to have any measurable performance impact.
When a stream is aborted on timeout or any reason initiated by the stream,
and this stream was subscribed to the send list, we forgot to detach it
when freeing it, resulting in a dead node remaining present in the send
list with all usual funny consequences (memory corruption, crashes, etc).
Let's simply unconditionally delete the stream.
When the stats code was moved to an applet, it wasn't completely
cleaned of its usage of the HTTP transaction and it used to store
the HTTP status in txn->status and to set the HTTP request date to
<now> from within the applet. This is totally wrong because the
applet is seen as a server from the HTTP engine, which parses its
response, so the http_txn must not be touched there.
This was made visible by the cache which would always exhibit a
negative TR log, indicating that nowhere in the code we took care of
setting s->logs.tv_request while the code above used to continue to
hide this. Another side effect of this issue is that under load, if
the stats applet call risks to be delayed, the reported t_queue can
appear negative by being below tv_request-tv_accept.
This patch removes the assignment of tv_request and txn->status from
the applet code and instead sets the tv_request if still unset when
connecting to the applet. This ensures that all applets report correct
request timers now.
During high loads it becomes visible that the time drifts between threads,
sometimes showing tens of seconds after several minutes. The root cause is
the per-thread correction which is performed based on a local offset and
local time. But we can't use a unique global time either as we need the
thread-local time to be stable between two poll() calls.
This commit takes a stab at this problem by proceeding this way :
- a global "global_now" date is monotonous and common between all threads.
- each thread has its own local <now> which is resynced with <global_now>
on each invocation of tv_update_date()
- each thread detects its own drift based on its poll() timeout and its
local <now>, and recalculates its adjusted local time
- each thread then ensures its new local time is no older than the current
global time, otherwise it readjusts its local time to match this one
- finally threads do atomically update the global time to match its own
local one
This guarantees a monotonous global time and a monotonous+stable local time.
It is still possible by definition for two threads to report a minor time
variation on subsequent events but that variation will only be caused by
the moment they watched the time and are very small. When a common global
time is needed between all threads, global_now could be used as a reference
(with care). The wallclock time used in logs is still <date> anyway.
With threads, it became mandatory to implement a thread-local time with
its own correction. However, it was noticed that during high thread
contention, the time correction could occasionally be wrong, reporting
huge negative or positive timers in logs. This was caused by the
conversion between struct timeval and a single 64-bit offset, due to
an erroneous shift and due to a loss of sign during the conversion.
Given that time_t is not always signed, and that timeval is not really
needed here, better avoid playing dangerous games with these operations
and use a single 64-bit offset representing a signed 32-bit offset, for
the seconds part and an unsigned offset for the microsecond part.
It still supports atomic updates and doesn't cause issues anymore.
This code has been used successfully a few times in the past to detect
that a pool was used after being freed. Its main goal is to allocate a
full page for each object so that they are always released individually
and unmapped from memory. This way if any part of the code reference the
object after is was freed and before it is reallocated, a segv occurs at
the exact offending location. It does a few extra things such as writing
to the memory area before freeing to detect double-frees and free of
read-only areas, and placing the data at the end of the page instead of
the beginning so that out of bounds accesses are easier to spot. The
amount of memory used with this is huge (about 10 times the regular
usage) but it can be useful sometimes.
If we can't write early data, for some reason, don't give up on reading them,
they may still be early data to be read, and if we don't do so, openssl
internal states might be inconsistent, and the handshake will fail.
The current code only tries to do the handshake in case we can't send early
data if we're acting as a client, which is wrong, it has to be done on the
server side too, or we end up in an infinite loop.
While using mmap() to allocate pools for debugging purposes, kill -USR1 caused
libc aborts in deinit() on two calls to free() on proxies' tasks and the global
listener task. The issue comes from the fact that we're using free() to release
a task instead of task_free(), so the task was allocated from a pool and released
using a different method.
This bug has been there since at least 1.5, so a backport is desirable to all
maintained versions.
The cache was trying to remove objects from the tree while they were
already removed from it. We set the key to 0 as a check for not trying
to remove the object from the tree when we are still using the object.
The cli command "show cache" displays the status of the cache, the first
displayed line is the shctx informations with how much blocks available
blocks it contains (blocks are 1k by default).
The next lines are the objects stored in the cache tree, the pointer,
the size of the object and how much blocks it uses, a refcount for the
number of users of the object, and the remaining expiration time (which
can be negative if expired)
Example:
$ echo "show cache" | socat - /run/haproxy.sock
0x7fa54e9ab03a: foobar (shctx:0x7fa54e9ab000, available blocks:3921)
0x7fa54ed65b8c (size: 43190 (43 blocks), refcount:2, expire: 2)
0x7fa54ecf1b4c (size: 45238 (45 blocks), refcount:0, expire: 2)
0x7fa54ed70cec (size: 61622 (61 blocks), refcount:0, expire: 2)
0x7fa54ecdbcac (size: 42166 (42 blocks), refcount:1, expire: 2)
0x7fa54ec9736c (size: 44214 (44 blocks), refcount:2, expire: 2)
0x7fa54eca28ec (size: 46262 (46 blocks), refcount:2, expire: -2)
Allows bigger objects to be cached in the shctx, the first
implementation was only storing small ssl session, but we want to store
bigger HTTP response.