Instead of exporting a number of pools and having to manually delete
them in deinit() or to have dedicated destructors to remove them, let's
simply kill all pools on deinit().
For this a new function pool_destroy_all() was introduced. As its name
implies, it destroys and frees all pools (provided they don't have any
user anymore of course).
This allowed to remove 4 implicit destructors, 2 explicit ones, and 11
individual calls to pool_destroy(). In addition it properly removes
the mux_pt_ctx pool which was not cleared on exit (no backport needed
here since it's 1.9 only). The sig_handler pool doesn't need to be
exported anymore and became static now.
This patch replaces a number of __decl_hathread() followed by HA_SPIN_INIT
or HA_RWLOCK_INIT by the new __decl_spinlock() or __decl_rwlock() which
automatically registers the lock for initialization in during the STG_LOCK
init stage. A few static modifiers were lost in the process, but since they
were not essential at all it was not worth extending the API to provide such
a variant.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
This one is more generic and designed to work on a random block. It
may later get a b_rep_ist() variant since many strings are already
available as (ptr,len).
There was no point keeping that function in the buffer part since it's
exclusively used by HTTP at the channel level, since it also automatically
appends the CRLF. This further cleans up the buffer code.
This is intentionally the minimal and safest set of changes, some cleanups
area still required. These changes are quite tricky and cannot be
independantly tested, so it's important to keep this patch as bisectable
as possible.
buf_empty and buf_wanted were changed and are now exactly similar since
there's no <p> member in the structure anymore. Given that no test is
ever made in the code to check that buf == &buf_wanted, it may be possible
that we don't need to have two anymore, unless some buf_empty tests have
precedence. This will have to be investigated.
A significant part of this commit affects the HTTP compression code,
which used to deeply manipulate the input and output buffers without
any reasonable solution for a better abstraction. For this reason, if
any regression is met and designates this patch as the culprit, it is
important to run tests which specifically involve compression or which
definitely don't use it in order to spot the issue.
Cc: Olivier Houchard <ohouchard@haproxy.com>
For the same consistency reasons, let's use b_empty() at the few places
where an empty buffer is expected, or c_empty() if it's done on a channel.
Some of these places were there to realign the buffer so
{b,c}_realign_if_empty() was used instead.
Where relevant, the channel version is used instead. The buffer version
was ported to be more generic and now takes a swap buffer and the output
byte count to know where to set the alignment point. The H2 mux still
uses buffer_slow_realign() with buf->o but it will change later.
Passing unsigned ints everywhere is painful, and will cause some headache
later when we'll want to integrate better with struct ist which already
uses size_t. Let's switch buffers to use size_t instead.
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
There was a flaw in the way the threads was created. the main one was just used
to create all the others and just wait to exit. Now, it is used to run a poll
loop. So we only create nbthread-1 threads.
This also fixes a bug about the compression filter when there is only 1 thread
(nbthread == 1 or no threads support). The bug was in the way thread-local
resources was initialized. per-thread init/deinit callbacks were never called
for the main process. So, with nthread set to 1, some buffers remained
uninitialized.
swap_buffer is a global variable only used by buffer_slow_realign. So it has
been moved from global.h to buffer.c and it is allocated by init_buffer
function. deinit_buffer function has been added to release it. It is also used
to destroy the buffers' pool.
When an entity tries to get a buffer, if it cannot be allocted, for example
because the number of buffers which may be allocated per process is limited,
this entity is added in a list (called <buffer_wq>) and wait for an available
buffer.
Historically, the <buffer_wq> list was logically attached to streams because it
were the only entities likely to be added in it. Now, applets can also be
waiting for a free buffer. And with filters, we could imagine to have more other
entities waiting for a buffer. So it make sense to have a generic list.
Anyway, with the current design there is a bug. When an applet failed to get a
buffer, it will wait. But we add the stream attached to the applet in
<buffer_wq>, instead of the applet itself. So when a buffer is available, we
wake up the stream and not the waiting applet. So, it is possible to have
waiting applets and never awakened.
So, now, <buffer_wq> is independant from streams. And we really add the waiting
entity in <buffer_wq>. To be generic, the entity is responsible to define the
callback used to awaken it.
In addition, applets will still request an input buffer when they become
active. But they will not be sleeped anymore if no buffer are available. So this
is the responsibility to the applet I/O handler to check if this buffer is
allocated or not. This way, an applet can decide if this buffer is required or
not and can do additional processing if not.
[wt: backport to 1.7 and 1.6]
When users request 16384 bytes for a buffer, they get 16392 after
rounding up. This is problematic for SSL as it systematically
causes a small 8-bytes message to be appended after the first 16kB
message and costs about 15% of performance.
Let's add MEM_F_EXACT to use exactly the size we need. This requires
previous patch (MEDIUM: pools: add a new flag to avoid rounding pool
size up).
This issue was introduced in 1.6 and causes trouble there, so this
fix must be backported.
This is issue was reported by Gary Barrueto and diagnosed by Cyril Bont.
The function buffer_slow_realign() was initially designed for requests
only and did not consider pending outgoing data. This causes a problem
when called on responses where data remain in the buffer, which may
happen with pipelined requests when the client is slow to read data.
The user-visible effect is that if less than <maxrewrite> bytes are
present in the buffer from a previous response and these bytes cross
the <maxrewrite> boundary close to the end of the buffer, then a new
response will cause a realign and will destroy these pending data and
move the pointer to what's believed to contain pending output data.
Thus the client receives the crap that lies in the buffer instead of
the original output bytes.
This new implementation now properly realigns everything including the
outgoing data which are moved to the end of the buffer while the input
data are moved to the beginning.
This implementation still uses a buffer-to-buffer copy which is not
optimal in terms of performance and which should be replaced by a
buffer switch later.
Prior to this patch, the following script would return different hashes
on each round when run from a 100 Mbps-connected machine :
i=0
while usleep 100000; do
echo round $((i++))
set -- $(nc6 0 8001 < 1kreq5k.txt | grep -v '^[0-9A-Z]' | md5sum)
if [ "$1" != "3861afbb6566cd48740ce01edc426020" ]; then echo $1;break;fi
done
The file contains 1000 times this request with "Connection: close" on the
last one :
GET /?s=5k&R=1 HTTP/1.1
The config is very simple :
global
tune.bufsize 16384
tune.maxrewrite 8192
defaults
mode http
timeout client 10s
timeout server 5s
timeout connect 3s
listen px
bind :8001
option http-server-close
server s1 127.0.0.1:8000
And httpterm-1.7.2 is used as the server on port 8000.
After the fix, 1 million requests were sent and all returned the same
contents.
Many thanks to Charlie Smurthwaite of atechmedia.com for his precious
help on this issue, which would not have been diagnosed without his
very detailed traces and numerous tests.
The patch must be backported to 1.5 which is where the bug was introduced.
Space is not avalaible only if the end of the data inserted
is strictly greater than the end of buffer. If these two value
are equal, the space is avamaible.
This setting is used to limit memory usage without causing the alloc
failures caused by "-m". Unexpectedly, tests have shown a performance
boost of up to about 18% on HTTP traffic when limiting the number of
buffers to about 10% of the amount of concurrent connections.
tune.buffers.limit <number>
Sets a hard limit on the number of buffers which may be allocated per process.
The default value is zero which means unlimited. The minimum non-zero value
will always be greater than "tune.buffers.reserve" and should ideally always
be about twice as large. Forcing this value can be particularly useful to
limit the amount of memory a process may take, while retaining a sane
behaviour. When this limit is reached, sessions which need a buffer wait for
another one to be released by another session. Since buffers are dynamically
allocated and released, the waiting time is very short and not perceptible
provided that limits remain reasonable. In fact sometimes reducing the limit
may even increase performance by increasing the CPU cache's efficiency. Tests
have shown good results on average HTTP traffic with a limit to 1/10 of the
expected global maxconn setting, which also significantly reduces memory
usage. The memory savings come from the fact that a number of connections
will not allocate 2*tune.bufsize. It is best not to touch this value unless
advised to do so by an haproxy core developer.
We've already experimented with three wake up algorithms when releasing
buffers : the first naive one used to wake up far too many sessions,
causing many of them not to get any buffer. The second approach which
was still in use prior to this patch consisted in waking up either 1
or 2 sessions depending on the number of FDs we had released. And this
was still inaccurate. The third one tried to cover the accuracy issues
of the second and took into consideration the number of FDs the sessions
would be willing to use, but most of the time we ended up waking up too
many of them for nothing, or deadlocking by lack of buffers.
This patch completely removes the need to allocate two buffers at once.
Instead it splits allocations into critical and non-critical ones and
implements a reserve in the pool for this. The deadlock situation happens
when all buffers are be allocated for requests pending in a maxconn-limited
server queue, because then there's no more way to allocate buffers for
responses, and these responses are critical to release the servers's
connection in order to release the pending requests. In fact maxconn on
a server creates a dependence between sessions and particularly between
oldest session's responses and latest session's requests. Thus, it is
mandatory to get a free buffer for a response in order to release a
server connection which will permit to release a request buffer.
Since we definitely have non-symmetrical buffers, we need to implement
this logic in the buffer allocation mechanism. What this commit does is
implement a reserve of buffers which can only be allocated for responses
and that will never be allocated for requests. This is made possible by
the requester indicating how much margin it wants to leave after the
allocation succeeds. Thus it is a cooperative allocation mechanism : the
requester (process_session() in general) prefers not to get a buffer in
order to respect other's need for response buffers. The session management
code always knows if a buffer will be used for requests or responses, so
that is not difficult :
- either there's an applet on the initiator side and we really need
the request buffer (since currently the applet is called in the
context of the session)
- or we have a connection and we really need the response buffer (in
order to support building and sending an error message back)
This reserve ensures that we don't take all allocatable buffers for
requests waiting in a queue. The downside is that all the extra buffers
are really allocated to ensure they can be allocated. But with small
values it is not an issue.
With this change, we don't observe any more deadlocks even when running
with maxconn 1 on a server under severely constrained memory conditions.
The code becomes a bit tricky, it relies on the scheduler's run queue to
estimate how many sessions are already expected to run so that it doesn't
wake up everyone with too few resources. A better solution would probably
consist in having two queues, one for urgent requests and one for normal
requests. A failed allocation for a session dealing with an error, a
connection event, or the need for a response (or request when there's an
applet on the left) would go to the urgent request queue, while other
requests would go to the other queue. Urgent requests would be served
from 1 entry in the pool, while the regular ones would be served only
according to the reserve. Despite not yet having this, it works
remarkably well.
This mechanism is quite efficient, we don't perform too many wake up calls
anymore. For 1 million sessions elapsed during massive memory contention,
we observe about 4.5M calls to process_session() compared to 4.0M without
memory constraints. Previously we used to observe up to 16M calls, which
rougly means 12M failures.
During a test run under high memory constraints (limit enforced to 27 MB
instead of the 58 MB normally needed), performance used to drop by 53% prior
to this patch. Now with this patch instead it *increases* by about 1.5%.
The best effect of this change is that by limiting the memory usage to about
2/3 to 3/4 of what is needed by default, it's possible to increase performance
by up to about 18% mainly due to the fact that pools are reused more often
and remain hot in the CPU cache (observed on regular HTTP traffic with 20k
objects, buffers.limit = maxconn/10, buffers.reserve = limit/2).
Below is an example of scenario which used to cause a deadlock previously :
- connection is received
- two buffers are allocated in process_session() then released
- one is allocated when receiving an HTTP request
- the second buffer is allocated then released in process_session()
for request parsing then connection establishment.
- poll() says we can send, so the request buffer is sent and released
- process session gets notified that the connection is now established
and allocates two buffers then releases them
- all other sessions do the same till one cannot get the request buffer
without hitting the margin
- and now the server responds. stream_interface allocates the response
buffer and manages to get it since it's higher priority being for a
response.
- but process_session() cannot allocate the request buffer anymore
=> We could end up with all buffers used by responses so that none may
be allocated for a request in process_session().
When the applet processing leaves the session context, the test will have
to be changed so that we always allocate a response buffer regardless of
the left side (eg: H2->H1 gateway). A final improvement would consists in
being able to only retry the failed I/O operation without waking up a
task, but to date all experiments to achieve this have proven not to be
reliable enough.
Doing so ensures that even when no memory is available, we leave the
channel in a sane condition. There's a special case in proto_http.c
regarding the compression, we simply pre-allocate the tmpbuf to point
to the dummy buffer. Not reusing &buf_empty for this allows the rest
of the code to differenciate an empty buffer that's not used from an
empty buffer that results from a failed allocation which has the same
semantics as a buffer full.
Channels are now created with a valid pointer to a buffer before the
buffer is allocated. This buffer is a global one called "buf_empty" and
of size zero. Thus it prevents any activity from being performed on
the buffer and still ensures that chn->buf may always be dereferenced.
b_free() also resets the buffer to &buf_empty, and was split into
b_drop() which does not reset the buffer.
HAProxy will crash with the following configuration:
global
...
tune.bufsize 1024
tune.maxrewrite 0
frontend xxx
...
backend yyy
...
cookie cookie insert maxidle 300s
If client sends a request of which object size is more than tune.bufsize (1024
bytes), HAProxy will crash.
After doing some debugging, the crash was caused by http_header_add_tail2() ->
buffer_insert_line2() while inserting cookie at the end of response header.
Part codes of buffer_insert_line2() are as below:
int buffer_insert_line2(struct buffer *b, char *pos, const char *str, int len)
{
int delta;
delta = len + 2;
if (bi_end(b) + delta >= b->data + b->size)
return 0; /* no space left */
/* first, protect the end of the buffer */
memmove(pos + delta, pos, bi_end(b) - pos);
...
}
Since tune.maxrewrite is 0, HAProxy can receive 1024 bytes once which is equals
to full buffer size. Under such condition, the buffer is full and bi_end(b)
will be wrapped to the start of buffer which pointed to b->data. As a result,
though there is no space left in buffer, the check condition
if (bi_end(b) + delta >= b->data + b->size)
will be true, then memmove() is called, and (pos + delta) will exceed the end
of buffer (b->data + b->size), HAProxy crashes
Just take buffer_replace2() as a reference, the other check when input data in
a buffer is wrapped should be also added into buffer_insert_line2().
This fix must be backported to 1.5.
Signed-off-by: Godbach <nylzhaowei@gmail.com>
Commit c08057c does the align job for buffer_dump(), but it has not fixed the
issue that less than 8 characters are left in the last line as below:
Dumping contents from byte 0 to byte 119
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 47 45 54 20 2f 69 6e 64 - 65 78 2e 68 74 6d 20 48 GET /index.htm H
0010: 54 54 50 2f 31 2e 30 0d - 0a 55 73 65 72 2d 41 67 TTP/1.0..User-Ag
...
0060: 6e 65 63 74 69 6f 6e 3a - 20 4b 65 65 70 2d 41 6c nection: Keep-Al
0070: 69 76 65 0d 0a 0d 0a ive....
The last line of the hex column is still overlapped by the text column. Since
there will be additional "- " for the output line which has no less than 8
characters, two additional spaces should be present when there is less than 8
characters in order to do alignment. The result after being fixed is as below:
Dumping contents from byte 0 to byte 119
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 47 45 54 20 2f 69 6e 64 - 65 78 2e 68 74 6d 20 48 GET /index.htm H
0010: 54 54 50 2f 31 2e 30 0d - 0a 55 73 65 72 2d 41 67 TTP/1.0..User-Ag
...
0060: 6e 65 63 74 69 6f 6e 3a - 20 4b 65 65 70 2d 41 6c nection: Keep-Al
0070: 69 76 65 0d 0a 0d 0a ive....
Signed-off-by: Godbach <nylzhaowei@gmail.com>
If the dumped length of buffer is not multiple of 16, the last output line can
be seen as below:
Dumping contents from byte 0 to byte 125
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 47 45 54 20 2f 69 6e 64 - 65 78 2e 68 74 6d 20 48 GET /index.htm H
0010: 54 54 50 2f 31 2e 30 0d - 0a 55 73 65 72 2d 41 67 TTP/1.0..User-Ag
...
0060: 30 0d 0a 43 6f 6e 6e 65 - 63 74 69 6f 6e 3a 20 4b 0..Connection: K
0070: 65 65 70 2d 41 6c 69 76 - 65 0d 0a 0d 0a eep-Alive....
Yes, the hex column will be overlapped by the text column. Both the hex and
text column should be aligned at their own area as below:
Dumping contents from byte 0 to byte 125
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 47 45 54 20 2f 69 6e 64 - 65 78 2e 68 74 6d 20 48 GET /index.htm H
0010: 54 54 50 2f 31 2e 30 0d - 0a 55 73 65 72 2d 41 67 TTP/1.0..User-Ag
...
0060: 30 0d 0a 43 6f 6e 6e 65 - 63 74 69 6f 6e 3a 20 4b 0..Connection: K
0070: 65 65 70 2d 41 6c 69 76 - 65 0d 0a 0d 0a eep-Alive....
Signed-off-by: Godbach <nylzhaowei@gmail.com>
With this commit, we now separate the channel from the buffer. This will
allow us to replace buffers on the fly without touching the channel. Since
nobody is supposed to keep a reference to a buffer anymore, doing so is not
a problem and will also permit some copy-less data manipulation.
Interestingly, these changes have shown a 2% performance increase on some
workloads, probably due to a better cache placement of data.
These functions do not depend on the channel flags anymore thus they're
much better suited to be used on plain buffers. Move them from channel
to buffer.