Currently we have per-thread arrays of trees and counts, but these
ones unfortunately share cache lines and are accessed very often. This
patch moves the task-specific stuff into a structure taking a multiple
of a cache line, and has one such per thread. Just doing this has
reduced the cache miss ratio from 19.2% to 18.7% and increased the
12-thread test performance by 3%.
It starts to become visible that we really need a process-wide per-thread
storage area that would cover more than just these parts of the tasks.
The code was arranged so that it's easy to move the pieces elsewhere if
needed.
Now we still have a main contention point with the timers in the main
wait queue, but the vast majority of the tasks are pinned to a single
thread. This patch creates a per-thread wait queue and queues a task
to the local wait queue without any locking if the task is bound to a
single thread (the current one) otherwise to the shared queue using
locking. This significantly reduces contention on the wait queue. A
test with 12 threads showed 11 ms spent in the WQ lock compared to
4.7 seconds in the same test without this change. The cache miss ratio
decreased from 19.7% to 19.2% on the 12-thread test, and its performance
increased by 1.5%.
Another indirect benefit is that the average queue size is divided
by the number of threads, which roughly removes log(nbthreads) levels
in the tree and further speeds up lookups.
The run queue is designed to perform a single tree lookup and to
use multiple passes to eb32sc_next(). The scheduler rework took a
conservative approach first but this is not needed anymore and it
increases the processing cost of process_runnable_tasks() and even
the time during which the RQ lock is held if the global queue is
heavily loaded. Let's simply move the initial lookup to the entry
of the loop like the previous scheduler used to do. This has reduced
by a factor of 5.5 the number of calls to eb32sc_lookup_get() there.
Instead of checking if nbthreads == 1, just and thread_mask with
all_threads_mask to know if we're supposed to add the task to the local or
the global runqueue.
Depending on the optimization level, gcc may complain that wake_thread()
uses an invalid array index for poller_wr_pipe[] when called from
__task_wakeup(). Normally the condition to get there never happens,
but it's simpler to ifdef out this part of the code which is only
used to wake other threads up. No backport is needed, this was brought
by the recent introduction of the ability to wake a sleeping thread.
Add a new pipe, one per thread, so that we can write on it to wake a thread
sleeping in a poller, and use it to wake threads supposed to take care of a
task, if they are all sleeping.
As __task_wakeup() is responsible for increasing
rqueue_local[tid]/global_rqueue_size, make __task_unlink_rq responsible for
decreasing it, as process_runnable_tasks() isn't the only one that removes
tasks from runqueues.
We may remove the thread's bit in active_tasks_mask despite tasks for that
thread still being present in the global runqueue. To fix that, introduce
global_tasks_mask, and set the correspnding bits when we add a task to the
runqueue.
We need to decrement requeue_size when we remove a task form rqueue_local,
not when we remove if from the task list, or we'd also decrement it for any
tasklet, that was never in the rqueue in the first place.
Commit 09eeb76 ("BUG/MEDIUM: tasks: Don't forget to increase/decrease
tasks_run_queue.") addressed a count issue in the run queue and uncovered
another issue with the way the tasks are dequeued from the global run
queue. The number of tasks to pick is computed using an integral divide,
which results in up to nbthread-1 tasks never being run. The fix simply
consists in getting rid of the divide and checking the task count in the
loop.
No backport is needed, this is 1.9-specific.
We can't just set t to NULL if it's a tasklet, or we'd have a hard time
accessing to t->process, so just make sure we pass NULL as the first parameter
of t->process if it's a tasklet.
This should be a non-issue at this point, as tasklets aren't used yet.
To make sure we don't inadvertently insert task in the global runqueue,
while only the local runqueue is used without threads, make its definition
and usage conditional on USE_THREAD.
We're taking tasks from the global runqueue based on the number of tasks
the thread already have in its local runqueue, but now that we have a task
list, we also have to take that into account.
When the task list was introduced, we bogusly lost max_processed--, that means
we would execute as much tasks as present in the list, and we would never
set active_tasks_mask, so the thread would go to sleep even if more tasks were
to be executed.
1.9-dev only, no backport is needed.
Introduce tasklets, lightweight tasks. They have no notion of priority,
they are just run as soon as possible, and will probably be used for I/O
later.
For the moment they're used to replace the temporary thread-local list
that was used in the scheduler. The first part of the struct is common
with tasks so that tasks can be cast to tasklets and queued in this list.
Once a task is in the tasklet list, it has its leaf_p set to 0x1 so that
it cannot accidently be confused as not in the queue.
Pure tasklets are identifiable by their nice value of -32768 (which is
normally not possible).
A lot of tasks are run on one thread only, so instead of having them all
in the global runqueue, create a per-thread runqueue which doesn't require
any locking, and add all tasks belonging to only one thread to the
corresponding runqueue.
The global runqueue is still used for non-local tasks, and is visited
by each thread when checking its own runqueue. The nice parameter is
thus used both in the global runqueue and in the local ones. The rare
tasks that are bound to multiple threads will have their nice value
used twice (once for the global queue, once for the thread-local one).
In preparation for thread-specific runqueues, change the task API so that
the callback takes 3 arguments, the task itself, the context, and the state,
those were retrieved from the task before. This will allow these elements to
change atomically in the scheduler while the application uses the copied
value, and even to have NULL tasks later.
While running a task, we may try to delete and free a task that is about to
be run, because it's part of the local tasks list, or because rq_next points
to it.
So flag any task that is in the local tasks list to be deleted, instead of
run, by setting t->process to NULL, and re-make rq_next a global,
thread-local variable, that is modified if we attempt to delete that task.
Many thanks to PiBa-NL for reporting this and analysing the problem.
This should be backported to 1.8.
A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The "clear
counters" commands also reset these counters. The output is sent as a
single write(), which currently produces up to about 7 kB of data for
64 threads. If more counters are added, it may be necessary to write
into multiple buffers, or to reset the counters.
To backport to 1.8 to help collect more detailed bug reports.
We really don't want them to share the same cache line as they are
expected to be used in parallel. Adding a 64-byte alignment here shows
a performance increase of about 4.5% on task-intensive workloads with
2 to 4 threads.
Very often when debugging, the current task's pointer isn't easy to
recover (eg: from a core file). Let's keep a copy of it, it will
likely help, especially with threads.
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
There is a small unprotected window for a task between the wait queue
and the run queue where a task could be woken up and destroyed at the
same time. What typically happens is that a timeout is reached at the
same time an I/O completes and wakes it up, and the I/O terminates the
task, causing a use after free in wake_expired_tasks() possibly causing
a crash and/or memory corruption :
thread 1 thread 2
(wake_expired_tasks) (stream_int_notify)
HA_SPIN_UNLOCK(TASK_WQ_LOCK, &wq_lock);
task_wakeup(task, TASK_WOKEN_IO);
...
process_stream()
stream_free()
task_free()
pool_free(task)
task_wakeup(task, TASK_WOKEN_TIMER);
This case is reasonably easy to reproduce with a config using very short
server timeouts (100ms) and client timeouts (10ms), while injecting on
httpterm requesting medium sized objects (5kB) over SSL. All this is
easier done with more threads than allocated CPUs so that pauses can
happen anywhere and last long enough for process_stream() to kill the
task.
This patch inverts the lock and the wakeup(), but requires some changes
in process_runnable_tasks() to ensure we never try to grab the WQ lock
while having the RQ lock held. This means we have to release the RQ lock
before calling task_queue(), so we can't hold the RQ lock during the
loop and must take and drop it.
It seems that a different approach with the scope-aware trees could be
easier, but it would possibly not cover situations where a task is
allowed to run on multiple threads. The current solution covers it and
doesn't seem to have any measurable performance impact.
tasks_run_queue is the run queue size. It is a global variable. So it is
underoptimized because we may be lead to consider there are active tasks for a
thread while in fact all active tasks are assigned to the other threads. So, in
such cases, the polling loop will be evaluated many more times than necessary.
Instead, we now check if the thread id is set in the bitfield active_tasks_mask.
Another change has been made in process_runnable_tasks. Now, we always limit the
number of tasks processed to 200.
This is specific to threads, no backport is needed.
a bitfield has been added to know if there are runnable tasks for a thread. When
a task is woken up, the bits corresponding to its thread_mask are set. When all
tasks for a thread have been evaluated without any wakeup, the thread is removed
from active ones by unsetting its tid_bit from the bitfield.
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
My recent change in commit ce4e0aa ("MEDIUM: task: change the construction
of the loop in process_runnable_tasks()") was bogus as it used to keep the
rq_next across an unlock/lock sequence, occasionally leading to crashes for
tasks that are eligible to any thread. We must use the lookup call for each
new batch instead. The problem is easily triggered with such a configuration :
global
nbthread 4
listen check
mode http
bind 0.0.0.0:8080
redirect location /
option httpchk GET /
server s1 127.0.0.1:8080 check inter 1
server s2 127.0.0.1:8080 check inter 1
Thanks to Olivier for diagnosing this one. No backport is needed.
The scheduler is complex and uses local queues to amortize the cost of
locks. But all this comes with a cost that is quite observable with
single-thread workloads.
The purpose of this patch is to reimplement the much simpler scheduler
for the case where threads are not used. The code is very small and
simple. It doesn't impact the multi-threaded performance at all, and
provides a nice 10% performance increase in single-thread by reaching
606kreq/s on the tests that showed 550kreq/s before.
process_runnable_tasks() needs to requeue or wake up tasks after
processing them in batches. By only refilling the existing ones, we
avoid revisiting all the queue. The performance gain is measurable
starting with two threads, where the request rate climbs to 657k/s
compared to 644k.
This patch slightly rearranges the loop to pack the locked code a little
bit, and to try to concentrate accesses to the tree together to benefit
more from the cache.
It also fixes how the loop handles the right margin : now that is guaranteed
that the retrieved nodes are filtered to only match the current thread, we
don't need to rewind every 16 entries. Instead we can rewind each time we
reach the right margin again.
With this change, we now achieve the following performance for 10 H2 conns
each containing 100 streams :
1 thread : 550kreq/s
2 thread : 644kreq/s
3 thread : 598kreq/s
This function is sensitive, let's make it shorter by factoring out the
unlock and leave code. This reduced the function's size by a few tens
of bytes and increased the overall performance by about 1%.
Currently the task scheduler suffers from an O(n) lookup when
skipping tasks that are not for the current thread. The reason
is that eb32_lookup_ge() has no information about the current
thread so it always revisits many tasks for other threads before
finding its own tasks.
This is particularly visible with HTTP/2 since the number of
concurrent streams created at once causes long series of tasks
for the same stream in the scheduler. With only 10 connections
and 100 streams each, by running on two threads, the performance
drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for
only 200000 task lookups, 430 million skips had to be performed,
which means that on average, each lookup leads to 2150 nodes to
be visited.
This commit backports the principle of scope lookups for ebtrees
from the ebtree_v7 development tree. The idea is that each node
contains a mask indicating the union of the scopes for the nodes
below it, which is fed during insertion, and used during lookups.
Then during lookups, branches that do not contain any leaf matching
the requested scope are simply ignored. This perfectly matches a
thread mask, allowing a thread to only extract the tasks it cares
about from the run queue, and to always find them in O(log(n))
instead of O(n). Thus the scheduler uses tid_bit and
task->thread_mask as the ebtree scope here.
Doing this has recovered most of the performance, as can be seen on
the test below with two threads, 10 connections, 100 streams each,
and 1 million requests total :
Before After Gain
test duration : 89.6s 4.73s x19
HTTP requests/s (DEBUG) : 11200 211300 x19
HTTP requests/s (PROD) : 15900 447000 x28
spin_lock time : 85.2s 0.46s /185
time per lookup : 13us 40ns /325
Even when going to 6 threads (on 3 hyperthreaded CPU cores), the
performance stays around 284000 req/s, showing that the contention
is much lower.
A test showed that there's no benefit in using this for the wait queue
though.
It was a leftover from the last cleaning session; this mask applies
to threads and calling it process_mask is a bit confusing. It's the
same in fd, task and applets.
2 global locks have been added to protect, respectively, the run queue and the
wait queue. And a process mask has been added on each task. Like for FDs, this
mask is used to know which threads are allowed to process a task.
For many tasks, all threads are granted. And this must be your first intension
when you create a new task, else you have a good reason to make a task sticky on
some threads. This is then the responsibility to the process callback to lock
what have to be locked in the task context.
Nevertheless, all tasks linked to a session must be sticky on the thread
creating the session. It is important that I/O handlers processing session FDs
and these tasks run on the same thread to avoid conflicts.
These notification management function and structs are generic and
it will be better to move in common parts.
The notification management functions and structs have names
containing some "lua" references because it was written for
the Lua. This patch removes also these references.
In order to authorize call of task_wakeup on running task:
- from within the task handler itself.
- in futur, from another thread.
The lookups on runqueue and waitqueue are re-worked
to prepare multithread stuff.
If task_wakeup is called on a running task, the woken
message flags are savec in the 'pending_state' attribute of
the state. The real wakeup is postponed at the end of the handler
process and the woken messages are copied from pending_state
to the state attribute of the task.
It's important to note that this change will cause a very minor
(though measurable) performance loss but it is necessary to make
forward progress on a multi-threaded scheduler. Most users won't
ever notice.
<run_queue> is used to track the number of task in the run queue and
<run_queue_cur> is a copy used for the reporting purpose. These counters has
been renamed, respectively, <tasks_run_queue> and <tasks_run_queue_cur>. So the
naming is consistent between tasks and applets.
[wt: needed for next fixes, backport to 1.7 and 1.6]
With HTTP/2, we'll have to support multiplexed streams. A stream is in
fact the largest part of what we currently call a session, it has buffers,
logs, etc.
In order to catch any error, this commit removes any reference to the
struct session and tries to rename most "session" occurrences in function
names to "stream" and "sess" to "strm" when that's related to a session.
The files stream.{c,h} were added and session.{c,h} removed.
The session will be reintroduced later and a few parts of the stream
will progressively be moved overthere. It will more or less contain
only what we need in an embryonic session.
Sample fetch functions and converters will have to change a bit so
that they'll use an L5 (session) instead of what's currently called
"L4" which is in fact L6 for now.
Once all changes are completed, we should see approximately this :
L7 - http_txn
L6 - stream
L5 - session
L4 - connection | applet
There will be at most one http_txn per stream, and a same session will
possibly be referenced by multiple streams. A connection will point to
a session and to a stream. The session will hold all the information
we need to keep even when we don't yet have a stream.
Some more cleanup is needed because some code was already far from
being clean. The server queue management still refers to sessions at
many places while comments talk about connections. This will have to
be cleaned up once we have a server-side connection pool manager.
Stream flags "SN_*" still need to be renamed, it doesn't seem like
any of them will need to move to the session.
Commit 501260b ("MEDIUM: task: always ensure that the run queue is
consistent") introduced a skew in the scheduler : if a negatively niced
task is woken up, it can be inserted prior to the current index and will
be skipped as long as there is some activity with less prioritary tasks.
The immediate effect is that it's not possible to get access to the stats
under full load until the load goes down.
This is because the rq_next constantly evolves within more recent
positions. The fix is simple, __task_wakeup() must empty rq_next. The
sad thing is that this issue was fixed during development and missed
during the commit. No backport is needed, this is purely 1.6 stuff.
Actually, HAProxy uses the function "process_runnable_tasks" and
"wake_expired_tasks" to get the next task which can expires.
If a task is added with "task_schedule" or other method during
the execution of an other task, the expiration of this new task
is not taken into account, and the execution of this task can be
too late.
Actualy, HAProxy seems to be no sensitive to this bug.
This fix moves the call to process_runnable_tasks() before the timeout
calculation and ensures that all wakeups are processed together. Only
wake_expired_tasks() needs to return a timeout now.
As found by Thierry Fournier, if a task manages to kill another one and
if this other task is the next one in the run queue, we can do whatever
including crashing, because the scheduler restarts from the saved next
task. For now, there is no such concept of a task killing another one,
but with Lua it will come.
A solution consists in always performing the lookup of the first task in
the scheduler's loop, but it's expensive and costs around 2% of the
performance.
Another solution consists in keeping a global next run queue node and
ensuring that when this task gets removed, it updates this pointer to
the next one. This allows to simplify the code a bit and in the end to
slightly increase the performance (0.3-0.5%). The mechanism might still
be usable if we later migrate to a multi-threaded scheduler.
All files referencing the previous ebtree code were changed to point
to the new one in the ebtree directory. A makefile variable (EBTREE_DIR)
is also available to use files from another directory.
The ability to build the libebtree library temporarily remains disabled
because it can have an impact on some existing toolchains and does not
appear worth it in the medium term if we add support for multi-criteria
stickiness for instance.
I noticed that in __eb32_insert , if the tree is empty
(root->b[EB_LEFT] == NULL) , the node.bit is not defined.
However in __task_queue there are checks:
- if (last_timer->node.bit < 0)
- if (task->wq.node.bit < last_timer->node.bit)
which might rely upon an undefined value.
This is how I see it:
1. We insert eb32_node in an empty wait queue tree for a task (called by
process_runnable_tasks() ):
Inserting into empty wait queue &task->wq = 0x72a87c8, last_timer
pointer: (nil)
2. Then, we set the last timer to the same address:
Setting last_timer: (nil) to: 0x72a87c8
3. We get a new task to be inserted in the queue (again called by
process_runnable_tasks()) , before the __task_unlink_wq() is called for
the previous task.
4. At this point, we still have last_timer set to 0x72a87c8 , but since
it was inserted in an empty tree, it doesn't have node.bit and the
values above get dereferenced with undefined value.
The bug has no effect right now because the check for equality is still
made, so the next timer will still be queued at the right place anyway,
without any possible side-effect. But it's a pending bug waiting for a
small change somewhere to strike.
Iliya Polihronov
Cristian Ditoiu reported a major regression when testing 1.3.19 at
transfer.ro. It would crash within a few minutes while 1.3.15.10
was OK. He offered to help so we could run gdb and debug the crash
live. We finally found that the crash was the result of a regression
introduced by recent fix 814c978fb6
(task: fix possible timer drift after update) which makes it possible
for a tree walk to start from a detached task if this task has got
its timeout disabled due to a missing timeout.
The trivial fix below has been extensively tested and confirmed not
to crash anymore.
Special thanks to Cristian who spontaneously provided a lot of help
and trust to debug this issue which at first glance looked impossible
after reading the code and traces, but took less than an hour to spot
and fix when caught live in gdb ! That's really appreciated !
When the scheduler detected that a task was misplaced in the timer
queue, it used to place it right again. Unfortunately, it did not
check whether it would still call the new task from its new place.
This resulted in some tasks not getting called on timeout once in
a while, causing a minor drift for repetitive timers. This effect
was only observable with slow health checks and without any activity
because no other task would cause the scheduler to be immediately
called again.
In practice, it does not affect any real-world configuration, but
it's still better to fix it.
It's sometimes useful at least for statistics to keep a task count.
It's easy to do by forcing the rare task creators to always use the
same functions to create/destroy a task.
If a task wants to stay in the run queue, it is possible. It just
needs to wake itself up. We just want to ensure that a reniced
task will be processed at the right instant.
The top of a duplicate tree is not where bit == -1 but at the most
negative bit. This was causing tasks to be queued in reverse order
within duplicates. While this is not dramatic, it's incorrect and
might lead to longer than expected duplicate depths under some
circumstances.
When there are niced tasks, we would only process #tasks/4 per
turn, without taking care of running #tasks when #tasks was below
4, leaving those tasks waiting for a few other tasks to push them.
The fix simply consists in checking (#tasks+3)/4.
Since we're now able to search from a precise expiration date in
the timer tree using ebtree 4.1, we don't need to maintain 4 trees
anymore. Not only does this simplify the code a lot, but it also
ensures that we can always look 24 days back and ahead, which
doubles the ability of the previous scheduler. Indeed, while based
on absolute values, the timer tree is now relative to <now> as we
can always search from <now>-31 bits.
The run queue uses the exact same principle now, and is now simpler
and a bit faster to process. With these changes alone, an overall
0.5% performance gain was observed.
Tests were performed on the few wrapping cases and everything works
as expected.
Most of the time, task_queue() will immediately return. By extracting
the preliminary checks and putting them in an inline function, we can
significantly reduce the number of calls to the function itself, and
most of the tests can be optimized away due to the caller's context.
Another minor improvement in process_runnable_tasks() consisted in
taking benefit from the processor's branch prediction unit by making
a special case of the process_session() callback which is by far the
most common one.
All this improved performance by about 1%, mainly during the call
from process_runnable_tasks().
Timers are unsigned and used as tree positions. Ticks are signed and
used as absolute date within current time frame. While the two are
normally equal (except zero), it's important not to confuse them in
the code as they are not interchangeable.
We add two inline functions to turn each one into the other.
The comments have also been moved to the proper location, as it was
not easy to understand what was a tick and what was a timer unit.
All the tasks callbacks had to requeue the task themselves, and update
a global timeout. This was not convenient at all. Now the API has been
simplified. The tasks callbacks only have to update their expire timer,
and return either a pointer to the task or NULL if the task has been
deleted. The scheduler will take care of requeuing the task at the
proper place in the wait queue.
We don't need to remove then add tasks in the wait queue every time we
update a timeout. We only need to do that when the new timeout is earlier
than previous one. We can rely on wake_expired_tasks() to perform the
proper checks and bounce the misplaced tasks in the rare case where this
happens. The motivation behind this is that we very rarely hit timeouts,
so we save a lot of CPU cycles by moving the tasks very rarely. This now
means we can also find tasks with expiration date set to eternity in the
queue, and that is not a problem.
In many situations, we wake a task on an I/O event, then queue it
exactly where it was. This is a real waste because we delete/insert
tasks into the wait queue for nothing. The only reason for this is
that there was only one tree node in the task struct.
By adding another tree node, we can have one tree for the timers
(wait queue) and one tree for the priority (run queue). That way,
we can have a task both in the run queue and wait queue at the
same time. The wait queue now really holds timers, which is what
it was designed for.
The net gain is at least 1 delete/insert cycle per session, and up
to 2-3 depending on the workload, since we save one cycle each time
the expiration date is not changed during a wake up.
A bug was introduced with the ebtree-based scheduler. It seldom causes
some timeouts to last longer than required if they hit an expiration
date which is the same as the last queued date, is also part of a
duplicate tree without being the top of the tree. In this case, the
task will not be expired until after the duplicate tree has been
flushed.
It is easier to reproduce by setting a very short client timeout (1s)
and sending connections and waiting for them to expire with the 408
status. Then in parallel, inject at about 1kh/s. The bug causes the
connections to sometimes wait longer than 1s before timing out.
The cause was the use of eb_insert_dup() on wrong nodes, as this
function is designed to work only on the top of the dup tree. The
solution consists in updating last_timer only when its bit is -1,
and using it only if its bit is still -1 (top of a dup tree).
The fix has not reduced performance because it only fixes the case
where this bug could fire, which is extremely rare.
It's very frequent to require some information about the
reason why a task is running. Some flags have been added
so that a task now knows if it got woken up due to I/O
completion, timeout, etc...
A test has shown that more than 16% of the calls to task_wakeup()
could be avoided because the task is already woken up. So make it
inline and move the test to the inline part.
It should be stated as a rule that a C file should never
include types/xxx.h when proto/xxx.h exists, as it gives
less exposure to declaration conflicts (one of which was
caught and fixed here) and it complicates the file headers
for nothing.
Only types/global.h, types/capture.h and types/polling.h
have been found to be valid includes from C files.
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.
A new header file has been added : include/common/ticks.h.
It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.
Some function and variable names are still subject to change for
a better overall logics.
When queuing a timer, it's very likely that an expiration date is
equal to that of the previously queued timer, due to time rounding
to the millisecond. Optimizing for this case provides a noticeable
1% performance boost.
The run queue scheduler now considers task->nice to queue a task and
to pick a task out of the queue. This makes it possible to boost the
access to statistics (both via HTTP and UNIX socket). The UNIX socket
receives twice as much a boost as the HTTP socket because it is more
sensible.
We now insert tasks in a certain sequence in the run queue.
The sorting key currently is the arrival order. It will now
be possible to apply a "nice" value to any task so that it
goes forwards or backwards in the run queue.
The calls to wake_expired_tasks() and maintain_proxies()
have been moved to the main run_poll_loop(), because they
had nothing to do in process_runnable_tasks().
The task_wakeup() function is not inlined anymore, as it was
only used at one place.
The qlist member of the task structure has been removed now.
The run_queue list has been replaced for an integer indicating
the number of tasks in the run queue.
The wait queues now rely on 4 trees for past, present and future
timers. The computations are cleaner and more reliable. The
wake_expired_tasks function has become simpler. Also, a bug
previously introduced in task_queue() by the first introduction
of eb_trees has been fixed (the eb->key was never updated).
The ultree code has been removed in favor of a simpler and
cleaner ebtree implementation. The eternity queue does not
need to exist anymore, and the pool_tree64 has been removed.
The ebtree node is stored in the task itself. The qlist list
header is still used by the run-queue, but will be able to
disappear once the run-queue uses ebtree too.
GCC4 is stupid (unbelievable news!).
When some code uses __builtin_expect(x != 0, 1), it really performs
the check of x != 0 then tests that the result is not zero! This is
a double check when only one was expected. Some performance drops
of 10% in the HTTP parser code have been observed due to this bug.
GCC 3.4 is fine though.
A solution consists in expecting that the tested value is 1. In
this case, it emits the correct code, but it's still not optimal
it seems. Finally the best solution is to ignore likely() and to
pray for the compiler to emit correct code. However, we still have
to fix unlikely() to remove the test there too, and to fix all
code which passed pointers overthere to pass integers instead.
wake_expired_tasks() used a hint to avoid scanning the tree in most cases,
but it looks like the hint is more expensive than reaching the first node
in the tree. Disable it for now.
The timeout functions were difficult to manipulate because they were
rounding results to the millisecond. Thus, it was difficult to compare
and to check what expired and what did not. Also, the comparison
functions were heavy with multiplies and divides by 1000. Now, all
timeouts are stored in timevals, reducing the number of operations
for updates and leading to cleaner and more efficient code.
The time subsystem really needs fixing. It was still possible
that some tasks with expiration date below the millisecond in
the future caused busy loop around poll() waiting for the
timeout to happen.
The fact that TV_ETERNITY was 0 was very awkward because it
required that comparison functions handled the special case.
Now it is ~0 and all comparisons are performed on unsigned
values, so that it is naturally greater than any other value.
A performance gain of about 2-5% has been noticed.
The rbtree-based wait queue consumes a lot of CPU. Use the ul2tree
instead. Lots of cleanups and code reorganizations made it possible
to reduce the task struct and simplify the code a bit.
The new rbtree-based scheduler makes heavy use of tv_cmp2(), and
this function becomes a huge CPU eater. Refine it a little bit in
order to slightly reduce CPU usage.
This patch from Sin Yu makes use of an rbtree for the wait queue,
which will solve the slowdown problem encountered when timeouts
are heterogenous in the configuration. The next step will be to
turn maintain_proxies() into a per-proxy task so that we won't
have to scan them all after each poll() loop.
The files are now stored under :
- include/haproxy for the generic includes
- include/types.h for the structures needed within prototypes
- include/proto.h for function prototypes and inline functions
- src/*.c for the C files
Most include files are now covered by LGPL. A last move still needs
to be done to put inline functions under GPL and not LGPL.
Version has been set to 1.3.0 in the code but some control still
needs to be done before releasing.