Commit Graph

198 Commits

Author SHA1 Message Date
Willy Tarreau
8d8747abe0 OPTIM: tasks: group all tree roots per cache line
Currently we have per-thread arrays of trees and counts, but these
ones unfortunately share cache lines and are accessed very often. This
patch moves the task-specific stuff into a structure taking a multiple
of a cache line, and has one such per thread. Just doing this has
reduced the cache miss ratio from 19.2% to 18.7% and increased the
12-thread test performance by 3%.

It starts to become visible that we really need a process-wide per-thread
storage area that would cover more than just these parts of the tasks.
The code was arranged so that it's easy to move the pieces elsewhere if
needed.
2018-10-15 19:06:13 +02:00
Willy Tarreau
b20aa9eef3 MAJOR: tasks: create per-thread wait queues
Now we still have a main contention point with the timers in the main
wait queue, but the vast majority of the tasks are pinned to a single
thread. This patch creates a per-thread wait queue and queues a task
to the local wait queue without any locking if the task is bound to a
single thread (the current one) otherwise to the shared queue using
locking. This significantly reduces contention on the wait queue. A
test with 12 threads showed 11 ms spent in the WQ lock compared to
4.7 seconds in the same test without this change. The cache miss ratio
decreased from 19.7% to 19.2% on the 12-thread test, and its performance
increased by 1.5%.

Another indirect benefit is that the average queue size is divided
by the number of threads, which roughly removes log(nbthreads) levels
in the tree and further speeds up lookups.
2018-10-15 19:04:40 +02:00
Willy Tarreau
0b25d5e99f MEDIUM: task: perform a single tree lookup per run queue batch
The run queue is designed to perform a single tree lookup and to
use multiple passes to eb32sc_next(). The scheduler rework took a
conservative approach first but this is not needed anymore and it
increases the processing cost of process_runnable_tasks() and even
the time during which the RQ lock is held if the global queue is
heavily loaded. Let's simply move the initial lookup to the entry
of the loop like the previous scheduler used to do. This has reduced
by a factor of 5.5 the number of calls to eb32sc_lookup_get() there.
2018-10-10 16:42:46 +02:00
Olivier Houchard
19bdf2428d MINOR: tasks: Don't special-case when nbthreads == 1
Instead of checking if nbthreads == 1, just and thread_mask with
all_threads_mask to know if we're supposed to add the task to the local or
the global runqueue.
2018-08-17 14:50:37 +02:00
Olivier Houchard
d8b7a4701d BUG/MEDIUM: tasks: Don't insert in the global rqueue if nbthread == 1
Make sure we don't insert a task in the global run queue if nbthread == 1,
as, as an optimisation, we avoid reading from it if nbthread == 1.
2018-08-16 19:25:46 +02:00
Willy Tarreau
85d9b84eb1 BUILD/MINOR: threads: unbreak build with threads disabled
Depending on the optimization level, gcc may complain that wake_thread()
uses an invalid array index for poller_wr_pipe[] when called from
__task_wakeup(). Normally the condition to get there never happens,
but it's simpler to ifdef out this part of the code which is only
used to wake other threads up. No backport is needed, this was brought
by the recent introduction of the ability to wake a sleeping thread.
2018-07-27 17:18:22 +02:00
Olivier Houchard
79321b95a8 MINOR: pollers: Add a way to wake a thread sleeping in the poller.
Add a new pipe, one per thread, so that we can write on it to wake a thread
sleeping in a poller, and use it to wake threads supposed to take care of a
task, if they are all sleeping.
2018-07-26 19:09:50 +02:00
Olivier Houchard
eba0c0b51d MINOR: tasks: Make global_tasks_mask volatile.
In order to make sure modifications are noticed by other threads when needed,
make global_tasks_mask volatile.
2018-07-26 19:09:50 +02:00
Olivier Houchard
9b03c0c9a7 MINOR: tasks: Make active_tasks_mask volatile.
To be sure we have the relevant informations, make active_tasks_mask volatile
2018-07-26 19:09:50 +02:00
Olivier Houchard
77551ee8a7 BUG/MEDIUM: tasks: make __task_unlink_rq responsible for the rqueue size.
As __task_wakeup() is responsible for increasing
rqueue_local[tid]/global_rqueue_size, make __task_unlink_rq responsible for
decreasing it, as process_runnable_tasks() isn't the only one that removes
tasks from runqueues.
2018-07-26 16:33:29 +02:00
Olivier Houchard
76e45181b2 MINOR: tasks: Add a flag that tells if we're in the global runqueue.
How that we have bits available in task->state, add a flag that tells if we're
in the global runqueue or not.
2018-07-26 16:33:10 +02:00
Olivier Houchard
c4aac9effe BUG/MEDIUM: tasks: Make sure there's no task left before considering inactive.
We may remove the thread's bit in active_tasks_mask despite tasks for that
thread still being present in the global runqueue. To fix that, introduce
global_tasks_mask, and set the correspnding bits when we add a task to the
runqueue.
2018-07-26 15:40:22 +02:00
Willy Tarreau
189ea856a7 BUG/MEDIUM: tasks: use atomic ops for active_tasks_mask
We don't have the lock anymore so we need to protect it.
2018-07-26 15:16:43 +02:00
Olivier Houchard
e85ee7b663 BUG/MEDIUM: tasks: Decrement rqueue_size at the right time.
We need to decrement requeue_size when we remove a task form rqueue_local,
not when we remove if from the task list, or we'd also decrement it for any
tasklet, that was never in the rqueue in the first place.
2018-07-26 15:00:58 +02:00
Willy Tarreau
9a77186cb0 BUG/MEDIUM: tasks: make sure we pick all tasks in the run queue
Commit 09eeb76 ("BUG/MEDIUM: tasks: Don't forget to increase/decrease
tasks_run_queue.") addressed a count issue in the run queue and uncovered
another issue with the way the tasks are dequeued from the global run
queue. The number of tasks to pick is computed using an integral divide,
which results in up to nbthread-1 tasks never being run. The fix simply
consists in getting rid of the divide and checking the task count in the
loop.

No backport is needed, this is 1.9-specific.
2018-07-26 14:24:46 +02:00
Olivier Houchard
9db0fedb59 BUG/MINOR: tasklets: Just make sure we don't pass a tasklet to the handler.
We can't just set t to NULL if it's a tasklet, or we'd have a hard time
accessing to t->process, so just make sure we pass NULL as the first parameter
of t->process if it's a tasklet.
This should be a non-issue at this point, as tasklets aren't used yet.
2018-06-14 18:57:26 +02:00
Olivier Houchard
b1ca58b245 MINOR: tasks: Don't define rqueue if we're building without threads.
To make sure we don't inadvertently insert task in the global runqueue,
while only the local runqueue is used without threads, make its definition
and usage conditional on USE_THREAD.
2018-06-06 16:35:12 +02:00
David Carlier
cc0a957a50 MINOR: task: Fix compiler warning.
Waking up task, when checking if it is a valid entry.
Similarly to commit caa8a37ffe,
casting explicitally to void pointer as HA_ATOMIC_CAS needs.
2018-06-05 13:55:57 +02:00
Olivier Houchard
082627af77 MINOR: task: Also consider the task list size when getting global tasks.
We're taking tasks from the global runqueue based on the number of tasks
the thread already have in its local runqueue, but now that we have a task
list, we also have to take that into account.
2018-05-28 15:20:59 +02:00
Olivier Houchard
736ea41c6c BUG/MEDIUM: task: Don't forget to decrement max_processed after each task.
When the task list was introduced, we bogusly lost max_processed--, that means
we would execute as much tasks as present in the list, and we would never
set active_tasks_mask, so the thread would go to sleep even if more tasks were
to be executed.

1.9-dev only, no backport is needed.
2018-05-28 15:20:57 +02:00
Olivier Houchard
1599b80360 MINOR: tasks: Make the number of tasks to run at once configurable.
Instead of hardcoding 200, make the number of tasks to be run configurable
using tune.runqueue-depth. 200 is still the default.
2018-05-26 20:03:24 +02:00
Olivier Houchard
b0bdae7b88 MAJOR: tasks: Introduce tasklets.
Introduce tasklets, lightweight tasks. They have no notion of priority,
they are just run as soon as possible, and will probably be used for I/O
later.

For the moment they're used to replace the temporary thread-local list
that was used in the scheduler. The first part of the struct is common
with tasks so that tasks can be cast to tasklets and queued in this list.
Once a task is in the tasklet list, it has its leaf_p set to 0x1 so that
it cannot accidently be confused as not in the queue.

Pure tasklets are identifiable by their nice value of -32768 (which is
normally not possible).
2018-05-26 20:03:19 +02:00
Olivier Houchard
f6e6dc12cd MAJOR: tasks: Create a per-thread runqueue.
A lot of tasks are run on one thread only, so instead of having them all
in the global runqueue, create a per-thread runqueue which doesn't require
any locking, and add all tasks belonging to only one thread to the
corresponding runqueue.

The global runqueue is still used for non-local tasks, and is visited
by each thread when checking its own runqueue. The nice parameter is
thus used both in the global runqueue and in the local ones. The rare
tasks that are bound to multiple threads will have their nice value
used twice (once for the global queue, once for the thread-local one).
2018-05-26 19:27:29 +02:00
Olivier Houchard
9f6af33222 MINOR: tasks: Change the task API so that the callback takes 3 arguments.
In preparation for thread-specific runqueues, change the task API so that
the callback takes 3 arguments, the task itself, the context, and the state,
those were retrieved from the task before. This will allow these elements to
change atomically in the scheduler while the application uses the copied
value, and even to have NULL tasks later.
2018-05-26 19:23:57 +02:00
Olivier Houchard
9b36cb4a41 BUG/MEDIUM: task: Don't free a task that is about to be run.
While running a task, we may try to delete and free a task that is about to
be run, because it's part of the local tasks list, or because rq_next points
to it.
So flag any task that is in the local tasks list to be deleted, instead of
run, by setting t->process to NULL, and re-make rq_next a global,
thread-local variable, that is modified if we attempt to delete that task.

Many thanks to PiBa-NL for reporting this and analysing the problem.

This should be backported to 1.8.
2018-05-04 20:11:04 +02:00
Willy Tarreau
d80cb4ee13 MINOR: global: add some global activity counters to help debugging
A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The "clear
counters" commands also reset these counters. The output is sent as a
single write(), which currently produces up to about 7 kB of data for
64 threads. If more counters are added, it may be necessary to write
into multiple buffers, or to reset the counters.

To backport to 1.8 to help collect more detailed bug reports.
2018-01-23 15:38:33 +01:00
Willy Tarreau
a24d1d0be4 MINOR: task: align the rq and wq locks
We really don't want them to share the same cache line as they are
expected to be used in parallel. Adding a 64-byte alignment here shows
a performance increase of about 4.5% on task-intensive workloads with
2 to 4 threads.
2017-11-26 11:10:51 +01:00
Willy Tarreau
6d1222ce73 MINOR: task: keep a pointer to the currently running task
Very often when debugging, the current task's pointer isn't easy to
recover (eg: from a core file). Let's keep a copy of it, it will
likely help, especially with threads.
2017-11-26 11:10:50 +01:00
Willy Tarreau
bafbe01028 CLEANUP: pools: rename all pool functions and pointers to remove this "2"
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
2017-11-24 17:49:53 +01:00
Willy Tarreau
51753458c4 BUG/MAJOR: threads/task: dequeue expired tasks under the WQ lock
There is a small unprotected window for a task between the wait queue
and the run queue where a task could be woken up and destroyed at the
same time. What typically happens is that a timeout is reached at the
same time an I/O completes and wakes it up, and the I/O terminates the
task, causing a use after free in wake_expired_tasks() possibly causing
a crash and/or memory corruption :

       thread 1                             thread 2
  (wake_expired_tasks)                (stream_int_notify)

 HA_SPIN_UNLOCK(TASK_WQ_LOCK, &wq_lock);
                              task_wakeup(task, TASK_WOKEN_IO);
                              ...
                              process_stream()
                                stream_free()
                                   task_free()
                                      pool_free(task)
 task_wakeup(task, TASK_WOKEN_TIMER);

This case is reasonably easy to reproduce with a config using very short
server timeouts (100ms) and client timeouts (10ms), while injecting on
httpterm requesting medium sized objects (5kB) over SSL. All this is
easier done with more threads than allocated CPUs so that pauses can
happen anywhere and last long enough for process_stream() to kill the
task.

This patch inverts the lock and the wakeup(), but requires some changes
in process_runnable_tasks() to ensure we never try to grab the WQ lock
while having the RQ lock held. This means we have to release the RQ lock
before calling task_queue(), so we can't hold the RQ lock during the
loop and must take and drop it.

It seems that a different approach with the scope-aware trees could be
easier, but it would possibly not cover situations where a task is
allowed to run on multiple threads. The current solution covers it and
doesn't seem to have any measurable performance impact.
2017-11-23 18:47:04 +01:00
Christopher Faulet
8a48f67526 MAJOR: polling: Use active_tasks_mask instead of tasks_run_queue
tasks_run_queue is the run queue size. It is a global variable. So it is
underoptimized because we may be lead to consider there are active tasks for a
thread while in fact all active tasks are assigned to the other threads. So, in
such cases, the polling loop will be evaluated many more times than necessary.

Instead, we now check if the thread id is set in the bitfield active_tasks_mask.

Another change has been made in process_runnable_tasks. Now, we always limit the
number of tasks processed to 200.

This is specific to threads, no backport is needed.
2017-11-16 11:19:46 +01:00
Christopher Faulet
3911ee85df MINOR: tasks: Use a bitfield to track tasks activity per-thread
a bitfield has been added to know if there are runnable tasks for a thread. When
a task is woken up, the bits corresponding to its thread_mask are set. When all
tasks for a thread have been evaluated without any wakeup, the thread is removed
from active ones by unsetting its tid_bit from the bitfield.
2017-11-16 11:19:46 +01:00
Christopher Faulet
919b739862 CLEANUP: tasks: Remove useless double test on rq_next
No backport is needed, this is purely 1.8-specific.
2017-11-14 18:11:34 +01:00
Christopher Faulet
9dcf9b6f03 MINOR: threads: Use __decl_hathreads to declare locks
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
2017-11-13 11:38:17 +01:00
Willy Tarreau
9e45b33f7e BUG/MAJOR: threads/tasks: fix the scheduler again
My recent change in commit ce4e0aa ("MEDIUM: task: change the construction
of the loop in process_runnable_tasks()") was bogus as it used to keep the
rq_next across an unlock/lock sequence, occasionally leading to crashes for
tasks that are eligible to any thread. We must use the lookup call for each
new batch instead. The problem is easily triggered with such a configuration :

    global
        nbthread 4

    listen check
        mode http
        bind 0.0.0.0:8080
        redirect location /
        option httpchk GET /
        server s1 127.0.0.1:8080 check inter 1
        server s2 127.0.0.1:8080 check inter 1

Thanks to Olivier for diagnosing this one. No backport is needed.
2017-11-08 14:05:19 +01:00
Christopher Faulet
2a944ee16b BUILD: threads: Rename SPIN/RWLOCK macros using HA_ prefix
This remove any name conflicts, especially on Solaris.
2017-11-07 11:10:24 +01:00
Willy Tarreau
f0c531ab55 MEDIUM: tasks: implement a lockless scheduler for single-thread usage
The scheduler is complex and uses local queues to amortize the cost of
locks. But all this comes with a cost that is quite observable with
single-thread workloads.

The purpose of this patch is to reimplement the much simpler scheduler
for the case where threads are not used. The code is very small and
simple. It doesn't impact the multi-threaded performance at all, and
provides a nice 10% performance increase in single-thread by reaching
606kreq/s on the tests that showed 550kreq/s before.
2017-11-06 11:20:11 +01:00
Willy Tarreau
9d4b56b88e MINOR: tasks: only visit filled task slots after processing them
process_runnable_tasks() needs to requeue or wake up tasks after
processing them in batches. By only refilling the existing ones, we
avoid revisiting all the queue. The performance gain is measurable
starting with two threads, where the request rate climbs to 657k/s
compared to 644k.
2017-11-06 11:20:11 +01:00
Willy Tarreau
ce4e0aa7f3 MEDIUM: task: change the construction of the loop in process_runnable_tasks()
This patch slightly rearranges the loop to pack the locked code a little
bit, and to try to concentrate accesses to the tree together to benefit
more from the cache.

It also fixes how the loop handles the right margin : now that is guaranteed
that the retrieved nodes are filtered to only match the current thread, we
don't need to rewind every 16 entries. Instead we can rewind each time we
reach the right margin again.

With this change, we now achieve the following performance for 10 H2 conns
each containing 100 streams :

   1 thread : 550kreq/s
   2 thread : 644kreq/s
   3 thread : 598kreq/s
2017-11-06 11:20:11 +01:00
Willy Tarreau
b992ba16ef MINOR: task: simplify wake_expired_tasks() to avoid unlocking in the loop
This function is sensitive, let's make it shorter by factoring out the
unlock and leave code. This reduced the function's size by a few tens
of bytes and increased the overall performance by about 1%.
2017-11-06 11:20:11 +01:00
Willy Tarreau
8d38805d3d MAJOR: task: make use of the scope-aware ebtree functions
Currently the task scheduler suffers from an O(n) lookup when
skipping tasks that are not for the current thread. The reason
is that eb32_lookup_ge() has no information about the current
thread so it always revisits many tasks for other threads before
finding its own tasks.

This is particularly visible with HTTP/2 since the number of
concurrent streams created at once causes long series of tasks
for the same stream in the scheduler. With only 10 connections
and 100 streams each, by running on two threads, the performance
drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for
only 200000 task lookups, 430 million skips had to be performed,
which means that on average, each lookup leads to 2150 nodes to
be visited.

This commit backports the principle of scope lookups for ebtrees
from the ebtree_v7 development tree. The idea is that each node
contains a mask indicating the union of the scopes for the nodes
below it, which is fed during insertion, and used during lookups.

Then during lookups, branches that do not contain any leaf matching
the requested scope are simply ignored. This perfectly matches a
thread mask, allowing a thread to only extract the tasks it cares
about from the run queue, and to always find them in O(log(n))
instead of O(n). Thus the scheduler uses tid_bit and
task->thread_mask as the ebtree scope here.

Doing this has recovered most of the performance, as can be seen on
the test below with two threads, 10 connections, 100 streams each,
and 1 million requests total :

                              Before     After    Gain
              test duration : 89.6s      4.73s     x19
    HTTP requests/s (DEBUG) : 11200     211300     x19
     HTTP requests/s (PROD) : 15900     447000     x28
             spin_lock time : 85.2s      0.46s    /185
            time per lookup : 13us       40ns     /325

Even when going to 6 threads (on 3 hyperthreaded CPU cores), the
performance stays around 284000 req/s, showing that the contention
is much lower.

A test showed that there's no benefit in using this for the wait queue
though.
2017-11-06 11:20:11 +01:00
Willy Tarreau
f65610a83d CLEANUP: threads: rename process_mask to thread_mask
It was a leftover from the last cleaning session; this mask applies
to threads and calling it process_mask is a bit confusing. It's the
same in fd, task and applets.
2017-10-31 16:06:06 +01:00
Willy Tarreau
5f4a47b701 CLEANUP: threads: replace the last few 1UL<<tid with tid_bit
There were a few occurences left, better replace them now.
2017-10-31 15:59:32 +01:00
Emeric Brun
c60def8368 MAJOR: threads/task: handle multithread on task scheduler
2 global locks have been added to protect, respectively, the run queue and the
wait queue. And a process mask has been added on each task. Like for FDs, this
mask is used to know which threads are allowed to process a task.

For many tasks, all threads are granted. And this must be your first intension
when you create a new task, else you have a good reason to make a task sticky on
some threads. This is then the responsibility to the process callback to lock
what have to be locked in the task context.

Nevertheless, all tasks linked to a session must be sticky on the thread
creating the session. It is important that I/O handlers processing session FDs
and these tasks run on the same thread to avoid conflicts.
2017-10-31 13:58:30 +01:00
Thierry FOURNIER
d697596c6c MINOR: tasks: Move Lua notification from Lua to tasks
These notification management function and structs are generic and
it will be better to move in common parts.

The notification management functions and structs have names
containing some "lua" references because it was written for
the Lua. This patch removes also these references.
2017-09-11 18:59:40 +02:00
Emeric Brun
0194897e54 MAJOR: task: task scheduler rework.
In order to authorize call of task_wakeup on running task:
- from within the task handler itself.
- in futur, from another thread.

The lookups on runqueue and waitqueue are re-worked
to prepare multithread stuff.

If task_wakeup is called on a running task, the woken
message flags are savec in the 'pending_state' attribute of
the state. The real wakeup is postponed at the end of the handler
process and the woken messages are copied from pending_state
to the state attribute of the task.

It's important to note that this change will cause a very minor
(though measurable) performance loss but it is necessary to make
forward progress on a multi-threaded scheduler. Most users won't
ever notice.
2017-06-27 14:38:02 +02:00
Christopher Faulet
34c5cc98da MINOR: task: Rename run_queue and run_queue_cur counters
<run_queue> is used to track the number of task in the run queue and
<run_queue_cur> is a copy used for the reporting purpose. These counters has
been renamed, respectively, <tasks_run_queue> and <tasks_run_queue_cur>. So the
naming is consistent between tasks and applets.

[wt: needed for next fixes, backport to 1.7 and 1.6]
2016-12-12 19:10:54 +01:00
Willy Tarreau
87b09668be REORG/MAJOR: session: rename the "session" entity to "stream"
With HTTP/2, we'll have to support multiplexed streams. A stream is in
fact the largest part of what we currently call a session, it has buffers,
logs, etc.

In order to catch any error, this commit removes any reference to the
struct session and tries to rename most "session" occurrences in function
names to "stream" and "sess" to "strm" when that's related to a session.

The files stream.{c,h} were added and session.{c,h} removed.

The session will be reintroduced later and a few parts of the stream
will progressively be moved overthere. It will more or less contain
only what we need in an embryonic session.

Sample fetch functions and converters will have to change a bit so
that they'll use an L5 (session) instead of what's currently called
"L4" which is in fact L6 for now.

Once all changes are completed, we should see approximately this :

   L7 - http_txn
   L6 - stream
   L5 - session
   L4 - connection | applet

There will be at most one http_txn per stream, and a same session will
possibly be referenced by multiple streams. A connection will point to
a session and to a stream. The session will hold all the information
we need to keep even when we don't yet have a stream.

Some more cleanup is needed because some code was already far from
being clean. The server queue management still refers to sessions at
many places while comments talk about connections. This will have to
be cleaned up once we have a server-side connection pool manager.
Stream flags "SN_*" still need to be renamed, it doesn't seem like
any of them will need to move to the session.
2015-04-06 11:23:56 +02:00
Willy Tarreau
c46c965540 BUG/MEDIUM: task: fix recently introduced scheduler skew
Commit 501260b ("MEDIUM: task: always ensure that the run queue is
consistent") introduced a skew in the scheduler : if a negatively niced
task is woken up, it can be inserted prior to the current index and will
be skipped as long as there is some activity with less prioritary tasks.
The immediate effect is that it's not possible to get access to the stats
under full load until the load goes down.

This is because the rq_next constantly evolves within more recent
positions. The fix is simple, __task_wakeup() must empty rq_next. The
sad thing is that this issue was fixed during development and missed
during the commit. No backport is needed, this is purely 1.6 stuff.
2015-03-05 11:49:17 +01:00
Thierry FOURNIER
9cf7c4b9df MAJOR: poll: only rely on wake_expired_tasks() to compute the wait delay
Actually, HAProxy uses the function "process_runnable_tasks" and
"wake_expired_tasks" to get the next task which can expires.

If a task is added with "task_schedule" or other method during
the execution of an other task, the expiration of this new task
is not taken into account, and the execution of this task can be
too late.

Actualy, HAProxy seems to be no sensitive to this bug.

This fix moves the call to process_runnable_tasks() before the timeout
calculation and ensures that all wakeups are processed together. Only
wake_expired_tasks() needs to return a timeout now.
2015-02-28 23:12:30 +01:00
Willy Tarreau
501260bf67 MEDIUM: task: always ensure that the run queue is consistent
As found by Thierry Fournier, if a task manages to kill another one and
if this other task is the next one in the run queue, we can do whatever
including crashing, because the scheduler restarts from the saved next
task. For now, there is no such concept of a task killing another one,
but with Lua it will come.

A solution consists in always performing the lookup of the first task in
the scheduler's loop, but it's expensive and costs around 2% of the
performance.

Another solution consists in keeping a global next run queue node and
ensuring that when this task gets removed, it updates this pointer to
the next one. This allows to simplify the code a bit and in the end to
slightly increase the performance (0.3-0.5%). The mechanism might still
be usable if we later migrate to a multi-threaded scheduler.
2015-02-23 16:07:01 +01:00
Willy Tarreau
98c6121ee5 [OPTIM] task: don't scan the run queue if we know it's empty
It happens quite often in fact, so let's save those precious cycles.
2011-09-10 20:08:49 +02:00
Willy Tarreau
45cb4fb640 [MEDIUM] build: switch ebtree users to use new ebtree version
All files referencing the previous ebtree code were changed to point
to the new one in the ebtree directory. A makefile variable (EBTREE_DIR)
is also available to use files from another directory.

The ability to build the libebtree library temporarily remains disabled
because it can have an impact on some existing toolchains and does not
appear worth it in the medium term if we add support for multi-criteria
stickiness for instance.
2009-10-26 21:10:04 +01:00
SaVaGe
1d7a420c84 [BUG] task.c: don't assing last_timer to node-less entries
I noticed that in __eb32_insert , if the tree is empty
(root->b[EB_LEFT] == NULL) , the node.bit is not defined.
However in __task_queue there are checks:

- if (last_timer->node.bit < 0)
- if (task->wq.node.bit < last_timer->node.bit)

which might rely upon an undefined value.

This is how I see it:

1. We insert eb32_node in an empty wait queue tree for a task (called by
process_runnable_tasks() ):
Inserting into empty wait queue  &task->wq = 0x72a87c8, last_timer
pointer: (nil)

2. Then, we set the last timer to the same address:
Setting last_timer: (nil) to: 0x72a87c8

3. We get a new task to be inserted in the queue (again called by
process_runnable_tasks()) , before the __task_unlink_wq() is called for
the previous task.

4. At this point, we still have last_timer set to 0x72a87c8 , but since
it was inserted in an empty tree, it doesn't have node.bit and the
values above get dereferenced with undefined value.

The bug has no effect right now because the check for equality is still
made, so the next timer will still be queued at the right place anyway,
without any possible side-effect. But it's a pending bug waiting for a
small change somewhere to strike.

Iliya Polihronov
2009-10-10 15:15:07 +02:00
Willy Tarreau
34e98ea70d [BUG] task: fix possible crash when some timeouts are not configured
Cristian Ditoiu reported a major regression when testing 1.3.19 at
transfer.ro. It would crash within a few minutes while 1.3.15.10
was OK. He offered to help so we could run gdb and debug the crash
live. We finally found that the crash was the result of a regression
introduced by recent fix 814c978fb6
(task: fix possible timer drift after update) which makes it possible
for a tree walk to start from a detached task if this task has got
its timeout disabled due to a missing timeout.

The trivial fix below has been extensively tested and confirmed not
to crash anymore.

Special thanks to Cristian who spontaneously provided a lot of help
and trust to debug this issue which at first glance looked impossible
after reading the code and traces, but took less than an hour to spot
and fix when caught live in gdb ! That's really appreciated !
2009-08-09 09:09:54 +02:00
Willy Tarreau
814c978fb6 [BUG] task: fix possible timer drift after update
When the scheduler detected that a task was misplaced in the timer
queue, it used to place it right again. Unfortunately, it did not
check whether it would still call the new task from its new place.
This resulted in some tasks not getting called on timeout once in
a while, causing a minor drift for repetitive timers. This effect
was only observable with slow health checks and without any activity
because no other task would cause the scheduler to be immediately
called again.

In practice, it does not affect any real-world configuration, but
it's still better to fix it.
2009-07-14 23:48:55 +02:00
Willy Tarreau
3884cbaae6 [MINOR] show sess: report number of calls to each task
For debugging purposes, it can be useful to know how many times each
task has been called.
2009-03-28 17:54:35 +01:00
Willy Tarreau
c7bdf09f9f [MINOR] stats: report number of tasks (active and running)
It may be useful for statistics purposes to report the number of
tasks.
2009-03-21 18:33:52 +01:00
Willy Tarreau
a461318f97 [MINOR] task: keep a task count and clean up task creators
It's sometimes useful at least for statistics to keep a task count.
It's easy to do by forcing the rare task creators to always use the
same functions to create/destroy a task.
2009-03-21 18:13:21 +01:00
Willy Tarreau
135a113e36 [MINOR] sched: permit a task to stay up between calls
If a task wants to stay in the run queue, it is possible. It just
needs to wake itself up. We just want to ensure that a reniced
task will be processed at the right instant.
2009-03-21 13:26:05 +01:00
Willy Tarreau
26ca34e66e [BUG] scheduler: fix improper handling of duplicates __task_queue()
The top of a duplicate tree is not where bit == -1 but at the most
negative bit. This was causing tasks to be queued in reverse order
within duplicates. While this is not dramatic, it's incorrect and
might lead to longer than expected duplicate depths under some
circumstances.
2009-03-21 12:57:06 +01:00
Willy Tarreau
218859ad6c [BUG] sched: don't leave 3 lasts tasks unprocessed when niced tasks are present
When there are niced tasks, we would only process #tasks/4 per
turn, without taking care of running #tasks when #tasks was below
4, leaving those tasks waiting for a few other tasks to push them.

The fix simply consists in checking (#tasks+3)/4.
2009-03-21 11:53:09 +01:00
Willy Tarreau
e35c94a748 [MEDIUM] scheduler: get rid of the 4 trees thanks and use ebtree v4.1
Since we're now able to search from a precise expiration date in
the timer tree using ebtree 4.1, we don't need to maintain 4 trees
anymore. Not only does this simplify the code a lot, but it also
ensures that we can always look 24 days back and ahead, which
doubles the ability of the previous scheduler. Indeed, while based
on absolute values, the timer tree is now relative to <now> as we
can always search from <now>-31 bits.

The run queue uses the exact same principle now, and is now simpler
and a bit faster to process. With these changes alone, an overall
0.5% performance gain was observed.

Tests were performed on the few wrapping cases and everything works
as expected.
2009-03-21 10:25:14 +01:00
Willy Tarreau
87bed62a92 [BUILD] build fixes for Solaris
One build error in stream_sock.c when MSG_NOSIGNAL is not defined,
and a warning in task.c.
2009-03-08 22:25:28 +01:00
Willy Tarreau
531cf0cf8d [OPTIM] task: reduce the number of calls to task_queue()
Most of the time, task_queue() will immediately return. By extracting
the preliminary checks and putting them in an inline function, we can
significantly reduce the number of calls to the function itself, and
most of the tests can be optimized away due to the caller's context.

Another minor improvement in process_runnable_tasks() consisted in
taking benefit from the processor's branch prediction unit by making
a special case of the process_session() callback which is by far the
most common one.

All this improved performance by about 1%, mainly during the call
from process_runnable_tasks().
2009-03-08 16:35:27 +01:00
Willy Tarreau
d0a201b35c [CLEANUP] task: distinguish between clock ticks and timers
Timers are unsigned and used as tree positions. Ticks are signed and
used as absolute date within current time frame. While the two are
normally equal (except zero), it's important not to confuse them in
the code as they are not interchangeable.

We add two inline functions to turn each one into the other.

The comments have also been moved to the proper location, as it was
not easy to understand what was a tick and what was a timer unit.
2009-03-08 15:58:07 +01:00
Willy Tarreau
26c250683f [MEDIUM] minor update to the task api: let the scheduler queue itself
All the tasks callbacks had to requeue the task themselves, and update
a global timeout. This was not convenient at all. Now the API has been
simplified. The tasks callbacks only have to update their expire timer,
and return either a pointer to the task or NULL if the task has been
deleted. The scheduler will take care of requeuing the task at the
proper place in the wait queue.
2009-03-08 09:38:41 +01:00
Willy Tarreau
4136522527 [OPTIM] displace tasks in the wait queue only if absolutely needed
We don't need to remove then add tasks in the wait queue every time we
update a timeout. We only need to do that when the new timeout is earlier
than previous one. We can rely on wake_expired_tasks() to perform the
proper checks and bounce the misplaced tasks in the rare case where this
happens. The motivation behind this is that we very rarely hit timeouts,
so we save a lot of CPU cycles by moving the tasks very rarely. This now
means we can also find tasks with expiration date set to eternity in the
queue, and that is not a problem.
2009-03-08 07:59:27 +01:00
Willy Tarreau
4726f53794 [OPTIM] task: don't unlink a task from a wait queue when waking it up
In many situations, we wake a task on an I/O event, then queue it
exactly where it was. This is a real waste because we delete/insert
tasks into the wait queue for nothing. The only reason for this is
that there was only one tree node in the task struct.

By adding another tree node, we can have one tree for the timers
(wait queue) and one tree for the priority (run queue). That way,
we can have a task both in the run queue and wait queue at the
same time. The wait queue now really holds timers, which is what
it was designed for.

The net gain is at least 1 delete/insert cycle per session, and up
to 2-3 depending on the workload, since we save one cycle each time
the expiration date is not changed during a wake up.
2009-03-08 07:59:18 +01:00
Willy Tarreau
1b8ca663a4 [BUG] task: fix handling of duplicate keys
A bug was introduced with the ebtree-based scheduler. It seldom causes
some timeouts to last longer than required if they hit an expiration
date which is the same as the last queued date, is also part of a
duplicate tree without being the top of the tree. In this case, the
task will not be expired until after the duplicate tree has been
flushed.

It is easier to reproduce by setting a very short client timeout (1s)
and sending connections and waiting for them to expire with the 408
status. Then in parallel, inject at about 1kh/s. The bug causes the
connections to sometimes wait longer than 1s before timing out.

The cause was the use of eb_insert_dup() on wrong nodes, as this
function is designed to work only on the top of the dup tree. The
solution consists in updating last_timer only when its bit is -1,
and using it only if its bit is still -1 (top of a dup tree).

The fix has not reduced performance because it only fixes the case
where this bug could fire, which is extremely rare.
2009-03-08 07:57:47 +01:00
Willy Tarreau
fdccded0e8 [MEDIUM] indicate a reason for a task wakeup
It's very frequent to require some information about the
reason why a task is running. Some flags have been added
so that a task now knows if it got woken up due to I/O
completion, timeout, etc...
2008-11-02 10:19:08 +01:00
Willy Tarreau
4df8206832 [OPTIM] reduce the number of calls to task_wakeup()
A test has shown that more than 16% of the calls to task_wakeup()
could be avoided because the task is already woken up. So make it
inline and move the test to the inline part.
2008-11-02 10:19:07 +01:00
Willy Tarreau
ec6c5df018 [CLEANUP] remove many #include <types/xxx> from C files
It should be stated as a rule that a C file should never
include types/xxx.h when proto/xxx.h exists, as it gives
less exposure to declaration conflicts (one of which was
caught and fixed here) and it complicates the file headers
for nothing.

Only types/global.h, types/capture.h and types/polling.h
have been found to be valid includes from C files.
2008-07-16 10:30:42 +02:00
Willy Tarreau
0c303eec87 [MAJOR] convert all expiration timers from timeval to ticks
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.

A new header file has been added : include/common/ticks.h.

It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.

Some function and variable names are still subject to change for
a better overall logics.
2008-07-07 00:09:58 +02:00
Willy Tarreau
ce44f12c1e [OPTIM] task_queue: assume most consecutive timers are equal
When queuing a timer, it's very likely that an expiration date is
equal to that of the previously queued timer, due to time rounding
to the millisecond. Optimizing for this case provides a noticeable
1% performance boost.
2008-07-05 18:16:19 +02:00
Willy Tarreau
91e99931b7 [MEDIUM] introduce task->nice and boot access to statistics
The run queue scheduler now considers task->nice to queue a task and
to pick a task out of the queue. This makes it possible to boost the
access to statistics (both via HTTP and UNIX socket). The UNIX socket
receives twice as much a boost as the HTTP socket because it is more
sensible.
2008-06-30 07:51:00 +02:00
Willy Tarreau
58b458d8ba [MAJOR] use an ebtree instead of a list for the run queue
We now insert tasks in a certain sequence in the run queue.
The sorting key currently is the arrival order. It will now
be possible to apply a "nice" value to any task so that it
goes forwards or backwards in the run queue.

The calls to wake_expired_tasks() and maintain_proxies()
have been moved to the main run_poll_loop(), because they
had nothing to do in process_runnable_tasks().

The task_wakeup() function is not inlined anymore, as it was
only used at one place.

The qlist member of the task structure has been removed now.
The run_queue list has been replaced for an integer indicating
the number of tasks in the run queue.
2008-06-29 22:40:23 +02:00
Willy Tarreau
af754fc88f [OPTIM] shrink wake_expired_tasks() by using task_wakeup()
It's not worth duplicating task_wakeup() in wake_expired_tasks().
Calling it reduces code size and slightly improves performance.
2008-06-29 19:25:52 +02:00
Willy Tarreau
28c41a4041 [MEDIUM] rework the wait queue mechanism
The wait queues now rely on 4 trees for past, present and future
timers. The computations are cleaner and more reliable. The
wake_expired_tasks function has become simpler. Also, a bug
previously introduced in task_queue() by the first introduction
of eb_trees has been fixed (the eb->key was never updated).
2008-06-29 17:00:59 +02:00
Willy Tarreau
e62bdd4026 [BUG] wqueue: perform proper timeout comparisons with wrapping values
With wrapping keys, we cannot simply do "if (key > now)", but
we must at least do "if ((signed)(key-now) > 0)".
2008-06-29 10:32:02 +02:00
Willy Tarreau
9789f7bd68 [MAJOR] replace ultree with ebtree in wait-queues
The ultree code has been removed in favor of a simpler and
cleaner ebtree implementation. The eternity queue does not
need to exist anymore, and the pool_tree64 has been removed.

The ebtree node is stored in the task itself. The qlist list
header is still used by the run-queue, but will be able to
disappear once the run-queue uses ebtree too.
2008-06-24 08:17:16 +02:00
Willy Tarreau
70bcfb77a7 [OPTIM] GCC4's builtin_expect() is suboptimal
GCC4 is stupid (unbelievable news!).

When some code uses __builtin_expect(x != 0, 1), it really performs
the check of x != 0 then tests that the result is not zero! This is
a double check when only one was expected. Some performance drops
of 10% in the HTTP parser code have been observed due to this bug.

GCC 3.4 is fine though.

A solution consists in expecting that the tested value is 1. In
this case, it emits the correct code, but it's still not optimal
it seems. Finally the best solution is to ignore likely() and to
pray for the compiler to emit correct code. However, we still have
to fix unlikely() to remove the test there too, and to fix all
code which passed pointers overthere to pass integers instead.
2008-02-14 23:14:33 +01:00
Willy Tarreau
315bff5183 Merge branch 'pools' into merge-pools 2007-05-14 02:11:56 +02:00
Willy Tarreau
1209033e46 [MINOR] disable useless hint in wake_expired_tasks
wake_expired_tasks() used a hint to avoid scanning the tree in most cases,
but it looks like the hint is more expensive than reaching the first node
in the tree. Disable it for now.
2007-05-14 02:11:39 +02:00
Willy Tarreau
fbfc053e34 [BUG] fix buggy timeout computation in wake_expired_tasks
Wake_expired_tasks is supposed to return a date, not an interval. It
was causing busy loops in pollers.
2007-05-14 02:03:47 +02:00
Willy Tarreau
c6ca1a02aa [MAJOR] migrated task, tree64 and session to pool2
task and tree64 are already very close in size and are merged together.
Overall performance gained slightly by this simple change.
2007-05-13 19:43:47 +02:00
Willy Tarreau
c64e5397f6 [MINOR] avoid inlining in task.c
The task management functions used to call __tv_* which is not really
optimal given the size of the functions.
2007-05-13 16:07:06 +02:00
Willy Tarreau
d825eef9c5 [MAJOR] replaced all timeouts with struct timeval
The timeout functions were difficult to manipulate because they were
rounding results to the millisecond. Thus, it was difficult to compare
and to check what expired and what did not. Also, the comparison
functions were heavy with multiplies and divides by 1000. Now, all
timeouts are stored in timevals, reducing the number of operations
for updates and leading to cleaner and more efficient code.
2007-05-12 22:35:00 +02:00
Willy Tarreau
7317eb5a1d [MAJOR] fixed some expiration dates on tasks
The time subsystem really needs fixing. It was still possible
that some tasks with expiration date below the millisecond in
the future caused busy loop around poll() waiting for the
timeout to happen.
2007-05-09 00:54:10 +02:00
Willy Tarreau
e33aecefa6 [MINOR] uninline task_wakeup
task_wakup has become bigger since we used the trees. Let's not
inline it anymore.
2007-04-30 14:38:03 +02:00
Willy Tarreau
42aae5c7cf [MEDIUM] many cleanups in the time functions
Now, functions whose name begins with '__tv_' are inlined. Also,
'tv_ms' is used as a prefix for functions using milliseconds.
2007-04-29 17:43:56 +02:00
Willy Tarreau
a6a6a93e56 [MAJOR] changed TV_ETERNITY to ~0 instead of 0
The fact that TV_ETERNITY was 0 was very awkward because it
required that comparison functions handled the special case.
Now it is ~0 and all comparisons are performed on unsigned
values, so that it is naturally greater than any other value.

A performance gain of about 2-5% has been noticed.
2007-04-29 13:44:24 +02:00
Willy Tarreau
96bcfd75aa [MAJOR] replaced rbtree with ul2tree.
The rbtree-based wait queue consumes a lot of CPU. Use the ul2tree
instead. Lots of cleanups and code reorganizations made it possible
to reduce the task struct and simplify the code a bit.
2007-04-29 13:43:53 +02:00
Willy Tarreau
5e8f066961 [MINOR] slightly optimize time calculation for rbtree
The new rbtree-based scheduler makes heavy use of tv_cmp2(), and
this function becomes a huge CPU eater. Refine it a little bit in
order to slightly reduce CPU usage.
2007-02-12 00:59:08 +01:00
Willy Tarreau
b1b8272a54 [MINOR] uninline rb_insert_task_queue()
rb_insert_task_queue() was inlined and is quite large. Uninlining
it reduces code size by about 2 kB and slightly improves performance.
2007-02-11 13:52:16 +01:00
Willy Tarreau
964c936b04 [MAJOR] replace the wait-queue linked list with an rbtree.
This patch from Sin Yu makes use of an rbtree for the wait queue,
which will solve the slowdown problem encountered when timeouts
are heterogenous in the configuration. The next step will be to
turn maintain_proxies() into a per-proxy task so that we won't
have to scan them all after each poll() loop.
2007-01-07 02:14:23 +01:00
Willy Tarreau
2dd0d4799e [CLEANUP] renamed include/haproxy to include/common 2006-06-29 17:53:05 +02:00
Willy Tarreau
baaee00406 [BIGMOVE] exploded the monolithic haproxy.c file into multiple files.
The files are now stored under :
  - include/haproxy for the generic includes
  - include/types.h for the structures needed within prototypes
  - include/proto.h for function prototypes and inline functions
  - src/*.c for the C files

Most include files are now covered by LGPL. A last move still needs
to be done to put inline functions under GPL and not LGPL.

Version has been set to 1.3.0 in the code but some control still
needs to be done before releasing.
2006-06-26 02:48:02 +02:00