haproxy

mirror of http://git.haproxy.org/git/haproxy.git/ synced 2025-04-23 23:45:37 +00:00

Author	SHA1	Message	Date
Willy Tarreau	74dea8caea	MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY While the scheduler is priority-aware and class-aware, and consistently tries to maintain fairness between all classes, it doesn't make use of a fine execution budget to compensate for high-latency tasks such as TLS handshakes. This can result in many subsequent calls adding multiple milliseconds of latency between the various steps of other tasklets that don't even depend on this. An ideal solution would be to add a 4th queue, have all tasks announce their estimated cost upfront and let the scheduler maintain an auto- refilling budget to pick from the most suitable queue. But it turns out that a very simplified version of this already provides impressive gains with very tiny changes and could easily be backported. The principle is to reserve a new task flag "TASK_HEAVY" that indicates that a task is expected to take a lot of time without yielding (e.g. an SSL handshake typically takes 700 microseconds of crypto computation). When the scheduler sees this flag when queuing a tasklet, it will place it into the bulk queue. And during dequeuing, we accept only one of these in a full round. This means that the first one will be accepted, will not prevent other lower priority tasks from running, but if a new one arrives, then the queue stops here and goes back to the polling. This will allow to collect more important updates for other tasks that will be batched before the next call of a heavy task. Preliminary tests consisting in placing this flag on the SSL handshake tasklet show that response times under SSL stress fell from 14 ms before the patch to 3.0 ms with the patch, and even 1.8 ms if tune.sched.low-latency is set to "on".	2021-02-26 00:25:51 +01:00
Willy Tarreau	2a54ffbf43	MINOR: task: make tasklet wakeup latency measurements more accurate First, we don't want to measure wakeup times if the call date had not been set before profiling was enabled at run time. And second, we may only collect the value before clearing the TASK_IN_LIST bit, otherwise another wakeup might happen on another thread and replace the call date we're about to use, hence artificially lower the wakeup times.	2021-02-25 09:44:16 +01:00
Willy Tarreau	b2285de049	MINOR: tasks: also compute the tasklet latency when DEBUG_TASK is set It is extremely useful to be able to observe the wakeup latency of some important I/O operations, so let's accept to inflate the tasklet struct by 8 extra bytes when DEBUG_TASK is set. With just this we have enough to get live reports like this: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on\|auto\|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg si_cs_io_cb 8099492 4.833s 596.0ns 8.974m 66.48us h1_io_cb 7460365 11.55s 1.548us 2.477m 19.92us process_stream 7383828 22.79s 3.086us 18.39m 149.5us h1_timeout_task 4157 - - 348.4ms 83.81us srv_cleanup_toremove_connections751 39.70ms 52.86us 10.54ms 14.04us srv_cleanup_idle_connections 21 1.405ms 66.89us 30.82us 1.467us task_run_applet 16 1.058ms 66.13us 446.2us 27.89us accept_queue_process 7 34.53us 4.933us 333.1us 47.58us	2021-02-25 09:44:16 +01:00
Willy Tarreau	45499c56d3	MINOR: task: make grq_total atomic to move it outside of the grq_lock Instead of decrementing grq_total once per task picked from the global run queue, let's do it at once after the loop like we do for other counters. This simplifies the code everywhere. It is not expected to bring noticeable improvements however, since global tasks tend to be less common nowadays.	2021-02-25 09:44:16 +01:00
Willy Tarreau	c9afbb10f5	MINOR: task: don't decrement then increment the local run queue Now we don't need to decrement rq_total when we pick a tack in the tree to immediately increment it again after installing it into the local list. Instead, we simply add to the local queue count the number of globally picked tasks. Avoiding this shows ~0.5% performance gains at 1Mreq/s (2M task switches/s).	2021-02-25 09:44:16 +01:00
Willy Tarreau	2b363ac092	MINOR: task: do not use __task_unlink_rq() from process_runnable_tasks() As indicated in previous commit, this function tries to guess which tree the task is in to figure what counters to update, while we already have that info in the caller. Let's just pick the relevant parts to place them in the caller.	2021-02-25 09:44:16 +01:00
Willy Tarreau	e7923c1d22	MINOR: task: split the counts of local and global tasks picked In process_runnable_tasks() we're still calling __task_unlink_rq() to pick a task, and this function tries to guess where to pick the task from and which counter to update while the caller's context already has everything. Worse, the number of local tasks is decremented then recredited, doubling the operations. In order to avoid this we first need to keep separate counters for local and global tasks that were picked. This is what this patch does.	2021-02-25 09:44:16 +01:00
Willy Tarreau	9c6dbf0eea	CLEANUP: task: split the large tasklet_wakeup_on() function in two This function has become large with the multi-queue scheduler. We need to keep the fast path and the debugging parts inlined, but the rest now moves to task.c just like was done for task_wakeup(). This has reduced the code size by 6kB due to less inlining of large parts that are always context-dependent, and as a side effect, has increased the overall performance by 1%.	2021-02-24 17:55:58 +01:00
Willy Tarreau	955a11ebfa	MINOR: task: move the allocated tasks counter to the per-thread struct The nb_tasks counter was still global and gets incremented and decremented for each task_new()/task_free(), and was read in process_runnable_tasks(). But it's only used for stats reporting, so doing this this often is pointless and expensive. Let's move it to the task_per_thread struct and have the stats sum it when needed.	2021-02-24 17:42:04 +01:00
Willy Tarreau	eeffb3df41	MINOR: task: limit the remote thread wakeup to the global runqueue only The test in __task_wakeup() to figure if the remote threads are sleeping doesn't make sense outside of the global runqueue test, since there are only two possibilities here: local runqueue or global runqueue, hence a sleeping thread is another one and can only happen when sending to the global run queue. Let's move the test inside the "if" block.	2021-02-24 17:42:04 +01:00
Willy Tarreau	018564eaa2	CLEANUP: task: move the tree root detection from __task_wakeup() to task_wakeup() Historically we used to call __task_wakeup() with a known tree root but this is not the case and the code has remained needlessly complicated with the root calculation in task_wakeup() passed in argument to __task_wakeup() which compares it again. Let's get rid of this and just move the detection code there. This eliminates some ifdefs and allows to simplify the test conditions quite a bit.	2021-02-24 17:42:04 +01:00
Willy Tarreau	1f3b1417b8	CLEANUP: tasks: use a less confusing name for task_list_size This one is systematically misunderstood due to its unclear name. It is in fact the number of tasks in the local tasklet list. Let's call it "tasks_in_list" to remove some of the confusion.	2021-02-24 17:42:04 +01:00
Willy Tarreau	2c41d77ebc	MINOR: tasks: do not maintain the rqueue_size counter anymore This one is exclusively used as a boolean nowadays and is non-zero only when the thread-local run queue is not empty. Better check the root tree's pointer and avoid updating this counter all the time.	2021-02-24 17:42:04 +01:00
Willy Tarreau	9c7b8085f4	MEDIUM: task: remove the tasks_run_queue counter and have one per thread This counter is solely used for reporting in the stats and is the hottest thread contention point to date. Moving it to the scheduler and having a separate one for the global run queue dramatically improves the performance, showing a 12% boost on the request rate on 16 threads! In addition, the thread debugging output which used to rely on rqueue_size was not totally accurate as it would only report task counts. Now we can return the exact thread's run queue length. It is also interesting to note that there are still a few other task/tasklet counters in the scheduler that are not efficiently updated because some cover a single area and others cover multiple areas. It looks like having a distinct counter for each of the following entries would help and would keep the code a bit cleaner: - global run queue (tree) - per-thread run queue (tree) - per-thread shared tasklets list - per-thread local lists Maybe even splitting the shared tasklets lists between pure tasklets and tasks instead of having the whole and tasks would simplify the code because there remain a number of places where several counters have to be updated.	2021-02-24 17:42:04 +01:00
Willy Tarreau	c6ba9a0b9b	MINOR: sched: have one runqueue ticks counter per thread The runqueue_ticks counts the number of task wakeups and is used to position new tasks in the run queue, but since we've had per-thread run queues, the values there are not very relevant anymore and the nice value doesn't apply well if some threads are more loaded than others. In addition, letting all threads compete over a shared counter is not smart as this may cause some excessive contention. Let's move this index close to the run queues themselves, i.e. one per thread and a global one. In addition to improving fairness, this has increased global performance by 2% on 16 threads thanks to the lower contention on rqueue_ticks. Fairness issues were not observed, but if any were to be, this patch could be backported as far as 2.0 to address them.	2021-02-20 13:03:37 +01:00
Willy Tarreau	4e2282f9bf	MEDIUM: tasks/activity: collect per-task statistics when profiling is enabled Now when the profiling is enabled, the scheduler wlil update per-function task-level statistics on number of calls, cpu usage and lateny, that could later be checked using "show profiling". This will immediately make it obvious what functions are responsible for others' high latencies or which ones are suffering from others, and should help spot issues like undesired wakeups. For now the stats are only collected but not reported (though they are readable from sched_activity[] under gdb).	2021-01-29 12:10:33 +01:00
Willy Tarreau	4d6c594998	BUG/MEDIUM: task: close a possible data race condition on a tasklet's list link In issue #958 Ashley Penney reported intermittent crashes on AWS's ARM nodes which would not happen on x86 nodes. After investigation it turned out that the Neoverse N1 CPU cores used in the Graviton2 CPU are much more aggressive than the usual Cortex A53/A72/A55 or any x86 regarding memory ordering. The issue that was triggered there is that if a tasklet_wakeup() call is made on a tasklet scheduled to run on a foreign thread and that tasklet is just being dequeued to be processed, there can be a race at two places: - if MT_LIST_TRY_ADDQ() happens between MT_LIST_BEHEAD() and LIST_SPLICE_END_DETACHED() if the tasklet is alone in the list, because the emptiness tests matches ; - if MT_LIST_TRY_ADDQ() happens during LIST_DEL_INIT() in run_tasks_from_lists(), then depending on how LIST_DEL_INIT() ends up being implemented, it may even corrupt the adjacent nodes while they're being reused for the in-tree storage. This issue was introduced in 2.2 when support for waking up remote tasklets was added. Initially the attachment of a tasklet to a list was enough to know its status and this used to be stable information. Now it's not sufficient to rely on this anymore, thus we need to use a different information. This patch solves this by adding a new task flag, TASK_IN_LIST, which is atomically set before attaching a tasklet to a list, and is only removed after the tasklet is detached from a list. It is checked by tasklet_wakeup_on() so that it may only be done while the tasklet is out of any list, and is cleared during the state switch when calling the tasklet. Note that the flag is not set for pure tasks as it's not needed. However this introduces a new special case: the function tasklet_remove_from_tasklet_list() needs to keep both states in sync and cannot check both the state and the attachment to a list at the same time. This function is already limited to being used by the thread owning the tasklet, so in this case the test remains reliable. However, just like its predecessors, this function is wrong by design and it should probably be replaced with a stricter one, a lazy one, or be totally removed (it's only used in checks to avoid calling a possibly scheduled event, and when freeing a tasklet). Regardless, for now the function exists so the flag is removed only if the deletion could be done, which covers all cases we're interested in regarding the insertion. This removal is safe against a concurrent tasklet_wakeup_on() since MT_LIST_DEL() guarantees the atomic test, and will ultimately clear the flag only if the task could be deleted, so the flag will always reflect the last state. This should be carefully be backported as far as 2.2 after some observation period. This patch depends on previous patch "MINOR: task: remove __tasklet_remove_from_tasklet_list()".	2020-11-30 18:17:59 +01:00
Willy Tarreau	2da4c316c2	MINOR: task: remove __tasklet_remove_from_tasklet_list() This function is only used at a single place directly within the scheduler in run_tasks_from_lists() and it really ought not be called by anything else, regardless of what its comment says. Let's delete it, move the two lines directly into the call place, and take this opportunity to factor the atomic decrement on tasks_run_queue. A comment was added on the remaining one tasklet_remove_from_tasklet_list() to mention the risks in using it.	2020-11-30 18:17:44 +01:00
Willy Tarreau	c309dbdd99	MINOR: task: perform atomic counter increments only once per wakeup In process_runnable_tasks(), we walk the run queue and pick tasks to insert them into the local list. And for each of these operations we perform a few increments, some of which are atomic, and they're even performed under the runqueue's lock. This is useless inside the loop, better do them at the end, since we don't use these values inside the loop and they're not used anywhere else either during this time. The only one is task_list_size which is accessed in parallel by other threads performing remote tasklet wakeups, but it's already approximative and is used to decide to get out of the loop when the limit is reached. So now we compute it first as an initial budget instead.	2020-11-30 18:17:44 +01:00
Willy Tarreau	a868c2920b	MINOR: task: remove tasklet_insert_into_tasklet_list() This function is only called at a single place and adds more confusion than it removes. It also makes one think it could be used outside of the scheduler while it must absolutely not. Let's just move its two lines to the call place, making the code more readable there. In addition this clearly shows that the preliminary LIST_INIT() is useless since the entry is immediately overwritten.	2020-11-30 18:17:44 +01:00
Willy Tarreau	69a7b8fc6c	CLEANUP: task: remove the unused and mishandled global_rqueue_size This counter is only updated and never used, and in addition it's done without any atomicity so it's very unlikely to be correct on multi-CPU systems! Let's just remove it since it's not used.	2020-10-19 14:08:13 +02:00
Willy Tarreau	d48ed6643b	MEDIUM: task: use an upgradable seek lock when scanning the wait queue Right now when running a configuration with many global timers (e.g. many health checks), there is a lot of contention on the global wait queue lock because all threads queue up in front of it to scan it. With 2000 servers checked every 10 milliseconds (200k checks per second), after 23 seconds running on 8 threads, the lock stats were this high: Stats about Lock TASK_WQ: write lock : 9872564 write unlock: 9872564 (0) wait time for write : 9208.409 msec wait time for write/lock: 932.727 nsec read lock : 240367 read unlock : 240367 (0) wait time for read : 149.025 msec wait time for read/lock : 619.991 nsec i.e. ~5% of the total runtime spent waiting on this specific lock. With upgradable locks we don't need to work like this anymore. We can just try to upgade the read lock to a seek lock before scanning the queue, then upgrade the seek lock to a write lock for each element we want to delete there and immediately downgrade it to a seek lock. The benefit is double: - all other threads which need to call next_expired_task() before polling won't wait anymore since the seek lock is compatible with the read lock ; - all other threads competing on trying to grab this lock will fail on the upgrade attempt from read to seek, and will let the current lock owner finish collecting expired entries. Doing only this has reduced the wake_expired_tasks() CPU usage in a very large servers test from 2.15% to 1.04% as reported by perf top, and increased by 3% the health check rate (all threads being saturated). This is expected to help against (and possibly solve) the problem described in issue #875.	2020-10-16 17:15:54 +02:00
Willy Tarreau	3cfaa8d1e0	BUG/MEDIUM: task: bound the number of tasks picked from the wait queue at once There is a theorical problem in the wait queue, which is that with many threads, one could spend a lot of time looping on the newly expired tasks, causing a lot of contention on the global wq_lock and on the global rq_lock. This initially sounds bening, but if another thread does just a task_schedule() or task_queue(), it might end up waiting for a long time on this lock, and this wait time will count on its execution budget, degrading the end user's experience and possibly risking to trigger the watchdog if that lasts too long. The simplest (and backportable) solution here consists in bounding the number of expired tasks that may be picked from the global wait queue at once by a thread, given that all other ones will do it as well anyway. We don't need to pick more than global.tune.runqueue_depth tasks at once as we won't process more, so this counter is updated for both the local and the global queues: threads with more local expired tasks will pick less global tasks and conversely, keeping the load balanced between all threads. This will guarantee a much lower latency if/when wakeup storms happen (e.g. hundreds of thousands of synchronized health checks). Note that some crashes have been witnessed with 1/4 of the threads in wake_expired_tasks() and, while the issue might or might not be related, not having reasonable bounds here definitely justifies why we can spend so much time there. This patch should be backported, probably as far as 2.0 (maybe with some adaptations).	2020-10-16 15:18:48 +02:00
Willy Tarreau	6ce0232a78	BUILD: task: work around a bogus warning in gcc 4.7/4.8 at -O1 As reported in issue #816, when building task.o at -O1 with gcc 4.7 or 4.8, we get the following warning: CC src/task.o In file included from include/haproxy/proxy.h:31:0, from include/haproxy/cfgparse.h:27, from src/task.c:19: src/task.c: In function 'next_timer_expiry': include/haproxy/ticks.h:121:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized] src/task.c:349:2: note: 'key' was declared here It is wrong since the condition to use 'key' is exactly the same as the one used to set it. This warning disappears at -O2 and disappeared from gcc 5 and above. Let's just initialize 'key' there, it only adds 16 bytes of code and remains cheap enough for this function. This should be backported to 2.2.	2020-08-21 05:54:00 +02:00
Willy Tarreau	e5d79bccc0	MINOR: tasks/debug: add a few BUG_ON() to detect use of wrong timer queue This aims at catching calls to task_unlink_wq() performed by the wrong thread based on the shared status for the task, as well as calls to __task_queue() with the wrong timer queue being used based on the task's capabilities. This will at least help eliminate some hypothesis during debugging sessions when suspecting that a wrong thread has attempted to queue a task at the wrong place.	2020-07-22 14:42:52 +02:00
Willy Tarreau	783afbe93b	BUG/MAJOR: tasks: don't requeue global tasks into the local queue A bug was introduced by commit `77015abe0` ("MEDIUM: tasks: clean up the front side of the wait queue in wake_expired_tasks()"): front tasks that are not yet expired were incorrectly requeued into the local wait queue instead of the global one. Because of this, the same task could be found by the same thread on next invocation and be unlinked without locking, allowing another thread to requeue it in parallel, and conversely another thread could unlink it while the task was being walked over, causing all sorts of crashes and endless loops in wake_expired_tasks() and affiliates. This bug can easily be triggered by stressing the do_resolve action in multi-thread (after applying the fixes required to get do_resolve to work with threads). It certainly is the cause of issue #758. This must be backported to 2.2 only.	2020-07-22 14:12:45 +02:00
Willy Tarreau	273aea479d	BUG/MAJOR: tasks: make sure to always lock the shared wait queue if needed In run_tasks_from_task_list() we may free some tasks that have been killed. Before doing so we unlink them from the wait queue. But if such a task is in the global wait queue, the queue isn't locked so this can result in corrupting the global task list and causing loops or crashes. It's very likely one cause of issue #758. This must be backported to 2.2. For 2.1 there doesn't seem to be any case where a task could be freed this way while in the global queue, but it doesn't cost much to apply the same change (the code is in process_runnable_task there).	2020-07-17 14:37:51 +02:00
Willy Tarreau	950954f5f7	MINOR: tasks: use MT_LIST_ADDQ() when killing tasks. A bug in task_kill() was fixed by commy `54d31170a` ("BUG/MAJOR: sched: make sure task_kill() always queues the task") which added a list initialization before adding an element. But in fact an inconditional addition would have done the same and been simpler than first initializing then checking the element was initialized. Let's use MT_LIST_ADDQ() there to add the task to kill into the shared queue and kill the dirty LIST_INIT().	2020-07-10 08:52:13 +02:00
Willy Tarreau	de4db17dee	MINOR: lists: rename some MT_LIST operations to clarify them Initially when mt_lists were added, their purpose was to be used with the scheduler, where anyone may concurrently add the same tasklet, so it sounded natural to implement a check in MT_LIST_ADD{,Q}. Later their usage was extended and MT_LIST_ADD{,Q} started to be used on situations where the element to be added was exclusively owned by the one performing the operation so a conflict was impossible. This became more obvious with the idle connections and the new macro was called MT_LIST_ADDQ_NOCHECK. But this remains confusing and at many places it's not expected that an MT_LIST_ADD could possibly fail, and worse, at some places we start by initializing it before adding (and the test is superflous) so let's rename them to something more conventional to denote the presence of the check or not: MT_LIST_ADD{,Q} : inconditional operation, the caller owns the element, and doesn't care about the element's current state (exactly like LIST_ADD) MT_LIST_TRY_ADD{,Q}: only perform the operation if the element is not already added or in the process of being added. This means that the previously "safe" MT_LIST_ADD{,Q} are not "safe" anymore. This also means that in case of backport mistakes in the future causing this to be overlooked, the slower and safer functions will still be used by default. Note that the missing unchecked MT_LIST_ADD macro was added. The rest of the code will have to be reviewed so that a number of callers of MT_LIST_TRY_ADDQ are changed to MT_LIST_ADDQ to remove the unneeded test.	2020-07-10 08:50:41 +02:00
Willy Tarreau	4f58926352	BUG/MAJOR: sched: make it work also when not building with DEBUG_STRICT Sadly, the fix from commit `54d31170a` ("BUG/MAJOR: sched: make sure task_kill() always queues the task") broke the builds without DEBUG_STRICT as, in order to be careful, it plcaed a BUG_ON() around the previously failing condition to check for any new possible failure, but this BUG_ON strips the condition when DEBUG_STRICT is not set. We don't want BUG_ON to evaluate any condition either as some debugging code calls possibly expensive ones (e.g. in htx_get_stline). Let's just drop the useless BUG_ON(). No backport is needed, this is 2.2-dev.	2020-07-02 17:17:42 +02:00
Willy Tarreau	54d31170a9	BUG/MAJOR: sched: make sure task_kill() always queues the task task_kill() may fail to queue a task if this task has never ever run, because its equivalent (tasklet->list) member has never been "emptied" since it didn't pass through the LIST_DEL_INIT() that's performed by run_tasks_from_lists(). This results in these tasks to never be freed. It happens during the mux takeover since the target task usually is the timeout task which, by definition, has never run yet. This fixes commit `eb8c2c69f` ("MEDIUM: sched: implement task_kill() to kill a task") which was introduced after 2.2-dev11 and doesn't need to be backported.	2020-07-02 14:14:00 +02:00
Willy Tarreau	eb8c2c69fa	MEDIUM: sched: implement task_kill() to kill a task task_kill() may be used by any thread to kill any task with less overhead than a regular wakeup. In order to achieve this, it bypasses the priority tree and inserts the task directly into the shared tasklets list, cast as a tasklet. The task_list_size is updated to make sure it is properly decremented after execution of this task. The task will thus be picked by process_runnable_tasks() after checking the tree and sent to the TL_URGENT list, where it will be processed and killed. If the task is bound to more than one thread, its first thread will be the one notified. If the task was already queued or running, nothing is done, only the flag is added so that it gets killed before or after execution. Of course it's the caller's responsibility to make sur any resources allocated by this task were already cleaned up or taken over.	2020-07-01 16:35:53 +02:00
Willy Tarreau	8a6049c268	MEDIUM: sched: create a new TASK_KILLED task flag This flag, when set, will be used to indicate that the task must die. At the moment this may only be placed by the task itself or by the scheduler when placing it into the TL_NORMAL queue.	2020-07-01 16:35:49 +02:00
Willy Tarreau	d99177f86d	MINOR: sched: make sched->task_list_size atomic We'll need to update it from foreign threads in order to throw killed tasks and maintain correct accounting, so let's make it atomic.	2020-07-01 16:35:41 +02:00
Willy Tarreau	1553b6657d	BUG/MINOR: sched: properly cover for a rare MT_LIST_ADDQ() race In commit `3ef7a190b` ("MEDIUM: tasks: apply a fair CPU distribution between tasklet classes") we compute a total weight to be used to split the CPU time between queues. There is a mention that the total cannot be null, wihch is based on the fact that we only get there if thread_has_task() returns non-zero. But there is a very small race which can break this assumption: if two threads conflict on MT_LIST_ADDQ() on an empty shared list and both roll back before trying again, there is the possibility that a first call to MT_LIST_ISEMPTY() sees the first thread install itself, then the second call will see the list empty when both roll back. Thus we could proceed with the queue while it's temporarily empty and compute max lengths using a divide by zero. This case is very hard to trigger, it seldom happens on 16 threads at 400k req/s. Let's simply test for max_total and leave the loop when we've not found any work. No backport is needed, that's 2.2-only.	2020-06-30 14:06:19 +02:00
Willy Tarreau	e7723bddd7	MEDIUM: tasks: add a tune.sched.low-latency option Now that all tasklet queues are scanned at once by run_tasks_from_lists(), it becomes possible to always check for lower priority classes and jump back to them when they exist. This patch adds tune.sched.low-latency global setting to enable this behavior. What it does is stick to the lowest ranked priority list in which tasks are still present with an available budget, and leave the loop to refill the tasklet lists if the trees got new tasks or if new work arrived into the shared urgent queue. Doing so allows to cut the latency in half when running with extremely deep run queues (10k-100k), thus allowing forwarding of small and large objects to coexist better. It remains off by default since it does have a small impact on large traffic by default (shorter batches).	2020-06-24 12:21:26 +02:00
Willy Tarreau	59153fef86	MINOR: tasks: make run_tasks_from_lists() scan the queues itself Now process_runnable_tasks is responsible for calculating the budgets for each queue, dequeuing from the tree, and calling run_tasks_from_lists(). This latter one scans the queues, picking tasks there and respecting budgets. Note that its name was updated with a plural "s" for this reason.	2020-06-24 12:21:26 +02:00
Willy Tarreau	ba48d5c8f9	MINOR: tasks: pass the queue index to run_task_from_list() Instead of passing it a pointer to the queue, pass it the queue's index so that it can perform all the work around current_queue and tl_class_mask.	2020-06-24 12:21:26 +02:00
Willy Tarreau	49f90bf148	MINOR: tasks: add a mask of the queues with active tasklets It is neither convenient nor scalable to check each and every tasklet queue to figure whether it's empty or not while we often need to check them all at once. This patch introduces a tasklet class mask which gets a bit 1 set for each queue representing one class of service. A single test on the mask allows to figure whether there's still some work to be done. It will later be usable to better factor the runqueue code. Bits are set when tasklets are queued. They're cleared when queues are emptied. It is possible that a queue is empty but has a bit if a tasklet was added then removed, but this is not a problem as this is properly checked for in run_tasks_from_list().	2020-06-24 12:21:26 +02:00
Willy Tarreau	c0a08ba2df	MINOR: tasks: make current_queue an index instead of a pointer It will be convenient to have the tasklet queue number soon, better make current_queue an index rather than a pointer to the queue. When not currently running (e.g. from I/O), the index is -1.	2020-06-24 12:21:26 +02:00
Willy Tarreau	3ef7a190b0	MEDIUM: tasks: apply a fair CPU distribution between tasklet classes Till now in process_runnable_tasks() we used to reserve a fixed portion of max_processed to urgent tasks, then a portion of what remains for normal tasks, then what remains for bulk tasks. This causes two issues: - the current budget for processed tasks could be drained once for all by higher level tasks so that they couldn't have enough left for the next run. For example, if bulk tasklets cause task wakeups, the required share to run them could be eaten by other bulk tasklets. - it forces the urgent tasks to be run before scanning the tree so that we know how many tasks to pick from the tree, and this isn't very efficient cache-wise. This patch changes this so that we compute upfront how max_processed will be shared between classes that require so. We can then decide in advance to pick a certain number of tasks from the tree, then execute all tasklets in turn. When reaching the end, if there's still some budget, we can go back and do the same thing again, improving chances to pick new work before the global budget is depleted. The default weights have been set to 50% for urgent tasklets, 37% for normal ones and 13% for the bulk ones. In practice, there are not that many urgent tasklets but when they appear they are cheap and must be processed in as large batches as possible. Every time there is nothing to pick there, the unused budget is shared between normal and bulk and this allows bulk tasklets to still have quite some CPU to run on.	2020-06-24 12:21:26 +02:00
Willy Tarreau	116ef223d2	MINOR: task: add a new pointer to current tasklet queue In task_per_thread[] we now have current_queue which is a pointer to the current tasklet_list entry being evaluated. This will be used to know the class under which the current task/tasklet is currently running.	2020-06-23 16:35:38 +02:00
Willy Tarreau	0c0c85ed9d	BUG/MINOR: tasks: make sure never to exceed max_processed We want to be sure not to exceed max_processed. It can actually go slightly negative due to the rounding applied to ratios, but we must refrain from processing too many tasks if it's already low. This became particularly relevant since recent commit `5c8be272c` ("MEDIUM: tasks: also process late wakeups in process_runnable_tasks()") which was merged into 2.2-dev10. No backport is needed.	2020-06-23 11:34:40 +02:00
Willy Tarreau	5c8be272c7	MEDIUM: tasks: also process late wakeups in process_runnable_tasks() Since version 1.8, we've started to use tasks and tasklets more extensively to defer I/O processing. Originally with the simple scheduler, a task waking another one up using task_wakeup() would have caused it to be processed right after the list of runnable ones. With the introduction of tasklets, we've started to spill running tasks from the run queues to the tasklet queues, so if a task wakes another one up, it will only be executed on the next call to process_runnable_task(), which means after yet another round of polling loop. This is particularly visible with I/Os hitting muxes: poll() reports a read event, the connection layer performs a tasklet_wakeup() on the mux subscribed to this I/O, and this mux in turn signals the upper layer stream using task_wakeup(). The process goes back to poll() with a null timeout since there's one active task, then back to checking all possibly expired events, and finally back to process_runnable_tasks() again. Worse, when there is high I/O activity, doing so will make the task's execution further apart from the tasklet and will both increase the total processing latency and reduce the cache hit ratio. This patch brings back to the original spirit of process_runnable_tasks() which is to execute runnable tasks as long as the execution budget is not exhausted. By doing so, we're immediately cutting in half the number of calls to all functions called by run_poll_loop(), and halving the number of calls to poll(). Furthermore, calling poll() less often also means purging FD updates less often and offering more chances to merge them. This also has the nice effect of making tune.runqueue-depth effective again, as in the past it used to be quickly bounded by this artificial event horizon which was preventing from executing remaining tasks. On certain workloads we can see a 2-3% performance increase.	2020-06-19 14:21:46 +02:00
Willy Tarreau	77015abe0b	MEDIUM: tasks: clean up the front side of the wait queue in wake_expired_tasks() Due to the way the wait queue works, some tasks might be postponed but not requeued. However when we exit wake_expired_tasks() on a not-yet-expired task and leave it in this situation, the next call to next_timer_expiry() will use this first task's key in the tree as an expiration date, but this date might be totally off and cause needless wakeups just to reposition it. This patch makes sure that we leave wake_expired_tasks with a clean state of frontside tasks and that their tree's key matches their expiration date. Doing so we can already observe a ~15% reduction of the number of wakeups when dealing with large numbers of health checks. The patch looks large because the code was rearranged but the real change is to take the wakeup/requeue decision on the task's expiration date instead of the tree node's key, the rest is unchanged.	2020-06-19 14:21:46 +02:00
Willy Tarreau	b2551057af	CLEANUP: include: tree-wide alphabetical sort of include files This patch fixes all the leftovers from the include cleanup campaign. There were not that many (~400 entries in ~150 files) but it was definitely worth doing it as it revealed a few duplicates.	2020-06-11 10:18:59 +02:00
Willy Tarreau	dfd3de8826	REORG: include: move stream.h to haproxy/stream{,-t}.h This one was not easy because it was embarking many includes with it, which other files would automatically find. At least global.h, arg.h and tools.h were identified. 93 total locations were identified, 8 additional includes had to be added. In the rare files where it was possible to finalize the sorting of includes by adjusting only one or two extra lines, it was done. But all files would need to be rechecked and cleaned up now. It was the last set of files in types/ and proto/ and these directories must not be reused anymore.	2020-06-11 10:18:58 +02:00
Willy Tarreau	a264d960f6	REORG: include: move proxy.h to haproxy/proxy{,-t}.h This one is particularly difficult to split because it provides all the functions used to manipulate a proxy state and to retrieve names or IDs for error reporting, and as such, it was included in 73 files (down to 68 after cleanup). It would deserve a small cleanup though the cut points are not obvious at the moment given the number of structs involved in the struct proxy itself.	2020-06-11 10:18:58 +02:00
Willy Tarreau	cea0e1bb19	REORG: include: move task.h to haproxy/task{,-t}.h The TASK_IS_TASKLET() macro was moved to the proto file instead of the type one. The proto part was a bit reordered to remove a number of ugly forward declaration of static inline functions. About a tens of C and H files had their dependency dropped since they were not using anything from task.h.	2020-06-11 10:18:58 +02:00
Willy Tarreau	0f6ffd652e	REORG: include: move fd.h to haproxy/fd{,-t}.h A few includes were missing in each file. A definition of struct polled_mask was moved to fd-t.h. The MAX_POLLERS macro was moved to defaults.h Stdio used to be silently inherited from whatever path but it's needed for list_pollers() which takes a FILE* and which can thus not be forward-declared.	2020-06-11 10:18:57 +02:00
Willy Tarreau	48fbcae07c	REORG: tools: split common/standard.h into haproxy/tools{,-t}.h And also rename standard.c to tools.c. The original split between tools.h and standard.h dates from version 1.3-dev and was mostly an accident. This patch moves the files back to what they were expected to be, and takes care of not changing anything else. However this time tools.h was split between functions and types, because it contains a small number of commonly used macros and structures (e.g. name_desc) which in turn cause the massive list of includes of tools.h to conflict with the callers. They remain the ugliest files of the whole project and definitely need to be cleaned and split apart. A few types are defined there only for functions provided there, and some parts are even OS-specific and should move somewhere else, such as the symbol resolution code.	2020-06-11 10:18:57 +02:00
Willy Tarreau	d0ef439699	REORG: include: move common/memory.h to haproxy/pool.h Now the file is ready to be stored into its final destination. A few minor reorderings were performed to keep the file properly organized, making the various sections more visible (cache & lockless). In addition and to stay consistent, memory.c was renamed to pool.c.	2020-06-11 10:18:57 +02:00
Willy Tarreau	6634794992	REORG: include: move freq_ctr to haproxy/ types/freq_ctr.h was moved to haproxy/freq_ctr-t.h and proto/freq_ctr.h was moved to haproxy/freq_ctr.h. Files were updated accordingly, no other change was applied.	2020-06-11 10:18:56 +02:00
Willy Tarreau	92b4f1372e	REORG: include: move time.h from common/ to haproxy/ This one is included almost everywhere and used to rely on a few other .h that are not needed (unistd, stdlib, standard.h). It could possibly make sense to split it into multiple parts to distinguish operations performed on timers and the internal time accounting, but at this point it does not appear much important.	2020-06-11 10:18:56 +02:00
Willy Tarreau	af613e8359	CLEANUP: thread: rename __decl_hathreads() to __decl_thread() I can never figure whether it takes an "s" or not, and in the end it's better if it matches the file's naming, so let's call it "__decl_thread".	2020-06-11 10:18:56 +02:00
Willy Tarreau	853b297c9b	REORG: include: split mini-clist into haproxy/list and list-t.h Half of the users of this include only need the type definitions and not the manipulation macros nor the inline functions. Moves the various types into mini-clist-t.h makes the files cleaner. The other one had all its includes grouped at the top. A few files continued to reference it without using it and were cleaned. In addition it was about time that we'd rename that file, it's not "mini" anymore and contains a bit more than just circular lists.	2020-06-11 10:18:56 +02:00
Willy Tarreau	4c7e4b7738	REORG: include: update all files to use haproxy/api.h or api-t.h if needed All files that were including one of the following include files have been updated to only include haproxy/api.h or haproxy/api-t.h once instead: - common/config.h - common/compat.h - common/compiler.h - common/defaults.h - common/initcall.h - common/tools.h The choice is simple: if the file only requires type definitions, it includes api-t.h, otherwise it includes the full api.h. In addition, in these files, explicit includes for inttypes.h and limits.h were dropped since these are now covered by api.h and api-t.h. No other change was performed, given that this patch is large and affects 201 files. At least one (tools.h) was already freestanding and didn't get the new one added.	2020-06-11 10:18:42 +02:00
Willy Tarreau	8d2b777fe3	REORG: ebtree: move the include files from ebtree to include/import/ This is where other imported components are located. All files which used to directly include ebtree were touched to update their include path so that "import/" is now prefixed before the ebtree-related files. The ebtree.h file was slightly adjusted to read compiler.h from the common/ subdirectory (this is the only change). A build issue was encountered when eb32sctree.h is loaded before eb32tree.h because only the former checks for the latter before defining type u32. This was addressed by adding the reverse ifdef in eb32tree.h. No further cleanup was done yet in order to keep changes minimal.	2020-06-11 09:31:11 +02:00
Ilya Shipitsin	856aabcda5	CLEANUP: assorted typo fixes in the code and comments This is 8th iteration of typo fixes	2020-04-17 09:37:36 +02:00
Olivier Houchard	c62d9ab7cb	MINOR: tasks: Provide the tasklet to the callback. When tasklet were introduced, it has been decided not to provide the tasklet to the callback, but NULL instead. While it may have been reasonable back then, maybe to be able to differentiate a task from a tasklet from the callback, it also means that we can't access the tasklet from the handler if the context provided can't be trusted. As no handler is shared between a task and a tasklet, and there are now other means of distinguishing between task and tasklet, just pass the tasklet pointer too. This may be backported to 2.1, 2.0 and 1.9 if needed.	2020-03-17 18:52:33 +01:00
Willy Tarreau	27d00c0167	MINOR: task: export run_tasks_from_list This will help refine debug traces.	2020-03-03 15:26:10 +01:00
Willy Tarreau	952c2640b0	MINOR: task: don't set TASK_RUNNING on tasklets We can't clear flags on tasklets because we don't know if they're still present upon return (they all return NULL, maybe that could change in the future). As a side effect, once TASK_RUNNING is set, it's never cleared anymore, which is misleading and resulted in some incorrect flagging of bulk tasks in the recent scheduler changes. And the only reason for setting TASK_RUNNING on tasklets was to detect self-wakers, which is not done using a dedicated flag. So instead of setting this flags for no opportunity to clear it, let's simply not set it.	2020-01-31 18:37:03 +01:00
Willy Tarreau	1dfc9bbdc6	OPTIM: task: readjust CPU bandwidth distribution since last update Now that we can more accurately watch which connection is really being woken up from itself, it was desirable to re-adjust the CPU BW thresholds based on measurements. New tests with 60000 concurrent connections were run at 100 Gbps with unbounded queues and showed the following distribution: scenario TC0 TC1 TC2 observation -------------------+---+---+----+--------------------------- TCP conn rate : 32, 51, 17 HTTP conn rate : 34, 41, 25 TCP byte rate : 2, 3, 95 (2 MB objets) splicing byte rate: 11, 6, 83 (2 MB objets) H2 10k object : 44, 23, 33 client-limited mixed traffic : 18, 10, 72 21m+10: 11kcps, 36 Gbps The H2 experienced a huge change since it uses a persistent connection that was accidently flagged in the previous test. The splicing test exhibits a higher need for short tasklets, so does the mixed traffic test. Given that latency mainly matters for conn rate and H2 here, the ratios were readjusted as 33% for TC0, 50% for TC1 and 17% for TC2, keeping in mind that whatever is not consumed by one class is automatically shared in equal propertions by the next one(s). This setting immediately provided a nice improvement as with the default settings (maxpollevents=200, runqueue-depth=200), the same ratios as above are still reported, while the time to request "show activity" on the CLI dropped to 30-50ms. The average loop time is around 5.7ms on the mixed traffic. In addition, one extra stress test at 90.5 Gbps with 5100 conn/s shows 70-100ms CLI request time, with an average loop time of 17 ms.	2020-01-31 18:37:01 +01:00
Willy Tarreau	d23d413e38	MINOR: task: make sched->current also reflect tasklets sched->current is used to know the current task/tasklet, and is currently only used by the panic dump code. However it turns out it was not set for tasklets, which prevents us from using it for more usages, despite the panic handling code already handling this case very well. Let's make sure it's now set.	2020-01-31 17:45:10 +01:00
Willy Tarreau	bb238834da	MINOR: task: permanently flag tasklets waking themselves up Commit `a17664d829` ("MEDIUM: tasks: automatically requeue into the bulk queue an already running tasklet") tried to inflict a penalty to self-requeuing tasks/tasklets which correspond to those involved in large, high-latency data transfers, for the benefit of all other processing which requires a low latency. However, it turns out that while it ought to do this on a case-by-case basis, basing itself on the RUNNING flag isn't accurate because this flag doesn't leave for tasklets, so we'd rather need a distinct flag to tag such tasklets. This commit introduces TASK_SELF_WAKING to mark tasklets acting like this. For now it's still set when TASK_RUNNING is present but this will have to change. The flag is kept across wakeups.	2020-01-31 17:45:10 +01:00
Willy Tarreau	c633607c06	OPTIM: task: refine task classes default CPU bandwidth ratios Measures with unbounded execution ratios under 40000 concurrent connections at 100 Gbps showed the following CPU bandwidth distribution between task classes depending on traffic scenarios: scenario TC0 TC1 TC2 observation -------------------+---+---+----+--------------------------- TCP conn rate : 29, 48, 23 221 kcps HTTP conn rate : 29, 47, 24 200 kcps TCP byte rate : 3, 5, 92 53 Gbps splicing byte rate: 5, 10, 85 70 Gbps H2 10k object : 10, 21, 74 client-limited mixed traffic : 4, 7, 89 21m+10: 11kcps, 36 Gbps Thus it seems that we always need a bit of bulk tasks even for short connections, which seems to imply a suboptimal processing somewhere, and that there are roughly twice as many tasks (TC1=normal) as regular tasklets (TC0=urgent). This ratio stands even when data forwarding increases. So at first glance it looks reasonable to enforce the following ratio by default: - 16% for TL_URGENT - 33% for TL_NORMAL - 50% for TL_BULK With this, the TCP conn rate climbs to ~225 kcps, and the mixed traffic pattern shows a more balanced 17kcps + 35 Gbps with 35ms CLI request time time instead of 11kcps + 36 Gbps and 400 ms response time. The byte rate tests (1M objects) are not affected at all. This setting looks "good enough" to allow immediate merging, and could be refined later. It's worth noting that it resists very well to massive increase of run queue depth and maxpollevents: with the run queue depth changed from 200 to 10000 and maxpollevents to 10000 as well, the CLI's request time is back to the previous ~400ms, but the mixed traffic test reaches 52 Gbps + 7500 CPS, which was never met with the previous scheduling model, while the CLI used to show ~1 minute response time. The reason is that in the bulk class it becomes possible to perform multiple rounds of recv+send and eliminate objects at once, increasing the L3 cache hit ratio, and keeping the connection count low, without degrading too much the latency. Another test with mixed traffic involving 2/3 splicing on huge objects and 1/3 on empty objects without touching any setting reports 51 Gbps + 5300 cps and 35ms CLI request time.	2020-01-31 07:09:10 +01:00
Willy Tarreau	a62917b890	MEDIUM: tasks: implement 3 different tasklet classes with their own queues We used to mix high latency tasks and low latency tasklets in the same list, and to even refill bulk tasklets there, causing some unfairness in certain situations (e.g. poll-less transfers between many connections saturating the machine with similarly-sized in and out network interfaces). This patch changes the mechanism to split the load into 3 lists depending on the task/tasklet's desired classes : - URGENT: this is mainly for tasklets used as deferred callbacks - NORMAL: this is for regular tasks - BULK: this is for bulk tasks/tasklets Arbitrary ratios of max_processed are picked from each of these lists in turn, with the ability to complete in one list from what was not picked in the previous one. After some quick tests, the following setup gave apparently good results both for raw TCP with splicing and for H2-to-H1 request rate: - 0 to 75% for urgent - 12 to 50% for normal - 12 to what remains for bulk Bulk is not used yet.	2020-01-30 18:59:33 +01:00
Willy Tarreau	4ffa0b526a	MINOR: tasks: move the list walking code to its own function New function run_tasks_from_list() will run over a tasklet list and will run all the tasks and tasklets it finds there within a limit of <max> that is passed in arggument. This is a preliminary work for scheduler QoS improvements.	2020-01-30 18:13:13 +01:00
Willy Tarreau	dd0e89a084	BUG/MAJOR: task: add a new TASK_SHARED_WQ flag to fix foreing requeuing Since 1.9 with commit `b20aa9eef3` ("MAJOR: tasks: create per-thread wait queues") a task bound to a single thread will not use locks when being queued or dequeued because the wait queue is assumed to be the owner thread's. But there exists a rare situation where this is not true: the health check tasks may be running on one thread waiting for a response, and may in parallel be requeued by another thread calling health_adjust() after a detecting a response error in traffic when "observe l7" is set, and "fastinter" is lower than "inter", requiring to shorten the running check's timeout. In this case, the task being requeued was present in another thread's wait queue, thus opening a race during task_unlink_wq(), and gets requeued into the calling thread's wait queue instead of the running one's, opening a second race here. This patch aims at protecting against the risk of calling task_unlink_wq() from one thread while the task is queued on another thread, hence unlocked, by introducing a new TASK_SHARED_WQ flag. This new flag indicates that a task's position in the wait queue may be adjusted by other threads than then one currently executing it. This means that such WQ manipulations must be performed under a lock. There are two types of such tasks: - the global ones, using the global wait queue (technically speaking, those whose thread_mask has at least 2 bits set). - some local ones, which for now will be placed into the global wait queue as well in order to benefit from its lock. The flag is automatically set on initialization if the task's thread mask indicates more than one thread. The caller must also set it if it intends to let other threads update the task's expiration delay (e.g. delegated I/Os), or if it intends to change the task's affinity over time as this could lead to the same situation. Right now only the situation described above seems to be affected by this issue, and it is very difficult to trigger, and even then, will often have no visible effect beyond stopping the checks for example once the race is met. On my laptop it is feasible with the following config, chained to httpterm: global maxconn 400 # provoke FD errors, calling health_adjust() defaults mode http timeout client 10s timeout server 10s timeout connect 10s listen px bind :8001 option httpchk /?t=50 server sback 127.0.0.1:8000 backup server-template s 0-999 127.0.0.1:8000 check port 8001 inter 100 fastinter 10 observe layer7 This patch will automatically address the case for the checks because check tasks are created with multiple threads bound and will get the TASK_SHARED_WQ flag set. If in the future more tasks need to rely on this (multi-threaded muxes for example) and the use of the global wait queue becomes a bottleneck again, then it should not be too difficult to place locks on the local wait queues and queue the task on its bound thread. This patch needs to be backported to 2.1, 2.0 and 1.9. It depends on previous patch "MINOR: task: only check TASK_WOKEN_ANY to decide to requeue a task". Many thanks to William Dauchy for providing detailed traces allowing to spot the problem.	2019-12-19 14:42:22 +01:00
Willy Tarreau	8fe4253bf6	MINOR: task: only check TASK_WOKEN_ANY to decide to requeue a task After processing a task, its RUNNING bit is cleared and at the same time we check for other bits to decide whether to requeue the task or not. It happens that we only want to check the TASK_WOKEN_* bits, because : - TASK_RUNNING was just cleared - TASK_GLOBAL and TASK_QUEUE cannot be set yet as the task was running, preventing it from being requeued It's important not to catch yet undefined flags there because it would prevent addition of new task flags. This also shows more clearly that waking a task up with flags 0 is not something safe to do as the task will not be woken up if it's already running.	2019-12-19 14:42:22 +01:00
Willy Tarreau	c49ba52524	MINOR: tasks: split wake_expired_tasks() in two parts to avoid useless wakeups We used to have wake_expired_tasks() wake up tasks and return the next expiration delay. The problem this causes is that we have to call it just before poll() in order to consider latest timers, but this also means that we don't wake up all newly expired tasks upon return from poll(), which thus systematically requires a second poll() round. This is visible when running any scheduled task like a health check, as there are systematically two poll() calls, one with the interval, nothing is done after it, and another one with a zero delay, and the task is called: listen test bind *:8001 server s1 127.0.0.1:1111 check 09:37:38.200959 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8696843}) = 0 09:37:38.200967 epoll_wait(3, [], 200, 1000) = 0 09:37:39.202459 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8712467}) = 0 >> nothing run here, as the expired task was not woken up yet. 09:37:39.202497 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8715766}) = 0 09:37:39.202505 epoll_wait(3, [], 200, 0) = 0 09:37:39.202513 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8719064}) = 0 >> now the expired task was woken up 09:37:39.202522 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 09:37:39.202537 fcntl(7, F_SETFL, O_RDONLY\|O_NONBLOCK) = 0 09:37:39.202565 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 09:37:39.202577 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0 09:37:39.202585 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) 09:37:39.202659 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0 09:37:39.202673 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8814713}) = 0 09:37:39.202683 epoll_wait(3, [{EPOLLOUT\|EPOLLERR\|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1 09:37:39.202693 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=8818617}) = 0 09:37:39.202701 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 09:37:39.202715 close(7) = 0 Let's instead split the function in two parts: - the first part, wake_expired_tasks(), called just before process_runnable_tasks(), wakes up all expired tasks; it doesn't compute any timeout. - the second part, next_timer_expiry(), called just before poll(), only computes the next timeout for the current thread. Thanks to this, all expired tasks are properly woken up when leaving poll, and each poll call's timeout remains up to date: 09:41:16.270449 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10223556}) = 0 09:41:16.270457 epoll_wait(3, [], 200, 999) = 0 09:41:17.270130 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10238572}) = 0 09:41:17.270157 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 09:41:17.270194 fcntl(7, F_SETFL, O_RDONLY\|O_NONBLOCK) = 0 09:41:17.270204 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 09:41:17.270216 setsockopt(7, SOL_TCP, TCP_QUICKACK, [0], 4) = 0 09:41:17.270224 connect(7, {sa_family=AF_INET, sin_port=htons(1111), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) 09:41:17.270299 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLOUT, {u32=7, u64=7}}) = 0 09:41:17.270314 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10337841}) = 0 09:41:17.270323 epoll_wait(3, [{EPOLLOUT\|EPOLLERR\|EPOLLHUP, {u32=7, u64=7}}], 200, 1000) = 1 09:41:17.270332 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=10341860}) = 0 09:41:17.270340 getsockopt(7, SOL_SOCKET, SO_ERROR, [111], [4]) = 0 09:41:17.270367 close(7) = 0 This may be backported to 2.1 and 2.0 though it's unlikely to bring any user-visible improvement except to clarify debugging.	2019-12-11 09:42:58 +01:00
Olivier Houchard	06910464dd	MEDIUM: task: Split the tasklet list into two lists. As using an mt_list for the tasklet list is costly, instead use a regular list, but add an mt_list for tasklet woken up by other threads, to be run on the current thread. At the beginning of process_runnable_tasks(), we just take the new list, and merge it into the task_list. This should give us performances comparable to before we started using a mt_list, but allow us to use tasklet_wakeup() from other threads.	2019-10-11 16:37:41 +02:00
Olivier Houchard	07308677dd	BUG/MEDIUM: tasks: Don't forget to decrement tasks_run_queue. When executing tasks, don't forget to decrement tasks_run_queue once we popped one task from the task_list. tasks_run_queue used to be decremented by __tasklet_remove_from_tasklet_list(), but we now call MT_LIST_POP().	2019-10-03 14:55:40 +02:00
Willy Tarreau	d022e9c98b	MINOR: task: introduce a thread-local "sched" variable for local scheduler stuff The aim is to rassemble all scheduler information related to the current thread. It simply points to task_per_thread[tid] without having to perform the operation at each time. We save around 1.2 kB of code on performance sensitive paths and increase the request rate by almost 1%.	2019-09-24 11:23:30 +02:00
Willy Tarreau	d66d75656e	MINOR: task: split the tasklet vs task code in process_runnable_tasks() There are a number of tests there which are enforced on tasklets while they will never apply (various handlers, destroyed task or not, arguments, results, ...). Instead let's have a single TASK_IS_TASKLET() test and call the tasklet processing function directly, skipping all the rest. It now appears visible that the only unneeded code is the update to curr_task that is never used for tasklets, except for opportunistic reporting in the debug handler, which can only catch si_cs_io_cb, which in practice doesn't appear in any report so the extra cost incurred there is pointless. This change alone removes 700 bytes of code, mostly in process_runnable_tasks() and increases the performance by about 1%.	2019-09-24 11:23:30 +02:00
Willy Tarreau	4c1e1ad6a8	CLEANUP: task: cache the task_per_thread pointer In process_runnable_tasks() we perform a lot of dereferences to task_per_thread[tid] but tid is thread_local and the compiler cannot know that it doesn't change so this results in making lots of thread local accesses and array dereferences. By just keeping a copy pointer of this, we let the compiler optimize the code. Just doing this has reduced process_runnable_tasks() by 124 bytes in the fast path. Doing the same in wake_expired_tasks() results in 16 extra bytes saved.	2019-09-24 11:23:30 +02:00
Willy Tarreau	9b48c629f2	CLEANUP: task: remove impossible test In process_runnable_task(), after the task's process() function returns, we used to check if the return is not NULL and is not a tasklet, to update profiling measurements. This is useless since only tasks can return non-null here. Let's remove this useless test.	2019-09-24 11:23:30 +02:00
Olivier Houchard	ff1e9f39b9	MEDIUM: tasklets: Make the tasklet list a struct mt_list. Change the tasklet code so that the tasklet list is now a mt_list. That means that tasklet now do have an associated tid, for the thread it is expected to run on, and any thread can now call tasklet_wakeup() for that tasklet. One can change the associated tid with tasklet_set_tid().	2019-09-23 18:16:08 +02:00
Olivier Houchard	859dc80f94	MEDIUM: list: Separate "locked" list from regular list. Instead of using the same type for regular linked lists and "autolocked" linked lists, use a separate type, "struct mt_list", for the autolocked one, and introduce a set of macros, similar to the LIST_* macros, with the MT_ prefix. When we use the same entry for both regular list and autolocked list, as is done for the "list" field in struct connection, we know have to explicitely cast it to struct mt_list when using MT_ macros.	2019-09-23 18:16:08 +02:00
Willy Tarreau	64e6012eb9	MINOR: task: introduce work lists Sometimes we need to delegate some list processing to a function running on another thread. In this case the list element will simply be queued into a dedicated self-locked list and the task responsible for this list will be woken up, calling the associated function which will run over the list. This is what work_list does. Such lists will be dedicated to a limited type of work but will significantly ease such remote handling. A function is provided to create these per-thread lists, their tasks and to properly bind each task to a distinct thread, so that the caller only has to store the resulting pointer to the start of the structure. These structures should not be abused though as each head will consume 4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.	2019-07-12 09:07:48 +02:00
Willy Tarreau	bd20a9dd4e	BUG: tasks: fix bug introduced by latest scheduler cleanup In commit `86eded6c6` ("CLEANUP: tasks: rename task_remove_from_tasklet_list() to tasklet_remove_*") which consisted in removing the casts between tasks and tasklet, I was a bit too fast to believe that we only saw tasklets in this function since process_runnable_tasks() also uses it with tasks under a cast. So removing the bookkeeping on task_list_size was not appropriate. Bah, the joy of casts which hide the real thing... This patch does two things at once to address this mess once for all: - it restores the decrement of task_list_size when it's a real task, but moves it to process_runnable_task() since it's the only place where it's allowed to call it with a task - it moves the increment there as well and renames task_insert_into_tasklet_list() to tasklet_insert_into_tasklet_list() of obvious consistency reasons. This way the increment/decrement of task_list_size is made at the only places where the cast is enforced, so it has less risks to be missed. The comments on top of these functions were updated to reflect that they are only supposed to be used with tasklets and that the caller is responsible for keeping task_list_size up to date if it decides to enforce a task there. Now we don't have to worry anymore about how these functions work outside of the scheduler, which is better longterm-wise. Thanks to Christopher for spotting this mistake. No backport is needed.	2019-06-14 18:16:19 +02:00
Willy Tarreau	86eded6c69	CLEANUP: tasks: rename task_remove_from_tasklet_list() to tasklet_remove_* The function really only operates on tasklets, its arguments are always tasklets cast as tasks to match the function's type, to be cast back to a struct tasklet. Let's rename it to tasklet_remove_from_tasklet_list(), take a struct tasklet, and get rid of the undesired task casts.	2019-06-14 14:57:03 +02:00
Willy Tarreau	5598d171b3	BUILD: task: fix a build warning when threads are disabled The __decl_hathreads() macro will leave a lone semi-colon making the end of variables declarations, resulting in a warning if threads are disabled. Let's simply swap it with the last variable. Thanks to Ilya Shipitsin for reporting this issue. No backport is needed.	2019-06-04 17:18:40 +02:00
Olivier Houchard	cfbb3e6560	MEDIUM: tasks: Get rid of active_tasks_mask. Remove the active_tasks_mask variable, we can deduce if we've work to do by other means, and it is costly to maintain. Instead, introduce a new function, thread_has_tasks(), that returns non-zero if there's tasks scheduled for the thread, zero otherwise.	2019-05-29 21:53:37 +02:00
Willy Tarreau	1e928c074b	MEDIUM: task: don't grab the WR lock just to check the WQ When profiling locks, it appears that the WQ's lock has become the most contended one, despite the WQ being split by thread. The reason is that each thread takes the WQ lock before checking if it it does have something to do. In practice the WQ almost only contains health checks and rare tasks that can be scheduled anywhere, so this is a real waste of resources. This patch proceeds differently. Now that the WQ's lock was turned to RW lock, we proceed in 3 phases : 1) locklessly check for the queue's emptiness 2) take an R lock to retrieve the first element and check if it is expired. This way most visits are performed with an R lock to find and return the next expiration date. 3) if one expiration is found, we perform the WR-locked lookup as usual. As a result, on a one-minute test involving 8 threads and 64 streams at 1.3 million ctxsw/s, before this patch the lock profiler reported this : Stats about Lock TASK_WQ: # write lock : 1125496 # write unlock: 1125496 (0) # wait time for write : 263.143 msec # wait time for write/lock: 233.802 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec And after : Stats about Lock TASK_WQ: # write lock : 173 # write unlock: 173 (0) # wait time for write : 0.018 msec # wait time for write/lock: 103.988 nsec # read lock : 1072706 # read unlock : 1072706 (0) # wait time for read : 60.702 msec # wait time for read/lock : 56.588 nsec Thus the contention was divided by 4.3.	2019-05-28 19:15:44 +02:00
Willy Tarreau	ef28dc11e3	MINOR: task: turn the WQ lock to an RW_LOCK For now it's exclusively used as a write lock though, thus it remains 100% equivalent to the spinlock it replaces.	2019-05-28 19:15:44 +02:00
Willy Tarreau	e6a02fa65a	MINOR: threads: add a "stuck" flag to the thread_info struct This flag is constantly cleared by the scheduler and will be set by the watchdog timer to detect stuck threads. It is also set by the "show threads" command so that it is easy to spot if the situation has evolved between two subsequent calls : if the first "show threads" shows no stuck thread and the second one shows such a stuck thread, it indicates that this thread didn't manage to make any forward progress since the previous call, which is extremely suspicious.	2019-05-22 11:50:48 +02:00
Willy Tarreau	01f3489752	MINOR: task: put barriers after each write to curr_task This one may be watched by signal handlers, we don't want the compiler to optimize its assignment away at the end of the loop and leave some wandering pointers there.	2019-05-17 17:16:20 +02:00
Willy Tarreau	bc13bec548	MINOR: activity: report context switch counts instead of rates It's not logical to report context switch rates per thread in show activity because everything else is a counter and it's not even possible to compare values. Let's only report counts. Further, this simplifies the scheduler's code.	2019-04-30 14:55:18 +02:00
Willy Tarreau	d9add3acc8	MINOR: activity: make the profiling status per thread and not global In order to later support automatic profiling turn on/off, we need to have it per-thread. We're keeping the global option to know whether to turn it or on off, but the profiling status is now set per thread. We're updating the status in activity_count_runtime() which is called before entering poll(). The reason is that we'll extend this with run time measurement when deciding to automatically turn it on or off.	2019-04-25 17:26:19 +02:00
Willy Tarreau	0212fadd65	MINOR: tasks/activity: report the context switch and task wakeup rates It's particularly useful to spot runaway tasks to see this. The context switch rate covers all tasklet calls (tasks and I/O handlers) while the task wakeups only covers tasks picked from the run queue to be executed. High values there will indicate either an intense traffic or a bug that mades a task go wild.	2019-04-24 16:04:23 +02:00
Olivier Houchard	ed1a6a0d8a	MEDIUM: tasks: Use __ha_barrier_store after modifying global_tasks_mask. Now that we no longer use atomic operations to update global_tasks_mask, as it's always modified while holding the TASK_RQ_LOCK, we have to use __ha_barrier_store() instead of __ha_barrier_atomic_store() to ensure any modification of global_tasks_mask is seen before modifying active_tasks_mask. This should be backported to 1.9.	2019-04-18 14:14:10 +02:00
Olivier Houchard	1cfac37b65	MEDIUM: tasks: Don't account a destroyed task as a runned task. In process_runnable_tasks(), if the task we're about to run has been destroyed, and should be free, don't account for it in the number of task we ran. We're only allowed a maximum number of tasks to run per call to process_runnable_tasks(), and freeing one shouldn't take the slot of a valid task.	2019-04-18 10:11:13 +02:00
Olivier Houchard	3f795f76e8	MEDIUM: tasks: Merge task_delete() and task_free() into task_destroy(). task_delete() was never used without calling task_free() just after, and task_free() was only used on error pathes to destroy a just-created task, so merge them into task_destroy(), that will remove the task from the wait queue, and make sure the task is either destroyed immediately if it's not in the run queue, or destroyed when it's supposed to run.	2019-04-18 10:10:04 +02:00
Willy Tarreau	03dd029a5b	CLEANUP: task: remain consistent when using the task's handler A pointer "process" is assigned the task's handler in process_runnable_tasks(), we have no reason to use t->process right after it is assigned.	2019-04-17 22:32:27 +02:00
Olivier Houchard	0c7a4b6371	MINOR: tasks: Don't set the TASK_RUNNING flag when adding in the tasklet list. Now that TASK_QUEUED is enforced, there's no need to set TASK_RUNNING when removing the task from the runqueue to add it to the tasklet list. The flag will only be set right before we run the task.	2019-04-17 19:28:01 +02:00
Olivier Houchard	de82aeaa26	BUG/MEDIUM: tasks: Make sure we modify global_tasks_mask with the rq_lock. When modifying global_tasks_mask, make sure we hold the rq_lock, or we might remove the bit while it has been re-set by somebody else, and we make not be waked when needed.	2019-04-17 19:28:01 +02:00
Willy Tarreau	b038007ae8	BUG/MEDIUM: tasks: Make sure we set TASK_QUEUED before adding a task to the rq. Make sure we set TASK_QUEUED in every case before adding the task to the run queue. task_wakeup() now checks if either TASK_QUEUED or TASK_RUNNING is set, and if neither is set, add TASK_QUEUED and effectively add the task to the runqueue. No longer use __task_wakeup() anywhere except in task_wakeup(), always use task_wakeup() instead. With the old code, process_runnable_task() may re-add a task in the runqueue without setting the TASK_QUEUED flag, and there were race conditions that could lead to a task having the TASK_QUEUED flag but not in the runqueue, thus being unschedulable. This should be backported to 1.9.	2019-04-17 19:28:01 +02:00
Willy Tarreau	3466e3cdcb	BUILD: task/thread: fix single-threaded build of task.c As expected, commit `cde7902ac` ("MEDIUM: tasks: improve fairness between the local and global queues") broke the build with threads disabled, and I forgot to rerun this test before committing. No backport is needed.	2019-04-15 18:52:40 +02:00
Willy Tarreau	c8da044b41	MINOR: tasks: restore the lower latency scheduling when niced tasks are present In the past we used to reduce the number of tasks consulted at once when some niced tasks were present in the run queue. This was dropped in 1.8 when the scheduler started to take batches. With the recent fixes it now becomes possible to restore this behaviour which guarantees a better latency between tasks when niced tasks are present. Thanks to this, with the default number of 200 for tune.runqueue-depth, with a parasitic load of 14000 requests per second, nice 0 gives 14000 rps, nice 1024 gives 12000 rps and nice -1024 gives 16000 rps. The amplitude widens if the runqueue depth is lowered.	2019-04-15 09:50:56 +02:00

1 2 3 4 5 ...

266 Commits