haproxy

mirror of http://git.haproxy.org/git/haproxy.git/ synced 2024-12-18 01:14:38 +00:00

Author	SHA1	Message	Date
Willy Tarreau	0bae075928	MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS We've reached a point where the global pools represent a significant bottleneck with threads. On a 64-core machine, the performance was divided by 8 between 32 and 64 H2 connections only because there were not enough entries in the local caches to avoid picking from the global pools, and the contention on the list there was very high. It becomes obvious that we need to have an array of lists, but that will require more changes. In parallel, standard memory allocators have improved, with tcmalloc and jemalloc finding their ways through mainstream systems, and glibc having upgraded to a thread-aware ptmalloc variant, keeping this level of contention here isn't justified anymore when we have both the local per-thread pool caches and a fast process-wide allocator. For these reasons, this patch introduces a new compile time setting CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are enabled with thread local pool caches, and we know we have a fast thread-aware memory allocator (currently set for glibc>=2.26). In this case we entirely bypass the global pool and directly use the standard memory allocator when missing objects from the local pools. It is also possible to force it at compile time when a good allocator is used with another setup. It is still possible to re-enable the global pools using CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the operating system's default allocator, or when building with a recent libc but a different allocator which provides other benefits but does not scale well with threads.	2021-03-05 08:30:08 +01:00
Ubuntu	f8fb4f75f1	MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs There finally is a way to support register pairs on aarch64 assembly under gcc, it's just undocumented, like many of the options there :-( As indicated below, it's possible to pass "%H" to mention the high part of a register pair (e.g. "%H0" to go with "%0"): https://patchwork.ozlabs.org/project/gcc/patch/59368A74.2060908@foss.arm.com/ By making local variables from pairs of registers via a struct (as is used in IST for example), we can let gcc choose the correct register pairs and avoid a few moves in certain situations. The code is now slightly more efficient than the previous one on AWS' Graviton2 platform, and noticeably smaller (by 4.5kB approx). A few tests on older releases show that even Linaro's gcc-4.7 used to support such register pairs and %H, and by then ATOMICS were not supported so this should not cause build issues, and as such this patch replaces the earlier implementation.	2021-03-05 08:30:08 +01:00
Willy Tarreau	46cca86900	MINOR: atomic: add armv8.1-a atomics variant for cas-dw This variant uses the CASP instruction available on armv8.1-a CPU cores, which is detected when __ARM_FEATURE_ATOMICS is set (gcc-linaro >= 7, mainline >= 9). This one was tested on cortex-A55 (S905D3) and on AWS' Graviton2 CPUs. The instruction performs way better on high thread counts since it guarantees some forward progress when facing extreme contention while the original LL/SC approach is light on low-thread counts but doesn't guarantee progress. The implementation is not the most optimal possible. In particular since the instruction requires to work on register pairs and there doesn't seem to be a way to force gcc to emit register pairs, we have to decide to force to use the pair (x0,x1) to store the old value, and (x2,x3) to store the new one, and this necessarily involves some extra moves. But at least it does improve the situation with 16 threads and more. See issue #958 for more context. Note, a first implementation of this function was making use of an input/output constraint passed using "+Q"((void*)target), which was resulting in smaller overall code than passing "target" as an input register only. It turned out that the cause was directly related to whether the function was inlined or not, hence the "forceinline" attribute. Any changes to this code should still pay attention to this important factor.	2021-03-05 08:30:08 +01:00
Willy Tarreau	168fc5332c	BUG/MINOR: mt-list: always perform a cpu_relax call on failure On highly threaded machines it is possible to occasionally trigger the watchdog on certain contended areas like the server's connection list, because while the mechanism inherently cannot guarantee a constant progress, it lacks CPU relax calls which are absolutely necessary in this situation to let a thread finish its job. The loop's "while (1)" was changed to use a "for" statement calling __ha_cpu_relax() as its continuation expression. This way the "continue" statements jump to the unique place containing the pause without excessively inflating the code. This was sufficient to definitely fix the problem on 64-core ARM Graviton2 machines. This patch should probably be backported once it's confirmed it also helps on many-cores x86 machines since some people are facing contention in these environments. This patch depends on previous commit "REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h". An attempt was made to first read the value before exchanging, and it significantly degraded the performance. It's very likely that this caused other cores to lose exclusive ownership on their line and slow down their next xchg operation. In addition it was found that MT_LIST_ADD is significantly faster than MT_LIST_ADDQ under high contention, because it fails one step earlier when conflicting with an adjacent MT_LIST_DEL(). It might be worth switching some operations' order to favor MT_LIST_ADDQ() instead.	2021-03-05 08:30:08 +01:00
Willy Tarreau	958ae26c35	REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h There is some confusion here as we need to place some cpu_relax statements in some loops where it's not easily possible to condition them on the use of threads. That's what atomic.h already does. So let's take the various pl_cpu_relax() implementations from there and place them in atomic.h under the name __ha_cpu_relax() and let them adapt to the presence or absence of threads and to the architecture (currently only x86 and aarch64 use a barrier instruction), though it's very likely that arm would work well with a cache flushing ISB instruction as well). This time they were implemented as expressions returning 1 rather than statements, in order to ease their placement as the loop condition or the continuation expression inside "for" loops. We should probably do the same with barriers and a few such other ones.	2021-03-05 08:30:08 +01:00
Amaury Denoyelle	8ede3db080	MINOR: backend: handle reuse for conns with no server as target If dispatch mode or transparent backend is used, the backend connection target is a proxy instead of a server. In these cases, the reuse of backend connections is not consistent. With the default behavior, no reuse is done and every new request uses a new connection. However, if http-reuse is set to never, the connection are stored by the mux in the session and can be reused for future requests in the same session. As no server is used for these connections, no reuse can be made outside of the session, similarly to http-reuse never mode. A different http-reuse config value should not have an impact. To achieve this, mark these connections as private to have a defined behavior. For this feature to properly work, the connection hash has been slightly adjusted. The server pointer as an input as been replaced by a generic target pointer to refer to the server or proxy instance. The hash is always calculated on connect_server even if the connection target is not a server. This also requires to allocate the connection hash node for every backend connections, not just the one with a server target.	2021-03-03 11:31:19 +01:00
Frédéric Lécaille	b28812af7a	BUILD: quic: Implicit conversion between SSL related enums. Fix such compilation issues: include/haproxy/quic_tls.h:157:10: error: implicit conversion from 'enum ssl_encryption_level_t' to 'enum quic_tls_enc_level' [-Werror=enum-conversion] 157 \| return ssl_encryption_application; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ src/xprt_quic.c: In function 'quic_conn_enc_level_init': src/xprt_quic.c:2358:13: error: implicit conversion from 'enum quic_tls_enc_level' to 'enum ssl_encryption_level_t' [-Werror=enum-conversion] 2358 \| qel->level = quic_to_ssl_enc_level(level); \| ^ Not detected by all the compilators.	2021-03-02 10:34:18 +01:00
Willy Tarreau	61cfdf4fd8	CLEANUP: tree-wide: replace free(x);x=NULL with ha_free(&x) This makes the code more readable and less prone to copy-paste errors. In addition, it allows to place some __builtin_constant_p() predicates to trigger a link-time error in case the compiler knows that the freed area is constant. It will also produce compile-time error if trying to free something that is not a regular pointer (e.g. a function). The DEBUG_MEM_STATS macro now also defines an instance for ha_free() so that all these calls can be checked. 178 occurrences were converted. The vast majority of them were handled by the following Coccinelle script, some slightly refined to better deal with "&*x" or with long lines: @ rule @ expression E; @@ - free(E); - E = NULL; + ha_free(&E); It was verified that the resulting code is the same, more or less a handful of cases where the compiler optimized slightly differently the temporary variable that holds the copy of the pointer. A non-negligible amount of {free(str);str=NULL;str_len=0;} are still present in the config part (mostly header names in proxies). These ones should also be cleaned for the same reasons, and probably be turned into ist strings.	2021-02-26 21:21:09 +01:00
Christopher Faulet	29e9326f2f	CLEANUP: hlua: Use net_addr structure internally to parse and compare addresses hlua_addr structure may be replaced by net_addr structure to parse and compare addresses. Both structures are similar.	2021-02-26 13:53:26 +01:00
Christopher Faulet	5d1def623a	MEDIUM: http-ana: Add IPv6 support for forwardfor and orignialto options A network may be specified to avoid header addition for "forwardfor" and "orignialto" option via the "except" parameter. However, only IPv4 networks/addresses are supported. This patch adds the support of IPv6. To do so, the net_addr structure is used to store the parameter value in the proxy structure. And ipcmp2net() function is used to perform the comparison. This patch should fix the issue #1145. It depends on the following commit: * c6ce0ab MINOR: tools: Add function to compare an address to a network address * 5587287 MINOR: tools: Add net_addr structure describing a network addess	2021-02-26 13:52:48 +01:00
Christopher Faulet	9553de7fec	MINOR: tools: Add function to compare an address to a network address ipcmp2net() function may be used to compare an addres (struct sockaddr_storage) to a network address (struct net_addr). Among other things, this function will be used to add support of IPv6 for "except" parameter of "forwardfor" and "originalto" options.	2021-02-26 13:52:06 +01:00
Christopher Faulet	01f02a4d84	MINOR: tools: Add net_addr structure describing a network addess The net_addr structure describes a IPv4 or IPv6 address. Its ip and mask are represented. Among other things, this structure will be used to add support of IPv6 for "except" parameter of "forwardfor" and "originalto" options.	2021-02-26 13:32:17 +01:00
Willy Tarreau	401135cee6	MINOR: task: add one extra tasklet class: TL_HEAVY This class will be used exclusively for heavy processing tasklets. It will be cleaner than mixing them with the bulk ones. For now it's allocated ~1% of the CPU bandwidth. The largest part of the patch consists in re-arranging the fields in the task_per_thread structure to preserve a clean alignment with one more list head. Since we're now forced to increase the struct past a second cache line, it now uses 4 cache lines (for easy multiplying) with the first two ones being exclusively used by local operations and the third one mostly by atomic operations. Interestingly, this better arrangement causes less stress and reduced the response time by 8 microseconds at 1 million requests per second.	2021-02-26 12:00:53 +01:00
Willy Tarreau	d8aa21a611	CLEANUP: server: rename srv_cleanup_{idle,toremove}_connections() These function names are unbearably long, they don't even fit into the screen in "show profiling", let's trim the "_connections" to "_conns", which happens to match the name of the lists there.	2021-02-26 00:30:22 +01:00
Willy Tarreau	74dea8caea	MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY While the scheduler is priority-aware and class-aware, and consistently tries to maintain fairness between all classes, it doesn't make use of a fine execution budget to compensate for high-latency tasks such as TLS handshakes. This can result in many subsequent calls adding multiple milliseconds of latency between the various steps of other tasklets that don't even depend on this. An ideal solution would be to add a 4th queue, have all tasks announce their estimated cost upfront and let the scheduler maintain an auto- refilling budget to pick from the most suitable queue. But it turns out that a very simplified version of this already provides impressive gains with very tiny changes and could easily be backported. The principle is to reserve a new task flag "TASK_HEAVY" that indicates that a task is expected to take a lot of time without yielding (e.g. an SSL handshake typically takes 700 microseconds of crypto computation). When the scheduler sees this flag when queuing a tasklet, it will place it into the bulk queue. And during dequeuing, we accept only one of these in a full round. This means that the first one will be accepted, will not prevent other lower priority tasks from running, but if a new one arrives, then the queue stops here and goes back to the polling. This will allow to collect more important updates for other tasks that will be batched before the next call of a heavy task. Preliminary tests consisting in placing this flag on the SSL handshake tasklet show that response times under SSL stress fell from 14 ms before the patch to 3.0 ms with the patch, and even 1.8 ms if tune.sched.low-latency is set to "on".	2021-02-26 00:25:51 +01:00
Christopher Faulet	69beaa91d5	REORG: server: Export and rename some functions updating server info Some static functions are now exported and renamed to follow the same pattern of other exported functions. Here is the list : * update_server_fqdn: Renamed to srv_update_fqdn and exported * update_server_check_addr_port: renamed to srv_update_check_addr_port and exported * update_server_agent_addr_port: renamed to srv_update_agent_addr_port and exported * update_server_addr: renamed to srv_update_addr * update_server_addr_potr: renamed to srv_update_addr_port * srv_prepare_for_resolution: exported This change is mandatory to move all functions dealing with the server-state files in a separate file.	2021-02-25 10:02:39 +01:00
Christopher Faulet	ecfb9b9109	MEDIUM: server: Store parsed params of a server-state line in the tree Parsed parameters are now stored in the tree of server-state lines. This way, a line from the global server-state file is only parsed once. Before, it was parsed a first time to store it in the tree and one more time to load the server state. To do so, the server-state line object must be allocated before parsing a line. This means its size must no longer depend on the length of first parsed parameters (backend and server names). Thus the node type was changed to use a hashed key instead of a string.	2021-02-25 10:02:39 +01:00
Christopher Faulet	6d87c58fb4	CLEANUP: server: Rename state_line structure into server_state_line The structure used to store a server-state line in an eb-tree has a too generic name. Instead of state_line, the structure is renamed as server_state_line.	2021-02-25 10:02:39 +01:00
Christopher Faulet	fcb53fbb58	CLEANUP: server: Rename state_line node to node instead of name_name <state_line.name_name> field is a node in an eb-tree. Thus, instead of "name_name", we now use "node" to name this field. If is a more explicit name and not too strange.	2021-02-25 10:02:39 +01:00
Willy Tarreau	b2285de049	MINOR: tasks: also compute the tasklet latency when DEBUG_TASK is set It is extremely useful to be able to observe the wakeup latency of some important I/O operations, so let's accept to inflate the tasklet struct by 8 extra bytes when DEBUG_TASK is set. With just this we have enough to get live reports like this: $ socat - /tmp/sock1 <<< "show profiling" Per-task CPU profiling : on # set profiling tasks {on\|auto\|off} Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg si_cs_io_cb 8099492 4.833s 596.0ns 8.974m 66.48us h1_io_cb 7460365 11.55s 1.548us 2.477m 19.92us process_stream 7383828 22.79s 3.086us 18.39m 149.5us h1_timeout_task 4157 - - 348.4ms 83.81us srv_cleanup_toremove_connections751 39.70ms 52.86us 10.54ms 14.04us srv_cleanup_idle_connections 21 1.405ms 66.89us 30.82us 1.467us task_run_applet 16 1.058ms 66.13us 446.2us 27.89us accept_queue_process 7 34.53us 4.933us 333.1us 47.58us	2021-02-25 09:44:16 +01:00
Willy Tarreau	45499c56d3	MINOR: task: make grq_total atomic to move it outside of the grq_lock Instead of decrementing grq_total once per task picked from the global run queue, let's do it at once after the loop like we do for other counters. This simplifies the code everywhere. It is not expected to bring noticeable improvements however, since global tasks tend to be less common nowadays.	2021-02-25 09:44:16 +01:00
Willy Tarreau	c03fbeb358	CLEANUP: task: re-merge __task_unlink_rq() with task_unlink_rq() There's no point keeping the two separate anymore, some tests are duplicated for no reason.	2021-02-25 09:44:16 +01:00
Christopher Faulet	e071f0e6a4	MINOR: htx: Add function to reserve the max possible size for an HTX DATA block The function htx_reserve_max_data() should be used to get an HTX DATA block with the max possible size. A current block may be extended or a new one created, depending on the HTX message state. But the idea is to let the caller to copy a bunch of data without requesting many new blocks. It is its responsibility to resize the block at the end, to set the final block size. This function will be used to parse messages with small chunks. Indeed, we can have more than 2700 1-byte chunks in a 16Kb of input data. So it is easy to understand how this function may help to improve the parsing of chunk messages.	2021-02-24 22:10:01 +01:00
Baptiste Assmann	b4badf720c	BUG/MINOR: resolvers: new callback to properly handle SRV record errors When a SRV record was created, it used to register the regular server name resolution callbacks. That said, SRV records and regular server name resolution don't work the same way, furthermore on error management. This patch introduces a new call back to manage DNS errors related to the SRV queries. this fixes github issue #50. Backport status: 2.3, 2.2, 2.1, 2.0	2021-02-24 21:58:45 +01:00
Willy Tarreau	5926e384e6	BUG/MINOR: fd: properly wait for !running_mask in fd_set_running_excl() In fd_set_running_excl() we don't reset the old mask in the CAS loop, so if we fail on the first round, we'll forcefully take the FD on the next one. In practice it's used bu fd_insert() and fd_delete() only, none of which is supposed to be passed an FD which is still in use since in practice, given that for now only listeners may be enabled on multiple threads at once. This can be backported to 2.2 but shouldn't result in fixing any user visible bug for now.	2021-02-24 19:40:49 +01:00
Willy Tarreau	9c6dbf0eea	CLEANUP: task: split the large tasklet_wakeup_on() function in two This function has become large with the multi-queue scheduler. We need to keep the fast path and the debugging parts inlined, but the rest now moves to task.c just like was done for task_wakeup(). This has reduced the code size by 6kB due to less inlining of large parts that are always context-dependent, and as a side effect, has increased the overall performance by 1%.	2021-02-24 17:55:58 +01:00
Willy Tarreau	955a11ebfa	MINOR: task: move the allocated tasks counter to the per-thread struct The nb_tasks counter was still global and gets incremented and decremented for each task_new()/task_free(), and was read in process_runnable_tasks(). But it's only used for stats reporting, so doing this this often is pointless and expensive. Let's move it to the task_per_thread struct and have the stats sum it when needed.	2021-02-24 17:42:04 +01:00
Willy Tarreau	018564eaa2	CLEANUP: task: move the tree root detection from __task_wakeup() to task_wakeup() Historically we used to call __task_wakeup() with a known tree root but this is not the case and the code has remained needlessly complicated with the root calculation in task_wakeup() passed in argument to __task_wakeup() which compares it again. Let's get rid of this and just move the detection code there. This eliminates some ifdefs and allows to simplify the test conditions quite a bit.	2021-02-24 17:42:04 +01:00
Willy Tarreau	1f3b1417b8	CLEANUP: tasks: use a less confusing name for task_list_size This one is systematically misunderstood due to its unclear name. It is in fact the number of tasks in the local tasklet list. Let's call it "tasks_in_list" to remove some of the confusion.	2021-02-24 17:42:04 +01:00
Willy Tarreau	2c41d77ebc	MINOR: tasks: do not maintain the rqueue_size counter anymore This one is exclusively used as a boolean nowadays and is non-zero only when the thread-local run queue is not empty. Better check the root tree's pointer and avoid updating this counter all the time.	2021-02-24 17:42:04 +01:00
Willy Tarreau	9c7b8085f4	MEDIUM: task: remove the tasks_run_queue counter and have one per thread This counter is solely used for reporting in the stats and is the hottest thread contention point to date. Moving it to the scheduler and having a separate one for the global run queue dramatically improves the performance, showing a 12% boost on the request rate on 16 threads! In addition, the thread debugging output which used to rely on rqueue_size was not totally accurate as it would only report task counts. Now we can return the exact thread's run queue length. It is also interesting to note that there are still a few other task/tasklet counters in the scheduler that are not efficiently updated because some cover a single area and others cover multiple areas. It looks like having a distinct counter for each of the following entries would help and would keep the code a bit cleaner: - global run queue (tree) - per-thread run queue (tree) - per-thread shared tasklets list - per-thread local lists Maybe even splitting the shared tasklets lists between pure tasklets and tasks instead of having the whole and tasks would simplify the code because there remain a number of places where several counters have to be updated.	2021-02-24 17:42:04 +01:00
Willy Tarreau	49de68520e	MEDIUM: streams: do not use the streams lock anymore The lock was still used exclusively to deal with the concurrency between the "show sess" release handler and a stream_new() or stream_free() on another thread. All other accesses made by "show sess" are already done under thread isolation. The release handler only requires to unlink its node when stopping in the middle of a dump (error, timeout etc). Let's just isolate the thread to deal with this case so that it's compatible with the dump conditions, and remove all remaining locking on the streams. This effectively kills the streams lock. The measured gain here is around 1.6% with 4 threads (374krps -> 380k).	2021-02-24 13:54:50 +01:00
Willy Tarreau	a698eb6739	MINOR: streams: use one list per stream instead of a global one The global streams list is exclusively used for "show sess", to look up a stream to shut down, and for the hard-stop. Having all of them in a single list is extremely expensive in terms of locking when using threads, with performance losses as high as 7% having been observed just due to this. This patch makes the list per-thread, since there's no need to have a global one in this situation. All call places just iterate over all threads. The most "invasive" changes was in "show sess" where the end of list needs to go back to the beginning of next thread's list until the last thread is seen. For now the lock was maintained to keep the code auditable but a next commit should get rid of it. The observed performance gain here with only 4 threads is already 7% (350krps -> 374krps).	2021-02-24 13:53:20 +01:00
Willy Tarreau	b981318c11	MINOR: stream: add an "epoch" to figure which streams appeared when The "show sess" CLI command currently lists all streams and needs to stop at a given position to avoid dumping forever. Since 2.2 with commit `c6e7a1b8e` ("MINOR: cli: make "show sess" stop at the last known session"), a hack consists in unlinking the stream running the applet and linking it again at the current end of the list, in order to serve as a delimiter. But this forces the stream list to be global, which affects scalability. This patch introduces an epoch, which is a global 32-bit counter that is incremented by the "show sess" command, and which is copied by newly created streams. This way any stream can know whether any other one is newer or older than itself. For now it's only stored and not exploited.	2021-02-24 12:12:51 +01:00
Ilya Shipitsin	98a9e1b873	BUILD: SSL: introduce fine guard for RAND_keep_random_devices_open RAND_keep_random_devices_open is OpenSSL specific function, not implemented in LibreSSL and BoringSSL. Let us define guard HAVE_SSL_RAND_KEEP_RANDOM_DEVICES_OPEN in include/haproxy/openssl-compat.h That guard does not depend anymore on HA_OPENSSL_VERSION	2021-02-22 10:35:23 +01:00
Willy Tarreau	c6ba9a0b9b	MINOR: sched: have one runqueue ticks counter per thread The runqueue_ticks counts the number of task wakeups and is used to position new tasks in the run queue, but since we've had per-thread run queues, the values there are not very relevant anymore and the nice value doesn't apply well if some threads are more loaded than others. In addition, letting all threads compete over a shared counter is not smart as this may cause some excessive contention. Let's move this index close to the run queues themselves, i.e. one per thread and a global one. In addition to improving fairness, this has increased global performance by 2% on 16 threads thanks to the lower contention on rqueue_ticks. Fairness issues were not observed, but if any were to be, this patch could be backported as far as 2.0 to address them.	2021-02-20 13:03:37 +01:00
Willy Tarreau	4d77bbf856	MINOR: dynbuf: pass offer_buffers() the number of buffers instead of a threshold Historically this function would try to wake the most accurate number of process_stream() waiters. But since the introduction of filters which could also require buffers (e.g. for compression), things started not to be as accurate anymore. Nowadays muxes and transport layers also use buffers, so the runqueue size has nothing to do anymore with the number of supposed users to come. In addition to this, the threshold was compared to the number of free buffer calculated as allocated minus used, but this didn't work anymore with local pools since these counts are not updated upon alloc/free! Let's clean this up and pass the number of released buffers instead, and consider that each waiter successfully called counts as one buffer. This is not rocket science and will not suddenly fix everything, but at least it cannot be as wrong as it is today. This could have been marked as a bug given that the current situation is totally broken regarding this, but this probably doesn't completely fix it, it only goes in a better direction. It is possible however that it makes sense in the future to backport this as part of a larger series if the situation significantly improves.	2021-02-20 12:38:18 +01:00
Willy Tarreau	90f366b595	MINOR: dynbuf: use regular lists instead of mt_lists for buffer_wait There's no point anymore in keeping mt_lists for the buffer_wait and buffer_wq since it's thread-local now.	2021-02-20 12:38:18 +01:00
Willy Tarreau	e8e5091510	MINOR: dynbuf: make the buffer wait queue per thread The buffer wait queue used to be global historically but this doest not make any sense anymore given that the most common use case is to have thread-local pools. Thus there's no point waking up waiters of other threads after releasing an entry, as they won't benefit from it. Let's move the queue head to the thread_info structure and use ti->buffer_wq from now on.	2021-02-20 12:38:18 +01:00
Christopher Faulet	ea2cdf55e3	MEDIUM: server: Don't introduce a new server-state file version This revert the commit `63e6cba12` ("MEDIUM: server: add server-states version 2"), but keeping all recent features added to the server-sate file. Instead of adding a 2nd version for the server-state file format to handle the 5 new fields added during the 2.4 development, these fields are considered as optionnal during the parsing. So it is possible to load a server-state file from HAProxy 2.3. However, from 2.4, these new fields are always dumped in the server-state file. But it should not be a problem to load it on the 2.3. This patch seems a bit huge but the diff ignoring the space is much smaller. The version 2 of the server-state file format is reserved for a real refactoring to address all issues of the current format.	2021-02-19 18:03:59 +01:00
Amaury Denoyelle	8990b010a0	MINOR: connection: allocate dynamically hash node for backend conns Remove ebmb_node entry from struct connection and create a dedicated struct conn_hash_node. struct connection contains now only a pointer to a conn_hash_node, allocated only for connections where target is of type OBJ_TYPE_SERVER. This will reduce memory footprints for every connections that does not need http-reuse such as frontend connections.	2021-02-19 16:59:18 +01:00
Olivier Houchard	5567f41d0a	BUG/MEDIUM: lists: Avoid an infinite loop in MT_LIST_TRY_ADDQ(). In MT_LIST_TRY_ADDQ(), deal with the "prev" field of the element before the "next". If the element is the first in the list, then its next will already have been locked when we locked list->prev->next, so locking it again will fail, and we'll start over and over. This should be backported to 2.3.	2021-02-19 16:47:20 +01:00
Willy Tarreau	66161326fd	MINOR: listener: refine the default MAX_ACCEPT from 64 to 4 The maximum number of connections accepted at once by a thread for a single listener used to default to 64 divided by the number of processes but the tasklet-based model is much more scalable and benefits from smaller values. Experimentation has shown that 4 gives the highest accept rate for all thread values, and that 3 and 5 come very close, as shown below (HTTP/1 connections forwarded per second at multi-accept 4 and 64): ac\thr\| 1 2 4 8 16 ------+------------------------------ 4\| 80k 106k 168k 270k 336k 64\| 63k 89k 145k 230k 274k Some tests were also conducted on SSL and absolutely no change was observed. The value was placed into a define because it used to be spread all over the code. It might be useful at some point to backport this to 2.3 and 2.2 to help those who observed some performance regressions from 1.6.	2021-02-19 16:02:04 +01:00
Willy Tarreau	4327d0ac00	MINOR: tasks: refine the default run queue depth Since a lot of internal callbacks were turned to tasklets, the runqueue depth had not been readjusted from the default 200 which was initially used to favor batched processing. But nowadays it appears too large already based on the following tests conducted on a 8c16t machine with a simple config involving "balance leastconn" and one server. The setup always involved the two threads of a same CPU core except for 1 thread, and the client was running over 1000 concurrent H1 connections. The number of requests per second is reported for each (runqueue-depth, nbthread) couple: rq\thr\| 1 2 4 8 16 ------+------------------------------ 32\| 120k 159k 276k 477k 698k 40\| 122k 160k 276k 478k 722k 48\| 121k 159k 274k 482k 720k 64\| 121k 160k 274k 469k 710k 200\| 114k 150k 247k 415k 613k <-- default It's possible to save up to about 18% performance by lowering the default value to 40. One possible explanation to this is that checking I/Os more frequently allows to flush buffers faster and to smooth the I/O wait time over multiple operations instead of alternating phases of processing, waiting for locks and waiting for new I/Os. The total round trip time also fell from 1.62ms to 1.40ms on average, among which at least 0.5ms is attributed to the testing tools since this is the minimum attainable on the loopback. After some observation it would be nice to backport this to 2.3 and 2.2 which observe similar improvements, since some users have already observed some perf regressions between 1.6 and 2.2.	2021-02-19 16:01:55 +01:00
Ilya Shipitsin	c47d676bd7	BUILD: ssl: introduce fine guard for OpenSSL specific SCTL functions SCTL (signed certificate timestamp list) specified in RFC6962 was implemented in c74ce24cd22e8c683ba0e5353c0762f8616e597d, let us introduce macro HAVE_SSL_SCTL for the HAVE_SSL_SCTL sake, which in turn is based on SN_ct_cert_scts, which comes in the same commit	2021-02-18 15:55:50 +01:00
Christopher Faulet	8dd40fbde9	BUG/MINOR: sample: Always consider zero size string samples as unsafe smp_is_safe() function is used to be sure a sample may be safely modified. For string samples, a test is performed to verify if there is a null-terminated byte. If not, one is added, if possible. It means if the sample is not const and if there is some free space in the buffer, after data. However, we must not try to read the null-terminated byte if the string sample is too long (data >= size) or if the size is equal to zero. This last test was not performed. Thus it was possible to consider a string sample as safe by testing a byte outside the buffer. Now, a zero size string sample is always considered as unsafe and is duplicated when smp_make_safe() is called. This patch must be backported in all stable versions.	2021-02-18 14:58:43 +01:00
Willy Tarreau	ca9f60c1ac	MINOR: tasks/debug: add some extra controls of use-after-free in DEBUG_TASK It's pretty easy to pre-initialize the index, change it on free() and check it during the wakeup, so let's do this to ease detection of any accidental task_wakeup() after a task_free() or tasklet_wakeup() after a tasklet_free(). If this would ever happen we'd then get a backtrace and a core now. The index's parity is respected so that the call history remains exploitable.	2021-02-18 14:38:49 +01:00
Willy Tarreau	b23f04260b	MINOR: tasks: add DEBUG_TASK to report caller info in a task The idea is to know who woke a task up, by recording the last two callers in a rotating mode. For now it's trivial with task_wakeup() but tasklet_wakeup_on() will require quite some more changes. This typically gives this from the debugger: (gdb) p t->debug $2 = { caller_file = {0x0, 0x8c0d80 "src/task.c"}, caller_line = {0, 260}, caller_idx = 1 } or this: (gdb) p t->debug $6 = { caller_file = {0x7fffe40329e0 "", 0x885feb "src/stream.c"}, caller_line = {284, 284}, caller_idx = 1 } But it also provides a trivial macro allowing to simply place a call in a task/tasklet handler that needs to be observed: DEBUG_TASK_PRINT_CALLER(t); Then starting haproxy this way would trivially yield such info: $ ./haproxy -db -f test.cfg \| sort \| uniq -c \| sort -nr 199992 h1_io_cb woken up from src/sock.c:797 51764 h1_io_cb woken up from src/mux_h1.c:3634 65 h1_io_cb woken up from src/connection.c:169 45 h1_io_cb woken up from src/sock.c:777	2021-02-18 10:42:07 +01:00
Willy Tarreau	59b0fecfd9	MINOR: lb/api: let callers of take_conn/drop_conn tell if they have the lock The two algos defining these functions (first and leastconn) do not need the server's lock. However it's already present in pendconn_process_next_strm() so the API must be updated so that the functions may take it if needed and that the callers indicate whether they already own it. As such, the call places (backend.c and stream.c) now do not take it anymore, queue.c was unchanged since it's already held, and both "first" and "leastconn" were updated to take it if not already held. A quick test on the "first" algo showed a jump from 432 to 565k rps by just dropping the lock in stream.c!	2021-02-18 10:06:45 +01:00
Willy Tarreau	b9ad30a8ad	Revert "MINOR: threads: change lock_t to an unsigned int" This reverts commit `8f1f177ed0`. Repeated tests have shown a small perforamnce degradation of ~1.8% caused by this patch at high request rates on 16 threads. The exact cause is not yet perfectly known but it probably stems in slower accesses for non-64-bit aligned atomic accesses.	2021-02-18 10:06:45 +01:00

1 2 3 4 5 ...

4859 Commits