haproxy public development tree
Go to file
Willy Tarreau 1a088da7c2 MAJOR: stktable: split the keys across multiple shards to reduce contention
In order to reduce the contention on the table when keys expire quickly,
we're spreading the load over multiple trees. That counts for keys and
expiration dates. The shard number is calculated from the key value
itself, both when looking up and when setting it.

The "show table" dump on the CLI iterates over all shards so that the
output is not fully sorted, it's only sorted within each shard. The Lua
table dump just does the same. It was verified with a Lua program to
count stick-table entries that it works as intended (the test case is
reproduced here as it's clearly not easy to automate as a vtc):

  function dump_stk()
    local dmp = core.proxies['tbl'].stktable:dump({});
    local count = 0
    for _, __ in pairs(dmp) do
        count = count + 1
    end
    core.Info('Total entries: ' .. count)
  end

  core.register_action("dump_stk", {'tcp-req', 'http-req'}, dump_stk, 0);

  ##
  global
    tune.lua.log.stderr on
    lua-load-per-thread lua-cnttbl.lua

  listen front
    bind :8001
    http-request lua.dump_stk if { path_beg /stk }
    http-request track-sc1 rand(),upper,hex table tbl
    http-request redirect location /

  backend tbl
    stick-table size 100k type string len 12 store http_req_cnt

  ##
  $ h2load -c 16 -n 10000 0:8001/
  $ curl 0:8001/stk

  ## A count close to 100k appears on haproxy's stderr
  ## On the CLI, "show table tbl" | wc will show the same.

Some large parts were reindented only to add a top-level loop to iterate
over shards (e.g. process_table_expire()). Better check the diff using
git show -b.

The number of shards is decided just like for the pools, at build time
based on the max number of threads, so that we can keep a constant. Maybe
this should be done differently. For now CONFIG_HAP_TBL_BUCKETS is used,
and defaults to CONFIG_HAP_POOL_BUCKETS to keep the benefits of all the
measurements made for the pools. It turns out that this value seems to
be the most reasonable one without inflating the struct stktable too
much. By default for 1024 threads the value is 32 and delivers 980k RPS
in a test involving 80 threads, while adding 1kB to the struct stktable
(roughly doubling it). The same test at 64 gives 1008 kRPS and at 128
it gives 1040 kRPS for 8 times the initial size. 16 would be too low
however, with 675k RPS.

The stksess already have a shard number, it's the one used to decide which
peer connection to send the entry. Maybe we should also store the one
associated with the entry itself instead of recalculating it, though it
does not happen that often. The operation is done by hashing the key using
XXH32().

The peers also take and release the table's lock but the way it's used
it not very clear yet, so at this point it's sure this will not work.

At this point, this allowed to completely unlock the performance on a
80-thread setup:

 before: 5.4 Gbps, 150k RPS, 80 cores
  52.71%  haproxy    [.] stktable_lookup_key
  36.90%  haproxy    [.] stktable_get_entry.part.0
   0.86%  haproxy    [.] ebmb_lookup
   0.18%  haproxy    [.] process_stream
   0.12%  haproxy    [.] process_table_expire
   0.11%  haproxy    [.] fwrr_get_next_server
   0.10%  haproxy    [.] eb32_insert
   0.10%  haproxy    [.] run_tasks_from_lists

 after: 36 Gbps, 980k RPS, 80 cores
  44.92%  haproxy    [.] stktable_get_entry
   5.47%  haproxy    [.] ebmb_lookup
   2.50%  haproxy    [.] fwrr_get_next_server
   0.97%  haproxy    [.] eb32_insert
   0.92%  haproxy    [.] process_stream
   0.52%  haproxy    [.] run_tasks_from_lists
   0.45%  haproxy    [.] conn_backend_get
   0.44%  haproxy    [.] __pool_alloc
   0.35%  haproxy    [.] process_table_expire
   0.35%  haproxy    [.] connect_server
   0.35%  haproxy    [.] h1_headers_to_hdr_list
   0.34%  haproxy    [.] eb_delete
   0.31%  haproxy    [.] srv_add_to_idle_list
   0.30%  haproxy    [.] h1_snd_buf

WIP: uint64_t -> long

WIP: ulong -> uint

code is much smaller
2024-04-03 17:34:47 +02:00
.github CI: vtest: show coredumps if any 2024-04-02 07:29:08 +02:00
addons CLEANUP: assorted typo fixes in the code and comments 2024-03-05 11:50:34 +01:00
admin BUILD: address a few remaining calloc(size, n) cases 2024-02-10 11:37:27 +01:00
dev MEDIUM: ring: use the topmost bit of the tail as a lock 2024-03-25 17:34:19 +00:00
doc MINOR: quic: HyStart++ implementation (RFC 9406) 2024-04-02 18:47:19 +02:00
examples CLEANUP: assorted typo fixes in the code and comments 2023-11-23 16:23:14 +01:00
include MAJOR: stktable: split the keys across multiple shards to reduce contention 2024-04-03 17:34:47 +02:00
reg-tests REGTESTS: ssl: disable ssl/ocsp_auto_update.vtc 2024-04-02 14:20:39 +02:00
scripts CI: ssl: add yet another OpenSSL download fallback 2024-02-07 11:05:45 +01:00
src MAJOR: stktable: split the keys across multiple shards to reduce contention 2024-04-03 17:34:47 +02:00
tests Revert "MAJOR: import: update mt_list to support exponential back-off" 2023-09-15 17:13:43 +02:00
.cirrus.yml CI: cirrus-ci: display gdb bt if any 2023-09-22 08:28:30 +02:00
.gitattributes MINOR: Configure the cpp userdiff driver for *.[ch] in .gitattributes 2021-02-22 18:17:57 +01:00
.gitignore CONTRIB: Add vi file extensions to .gitignore 2023-06-02 18:14:34 +02:00
.mailmap DOC: update Tim's address in .mailmap 2021-09-16 09:14:14 +02:00
.travis.yml CI: travis-ci: temporarily disable arm64 builds 2021-08-07 07:28:15 +02:00
BRANCHES DOC: fix some spelling issues over multiple files 2021-01-08 14:53:47 +01:00
BSDmakefile BUILD: makefile: commit the tiny FreeBSD makefile stub 2023-05-24 17:17:36 +02:00
CHANGELOG [RELEASE] Released version 3.0-dev6 2024-03-26 15:36:49 +01:00
CONTRIBUTING CLEANUP: assorted typo fixes in the code and comments 2021-08-16 12:37:59 +02:00
INSTALL DOC: install: recommend pcre2 2024-02-08 08:55:57 +01:00
LICENSE LICENSE: add licence exception for OpenSSL 2012-09-07 13:52:26 +02:00
MAINTAINERS MAJOR: spoe: Deprecate the SPOE filter 2024-03-15 11:29:39 +01:00
Makefile MEDIUM: mworker: get rid of libsystemd 2024-04-03 15:53:18 +02:00
README DOC: create a BRANCHES file to explain the life cycle 2019-06-15 22:00:14 +02:00
SUBVERS BUILD: use format tags in VERDATE and SUBVERS files 2013-12-10 11:22:49 +01:00
VERDATE [RELEASE] Released version 3.0-dev6 2024-03-26 15:36:49 +01:00
VERSION [RELEASE] Released version 3.0-dev6 2024-03-26 15:36:49 +01:00

The HAProxy documentation has been split into a number of different files for
ease of use.

Please refer to the following files depending on what you're looking for :

  - INSTALL for instructions on how to build and install HAProxy
  - BRANCHES to understand the project's life cycle and what version to use
  - LICENSE for the project's license
  - CONTRIBUTING for the process to follow to submit contributions

The more detailed documentation is located into the doc/ directory :

  - doc/intro.txt for a quick introduction on HAProxy
  - doc/configuration.txt for the configuration's reference manual
  - doc/lua.txt for the Lua's reference manual
  - doc/SPOE.txt for how to use the SPOE engine
  - doc/network-namespaces.txt for how to use network namespaces under Linux
  - doc/management.txt for the management guide
  - doc/regression-testing.txt for how to use the regression testing suite
  - doc/peers.txt for the peers protocol reference
  - doc/coding-style.txt for how to adopt HAProxy's coding style
  - doc/internals for developer-specific documentation (not all up to date)