Add an early return to qcc_decode_qcs() if QCC instance is flagged on
error and connection is scheduled for immediate closure.
The main objective is to ensure to not trigger BUG_ON() from
qcc_set_error() : if a stream decoding has set the connection error, do
not try to process decoding on other streams as they may also encounter
an error. Thus, the connection is closed asap with the first encountered
error case.
This should be backported up to 2.6, after a period of observation.
With the support of multiple Rx buffers per QCS instance, stream
decoding in qcc_io_recv() has been reworked for the next haproxy
release. An issue appears in a double while loop : a break statement is
used in the inner loop, which is not sufficient as it should instead
exit from the outer one.
Fix this by replacing break with a goto statement.
No need to backport this.
Remove return statement in h3_rcv_buf() in case of stream/connection
error. Instead, reuse already existing label err. This simplifies the
code path. It also fixes the missing leave trace for these cases.
Use a customized proxy for the ACME client.
The proxy is initialized at the first acme section parsed.
The proxy uses the httpsclient log format as ACME CA use HTTPS.
Add an experimental "https" log-format for the httpclient, it is not
used by the httpclient by default, but could be define in a customized
proxy.
The string is basically a httpslog, with some of the fields replaced by
their backend equivalent or - when not available:
"%ci:%cp [%tr] %ft -/- %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %[bc_err]/%[ssl_bc_err,hex]/-/-/%[ssl_bc_is_resumed] -/-/-"
The 'pause' HTTP action can now be used to suspend for a moment the message
analysis. A timeout, expressed in milliseconds using a time-format
parameter, or an expression can be used. If an expression is used, errors
and invalid values are ignored.
Internally, the action will set the analysis expiration date on the
corresponding channel to the configured value and it will yield while it is
not expired.
The 'pause' action is available for 'http-request' and 'http-response'
rules.
When a SPOP connection is opened, the maximum size for frames is negociated.
This negociated size is properly used when a frame is received and if a too
big frame is detected, an error is triggered. However, the same was not
performed on the sending path. No check was performed on frames sent to the
agent. So it was possible to send frames bigger than the maximum size
supported by the the SPOE agent.
Now, the size of NOTIFY and DISCONNECT frames is checked before sending them
to the agent.
Thanks to Miroslav to have reported the issue.
This patch must be backported to 3.1.
Since the commit "MINOR: hlua/h1: Use http_parse_cont_len_header() to parse
content-length value", this function is no longer used. So it can be safely
removed.
Till now, h1_parse_cont_len_header() was used during the H1 message parsing and
by the lua HTTP applets to parse the content-length header value. But a more
generic function was added some years ago doing exactly the same operations. So
let's use it instead.
Thanks to the commit "MINOR: mux-h1: Don't remove custom "Content-Length: 0"
header in 1xx and 204 messages", we are now sure that 1xx and 204 responses
were sanitized during the parsing. So, if one of these headers are found in
such responses when sent to the client, it means it was added by hand, via a
"set-header" action for instance. In this context, we are able to make an
exception for the "Content-Length: 0" header, and only this one with this
value, to not break leagacy applications.
So now, a user can force the "Content-Length: 0" header to appear in 1xx and
204 responses by adding the right action in hist configuration.
"Transfer-Encoding" headers are still dropped as "Content-Length" headers
with another value than 0. Note, that in practice, only 101 and 204 are
concerned because other 1xx message are not subject to HTTP analysis.
This patch should fix the issue #2888. There is no reason to backport
it. But if we do so, the patch above must be backported too.
According to the RFC9110 and RFC9112, a server must not add 'Content-Length'
or 'Transfer-Encoding' headers into 1xx and 204 responses. So till now,
these headers were dropped from the response when it is sent to the client.
However, it seems more logical to remove it during the message parsing. In
addition to sanitize messages as early as possible, this will allow us to
apply some exception in some cases (This will be the subject of another
patch).
In this patch, 'Content-Length' and 'Transfer-Encoding' headers are removed
from 1xx and 204 responses during the parsing but the same is still
performed during the formatting stage.
In RFC9110, it is stated that trailers could be merged with the
headers. While it should be performed with a speicial care, it may be a
problem for some applications. To avoid any trouble with such applications,
two new options were added to drop trailers during the message forwarding.
On the backend, "http-drop-request-trailers" option can be enabled to drop
trailers from the requests before sending them to the server. And on the
frontend, "http-drop-response-trailers" option can be enabled to drop
trailers from the responses before sending them to the client. The options
can be defined in defaults sections and disabled with "no" keyword.
This patch should fix the issue #2930.
This changes commit d2a9149f0 ("BUG/MINOR: proxy: always detach a proxy
from the names tree on free()") to be cleaner. Aurlien spotted that
the free(p->id) was indeed already done in proxy_free_common(), which is
called before we delete the node. That's still a bit ugly and it only
works because ebpt_delete() does not dereference the key during the
operation. Better play safe and delete the entry before freeing it,
that's more future-proof.
Stephen Farrell reported in issue #2942 that recent haproxy versions
crash if there's no resolv.conf. A quick bisect with his reproducer
showed that it started with commit 4194f75 ("MEDIUM: tree-wide: avoid
manually initializing proxies") which reorders the proxies initialization
sequence a bit. The crash shows a corrupted tree, typically indicating a
use-after-free. With the help of ASAN it was possible to find that a
resolver proxy had been destroyed and freed before the name insertion
that causes the crash, very likely caused by the absence of the needed
resolv.conf:
#0 0x7ffff72a82f7 in free (/usr/local/lib64/libasan.so.5+0x1062f7)
#1 0x94c1fd in free_proxy src/proxy.c:436
#2 0x9355d1 in resolvers_destroy src/resolvers.c:2604
#3 0x93e899 in resolvers_create_default src/resolvers.c:3892
#4 0xc6ed29 in httpclient_resolve_init src/http_client.c:1170
#5 0xc6fbcf in httpclient_create_proxy src/http_client.c:1310
#6 0x4ae9da in ssl_ocsp_update_precheck src/ssl_ocsp.c:1452
#7 0xa1b03f in step_init_2 src/haproxy.c:2050
But free_proxy() doesn't delete the ebpt_node that carries the name,
which perfectly explains the situation. This patch simply deletes the
name node and Stephen confirmed that it fixed the problem for him as
well. Let's also free it since the key points to p->id which is never
freed either in this function!
No backport is needed since the patch above was first merged into
3.2-dev10.
To handle out-of-order received CRYPTO frames, a ncbuf instance is
allocated. This is done via the helper quic_get_ncbuf().
Buffer allocation was improperly checked. In case b_alloc() fails, it
crashes due to a BUG_ON(). Fix this by removing it. The function now
returns NULL on allocation failure, which is already properly handled in
its caller qc_handle_crypto_frm().
This should fix the last reported crash from github issue #2935.
This must be backported up to 2.6.
Released version 3.2-dev11 with the following main changes :
- CI: enable weekly QuicTLS build
- DOC: management: slightly clarify the prefix role of the '@' command
- DOC: management: add a paragraph about the limitations of the '@' prefix
- MINOR: master/cli: support bidirectional communications with workers
- MEDIUM: ssl/ckch: add filename and linenum argument to crt-store parsing
- MINOR: acme: add the acme section in the configuration parser
- MINOR: acme: add configuration for the crt-store
- MINOR: acme: add private key configuration
- MINOR: acme/cli: add the 'acme renew' command
- MINOR: acme: the acme section is experimental
- MINOR: acme: get the ACME directory
- MINOR: acme: handle the nonce
- MINOR: acme: check if the account exist
- MINOR: acme: generate new account
- MINOR: acme: newOrder request retrieve authorizations URLs
- MINOR: acme: allow empty payload in acme_jws_payload()
- MINOR: acme: get the challenges object from the Auth URL
- MINOR: acme: send the request for challenge ready
- MINOR: acme: implement a check on the challenge status
- MINOR: acme: generate the CSR in a X509_REQ
- MINOR: acme: finalize by sending the CSR
- MINOR: acme: verify the order status once finalized
- MINOR: acme: implement retrieval of the certificate
- BUG/MINOR: acme: ckch_conf_acme_init() when no filename
- MINOR: ssl/ckch: handle ckch_conf in ckchs_dup() and ckch_conf_clean()
- MINOR: acme: copy the original ckch_store
- MEDIUM: acme: replace the previous ckch instance with new ones
- MINOR: acme: schedule retries with a timer
- BUILD: acme: enable the ACME feature when JWS is present
- BUG/MINOR: cpu-topo: check the correct variable for NULL after malloc()
- BUG/MINOR: acme: key not restored upon error in acme_res_certificate()
- BUG/MINOR: thread: protect thread_cpus_enabled_at_boot with USE_THREAD
- MINOR: acme: default to 2048bits for RSA
- DOC: acme: explain how to configure and run ACME
- BUG/MINOR: debug: remove the trailing \n from BUG_ON() statements
- DOC: config: add the missing "profiling.memory" to the global kw index
- DOC: config: add the missing "force-cfg-parser-pause" to the global kw index
- DEBUG: init: report invalid characters in debug description strings
- DEBUG: rename DEBUG_GLITCHES to DEBUG_COUNTERS and enable it by default
- DEBUG: counters: make COUNT_IF() only appear at DEBUG_COUNTERS>=1
- DEBUG: counters: add the ability to enable/disable updating the COUNT_IF counters
- MINOR: tools: let dump_addr_and_bytes() support dumping before the offset
- MINOR: debug: in call traces, dump the 8 bytes before the return address, not after
- MINOR: debug: detect call instructions and show the branch target in backtraces
- BUG/MINOR: acme: fix possible NULL deref
- CLEANUP: acme: stored value is overwritten before it can be used
- BUILD: incompatible pointer type suspected with -DDEBUG_UNIT
- BUG/MINOR: http-ana: Properly detect client abort when forwarding the response
- BUG/MEDIUM: http-ana: Report 502 from req analyzer only during rsp forwarding
- CI: fedora rawhide: enable unit tests
- DOC: configuration: fix a typo in ACME documentation
- MEDIUM: sink: add a new dpapi ring buffer
- Revert "BUG/MINOR: acme: key not restored upon error in acme_res_certificate()"
- BUG/MINOR: acme: key not restored upon error in acme_res_certificate() V2
- BUG/MINOR: acme: fix the exponential backoff of retries
- DOC: configuration: specify limitations of ACME for 3.2
- MINOR: acme: emit logs instead of ha_notice
- MINOR: acme: add a success message to the logs
- BUG/MINOR: acme/cli: fix certificate name in error message
- MINOR: acme: register the task in the ckch_store
- MINOR: acme: free acme_ctx once the task is done
- BUG/MEDIUM: h3: trim whitespaces when parsing headers value
- BUG/MEDIUM: h3: trim whitespaces in header value prior to QPACK encoding
- BUG/MINOR: h3: filter upgrade connection header
- BUG/MINOR: h3: reject invalid :path in request
- BUG/MINOR: h3: reject request URI with invalid characters
- MEDIUM: h3: use absolute URI form with :authority
- BUG/MEDIUM: hlua: fix hlua_applet_{http,tcp}_fct() yield regression (lost data)
- BUG/MINOR: mux-h2: prevent past scheduling with idle connections
- BUG/MINOR: rhttp: fix reconnect if timeout connect unset
- BUG/MINOR: rhttp: ensure GOAWAY can be emitted after reversal
- BUG/MINOR: mux-h2: do not apply timer on idle backend connection
- MINOR: mux-h2: refactor idle timeout calculation
- MINOR: mux-h2: prepare to support PING emission
- MEDIUM: server/mux-h2: implement idle-ping on backend side
- MEDIUM: listener/mux-h2: implement idle-ping on frontend side
- MINOR: mux-h2: do not emit GOAWAY on idle ping expiration
- MINOR: mux-h2: handle idle-ping on conn reverse
- BUILD: makefile: enable backtrace by default on musl
- BUG/MINOR: threads: set threads_idle and threads_harmless even with no threads
- BUG/MINOR debug: fix !USE_THREAD_DUMP in ha_thread_dump_fill()
- BUG/MINOR: wdt/debug: avoid signal re-entrance between debugger and watchdog
- BUG/MINOR: debug: detect and prevent re-entrance in ha_thread_dump_fill()
- MINOR: debug: do not statify a few debugging functions often used with wdt/dbg
- MINOR: tools: also protect the library name resolution against concurrent accesses
- MINOR: tools: protect dladdr() against reentrant calls from the debug handler
- MINOR: debug: protect ha_dump_backtrace() against risks of re-entrance
- MINOR: tinfo: keep a copy of the pointer to the thread dump buffer
- MINOR: debug: always reset the dump pointer when done
- MINOR: debug: remove unused case of thr!=tid in ha_thread_dump_one()
- MINOR: pass a valid buffer pointer to ha_thread_dump_one()
- MEDIUM: wdt: always make the faulty thread report its own warnings
- MINOR: debug: make ha_stuck_warning() only work for the current thread
- MINOR: debug: make ha_stuck_warning() print the whole message at once
- CLEANUP: debug: no longer set nor use TH_FL_DUMPING_OTHERS
- MINOR: sched: add a new function is_sched_alive() to report scheduler's health
- MINOR: wdt: use is_sched_alive() instead of keeping a local ctxsw copy
- MINOR: sample: add 4 new sample fetches for clienthello parsing
- REGTEST: add new reg-test for the 4 new clienthello fetches
- MINOR: servers: Move the per-thread server initialization earlier
- MINOR: proxies: Initialize the per-thread structure earlier.
- MINOR: servers: Provide a pointer to the server in srv_per_tgroup.
- MINOR: lb_fwrr: Move the next weight out of fwrr_group.
- MINOR: proxies: Add a per-thread group lbprm struct.
- MEDIUM: lb_fwrr: Use one ebtree per thread group.
- MEDIUM: lb_fwrr: Don't start all thread groups on the same server.
- MINOR: proxies: Do stage2 initialization for sinks too
In check_config_validity(), we initialize the proxy in several stages.
We do so for the sink list for stage1, but not for stage2. It may not be
needed right now, but it may become needed in the future, so do it
anyway.
Now that all there is one tree per thread group, all thread groups will
start on the same server. To prevent that, just insert the servers in a
different order for each thread group.
When using the round-robin load balancer, the major source of contention
is the lbprm lock, that has to be held every time we pick a server.
To mitigate that, make it so there are one tree per thread-group, and
one lock per thread-group. That means we now have a lb_fwrr_per_tgrp
structure that will contain the two lb_fwrr_groups (active and backup) as well
as the lock to protect them in the per-thread lbprm struct, and all
fields in the struct server are now moved to the per-thread structure
too.
Those changes are mostly mechanical, and brings good performances
improvment, on a 64-cores AMD CPU, with 64 servers configured, we could
process about 620000 requests par second, and we now can process around
1400000 requests per second.
Add a new structure in the per-thread groups proxy structure, that will
contain whatever is per-thread group in lbprm.
It will be accessed as p->per_tgrp[tgid].lbprm.
Move the "next_weight" outside of fwrr_group, and inside struct lb_fwrr
directly, one for the active servers, one for the backup servers.
We will soon have one fwrr_group per thread group, but next_weight will
be global to all of them.
Add a pointer to the server into the struct srv_per_tgroup, so that if
we only have access to that srv_per_tgroup, we can come back to the
corresponding server.
Move the call to initialize the proxy's per-thread structure earlier
than currently done, so that they are usable when we're initializing the
load balancers.
Move the code responsible for calling per-thread server initialization
earlier than it was done, so that per-thread structures are available a
bit later, when we initialize load-balancing.
This patch contains this 4 new fetches and doc changes for the new fetches:
- req.ssl_cipherlist
- req.ssl_sigalgs
- req.ssl_keyshare_groups
- req.ssl_supported_groups
Towards:#2532
Now we can simply call is_sched_alive() on the local thread to verify
that the scheduler is still ticking instead of having to keep a copy of
the ctxsw and comparing it. It's cleaner, doesn't require to maintain
a local copy, doesn't rely on activity[] (whose purpose is mainly for
observation and debugging), and shows how this could be extended later
to cover other use cases. Practically speaking this doesn't change
anything however, the algorithm is still the same.
This verifies that the scheduler is still ticking without having to
access the activity[] array nor keeping local copies of the ctxsw
counter. It just tests and sets a flag that is reset after each
return from a ->process() function.
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
It has been noticed quite a few times during troubleshooting and even
testing that warnings can happen in avalanches from multiple threads
at the same time, and that their reporting it interleaved bacause the
output is produced in small chunks. Originally, this code inspired by
the panic code aimed at making sure to log whatever could be emitted
in case it would crash later. But this approach was wrong since writes
are atomic, and performing 5 writes in sequence in each dumping thread
also means that the outputs can be mixed up at 5 different locations
between multiple threads. The output of warnings is never very long,
and the stack-based buffer is 4kB so let's just concatenate everything
in the buffer and emit it at once using a single write(). Now there's
no longer this confusion on the output.
Since we no longer call it with a foreign thread, let's simplify its code
and get rid of the special cases that were relying on ha_thread_dump_fill()
and synchronization with a remote thread. We're not only dumping the
current thread so ha_thread_dump_one() is sufficient.
Warnings remain tricky to deal with, especially for other threads as
they require some inter-thread synchronization that doesn't cope very
well with other parallel activities such as "show threads" for example.
However there is nothing that forces us to handle them this way. The
panic for example is already handled by bouncing the WDT signal to the
faulty thread.
This commit rearranges the WDT handler to make a better used of this
existing signal bouncing feature of the WDT handler so that it's no
longer limited to panics but can also deal with warnings. In order not
to bounce on all wakeups, we only bounce when there is a suspicion,
that is, when the warning timer has been crossed. We'll let the target
thread verify the stuck flag and context switch count by itself to
decide whether or not to panic, warn, or just do nothing and update
the counters.
As a bonus, now all warning traces look the same regardless of the
reporting thread:
call trace(16):
| 0x6bc733 <01 00 00 e8 6d e6 de ff]: ha_dump_backtrace+0x73/0x309 > main-0x2570
| 0x6bd37a <00 00 00 e8 d6 fb ff ff]: ha_thread_dump_fill+0xda/0x104 > ha_thread_dump_one
| 0x6bd625 <00 00 00 e8 7b fc ff ff]: ha_stuck_warning+0xc5/0x19e > ha_thread_dump_fill
| 0x7b2b60 <64 8b 3b e8 00 aa f0 ff]: wdt_handler+0x1f0/0x212 > ha_stuck_warning
| 0x7fd7e2cef3a0 <00 00 00 00 0f 1f 40 00]: libpthread:+0x123a0
| 0x7ffc6af9e634 <85 a6 00 00 00 0f 01 f9]: linux-vdso:__vdso_gettimeofday+0x34/0x2b0
| 0x6bad74 <7c 24 10 e8 9c 01 df ff]: sc_conn_io_cb+0x9fa4 > main-0x2400
| 0x67c457 <89 f2 4c 89 e6 41 ff d0]: main+0x1cf147
| 0x67d401 <48 89 df e8 8f ed ff ff]: cli_io_handler+0x191/0xb38 > main+0x1cee80
| 0x6dd605 <40 48 8b 45 60 ff 50 18]: task_process_applet+0x275/0xce9
The goal is to let the caller deal with the pointer so that the function
only has to fill that buffer without worrying about locking. This way,
synchronous dumps from "show threads" are produced and emitted directly
without causing undesired locking of the buffer nor risking causing
confusion about thread_dump_buffer containing bits from an interrupted
dump in progress.
It's only the caller that's responsible for notifying the requester of
the end of the dump by setting bit 0 of the pointer if needed (i.e. it's
only done in the debug handler).
This function was initially designed to dump any threadd into the presented
buffer, but the way it currently works is that it's always called for the
current thread, and uses the distinction between coming from a sighandler
or being called directly to detect which thread is the caller.
Let's simplify all this by replacing thr with tid everywhere, and using
the thread-local pointers where it makes sense (e.g. th_ctx, th_ctx etc).
The confusing "from_signal" argument is now replaced with "is_caller"
which clearly states whether or not the caller declares being the one
asking for the dump (the logic is inverted, but there are only two call
places with a constant).
We don't need to copy the old dump pointer to the thread_dump_pointer
area anymore to indicate a dump is collected. It used to be done as an
artificial way to keep the pointer for the post-mortem analysis but
since we now have this pointer stored separately, that's no longer
needed and it simplifies the mechanim to reset it.
Instead of using the thread dump buffer for post-mortem analysis, we'll
keep a copy of the assigned pointer whenever it's used, even for warnings
or "show threads". This will offer more opportunities to figure from a
core what happened, and will give us more freedom regarding the value of
the thread_dump_buffer itself. For example, even at the end of the dump
when the pointer is reset, the last used buffer is now preserved.
If a thread is dumping itself (warning, show thread etc) and another one
wants to dump the state of all threads (e.g. panic), it may interrupt the
first one during backtrace() and re-enter it from the signal handler,
possibly triggering a deadlock in the underlying libc. Let's postpone
the debug signal delivery at this point until the call ends in order to
avoid this.
If a thread is currently resolving a symbol while another thread triggers
a thread dump, the current thread may enter the debug handler and call
resolve_sym_addr() again, possibly deadlocking if the underlying libc
uses locking. Let's postpone the debug signal delivery in this area
during the call. This will slow the resolution a little bit but we don't
care, it's not supposed to happen often and it must remain rock-solid.
This is an extension of eb41d768f ("MINOR: tools: use only opportunistic
symbols resolution"). It also makes sure we're not calling dladddr() in
parallel to dladdr_and_size(), as a preventive measure against some
potential deadlocks in the inner layers of the libc.
A few functions are used when debugging debug signals and watchdog, but
being static, they're not resolved and are hard to spot in dumps, and
they appear as any random other function plus an offset. Let's just not
mark them static anymore, it only hurts:
- cli_io_handler_show_threads()
- debug_run_cli_deadlock()
- debug_parse_cli_loop()
- debug_parse_cli_panic()
In the following trace trying to abuse the watchdog from the CLI's
"debug dev loop" command running in parallel to "show threads" loops,
it's clear that some re-entrance may happen in ha_thread_dump_fill().
A first minimal fix consists in using a test-and-set on the flag
indicating that the function is currently dumping threads, so that
the one from the signal just returns. However the caller should be
made more reliable to serialize all of this, that's for future
work.
Here's an example capture of 7 threads stuck waiting for each other:
(gdb) bt
#0 0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
#1 0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
#2 0x00000000005ba4f5 in ha_thread_dump_fill (thr=2, buf=0x7ffdd8e08ab0) at src/debug.c:402
#3 ha_thread_dump_fill (buf=0x7ffdd8e08ab0, thr=<optimized out>) at src/debug.c:384
#4 0x00000000005baac4 in ha_stuck_warning (thr=thr@entry=2) at src/debug.c:840
#5 0x00000000006a360d in wdt_handler (sig=<optimized out>, si=<optimized out>, arg=<optimized out>) at src/wdt.c:156
#6 <signal handler called>
#7 0x00007fe78d78e147 in sched_yield () from /lib64/libc.so.6
#8 0x0000000000674a05 in ha_thread_relax () at src/thread.c:356
#9 0x00000000005ba4c2 in ha_thread_dump_fill (thr=2, buf=0x7fe78f2d6420) at src/debug.c:426
#10 ha_thread_dump_fill (buf=0x7fe78f2d6420, thr=2) at src/debug.c:384
#11 0x00000000005ba7c6 in cli_io_handler_show_threads (appctx=0x2a89ab0) at src/debug.c:548
#12 0x000000000057ea43 in cli_io_handler (appctx=0x2a89ab0) at src/cli.c:1176
#13 0x00000000005d7885 in task_process_applet (t=0x2a82730, context=0x2a89ab0, state=<optimized out>) at src/applet.c:920
#14 0x0000000000659002 in run_tasks_from_lists (budgets=budgets@entry=0x7ffdd8e0a5c0) at src/task.c:644
#15 0x0000000000659bd7 in process_runnable_tasks () at src/task.c:886
#16 0x00000000005cdcc9 in run_poll_loop () at src/haproxy.c:2858
#17 0x00000000005ce457 in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:3075
#18 0x0000000000430628 in main (argc=<optimized out>, argv=<optimized out>) at src/haproxy.c:3665
As seen in issue #2860, there are some situations where a watchdog could
trigger during the debug signal handler, and where similarly the debug
signal handler may trigger during the wdt handler. This is really bad
because it could trigger some deadlocks inside inner libc code such as
dladdr() or backtrace() since the code will not protect against re-
entrance but only against concurrent accesses.
A first attempt was made using ha_sigmask() but that's not always very
convenient because the second handler is called immediately after
unblocking the signal and before returning, leaving signal cascades in
backtrace. Instead, let's mark which signals to block at registration
time. Here we're blocking wdt/dbg for both signals, and optionally
SIGRTMAX if DEBUG_DEV is used as that one may also be used in this case.
This should be backported at least to 3.1.
The function must make sure to return NULL for foreign threads and
the local buffer for the current thread in this case, otherwise panics
(and sometimes even warnings) will segfault when USE_THREAD_DUMP is
disabled. Let's slightly re-arrange the function to reduce the #if/else
since we have to specifically handle the case of !USE_THREAD_DUMP anyway.
This needs to be backported wherever b8adef065d ("MEDIUM: debug: on
panic, make the target thread automatically allocate its buf") was
backported (at least 2.8).
Some signal handlers rely on these to decide about the level of detail to
provide in dumps, so let's properly fill the info about entering/leaving
idle. Note that for consistency with other tests we're using bitops with
t->ltid_bit, while we could simply assign 0/1 to the fields. But it makes
the code more readable and the whole difference is only 88 bytes on a 3MB
executable.
This bug is not important, and while older versions are likely affected
as well, it's not worth taking the risk to backport this in case it would
wake up an obscure bug.
The reason musl builds was not producing exploitable backtraces was
that the toolchain used appears to automatically omit the frame pointer
at -O2 but leaves it at -O0. This patch just makes sure to always append
-fno-omit-frame-pointer to the BACKTRACE cflags and enables the option
with musl where it now works. This will allow us to finally get
exploitable traces from docker images where core dumps are not always
available.