BUG/MEDIUM: debug: don't set the STUCK flag from debug_handler()

Since 2.0 with commit e6a02fa65a ("MINOR: threads: add a "stuck" flag
to the thread_info struct"), the TH_FL_STUCK flag was set by the
debugger to flag that a thread was stuck and report it in the output.

However, two commits later (2bfefdbaef "MAJOR: watchdog: implement a
thread lockup detection mechanism"), this flag was used to detect that
a thread had already been reported as stuck. The problem is that it
seldom happens that a "show threads" command instantly crashes because
it calls debug_handler(), which sets the flag, and if the watchdog timer
was about to trigger before going back to the scheduler, the watchdog
believes that the thread has been stuck for a while and will kill the
process.

The issue was magnified in 3.1 with the lower-delay warning, because
it's possible for a thread to die on the next wakeup after the first
warning (which calls debug_handler() hence sets the STUCK flag).

One good approach would have been to use two distinct flags, one for
"stuck" as reported by the debug handler, and one for "stuck" as seen
by the watchdog. However, one could also argue that since the second
commit, given that the wdt monitors the threads, there's no point any
more for the debug handler to set the flag itself. Removing this code
means that two consecutive "show threads" will not report "stuck" until
the watchdog sets it, which aligns better with expectations.

This can be backported to all stable releases. This code has changed a
bit over time, the "if" block and the harmless variables just need to
be removed.
This commit is contained in:
Willy Tarreau 2024-11-21 19:19:46 +01:00
parent 332839eb9d
commit 1151fe6818

View File

@ -2347,7 +2347,6 @@ static int debug_iohandler_counters(struct appctx *appctx)
void debug_handler(int sig, siginfo_t *si, void *arg)
{
struct buffer *buf = HA_ATOMIC_LOAD(&th_ctx->thread_dump_buffer);
int harmless = is_thread_harmless();
int no_return = 0;
/* first, let's check it's really for us and that we didn't just get
@ -2371,13 +2370,6 @@ void debug_handler(int sig, siginfo_t *si, void *arg)
*/
ha_thread_dump_one(tid, 1);
/* mark the current thread as stuck to detect it upon next invocation
* if it didn't move.
*/
if (!harmless &&
!(_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_SLEEPING))
_HA_ATOMIC_OR(&th_ctx->flags, TH_FL_STUCK);
/* in case of panic, no return is planned so that we don't destroy
* the buffer's contents and we make sure not to trigger in loops.
*/