BUG/MINOR: stream: make the call_rate only count the no-progress calls

We have an anti-looping protection in process_stream() that detects bugs
that used to affect a few filters like compression in the past which
sometimes forgot to handle a read0 or a particular error, leaving a
thread looping at 100% CPU forever. When such a condition is detected,
an alert it emitted and the process is killed so that it can be replaced
by a sane one:

  [ALERT]    (19061) : A bogus STREAM [0x274abe0] is spinning at 2057156
             calls per second and refuses to die, aborting now! Please
             report this error to developers [strm=0x274abe0,3 src=unix
             fe=MASTER be=MASTER dst=<MCLI> txn=(nil),0 txn.req=-,0
             txn.rsp=-,0 rqf=c02000 rqa=10000 rpf=88000021 rpa=8000000
             sif=EST,40008 sib=DIS,84018 af=(nil),0 csf=0x274ab90,8600
             ab=0x272fd40,1 csb=(nil),0
             cof=0x25d5d80,1300:PASS(0x274aaf0)/RAW((nil))/unix_stream(9)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0) filters={}]
    call trace(11):
    |       0x4dbaab [c7 04 25 01 00 00 00 00]: stream_dump_and_crash+0x17b/0x1b4
    |       0x4df31f [e9 bd c8 ff ff 49 83 7c]: process_stream+0x382f/0x53a3
    (...)

One problem with this detection is that it used to only count the call
rate because we weren't sure how to make it more accurate, but the
threshold was high enough to prevent accidental false positives.

There is actually one case that manages to trigger it, which is when
sending huge amounts of requests pipelined on the master CLI. Some
short requests such as "show version" are sufficient to be handled
extremely fast and to cause a wake up of an analyser to parse the
next request, then an applet to handle it, back and forth. But this
condition is not an error, since some data are being forwarded by
the stream, and it's easy to detect it.

This patch modifies the detection so that update_freq_ctr() only
applies to calls made without CF_READ_PARTIAL nor CF_WRITE_PARTIAL
set on any of the channels, which really indicates that nothing is
happening at all.

This is greatly sufficient and extremely effective, as the call above
is still caught (shutr being ignored by an analyser) while a loop on
the master CLI now has no effect. The "call_rate" field in the detailed
"show sess" output will now be much lower, except for bogus streams,
which may help spot them. This field is only there for developers
anyway so it's pretty fine to slightly adjust its meaning.

This patch could be backported to stable versions in case of reports
of such an issue, but as that's unlikely, it's not really needed.
This commit is contained in:
Willy Tarreau 2022-01-20 18:42:16 +01:00
parent a4e4d66f70
commit 6c539c4b8c
2 changed files with 9 additions and 4 deletions

View File

@ -143,7 +143,7 @@ struct stream {
struct list back_refs; /* list of users tracking this stream */
struct buffer_wait buffer_wait; /* position in the list of objects waiting for a buffer */
struct freq_ctr call_rate; /* stream task call rate */
struct freq_ctr call_rate; /* stream task call rate without making progress */
short store_count;
enum obj_type obj_type; /* object type == OBJ_TYPE_STREAM */

View File

@ -1628,9 +1628,14 @@ struct task *process_stream(struct task *t, void *context, unsigned int state)
si_sync_recv(si_f);
si_sync_recv(si_b);
rate = update_freq_ctr(&s->call_rate, 1);
if (rate >= 100000 && s->call_rate.prev_ctr) { // make sure to wait at least a full second
stream_dump_and_crash(&s->obj_type, read_freq_ctr(&s->call_rate));
/* Let's check if we're looping without making any progress, e.g. due
* to a bogus analyser or the fact that we're ignoring a read0. The
* call_rate counter only counts calls with no progress made.
*/
if (!((req->flags | res->flags) & (CF_READ_PARTIAL|CF_WRITE_PARTIAL))) {
rate = update_freq_ctr(&s->call_rate, 1);
if (rate >= 100000 && s->call_rate.prev_ctr) // make sure to wait at least a full second
stream_dump_and_crash(&s->obj_type, read_freq_ctr(&s->call_rate));
}
/* this data may be no longer valid, clear it */