Replace load average with PSI metric

The load average metric is misleading as a representation of CPU saturation. Normal CPU utilization is a better real representation of saturation. On newer Linux, there is a new Pressure Stall Information[0] metric that better represents CPU over saturation. This is also useful as it can make single-core saturation more visible. [0]: https://www.kernel.org/doc/html/latest/accounting/psi.html Signed-off-by: Ben Kochie <superq@gmail.com>
2024-12-17 20:05:10 +00:00 · 2021-08-19 12:19:26 +02:00 · 2021-08-19 12:19:26 +02:00 · d90b2d83d7
commit d90b2d83d7
parent aea88e4dc5
1 changed files with 8 additions and 11 deletions
--- a/docs/node-mixin/rules/rules.libsonnet
+++ b/docs/node-mixin/rules/rules.libsonnet
@ -16,7 +16,7 @@
            ||| % $._config,
          },
          {
-            // CPU utilisation is % CPU is not idle.
+            // CPU utilisation is % CPU is not idle. This represents CPU saturation.
            record: 'instance:node_cpu_utilisation:rate%(rateInterval)s' % $._config,
            expr: |||
              1 - avg without (cpu, mode) (
@ -25,17 +25,14 @@
            ||| % $._config,
          },
          {
-            // This is CPU saturation: 1min avg run queue length / number of CPUs.
+            // CPU pressure represents over-saturation. This is the amount of CPU seconds
-            // Can go over 1.
+            // requested, but the kernel was not able to schedule.
-            // TODO: There are situation where a run queue >1/core is just normal and fine.
+            // NOTE: This is only availalbe on Linux >= 4.19 and `CONFIG_PSI` is enabled.
-            //       We need to clarify how to read this metric and if its usage is helpful at all.
+            // See also:
-            record: 'instance:node_load1_per_cpu:ratio',
+            // - https://www.kernel.org/doc/html/latest/accounting/psi.html
            // - https://facebookmicrosites.github.io/psi/docs/overview
            expr: |||
-              (
+              rate(node_pressure_cpu_waiting_seconds_total{%(nodeExporterSelector)s}[%(rateInterval)s])
                node_load1{%(nodeExporterSelector)s}
              /
                instance:node_num_cpu:sum{%(nodeExporterSelector)s}
              )
            ||| % $._config,
          },
          {