The aggregation group is already responsible for removing the resolved
alerts. Running the garbage collection in parallel introduces a race and
eventually resolved notifications may be dropped.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
If the original EndsAt is left in place, then as time moves forwards
past the EndsAt then firing alerts will be rendered and treated as
resolved alerts which can cause confusion and races. This is most
likely to happen on retries for a notification.
Mitigate race and fix data races in TestAggrGroup.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
To aggregate by all possible labels use '...' as the sole label name.
This effectively disables aggregation entirely, passing through all
alerts as-is. This is unlikely to be what you want, unless you have
a very low alert volume or your upstream notification system performs
its own grouping. Example: group_by: [...]
Signed-off-by: Kyryl Sablin <kyryl.sablin@schibsted.com>
Move the code for storing and GC'ing alerts from being re-implemented in
several packages to existing in its own package
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* fix concurrent read and wirte group
Signed-off-by: denghuan <denghuan@actionsky.com>
* make lock more elegant
Signed-off-by: denghuan <denghuan@actionsky.com>
* Sort dispatched alerts by job+instance in the correct order (#1178)
Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>
* dispatch: add unit test for alerts sorting
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
When the aggregation group receives an alert that is past the initial
group_wait value, it should reset its timer only if the timer has ever
expired. Otherwise it means that the flush is already in-progress.
This change decreases the repeat_interval parameter from 5s to 4.9s to
make sure that the alerts are effectively sent after 5 seconds.
The workflow is:
- The dispatcher flushes the alerts at t0, sends the notification and
marks the notification log at t0+epsilon.
- The dispatcher flushes the alerts at t1, t2, t3 and t4 and doesn't
send the notifications as expected.
- At t5, the dispatcher flushes the alerts because current_time - (t0+epsilon)
is less then repeat_interval.
If repeat_interval is exactly 5s, there is a little chance that it is
greater than current_time - (t0+epsilon).
* Expose alert fingerprint in the API
Alert fingerprint is already provided as the value of status.inhibitedBy[] attribute that inhibited alerts have, but there's no way to get back to the alert that's inhibiting it as the fingerprint is not exposed.
* Expose alert fingerprint as ID in the list endpoint
* Rename ID to Fingerprint
* Use Fingerprint().String() in the API
AlertStatus doesn't have json tag with the field name, so it's serialized into 'Status', and it's the only uppercase field in the alert object. Tag it with 'status' name for consistency
Turn the GroupKey into a string that is composed of the matchers if the
path in the routing tree and the grouping labels.
Only hash it at the very end to ensure we don't exceed size limits of
integration APIs.
* Vendor dependencies.
This updates several old dependencies, removes
some that are no longer needed, and adds
`pkg/labels` from prometheus `dev-2.0` branch.
* Add metrics selector parsing code
This is a temporary simplified re-implementation
of promQL's metric selector parsing.
* Add alerts filtering
Filter alerts through `?filter=` query string.
* Add silences filtering
Filter silences through `?filter=` query string.
* Move `parse` to `pkg/parse`
This string value is initially used to store a receiver name. It is
later overloaded with a unique string identifier of <name, integration,
index>.
This renaming is in preparation to separate the two and use the Receiver
object of the nflogpb package.