As noted in #2867, there is an unnecessary require.Eventually in a
silence test. This PR addresses that by using a channel to signal that
that the maintenance loop has completed.
Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
github.com/benbjohnson/clock provides a time interface to programs
rather than using the stdlib time package. This allows mocking time in
programs and tests. In this commit, the clock is used to speed up and
simplify testing of the silences package.
Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
so third parties, Grafana in particular, can over ride the validation.
Grafana wants to do this because other data sources will have label keys with things like spaces, periods, or other characters - and looking for a better integration with alert manager.
goes with grafana/grafana#38629
replaces https://github.com/prometheus/alertmanager/pull/2694
Signed-off-by: Kyle Brandt <kyle@grafana.com>
https://github.com/prometheus/alertmanager/pull/2689 introduced a
regression where the default maintenance function would no longer be
called even if no override was specified. The Alertmanager now crashes
on any silence maintenance run without this fix.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Enable support for custom callbacks as part of maintenance
This enables support for custom Maintenance callbacks as part of the periodic maintenance of silences and notification logs.
Effectively a no-op for the Alertmanager but allows downstream implementation to inject custom logic as part of it.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add tests
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Fix tests and remove whitespace
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Address review comments
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* run go fmt
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Fix import ordering
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Previously, if a pending silence existed for an alert, and it later
became active without any silences getting added in the meantime, we
would miss the existence of that newly active silence.
Signed-off-by: beorn7 <beorn@grafana.com>
* check if at least one silence matcher doesn't match empty strings
Signed-off-by: qoops <ilya.v.gladyshev@gmail.com>
* fixed grammar
Signed-off-by: qoops <ilya.v.gladyshev@gmail.com>
Essentially, the Silences.Expire() will in that case have no effect
because the affected silence is immediately seen as expired from the
storage and thus not updated. The silence will stay around in its old
state.
This fix makes sure to use the same “now” throughout the expiration
process.
Signed-off-by: beorn7 <beorn@soundcloud.com>
Add version tracking of silences states. Adding a silence to the state
increments the version. If the version hasn't changed since the last
time an alert was checked for being silenced, we only have to verify
that the relevant silences are still active rather than checking the
alert against all silences.
Signed-off-by: beorn7 <beorn@soundcloud.com>
* [silences] Don't merge expired silences
If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add analogous behavior+test for nflog
The code for nflog was also constantly re-adding
nflogs to the internal memory store, the same as
the silence code was.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add retention to TestQuery
With the default 0 retention, the alerts would not
be merged.
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* cluster: add alertmanager_cluster_messages_queued metric
* cluster: add metrics for sent messages
This change adds 2 new metrics:
- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total
* Fix marshaling for entries being broadcast
Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.
* main: fix argument order for cluster.Join()
cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
TestStateMerge() was skipped because of a typo. Fixing the name revealed
that the test itself needed to be updated following the switch to the
memberlist library.
This adds metrics that look like this:
```
alertmanager_alerts{state="active"} 6
alertmanager_alerts{state="suppressed"} 0
alertmanager_silences{state="active"} 1
alertmanager_silences{state="expired"} 1
alertmanager_silences{state="pending"} 0
```
This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.
* silences: avoid deadlock
Calling gossip.GossipBroadcast() will cause a deadlock if
there is a currently executing OnBroadcast* function.
See #982
* silence_test: better unit test to detect deadlocks
This commit adds an implementation of a silence storage that can
share store and modify silences, share state via a mesh network,
write and load snapshots, and be dynamically queried.
All data formats are based on protocol buffers.