* cluster: add alertmanager_cluster_messages_queued metric
* cluster: add metrics for sent messages
This change adds 2 new metrics:
- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total
* Fix marshaling for entries being broadcast
Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.
* main: fix argument order for cluster.Join()
cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
TestStateMerge() was skipped because of a typo. Fixing the name revealed
that the test itself needed to be updated following the switch to the
memberlist library.
* Log snapshot sizes on maintenance
* Add metrics for snapshot sizes
This change adds 2 new gauges for tracking the last snapshots' sizes:
- alertmanager_nflog_snapshot_size_bytes
- alertmanager_silences_snapshot_size_bytes
This adds metrics that look like this:
```
alertmanager_alerts{state="active"} 6
alertmanager_alerts{state="suppressed"} 0
alertmanager_silences{state="active"} 1
alertmanager_silences{state="expired"} 1
alertmanager_silences{state="pending"} 0
```
This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.
* silences: avoid deadlock
Calling gossip.GossipBroadcast() will cause a deadlock if
there is a currently executing OnBroadcast* function.
See #982
* silence_test: better unit test to detect deadlocks
Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
This fixes#559 by removing concurrent map writes to the matcher cache.
The cache was guarded by the Silence's main lock, which only used a
read-lock on queries.
The cache's get methods lazily loads data into the cache and thus
causing concurrent writes.
We just change the main lock to always write-lock, as we don't expect
high lock contention at this point and would have it in a dedicated
cache lock anyway.
This commit adds an implementation of a silence storage that can
share store and modify silences, share state via a mesh network,
write and load snapshots, and be dynamically queried.
All data formats are based on protocol buffers.