* Metrics: Notification log maintenance success and failure
Due to various reasons, we've observed different kind of errors on this area. From read-only disks to silly code bugs. Errors during maintenance are effectively a data loss and therefore, we should encourage proper monitoring of this area.
Similar to #3285
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Refactor nflog configuration options to make it similar to Silences.
The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.
To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.
It seems useless to keep the notifications in the nflog for longer than
twice the repeat interval. This should help reduce memory usage of
clustered alertmanagers.
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
* Enable support for custom callbacks as part of maintenance
This enables support for custom Maintenance callbacks as part of the periodic maintenance of silences and notification logs.
Effectively a no-op for the Alertmanager but allows downstream implementation to inject custom logic as part of it.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add tests
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Fix tests and remove whitespace
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Address review comments
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* run go fmt
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Fix import ordering
Signed-off-by: gotjosh <josue.abreu@gmail.com>
With the next release of client_golang, Summaries will not have
objectives by default. Interestingly, this will do the right thing for
the Summaries affected by this commit. However, right now those
summaries do get the old default objectives. They don't really make
sense because the affected Summaries receive Observations quite
infrequently (far less than once in the 10m max age currently
used). To not get surprising changes when moving on to client_golang
v1, let's explicitly set the Summaries as objective-less now.
Signed-off-by: beorn7 <beorn@grafana.com>
* [silences] Don't merge expired silences
If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add analogous behavior+test for nflog
The code for nflog was also constantly re-adding
nflogs to the internal memory store, the same as
the silence code was.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add retention to TestQuery
With the default 0 retention, the alerts would not
be merged.
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* Gossip large messages via SendReliable
For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.
The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.
* Add tests for oversized/normal message gossiping
* Make oversize metric names consistent
* Remove errant printf in test
* Correctly increment WaitGroup
* Add comment for OversizedMessage func
* Add metric for oversized messages dropped
Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.
* Add counter for total oversized messages sent
* Change full queue log level to debug
Was previously a warning, which isn't necessary
now that there is a metric tracking it.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Peers further propagate newly received nflogs
If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set Retransmit value based on number of members
For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [nflog] Move retransmit calculation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [silence] further gossip silence messages
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set GossipNodes to equal RetransmitMulti
During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix rebase
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* cluster: add alertmanager_cluster_messages_queued metric
* cluster: add metrics for sent messages
This change adds 2 new metrics:
- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total
* Fix marshaling for entries being broadcast
Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.
* main: fix argument order for cluster.Join()
cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
* Log snapshot sizes on maintenance
* Add metrics for snapshot sizes
This change adds 2 new gauges for tracking the last snapshots' sizes:
- alertmanager_nflog_snapshot_size_bytes
- alertmanager_silences_snapshot_size_bytes
Turn the GroupKey into a string that is composed of the matchers if the
path in the routing tree and the grouping labels.
Only hash it at the very end to ensure we don't exceed size limits of
integration APIs.
Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
This commit replaces the previous NotifyInfo provider with the new
nflog package. It needs adjustments in the behavior of the deduping
stage.
The nflog stores notification digests per receiver per alert aggregation
group rather than one entry for alert per receiver. This drastically
reduces the number of entries and removes interference
across aggregation groups.
This removes the Delete function from the interface as the log
should be append-only and only be reduced by expired entries.
This also adds an argument to configure a background processing routine,
which periodically garbage collects and snapshots.
This adds a new nflog package meant to replace provider.Notifies. It
has a central protobuf type package, which is also meant for usage for
other packages and the API.
The generated Go types are also the in-memory representation.