Commit Graph

36 Commits

Author SHA1 Message Date
gotjosh 28925efbd8
Metrics: Notification log maintenance success and failure (#3286)
* Metrics: Notification log maintenance success and failure

Due to various reasons, we've observed different kind of errors on this area. From read-only disks to silly code bugs. Errors during maintenance are effectively a data loss and therefore, we should encourage proper monitoring of this area.

Similar to #3285
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-03-08 10:29:05 +00:00
gotjosh f59460bfd4
Refactor nflog configuration options to make it similar to Silences. (#3220)
* Refactor nflog configuration options to make it similar to Silences.

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.

To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.
2023-01-19 16:39:03 +00:00
Julien Pivotto b0443021dc Expires notify log sooner when possible
It seems useless to keep the notifications in the nflog for longer than
twice the repeat interval. This should help reduce memory usage of
clustered alertmanagers.

Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
2022-10-14 10:03:17 +02:00
Matthias Loibl a6d10bd5bc
Update golangci-lint and fix complaints (#2853)
* Copy latest golangci-lint files from Prometheus

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Use grafana/regexp over stdlib regexp

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Fix typos in comments

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Fix goimports complains in import sorting

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* gofumpt all Go files

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Update naming to comply with revive linter

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* config: Fix error messages to be lower case

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* test/cli: Fix error messages to be lower case

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* .golangci.yaml: Remove obsolete space

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* config: Fix expected victorOps error

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Use stdlib regexp

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Clean up Go modules

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2022-03-25 17:59:51 +01:00
gotjosh 8da517524a
Enable support for custom callbacks as part of maintenance (#2689)
* Enable support for custom callbacks as part of maintenance

This enables support for custom Maintenance callbacks as part of the periodic maintenance of silences and notification logs.
Effectively a no-op for the Alertmanager but allows downstream implementation to inject custom logic as part of it.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Add tests

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Fix tests and remove whitespace

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Address review comments

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* run go fmt

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Fix import ordering

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2021-09-06 16:19:39 +05:30
Julien Pivotto b2a4cacb95 Update go dependencies & switch to go-kit/log
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-08-02 12:43:23 +02:00
johncming 311650658a nflog: use errors.New instead of fmt.Errorf for no custom error msg. (#2045)
Signed-off-by: johncming <johncming@yahoo.com>
2019-09-25 10:31:01 +02:00
beorn7 318e006065 Mark some Summaries explicitly as having no objectives
With the next release of client_golang, Summaries will not have
objectives by default. Interestingly, this will do the right thing for
the Summaries affected by this commit. However, right now those
summaries do get the old default objectives. They don't really make
sense because the affected Summaries receive Observations quite
infrequently (far less than once in the 10m max age currently
used). To not get surprising changes when moving on to client_golang
v1, let's explicitly set the Summaries as objective-less now.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-12 15:47:56 +02:00
stuart nelson 2026e4a01f
[gossip] Don't merge expired gossip messages (#1631)
* [silences] Don't merge expired silences

If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add analogous behavior+test for nflog

The code for nflog was also constantly re-adding
nflogs to the internal memory store, the same as
the silence code was.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add retention to TestQuery

With the default 0 retention, the alerts would not
be merged.

Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
2018-11-21 11:40:57 +01:00
stuart nelson 445fbdf1a8
gossip large messages via SendReliable (#1415)
* Gossip large messages via SendReliable

For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.

The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.

* Add tests for oversized/normal message gossiping

* Make oversize metric names consistent

* Remove errant printf in test

* Correctly increment WaitGroup

* Add comment for OversizedMessage func

* Add metric for oversized messages dropped

Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.

* Add counter for total oversized messages sent

* Change full queue log level to debug

Was previously a warning, which isn't necessary
now that there is a metric tracking it.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-15 13:40:21 +02:00
stuart nelson 77cc718a81 [nflog] register snapshotSize
This metric was never registered.
2018-06-12 13:59:48 +02:00
stuart nelson 36588c3865
memberlist gossip (#1389)
* Peers further propagate newly received nflogs

If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set Retransmit value based on number of members

For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [nflog] Move retransmit calculation

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [silence] further gossip silence messages

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set GossipNodes to equal RetransmitMulti

During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Fix rebase

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-08 11:48:42 +02:00
Ted Zlatanov b04e9ad19b #1346: move maintenance messages to DEBUG log level (#1347)
Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>
2018-04-30 11:56:17 +02:00
Simon Pasquier a8c995f77c nflog: fix potential panic in decodeState()
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-04-10 10:11:40 +02:00
Simon Pasquier 1531aa66f3 Fix for #1282 (#1286)
* cluster: add alertmanager_cluster_messages_queued metric

* cluster: add metrics for sent messages

This change adds 2 new metrics:

- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total

* Fix marshaling for entries being broadcast

Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.

* main: fix argument order for cluster.Join()

cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
2018-03-22 13:53:00 +01:00
Fabian Reinartz 247bfff606 cluster: remove MergeSingle 2018-02-09 11:06:51 +01:00
Fabian Reinartz fd49dbb477 *: move to memberlist for clustering 2018-02-08 12:18:44 +01:00
pasquier-s 9b10acae68 Don't notify resolved alerts if none were firing (#1198)
* Don't notify resolved alerts if none were firing

* Fix comments
2018-01-18 11:12:17 +01:00
pasquier-s a7d4e4ea7c Log snapshot sizes on maintenance (#1155)
* Log snapshot sizes on maintenance

* Add metrics for snapshot sizes

This change adds 2 new gauges for tracking the last snapshots' sizes:

  - alertmanager_nflog_snapshot_size_bytes
  - alertmanager_silences_snapshot_size_bytes
2018-01-10 14:53:57 +01:00
Frederic Branczyk bfdff67138 nflog: Copy and replace gossipData instead of modifying it in place (#1121) 2017-12-09 15:22:07 +01:00
Julius Volz fc984941ee nflog: Fix Log() crash when gossip is nil (#1064) 2017-11-01 10:34:40 +01:00
Julius Volz 947970af44 Convert Alertmanager to use non-global go-kit loggers
Fixes https://github.com/prometheus/alertmanager/issues/1040
2017-10-22 00:20:40 -07:00
Fabian Reinartz 3269bc39e1 *: switch group key to matcher serialization
Turn the GroupKey into a string that is composed of the matchers if the
path in the routing tree and the grouping labels.
Only hash it at the very end to ensure we don't exceed size limits of
integration APIs.
2017-04-21 12:06:23 +02:00
Fabian Reinartz 4258b028d6 nflog: switch to gogoproto
This switches the nflog to generate Go code via gogoproto and thereby
use standard library timestamp types.
2017-04-18 10:03:57 +02:00
Fabian Reinartz 309c6af4b2
nflog: use alert set instead of hash for deduplication
Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
2017-04-13 15:13:47 +02:00
Fabian Reinartz 1e01b2bdba nflog: add metrics (#518) 2016-11-21 15:22:35 +01:00
Fabian Reinartz b2461bb2d4 *: remove go-kit logging 2016-09-06 11:56:57 +02:00
Fabian Reinartz d6713c8eeb nflog: enable sharing log via gossip 2016-08-19 12:20:04 +02:00
Fabian Reinartz 5dc8286942 nflog: fix maintenance termination 2016-08-19 12:01:16 +02:00
Fabian Reinartz 72fdf3d3ab *: integrate nflog
This commit replaces the previous NotifyInfo provider with the new
nflog package. It needs adjustments in the behavior of the deduping
stage.
The nflog stores notification digests per receiver per alert aggregation
group rather than one entry for alert per receiver. This drastically
reduces the number of entries and removes interference
across aggregation groups.
2016-08-18 15:52:28 +02:00
Fabian Reinartz a42a473213 nflog: add doc comments, license headers 2016-08-18 14:08:01 +02:00
Fabian Reinartz 4a5df40539 nflog: add logging 2016-08-16 16:33:17 +02:00
Fabian Reinartz 48b6c8ff70 nflog: add initial tests 2016-08-16 11:11:48 +02:00
Fabian Reinartz 086d581cf8 nflog: add gc/snapshotting maintenance, remove delete
This removes the Delete function from the interface as the log
should be append-only and only be reduced by expired entries.
This also adds an argument to configure a background processing routine,
which periodically garbage collects and snapshots.
2016-08-16 11:11:48 +02:00
Fabian Reinartz 80afd502d5 nflog: add mesh gossip support 2016-08-16 11:11:48 +02:00
Fabian Reinartz 3d8e60ded7 nflog: add notification log package
This adds a new nflog package meant to replace provider.Notifies. It
has a central protobuf type package, which is also meant for usage for
other packages and the API.
The generated Go types are also the in-memory representation.
2016-08-16 11:11:48 +02:00