alertmanager

Commit Graph

Author	SHA1	Message	Date
gotjosh	3ee2cd0f12	Metrics: Silence maintenance success and failure (#3285 ) * Metrics: Silence maintenance success and failure Due to various reasons, we've observed different kind of errors on this area. From read-only disks to silly code bugs. Errors during maintenance are effectively a _data loss_ and therefore we should encourage proper monitoring of this area. This PR Introduces a total and failure metric for silence maintenance. If agreed, I'll do the same for the nflog and fix the flaky test like I did for silences while I'm there. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2023-03-08 12:32:59 +00:00
gotjosh	f59460bfd4	Refactor nflog configuration options to make it similar to Silences. (#3220 ) * Refactor nflog configuration options to make it similar to Silences. The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval. To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.	2023-01-19 16:39:03 +00:00
Joe Blubaugh	505f944c6a	Apply suggestions from code review. Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>	2022-07-05 11:22:46 +08:00
Joe Blubaugh	c9249a02bc	Remove a stray line that was breaking the linter. Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>	2022-07-05 11:22:46 +08:00
Joe Blubaugh	bedd3c4175	Clean up linter warnings about unused code and atomic package Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>	2022-07-05 11:22:46 +08:00
Joe Blubaugh	cb00d9259b	Issue #2850 : Add benbjohnson/clock to the silences package. github.com/benbjohnson/clock provides a time interface to programs rather than using the stdlib time package. This allows mocking time in programs and tests. In this commit, the clock is used to speed up and simplify testing of the silences package. Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>	2022-07-05 11:22:46 +08:00
gotjosh	cfb909f419	Marker: Rename `SetSilenced` to `SetActiveOrSilenced` This accurately reflects what the function _actually_ does. If no active silences IDs are provided and the list of inhibitions we have is already empty the alert is actually set to Active. Took me a while to realise this as I was understanding how do we populate the alert list. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2022-06-17 12:51:23 +01:00
Matthias Loibl	a6d10bd5bc	Update golangci-lint and fix complaints (#2853 ) * Copy latest golangci-lint files from Prometheus Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Use grafana/regexp over stdlib regexp Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Fix typos in comments Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Fix goimports complains in import sorting Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * gofumpt all Go files Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Update naming to comply with revive linter Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * config: Fix error messages to be lower case Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * test/cli: Fix error messages to be lower case Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * .golangci.yaml: Remove obsolete space Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * config: Fix expected victorOps error Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Use stdlib regexp Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Clean up Go modules Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>	2022-03-25 17:59:51 +01:00
Simon Pasquier	3f42c5e813	Merge pull request #2816 from prashbnair/update_check Correcting the condition for updating a silence. Earlier was checking…	2022-03-04 15:17:12 +01:00
Soon-Ping	a2d18c93de	Return no error when deleting expired silence (#2817 ) * Changed Silences.expire(id) to not return error for already expired silence Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Added comment explaining idempotency change for Silences.expire() Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Trigger build Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Trigger build Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Fixed typo in comment Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Trigger build Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Trigger build Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Fixed another typo in comment Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Promoted comment to function-level Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Added API v2 test for DeleteSilence, PostSilence Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Fixed lint errors Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Trigger build Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Trigger build Signed-off-by: Soon-Ping Phang <soonping@amazon.com> * Trigger build Signed-off-by: Soon-Ping Phang <soonping@amazon.com>	2022-02-22 13:34:21 +01:00
Prashant Balachandran	66182178d0	Correcting the condition for updating a silence. Earlier was checking upto nanosecond precision but reduced to second as the UI only sends upto millisecond Signed-off-by: Prashant Balachandran <pnair@redhat.com>	2022-01-31 11:32:48 +05:30
Kyle Brandt	1b8afe7cb5	export ValidateMatcher for DI (#2 ) (#2716 ) so third parties, Grafana in particular, can over ride the validation. Grafana wants to do this because other data sources will have label keys with things like spaces, periods, or other characters - and looking for a better integration with alert manager. goes with grafana/grafana#38629 replaces https://github.com/prometheus/alertmanager/pull/2694 Signed-off-by: Kyle Brandt <kyle@grafana.com>	2021-10-21 09:29:55 +02:00
Yuriy Tseretyan	15f44f4a61	Close file descriptor after snapshot file was read (#2710 ) * close file if it is opened Signed-off-by: Yuriy Tseretyan <yuriy.tseretyan@grafana.com>	2021-10-19 01:12:02 +02:00
Julius Volz	5195460c95	Correctly call default silence maintenance function (#2701 ) https://github.com/prometheus/alertmanager/pull/2689 introduced a regression where the default maintenance function would no longer be called even if no override was specified. The Alertmanager now crashes on any silence maintenance run without this fix. Signed-off-by: Julius Volz <julius.volz@gmail.com>	2021-09-13 19:42:48 +05:30
gotjosh	8da517524a	Enable support for custom callbacks as part of maintenance (#2689 ) * Enable support for custom callbacks as part of maintenance This enables support for custom Maintenance callbacks as part of the periodic maintenance of silences and notification logs. Effectively a no-op for the Alertmanager but allows downstream implementation to inject custom logic as part of it. Signed-off-by: gotjosh <josue.abreu@gmail.com> * Add tests Signed-off-by: gotjosh <josue.abreu@gmail.com> * Fix tests and remove whitespace Signed-off-by: gotjosh <josue.abreu@gmail.com> * Address review comments Signed-off-by: gotjosh <josue.abreu@gmail.com> * run go fmt Signed-off-by: gotjosh <josue.abreu@gmail.com> * Fix import ordering Signed-off-by: gotjosh <josue.abreu@gmail.com>	2021-09-06 16:19:39 +05:30
Julien Pivotto	b2a4cacb95	Update go dependencies & switch to go-kit/log Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-08-02 12:43:23 +02:00
beorn7	e84c265196	Include pending silences for future muting decisions Previously, if a pending silence existed for an alert, and it later became active without any silences getting added in the meantime, we would miss the existence of that newly active silence. Signed-off-by: beorn7 <beorn@grafana.com>	2021-05-27 22:15:57 +02:00
Ganesh Vernekar	1f946f8a7d	Replace satori/go.uuid with gofrs/uuid Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2021-03-15 19:39:15 +05:30
Ganesh Vernekar	406ddd200a	Upgrade github.com/satori/go.uuid Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2021-03-10 14:49:07 +05:30
Björn Rabenstein	023937679f	Catch unknown matcher types in gossipped silences (#2484 ) This has been discussed in #2479. Even if the conclusion there was that we don't need this in a bugfix release, it's still better to have this kind of robustness. So this introduces the same check into the main branch. Signed-off-by: beorn7 <beorn@grafana.com>	2021-02-10 12:02:56 +01:00
Kiril Vladimirov	f5382af591	silence: Add tests for Not(Equal\|Regexp) matchers ... and fix a bug in validating silences with such matchers, caught while writing them. Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>	2021-01-22 17:02:50 +02:00
Kiril Vladimirov	7320d83cbc	Replace types.Matcher(s)? with labels.Matcher(s)? Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>	2021-01-22 17:02:48 +02:00
Josh Soref	0f2c65d265	Spelling (#2167 ) * spelling: inhibition Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: matchers Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: notification Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: nonexistent Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: obfuscated Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: occurred Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: relevant Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: unexpected Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: marshaled Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * spelling: marshaling Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>	2020-01-23 17:06:16 +01:00
Ilya Gladyshev	196c62f488	At least one non-empty silence matcher (#2081 ) * check if at least one silence matcher doesn't match empty strings Signed-off-by: qoops <ilya.v.gladyshev@gmail.com> * fixed grammar Signed-off-by: qoops <ilya.v.gladyshev@gmail.com>	2019-10-31 15:42:03 +01:00
beorn7	318e006065	Mark some Summaries explicitly as having no objectives With the next release of client_golang, Summaries will not have objectives by default. Interestingly, this will do the right thing for the Summaries affected by this commit. However, right now those summaries do get the old default objectives. They don't really make sense because the affected Summaries receive Observations quite infrequently (far less than once in the 10m max age currently used). To not get surprising changes when moving on to client_golang v1, let's explicitly set the Summaries as objective-less now. Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-12 15:47:56 +02:00
stuart nelson	b3972f3adc	Merge pull request #1672 from aixeshunter/master Unused function 'QTimeRange' and empty slice declaration via literal	2019-05-03 14:13:04 +02:00
beorn7	46b61a38cd	Remove a confusing closure Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-28 13:04:05 +01:00
beorn7	0ab3b724cc	Fix bug with zero retention time Essentially, the Silences.Expire() will in that case have no effect because the affected silence is immediately seen as expired from the storage and thus not updated. The silence will stay around in its old state. This fix makes sure to use the same “now” throughout the expiration process. Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-28 12:51:40 +01:00
beorn7	3c981a92f7	Improve `Mutes` performance for silences Add version tracking of silences states. Adding a silence to the state increments the version. If the version hasn't changed since the last time an alert was checked for being silenced, we only have to verify that the relevant silences are still active rather than checking the alert against all silences. Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-28 12:34:41 +01:00
beorn7	f3d9c89bbc	Create a `Muter` implementation for silences This encapsulates the logic of querying and marking silenced alerts. It removes the code duplication flagged earlier. I removed the error returned by the setAlertStatus function as we were only logging it, and that's already done anyway when the error is received from the `silence.Query` call (now in the `Mutes` method). Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-26 16:42:59 +01:00
aixeshunter	4deb083823	Unused function 'QTimeRange' Signed-off-by: aixeshunter <aixeshunter@gmail.com>	2018-12-19 09:47:52 +08:00
stuart nelson	2026e4a01f	[gossip] Don't merge expired gossip messages (#1631 ) * [silences] Don't merge expired silences If they're expired, they should be cleaned up on the next GC cycle, but merging them in means that they'll probably be gossip'd continually between the cluster members. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add analogous behavior+test for nflog The code for nflog was also constantly re-adding nflogs to the internal memory store, the same as the silence code was. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add retention to TestQuery With the default 0 retention, the alerts would not be merged. Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>	2018-11-21 11:40:57 +01:00
Max Leonard Inden	f1b920bcc9	api: Implement OpenAPI generated Alertmanager API V2 The current Alertmanager API v1 is undocumented and written by hand. This patch introduces a new Alertmanager API - v2. The API is fully generated via an OpenAPI 2.0 [1] specification (see `api/v2/openapi.yaml`) with the exception of the http handlers itself. Pros: - Generated server code - Ability to generate clients in all major languages (Go, Java, JS, Python, Ruby, Haskell, elm [3] ...) - Strict contract (OpenAPI spec) between server and clients. - Instant feedback on frontend-breaking changes, due to strictly typed frontend language elm. - Generated documentation (See Alertmanager online Swagger UI [4]) Cons: - Dependency on open api ecosystem including go-swagger [2] In addition this patch includes the following changes. - README.md: Add API section - test: Duplicate acceptance test to API v1 & API v2 version The Alertmanager acceptance test framework has a decent test coverage on the Alertmanager API. Introducing the Alertmanager API v2 does not go hand in hand with deprecating API v1. They should live alongside each other for a couple of minor Alertmanager versions. Instead of porting the acceptance test framework to use the new API v2, this patch duplicates the acceptance tests, one using the API v1, the other API v2. Once API v1 is removed we can simply remove `test/with_api_v1` and bring `test/with_api_v2` to `test/`. [1] https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md [2] https://github.com/go-swagger/go-swagger/ [3] https://github.com/ahultgren/swagger-elm [4] http://petstore.swagger.io/?url=https://raw.githubusercontent.com/mxinden/alertmanager/apiv2/api/v2/openapi.yaml Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-09-04 13:38:34 +02:00
stuart nelson	445fbdf1a8	gossip large messages via SendReliable (#1415 ) * Gossip large messages via SendReliable For messages beyond half of the maximum gossip packet size, send the message to all peer nodes via TCP. The choice of "larger than half the max gossip size" is relatively arbitrary. From brief testing, the overhead from memberlist on a packet seemed to only use ~3 of the available 1400 bytes, and most gossip messages seem to be <<500 bytes. * Add tests for oversized/normal message gossiping * Make oversize metric names consistent * Remove errant printf in test * Correctly increment WaitGroup * Add comment for OversizedMessage func * Add metric for oversized messages dropped Code was added to drop oversized messages if the buffered channel they are sent on is full. This is a good thing to surface as a metric. * Add counter for total oversized messages sent * Change full queue log level to debug Was previously a warning, which isn't necessary now that there is a metric tracking it. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-15 13:40:21 +02:00
stuart nelson	36588c3865	memberlist gossip (#1389 ) * Peers further propagate newly received nflogs If a peer receives an nflog that it hasn't seen before, queue the message and propagate it further to other peers. This should ensure that all peers within a cluster receive all gossip messages. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set Retransmit value based on number of members For alertmanagers that are brought up with a list of peers, set the number of message retransmits to be half of that number. If there are no peers on start, or there are few, continue to use the default value of 3. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [nflog] Move retransmit calculation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [silence] further gossip silence messages Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set GossipNodes to equal RetransmitMulti During a gossip, we send messages to at most GossipNodes nodes. If possible, we only a message to reach all nodes as soon as possible. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix rebase Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 11:48:42 +02:00
Ted Zlatanov	b04e9ad19b	#1346 : move maintenance messages to DEBUG log level (#1347 ) Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>	2018-04-30 11:56:17 +02:00
Simon Pasquier	2d68b4d318	silence: fix potential panic in decodeState() Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 10:12:05 +02:00
Simon Pasquier	1531aa66f3	Fix for #1282 (#1286 ) * cluster: add alertmanager_cluster_messages_queued metric * cluster: add metrics for sent messages This change adds 2 new metrics: - alertmanager_cluster_messages_sent_total - alertmanager_cluster_messages_sent_size_total * Fix marshaling for entries being broadcast Individual notifications logs and silences being broadcast to the other peers need to be encoded using the same length-delimited format as when doing full-state synchronization. * main: fix argument order for cluster.Join() cluster.Join() was called with the push/pull and gossip interval parameters being swapped one for another.	2018-03-22 13:53:00 +01:00
pasquier-s	c2dac90434	silence: fix skipped test (#1258 ) TestStateMerge() was skipped because of a typo. Fixing the name revealed that the test itself needed to be updated following the switch to the memberlist library.	2018-02-27 10:17:48 +01:00
Fabian Reinartz	fd49dbb477	*: move to memberlist for clustering	2018-02-08 12:18:44 +01:00
pasquier-s	a7d4e4ea7c	Log snapshot sizes on maintenance (#1155 ) * Log snapshot sizes on maintenance * Add metrics for snapshot sizes This change adds 2 new gauges for tracking the last snapshots' sizes: - alertmanager_nflog_snapshot_size_bytes - alertmanager_silences_snapshot_size_bytes	2018-01-10 14:53:57 +01:00
Jose Donizetti	74808e40f3	Refactor silence constants (#1076 ) * Refactor remove dups silence state constants * Refactor to use const instead of string	2017-11-07 11:36:30 +01:00
Julius Volz	947970af44	Convert Alertmanager to use non-global go-kit loggers Fixes https://github.com/prometheus/alertmanager/issues/1040	2017-10-22 00:20:40 -07:00
Corentin Chary	bff889b490	silence\|alerts: add metrics about current silences and alerts This adds metrics that look like this: ``` alertmanager_alerts{state="active"} 6 alertmanager_alerts{state="suppressed"} 0 alertmanager_silences{state="active"} 1 alertmanager_silences{state="expired"} 1 alertmanager_silences{state="pending"} 0 ``` This can be used to monitor alertmanager's usage and validate that alertmanagers in a mesh have a similar number of silences and alerts.	2017-10-02 13:33:29 +02:00
Jose Donizetti	9449bd1fa9	Ignore expired silences OnGossip (#999 ) This will fix the bug of resync deleted silences due to the state of other peers.	2017-09-28 10:25:35 +02:00
Corentin Chary	34d9524ab9	silences: avoid deadlock (#995 ) * silences: avoid deadlock Calling gossip.GossipBroadcast() will cause a deadlock if there is a currently executing OnBroadcast* function. See #982 * silence_test: better unit test to detect deadlocks	2017-09-27 11:48:28 +02:00
Corentin Chary	869a038a2b	Add a mutex to silences.go:gossipData (#984 ) This should fix silence/silence.go #982	2017-09-13 11:18:01 +02:00
Max Leonard Inden	08be6a4149	Expire pending silence and move to expired state Instead of setting endsAt to startsAt we can set both to now. Thereby the Silence will get the expired state by default.	2017-05-29 18:44:58 +02:00
Fabian Reinartz	f53974d5e6	silence: fix and test expiration behavior	2017-05-22 09:27:57 +02:00
Fabian Reinartz	d73a655bf4	Simplify silence modifications, add update endpoint (#796 ) * Simplify silence modifications, add update endpoint * vendor: add pkg/errors * ui: Handle upserting of silences . * Regenerate bindata	2017-05-16 16:48:25 +02:00

1 2

59 Commits