alertmanager

mirror of https://github.com/prometheus/alertmanager synced 2024-12-27 08:32:15 +00:00

Author	SHA1	Message	Date
Max Neverov	c39b787800	Add metrics for notification requests (#2361 ) (#2383 ) Signed-off-by: Max Neverov <neverov.max@gmail.com>	2020-11-06 15:24:18 +01:00
Simon Pasquier	56f09a62b2	notify: always retry with a back-off (#2290 ) By default the library implementing the back-off timer stops the timer after 15 minutes. Since the code never checked the value returned by the ticker, notification retries were executed without delay after the 15 minutes had elapsed (e.g. for `group_interval` greater than 15m). This change ensures that the back-off timer never expires. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2020-06-16 09:50:35 +02:00
Julien Pivotto	1cba0c7a37	Remove HipChat (#2281 ) Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-06-11 15:51:10 +02:00
Simon Pasquier	f7c595c168	notify: improve logs on notification errors (#2273 ) * notify: improve logs on notification errors Alertmanager can experience occasional failures when sending notifications to an external service. If the operation succeeds after some retry, the 'alertmanager_notifications_failed_total' metric increases but nothing is logged (unless running with log.level=debug). Hence an operator might receive an alert about notification failures but wouldn't know which integration was failing. With this change, notification failures are logged at the warning level. To avoid log flooding, similar failures on retries aren't logged. Additional information on the failing integration has also been added. Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Log notify success at info level if it's a retry Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2020-06-04 10:38:48 +02:00
Julien Pivotto	013177e2d0	Update dependencies (#2257 ) Update membership Update common (support HTTP/2 client) Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-05-18 15:00:36 +02:00
Simon Pasquier	44af3201fe	notify: add retry field to debug log (#2188 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2020-03-13 15:39:16 +01:00
johncming	d965ac6393	notify: optimize length check. (#2106 ) Signed-off-by: johncming <johncming@yahoo.com>	2019-11-19 09:00:06 +01:00
Simon Pasquier	9f7f4ead46	notify: don't use the global metrics registry (#1977 ) * notify: don't use the global metrics registry Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Address Max's comment Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-08-26 16:37:13 +02:00
Simon Pasquier	9b0ecaa0fe	notify/email: wrap all errors for easier debugging (#1953 ) * notify/email: wrap all errors for easier debugging In addition, this commit passes the current context to the TCP dialer and it doesn't log any QUIT errors if the email delivery wasn't successful. Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Fix typo Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-07-10 11:24:51 +02:00
Simon Pasquier	0c3120efac	*: split notify package Instead of keeping all notifiers in the notify package, it splits them into individual sub-packages. This improves readability and maintainability of the code. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-06-18 15:36:19 +02:00
Simon Pasquier	2abd78cbb7	*: use persistent HTTP clients (#1904 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-06-07 10:37:49 +02:00
Simon Pasquier	a5e26cc721	*: log at debug level when context is canceled Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-04-03 16:41:03 +02:00
beorn7	f3d9c89bbc	Create a `Muter` implementation for silences This encapsulates the logic of querying and marking silenced alerts. It removes the code duplication flagged earlier. I removed the error returned by the setAlertStatus function as we were only logging it, and that's already done anyway when the error is received from the `silence.Query` call (now in the `Mutes` method). Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-26 16:42:59 +01:00
JoeWrightss	b926c6935e	Fix some typos in comment (#1750 ) Signed-off-by: zhoulin xie <zhoulin.xie@daocloud.io>	2019-02-08 14:57:08 +01:00
Simon Pasquier	b676fa79c0	*: update Makefile.common with new staticcheck (#1692 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-01-04 15:37:33 +01:00
Simon Pasquier	306fd73e32	*: remove use of golang.org/x/net/context Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-11-09 10:00:23 +01:00
Chris Mark	6eaacfe243	Changes alertmanager_notifications to count attempted notifications (#1578 ) alertmanager_notifications_total is increased only on successful notifications, and not on any attempted notification as the current description points Signed-off-by: Chris Mark <chrismarkou92@gmail.com>	2018-10-09 13:23:12 +01:00
Julius Volz	6d0edbe630	Fix a bunch of unhandled errors (#1501 ) ...as discovered by "gosec" (many other ones reported, but not all make a lot of sense to fix). Signed-off-by: Julius Volz <julius.volz@gmail.com>	2018-08-05 15:38:25 +02:00
Simon Pasquier	b7d891cf39	notify: notify resolved alerts properly (#1408 ) * notify: notify resolved alerts properly The PR #1205 while fixing an existing issue introduced another bug when the send_resolved flag of the integration is set to true. With send_resolved set to false, the semantics remain the same: AlertManager generates a notification when new firing alerts are added to the alert group. The notification only carries firing alerts. With send_resolved set to true, AlertManager generates a notification when new firing or resolved alerts are added to the alert group. The notification carries both the firing and resolved notifications. Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Fix comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-08 11:37:38 +02:00
stuart nelson	bc263d3e61	Improve notification instrumentation (#1335 ) * Improve notification instrumentation - Add notificationLatencySeconds histogram to debug duplicate messages. This can help rule out if duplicate messages are being caused by excessive latency when sending a notification. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-04-23 14:23:01 +02:00
pasquier-s	7b80919b36	Remove unused code (#1272 )	2018-03-03 11:07:47 +01:00
Corentin Chary	dd75201f1c	Add /-/ready based on mesh status (#1209 ) * Wait for the gossip to settle before sending notifications See #1209 for details. As an heuristic for mesh readyness, try to see if the mesh looks stable (the number of peers isn't changing too much). This implementation always mark the altermanager as ready after a maximum of 60s. This adds one new flags to control this behavior: ``` --cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup. ``` It also adds `/-/ready` which always return 200 (in order to make it clear that we are ready as soon as we can receive requests). The mesh status is exposed in `/api/v1/status` and visible on `/#/status`. * cluster: fix typos and base interval on gossipInterval	2018-03-02 15:45:21 +01:00
Fabian Reinartz	fd49dbb477	*: move to memberlist for clustering	2018-02-08 12:18:44 +01:00
Carlos Alexandro Becker	23f31d7d5a	improved error when victorops fails (#1207 ) * improved error when victorops fails * moved to debug * allocate mem only once * joining strings * logging receiver name * passing only group name	2018-01-29 16:00:04 +01:00
pasquier-s	62b957cc14	Notify only when new firing alerts are added (#1205 ) After the initial notification has been sent, AlertManager shouldn't notify the receiver again when no new alerts have been added to the group during group_interval. This change also modifies the acceptance test framework to assert that no notification has been received in a given interval.	2018-01-23 16:52:03 +01:00
pasquier-s	9b10acae68	Don't notify resolved alerts if none were firing (#1198 ) * Don't notify resolved alerts if none were firing * Fix comments	2018-01-18 11:12:17 +01:00
Jose Donizetti	d75ff37a38	Refactor inhibit stage (#1105 ) * Refactor BuildPipeline to receive a muter * Remove marker not used by InhibitStage	2017-12-14 16:22:31 +01:00
berlinsaint	6bab629590	Add notify support for Chinese User wechat (#1059 ) * WECHAT support by ybyang2/berlinsaint * correct the whitespace * add some TestFile and modify some naming errors by ybyang2/berlinsaint * modify wechat retry test expect * template error * add newline Signed-off-by: yb_home <berlinsaint@126.com> * fmt some pr code * use the @stuartnelson3 the test-ci-wechat bingdata.go * notify go add wechat	2017-12-09 16:20:22 +01:00
Jose Donizetti	74808e40f3	Refactor silence constants (#1076 ) * Refactor remove dups silence state constants * Refactor to use const instead of string	2017-11-07 11:36:30 +01:00
Julius Volz	b0aab04906	Fix notifications for flapping alerts (#1071 ) Fixes https://github.com/prometheus/alertmanager/issues/1063	2017-11-02 11:12:12 +01:00
Julius Volz	9b72c10134	Minor code cleanups	2017-11-01 23:08:34 +01:00
Jose Donizetti	f8dc12c317	Remove not used code (#1069 )	2017-11-01 16:40:46 +01:00
Julius Volz	947970af44	Convert Alertmanager to use non-global go-kit loggers Fixes https://github.com/prometheus/alertmanager/issues/1040	2017-10-22 00:20:40 -07:00
Frederic Branczyk	ff9e5270c7	Merge pull request #1026 from brancz/marker-race Remove .WasInhibited and .WasSilenced fields of Alert type	2017-10-10 16:49:55 +02:00
Frederic Branczyk	0ef6695055	*: Remove .WasInhibited and .WasSilenced fields of Alert type	2017-10-10 15:50:15 +02:00
Conor Broderick	10b9d34f80	Initialise notifications_total and notifications_failed_total (#1011 )	2017-10-07 11:57:53 +02:00
Max Inden	a217e162a8	Do not expose resolved alerts & do not send resolved if never firing (#820 ) Do not expose resolved alerts on the /alerts endpoint. Do not send resolved alerts to receivers if the alerts have never been fired before.	2017-05-29 14:07:05 +02:00
stuart nelson	6a909abf17	Add processing status field to alert	2017-04-27 14:18:52 +02:00
Fabian Reinartz	3269bc39e1	*: switch group key to matcher serialization Turn the GroupKey into a string that is composed of the matchers if the path in the routing tree and the grouping labels. Only hash it at the very end to ensure we don't exceed size limits of integration APIs.	2017-04-21 12:06:23 +02:00
Fabian Reinartz	4258b028d6	nflog: switch to gogoproto This switches the nflog to generate Go code via gogoproto and thereby use standard library timestamp types.	2017-04-18 10:03:57 +02:00
Fabian Reinartz	8820ce7827	Merge pull request #703 from prometheus/fix-resolve Fix resolve notifications	2017-04-17 14:19:04 +02:00
Fabian Reinartz	309c6af4b2	nflog: use alert set instead of hash for deduplication Building a hash over an entire set of alerts causes problems, because the hash differs, on any change, whereas we only want to send notifications if the alert and it's state have changed. Therefore this introduces a list of alerts that are active and a list of alerts that are resolved. If the currently active alerts of a group are a subset of the ones that have been notified about before then they are deduplicated. The resolved notifications work the same way, with a separate list of resolved notifications that have already been sent.	2017-04-13 15:13:47 +02:00
Julius Volz	7f1d111324	Include notifier type in retry logs and errors	2017-04-11 00:55:14 +02:00
Frederic Branczyk	dcf2b3afcb	notify: move resolved alert filtering to integration Resolved alerts, even when filtered, have to end up in the SetNotifiesStage, otherwise when an alert fires again it is ambiguous whether it was resolved in between or not. fixes #523	2016-10-05 17:45:35 +02:00
Frederic Branczyk	e72e45c8f1	silence: add cache for silence matchers compiling regex silence matchers on every query is expensive, therefore caching them as soon as they are gossiped through the mesh	2016-09-09 11:41:39 +02:00
Frederic Branczyk	92acfbd449	add retry flag for notify providers The retry flag allows an integration to specify whether a retry can potentially be solved or if the error is likely not going to recover. For example invalid authentication is likely a wrong configuration and therefore a retry would not make sense, while a server error is likely a temporary problem and can potentially be solved on the next retry.	2016-09-06 16:21:56 +02:00
Fabian Reinartz	a4e8703567	*: integrate new silence package	2016-08-30 12:15:23 +02:00
Fabian Reinartz	72fdf3d3ab	*: integrate nflog This commit replaces the previous NotifyInfo provider with the new nflog package. It needs adjustments in the behavior of the deduping stage. The nflog stores notification digests per receiver per alert aggregation group rather than one entry for alert per receiver. This drastically reduces the number of entries and removes interference across aggregation groups.	2016-08-18 15:52:28 +02:00
Fabian Reinartz	d2a556b269	notify: include context in Stage interface This adds context.Context to the return arguments of a Stage. This is necessary to propagate modified contexts.	2016-08-18 11:42:37 +02:00
Fabian Reinartz	ed4f295c70	notify: embed nflogpb.Receiver in stage This commit directly adds the nflogpb.Receiver object to stage objects at stage creation time. Hence, we no longer rely on a value from within the context.	2016-08-16 16:40:42 +02:00

1 2

93 Commits