Commit Graph

93 Commits

Author SHA1 Message Date
Max Neverov
c39b787800
Add metrics for notification requests (#2361) (#2383)
Signed-off-by: Max Neverov <neverov.max@gmail.com>
2020-11-06 15:24:18 +01:00
Simon Pasquier
56f09a62b2
notify: always retry with a back-off (#2290)
By default the library implementing the back-off timer stops the timer
after 15 minutes. Since the code never checked the value returned by the
ticker, notification retries were executed without delay after the 15
minutes had elapsed (e.g. for `group_interval` greater than 15m).

This change ensures that the back-off timer never expires.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-06-16 09:50:35 +02:00
Julien Pivotto
1cba0c7a37
Remove HipChat (#2281)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-06-11 15:51:10 +02:00
Simon Pasquier
f7c595c168
notify: improve logs on notification errors (#2273)
* notify: improve logs on notification errors

Alertmanager can experience occasional failures when sending
notifications to an external service. If the operation succeeds after
some retry, the 'alertmanager_notifications_failed_total' metric
increases but nothing is logged (unless running with log.level=debug).
Hence an operator might receive an alert about notification failures but
wouldn't know which integration was failing.

With this change, notification failures are logged at the warning level.
To avoid log flooding, similar failures on retries aren't logged.
Additional information on the failing integration has also been added.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Log notify success at info level if it's a retry

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-06-04 10:38:48 +02:00
Julien Pivotto
013177e2d0
Update dependencies (#2257)
Update membership

Update common (support HTTP/2 client)

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-05-18 15:00:36 +02:00
Simon Pasquier
44af3201fe
notify: add retry field to debug log (#2188)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-03-13 15:39:16 +01:00
johncming
d965ac6393 notify: optimize length check. (#2106)
Signed-off-by: johncming <johncming@yahoo.com>
2019-11-19 09:00:06 +01:00
Simon Pasquier
9f7f4ead46
notify: don't use the global metrics registry (#1977)
* notify: don't use the global metrics registry

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Address Max's comment

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-08-26 16:37:13 +02:00
Simon Pasquier
9b0ecaa0fe
notify/email: wrap all errors for easier debugging (#1953)
* notify/email: wrap all errors for easier debugging

In addition, this commit passes the current context to the TCP dialer
and it doesn't log any QUIT errors if the email delivery wasn't
successful.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Fix typo

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-07-10 11:24:51 +02:00
Simon Pasquier
0c3120efac *: split notify package
Instead of keeping all notifiers in the notify package, it splits them
into individual sub-packages. This improves readability and
maintainability of the code.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-06-18 15:36:19 +02:00
Simon Pasquier
2abd78cbb7
*: use persistent HTTP clients (#1904)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-06-07 10:37:49 +02:00
Simon Pasquier
a5e26cc721 *: log at debug level when context is canceled
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-04-03 16:41:03 +02:00
beorn7
f3d9c89bbc Create a Muter implementation for silences
This encapsulates the logic of querying and marking silenced
alerts. It removes the code duplication flagged earlier.

I removed the error returned by the setAlertStatus function as we were
only logging it, and that's already done anyway when the error is
received from the `silence.Query` call (now in the `Mutes` method).

Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-26 16:42:59 +01:00
JoeWrightss
b926c6935e Fix some typos in comment (#1750)
Signed-off-by: zhoulin xie <zhoulin.xie@daocloud.io>
2019-02-08 14:57:08 +01:00
Simon Pasquier
b676fa79c0 *: update Makefile.common with new staticcheck (#1692)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-01-04 15:37:33 +01:00
Simon Pasquier
306fd73e32 *: remove use of golang.org/x/net/context
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-11-09 10:00:23 +01:00
Chris Mark
6eaacfe243 Changes alertmanager_notifications to count attempted notifications (#1578)
alertmanager_notifications_total is increased only on
successful notifications, and not on any attempted
notification as the current description points

Signed-off-by: Chris Mark <chrismarkou92@gmail.com>
2018-10-09 13:23:12 +01:00
Julius Volz
6d0edbe630 Fix a bunch of unhandled errors (#1501)
...as discovered by "gosec" (many other ones reported, but not all make
a lot of sense to fix).

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-05 15:38:25 +02:00
Simon Pasquier
b7d891cf39 notify: notify resolved alerts properly (#1408)
* notify: notify resolved alerts properly

The PR #1205 while fixing an existing issue introduced another bug when
the send_resolved flag of the integration is set to true.

With send_resolved set to false, the semantics remain the same:
AlertManager generates a notification when new firing alerts are added
to the alert group. The notification only carries firing alerts.

With send_resolved set to true, AlertManager generates a notification
when new firing or resolved alerts are added to the alert group. The
notification carries both the firing and resolved notifications.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Fix comments

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-08 11:37:38 +02:00
stuart nelson
bc263d3e61
Improve notification instrumentation (#1335)
* Improve notification instrumentation

- Add notificationLatencySeconds histogram to
debug duplicate messages. This can help rule out
if duplicate messages are being caused by
excessive latency when sending a notification.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-04-23 14:23:01 +02:00
pasquier-s
7b80919b36 Remove unused code (#1272) 2018-03-03 11:07:47 +01:00
Corentin Chary
dd75201f1c Add /-/ready based on mesh status (#1209)
* Wait for the gossip to settle before sending notifications

See #1209 for details.

As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.

This adds one new flags to control this behavior:
```
      --cluster.settle-timeout=60s  mesh settling timeout. Do not wait more than this duration on startup.
```

It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).

The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.

* cluster: fix typos and base interval on gossipInterval
2018-03-02 15:45:21 +01:00
Fabian Reinartz
fd49dbb477 *: move to memberlist for clustering 2018-02-08 12:18:44 +01:00
Carlos Alexandro Becker
23f31d7d5a improved error when victorops fails (#1207)
* improved error when victorops fails

* moved to debug

* allocate mem only once

* joining strings

* logging receiver name

* passing only group name
2018-01-29 16:00:04 +01:00
pasquier-s
62b957cc14 Notify only when new firing alerts are added (#1205)
After the initial notification has been sent, AlertManager shouldn't notify the
receiver again when no new alerts have been added to the group during
group_interval.

This change also modifies the acceptance test framework to assert that no
notification has been received in a given interval.
2018-01-23 16:52:03 +01:00
pasquier-s
9b10acae68 Don't notify resolved alerts if none were firing (#1198)
* Don't notify resolved alerts if none were firing

* Fix comments
2018-01-18 11:12:17 +01:00
Jose Donizetti
d75ff37a38 Refactor inhibit stage (#1105)
* Refactor BuildPipeline to receive a muter

* Remove marker not used by InhibitStage
2017-12-14 16:22:31 +01:00
berlinsaint
6bab629590 Add notify support for Chinese User wechat (#1059)
* WECHAT support by ybyang2/berlinsaint

* correct the whitespace

* add some TestFile and modify some naming errors by ybyang2/berlinsaint

* modify wechat retry test expect

* template error

* add newline
Signed-off-by: yb_home <berlinsaint@126.com>

* fmt some pr code

* use the @stuartnelson3 the test-ci-wechat bingdata.go

* notify go add wechat
2017-12-09 16:20:22 +01:00
Jose Donizetti
74808e40f3 Refactor silence constants (#1076)
* Refactor remove dups silence state constants

* Refactor to use const instead of string
2017-11-07 11:36:30 +01:00
Julius Volz
b0aab04906 Fix notifications for flapping alerts (#1071)
Fixes https://github.com/prometheus/alertmanager/issues/1063
2017-11-02 11:12:12 +01:00
Julius Volz
9b72c10134 Minor code cleanups 2017-11-01 23:08:34 +01:00
Jose Donizetti
f8dc12c317 Remove not used code (#1069) 2017-11-01 16:40:46 +01:00
Julius Volz
947970af44 Convert Alertmanager to use non-global go-kit loggers
Fixes https://github.com/prometheus/alertmanager/issues/1040
2017-10-22 00:20:40 -07:00
Frederic Branczyk
ff9e5270c7 Merge pull request #1026 from brancz/marker-race
Remove .WasInhibited and .WasSilenced fields of Alert type
2017-10-10 16:49:55 +02:00
Frederic Branczyk
0ef6695055
*: Remove .WasInhibited and .WasSilenced fields of Alert type 2017-10-10 15:50:15 +02:00
Conor Broderick
10b9d34f80 Initialise notifications_total and notifications_failed_total (#1011) 2017-10-07 11:57:53 +02:00
Max Inden
a217e162a8 Do not expose resolved alerts & do not send resolved if never firing (#820)
Do not expose resolved alerts on the /alerts endpoint. Do not send
resolved alerts to receivers if the alerts have never been fired before.
2017-05-29 14:07:05 +02:00
stuart nelson
6a909abf17 Add processing status field to alert 2017-04-27 14:18:52 +02:00
Fabian Reinartz
3269bc39e1 *: switch group key to matcher serialization
Turn the GroupKey into a string that is composed of the matchers if the
path in the routing tree and the grouping labels.
Only hash it at the very end to ensure we don't exceed size limits of
integration APIs.
2017-04-21 12:06:23 +02:00
Fabian Reinartz
4258b028d6 nflog: switch to gogoproto
This switches the nflog to generate Go code via gogoproto and thereby
use standard library timestamp types.
2017-04-18 10:03:57 +02:00
Fabian Reinartz
8820ce7827 Merge pull request #703 from prometheus/fix-resolve
Fix resolve notifications
2017-04-17 14:19:04 +02:00
Fabian Reinartz
309c6af4b2
nflog: use alert set instead of hash for deduplication
Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
2017-04-13 15:13:47 +02:00
Julius Volz
7f1d111324 Include notifier type in retry logs and errors 2017-04-11 00:55:14 +02:00
Frederic Branczyk
dcf2b3afcb
notify: move resolved alert filtering to integration
Resolved alerts, even when filtered, have to end up in the
SetNotifiesStage, otherwise when an alert fires again it is ambiguous
whether it was resolved in between or not.

fixes #523
2016-10-05 17:45:35 +02:00
Frederic Branczyk
e72e45c8f1 silence: add cache for silence matchers
compiling regex silence matchers on every query is expensive, therefore
caching them as soon as they are gossiped through the mesh
2016-09-09 11:41:39 +02:00
Frederic Branczyk
92acfbd449 add retry flag for notify providers
The retry flag allows an integration to specify whether a retry can
potentially be solved or if the error is likely not going to recover.
For example invalid authentication is likely a wrong configuration and
therefore a retry would not make sense, while a server error is likely
a temporary problem and can potentially be solved on the next retry.
2016-09-06 16:21:56 +02:00
Fabian Reinartz
a4e8703567 *: integrate new silence package 2016-08-30 12:15:23 +02:00
Fabian Reinartz
72fdf3d3ab *: integrate nflog
This commit replaces the previous NotifyInfo provider with the new
nflog package. It needs adjustments in the behavior of the deduping
stage.
The nflog stores notification digests per receiver per alert aggregation
group rather than one entry for alert per receiver. This drastically
reduces the number of entries and removes interference
across aggregation groups.
2016-08-18 15:52:28 +02:00
Fabian Reinartz
d2a556b269 notify: include context in Stage interface
This adds context.Context to the return arguments of a Stage.
This is necessary to propagate modified contexts.
2016-08-18 11:42:37 +02:00
Fabian Reinartz
ed4f295c70 notify: embed nflogpb.Receiver in stage
This commit directly adds the nflogpb.Receiver object to stage
objects at stage creation time. Hence, we no longer rely on a value from
within the context.
2016-08-16 16:40:42 +02:00