* notify: improve logs on notification errors
Alertmanager can experience occasional failures when sending
notifications to an external service. If the operation succeeds after
some retry, the 'alertmanager_notifications_failed_total' metric
increases but nothing is logged (unless running with log.level=debug).
Hence an operator might receive an alert about notification failures but
wouldn't know which integration was failing.
With this change, notification failures are logged at the warning level.
To avoid log flooding, similar failures on retries aren't logged.
Additional information on the failing integration has also been added.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Log notify success at info level if it's a retry
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Allow limiting maximum number of alerts in webhook
The webhook notifier is the only notifier that does not allow templating
on the Alertmanager side. Users who encounter occasional alert storms
(10ks of alerts going off at once for the same group) have reported
webhook receiver systems not being able to cope with the load caused by
the resulting large webhook notifier messages (the alerting rules also
contained large annotations that can't be stripped away due to lack of
templating). Reducing group size also wasn't an option, but this change
proposes to allow truncating the list of alerts sent in the webhook body
to a provided maximum length. This assumes that e.g. if a group receives
20k alerts, you really are fine only receiving 10k because you wouldn't
be able to check them all anyway.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Change max_alerts to uint32
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Add truncatedAlerts field to webhook message
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Fix JSON struct tag
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* .circleci/config.yml: collect test metadata
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Store frontend test results too
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* api/v2: add path and method to API v2 logs
When an API v2 handler logged a message, the log wouldn't include the
path and method. Since different handlers perform the same validations
(e.g. matchers for alerts and silences), it isn't easy to know which
handler was invoked (though the logged filename
+ line number provides a hint).
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Capitalize messages + improve logs
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Improve remark on UDP/TCP for high availability
Signed-off-by: Pascal Hofmann <mail+github@pascalhofmann.de>
* Update README.md
Co-Authored-By: Max Inden <mail@max-inden.de>
Signed-off-by: Pascal Hofmann <mail+github@pascalhofmann.de>
* Update README.md
Signed-off-by: Pascal Hofmann <mail+github@pascalhofmann.de>
Co-authored-by: Max Inden <mail@max-inden.de>
* fix dispatcher race condition
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
* add test to check for race condition in dispatcher
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
* return when dispatcher Stop has nil receiver
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
* remove unneeded chec
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
During the dev Summit 2019/2, there was a consensus to mark stale PR
after 60 days.
This change is adding the stale bot configuration required for this.
The stale bot has already has access to the Prometheus organization. It
does _not_ comment and does _not_ close the stale pull request. It just
adds a label 'stale'.
This is already done in the collectd_exporter repository and there it
works as expected.
https://docs.google.com/document/d/1VVxx9DzpJPDgOZpZ5TtSHBRPuG5Fr3Vr6EFh8XuUpgs/edit
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Fix an error message about start and end time validation
Signed-off-by: Célian Garcia <celian.garcia@amadeus.com>
* Modified start and end time validation message to be affirmative
Signed-off-by: Célian Garcia <celian.garcia@amadeus.com>