alertmanager

mirror of https://github.com/prometheus/alertmanager synced 2024-12-28 17:12:13 +00:00

Author	SHA1	Message	Date
Simon Pasquier	b7d891cf39	notify: notify resolved alerts properly (#1408 ) * notify: notify resolved alerts properly The PR #1205 while fixing an existing issue introduced another bug when the send_resolved flag of the integration is set to true. With send_resolved set to false, the semantics remain the same: AlertManager generates a notification when new firing alerts are added to the alert group. The notification only carries firing alerts. With send_resolved set to true, AlertManager generates a notification when new firing or resolved alerts are added to the alert group. The notification carries both the firing and resolved notifications. Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Fix comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-08 11:37:38 +02:00
stuart nelson	80f2eeb2ca	Fix resolved alerts still inhibiting (#1331 ) * inhibit: update inhibition cache when alerts resolve Signed-off-by: Simon Pasquier <spasquie@redhat.com> * inhibit: remove unnecessary fmt.Sprintf Signed-off-by: Simon Pasquier <spasquie@redhat.com> * inhibit: add unit tests Signed-off-by: Simon Pasquier <spasquie@redhat.com> * inhibit: use NopLogger in tests Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Update old alert with result of merge with new On ingest, alerts with matching fingerprints are merged if the new alert's start and end times overlap with the old alert's. The merge creates a new alert, which is then updated in the internal alert store. The original alert is not updated (because merge creates a copy), so it is never marked as resolved in the inhibitor's reference to it. The code within the inhibitor relies on skipping over resolved alerts, but because the old alert is never updated it is never marked as resolved. Thus it continues to inhibit other alerts until it is cleaned up by the internal GC. This commit updates the struct of the old alert with the result of the merge with the new alert. An alternative would be to always update the inhibitor's internal cache of alerts regardless of an alert's resolve status. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update inhibitor cache even if alert is resolved This seems like a better choice than the previous commit. I think it is more sane to have the inhibitor update its own cache, rather than having one of its pointers updated externally. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-04-18 16:26:04 +02:00
Simon Pasquier	c92ed69ce8	Split cli package (#1314 ) * cli: move commands to cli/cmd * cli: use StatusAPI interface for config command * cli: use SilenceAPI interface for silence commands * cli: use AlertAPI for alert command * cli: move back commands to cli package And move API client code to its own package. * cli: remove unused structs	2018-04-11 11:17:41 +02:00
Simon Pasquier	4cba49155d	dispatch: don't reset timer if flush is in-progress (#1301 ) When the aggregation group receives an alert that is past the initial group_wait value, it should reset its timer only if the timer has ever expired. Otherwise it means that the flush is already in-progress.	2018-03-29 12:22:49 +02:00
Simon Pasquier	0c086e3b12	cli: extract client bindings of the v1 API (#1278 ) * cli: extract client bindings of the v1 API from amtool This is a continuation of [1] but the code is kept in the alertmanage repository rather than having it in client_golang. [1] https://github.com/prometheus/client_golang/pull/333 Co-Authored-By: Fabian Reinartz <fab.reinartz@gmail.com> Co-Authored-By: Tristan Colgate <tcolgate@gmail.com> Co-Authored-By: Corin Lawson <corin@responsight.com> Co-Authored-By: stuart nelson <stuartnelson3@gmail.com> * cli: fix httpSilenceAPI.Set() method * vendor: remove github.com/prometheus/client_golang/api/alertmanager * cli: don't use the model.Alert type	2018-03-28 19:19:04 +02:00
Brian Brazil	aa950668bf	The default group_by is meant to be no labels. (#1287 ) This is what the intended default is, and what the documentation says.	2018-03-16 18:39:23 +01:00
Corentin Chary	dd75201f1c	Add /-/ready based on mesh status (#1209 ) * Wait for the gossip to settle before sending notifications See #1209 for details. As an heuristic for mesh readyness, try to see if the mesh looks stable (the number of peers isn't changing too much). This implementation always mark the altermanager as ready after a maximum of 60s. This adds one new flags to control this behavior: ``` --cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup. ``` It also adds `/-/ready` which always return 200 (in order to make it clear that we are ready as soon as we can receive requests). The mesh status is exposed in `/api/v1/status` and visible on `/#/status`. * cluster: fix typos and base interval on gossipInterval	2018-03-02 15:45:21 +01:00
pasquier-s	c39a913f8a	test: enable race detection (#1262 ) This change enables race detection when running the tests. It also fixes a couple of existing race conditions.	2018-02-27 18:18:53 +01:00
Stuart Nelson	a552afd998	Merge branch 'master' into memberlist	2018-02-13 10:47:17 +01:00
Fabian Reinartz	e6df2d8751	Adapt cluster listen address flag in tests	2018-02-12 11:31:55 +01:00
pasquier-s	76ee5388e7	Forbid 0 value for group_interval and repeat_interval (#1230 ) Setting one of these parameters to a zero value doesn't make sense semantically and can cause high CPU usage.	2018-02-09 10:53:46 +01:00
Fabian Reinartz	fd49dbb477	*: move to memberlist for clustering	2018-02-08 12:18:44 +01:00
pasquier-s	62b957cc14	Notify only when new firing alerts are added (#1205 ) After the initial notification has been sent, AlertManager shouldn't notify the receiver again when no new alerts have been added to the group during group_interval. This change also modifies the acceptance test framework to assert that no notification has been received in a given interval.	2018-01-23 16:52:03 +01:00
pasquier-s	907ac510f8	Fix flaky TestBatching acceptance test (#1193 ) This change decreases the repeat_interval parameter from 5s to 4.9s to make sure that the alerts are effectively sent after 5 seconds. The workflow is: - The dispatcher flushes the alerts at t0, sends the notification and marks the notification log at t0+epsilon. - The dispatcher flushes the alerts at t1, t2, t3 and t4 and doesn't send the notifications as expected. - At t5, the dispatcher flushes the alerts because current_time - (t0+epsilon) is less then repeat_interval. If repeat_interval is exactly 5s, there is a little chance that it is greater than current_time - (t0+epsilon).	2018-01-11 22:45:59 +01:00
Calle Pettersson	b7da058efb	Switch cmd/alertmanager to kingpin (#974 )	2018-01-06 11:22:26 +01:00
Julius Volz	b0aab04906	Fix notifications for flapping alerts (#1071 ) Fixes https://github.com/prometheus/alertmanager/issues/1063	2017-11-02 11:12:12 +01:00
Julius Volz	9b72c10134	Minor code cleanups	2017-11-01 23:08:34 +01:00
Fabian Reinartz	ff5ecfff51	test: add reload test This test reloads the Alertmanager to verify, that it properly keeps state and sends notifications correctly across reloads.	2017-04-18 12:44:38 +02:00
Fabian Reinartz	309c6af4b2	nflog: use alert set instead of hash for deduplication Building a hash over an entire set of alerts causes problems, because the hash differs, on any change, whereas we only want to send notifications if the alert and it's state have changed. Therefore this introduces a list of alerts that are active and a list of alerts that are resolved. If the currently active alerts of a group are a subset of the ones that have been notified about before then they are deduplicated. The resolved notifications work the same way, with a separate list of resolved notifications that have already been sent.	2017-04-13 15:13:47 +02:00
stuart nelson	24a9a64bdf	Only find MAC address if no command-line flag value given (#638 ) * Find MAC address if mesh.hardware-addr not given Defaulting to the machine's MAC address fails sometimes fails and causes a panic. Allow the user to specify custom address to skip this so they can run AlertManager. * -mesh.hardware-address -> -mesh.peer-id * Fix command-line invocation	2017-02-28 14:57:45 +01:00
Martín Ferrari	5489644cbe	Wait for test server to be ready before running tests (#605 ) * Wait for test server to be ready before running tests This fixes problems when running the acceptance tests in slow or CPU-starved machines, as mentioned in #472.	2017-01-16 12:32:27 +00:00
Frederic Branczyk	c392ace697	notify: replace unfiltered with filtered alerts	2017-01-04 13:50:40 +01:00
Frederic Branczyk	dcf2b3afcb	notify: move resolved alert filtering to integration Resolved alerts, even when filtered, have to end up in the SetNotifiesStage, otherwise when an alert fires again it is ambiguous whether it was resolved in between or not. fixes #523	2016-10-05 17:45:35 +02:00
Fabian Reinartz	a4e8703567	*: integrate new silence package	2016-08-30 12:15:23 +02:00
Fabian Reinartz	72fdf3d3ab	*: integrate nflog This commit replaces the previous NotifyInfo provider with the new nflog package. It needs adjustments in the behavior of the deduping stage. The nflog stores notification digests per receiver per alert aggregation group rather than one entry for alert per receiver. This drastically reduces the number of entries and removes interference across aggregation groups.	2016-08-18 15:52:28 +02:00
Fabian Reinartz	e51770ce21	main: use mesh providers	2016-08-09 12:00:28 +02:00
Fabian Reinartz	81cbf3cda7	*: refactor Silence type, use UUID This commit removes the dependency on model.Silence for the internal Silence type, uses UUIDs instead of uint64s and clarifies invariants around timestamp handling. The created_at timestamp is removed for the time being.	2016-08-09 11:59:35 +02:00
Fabian Reinartz	d6e64dccc5	provider/boltmem: make alerts purely in-memory. Initial testing has shown BoltDB in plain usage to be a bottleneck at a few thousand alerts or more (especially JSON decoding. This commit actually makes them purely memory as a temporary solution.	2016-07-07 09:45:12 +02:00
Matt Bostock	68a1e51ffb	Use localhost for tests Previously, the tests would listen on all available interfaces. Instead, have the tests use localhost only; using all available interfaces is unnecessary. On Mac OS X with the builtin firewall enabled, it triggers annoying prompts to allow the tests to listen on all interfaces.	2016-06-04 08:19:23 +01:00
Fabian Reinartz	04f60c5a50	Deal with changed webhook format in tests	2016-02-12 11:00:51 +01:00
Fabian Reinartz	11fae2a719	Simplify and fix notification grouping. This commit changes the notification grouping behavior to simply send all alerts of a group as soon as a single one of them needs updating. This fixes a critical bug which caused erroneous resolved notifications to be sent.	2016-01-08 15:17:54 +01:00
Fabian Reinartz	d21d29ee58	Correctly parse send_resolved config field Fixes #198	2015-12-23 08:31:50 +01:00
Fabian Reinartz	a2b8d35733	Validate API input	2015-12-09 18:21:06 +01:00
Fabian Reinartz	2e5b9e5194	Improve acceptance test logging	2015-12-08 11:55:29 +01:00
Fabian Reinartz	cec04341f7	Add resolved test	2015-11-30 11:20:28 +01:00
beorn7	93ffa534a5	PR with changes after code review Now to be reverse-reveiewed.	2015-11-23 18:24:57 +01:00
Brian Brazil	faa88831f4	First-pass at improving template system. - Cut back to bare minimum to make the rest simpler - Consistency in config naming - Have one data strucutre that's the same for all templates - Pass in common labels to templates - Support templates almost everywhere - Support multiple SMTP recipients - Support non-ASCII SMTP headers - Handle colour logic via templates - Make $subjects have consistent output, go maps aren't sorted. - Make tests pass when v6 is disabled	2015-11-18 14:59:05 +00:00
Fabian Reinartz	d80fd26902	Add Dockerfile and target, change flag	2015-11-12 15:03:09 +01:00
Fabian Reinartz	dc656a44ea	Adjust config fields to 'receiver'	2015-11-10 14:08:20 +01:00
Fabian Reinartz	e4e594d826	Unify receiver naming	2015-11-10 13:47:04 +01:00
Fabian Reinartz	ba4c3d31b5	Extend merging test to cover more scenarios	2015-10-20 11:59:40 +02:00
Fabian Reinartz	6cbd7f5511	Inherit grouping labels, default grouping labels	2015-10-19 17:35:59 +02:00
Fabian Reinartz	cb0ecd9416	Alter config to have a root route	2015-10-19 16:52:54 +02:00
Fabian Reinartz	fa7955c9bc	Show logs of crashed testing instances	2015-10-15 16:17:04 +02:00
Fabian Reinartz	8148e82358	Terminate tests on Alertmanager crash	2015-10-15 16:15:37 +02:00
Fabian Reinartz	ff7eddc453	Add acceptance test for alert merging	2015-10-15 12:46:51 +02:00
Fabian Reinartz	2d3f0ecd84	Add test for silence deletion	2015-10-12 07:40:55 +02:00
Fabian Reinartz	0073647981	Restructure acceptance test files	2015-10-12 07:35:22 +02:00
Fabian Reinartz	16e693dd4f	Add simple test for retry logic	2015-10-12 07:28:43 +02:00
Fabian Reinartz	aca2089216	Add injection function to webhook	2015-10-12 07:28:25 +02:00

1 2

86 Commits