alertmanager

Commit Graph

Author	SHA1	Message	Date
stuart nelson	77cc718a81	[nflog] register snapshotSize This metric was never registered.	2018-06-12 13:59:48 +02:00
stuart nelson	d259bf9d09	Check for advertise host when setting failed peers (#1411 ) When setting initially failing peers, if we don't have a value for the advertise address, use the bindAddr. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-11 14:18:15 +02:00
stuart nelson	ec2cc57d28	0.15.0-rc.2 (#1410 ) Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 14:38:37 +02:00
stuart nelson	6305229fcc	fix set initial failed peers (#1407 ) * Correctly add Node to initially failed peer Reconnect attempts to failed peers were panicking because peer.Address() would attempt to access the nil Node struct member. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Correctly remove old peers Again, since we aren't assigning a name (this is generated) we rely on the node's Address for removing the initially joining (and potentially later re-joining) peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Test the peerJoin removes initial peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Don't add self to failing peers list The initially failing peers list shouldn't include the bindAddr for the alertmanager itself, as this connection is never made, and consequently only removed from the failedPeers list after the failed peer timeout. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Filter initialFailed with advertise addr This may differ from bindAddr, and is the value we want to not attempt to connect to. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 12:34:52 +02:00
stuart nelson	36588c3865	memberlist gossip (#1389 ) * Peers further propagate newly received nflogs If a peer receives an nflog that it hasn't seen before, queue the message and propagate it further to other peers. This should ensure that all peers within a cluster receive all gossip messages. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set Retransmit value based on number of members For alertmanagers that are brought up with a list of peers, set the number of message retransmits to be half of that number. If there are no peers on start, or there are few, continue to use the default value of 3. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [nflog] Move retransmit calculation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [silence] further gossip silence messages Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set GossipNodes to equal RetransmitMulti During a gossip, we send messages to at most GossipNodes nodes. If possible, we only a message to reach all nodes as soon as possible. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix rebase Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 11:48:42 +02:00
Simon Pasquier	b7d891cf39	notify: notify resolved alerts properly (#1408 ) * notify: notify resolved alerts properly The PR #1205 while fixing an existing issue introduced another bug when the send_resolved flag of the integration is set to true. With send_resolved set to false, the semantics remain the same: AlertManager generates a notification when new firing alerts are added to the alert group. The notification only carries firing alerts. With send_resolved set to true, AlertManager generates a notification when new firing or resolved alerts are added to the alert group. The notification carries both the firing and resolved notifications. Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Fix comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-08 11:37:38 +02:00
Simon Pasquier	9f87f9d6e7	cluster: advertise explicitly for empty addresses (#1386 ) memberlist doesn't advertise a valid IP address when the bind address is empty (":8001") or the unspecified IPv6 address ("[::]:8001). Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-07 17:57:01 +02:00
Kellen Fox	b949f0dc19	Amtool: Implement filter by receiver fixes:#937 (#1402 ) * Amtool: Implement filter by receiver * Adds receiver flag to amtool alert query * Adds receiver argument to alert http client * Updates http client tests for added argument Also works: scpecifying `receiver: "receiver-123"` in config file automaticly filters all alerts shown * Include receiver in amtool config docs Now that I've implemented the receiver in amtool I should add the new feature to the documentation. * #937 Add mention of supporting regex syntax to receiver flag	2018-06-07 09:21:12 +02:00
stuart nelson	db4af95ea0	memberlist reconnect (#1384 ) * initial impl Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnectTimeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix locking Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove unused PeerStatuses Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add metrics Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Actually use peerJoinCounter Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Cleanup peers map on peer timeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnect test Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * test removing failed peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Use peer address as map key If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add failed peers from creation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove warnIfAlone() Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update metric names Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Address comments Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-05 14:28:49 +02:00
Silvio Gissi	402564055b	Update Architecture diagram (#1394 ) * Update Architecture diagram Update diagram from sketch to vector. Add draw.io XML source file. Update README.md to display master doc/arch.jpg Signed-off-by: Silvio Gissi <silvio@gissilabs.com> * Updated README.md with relative link to architecture doc. * Updated Architecture document from JPG to SVG Signed-off-by: Silvio Gissi <silvio@gissilabs.com> * Small fix in graph. * Updated font to align with Prometheus architecture. Signed-off-by: Silvio Gissi <silvio@gissilabs.com> * Embedded images at arch.svg * Removed images from SVG, update source XML	2018-05-31 15:34:52 +02:00
Simon Pasquier	49717d91b0	parse: fix parsing for label values with commas (#1395 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-28 11:36:47 +02:00
Max Inden	a9e584be75	Merge pull request #1391 from simonpasquier/fix-circleci Fix CircleCI config for releases	2018-05-28 11:33:58 +02:00
Max Inden	402a4aa4a2	Merge pull request #1381 from simonpasquier/add-missing-header *: add missing license headers	2018-05-28 10:45:39 +02:00
Simon Pasquier	44e30e6a41	Fix CircleCI config for releases Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-23 10:20:06 +02:00
Adam Shannon	bf0db5b989	cli: print more config details (#1376 ) Example output: $ amtool check-config alertmanager.yaml Checking 'alertmanager.yaml' SUCCESS Found: - global config - route - 0 inhibit rules - 13 receivers - 0 templates Signed-off-by: Adam Shannon <adamkshannon@gmail.com>	2018-05-15 09:17:51 +02:00
Simon Pasquier	0ebaeccd4b	*: add missing license headers Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-14 17:37:13 +02:00
Alex Lardschneider	1f9a7b6182	[Request] Add Slack actions to notifications (#1355 ) * Added slack actions to notifications Signed-off-by: Alex Lardschneider <alex.lardschneider@gmail.com>	2018-05-14 17:26:11 +02:00
Simon Pasquier	292256ca7f	vendor: remove unused packages (#1380 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-14 16:23:48 +02:00
rhysm	e4416bd612	Add additional cluster configuration flags (#1379 ) The cluster configuration uses DefaultLANConfig which seems to be quite sensitive to WAN conditions. Allowing the tuning of these 3 parameters (TCP Timeout, Probe Interval and Probe Timeout) makes clustering more robust across WAN connections. Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>	2018-05-14 09:22:04 +02:00
stuart nelson	942be9d993	cli alert query: Expose --active and --unprocessed (#1370 ) * cli alert query: Expose --active and --unprocessed Support the new filter options in the alerts api endpoint introduced by https://github.com/prometheus/alertmanager/pull/1366 Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update comment and client_test Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-05-09 10:57:01 +02:00
Simon Pasquier	02f10f204f	circleci: fix docker push command (#1371 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-08 11:41:25 +02:00
Simon Pasquier	28967e394e	config: fix Go formatting (#1368 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-07 18:12:14 +02:00
Simon Pasquier	75900ea62a	api: remove dead code (#1367 ) This is a follow-up of `f825d97de4`. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-07 18:11:36 +02:00
Simon Pasquier	383024e63d	api: support more query filters (#1366 ) * api: support more query filters This change adds 2 new query filters to the /api/v1/alerts endpoint. - active, filter out active alerts when set to 'false' (default: 'true'). - unprocessed, filter out unprocessed alerts when set to 'false' (default: 'true'). The default values ensure that the API behavior remains the same as before when the query filters aren't provided. Signed-off-by: Simon Pasquier <spasquie@redhat.com> * api: address comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-07 18:07:19 +02:00
Max Inden	05fb09aebd	Merge pull request #1362 from mxinden/deprecate-v0-alerts api: Deprecate `api/alerts` endpoint	2018-05-05 13:43:54 +02:00
stuart nelson	1c0c24b300	Update alerts argument order, rename expired to inhibited (#1360 ) Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-05-04 10:43:38 +02:00
Max Leonard Inden	f825d97de4	api: Deprecate `api/alerts` endpoint With prometheus/prometheus commit e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from using `api/alerts` to `api/v1/alerts`. This commit is included starting from Prometheus v0.17.0. As discussed on the prometheus-developers mailing list [2] the deprecation period is long over. [1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [2] https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-05-04 09:59:14 +02:00
Simon Pasquier	998984d8d6	Update CircleCI build (#1354 ) This change upgrades the build configuration to CircleCI 2.0. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-03 09:36:23 +02:00
RogerYuQian	8a0faa9946	fix wechat issue (#1353 ) (#1356 )	2018-05-03 09:32:09 +02:00
Simon Pasquier	b3cc6229a2	notify: remove wechat unit test (#1350 ) The unit test was making a request to the public Wechat endpoint which caused flaky results. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-30 20:01:38 +02:00
Ted Zlatanov	b04e9ad19b	#1346 : move maintenance messages to DEBUG log level (#1347 ) Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>	2018-04-30 11:56:17 +02:00
Trevor Wood	cecfe5b2f5	Validate Slack field config and only allow the necessary input (#1334 ) Signed-off-by: Trevor Wood <Trevor.G.Wood@gmail.com>	2018-04-25 18:58:11 +02:00
stuart nelson	cfde256913	[amtool] fix silence import --help format	2018-04-24 11:46:24 +02:00
Simon Pasquier	dc5fc02d22	[amtool] use kingpin.v2 (#1330 ) * Use default values to store values from config * fix typo and reserved keywork * move to long help texts * add one more unit test for resolver * update comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-24 09:35:15 +02:00
Ted Zlatanov	af5dd74264	tell users opening issues to use alertmanager --version (#1327 ) `-version` doesn't work as of 0.15-rc.1 so users should run `alertmanager --version` Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>	2018-04-23 18:02:06 +02:00
stuart nelson	bc263d3e61	Improve notification instrumentation (#1335 ) * Improve notification instrumentation - Add notificationLatencySeconds histogram to debug duplicate messages. This can help rule out if duplicate messages are being caused by excessive latency when sending a notification. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-04-23 14:23:01 +02:00
stuart nelson	80f2eeb2ca	Fix resolved alerts still inhibiting (#1331 ) * inhibit: update inhibition cache when alerts resolve Signed-off-by: Simon Pasquier <spasquie@redhat.com> * inhibit: remove unnecessary fmt.Sprintf Signed-off-by: Simon Pasquier <spasquie@redhat.com> * inhibit: add unit tests Signed-off-by: Simon Pasquier <spasquie@redhat.com> * inhibit: use NopLogger in tests Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Update old alert with result of merge with new On ingest, alerts with matching fingerprints are merged if the new alert's start and end times overlap with the old alert's. The merge creates a new alert, which is then updated in the internal alert store. The original alert is not updated (because merge creates a copy), so it is never marked as resolved in the inhibitor's reference to it. The code within the inhibitor relies on skipping over resolved alerts, but because the old alert is never updated it is never marked as resolved. Thus it continues to inhibit other alerts until it is cleaned up by the internal GC. This commit updates the struct of the old alert with the result of the merge with the new alert. An alternative would be to always update the inhibitor's internal cache of alerts regardless of an alert's resolve status. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update inhibitor cache even if alert is resolved This seems like a better choice than the previous commit. I think it is more sane to have the inhibitor update its own cache, rather than having one of its pointers updated externally. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-04-18 16:26:04 +02:00
Manos Fokas	300a87e85b	Removed file changes to resolve conflict. (#1318 ) Signed-off-by: manosf <manosf@protonmail.com>	2018-04-17 16:22:46 +02:00
stuart nelson	e7bc6e2935	Move amtool to modular structure (#1321 ) * Move amtool to modular structure Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com> * Move toplevel setup back into root.go Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com> * Remove confusing alert struct name overwriting A local variable within the alert subcommand was using the name of the struct within that file. Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com> * change local var name shadowing struct name Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>	2018-04-13 13:34:16 +02:00
stuart nelson	360dba6d9a	Rename silence API Delete() -> Expire() (#1319 ) Within alertmanager, expire is the term used, since silences still "exist" but aren't in effect. Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>	2018-04-11 12:30:18 +02:00
Simon Pasquier	c92ed69ce8	Split cli package (#1314 ) * cli: move commands to cli/cmd * cli: use StatusAPI interface for config command * cli: use SilenceAPI interface for silence commands * cli: use AlertAPI for alert command * cli: move back commands to cli package And move API client code to its own package. * cli: remove unused structs	2018-04-11 11:17:41 +02:00
Max Inden	510e67ef18	Merge pull request #1316 from simonpasquier/fix-decode-state Fix potential panic in decodeState()	2018-04-10 18:51:57 +02:00
Max Inden	a9b7026bc2	Merge pull request #1317 from simonpasquier/go-fmt gofmt code	2018-04-10 13:04:17 +02:00
Simon Pasquier	d0b664b618	cluster: gofmt code Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 12:06:23 +02:00
Simon Pasquier	2d68b4d318	silence: fix potential panic in decodeState() Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 10:12:05 +02:00
Simon Pasquier	a8c995f77c	nflog: fix potential panic in decodeState() Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 10:11:40 +02:00
stuart nelson	b1625a08a0	Provide default working config for artifacts (#1313 )	2018-04-05 16:25:45 +02:00
Simon Pasquier	f53b24765d	api: initialize alerts_received_total labels (#1310 )	2018-04-04 10:38:17 +02:00
Simon Pasquier	cb169a5ec6	parse: fix missing argument to fmt.Errorf (#1311 )	2018-04-04 10:37:35 +02:00
Simon Pasquier	4cba49155d	dispatch: don't reset timer if flush is in-progress (#1301 ) When the aggregation group receives an alert that is past the initial group_wait value, it should reset its timer only if the timer has ever expired. Otherwise it means that the flush is already in-progress.	2018-03-29 12:22:49 +02:00

1 2 3 4 5 ...

1608 Commits All Branches Search

1608 Commits

All Branches