alertmanager

mirror of https://github.com/prometheus/alertmanager synced 2024-12-26 16:12:20 +00:00

Author	SHA1	Message	Date
Simon Pasquier	37884c8460	alertmanager: fix Settle() interval (#1478 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-07-24 22:42:09 +02:00
Ben Chess	235944cc5f	Email is green if only none firing (#1475 ) Signed-off-by: Benjamin Chess <bchess@gmail.com>	2018-07-23 14:06:46 +02:00
Max Inden	81b9a83f06	notify: Improve error handling (#1474 ) - `tmplText` and `tmplHTML` are using a monad-style error handling [1]. This reduces the verbosity of the error logic, but introduces the risk of forgetting the final error check. This patch does not remove this coding-style, but ensures proper error checking in the Email and PagerDuty notifier. - Ensure to handle errors returned by `multipartWriter.Close()` and `wc.Write(buffer.Bytes())` in `Email.Notify()`. [1] https://www.innoq.com/en/blog/golang-errors-monads/ Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-23 14:04:40 +02:00
Mark Van De Weert	7f86d613b6	enable templating of hipchat room_id (#1463 ) Signed-off-by: Mark Van De Weert <mark.vandeweert@wpengine.com>	2018-07-19 18:35:53 +02:00
stuart nelson	bd6100793f	Add timeout support to amtool commands (#1471 ) Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-07-17 09:50:48 +02:00
Bob Shannon	50e271678d	Add support for adding alerts using amtool (#1461 ) * Add support for adding alerts using amtool Signed-off-by: Bob Shannon <bshannon@palantir.com> * comment: Simplify return in addAlert Signed-off-by: Bob Shannon <bshannon@palantir.com>	2018-07-16 16:29:04 +02:00
Max Inden	81cc0ffa12	*: Cut 0.15.1 (#1467 ) Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-14 14:23:31 +02:00
Max Inden	3735df3ac7	cluster: Do not exit when failing to join cluster (#1465 ) Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR #1456 [1]. [1] https://github.com/prometheus/alertmanager/pull/1456 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-11 17:19:33 +02:00
bigMacro	f3bc41d256	fix concurrent read and wirte group error (#1447 ) * fix concurrent read and wirte group Signed-off-by: denghuan <denghuan@actionsky.com> * make lock more elegant Signed-off-by: denghuan <denghuan@actionsky.com>	2018-07-10 17:13:41 +02:00
Simon Pasquier	5aac7c840b	amtool: add support for stdin to check-config (#1431 ) * amtool: add support for stdin to check-config Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Address Stuart's comment Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-07-09 19:27:04 +02:00
Corentin Chary	42ea9a565b	cluster: make sure we don't miss the first pushPull (#1456 ) * cluster: make sure we don't miss the first pushPull During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>	2018-07-09 11:16:04 +02:00
Simon Pasquier	f5a258dd1d	cluster: fail when no private address can be found (#1437 ) The memberlist library fails when it can't find a private address and no advertise address is given. To return a helpful message to the user, AlertManager mimics the logic from memberlist. However the code had a bug that swallowed the error message and made it difficult for the user to understand how to fix the problem. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-07-05 22:59:56 +02:00
Max Inden	67d3d9e85a	Merge pull request #1458 from mxinden/add-next-release-changelog CHANGELOG.md: Add 'Next release' section with docker working dir change	2018-07-05 17:07:19 +02:00
Max Inden	a736a90dd0	Merge pull request #1436 from simonpasquier/fix-wechat-templ notify: catch templating errors for Wechat	2018-07-05 14:52:16 +02:00
Max Leonard Inden	4a6496c964	CHANGELOG.md: Add 'Next release' section with docker working dir change To ensure we include the breaking change notice in the next release notes, this patch adds a 'Next release' section mentioning the breaking change of the working directory of the Alertmanager Dockerfile. Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-05 13:47:40 +02:00
Simon Pasquier	2d3c4065e8	config: fix regression with Pager Duty (#1455 ) The YAML strict mode doesn't allow mapping keys that are duplicates. If someone wants to override one of the default keys in the Details hash, the unmarshal function returns an error because the key is already defined by DefaultPagerdutyConfig. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-07-05 09:54:28 +02:00
Martin Chodur	2cd2bd3644	fix: reverted change of dockerfile entrypoint (#1435 )	2018-07-04 10:53:51 +02:00
Max Inden	7d70fd9031	Merge pull request #1421 from palmerabollo/patch-1 fix: email template typo in alert-warning style	2018-07-04 10:00:39 +02:00
Waldemar Biller	4e8a910b9d	Lookup parts in strings using regexp.MatchString (#1452 ) Signed-off-by: Waldemar Biller <wbiller@gmail.com>	2018-07-03 10:55:47 +01:00
Max Inden	98105b8360	Merge pull request #1438 from mxinden/master CHANGELOG.md: Improve [CHANGE] section of v0.15.0 release	2018-06-30 02:55:14 +08:00
Max Leonard Inden	b8333e11fa	CHANGELOG.md: Improve [CHANGE] section of v0.15.0 release - Add entry for working dir change in Alertmanager Docker image - Indicate cluster flag changes Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-06-26 22:23:10 +08:00
Simon Pasquier	d188c21fb0	notify: catch templating errors for Wechat Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-24 14:02:21 +02:00
Max Inden	dac673a6aa	Merge pull request #1430 from mxinden/release-0.15 *: Cut 0.15.0 merge to master	2018-06-23 00:56:13 +08:00
Max Leonard Inden	462c969d85	*: Cut 0.15.0 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-06-22 19:01:25 +08:00
Guido García	fd97b969c8	fix: email template typo in alert-warning style Signed-off-by: Guido García <guido.garciabernardo@telefonica.com>	2018-06-18 17:39:26 +02:00
Max Inden	5e86f61bd7	Merge pull request #1419 from mxinden/cut-0.15.0-rc.3 *: Cut 0.15.0-rc.3	2018-06-18 12:17:42 +02:00
Max Leonard Inden	17e2fc7c2b	*: Cut 0.15.0-rc.3 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-06-16 10:09:30 +02:00
Simon Pasquier	7a272416de	cluster: prune the queue if it contains too many items (#1418 ) * cluster: prune the queue if too large Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Address review comments Also increases the pruning interval to 15 minutes and the max queue size to 4096 items (same value as used by Serf). Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-15 18:08:12 +02:00
stuart nelson	445fbdf1a8	gossip large messages via SendReliable (#1415 ) * Gossip large messages via SendReliable For messages beyond half of the maximum gossip packet size, send the message to all peer nodes via TCP. The choice of "larger than half the max gossip size" is relatively arbitrary. From brief testing, the overhead from memberlist on a packet seemed to only use ~3 of the available 1400 bytes, and most gossip messages seem to be <<500 bytes. * Add tests for oversized/normal message gossiping * Make oversize metric names consistent * Remove errant printf in test * Correctly increment WaitGroup * Add comment for OversizedMessage func * Add metric for oversized messages dropped Code was added to drop oversized messages if the buffered channel they are sent on is full. This is a good thing to surface as a metric. * Add counter for total oversized messages sent * Change full queue log level to debug Was previously a warning, which isn't necessary now that there is a metric tracking it. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-15 13:40:21 +02:00
Simon Pasquier	8034f137e1	cluster: don't track FQDN addresses as inital peers (#1416 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-15 12:34:50 +02:00
Simon Pasquier	6a7c912559	Sort alerts in correct order (#1349 ) * Sort dispatched alerts by job+instance in the correct order (#1178) Signed-off-by: Ted Zlatanov <tzz@lifelogs.com> * dispatch: add unit test for alerts sorting Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-14 15:54:33 +02:00
Simon Pasquier	387e684faa	vendor: update prometheus/common packages (#1414 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-13 16:11:22 +02:00
Simon Pasquier	0c512998ee	Use Makefile.common from Prometheus (#1396 ) * Include Makefile.common * Fix the bindata.go files to make the style target happy * Inline `.PHONY` statements Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-13 14:41:52 +02:00
stuart nelson	77cc718a81	[nflog] register snapshotSize This metric was never registered.	2018-06-12 13:59:48 +02:00
stuart nelson	d259bf9d09	Check for advertise host when setting failed peers (#1411 ) When setting initially failing peers, if we don't have a value for the advertise address, use the bindAddr. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-11 14:18:15 +02:00
stuart nelson	ec2cc57d28	0.15.0-rc.2 (#1410 ) Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 14:38:37 +02:00
stuart nelson	6305229fcc	fix set initial failed peers (#1407 ) * Correctly add Node to initially failed peer Reconnect attempts to failed peers were panicking because peer.Address() would attempt to access the nil Node struct member. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Correctly remove old peers Again, since we aren't assigning a name (this is generated) we rely on the node's Address for removing the initially joining (and potentially later re-joining) peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Test the peerJoin removes initial peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Don't add self to failing peers list The initially failing peers list shouldn't include the bindAddr for the alertmanager itself, as this connection is never made, and consequently only removed from the failedPeers list after the failed peer timeout. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Filter initialFailed with advertise addr This may differ from bindAddr, and is the value we want to not attempt to connect to. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 12:34:52 +02:00
stuart nelson	36588c3865	memberlist gossip (#1389 ) * Peers further propagate newly received nflogs If a peer receives an nflog that it hasn't seen before, queue the message and propagate it further to other peers. This should ensure that all peers within a cluster receive all gossip messages. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set Retransmit value based on number of members For alertmanagers that are brought up with a list of peers, set the number of message retransmits to be half of that number. If there are no peers on start, or there are few, continue to use the default value of 3. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [nflog] Move retransmit calculation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [silence] further gossip silence messages Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set GossipNodes to equal RetransmitMulti During a gossip, we send messages to at most GossipNodes nodes. If possible, we only a message to reach all nodes as soon as possible. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix rebase Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 11:48:42 +02:00
Simon Pasquier	b7d891cf39	notify: notify resolved alerts properly (#1408 ) * notify: notify resolved alerts properly The PR #1205 while fixing an existing issue introduced another bug when the send_resolved flag of the integration is set to true. With send_resolved set to false, the semantics remain the same: AlertManager generates a notification when new firing alerts are added to the alert group. The notification only carries firing alerts. With send_resolved set to true, AlertManager generates a notification when new firing or resolved alerts are added to the alert group. The notification carries both the firing and resolved notifications. Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Fix comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-08 11:37:38 +02:00
Simon Pasquier	9f87f9d6e7	cluster: advertise explicitly for empty addresses (#1386 ) memberlist doesn't advertise a valid IP address when the bind address is empty (":8001") or the unspecified IPv6 address ("[::]:8001). Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-07 17:57:01 +02:00
Kellen Fox	b949f0dc19	Amtool: Implement filter by receiver fixes:#937 (#1402 ) * Amtool: Implement filter by receiver * Adds receiver flag to amtool alert query * Adds receiver argument to alert http client * Updates http client tests for added argument Also works: scpecifying `receiver: "receiver-123"` in config file automaticly filters all alerts shown * Include receiver in amtool config docs Now that I've implemented the receiver in amtool I should add the new feature to the documentation. * #937 Add mention of supporting regex syntax to receiver flag	2018-06-07 09:21:12 +02:00
stuart nelson	db4af95ea0	memberlist reconnect (#1384 ) * initial impl Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnectTimeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix locking Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove unused PeerStatuses Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add metrics Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Actually use peerJoinCounter Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Cleanup peers map on peer timeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnect test Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * test removing failed peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Use peer address as map key If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add failed peers from creation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove warnIfAlone() Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update metric names Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Address comments Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-05 14:28:49 +02:00
Silvio Gissi	402564055b	Update Architecture diagram (#1394 ) * Update Architecture diagram Update diagram from sketch to vector. Add draw.io XML source file. Update README.md to display master doc/arch.jpg Signed-off-by: Silvio Gissi <silvio@gissilabs.com> * Updated README.md with relative link to architecture doc. * Updated Architecture document from JPG to SVG Signed-off-by: Silvio Gissi <silvio@gissilabs.com> * Small fix in graph. * Updated font to align with Prometheus architecture. Signed-off-by: Silvio Gissi <silvio@gissilabs.com> * Embedded images at arch.svg * Removed images from SVG, update source XML	2018-05-31 15:34:52 +02:00
Simon Pasquier	49717d91b0	parse: fix parsing for label values with commas (#1395 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-28 11:36:47 +02:00
Max Inden	a9e584be75	Merge pull request #1391 from simonpasquier/fix-circleci Fix CircleCI config for releases	2018-05-28 11:33:58 +02:00
Max Inden	402a4aa4a2	Merge pull request #1381 from simonpasquier/add-missing-header *: add missing license headers	2018-05-28 10:45:39 +02:00
Simon Pasquier	44e30e6a41	Fix CircleCI config for releases Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-23 10:20:06 +02:00
Adam Shannon	bf0db5b989	cli: print more config details (#1376 ) Example output: $ amtool check-config alertmanager.yaml Checking 'alertmanager.yaml' SUCCESS Found: - global config - route - 0 inhibit rules - 13 receivers - 0 templates Signed-off-by: Adam Shannon <adamkshannon@gmail.com>	2018-05-15 09:17:51 +02:00
Simon Pasquier	0ebaeccd4b	*: add missing license headers Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-14 17:37:13 +02:00
Alex Lardschneider	1f9a7b6182	[Request] Add Slack actions to notifications (#1355 ) * Added slack actions to notifications Signed-off-by: Alex Lardschneider <alex.lardschneider@gmail.com>	2018-05-14 17:26:11 +02:00

1 2 3 4 5 ...

1691 Commits