alertmanager

Commit Graph

Author	SHA1	Message	Date
Simon Pasquier	c7de536129	: use stdlib context (#1768 ) This changes removes all usage of golang.org/x/net/context in the code base. It also bumps a few dependencies for the same reason: - github.com/gogo/protobuf - go-openapi/ Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-02-26 12:18:57 +01:00
Ye Ben	5f8eaf9560	cluster/delegate: Replace labels to const to reduce hardcode (#1724 ) Signed-off-by: yeya24 <ben.ye@daocloud.io>	2019-01-28 10:17:55 +01:00
JoeWrightss	9ccbeb585b	cluster: Fix typo in comment (#1668 ) Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-16 14:03:55 +01:00
Povilas Versockas	7f34cb4716	cluster: Add cluster peers DNS refresh job (#1428 ) Adds a job which runs periodically and refreshes cluster.peer dns records. The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form. Signed-off-by: Povilas Versockas <p.versockas@gmail.com>	2018-11-23 09:47:13 +01:00
Simon Pasquier	13d71e58fa	cluster: skip tests when no private ip address exists (#1470 ) The memberlist library will fail to setup the cluster when the machine has no private IP address. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-08-22 17:40:07 +02:00
Max Inden	3735df3ac7	cluster: Do not exit when failing to join cluster (#1465 ) Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR #1456 [1]. [1] https://github.com/prometheus/alertmanager/pull/1456 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-11 17:19:33 +02:00
Corentin Chary	42ea9a565b	cluster: make sure we don't miss the first pushPull (#1456 ) * cluster: make sure we don't miss the first pushPull During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>	2018-07-09 11:16:04 +02:00
Simon Pasquier	f5a258dd1d	cluster: fail when no private address can be found (#1437 ) The memberlist library fails when it can't find a private address and no advertise address is given. To return a helpful message to the user, AlertManager mimics the logic from memberlist. However the code had a bug that swallowed the error message and made it difficult for the user to understand how to fix the problem. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-07-05 22:59:56 +02:00
Simon Pasquier	7a272416de	cluster: prune the queue if it contains too many items (#1418 ) * cluster: prune the queue if too large Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Address review comments Also increases the pruning interval to 15 minutes and the max queue size to 4096 items (same value as used by Serf). Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-15 18:08:12 +02:00
stuart nelson	445fbdf1a8	gossip large messages via SendReliable (#1415 ) * Gossip large messages via SendReliable For messages beyond half of the maximum gossip packet size, send the message to all peer nodes via TCP. The choice of "larger than half the max gossip size" is relatively arbitrary. From brief testing, the overhead from memberlist on a packet seemed to only use ~3 of the available 1400 bytes, and most gossip messages seem to be <<500 bytes. * Add tests for oversized/normal message gossiping * Make oversize metric names consistent * Remove errant printf in test * Correctly increment WaitGroup * Add comment for OversizedMessage func * Add metric for oversized messages dropped Code was added to drop oversized messages if the buffered channel they are sent on is full. This is a good thing to surface as a metric. * Add counter for total oversized messages sent * Change full queue log level to debug Was previously a warning, which isn't necessary now that there is a metric tracking it. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-15 13:40:21 +02:00
Simon Pasquier	8034f137e1	cluster: don't track FQDN addresses as inital peers (#1416 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-15 12:34:50 +02:00
stuart nelson	d259bf9d09	Check for advertise host when setting failed peers (#1411 ) When setting initially failing peers, if we don't have a value for the advertise address, use the bindAddr. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-11 14:18:15 +02:00
stuart nelson	6305229fcc	fix set initial failed peers (#1407 ) * Correctly add Node to initially failed peer Reconnect attempts to failed peers were panicking because peer.Address() would attempt to access the nil Node struct member. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Correctly remove old peers Again, since we aren't assigning a name (this is generated) we rely on the node's Address for removing the initially joining (and potentially later re-joining) peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Test the peerJoin removes initial peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Don't add self to failing peers list The initially failing peers list shouldn't include the bindAddr for the alertmanager itself, as this connection is never made, and consequently only removed from the failedPeers list after the failed peer timeout. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Filter initialFailed with advertise addr This may differ from bindAddr, and is the value we want to not attempt to connect to. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 12:34:52 +02:00
stuart nelson	36588c3865	memberlist gossip (#1389 ) * Peers further propagate newly received nflogs If a peer receives an nflog that it hasn't seen before, queue the message and propagate it further to other peers. This should ensure that all peers within a cluster receive all gossip messages. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set Retransmit value based on number of members For alertmanagers that are brought up with a list of peers, set the number of message retransmits to be half of that number. If there are no peers on start, or there are few, continue to use the default value of 3. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [nflog] Move retransmit calculation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [silence] further gossip silence messages Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set GossipNodes to equal RetransmitMulti During a gossip, we send messages to at most GossipNodes nodes. If possible, we only a message to reach all nodes as soon as possible. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix rebase Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 11:48:42 +02:00
Simon Pasquier	9f87f9d6e7	cluster: advertise explicitly for empty addresses (#1386 ) memberlist doesn't advertise a valid IP address when the bind address is empty (":8001") or the unspecified IPv6 address ("[::]:8001). Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-07 17:57:01 +02:00
stuart nelson	db4af95ea0	memberlist reconnect (#1384 ) * initial impl Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnectTimeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix locking Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove unused PeerStatuses Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add metrics Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Actually use peerJoinCounter Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Cleanup peers map on peer timeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnect test Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * test removing failed peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Use peer address as map key If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add failed peers from creation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove warnIfAlone() Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update metric names Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Address comments Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-05 14:28:49 +02:00
Simon Pasquier	0ebaeccd4b	*: add missing license headers Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-14 17:37:13 +02:00
rhysm	e4416bd612	Add additional cluster configuration flags (#1379 ) The cluster configuration uses DefaultLANConfig which seems to be quite sensitive to WAN conditions. Allowing the tuning of these 3 parameters (TCP Timeout, Probe Interval and Probe Timeout) makes clustering more robust across WAN connections. Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>	2018-05-14 09:22:04 +02:00
Simon Pasquier	d0b664b618	cluster: gofmt code Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 12:06:23 +02:00
Simon Pasquier	1531aa66f3	Fix for #1282 (#1286 ) * cluster: add alertmanager_cluster_messages_queued metric * cluster: add metrics for sent messages This change adds 2 new metrics: - alertmanager_cluster_messages_sent_total - alertmanager_cluster_messages_sent_size_total * Fix marshaling for entries being broadcast Individual notifications logs and silences being broadcast to the other peers need to be encoded using the same length-delimited format as when doing full-state synchronization. * main: fix argument order for cluster.Join() cluster.Join() was called with the push/pull and gossip interval parameters being swapped one for another.	2018-03-22 13:53:00 +01:00
Corentin Chary	dd75201f1c	Add /-/ready based on mesh status (#1209 ) * Wait for the gossip to settle before sending notifications See #1209 for details. As an heuristic for mesh readyness, try to see if the mesh looks stable (the number of peers isn't changing too much). This implementation always mark the altermanager as ready after a maximum of 60s. This adds one new flags to control this behavior: ``` --cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup. ``` It also adds `/-/ready` which always return 200 (in order to make it clear that we are ready as soon as we can receive requests). The mesh status is exposed in `/api/v1/status` and visible on `/#/status`. * cluster: fix typos and base interval on gossipInterval	2018-03-02 15:45:21 +01:00
pasquier-s	3df093968c	cluster: gather alertmanager_peer_position all the time (#1247 ) * cluster: gather alertmanager_peer_position all the time This change moves the gathering of the alertmanager_peer_position metric outside of the clusterWait() function so that the metric is computed accurately even when no alerting group fires. * cluster: add alertmanager_cluster_health_score metric This metric is retrieved from the memberlist library.	2018-02-27 10:37:56 +01:00
Simon Pasquier	f4c81c43e9	cluster: pass resolved peers to Join()	2018-02-13 16:53:09 +01:00
Fabian Reinartz	3f2e00fbea	cluster/api: improve metrics and cluster status	2018-02-09 11:16:00 +01:00
Fabian Reinartz	247bfff606	cluster: remove MergeSingle	2018-02-09 11:06:51 +01:00
Fabian Reinartz	fd49dbb477	*: move to memberlist for clustering	2018-02-08 12:18:44 +01:00

26 Commits