alertmanager

Commit Graph

Author	SHA1	Message	Date
Matthias Loibl	a6d10bd5bc	Update golangci-lint and fix complaints (#2853 ) * Copy latest golangci-lint files from Prometheus Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Use grafana/regexp over stdlib regexp Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Fix typos in comments Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Fix goimports complains in import sorting Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * gofumpt all Go files Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Update naming to comply with revive linter Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * config: Fix error messages to be lower case Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * test/cli: Fix error messages to be lower case Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * .golangci.yaml: Remove obsolete space Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * config: Fix expected victorOps error Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Use stdlib regexp Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> * Clean up Go modules Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>	2022-03-25 17:59:51 +01:00
Devin Trejo	fad796931b	Add feature flag to enable discovery and use of public IPaddr for clustering. (#2719 ) * Add feature flag to enable discovery and use of public IPaddr for clustering. Before this change, Alertmanager would refuse to startup if using a advertise address binding to any address (0.0.0.0), and the host only had an interface with a public IP address. After this change we feature flag permitting the use of a discovered public address for cluster gossiping. Signed-off-by: Devin Trejo <dtrejo@palantir.com>	2021-11-10 17:40:48 +01:00
Dustin Hooten	ff85bec45b	Secure cluster traffic via mutual TLS (#2237 ) * Add TLS option to gossip cluster Co-authored-by: Sharad Gaur <sharadgaur@gmail.com> Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * generate new certs that expire in 100 years Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Fix tls_connection attributes Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Improve error message Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Fix tls client config docs Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Add capacity arg to message buffer Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * fix formatting Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Update version; add version validation Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * use lru cache for connection pool Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * lock reading from the connection Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * when extracting net.Conn from tlsConn, lock and throw away wrapper Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Add mutex to connection pool to protect cache Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * fix linting Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> Co-authored-by: Sharad Gaur <sharadgaur@gmail.com>	2021-08-09 14:58:06 -06:00
Julien Pivotto	3a9808c3f7	Fix main tests (#2670 ) Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-08-04 16:13:51 +02:00
Julien Pivotto	20a1f8fd3f	Merge pull request #2433 from sylr/fix-test Fix test not waiting for cluster member to be ready	2021-08-04 13:57:26 +02:00
Julien Pivotto	b2a4cacb95	Update go dependencies & switch to go-kit/log Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-08-02 12:43:23 +02:00
Steve Simpson	1711e72d1b	Clustering: Change WaitReady to accept a Context. WaitReady is a blocking call and so should accept a Context in order to be responsive to cancellation of the notification pipeline for any reason. Signed-off-by: Steve Simpson <steve.simpson@grafana.com>	2021-03-10 09:18:39 +01:00
Sylvain Rabot	f4c7eb54aa	Fix test not waiting for cluster member to be ready Signed-off-by: Sylvain Rabot <sylvain@abstraction.fr>	2020-12-10 16:16:54 +01:00
Sylvain Rabot	21e99dcb63	Fix TestClusterJoinAndReconnect on macos (#2110 ) Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>	2019-11-21 14:17:24 +01:00
Simon Pasquier	13d71e58fa	cluster: skip tests when no private ip address exists (#1470 ) The memberlist library will fail to setup the cluster when the machine has no private IP address. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-08-22 17:40:07 +02:00
Corentin Chary	42ea9a565b	cluster: make sure we don't miss the first pushPull (#1456 ) * cluster: make sure we don't miss the first pushPull During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>	2018-07-09 11:16:04 +02:00
Simon Pasquier	8034f137e1	cluster: don't track FQDN addresses as inital peers (#1416 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-15 12:34:50 +02:00
stuart nelson	6305229fcc	fix set initial failed peers (#1407 ) * Correctly add Node to initially failed peer Reconnect attempts to failed peers were panicking because peer.Address() would attempt to access the nil Node struct member. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Correctly remove old peers Again, since we aren't assigning a name (this is generated) we rely on the node's Address for removing the initially joining (and potentially later re-joining) peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Test the peerJoin removes initial peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Don't add self to failing peers list The initially failing peers list shouldn't include the bindAddr for the alertmanager itself, as this connection is never made, and consequently only removed from the failedPeers list after the failed peer timeout. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Filter initialFailed with advertise addr This may differ from bindAddr, and is the value we want to not attempt to connect to. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 12:34:52 +02:00
stuart nelson	db4af95ea0	memberlist reconnect (#1384 ) * initial impl Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnectTimeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix locking Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove unused PeerStatuses Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add metrics Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Actually use peerJoinCounter Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Cleanup peers map on peer timeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnect test Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * test removing failed peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Use peer address as map key If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add failed peers from creation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove warnIfAlone() Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update metric names Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Address comments Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-05 14:28:49 +02:00
rhysm	e4416bd612	Add additional cluster configuration flags (#1379 ) The cluster configuration uses DefaultLANConfig which seems to be quite sensitive to WAN conditions. Allowing the tuning of these 3 parameters (TCP Timeout, Probe Interval and Probe Timeout) makes clustering more robust across WAN connections. Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>	2018-05-14 09:22:04 +02:00
Simon Pasquier	d0b664b618	cluster: gofmt code Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 12:06:23 +02:00
Corentin Chary	dd75201f1c	Add /-/ready based on mesh status (#1209 ) * Wait for the gossip to settle before sending notifications See #1209 for details. As an heuristic for mesh readyness, try to see if the mesh looks stable (the number of peers isn't changing too much). This implementation always mark the altermanager as ready after a maximum of 60s. This adds one new flags to control this behavior: ``` --cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup. ``` It also adds `/-/ready` which always return 200 (in order to make it clear that we are ready as soon as we can receive requests). The mesh status is exposed in `/api/v1/status` and visible on `/#/status`. * cluster: fix typos and base interval on gossipInterval	2018-03-02 15:45:21 +01:00

17 Commits