alertmanager

Commit Graph

Author	SHA1	Message	Date
Simon Pasquier	13d71e58fa	cluster: skip tests when no private ip address exists (#1470 ) The memberlist library will fail to setup the cluster when the machine has no private IP address. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-08-22 17:40:07 +02:00
Corentin Chary	42ea9a565b	cluster: make sure we don't miss the first pushPull (#1456 ) * cluster: make sure we don't miss the first pushPull During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>	2018-07-09 11:16:04 +02:00
Simon Pasquier	8034f137e1	cluster: don't track FQDN addresses as inital peers (#1416 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-06-15 12:34:50 +02:00
stuart nelson	6305229fcc	fix set initial failed peers (#1407 ) * Correctly add Node to initially failed peer Reconnect attempts to failed peers were panicking because peer.Address() would attempt to access the nil Node struct member. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Correctly remove old peers Again, since we aren't assigning a name (this is generated) we rely on the node's Address for removing the initially joining (and potentially later re-joining) peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Test the peerJoin removes initial peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Don't add self to failing peers list The initially failing peers list shouldn't include the bindAddr for the alertmanager itself, as this connection is never made, and consequently only removed from the failedPeers list after the failed peer timeout. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Filter initialFailed with advertise addr This may differ from bindAddr, and is the value we want to not attempt to connect to. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 12:34:52 +02:00
stuart nelson	db4af95ea0	memberlist reconnect (#1384 ) * initial impl Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnectTimeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix locking Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove unused PeerStatuses Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add metrics Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Actually use peerJoinCounter Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Cleanup peers map on peer timeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnect test Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * test removing failed peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Use peer address as map key If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add failed peers from creation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove warnIfAlone() Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update metric names Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Address comments Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-05 14:28:49 +02:00
rhysm	e4416bd612	Add additional cluster configuration flags (#1379 ) The cluster configuration uses DefaultLANConfig which seems to be quite sensitive to WAN conditions. Allowing the tuning of these 3 parameters (TCP Timeout, Probe Interval and Probe Timeout) makes clustering more robust across WAN connections. Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>	2018-05-14 09:22:04 +02:00
Simon Pasquier	d0b664b618	cluster: gofmt code Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 12:06:23 +02:00
Corentin Chary	dd75201f1c	Add /-/ready based on mesh status (#1209 ) * Wait for the gossip to settle before sending notifications See #1209 for details. As an heuristic for mesh readyness, try to see if the mesh looks stable (the number of peers isn't changing too much). This implementation always mark the altermanager as ready after a maximum of 60s. This adds one new flags to control this behavior: ``` --cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup. ``` It also adds `/-/ready` which always return 200 (in order to make it clear that we are ready as soon as we can receive requests). The mesh status is exposed in `/api/v1/status` and visible on `/#/status`. * cluster: fix typos and base interval on gossipInterval	2018-03-02 15:45:21 +01:00

8 Commits