WaitReady is a blocking call and so should accept a Context in order to
be responsive to cancellation of the notification pipeline for any reason.
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
* cluster: make sure we don't miss the first pushPull
During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).
Signed-off-by: Corentin Chary <c.chary@criteo.com>
* Correctly add Node to initially failed peer
Reconnect attempts to failed peers were panicking
because peer.Address() would attempt to access the
nil Node struct member.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Correctly remove old peers
Again, since we aren't assigning a name (this is
generated) we rely on the node's Address for
removing the initially joining (and potentially
later re-joining) peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Test the peerJoin removes initial peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Don't add self to failing peers list
The initially failing peers list shouldn't include
the bindAddr for the alertmanager itself, as this
connection is never made, and consequently only
removed from the failedPeers list after the failed
peer timeout.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Filter initialFailed with advertise addr
This may differ from bindAddr, and is the value we
want to not attempt to connect to.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* initial impl
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnectTimeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix locking
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove unused PeerStatuses
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add metrics
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Actually use peerJoinCounter
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Cleanup peers map on peer timeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnect test
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* test removing failed peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Use peer address as map key
If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add failed peers from creation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove warnIfAlone()
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update metric names
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Address comments
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.
Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
* Wait for the gossip to settle before sending notifications
See #1209 for details.
As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.
This adds one new flags to control this behavior:
```
--cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup.
```
It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).
The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.
* cluster: fix typos and base interval on gossipInterval