Commit Graph

8 Commits

Author SHA1 Message Date
rhysm
e4416bd612 Add additional cluster configuration flags ()
The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.

Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
2018-05-14 09:22:04 +02:00
Simon Pasquier
1531aa66f3 Fix for ()
* cluster: add alertmanager_cluster_messages_queued metric

* cluster: add metrics for sent messages

This change adds 2 new metrics:

- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total

* Fix marshaling for entries being broadcast

Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.

* main: fix argument order for cluster.Join()

cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
2018-03-22 13:53:00 +01:00
Corentin Chary
dd75201f1c Add /-/ready based on mesh status ()
* Wait for the gossip to settle before sending notifications

See  for details.

As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.

This adds one new flags to control this behavior:
```
      --cluster.settle-timeout=60s  mesh settling timeout. Do not wait more than this duration on startup.
```

It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).

The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.

* cluster: fix typos and base interval on gossipInterval
2018-03-02 15:45:21 +01:00
pasquier-s
3df093968c cluster: gather alertmanager_peer_position all the time ()
* cluster: gather alertmanager_peer_position all the time

This change moves the gathering of the alertmanager_peer_position metric
outside of the clusterWait() function so that the metric is computed
accurately even when no alerting group fires.

* cluster: add alertmanager_cluster_health_score metric

This metric is retrieved from the memberlist library.
2018-02-27 10:37:56 +01:00
Simon Pasquier
f4c81c43e9 cluster: pass resolved peers to Join() 2018-02-13 16:53:09 +01:00
Fabian Reinartz
3f2e00fbea cluster/api: improve metrics and cluster status 2018-02-09 11:16:00 +01:00
Fabian Reinartz
247bfff606 cluster: remove MergeSingle 2018-02-09 11:06:51 +01:00
Fabian Reinartz
fd49dbb477 *: move to memberlist for clustering 2018-02-08 12:18:44 +01:00