* Wait for the gossip to settle before sending notifications
See #1209 for details.
As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.
This adds one new flags to control this behavior:
```
--cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup.
```
It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).
The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.
* cluster: fix typos and base interval on gossipInterval
* cluster: gather alertmanager_peer_position all the time
This change moves the gathering of the alertmanager_peer_position metric
outside of the clusterWait() function so that the metric is computed
accurately even when no alerting group fires.
* cluster: add alertmanager_cluster_health_score metric
This metric is retrieved from the memberlist library.
TestStateMerge() was skipped because of a typo. Fixing the name revealed
that the test itself needed to be updated following the switch to the
memberlist library.
When the API receives alerts where StartsAt is zero, it updates the
value to EndsAt (if not zero itself) or "now". This ensures that the
alert validation will not fail since StartsAt has to be less than or
equal to EndsAt.
See #1223, looks like OpsGenie now sometimes returns a 422 when you
don't specify a team. This change cleans up the JSON output and
add a few unit tests.
* Add mesh metrics
This change adds 2 new metrics for the mesh:
* alertmanager_peer_connection, state of the connection between the
Alertmanager instance and a peer.
* alertmanager_peer_terminations_total, total number of terminated
connection.
It also moves the gathering of the alertmanager_peer_position metric
outside of the meshWait() function so that the metric is computed
accurately even when no alerting group fires.
* Remove 'nick' label from alertmanager_peer_connection metric
After the initial notification has been sent, AlertManager shouldn't notify the
receiver again when no new alerts have been added to the group during
group_interval.
This change also modifies the acceptance test framework to assert that no
notification has been received in a given interval.