* Add feature flag to enable discovery and use of public IPaddr for clustering.
Before this change, Alertmanager would refuse to startup if using a
advertise address binding to any address (0.0.0.0), and the host only
had an interface with a public IP address. After this change we feature
flag permitting the use of a discovered public address for cluster
gossiping.
Signed-off-by: Devin Trejo <dtrejo@palantir.com>
WaitReady is a blocking call and so should accept a Context in order to
be responsive to cancellation of the notification pipeline for any reason.
Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
A Peer as defined by the `cluster` package represents the node in the
cluster. It is used in other packages to know the status of all of the
members or how long should we wait to know if a notification has already fired.
In Cortex, we'd like to implement a slightly different way of
clustering (using gRPC for communication and a
hash ring for node discovery).
This is a small change to support that by changing the consumer of other
packages to an interface.
Silences and Notification channels don't need an interface as they take
a `func([]byte) error` as a parameter.
Signed-off-by: gotjosh <josue@grafana.com>
- alertmanager_cluster_alive_messages_total, total number of alive
messages received.
- alertmanager_cluster_peer_info, a constant metric labeled by peer name.
- alertmanager_cluster_pings_seconds, histogram of latencies for ping
messages.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
This changes removes all usage of golang.org/x/net/context in the code
base. It also bumps a few dependencies for the same reason:
- github.com/gogo/protobuf
- go-openapi/*
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Adds a job which runs periodically and refreshes cluster.peer dns records.
The problem is that when you restart all of the alertmanager instances in an environment like Kubernetes, DNS may contain old alertmanager instance IPs, but on startup (when Join() happens) none of the new instance IPs. As at the start DNS is not empty resolvePeers waitIfEmpty=true, will return and "islands" of 1 alertmanager instances will form.
Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
Alertmanager is exiting with a non-zero exit code if the initial cluster
join fails. This behavior could be not wanted because:
- As Alertmanager is a critical component with an at-least-once
guarantee, failing on joining the cluster is unnecessary as
Alertmanager still functions by itself.
- In an environment like Kubernetes discovering peers via DNS, peers
might roll out one-by-one, leaving the DNS entries unpopulated for the
first peer of a set. Failing on initial join prevents a roll-out.
Instead of failing on the initial join this patch only logs the failure.
The cluster can be later joined via the `handleReconnect`.
This is a regression introduced in PR #1456 [1].
[1] https://github.com/prometheus/alertmanager/pull/1456
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* cluster: make sure we don't miss the first pushPull
During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).
Signed-off-by: Corentin Chary <c.chary@criteo.com>
The memberlist library fails when it can't find a private address and no
advertise address is given. To return a helpful message to the user,
AlertManager mimics the logic from memberlist. However the code had a
bug that swallowed the error message and made it difficult for the user
to understand how to fix the problem.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* cluster: prune the queue if too large
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Address review comments
Also increases the pruning interval to 15 minutes and the max queue size
to 4096 items (same value as used by Serf).
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Gossip large messages via SendReliable
For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.
The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.
* Add tests for oversized/normal message gossiping
* Make oversize metric names consistent
* Remove errant printf in test
* Correctly increment WaitGroup
* Add comment for OversizedMessage func
* Add metric for oversized messages dropped
Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.
* Add counter for total oversized messages sent
* Change full queue log level to debug
Was previously a warning, which isn't necessary
now that there is a metric tracking it.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>