Note that this does not stop showing classic metrics, for now
it is up to the scrape config to decide whether to keep those instead or
both.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
- alertmanager_cluster_alive_messages_total, total number of alive
messages received.
- alertmanager_cluster_peer_info, a constant metric labeled by peer name.
- alertmanager_cluster_pings_seconds, histogram of latencies for ping
messages.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* cluster: make sure we don't miss the first pushPull
During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).
Signed-off-by: Corentin Chary <c.chary@criteo.com>
* cluster: prune the queue if too large
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Address review comments
Also increases the pruning interval to 15 minutes and the max queue size
to 4096 items (same value as used by Serf).
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Gossip large messages via SendReliable
For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.
The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.
* Add tests for oversized/normal message gossiping
* Make oversize metric names consistent
* Remove errant printf in test
* Correctly increment WaitGroup
* Add comment for OversizedMessage func
* Add metric for oversized messages dropped
Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.
* Add counter for total oversized messages sent
* Change full queue log level to debug
Was previously a warning, which isn't necessary
now that there is a metric tracking it.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Peers further propagate newly received nflogs
If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set Retransmit value based on number of members
For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [nflog] Move retransmit calculation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [silence] further gossip silence messages
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set GossipNodes to equal RetransmitMulti
During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix rebase
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* initial impl
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnectTimeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix locking
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove unused PeerStatuses
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add metrics
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Actually use peerJoinCounter
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Cleanup peers map on peer timeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnect test
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* test removing failed peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Use peer address as map key
If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add failed peers from creation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove warnIfAlone()
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update metric names
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Address comments
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>