Commit Graph

17 Commits

Author SHA1 Message Date
Matthias Loibl a6d10bd5bc
Update golangci-lint and fix complaints (#2853)
* Copy latest golangci-lint files from Prometheus

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Use grafana/regexp over stdlib regexp

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Fix typos in comments

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Fix goimports complains in import sorting

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* gofumpt all Go files

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Update naming to comply with revive linter

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* config: Fix error messages to be lower case

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* test/cli: Fix error messages to be lower case

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* .golangci.yaml: Remove obsolete space

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* config: Fix expected victorOps error

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Use stdlib regexp

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Clean up Go modules

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2022-03-25 17:59:51 +01:00
Devin Trejo fad796931b
Add feature flag to enable discovery and use of public IPaddr for clustering. (#2719)
* Add feature flag to enable discovery and use of public IPaddr for clustering.

Before this change, Alertmanager would refuse to startup if using a
advertise address binding to any address (0.0.0.0), and the host only
had an interface with a public IP address. After this change we feature
flag permitting the use of a discovered public address for cluster
gossiping.

Signed-off-by: Devin Trejo <dtrejo@palantir.com>
2021-11-10 17:40:48 +01:00
Dustin Hooten ff85bec45b
Secure cluster traffic via mutual TLS (#2237)
* Add TLS option to gossip cluster

Co-authored-by: Sharad Gaur <sharadgaur@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* generate new certs that expire in 100 years

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* Fix tls_connection attributes

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* Improve error message

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* Fix tls client config docs

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* Add capacity arg to message buffer

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* fix formatting

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* Update version; add version validation

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* use lru cache for connection pool

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* lock reading from the connection

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* when extracting net.Conn from tlsConn, lock and throw away wrapper

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* Add mutex to connection pool to protect cache

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

* fix linting

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Co-authored-by: Sharad Gaur <sharadgaur@gmail.com>
2021-08-09 14:58:06 -06:00
Julien Pivotto 3a9808c3f7
Fix main tests (#2670)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-08-04 16:13:51 +02:00
Julien Pivotto 20a1f8fd3f
Merge pull request #2433 from sylr/fix-test
Fix test not waiting for cluster member to be ready
2021-08-04 13:57:26 +02:00
Julien Pivotto b2a4cacb95 Update go dependencies & switch to go-kit/log
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-08-02 12:43:23 +02:00
Steve Simpson 1711e72d1b Clustering: Change WaitReady to accept a Context.
WaitReady is a blocking call and so should accept a Context in order to
be responsive to cancellation of the notification pipeline for any reason.

Signed-off-by: Steve Simpson <steve.simpson@grafana.com>
2021-03-10 09:18:39 +01:00
Sylvain Rabot f4c7eb54aa
Fix test not waiting for cluster member to be ready
Signed-off-by: Sylvain Rabot <sylvain@abstraction.fr>
2020-12-10 16:16:54 +01:00
Sylvain Rabot 21e99dcb63 Fix TestClusterJoinAndReconnect on macos (#2110)
Signed-off-by: Sylvain Rabot <s.rabot@lectra.com>
2019-11-21 14:17:24 +01:00
Simon Pasquier 13d71e58fa cluster: skip tests when no private ip address exists (#1470)
The memberlist library will fail to setup the cluster when the machine
has no private IP address.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-08-22 17:40:07 +02:00
Corentin Chary 42ea9a565b cluster: make sure we don't miss the first pushPull (#1456)
* cluster: make sure we don't miss the first pushPull

During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).

Signed-off-by: Corentin Chary <c.chary@criteo.com>
2018-07-09 11:16:04 +02:00
Simon Pasquier 8034f137e1 cluster: don't track FQDN addresses as inital peers (#1416)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-15 12:34:50 +02:00
stuart nelson 6305229fcc
fix set initial failed peers (#1407)
* Correctly add Node to initially failed peer

Reconnect attempts to failed peers were panicking
because peer.Address() would attempt to access the
nil Node struct member.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Correctly remove old peers

Again, since we aren't assigning a name (this is
generated) we rely on the node's Address for
removing the initially joining (and potentially
later re-joining) peers

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Test the peerJoin removes initial peers

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Don't add self to failing peers list

The initially failing peers list shouldn't include
the bindAddr for the alertmanager itself, as this
connection is never made, and consequently only
removed from the failedPeers list after the failed
peer timeout.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Filter initialFailed with advertise addr

This may differ from bindAddr, and is the value we
want to not attempt to connect to.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-08 12:34:52 +02:00
stuart nelson db4af95ea0
memberlist reconnect (#1384)
* initial impl

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add reconnectTimeout

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Fix locking

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Remove unused PeerStatuses

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add metrics

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Actually use peerJoinCounter

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Cleanup peers map on peer timeout

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add reconnect test

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* test removing failed peers

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Use peer address as map key

If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add failed peers from creation

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Remove warnIfAlone()

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Update metric names

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Address comments

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-05 14:28:49 +02:00
rhysm e4416bd612 Add additional cluster configuration flags (#1379)
The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.

Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
2018-05-14 09:22:04 +02:00
Simon Pasquier d0b664b618 cluster: gofmt code
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-04-10 12:06:23 +02:00
Corentin Chary dd75201f1c Add /-/ready based on mesh status (#1209)
* Wait for the gossip to settle before sending notifications

See #1209 for details.

As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.

This adds one new flags to control this behavior:
```
      --cluster.settle-timeout=60s  mesh settling timeout. Do not wait more than this duration on startup.
```

It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).

The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.

* cluster: fix typos and base interval on gossipInterval
2018-03-02 15:45:21 +01:00