* [silences] Don't merge expired silences
If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add analogous behavior+test for nflog
The code for nflog was also constantly re-adding
nflogs to the internal memory store, the same as
the silence code was.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add retention to TestQuery
With the default 0 retention, the alerts would not
be merged.
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
The current Alertmanager API v1 is undocumented and written by hand.
This patch introduces a new Alertmanager API - v2. The API is fully
generated via an OpenAPI 2.0 [1] specification (see
`api/v2/openapi.yaml`) with the exception of the http handlers itself.
Pros:
- Generated server code
- Ability to generate clients in all major languages
(Go, Java, JS, Python, Ruby, Haskell, *elm* [3] ...)
- Strict contract (OpenAPI spec) between server and clients.
- Instant feedback on frontend-breaking changes, due to strictly
typed frontend language elm.
- Generated documentation (See Alertmanager online Swagger UI [4])
Cons:
- Dependency on open api ecosystem including go-swagger [2]
In addition this patch includes the following changes.
- README.md: Add API section
- test: Duplicate acceptance test to API v1 & API v2 version
The Alertmanager acceptance test framework has a decent test coverage
on the Alertmanager API. Introducing the Alertmanager API v2 does not go
hand in hand with deprecating API v1. They should live alongside each
other for a couple of minor Alertmanager versions.
Instead of porting the acceptance test framework to use the new API v2,
this patch duplicates the acceptance tests, one using the API v1, the
other API v2.
Once API v1 is removed we can simply remove `test/with_api_v1` and bring
`test/with_api_v2` to `test/`.
[1]
https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md
[2] https://github.com/go-swagger/go-swagger/
[3] https://github.com/ahultgren/swagger-elm
[4]
http://petstore.swagger.io/?url=https://raw.githubusercontent.com/mxinden/alertmanager/apiv2/api/v2/openapi.yaml
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* Gossip large messages via SendReliable
For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.
The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.
* Add tests for oversized/normal message gossiping
* Make oversize metric names consistent
* Remove errant printf in test
* Correctly increment WaitGroup
* Add comment for OversizedMessage func
* Add metric for oversized messages dropped
Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.
* Add counter for total oversized messages sent
* Change full queue log level to debug
Was previously a warning, which isn't necessary
now that there is a metric tracking it.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Peers further propagate newly received nflogs
If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set Retransmit value based on number of members
For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [nflog] Move retransmit calculation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [silence] further gossip silence messages
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set GossipNodes to equal RetransmitMulti
During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix rebase
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* cluster: add alertmanager_cluster_messages_queued metric
* cluster: add metrics for sent messages
This change adds 2 new metrics:
- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total
* Fix marshaling for entries being broadcast
Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.
* main: fix argument order for cluster.Join()
cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
TestStateMerge() was skipped because of a typo. Fixing the name revealed
that the test itself needed to be updated following the switch to the
memberlist library.
* Log snapshot sizes on maintenance
* Add metrics for snapshot sizes
This change adds 2 new gauges for tracking the last snapshots' sizes:
- alertmanager_nflog_snapshot_size_bytes
- alertmanager_silences_snapshot_size_bytes
This adds metrics that look like this:
```
alertmanager_alerts{state="active"} 6
alertmanager_alerts{state="suppressed"} 0
alertmanager_silences{state="active"} 1
alertmanager_silences{state="expired"} 1
alertmanager_silences{state="pending"} 0
```
This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.
* silences: avoid deadlock
Calling gossip.GossipBroadcast() will cause a deadlock if
there is a currently executing OnBroadcast* function.
See #982
* silence_test: better unit test to detect deadlocks
Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
This fixes#559 by removing concurrent map writes to the matcher cache.
The cache was guarded by the Silence's main lock, which only used a
read-lock on queries.
The cache's get methods lazily loads data into the cache and thus
causing concurrent writes.
We just change the main lock to always write-lock, as we don't expect
high lock contention at this point and would have it in a dedicated
cache lock anyway.
This commit adds an implementation of a silence storage that can
share store and modify silences, share state via a mesh network,
write and load snapshots, and be dynamically queried.
All data formats are based on protocol buffers.