alertmanager

mirror of https://github.com/prometheus/alertmanager synced 2024-12-28 00:52:13 +00:00

Author	SHA1	Message	Date
stuart nelson	2026e4a01f	[gossip] Don't merge expired gossip messages (#1631 ) * [silences] Don't merge expired silences If they're expired, they should be cleaned up on the next GC cycle, but merging them in means that they'll probably be gossip'd continually between the cluster members. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add analogous behavior+test for nflog The code for nflog was also constantly re-adding nflogs to the internal memory store, the same as the silence code was. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add retention to TestQuery With the default 0 retention, the alerts would not be merged. Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>	2018-11-21 11:40:57 +01:00
Max Leonard Inden	f1b920bcc9	api: Implement OpenAPI generated Alertmanager API V2 The current Alertmanager API v1 is undocumented and written by hand. This patch introduces a new Alertmanager API - v2. The API is fully generated via an OpenAPI 2.0 [1] specification (see `api/v2/openapi.yaml`) with the exception of the http handlers itself. Pros: - Generated server code - Ability to generate clients in all major languages (Go, Java, JS, Python, Ruby, Haskell, elm [3] ...) - Strict contract (OpenAPI spec) between server and clients. - Instant feedback on frontend-breaking changes, due to strictly typed frontend language elm. - Generated documentation (See Alertmanager online Swagger UI [4]) Cons: - Dependency on open api ecosystem including go-swagger [2] In addition this patch includes the following changes. - README.md: Add API section - test: Duplicate acceptance test to API v1 & API v2 version The Alertmanager acceptance test framework has a decent test coverage on the Alertmanager API. Introducing the Alertmanager API v2 does not go hand in hand with deprecating API v1. They should live alongside each other for a couple of minor Alertmanager versions. Instead of porting the acceptance test framework to use the new API v2, this patch duplicates the acceptance tests, one using the API v1, the other API v2. Once API v1 is removed we can simply remove `test/with_api_v1` and bring `test/with_api_v2` to `test/`. [1] https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md [2] https://github.com/go-swagger/go-swagger/ [3] https://github.com/ahultgren/swagger-elm [4] http://petstore.swagger.io/?url=https://raw.githubusercontent.com/mxinden/alertmanager/apiv2/api/v2/openapi.yaml Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-09-04 13:38:34 +02:00
stuart nelson	445fbdf1a8	gossip large messages via SendReliable (#1415 ) * Gossip large messages via SendReliable For messages beyond half of the maximum gossip packet size, send the message to all peer nodes via TCP. The choice of "larger than half the max gossip size" is relatively arbitrary. From brief testing, the overhead from memberlist on a packet seemed to only use ~3 of the available 1400 bytes, and most gossip messages seem to be <<500 bytes. * Add tests for oversized/normal message gossiping * Make oversize metric names consistent * Remove errant printf in test * Correctly increment WaitGroup * Add comment for OversizedMessage func * Add metric for oversized messages dropped Code was added to drop oversized messages if the buffered channel they are sent on is full. This is a good thing to surface as a metric. * Add counter for total oversized messages sent * Change full queue log level to debug Was previously a warning, which isn't necessary now that there is a metric tracking it. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-15 13:40:21 +02:00
stuart nelson	36588c3865	memberlist gossip (#1389 ) * Peers further propagate newly received nflogs If a peer receives an nflog that it hasn't seen before, queue the message and propagate it further to other peers. This should ensure that all peers within a cluster receive all gossip messages. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set Retransmit value based on number of members For alertmanagers that are brought up with a list of peers, set the number of message retransmits to be half of that number. If there are no peers on start, or there are few, continue to use the default value of 3. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [nflog] Move retransmit calculation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * [silence] further gossip silence messages Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Set GossipNodes to equal RetransmitMulti During a gossip, we send messages to at most GossipNodes nodes. If possible, we only a message to reach all nodes as soon as possible. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix rebase Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-08 11:48:42 +02:00
Ted Zlatanov	b04e9ad19b	#1346 : move maintenance messages to DEBUG log level (#1347 ) Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>	2018-04-30 11:56:17 +02:00
Simon Pasquier	2d68b4d318	silence: fix potential panic in decodeState() Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-10 10:12:05 +02:00
Simon Pasquier	1531aa66f3	Fix for #1282 (#1286 ) * cluster: add alertmanager_cluster_messages_queued metric * cluster: add metrics for sent messages This change adds 2 new metrics: - alertmanager_cluster_messages_sent_total - alertmanager_cluster_messages_sent_size_total * Fix marshaling for entries being broadcast Individual notifications logs and silences being broadcast to the other peers need to be encoded using the same length-delimited format as when doing full-state synchronization. * main: fix argument order for cluster.Join() cluster.Join() was called with the push/pull and gossip interval parameters being swapped one for another.	2018-03-22 13:53:00 +01:00
pasquier-s	c2dac90434	silence: fix skipped test (#1258 ) TestStateMerge() was skipped because of a typo. Fixing the name revealed that the test itself needed to be updated following the switch to the memberlist library.	2018-02-27 10:17:48 +01:00
Fabian Reinartz	fd49dbb477	*: move to memberlist for clustering	2018-02-08 12:18:44 +01:00
pasquier-s	a7d4e4ea7c	Log snapshot sizes on maintenance (#1155 ) * Log snapshot sizes on maintenance * Add metrics for snapshot sizes This change adds 2 new gauges for tracking the last snapshots' sizes: - alertmanager_nflog_snapshot_size_bytes - alertmanager_silences_snapshot_size_bytes	2018-01-10 14:53:57 +01:00
Jose Donizetti	74808e40f3	Refactor silence constants (#1076 ) * Refactor remove dups silence state constants * Refactor to use const instead of string	2017-11-07 11:36:30 +01:00
Julius Volz	947970af44	Convert Alertmanager to use non-global go-kit loggers Fixes https://github.com/prometheus/alertmanager/issues/1040	2017-10-22 00:20:40 -07:00
Corentin Chary	bff889b490	silence\|alerts: add metrics about current silences and alerts This adds metrics that look like this: ``` alertmanager_alerts{state="active"} 6 alertmanager_alerts{state="suppressed"} 0 alertmanager_silences{state="active"} 1 alertmanager_silences{state="expired"} 1 alertmanager_silences{state="pending"} 0 ``` This can be used to monitor alertmanager's usage and validate that alertmanagers in a mesh have a similar number of silences and alerts.	2017-10-02 13:33:29 +02:00
Jose Donizetti	9449bd1fa9	Ignore expired silences OnGossip (#999 ) This will fix the bug of resync deleted silences due to the state of other peers.	2017-09-28 10:25:35 +02:00
Corentin Chary	34d9524ab9	silences: avoid deadlock (#995 ) * silences: avoid deadlock Calling gossip.GossipBroadcast() will cause a deadlock if there is a currently executing OnBroadcast* function. See #982 * silence_test: better unit test to detect deadlocks	2017-09-27 11:48:28 +02:00
Corentin Chary	869a038a2b	Add a mutex to silences.go:gossipData (#984 ) This should fix silence/silence.go #982	2017-09-13 11:18:01 +02:00
Max Leonard Inden	08be6a4149	Expire pending silence and move to expired state Instead of setting endsAt to startsAt we can set both to now. Thereby the Silence will get the expired state by default.	2017-05-29 18:44:58 +02:00
Fabian Reinartz	f53974d5e6	silence: fix and test expiration behavior	2017-05-22 09:27:57 +02:00
Fabian Reinartz	d73a655bf4	Simplify silence modifications, add update endpoint (#796 ) * Simplify silence modifications, add update endpoint * vendor: add pkg/errors * ui: Handle upserting of silences . * Regenerate bindata	2017-05-16 16:48:25 +02:00
Fabian Reinartz	b1486ca546	silence: move to gogoproto This generates the protobuf Go code with gogoproto and switches to standard library time types.	2017-04-18 12:47:42 +02:00
Fabian Reinartz	309c6af4b2	nflog: use alert set instead of hash for deduplication Building a hash over an entire set of alerts causes problems, because the hash differs, on any change, whereas we only want to send notifications if the alert and it's state have changed. Therefore this introduces a list of alerts that are active and a list of alerts that are resolved. If the currently active alerts of a group are a subset of the ones that have been notified about before then they are deduplicated. The resolved notifications work the same way, with a separate list of resolved notifications that have already been sent.	2017-04-13 15:13:47 +02:00
Fabian Reinartz	b6851a5421	silences: fix concurrent cache writes (#561 ) This fixes #559 by removing concurrent map writes to the matcher cache. The cache was guarded by the Silence's main lock, which only used a read-lock on queries. The cache's get methods lazily loads data into the cache and thus causing concurrent writes. We just change the main lock to always write-lock, as we don't expect high lock contention at this point and would have it in a dedicated cache lock anyway.	2016-11-21 11:09:49 +01:00
Fabian Reinartz	7517453c68	silence: add metrics	2016-09-29 09:54:34 +02:00
Frederic Branczyk	e72e45c8f1	silence: add cache for silence matchers compiling regex silence matchers on every query is expensive, therefore caching them as soon as they are gossiped through the mesh	2016-09-09 11:41:39 +02:00
Fabian Reinartz	b2461bb2d4	*: remove go-kit logging	2016-09-06 11:56:57 +02:00
Fabian Reinartz	8d88d9e05b	Merge pull request #481 from prometheus/fabxc-meshsil *: integrate new silence package	2016-08-30 16:53:34 +02:00
Fabian Reinartz	98101f3868	silence: fix doc strings	2016-08-30 14:19:22 +02:00
Fabian Reinartz	a4e8703567	*: integrate new silence package	2016-08-30 12:15:23 +02:00
Fabian Reinartz	ed3fdc747d	silence: add protobuf-based silence package. This commit adds an implementation of a silence storage that can share store and modify silences, share state via a mesh network, write and load snapshots, and be dynamically queried. All data formats are based on protocol buffers.	2016-08-24 17:48:31 +02:00

29 Commits