alertmanager

Commit Graph

Author	SHA1	Message	Date
Peter Štibraný	d5ed7bfb15	Only register limit metrics when they are used. Limits are not used in standalone alertmanager. Signed-off-by: Peter Štibraný <pstibrany@gmail.com> Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>	2021-06-02 12:00:31 +02:00
Peter Štibraný	390474ffbe	Added group limit to dispatcher. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>	2021-06-02 12:00:31 +02:00
Peter Štibraný	cc0b08fd7c	Added possibility to pass callback to *mem.NewAlerts, useful for implementing limits on alerts. Update provider/mem/mem.go Co-authored-by: Julien Pivotto <roidelapluie@gmail.com> Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>	2021-05-31 09:56:57 +02:00
Julien Pivotto	4d2aea63c1	API: Only pass cluster peer if empty Fixes #2580 Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-05-14 18:12:46 +02:00
Julien Pivotto	64e108c3d4	Fix panic when HA is disabled Fix #2549 Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-05-07 00:26:41 +02:00
ArthurSens	74d388e3f4	Amtool and Alertmanager binaries print to stdout Signed-off-by: ArthurSens <arthursens2005@gmail.com>	2021-03-04 15:31:17 +00:00
Ben Ridley	ae116cfc26	Fix comment formatting Signed-off-by: Ben Ridley <benridley29@gmail.com>	2021-03-01 08:30:02 +11:00
Ben Ridley	c34003ffdb	Pre-allocate mute time config slice Signed-off-by: Ben Ridley <benridley29@gmail.com>	2021-03-01 08:30:02 +11:00
Ben Ridley	a3cb125e5c	Move timeinterval library into locally maintained package Signed-off-by: Ben Ridley <benridley29@gmail.com>	2021-03-01 08:30:01 +11:00
ben	d1f5e07909	Add mute time stage and pipeline Signed-off-by: Ben Ridley <benridley29@gmail.com>	2021-03-01 08:30:01 +11:00
Julien Pivotto	8ebd888488	Support https (#2446 ) Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2021-01-27 10:52:08 +01:00
Julien Pivotto	1cba0c7a37	Remove HipChat (#2281 ) Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-06-11 15:51:10 +02:00
LucasBoisserie	97bd078441	Add redirect on / to routePrefix (#2235 ) Signed-off-by: LucasBoisserie <lucas.boisserie@gmail.com>	2020-05-28 17:07:55 +02:00
johncming	134c3c0ed9	move walkRoute to dispatch package. (#2136 ) Signed-off-by: johncming <johncming@yahoo.com>	2019-12-20 15:27:58 +01:00
Simon Pasquier	2a1204e667	cmd/alertmanager: add alertmanager_integrations metric (#2117 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-11-27 09:33:08 +01:00
Simon Pasquier	324c44ccb7	cmd/alertmanager: log unused receivers + add alertmanager_receivers metric (#2114 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-11-26 10:01:56 +01:00
Simon Pasquier	4f45457b9c	dispatch: add metrics (#2113 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-11-26 09:04:56 +01:00
Simon Pasquier	9f7f4ead46	notify: don't use the global metrics registry (#1977 ) * notify: don't use the global metrics registry Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Address Max's comment Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-08-26 16:37:13 +02:00
Simon Pasquier	f8428bfc7b	cmd/alermanager: log when repeat_interval > retention (#1993 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-08-07 14:58:43 +02:00
Simon Pasquier	2bbfd4acb6	cmd/alertmanager: add alertmanager_cluster_enabled metric (#1973 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-07-29 10:48:44 +02:00
Bartek Plotka	f7f8c47d55	docs/flags: Make it explicit that HA is enabled by default and how to disable it. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>	2019-07-24 10:48:28 +01:00
Simon Pasquier	78c9ebc621	cmd/alertmanager: reject invalid external URLs (#1960 ) * cmd/alertmanager: reject invalid external URLs Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Address Brian's comments Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Simplify the code according to Max's feedback Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-07-22 14:35:26 +02:00
Simon Pasquier	0c3120efac	*: split notify package Instead of keeping all notifiers in the notify package, it splits them into individual sub-packages. This improves readability and maintainability of the code. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-06-18 15:36:19 +02:00
Simon Pasquier	2abd78cbb7	*: use persistent HTTP clients (#1904 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-06-07 10:37:49 +02:00
stuart nelson	2fa210d0e3	add groups endpoint to v2 api Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2019-04-17 11:32:21 +02:00
beorn7	f3d9c89bbc	Create a `Muter` implementation for silences This encapsulates the logic of querying and marking silenced alerts. It removes the code duplication flagged earlier. I removed the error returned by the setAlertStatus function as we were only logging it, and that's already done anyway when the error is received from the `silence.Query` call (now in the `Mutes` method). Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-26 16:42:59 +01:00
Max Leonard Inden	d0cd5a0f08	*: Introduce config coordinator bundling config specific logic Instead of handling all config specific logic inside Alertmangaer.main(), this patch introduces the config coordinator component. Tasks of the config coordinator: - Load and parse configuration - Notify subscribers on configuration changes - Register and manage configuration specific metrics Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2019-02-25 11:26:30 +01:00
stuart nelson	51eebbef85	Stn/correctly mark api silences (#1733 ) * Update alert status on every GET to alerts Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2019-02-18 17:06:51 +01:00
beorn7	21de9ff88c	Various improvements after code review Most importantly, `api.New` now takes an `Options` struct as an argument, which allows some other things done here as well: - Timout and concurrency limit are now in the options, streamlining the registration and the implementation of the limiting middleware. - A local registry is used for metrics, and the metrics used so far inside any of the api packages are using it now. The 'in flight' metric now contains the 'get' as a method label. I have also added a TODO to instrument other methods in the same way (otherwise, the label doesn't reall make sense, semantically). I have also added an explicit error counter for requests rejected because of the concurrency limit. (They also show up as 503s in the generic HTTP instrumentation (or they would, if v2 were instrumented, too), but those 503s might have a number of reasons, while users might want to alert on concurrency limit problems explicitly). Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-12 18:42:08 +01:00
beorn7	3382a0e949	Add HTTP instrumentation for GET requests in flight While the newly added in-flight instrumentation works for all GET requests, the existing HTTP instrumentation omits api/v2 calls. This commit adds a TODO note about that. Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-11 19:34:06 +01:00
beorn7	fc4b67ce80	Introduce a timeout and concurrency limit for HTTP requests The default concurrency limit is max(GOMAXPROCS, 8). That should not imply that each GET requests eats a whole CPU. It's more to get some reasonable heuristics for the processing power of the hosting machine (while allowing at least 8 concurrent requests even on the smallest machines). As GET requests can easily overload the Alertmanager, rendering it incapable of doing its main task, namely sending alert notifications, we need to limit GET requests by default. In contrast, no timeout is set by default. The http.TimeoutHandler inovkes quite a bit of machinery behind the scenes, in particular an additional layer of buffering. Thus, we should first get a bit of experience with it before we consider enforcing a timeout by default, even if setting a timeout is in general the safer setting for resiliency. Signed-off-by: beorn7 <beorn@soundcloud.com>	2019-02-11 19:34:06 +01:00
Max Leonard Inden	09a7370572	main.go: Move marker metric registering into types/types.go Instead of registering marker metrics inside of cmd/alertmanager/main.go, register them in types/types.go, encapsulating marker specific logic in its module, not in main.go. In addition it paves the path for removing the usage of the global metric registry in the future, by taking a local metric registerer. Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2019-02-05 14:59:22 +01:00
Max Leonard Inden	c57542127d	api: Combine v1 and v2 into generic api Instead of cmd/alertmanager/main.go instantiating and starting both api v1 and v2, delegate that work to a generic api combining the two. Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2019-02-04 14:31:33 +01:00
Paul Traylor	cd4a524848	Update prometheus/common and add support for --log.format (#1658 ) Signed-off-by: Paul Traylor <paul.traylor@linecorp.com>	2018-12-13 12:58:43 +01:00
Simon Pasquier	ae66c4f31f	cmd/alertmanager: fix route prefix for the API v2 Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-11-21 15:32:29 +01:00
Simon Pasquier	dae389f058	cmd/alertmanager: use buffered channel for signal Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-11-14 16:41:32 +01:00
Max Leonard Inden	f1b920bcc9	api: Implement OpenAPI generated Alertmanager API V2 The current Alertmanager API v1 is undocumented and written by hand. This patch introduces a new Alertmanager API - v2. The API is fully generated via an OpenAPI 2.0 [1] specification (see `api/v2/openapi.yaml`) with the exception of the http handlers itself. Pros: - Generated server code - Ability to generate clients in all major languages (Go, Java, JS, Python, Ruby, Haskell, elm [3] ...) - Strict contract (OpenAPI spec) between server and clients. - Instant feedback on frontend-breaking changes, due to strictly typed frontend language elm. - Generated documentation (See Alertmanager online Swagger UI [4]) Cons: - Dependency on open api ecosystem including go-swagger [2] In addition this patch includes the following changes. - README.md: Add API section - test: Duplicate acceptance test to API v1 & API v2 version The Alertmanager acceptance test framework has a decent test coverage on the Alertmanager API. Introducing the Alertmanager API v2 does not go hand in hand with deprecating API v1. They should live alongside each other for a couple of minor Alertmanager versions. Instead of porting the acceptance test framework to use the new API v2, this patch duplicates the acceptance tests, one using the API v1, the other API v2. Once API v1 is removed we can simply remove `test/with_api_v1` and bring `test/with_api_v2` to `test/`. [1] https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md [2] https://github.com/go-swagger/go-swagger/ [3] https://github.com/ahultgren/swagger-elm [4] http://petstore.swagger.io/?url=https://raw.githubusercontent.com/mxinden/alertmanager/apiv2/api/v2/openapi.yaml Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-09-04 13:38:34 +02:00
stuart nelson	e883ccb9de	pull out shared code for storing alerts (#1507 ) Move the code for storing and GC'ing alerts from being re-implemented in several packages to existing in its own package Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-09-03 14:52:53 +02:00
Max Inden	b1a8fdd169	Merge pull request #1521 from mxinden/errcheck *.go: Introduce errcheck enforcing error handling	2018-08-30 17:53:49 +02:00
Max Leonard Inden	1219541184	*.go: Introduce errcheck enforcing error handling Errcheck [1] enforces error handling accross all go files. Functions can be excluded via `scripts/errcheck_excludes.txt`. This patch adds errcheck to the `test` Make target. [1] https://github.com/kisielk/errcheck Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-08-30 15:47:13 +02:00
Simon Pasquier	899226f3ac	*: remove v1/alerts/groups API endpoint (#1525 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-08-23 16:03:49 +02:00
Julius Volz	6d0edbe630	Fix a bunch of unhandled errors (#1501 ) ...as discovered by "gosec" (many other ones reported, but not all make a lot of sense to fix). Signed-off-by: Julius Volz <julius.volz@gmail.com>	2018-08-05 15:38:25 +02:00
Simon Pasquier	37884c8460	alertmanager: fix Settle() interval (#1478 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-07-24 22:42:09 +02:00
Max Inden	3735df3ac7	cluster: Do not exit when failing to join cluster (#1465 ) Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR #1456 [1]. [1] https://github.com/prometheus/alertmanager/pull/1456 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-11 17:19:33 +02:00
Corentin Chary	42ea9a565b	cluster: make sure we don't miss the first pushPull (#1456 ) * cluster: make sure we don't miss the first pushPull During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>	2018-07-09 11:16:04 +02:00
stuart nelson	445fbdf1a8	gossip large messages via SendReliable (#1415 ) * Gossip large messages via SendReliable For messages beyond half of the maximum gossip packet size, send the message to all peer nodes via TCP. The choice of "larger than half the max gossip size" is relatively arbitrary. From brief testing, the overhead from memberlist on a packet seemed to only use ~3 of the available 1400 bytes, and most gossip messages seem to be <<500 bytes. * Add tests for oversized/normal message gossiping * Make oversize metric names consistent * Remove errant printf in test * Correctly increment WaitGroup * Add comment for OversizedMessage func * Add metric for oversized messages dropped Code was added to drop oversized messages if the buffered channel they are sent on is full. This is a good thing to surface as a metric. * Add counter for total oversized messages sent * Change full queue log level to debug Was previously a warning, which isn't necessary now that there is a metric tracking it. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-15 13:40:21 +02:00
stuart nelson	db4af95ea0	memberlist reconnect (#1384 ) * initial impl Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnectTimeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix locking Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove unused PeerStatuses Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add metrics Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Actually use peerJoinCounter Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Cleanup peers map on peer timeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnect test Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * test removing failed peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Use peer address as map key If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add failed peers from creation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove warnIfAlone() Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update metric names Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Address comments Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-05 14:28:49 +02:00
Simon Pasquier	0ebaeccd4b	*: add missing license headers Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-14 17:37:13 +02:00
rhysm	e4416bd612	Add additional cluster configuration flags (#1379 ) The cluster configuration uses DefaultLANConfig which seems to be quite sensitive to WAN conditions. Allowing the tuning of these 3 parameters (TCP Timeout, Probe Interval and Probe Timeout) makes clustering more robust across WAN connections. Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>	2018-05-14 09:22:04 +02:00
Max Leonard Inden	f825d97de4	api: Deprecate `api/alerts` endpoint With prometheus/prometheus commit e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from using `api/alerts` to `api/v1/alerts`. This commit is included starting from Prometheus v0.17.0. As discussed on the prometheus-developers mailing list [2] the deprecation period is long over. [1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [2] https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-05-04 09:59:14 +02:00

1 2 3

109 Commits