alertmanager

mirror of https://github.com/prometheus/alertmanager synced 2025-02-21 13:17:01 +00:00

Author	SHA1	Message	Date
Max Inden	b1a8fdd169	Merge pull request #1521 from mxinden/errcheck *.go: Introduce errcheck enforcing error handling	2018-08-30 17:53:49 +02:00
Max Leonard Inden	1219541184	*.go: Introduce errcheck enforcing error handling Errcheck [1] enforces error handling accross all go files. Functions can be excluded via `scripts/errcheck_excludes.txt`. This patch adds errcheck to the `test` Make target. [1] https://github.com/kisielk/errcheck Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-08-30 15:47:13 +02:00
Simon Pasquier	899226f3ac	*: remove v1/alerts/groups API endpoint (#1525 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-08-23 16:03:49 +02:00
Julius Volz	6d0edbe630	Fix a bunch of unhandled errors (#1501 ) ...as discovered by "gosec" (many other ones reported, but not all make a lot of sense to fix). Signed-off-by: Julius Volz <julius.volz@gmail.com>	2018-08-05 15:38:25 +02:00
Simon Pasquier	37884c8460	alertmanager: fix Settle() interval (#1478 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-07-24 22:42:09 +02:00
Max Inden	3735df3ac7	cluster: Do not exit when failing to join cluster (#1465 ) Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR #1456 [1]. [1] https://github.com/prometheus/alertmanager/pull/1456 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-07-11 17:19:33 +02:00
Corentin Chary	42ea9a565b	cluster: make sure we don't miss the first pushPull (#1456 ) * cluster: make sure we don't miss the first pushPull During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>	2018-07-09 11:16:04 +02:00
stuart nelson	445fbdf1a8	gossip large messages via SendReliable (#1415 ) * Gossip large messages via SendReliable For messages beyond half of the maximum gossip packet size, send the message to all peer nodes via TCP. The choice of "larger than half the max gossip size" is relatively arbitrary. From brief testing, the overhead from memberlist on a packet seemed to only use ~3 of the available 1400 bytes, and most gossip messages seem to be <<500 bytes. * Add tests for oversized/normal message gossiping * Make oversize metric names consistent * Remove errant printf in test * Correctly increment WaitGroup * Add comment for OversizedMessage func * Add metric for oversized messages dropped Code was added to drop oversized messages if the buffered channel they are sent on is full. This is a good thing to surface as a metric. * Add counter for total oversized messages sent * Change full queue log level to debug Was previously a warning, which isn't necessary now that there is a metric tracking it. Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-15 13:40:21 +02:00
stuart nelson	db4af95ea0	memberlist reconnect (#1384 ) * initial impl Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnectTimeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Fix locking Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove unused PeerStatuses Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add metrics Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Actually use peerJoinCounter Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Cleanup peers map on peer timeout Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add reconnect test Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * test removing failed peers Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Use peer address as map key If a peer is restarted, it will rejoin with the same IP but different ULID. So the node will rejoin the cluster, but its peers will never remove it from their internal list of failed nodes because its ULID has changed. Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Add failed peers from creation Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Remove warnIfAlone() Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Update metric names Signed-off-by: stuart nelson <stuartnelson3@gmail.com> * Address comments Signed-off-by: stuart nelson <stuartnelson3@gmail.com>	2018-06-05 14:28:49 +02:00
Simon Pasquier	0ebaeccd4b	*: add missing license headers Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-05-14 17:37:13 +02:00
rhysm	e4416bd612	Add additional cluster configuration flags (#1379 ) The cluster configuration uses DefaultLANConfig which seems to be quite sensitive to WAN conditions. Allowing the tuning of these 3 parameters (TCP Timeout, Probe Interval and Probe Timeout) makes clustering more robust across WAN connections. Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>	2018-05-14 09:22:04 +02:00
Max Leonard Inden	f825d97de4	api: Deprecate `api/alerts` endpoint With prometheus/prometheus commit e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from using `api/alerts` to `api/v1/alerts`. This commit is included starting from Prometheus v0.17.0. As discussed on the prometheus-developers mailing list [2] the deprecation period is long over. [1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [2] https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ Signed-off-by: Max Leonard Inden <IndenML@gmail.com>	2018-05-04 09:59:14 +02:00
Simon Pasquier	dc5fc02d22	[amtool] use kingpin.v2 (#1330 ) * Use default values to store values from config * fix typo and reserved keywork * move to long help texts * add one more unit test for resolver * update comments Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2018-04-24 09:35:15 +02:00
Simon Pasquier	b95b32821f	ui: replace deprecated InstrumentHandler() (#1302 ) This change replaces the deprecated InstrumentHandler function by the equivalent functions from the promhttp package. The following metrics are removed: * http_request_duration_microseconds (Summary). * http_request_size_bytes (Summary). * http_requests_total (Counter). And the following metrics are added instead: * alertmanager_http_request_duration_seconds (Histogram). * alertmanager_http_response_size_bytes (Histogram). * promhttp_metric_handler_requests_in_flight (Gauge). * promhttp_metric_handler_requests_total (Counter).	2018-03-28 15:28:38 +02:00
Simon Pasquier	1531aa66f3	Fix for #1282 (#1286 ) * cluster: add alertmanager_cluster_messages_queued metric * cluster: add metrics for sent messages This change adds 2 new metrics: - alertmanager_cluster_messages_sent_total - alertmanager_cluster_messages_sent_size_total * Fix marshaling for entries being broadcast Individual notifications logs and silences being broadcast to the other peers need to be encoded using the same length-delimited format as when doing full-state synchronization. * main: fix argument order for cluster.Join() cluster.Join() was called with the push/pull and gossip interval parameters being swapped one for another.	2018-03-22 13:53:00 +01:00
Brian Brazil	bd04da5480	Remove debugging code (#1291 )	2018-03-18 12:42:24 +01:00
Stuart Nelson	8f1c16eaa9	Update flag help text Start all help texts with a capital letter, end with a period. There were some additional things that got caught by gofmt/goimports.	2018-03-07 10:04:30 +01:00
Corentin Chary	dd75201f1c	Add /-/ready based on mesh status (#1209 ) * Wait for the gossip to settle before sending notifications See #1209 for details. As an heuristic for mesh readyness, try to see if the mesh looks stable (the number of peers isn't changing too much). This implementation always mark the altermanager as ready after a maximum of 60s. This adds one new flags to control this behavior: ``` --cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup. ``` It also adds `/-/ready` which always return 200 (in order to make it clear that we are ready as soon as we can receive requests). The mesh status is exposed in `/api/v1/status` and visible on `/#/status`. * cluster: fix typos and base interval on gossipInterval	2018-03-02 15:45:21 +01:00
pasquier-s	e8a92f65ef	Run staticcheck as part of the build process (#1264 ) This change also fixes potential issues highlighted by running staticcheck.	2018-02-28 17:42:32 +01:00
pasquier-s	3df093968c	cluster: gather alertmanager_peer_position all the time (#1247 ) * cluster: gather alertmanager_peer_position all the time This change moves the gathering of the alertmanager_peer_position metric outside of the clusterWait() function so that the metric is computed accurately even when no alerting group fires. * cluster: add alertmanager_cluster_health_score metric This metric is retrieved from the memberlist library.	2018-02-27 10:37:56 +01:00
stuart nelson	0f9c9a0bb0	Remove unused functions for mesh (#1251 ) These functions were used with weaveworks/mesh, but are no longer needed with memberlist.	2018-02-16 18:16:06 +01:00
Julien Pivotto	dc293439ca	cluster: Make peer timeout configurable Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2018-02-13 16:31:33 +01:00
Simon Pasquier	9d16fe8266	cmd: remove unused code	2018-02-13 14:20:54 +01:00
Fabian Reinartz	6cfbe6e8b4	update cluster listen address flag	2018-02-12 10:22:49 +01:00
Fabian Reinartz	fd49dbb477	*: move to memberlist for clustering	2018-02-08 12:18:44 +01:00
pasquier-s	17bd637c97	Add mesh metrics (#1225 ) * Add mesh metrics This change adds 2 new metrics for the mesh: * alertmanager_peer_connection, state of the connection between the Alertmanager instance and a peer. * alertmanager_peer_terminations_total, total number of terminated connection. It also moves the gathering of the alertmanager_peer_position metric outside of the meshWait() function so that the metric is computed accurately even when no alerting group fires. * Remove 'nick' label from alertmanager_peer_connection metric	2018-02-06 12:13:52 +01:00
stuart nelson	3c61fe3fef	Return reload status from http endpoint (#1152 ) (#1180 ) * Return reload status from http endpoint (#1152) * Use same reload messaging as prometheus	2018-01-08 11:51:05 +01:00
Calle Pettersson	b7da058efb	Switch cmd/alertmanager to kingpin (#974 )	2018-01-06 11:22:26 +01:00
Calle Pettersson	608848390f	Switch amtool to kingpin (#976 ) * Switch cmd/amtool to kingpin * Touch-ups * Implement long help * Add missing short-form of --output * Fix backwards compatibility for config file options * Fix vendoring * Review fixes * Fix flag word order	2017-12-22 11:17:13 +01:00
stuart nelson	481eab7b83	Make alertGC interval configurable	2017-12-19 15:36:38 +01:00
pasquier-s	06f9a4ad1d	Fix logging for the mesh component (#1145 )	2017-12-14 16:05:59 +01:00
Julius Volz	f64a419853	Fix shutdown crash with nil mesh router	2017-11-03 23:44:05 +01:00
Frederic Branczyk	029c70d6fe	Allow enabling mutex and block profiles (#1073 )	2017-11-02 12:18:45 +01:00
Julius Volz	947970af44	Convert Alertmanager to use non-global go-kit loggers Fixes https://github.com/prometheus/alertmanager/issues/1040	2017-10-22 00:20:40 -07:00
Frederic Branczyk	620fff4e4f	add metric of alertmanager position in mesh (#1024 )	2017-10-06 18:37:44 +02:00
Corentin Chary	bff889b490	silence\|alerts: add metrics about current silences and alerts This adds metrics that look like this: ``` alertmanager_alerts{state="active"} 6 alertmanager_alerts{state="suppressed"} 0 alertmanager_silences{state="active"} 1 alertmanager_silences{state="expired"} 1 alertmanager_silences{state="pending"} 0 ``` This can be used to monitor alertmanager's usage and validate that alertmanagers in a mesh have a similar number of silences and alerts.	2017-10-02 13:33:29 +02:00
Jack Neely	0dfdda3074	Use logging options consistently for all components #967 (#968 )	2017-09-02 11:24:11 +02:00
Julius Volz	b78869e749	Fix crash when no mesh router is configured (#919 ) * Fix crash when no mesh router is configured This adds a check for `meshListen != ""` around the waitFunc code as we have around the other mesh-related code parts above. Fixes https://github.com/prometheus/alertmanager/issues/914 * Update bindata	2017-07-22 10:56:55 +02:00
Frederic Branczyk	c4c0875ba3	fix config JSON marshaling	2017-06-08 13:37:57 +02:00
stuart nelson	2cf38e4c2e	Fix external web url (#836 ) Infer path from Navigation.Location Build uses template, local dev uses elm-reactor Remove unneeded local dev go server Add script.js make target Compiles and uglifies script.js Before: ~570kb After: ~170kb Bootstrap loading state Add trailing slash via JS & add routePrefix console param Add Javascript script tag to `index.html` which adds a trailing slash to the url pathname if none is present. This is done to ensure assets like `script.js` are loaded properly. Example without patch: If the pathname is "mxinden.com/alertmanager" the browser will try to download the `script.js` asset from "mxinden.com/script.js". This request will fail. Example with patch: If the pathname is "mxinden.com/alertmanager", Javascript redirects the browser to "mxinden.com/alertmanager/" and then the `script.js` asset will be downloaded from "mxinden.com/alertmanager/script.js". This request will succeed. Add `-web.route-prefix` as a console parameter. This configures a Prefix for the internal routes of web endpoints. Defaults to path of -web.external-url like in Prometheus. Trim slashes off of route prefix and add one slash at the beginning. Make sure route prefix is not empty or just a slash before prefixing router.	2017-06-07 22:38:39 +02:00
Conor Broderick	02484d18b0	Added option to disable AM listening on mesh port (#764 )	2017-05-31 11:05:18 +01:00
Max Leonard Inden	7e6de53f3a	Merge branch 'master' into ui-rewrite	2017-05-15 09:58:14 +02:00
Fabian Reinartz	8ccb95c9f5	Log mesh messages at debug level	2017-05-09 16:07:27 +02:00
Fabian Reinartz	672e9b205f	vendor: update mesh	2017-05-09 15:03:27 +02:00
Max Leonard Inden	ef3cc7b001	Return alert status on /alerts endpoint and enable filtering This makes the /alerts endpoint return the alert status introduced by PR # 717. In addition it enables alert filtering via matchers like on the /alerts/groups endpoint.	2017-05-02 09:46:20 +02:00
Frederic Branczyk	3eb81b4243	Add config hash metric (#751 )	2017-04-27 20:58:15 +02:00
stuart nelson	6a909abf17	Add processing status field to alert	2017-04-27 14:18:52 +02:00
Tom Wilkie	ba4fc17307	Remove use of route.Context	2017-04-25 17:43:01 -07:00
Fabian Reinartz	7e31a58868	cmd: fix panic on empty peer string	2017-04-21 14:40:46 +02:00
Kellen Fox	3aab66ec3a	Amtool implementation (#636 ) * Implement alertmanager cli tool 'amtool' The primary goal of an alertmanager tool is to provide a cli interface for the prometheus alertmanager. My vision for this tool has two parts: - Silence management (query, add, delete) - Alert management (query, maybe more in future?) Resolves: #567	2017-04-20 11:04:17 +02:00
stuart nelson	1e34f29532	Filter alerts (#633 ) * Vendor dependencies. This updates several old dependencies, removes some that are no longer needed, and adds `pkg/labels` from prometheus `dev-2.0` branch. * Add metrics selector parsing code This is a temporary simplified re-implementation of promQL's metric selector parsing. * Add alerts filtering Filter alerts through `?filter=` query string. * Add silences filtering Filter silences through `?filter=` query string. * Move `parse` to `pkg/parse`	2017-03-16 11:16:10 +01:00
Max Inden	fced9e126b	api: Expose mesh status (#644 ) * api: Expose mesh status The weaveworks mesh package reveals information about the current status of the mesh network between alertmanager instances. This commit exposes the current address and connection status of each instance connected to the targeted alertmanager instance via the /status API endpoint. * api: Replace LocalConnectionStatus with PeerStatus Now meshStatus contains all peers (Name, NickName, UID) of the network. Additionally adding Name and Nickname of current target to toplevel of meshNetwork object.	2017-03-07 18:17:03 +01:00
stuart nelson	24a9a64bdf	Only find MAC address if no command-line flag value given (#638 ) * Find MAC address if mesh.hardware-addr not given Defaulting to the machine's MAC address fails sometimes fails and causes a panic. Allow the user to specify custom address to skip this so they can run AlertManager. * -mesh.hardware-address -> -mesh.peer-id * Fix command-line invocation	2017-02-28 14:57:45 +01:00
Matt Bostock	890f148a27	Prevent panic on failed config load Prevent Alertmanager from panicking when the configuration cannot be loaded. Spotted in version 0.4.2: INFO[0000] Starting alertmanager (version=0.4.2, branch=HEAD, revision=9a5ab2fa63dd7951f4f202b0846d4f4d8e9615b0) source=main.go:84 INFO[0000] Build context (go=go1.7.3, user=root@45f28166fed1, date=20170117-13:50:50) source=main.go:85 INFO[0000] Loading configuration file file=alertmanager.yml source=main.go:156 ERRO[0000] error: yaml: line 64: could not find expected ':' source=api.go:115 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x401adb] goroutine 1 [running]: panic(0x83e240, 0xc42000e020) /usr/local/go/src/runtime/panic.go:500 +0x1a1 main.(API).Update(0xc4200b20f0, 0xc42019f800, 0x17e7, 0x45d964b800) /go/src/github.com/prometheus/alertmanager/api.go:117 +0xcb main.main.func3(0x0, 0x0) /go/src/github.com/prometheus/alertmanager/main.go:172 +0x226 main.main() /go/src/github.com/prometheus/alertmanager/main.go:192 +0x97a make: ** [run] Error 2	2017-01-17 14:59:11 +00:00
Fabian Reinartz	a2666e6b31	vendoring: update	2016-12-01 13:49:28 +01:00
Fabian Reinartz	1e01b2bdba	nflog: add metrics (#518 )	2016-11-21 15:22:35 +01:00
Frederic Branczyk	8a2f93a102	*: allow use of mesh encryption through password parameter	2016-10-07 16:19:42 +02:00
Fabian Reinartz	7517453c68	silence: add metrics	2016-09-29 09:54:34 +02:00
Fabian Reinartz	bf6d47934e	Merge pull request #489 from prometheus/timeout *: consider mesh wait in notification timeouts	2016-09-06 14:17:53 +02:00
Fabian Reinartz	b2461bb2d4	*: remove go-kit logging	2016-09-06 11:56:57 +02:00
Fabian Reinartz	e9fbe62e0f	*: consider mesh wait in notification timeouts This adds the peer wait duration to the standard timeout to avoid terminating a notification prematurely while being in failover wait status.	2016-09-05 13:21:28 +02:00
Fabian Reinartz	8d88d9e05b	Merge pull request #481 from prometheus/fabxc-meshsil *: integrate new silence package	2016-08-30 16:53:34 +02:00
Fabian Reinartz	a4e8703567	*: integrate new silence package	2016-08-30 12:15:23 +02:00
Fabian Reinartz	fcda6fede3	Merge pull request #458 from pdbogen/pdbogen-error-on-nonflag-args Pdbogen error on nonflag args	2016-08-23 14:25:53 +02:00
Fabian Reinartz	5dc8286942	nflog: fix maintenance termination	2016-08-19 12:01:16 +02:00
Fabian Reinartz	72fdf3d3ab	*: integrate nflog This commit replaces the previous NotifyInfo provider with the new nflog package. It needs adjustments in the behavior of the deduping stage. The nflog stores notification digests per receiver per alert aggregation group rather than one entry for alert per receiver. This drastically reduces the number of entries and removes interference across aggregation groups.	2016-08-18 15:52:28 +02:00
Patrick Bogen	ca844915b3	alertmanager should raise an error if unexpected arguments are present on the command line	2016-08-16 11:15:45 -07:00
Frederic Branczyk	7bc851e894	rework building of stage pipelines	2016-08-16 10:56:46 +02:00
Frederic Branczyk	840dd7d2f5	introduce Stage interface	2016-08-12 16:01:40 +02:00
Frederic Branczyk	3dfb17e601	refactor notification pipeline move hard to read backwards declared approach to more transparent pipeline approach with more detailed interfaces	2016-08-11 15:04:03 +02:00
Fabian Reinartz	3931d4e64b	*: restructure package tree This commit packages up individual modules and removes the top-level main package.	2016-08-09 14:24:52 +02:00

1 2 3

121 Commits