Commit Graph

121 Commits

Author SHA1 Message Date
Max Inden
b1a8fdd169
Merge pull request #1521 from mxinden/errcheck
*.go: Introduce errcheck enforcing error handling
2018-08-30 17:53:49 +02:00
Max Leonard Inden
1219541184
*.go: Introduce errcheck enforcing error handling
Errcheck [1] enforces error handling accross all go files. Functions can
be excluded via `scripts/errcheck_excludes.txt`.

This patch adds errcheck to the `test` Make target.

[1] https://github.com/kisielk/errcheck

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-08-30 15:47:13 +02:00
Simon Pasquier
899226f3ac *: remove v1/alerts/groups API endpoint (#1525)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-08-23 16:03:49 +02:00
Julius Volz
6d0edbe630 Fix a bunch of unhandled errors (#1501)
...as discovered by "gosec" (many other ones reported, but not all make
a lot of sense to fix).

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-05 15:38:25 +02:00
Simon Pasquier
37884c8460 alertmanager: fix Settle() interval (#1478)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-24 22:42:09 +02:00
Max Inden
3735df3ac7
cluster: Do not exit when failing to join cluster (#1465)
Alertmanager is exiting with a non-zero exit code if the initial cluster
join fails. This behavior could be not wanted because:

- As Alertmanager is a critical component with an at-least-once
guarantee, failing on joining the cluster is unnecessary as
Alertmanager still functions by itself.

- In an environment like Kubernetes discovering peers via DNS, peers
might roll out one-by-one, leaving the DNS entries unpopulated for the
first peer of a set. Failing on initial join prevents a roll-out.

Instead of failing on the initial join this patch only logs the failure.
The cluster can be later joined via the `handleReconnect`.

This is a regression introduced in PR #1456 [1].

[1] https://github.com/prometheus/alertmanager/pull/1456

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-07-11 17:19:33 +02:00
Corentin Chary
42ea9a565b cluster: make sure we don't miss the first pushPull (#1456)
* cluster: make sure we don't miss the first pushPull

During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).

Signed-off-by: Corentin Chary <c.chary@criteo.com>
2018-07-09 11:16:04 +02:00
stuart nelson
445fbdf1a8
gossip large messages via SendReliable (#1415)
* Gossip large messages via SendReliable

For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.

The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.

* Add tests for oversized/normal message gossiping

* Make oversize metric names consistent

* Remove errant printf in test

* Correctly increment WaitGroup

* Add comment for OversizedMessage func

* Add metric for oversized messages dropped

Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.

* Add counter for total oversized messages sent

* Change full queue log level to debug

Was previously a warning, which isn't necessary
now that there is a metric tracking it.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-15 13:40:21 +02:00
stuart nelson
db4af95ea0
memberlist reconnect (#1384)
* initial impl

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add reconnectTimeout

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Fix locking

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Remove unused PeerStatuses

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add metrics

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Actually use peerJoinCounter

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Cleanup peers map on peer timeout

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add reconnect test

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* test removing failed peers

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Use peer address as map key

If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add failed peers from creation

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Remove warnIfAlone()

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Update metric names

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Address comments

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-05 14:28:49 +02:00
Simon Pasquier
0ebaeccd4b *: add missing license headers
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-14 17:37:13 +02:00
rhysm
e4416bd612 Add additional cluster configuration flags (#1379)
The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.

Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
2018-05-14 09:22:04 +02:00
Max Leonard Inden
f825d97de4
api: Deprecate api/alerts endpoint
With prometheus/prometheus commit
e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from
using `api/alerts` to `api/v1/alerts`. This commit is included starting
from Prometheus v0.17.0. As discussed on the prometheus-developers
mailing list [2] the deprecation period is long over.

[1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe
[2]
https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-05-04 09:59:14 +02:00
Simon Pasquier
dc5fc02d22 [amtool] use kingpin.v2 (#1330)
* Use default values to store values from config
* fix typo and reserved keywork
* move to long help texts
* add one more unit test for resolver
* update comments

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-04-24 09:35:15 +02:00
Simon Pasquier
b95b32821f ui: replace deprecated InstrumentHandler() (#1302)
This change replaces the deprecated InstrumentHandler function by
the equivalent functions from the promhttp package.

The following metrics are removed:

* http_request_duration_microseconds (Summary).
* http_request_size_bytes (Summary).
* http_requests_total (Counter).

And the following metrics are added instead:

* alertmanager_http_request_duration_seconds (Histogram).
* alertmanager_http_response_size_bytes (Histogram).
* promhttp_metric_handler_requests_in_flight (Gauge).
* promhttp_metric_handler_requests_total (Counter).
2018-03-28 15:28:38 +02:00
Simon Pasquier
1531aa66f3 Fix for #1282 (#1286)
* cluster: add alertmanager_cluster_messages_queued metric

* cluster: add metrics for sent messages

This change adds 2 new metrics:

- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total

* Fix marshaling for entries being broadcast

Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.

* main: fix argument order for cluster.Join()

cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
2018-03-22 13:53:00 +01:00
Brian Brazil
bd04da5480 Remove debugging code (#1291) 2018-03-18 12:42:24 +01:00
Stuart Nelson
8f1c16eaa9 Update flag help text
Start all help texts with a capital letter, end
with a period.

There were some additional things that got caught
by gofmt/goimports.
2018-03-07 10:04:30 +01:00
Corentin Chary
dd75201f1c Add /-/ready based on mesh status (#1209)
* Wait for the gossip to settle before sending notifications

See #1209 for details.

As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.

This adds one new flags to control this behavior:
```
      --cluster.settle-timeout=60s  mesh settling timeout. Do not wait more than this duration on startup.
```

It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).

The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.

* cluster: fix typos and base interval on gossipInterval
2018-03-02 15:45:21 +01:00
pasquier-s
e8a92f65ef Run staticcheck as part of the build process (#1264)
This change also fixes potential issues highlighted by running
staticcheck.
2018-02-28 17:42:32 +01:00
pasquier-s
3df093968c cluster: gather alertmanager_peer_position all the time (#1247)
* cluster: gather alertmanager_peer_position all the time

This change moves the gathering of the alertmanager_peer_position metric
outside of the clusterWait() function so that the metric is computed
accurately even when no alerting group fires.

* cluster: add alertmanager_cluster_health_score metric

This metric is retrieved from the memberlist library.
2018-02-27 10:37:56 +01:00
stuart nelson
0f9c9a0bb0
Remove unused functions for mesh (#1251)
These functions were used with weaveworks/mesh,
but are no longer needed with memberlist.
2018-02-16 18:16:06 +01:00
Julien Pivotto
dc293439ca cluster: Make peer timeout configurable
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2018-02-13 16:31:33 +01:00
Simon Pasquier
9d16fe8266 cmd: remove unused code 2018-02-13 14:20:54 +01:00
Fabian Reinartz
6cfbe6e8b4 update cluster listen address flag 2018-02-12 10:22:49 +01:00
Fabian Reinartz
fd49dbb477 *: move to memberlist for clustering 2018-02-08 12:18:44 +01:00
pasquier-s
17bd637c97 Add mesh metrics (#1225)
* Add mesh metrics

This change adds 2 new metrics for the mesh:

* alertmanager_peer_connection, state of the connection between the
  Alertmanager instance and a peer.
* alertmanager_peer_terminations_total, total number of terminated
  connection.

It also moves the gathering of the alertmanager_peer_position metric
outside of the meshWait() function so that the metric is computed
accurately even when no alerting group fires.

* Remove 'nick' label from alertmanager_peer_connection metric
2018-02-06 12:13:52 +01:00
stuart nelson
3c61fe3fef
Return reload status from http endpoint (#1152) (#1180)
* Return reload status from http endpoint (#1152)

* Use same reload messaging as prometheus
2018-01-08 11:51:05 +01:00
Calle Pettersson
b7da058efb Switch cmd/alertmanager to kingpin (#974) 2018-01-06 11:22:26 +01:00
Calle Pettersson
608848390f Switch amtool to kingpin (#976)
* Switch cmd/amtool to kingpin

* Touch-ups

* Implement long help

* Add missing short-form of --output

* Fix backwards compatibility for config file options

* Fix vendoring

* Review fixes

* Fix flag word order
2017-12-22 11:17:13 +01:00
stuart nelson
481eab7b83 Make alertGC interval configurable 2017-12-19 15:36:38 +01:00
pasquier-s
06f9a4ad1d Fix logging for the mesh component (#1145) 2017-12-14 16:05:59 +01:00
Julius Volz
f64a419853 Fix shutdown crash with nil mesh router 2017-11-03 23:44:05 +01:00
Frederic Branczyk
029c70d6fe Allow enabling mutex and block profiles (#1073) 2017-11-02 12:18:45 +01:00
Julius Volz
947970af44 Convert Alertmanager to use non-global go-kit loggers
Fixes https://github.com/prometheus/alertmanager/issues/1040
2017-10-22 00:20:40 -07:00
Frederic Branczyk
620fff4e4f add metric of alertmanager position in mesh (#1024) 2017-10-06 18:37:44 +02:00
Corentin Chary
bff889b490 silence|alerts: add metrics about current silences and alerts
This adds metrics that look like this:
```
alertmanager_alerts{state="active"} 6
alertmanager_alerts{state="suppressed"} 0
alertmanager_silences{state="active"} 1
alertmanager_silences{state="expired"} 1
alertmanager_silences{state="pending"} 0
```

This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.
2017-10-02 13:33:29 +02:00
Jack Neely
0dfdda3074 Use logging options consistently for all components #967 (#968) 2017-09-02 11:24:11 +02:00
Julius Volz
b78869e749 Fix crash when no mesh router is configured (#919)
* Fix crash when no mesh router is configured

This adds a check for `meshListen != ""` around the waitFunc code as we
have around the other mesh-related code parts above.

Fixes https://github.com/prometheus/alertmanager/issues/914

* Update bindata
2017-07-22 10:56:55 +02:00
Frederic Branczyk
c4c0875ba3
fix config JSON marshaling 2017-06-08 13:37:57 +02:00
stuart nelson
2cf38e4c2e Fix external web url (#836)
Infer path from Navigation.Location

Build uses template, local dev uses elm-reactor

Remove unneeded local dev go server

Add script.js make target

Compiles and uglifies script.js

Before:
~570kb

After:
~170kb

Bootstrap loading state

Add trailing slash via JS & add routePrefix console param

Add Javascript script tag to `index.html` which adds a trailing slash to
the url pathname if none is present. This is done to ensure assets like
`script.js` are loaded properly.

Example without patch:
If the pathname is "mxinden.com/alertmanager" the browser will try to
download the `script.js` asset from "mxinden.com/script.js". This
request will fail.

Example with patch:
If the pathname is "mxinden.com/alertmanager", Javascript redirects the
browser to "mxinden.com/alertmanager/" and then the `script.js` asset
will be downloaded from "mxinden.com/alertmanager/script.js". This
request will succeed.

Add `-web.route-prefix` as a console parameter. This configures a
Prefix for the internal routes of web endpoints. Defaults to path of
-web.external-url like in *Prometheus*.

Trim slashes off of route prefix and add one slash at the beginning.
Make sure route prefix is not empty or just a slash before prefixing
router.
2017-06-07 22:38:39 +02:00
Conor Broderick
02484d18b0 Added option to disable AM listening on mesh port (#764) 2017-05-31 11:05:18 +01:00
Max Leonard Inden
7e6de53f3a
Merge branch 'master' into ui-rewrite 2017-05-15 09:58:14 +02:00
Fabian Reinartz
8ccb95c9f5 Log mesh messages at debug level 2017-05-09 16:07:27 +02:00
Fabian Reinartz
672e9b205f vendor: update mesh 2017-05-09 15:03:27 +02:00
Max Leonard Inden
ef3cc7b001
Return alert status on /alerts endpoint and enable filtering
This makes the /alerts endpoint return the alert status introduced by PR
 # 717. In addition it enables alert filtering via matchers like on the
/alerts/groups endpoint.
2017-05-02 09:46:20 +02:00
Frederic Branczyk
3eb81b4243 Add config hash metric (#751) 2017-04-27 20:58:15 +02:00
stuart nelson
6a909abf17 Add processing status field to alert 2017-04-27 14:18:52 +02:00
Tom Wilkie
ba4fc17307 Remove use of route.Context 2017-04-25 17:43:01 -07:00
Fabian Reinartz
7e31a58868 cmd: fix panic on empty peer string 2017-04-21 14:40:46 +02:00
Kellen Fox
3aab66ec3a Amtool implementation (#636)
* Implement alertmanager cli tool 'amtool'

The primary goal of an alertmanager tool is to provide a cli interface
for the prometheus alertmanager.

My vision for this tool has two parts:
  - Silence management (query, add, delete)

  - Alert management (query, maybe more in future?)

Resolves: #567
2017-04-20 11:04:17 +02:00
stuart nelson
1e34f29532 Filter alerts (#633)
* Vendor dependencies.

This updates several old dependencies, removes
some that are no longer needed, and adds
`pkg/labels` from prometheus `dev-2.0` branch.

* Add metrics selector parsing code

This is a temporary simplified re-implementation
of promQL's metric selector parsing.

* Add alerts filtering

Filter alerts through `?filter=` query string.

* Add silences filtering

Filter silences through `?filter=` query string.

* Move `parse` to `pkg/parse`
2017-03-16 11:16:10 +01:00
Max Inden
fced9e126b api: Expose mesh status (#644)
* api: Expose mesh status

The weaveworks mesh package reveals information about the current status
of the mesh network between alertmanager instances. This commit exposes
the current address and connection status of each instance connected to
the targeted alertmanager instance via the /status API endpoint.

* api: Replace LocalConnectionStatus with PeerStatus

Now meshStatus contains all peers (Name, NickName, UID) of the network.
Additionally adding Name and Nickname of current target to toplevel of
meshNetwork object.
2017-03-07 18:17:03 +01:00
stuart nelson
24a9a64bdf Only find MAC address if no command-line flag value given (#638)
* Find MAC address if mesh.hardware-addr not given

Defaulting to the machine's MAC address fails
sometimes fails and causes a panic. Allow the user
to specify custom address to skip this so they can
run AlertManager.

* -mesh.hardware-address -> -mesh.peer-id

* Fix command-line invocation
2017-02-28 14:57:45 +01:00
Matt Bostock
890f148a27 Prevent panic on failed config load
Prevent Alertmanager from panicking when the configuration cannot be
loaded.

Spotted in version 0.4.2:

    INFO[0000] Starting alertmanager (version=0.4.2, branch=HEAD, revision=9a5ab2fa63dd7951f4f202b0846d4f4d8e9615b0)  source=main.go:84
    INFO[0000] Build context (go=go1.7.3, user=root@45f28166fed1, date=20170117-13:50:50)  source=main.go:85
    INFO[0000] Loading configuration file                    file=alertmanager.yml source=main.go:156
    ERRO[0000] error: yaml: line 64: could not find expected ':'  source=api.go:115
    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x401adb]

    goroutine 1 [running]:
    panic(0x83e240, 0xc42000e020)
            /usr/local/go/src/runtime/panic.go:500 +0x1a1
    main.(*API).Update(0xc4200b20f0, 0xc42019f800, 0x17e7, 0x45d964b800)
            /go/src/github.com/prometheus/alertmanager/api.go:117 +0xcb
    main.main.func3(0x0, 0x0)
            /go/src/github.com/prometheus/alertmanager/main.go:172 +0x226
    main.main()
            /go/src/github.com/prometheus/alertmanager/main.go:192 +0x97a
    make: *** [run] Error 2
2017-01-17 14:59:11 +00:00
Fabian Reinartz
a2666e6b31 vendoring: update 2016-12-01 13:49:28 +01:00
Fabian Reinartz
1e01b2bdba nflog: add metrics (#518) 2016-11-21 15:22:35 +01:00
Frederic Branczyk
8a2f93a102
*: allow use of mesh encryption through password parameter 2016-10-07 16:19:42 +02:00
Fabian Reinartz
7517453c68 silence: add metrics 2016-09-29 09:54:34 +02:00
Fabian Reinartz
bf6d47934e Merge pull request #489 from prometheus/timeout
*: consider mesh wait in notification timeouts
2016-09-06 14:17:53 +02:00
Fabian Reinartz
b2461bb2d4 *: remove go-kit logging 2016-09-06 11:56:57 +02:00
Fabian Reinartz
e9fbe62e0f *: consider mesh wait in notification timeouts
This adds the peer wait duration to the standard timeout to avoid
terminating a notification prematurely while being in failover
wait status.
2016-09-05 13:21:28 +02:00
Fabian Reinartz
8d88d9e05b Merge pull request #481 from prometheus/fabxc-meshsil
*: integrate new silence package
2016-08-30 16:53:34 +02:00
Fabian Reinartz
a4e8703567 *: integrate new silence package 2016-08-30 12:15:23 +02:00
Fabian Reinartz
fcda6fede3 Merge pull request #458 from pdbogen/pdbogen-error-on-nonflag-args
Pdbogen error on nonflag args
2016-08-23 14:25:53 +02:00
Fabian Reinartz
5dc8286942 nflog: fix maintenance termination 2016-08-19 12:01:16 +02:00
Fabian Reinartz
72fdf3d3ab *: integrate nflog
This commit replaces the previous NotifyInfo provider with the new
nflog package. It needs adjustments in the behavior of the deduping
stage.
The nflog stores notification digests per receiver per alert aggregation
group rather than one entry for alert per receiver. This drastically
reduces the number of entries and removes interference
across aggregation groups.
2016-08-18 15:52:28 +02:00
Patrick Bogen
ca844915b3
alertmanager should raise an error if unexpected arguments are present on the command line 2016-08-16 11:15:45 -07:00
Frederic Branczyk
7bc851e894 rework building of stage pipelines 2016-08-16 10:56:46 +02:00
Frederic Branczyk
840dd7d2f5 introduce Stage interface 2016-08-12 16:01:40 +02:00
Frederic Branczyk
3dfb17e601 refactor notification pipeline
move hard to read backwards declared approach to more transparent
pipeline approach with more detailed interfaces
2016-08-11 15:04:03 +02:00
Fabian Reinartz
3931d4e64b *: restructure package tree
This commit packages up individual modules and removes the top-level
main package.
2016-08-09 14:24:52 +02:00