Commit Graph

1631 Commits

Author SHA1 Message Date
Ben Chess
f10ebba89d
Email is green if only none firing (#1475)
Signed-off-by: Benjamin Chess <bchess@gmail.com>
2018-08-14 10:36:51 +02:00
Max Inden
2d3385f9dd
notify: Improve error handling (#1474)
- `tmplText` and `tmplHTML` are using a monad-style error handling [1].
This reduces the verbosity of the error logic, but introduces the risk
of forgetting the final error check. This patch does not remove this
coding-style, but ensures proper error checking in the Email and
PagerDuty notifier.

- Ensure to handle errors returned by `multipartWriter.Close()` and
`wc.Write(buffer.Bytes())` in `Email.Notify()`.

[1] https://www.innoq.com/en/blog/golang-errors-monads/

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-08-14 10:36:07 +02:00
Mark Van De Weert
080cbe4e23
enable templating of hipchat room_id (#1463)
Signed-off-by: Mark Van De Weert <mark.vandeweert@wpengine.com>
2018-08-14 10:35:14 +02:00
Max Inden
8397de1830
Merge pull request #1462 from mxinden/cut-0.15.1
*: Cut 0.15.1
2018-07-12 20:21:50 +02:00
Max Leonard Inden
b0cb197aa1
*: Cut 0.15.1
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-07-11 17:21:00 +02:00
Max Inden
2c7c5b6f4e
cluster: Do not exit when failing to join cluster (#1465)
Alertmanager is exiting with a non-zero exit code if the initial cluster
join fails. This behavior could be not wanted because:

- As Alertmanager is a critical component with an at-least-once
guarantee, failing on joining the cluster is unnecessary as
Alertmanager still functions by itself.

- In an environment like Kubernetes discovering peers via DNS, peers
might roll out one-by-one, leaving the DNS entries unpopulated for the
first peer of a set. Failing on initial join prevents a roll-out.

Instead of failing on the initial join this patch only logs the failure.
The cluster can be later joined via the `handleReconnect`.

This is a regression introduced in PR #1456 [1].

[1] https://github.com/prometheus/alertmanager/pull/1456

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-07-11 17:20:58 +02:00
bigMacro
bbc2da9294
fix concurrent read and wirte group error (#1447)
* fix concurrent read and wirte group

Signed-off-by: denghuan <denghuan@actionsky.com>

* make lock more elegant

Signed-off-by: denghuan <denghuan@actionsky.com>
2018-07-10 22:04:15 +02:00
Corentin Chary
9988e14b0f
cluster: make sure we don't miss the first pushPull (#1456)
* cluster: make sure we don't miss the first pushPull

During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).

Signed-off-by: Corentin Chary <c.chary@criteo.com>
2018-07-10 22:02:57 +02:00
Simon Pasquier
b15f99533b
cluster: fail when no private address can be found (#1437)
The memberlist library fails when it can't find a private address and no
advertise address is given. To return a helpful message to the user,
AlertManager mimics the logic from memberlist. However the code had a
bug that swallowed the error message and made it difficult for the user
to understand how to fix the problem.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-10 22:01:40 +02:00
Simon Pasquier
dc33b6a155
notify: catch templating errors for Wechat
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-10 21:59:26 +02:00
Simon Pasquier
8091a19c7f
config: fix regression with Pager Duty (#1455)
The YAML strict mode doesn't allow mapping keys that are duplicates. If
someone wants to override one of the default keys in the Details hash,
the unmarshal function returns an error because the key is already
defined by DefaultPagerdutyConfig.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-10 21:56:40 +02:00
Guido García
8b78591fff
fix: email template typo in alert-warning style
Signed-off-by: Guido García <guido.garciabernardo@telefonica.com>
2018-07-10 21:55:51 +02:00
Frederic Branczyk
898cfbe3f2
Merge pull request #1432 from mxinden/improve-change-changelog
CHANGELOG.md: Improve [CHANGE] section of v0.15.0 release
2018-06-25 10:20:02 +02:00
Max Leonard Inden
6d9ccfa624
CHANGELOG.md: Improve [CHANGE] section of v0.15.0 release
- Add entry for working dir change in Alertmanager Docker image
- Indicate cluster flag changes

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-06-23 00:56:30 +08:00
Max Inden
1322a5a6e4
*: Cut 0.15.0 (#1429)
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-06-22 19:54:50 +08:00
Max Inden
5e86f61bd7
Merge pull request #1419 from mxinden/cut-0.15.0-rc.3
*: Cut 0.15.0-rc.3
2018-06-18 12:17:42 +02:00
Max Leonard Inden
17e2fc7c2b
*: Cut 0.15.0-rc.3
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-06-16 10:09:30 +02:00
Simon Pasquier
7a272416de cluster: prune the queue if it contains too many items (#1418)
* cluster: prune the queue if too large

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Address review comments

Also increases the pruning interval to 15 minutes and the max queue size
to 4096 items (same value as used by Serf).

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-15 18:08:12 +02:00
stuart nelson
445fbdf1a8
gossip large messages via SendReliable (#1415)
* Gossip large messages via SendReliable

For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.

The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.

* Add tests for oversized/normal message gossiping

* Make oversize metric names consistent

* Remove errant printf in test

* Correctly increment WaitGroup

* Add comment for OversizedMessage func

* Add metric for oversized messages dropped

Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.

* Add counter for total oversized messages sent

* Change full queue log level to debug

Was previously a warning, which isn't necessary
now that there is a metric tracking it.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-15 13:40:21 +02:00
Simon Pasquier
8034f137e1 cluster: don't track FQDN addresses as inital peers (#1416)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-15 12:34:50 +02:00
Simon Pasquier
6a7c912559 Sort alerts in correct order (#1349)
* Sort dispatched alerts by job+instance in the correct order (#1178)

Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>

* dispatch: add unit test for alerts sorting

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-14 15:54:33 +02:00
Simon Pasquier
387e684faa vendor: update prometheus/common packages (#1414)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-13 16:11:22 +02:00
Simon Pasquier
0c512998ee Use Makefile.common from Prometheus (#1396)
* Include Makefile.common
* Fix the bindata.go files to make the style target happy
* Inline `.PHONY` statements

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-13 14:41:52 +02:00
stuart nelson
77cc718a81 [nflog] register snapshotSize
This metric was never registered.
2018-06-12 13:59:48 +02:00
stuart nelson
d259bf9d09
Check for advertise host when setting failed peers (#1411)
When setting initially failing peers, if we don't
have a value for the advertise address, use the
bindAddr.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-11 14:18:15 +02:00
stuart nelson
ec2cc57d28
0.15.0-rc.2 (#1410)
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-08 14:38:37 +02:00
stuart nelson
6305229fcc
fix set initial failed peers (#1407)
* Correctly add Node to initially failed peer

Reconnect attempts to failed peers were panicking
because peer.Address() would attempt to access the
nil Node struct member.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Correctly remove old peers

Again, since we aren't assigning a name (this is
generated) we rely on the node's Address for
removing the initially joining (and potentially
later re-joining) peers

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Test the peerJoin removes initial peers

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Don't add self to failing peers list

The initially failing peers list shouldn't include
the bindAddr for the alertmanager itself, as this
connection is never made, and consequently only
removed from the failedPeers list after the failed
peer timeout.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Filter initialFailed with advertise addr

This may differ from bindAddr, and is the value we
want to not attempt to connect to.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-08 12:34:52 +02:00
stuart nelson
36588c3865
memberlist gossip (#1389)
* Peers further propagate newly received nflogs

If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set Retransmit value based on number of members

For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [nflog] Move retransmit calculation

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [silence] further gossip silence messages

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set GossipNodes to equal RetransmitMulti

During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Fix rebase

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-08 11:48:42 +02:00
Simon Pasquier
b7d891cf39 notify: notify resolved alerts properly (#1408)
* notify: notify resolved alerts properly

The PR #1205 while fixing an existing issue introduced another bug when
the send_resolved flag of the integration is set to true.

With send_resolved set to false, the semantics remain the same:
AlertManager generates a notification when new firing alerts are added
to the alert group. The notification only carries firing alerts.

With send_resolved set to true, AlertManager generates a notification
when new firing or resolved alerts are added to the alert group. The
notification carries both the firing and resolved notifications.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Fix comments

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-08 11:37:38 +02:00
Simon Pasquier
9f87f9d6e7 cluster: advertise explicitly for empty addresses (#1386)
memberlist doesn't advertise a valid IP address when the bind address is
empty (":8001") or the unspecified IPv6 address ("[::]:8001).

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-06-07 17:57:01 +02:00
Kellen Fox
b949f0dc19 Amtool: Implement filter by receiver fixes:#937 (#1402)
* Amtool: Implement filter by receiver

* Adds receiver flag to amtool alert query
* Adds receiver argument to alert http client
* Updates http client tests for added argument

Also works: scpecifying `receiver: "receiver-123"` in config file
automaticly filters all alerts shown

* Include receiver in amtool config docs

Now that I've implemented the receiver in amtool I should add the new
feature to the documentation.

*  #937 Add mention of supporting regex syntax to receiver flag
2018-06-07 09:21:12 +02:00
stuart nelson
db4af95ea0
memberlist reconnect (#1384)
* initial impl

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add reconnectTimeout

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Fix locking

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Remove unused PeerStatuses

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add metrics

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Actually use peerJoinCounter

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Cleanup peers map on peer timeout

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add reconnect test

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* test removing failed peers

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Use peer address as map key

If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add failed peers from creation

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Remove warnIfAlone()

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Update metric names

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Address comments

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-05 14:28:49 +02:00
Silvio Gissi
402564055b Update Architecture diagram (#1394)
* Update Architecture diagram

Update diagram from sketch to vector.
Add draw.io XML source file.
Update README.md to display master doc/arch.jpg

Signed-off-by: Silvio Gissi <silvio@gissilabs.com>

* Updated README.md with relative link to architecture doc.

* Updated Architecture document from JPG to SVG

Signed-off-by: Silvio Gissi <silvio@gissilabs.com>

* Small fix in graph.

* Updated font to align with Prometheus architecture.

Signed-off-by: Silvio Gissi <silvio@gissilabs.com>

* Embedded images at arch.svg

* Removed images from SVG, update source XML
2018-05-31 15:34:52 +02:00
Simon Pasquier
49717d91b0 parse: fix parsing for label values with commas (#1395)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-28 11:36:47 +02:00
Max Inden
a9e584be75
Merge pull request #1391 from simonpasquier/fix-circleci
Fix CircleCI config for releases
2018-05-28 11:33:58 +02:00
Max Inden
402a4aa4a2
Merge pull request #1381 from simonpasquier/add-missing-header
*: add missing license headers
2018-05-28 10:45:39 +02:00
Simon Pasquier
44e30e6a41 Fix CircleCI config for releases
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-23 10:20:06 +02:00
Adam Shannon
bf0db5b989 cli: print more config details (#1376)
Example output:

$ amtool check-config alertmanager.yaml
Checking 'alertmanager.yaml'  SUCCESS
Found:
 - global config
 - route
 - 0 inhibit rules
 - 13 receivers
 - 0 templates

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-05-15 09:17:51 +02:00
Simon Pasquier
0ebaeccd4b *: add missing license headers
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-14 17:37:13 +02:00
Alex Lardschneider
1f9a7b6182 [Request] Add Slack actions to notifications (#1355)
* Added slack actions to notifications

Signed-off-by: Alex Lardschneider <alex.lardschneider@gmail.com>
2018-05-14 17:26:11 +02:00
Simon Pasquier
292256ca7f vendor: remove unused packages (#1380)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-14 16:23:48 +02:00
rhysm
e4416bd612 Add additional cluster configuration flags (#1379)
The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.

Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
2018-05-14 09:22:04 +02:00
stuart nelson
942be9d993
cli alert query: Expose --active and --unprocessed (#1370)
* cli alert query: Expose --active and --unprocessed

Support the new filter options in the alerts api
endpoint introduced by https://github.com/prometheus/alertmanager/pull/1366

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Update comment and client_test

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-05-09 10:57:01 +02:00
Simon Pasquier
02f10f204f circleci: fix docker push command (#1371)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-08 11:41:25 +02:00
Simon Pasquier
28967e394e config: fix Go formatting (#1368)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-07 18:12:14 +02:00
Simon Pasquier
75900ea62a api: remove dead code (#1367)
This is a follow-up of f825d97de4.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-07 18:11:36 +02:00
Simon Pasquier
383024e63d api: support more query filters (#1366)
* api: support more query filters

This change adds 2 new query filters to the /api/v1/alerts endpoint.

- active, filter out active alerts when set to 'false' (default: 'true').
- unprocessed, filter out unprocessed alerts when set to 'false'
 (default: 'true').

The default values ensure that the API behavior remains the same as
before when the query filters aren't provided.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* api: address comments

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-05-07 18:07:19 +02:00
Max Inden
05fb09aebd
Merge pull request #1362 from mxinden/deprecate-v0-alerts
api: Deprecate `api/alerts` endpoint
2018-05-05 13:43:54 +02:00
stuart nelson
1c0c24b300
Update alerts argument order, rename expired to inhibited (#1360)
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-05-04 10:43:38 +02:00
Max Leonard Inden
f825d97de4
api: Deprecate api/alerts endpoint
With prometheus/prometheus commit
e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from
using `api/alerts` to `api/v1/alerts`. This commit is included starting
from Prometheus v0.17.0. As discussed on the prometheus-developers
mailing list [2] the deprecation period is long over.

[1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe
[2]
https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-05-04 09:59:14 +02:00