- `tmplText` and `tmplHTML` are using a monad-style error handling [1].
This reduces the verbosity of the error logic, but introduces the risk
of forgetting the final error check. This patch does not remove this
coding-style, but ensures proper error checking in the Email and
PagerDuty notifier.
- Ensure to handle errors returned by `multipartWriter.Close()` and
`wc.Write(buffer.Bytes())` in `Email.Notify()`.
[1] https://www.innoq.com/en/blog/golang-errors-monads/
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* Add support for adding alerts using amtool
Signed-off-by: Bob Shannon <bshannon@palantir.com>
* comment: Simplify return in addAlert
Signed-off-by: Bob Shannon <bshannon@palantir.com>
Alertmanager is exiting with a non-zero exit code if the initial cluster
join fails. This behavior could be not wanted because:
- As Alertmanager is a critical component with an at-least-once
guarantee, failing on joining the cluster is unnecessary as
Alertmanager still functions by itself.
- In an environment like Kubernetes discovering peers via DNS, peers
might roll out one-by-one, leaving the DNS entries unpopulated for the
first peer of a set. Failing on initial join prevents a roll-out.
Instead of failing on the initial join this patch only logs the failure.
The cluster can be later joined via the `handleReconnect`.
This is a regression introduced in PR #1456 [1].
[1] https://github.com/prometheus/alertmanager/pull/1456
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* fix concurrent read and wirte group
Signed-off-by: denghuan <denghuan@actionsky.com>
* make lock more elegant
Signed-off-by: denghuan <denghuan@actionsky.com>
* amtool: add support for stdin to check-config
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Address Stuart's comment
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* cluster: make sure we don't miss the first pushPull
During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).
Signed-off-by: Corentin Chary <c.chary@criteo.com>
The memberlist library fails when it can't find a private address and no
advertise address is given. To return a helpful message to the user,
AlertManager mimics the logic from memberlist. However the code had a
bug that swallowed the error message and made it difficult for the user
to understand how to fix the problem.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
To ensure we include the breaking change notice in the next release
notes, this patch adds a 'Next release' section mentioning the breaking
change of the working directory of the Alertmanager Dockerfile.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
The YAML strict mode doesn't allow mapping keys that are duplicates. If
someone wants to override one of the default keys in the Details hash,
the unmarshal function returns an error because the key is already
defined by DefaultPagerdutyConfig.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* cluster: prune the queue if too large
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Address review comments
Also increases the pruning interval to 15 minutes and the max queue size
to 4096 items (same value as used by Serf).
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Gossip large messages via SendReliable
For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.
The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.
* Add tests for oversized/normal message gossiping
* Make oversize metric names consistent
* Remove errant printf in test
* Correctly increment WaitGroup
* Add comment for OversizedMessage func
* Add metric for oversized messages dropped
Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.
* Add counter for total oversized messages sent
* Change full queue log level to debug
Was previously a warning, which isn't necessary
now that there is a metric tracking it.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Sort dispatched alerts by job+instance in the correct order (#1178)
Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>
* dispatch: add unit test for alerts sorting
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Include Makefile.common
* Fix the bindata.go files to make the style target happy
* Inline `.PHONY` statements
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
When setting initially failing peers, if we don't
have a value for the advertise address, use the
bindAddr.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Correctly add Node to initially failed peer
Reconnect attempts to failed peers were panicking
because peer.Address() would attempt to access the
nil Node struct member.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Correctly remove old peers
Again, since we aren't assigning a name (this is
generated) we rely on the node's Address for
removing the initially joining (and potentially
later re-joining) peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Test the peerJoin removes initial peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Don't add self to failing peers list
The initially failing peers list shouldn't include
the bindAddr for the alertmanager itself, as this
connection is never made, and consequently only
removed from the failedPeers list after the failed
peer timeout.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Filter initialFailed with advertise addr
This may differ from bindAddr, and is the value we
want to not attempt to connect to.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Peers further propagate newly received nflogs
If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set Retransmit value based on number of members
For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [nflog] Move retransmit calculation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [silence] further gossip silence messages
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set GossipNodes to equal RetransmitMulti
During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix rebase
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* notify: notify resolved alerts properly
The PR #1205 while fixing an existing issue introduced another bug when
the send_resolved flag of the integration is set to true.
With send_resolved set to false, the semantics remain the same:
AlertManager generates a notification when new firing alerts are added
to the alert group. The notification only carries firing alerts.
With send_resolved set to true, AlertManager generates a notification
when new firing or resolved alerts are added to the alert group. The
notification carries both the firing and resolved notifications.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Fix comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
memberlist doesn't advertise a valid IP address when the bind address is
empty (":8001") or the unspecified IPv6 address ("[::]:8001).
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Amtool: Implement filter by receiver
* Adds receiver flag to amtool alert query
* Adds receiver argument to alert http client
* Updates http client tests for added argument
Also works: scpecifying `receiver: "receiver-123"` in config file
automaticly filters all alerts shown
* Include receiver in amtool config docs
Now that I've implemented the receiver in amtool I should add the new
feature to the documentation.
* #937 Add mention of supporting regex syntax to receiver flag
* initial impl
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnectTimeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix locking
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove unused PeerStatuses
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add metrics
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Actually use peerJoinCounter
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Cleanup peers map on peer timeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnect test
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* test removing failed peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Use peer address as map key
If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add failed peers from creation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove warnIfAlone()
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update metric names
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Address comments
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update Architecture diagram
Update diagram from sketch to vector.
Add draw.io XML source file.
Update README.md to display master doc/arch.jpg
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
* Updated README.md with relative link to architecture doc.
* Updated Architecture document from JPG to SVG
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
* Small fix in graph.
* Updated font to align with Prometheus architecture.
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
* Embedded images at arch.svg
* Removed images from SVG, update source XML