When setting initially failing peers, if we don't
have a value for the advertise address, use the
bindAddr.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Correctly add Node to initially failed peer
Reconnect attempts to failed peers were panicking
because peer.Address() would attempt to access the
nil Node struct member.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Correctly remove old peers
Again, since we aren't assigning a name (this is
generated) we rely on the node's Address for
removing the initially joining (and potentially
later re-joining) peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Test the peerJoin removes initial peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Don't add self to failing peers list
The initially failing peers list shouldn't include
the bindAddr for the alertmanager itself, as this
connection is never made, and consequently only
removed from the failedPeers list after the failed
peer timeout.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Filter initialFailed with advertise addr
This may differ from bindAddr, and is the value we
want to not attempt to connect to.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Peers further propagate newly received nflogs
If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set Retransmit value based on number of members
For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [nflog] Move retransmit calculation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* [silence] further gossip silence messages
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Set GossipNodes to equal RetransmitMulti
During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix rebase
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* notify: notify resolved alerts properly
The PR #1205 while fixing an existing issue introduced another bug when
the send_resolved flag of the integration is set to true.
With send_resolved set to false, the semantics remain the same:
AlertManager generates a notification when new firing alerts are added
to the alert group. The notification only carries firing alerts.
With send_resolved set to true, AlertManager generates a notification
when new firing or resolved alerts are added to the alert group. The
notification carries both the firing and resolved notifications.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Fix comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
memberlist doesn't advertise a valid IP address when the bind address is
empty (":8001") or the unspecified IPv6 address ("[::]:8001).
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Amtool: Implement filter by receiver
* Adds receiver flag to amtool alert query
* Adds receiver argument to alert http client
* Updates http client tests for added argument
Also works: scpecifying `receiver: "receiver-123"` in config file
automaticly filters all alerts shown
* Include receiver in amtool config docs
Now that I've implemented the receiver in amtool I should add the new
feature to the documentation.
* #937 Add mention of supporting regex syntax to receiver flag
* initial impl
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnectTimeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Fix locking
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove unused PeerStatuses
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add metrics
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Actually use peerJoinCounter
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Cleanup peers map on peer timeout
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add reconnect test
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* test removing failed peers
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Use peer address as map key
If a peer is restarted, it will rejoin with the
same IP but different ULID. So the node will
rejoin the cluster, but its peers will never
remove it from their internal list of failed nodes
because its ULID has changed.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Add failed peers from creation
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Remove warnIfAlone()
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update metric names
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Address comments
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update Architecture diagram
Update diagram from sketch to vector.
Add draw.io XML source file.
Update README.md to display master doc/arch.jpg
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
* Updated README.md with relative link to architecture doc.
* Updated Architecture document from JPG to SVG
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
* Small fix in graph.
* Updated font to align with Prometheus architecture.
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
* Embedded images at arch.svg
* Removed images from SVG, update source XML
The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.
Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
* cli alert query: Expose --active and --unprocessed
Support the new filter options in the alerts api
endpoint introduced by https://github.com/prometheus/alertmanager/pull/1366
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update comment and client_test
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* api: support more query filters
This change adds 2 new query filters to the /api/v1/alerts endpoint.
- active, filter out active alerts when set to 'false' (default: 'true').
- unprocessed, filter out unprocessed alerts when set to 'false'
(default: 'true').
The default values ensure that the API behavior remains the same as
before when the query filters aren't provided.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* api: address comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
With prometheus/prometheus commit
e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from
using `api/alerts` to `api/v1/alerts`. This commit is included starting
from Prometheus v0.17.0. As discussed on the prometheus-developers
mailing list [2] the deprecation period is long over.
[1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe
[2]
https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* Use default values to store values from config
* fix typo and reserved keywork
* move to long help texts
* add one more unit test for resolver
* update comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Improve notification instrumentation
- Add notificationLatencySeconds histogram to
debug duplicate messages. This can help rule out
if duplicate messages are being caused by
excessive latency when sending a notification.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* inhibit: update inhibition cache when alerts resolve
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: remove unnecessary fmt.Sprintf
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: add unit tests
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: use NopLogger in tests
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update old alert with result of merge with new
On ingest, alerts with matching fingerprints are
merged if the new alert's start and end times
overlap with the old alert's.
The merge creates a new alert, which is then
updated in the internal alert store.
The original alert is not updated (because merge
creates a copy), so it is never marked as resolved
in the inhibitor's reference to it.
The code within the inhibitor relies on skipping
over resolved alerts, but because the old alert is
never updated it is never marked as resolved. Thus
it continues to inhibit other alerts until it is
cleaned up by the internal GC.
This commit updates the struct of the old alert
with the result of the merge with the new alert.
An alternative would be to always update the
inhibitor's internal cache of alerts regardless of
an alert's resolve status.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update inhibitor cache even if alert is resolved
This seems like a better choice than the previous
commit. I think it is more sane to have the
inhibitor update its own cache, rather than having
one of its pointers updated externally.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Move amtool to modular structure
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* Move toplevel setup back into root.go
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* Remove confusing alert struct name overwriting
A local variable within the alert subcommand was
using the name of the struct within that file.
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* change local var name shadowing struct name
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
Within alertmanager, expire is the term used,
since silences still "exist" but aren't in effect.
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* cli: move commands to cli/cmd
* cli: use StatusAPI interface for config command
* cli: use SilenceAPI interface for silence commands
* cli: use AlertAPI for alert command
* cli: move back commands to cli package
And move API client code to its own package.
* cli: remove unused structs
When the aggregation group receives an alert that is past the initial
group_wait value, it should reset its timer only if the timer has ever
expired. Otherwise it means that the flush is already in-progress.