The cluster configuration uses DefaultLANConfig which seems
to be quite sensitive to WAN conditions. Allowing the tuning of these 3
parameters (TCP Timeout, Probe Interval and Probe Timeout) makes
clustering more robust across WAN connections.
Signed-off-by: Rhys Meaclem <rhysmeaclem@gmail.com>
* cli alert query: Expose --active and --unprocessed
Support the new filter options in the alerts api
endpoint introduced by https://github.com/prometheus/alertmanager/pull/1366
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update comment and client_test
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* api: support more query filters
This change adds 2 new query filters to the /api/v1/alerts endpoint.
- active, filter out active alerts when set to 'false' (default: 'true').
- unprocessed, filter out unprocessed alerts when set to 'false'
(default: 'true').
The default values ensure that the API behavior remains the same as
before when the query filters aren't provided.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* api: address comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
With prometheus/prometheus commit
e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe [1] Prometheus switches from
using `api/alerts` to `api/v1/alerts`. This commit is included starting
from Prometheus v0.17.0. As discussed on the prometheus-developers
mailing list [2] the deprecation period is long over.
[1] github.com/prometheus/prometheus/commit/e114ce0ff7a1ae06b24fdc479ffc7422074c1ebe
[2]
https://groups.google.com/d/msg/prometheus-developers/2CCuFTMbmAg/Qg58rvyzAQAJ
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* Use default values to store values from config
* fix typo and reserved keywork
* move to long help texts
* add one more unit test for resolver
* update comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Improve notification instrumentation
- Add notificationLatencySeconds histogram to
debug duplicate messages. This can help rule out
if duplicate messages are being caused by
excessive latency when sending a notification.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* inhibit: update inhibition cache when alerts resolve
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: remove unnecessary fmt.Sprintf
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: add unit tests
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: use NopLogger in tests
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update old alert with result of merge with new
On ingest, alerts with matching fingerprints are
merged if the new alert's start and end times
overlap with the old alert's.
The merge creates a new alert, which is then
updated in the internal alert store.
The original alert is not updated (because merge
creates a copy), so it is never marked as resolved
in the inhibitor's reference to it.
The code within the inhibitor relies on skipping
over resolved alerts, but because the old alert is
never updated it is never marked as resolved. Thus
it continues to inhibit other alerts until it is
cleaned up by the internal GC.
This commit updates the struct of the old alert
with the result of the merge with the new alert.
An alternative would be to always update the
inhibitor's internal cache of alerts regardless of
an alert's resolve status.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update inhibitor cache even if alert is resolved
This seems like a better choice than the previous
commit. I think it is more sane to have the
inhibitor update its own cache, rather than having
one of its pointers updated externally.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Move amtool to modular structure
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* Move toplevel setup back into root.go
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* Remove confusing alert struct name overwriting
A local variable within the alert subcommand was
using the name of the struct within that file.
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* change local var name shadowing struct name
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
Within alertmanager, expire is the term used,
since silences still "exist" but aren't in effect.
Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
* cli: move commands to cli/cmd
* cli: use StatusAPI interface for config command
* cli: use SilenceAPI interface for silence commands
* cli: use AlertAPI for alert command
* cli: move back commands to cli package
And move API client code to its own package.
* cli: remove unused structs
When the aggregation group receives an alert that is past the initial
group_wait value, it should reset its timer only if the timer has ever
expired. Otherwise it means that the flush is already in-progress.
* Update silence add/update flags
- Change --expires/-e to --duration/-d
- Change --expires-on to --end
- Add --start
* update subcommand returns ID of new silence
The silences printed before were accurate, except
they had the old ID. Now the new ID is returned.
* Duration is added to silence.StartsAt
When a user supplies a duration to update a
silence, it is applied to silence.StartsAt after
any potential changes to the silence's start time.
* cli: extract client bindings of the v1 API from amtool
This is a continuation of [1] but the code is kept in the alertmanage
repository rather than having it in client_golang.
[1] https://github.com/prometheus/client_golang/pull/333
Co-Authored-By: Fabian Reinartz <fab.reinartz@gmail.com>
Co-Authored-By: Tristan Colgate <tcolgate@gmail.com>
Co-Authored-By: Corin Lawson <corin@responsight.com>
Co-Authored-By: stuart nelson <stuartnelson3@gmail.com>
* cli: fix httpSilenceAPI.Set() method
* vendor: remove github.com/prometheus/client_golang/api/alertmanager
* cli: don't use the model.Alert type
This change replaces the deprecated InstrumentHandler function by
the equivalent functions from the promhttp package.
The following metrics are removed:
* http_request_duration_microseconds (Summary).
* http_request_size_bytes (Summary).
* http_requests_total (Counter).
And the following metrics are added instead:
* alertmanager_http_request_duration_seconds (Histogram).
* alertmanager_http_response_size_bytes (Histogram).
* promhttp_metric_handler_requests_in_flight (Gauge).
* promhttp_metric_handler_requests_total (Counter).
* cluster: add alertmanager_cluster_messages_queued metric
* cluster: add metrics for sent messages
This change adds 2 new metrics:
- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total
* Fix marshaling for entries being broadcast
Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.
* main: fix argument order for cluster.Join()
cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
The same behavior exists in prometheus. This is a
bit superfluous, but in the event people are using
old versions of prometheus or a different metric
gathering system, it's still valid to check.