If a users chooses to disable the Alertmanager cluster feature, there is
no cluster name nor cluster peers. Hence these should be optional. Only
cluster status is set to "disabled".
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
When users start Alertmanager with `--cluster.listen-address=`, the
cluster will not be initialized, hence api.peer will be `nil`. So far
this would result in a nil pointer dereference by the API v2 accessing
the api.peer field.
With this patch, api v2 skips populating the peers array, sets the name
to an empty string and the status to "disabled" in case `api.peer` is
nil.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
With issue 1465 on openapi-generator [1] being fixed, we can not extract
shared properties of the gettable and postable alert definition into a
shared object (`alert`) like we do for silence, gettable silence and
postable silence.
In addition this patch does the following changes to the UI:
- Use `List GettableAlert` instead of plural type definition like
`GettableAlerts` because the plural definitions are not generated.
- Fix openapi-generator-cli docker image to specific hash.
[1] https://github.com/OpenAPITools/openapi-generator/issues/1465
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
Instead of having one general silence, differentiate between postable
and gettable silence, hence making more fields required.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
This patch makes the Alertmanager UI (/status & /silences) use the
api/v2 endpoint. In addition it adds logic to generate the elm side data
model based on the OpenAPI specification.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
Instead of only testing single instance Alertmanagers, this patch
enables individual tests to spin up Alertmanager clusters.
In addition it adds two tests:
1. A test firing alerts against a cluster, expecting to only receive a a
notification by one of the Alertmanager instances in the cluster.
2. A test firing alerts both against a single instance as well as a
cluster, making sure the output equals.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
The current Alertmanager API v1 is undocumented and written by hand.
This patch introduces a new Alertmanager API - v2. The API is fully
generated via an OpenAPI 2.0 [1] specification (see
`api/v2/openapi.yaml`) with the exception of the http handlers itself.
Pros:
- Generated server code
- Ability to generate clients in all major languages
(Go, Java, JS, Python, Ruby, Haskell, *elm* [3] ...)
- Strict contract (OpenAPI spec) between server and clients.
- Instant feedback on frontend-breaking changes, due to strictly
typed frontend language elm.
- Generated documentation (See Alertmanager online Swagger UI [4])
Cons:
- Dependency on open api ecosystem including go-swagger [2]
In addition this patch includes the following changes.
- README.md: Add API section
- test: Duplicate acceptance test to API v1 & API v2 version
The Alertmanager acceptance test framework has a decent test coverage
on the Alertmanager API. Introducing the Alertmanager API v2 does not go
hand in hand with deprecating API v1. They should live alongside each
other for a couple of minor Alertmanager versions.
Instead of porting the acceptance test framework to use the new API v2,
this patch duplicates the acceptance tests, one using the API v1, the
other API v2.
Once API v1 is removed we can simply remove `test/with_api_v1` and bring
`test/with_api_v2` to `test/`.
[1]
https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md
[2] https://github.com/go-swagger/go-swagger/
[3] https://github.com/ahultgren/swagger-elm
[4]
http://petstore.swagger.io/?url=https://raw.githubusercontent.com/mxinden/alertmanager/apiv2/api/v2/openapi.yaml
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
Errcheck [1] enforces error handling accross all go files. Functions can
be excluded via `scripts/errcheck_excludes.txt`.
This patch adds errcheck to the `test` Make target.
[1] https://github.com/kisielk/errcheck
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
* notify: notify resolved alerts properly
The PR #1205 while fixing an existing issue introduced another bug when
the send_resolved flag of the integration is set to true.
With send_resolved set to false, the semantics remain the same:
AlertManager generates a notification when new firing alerts are added
to the alert group. The notification only carries firing alerts.
With send_resolved set to true, AlertManager generates a notification
when new firing or resolved alerts are added to the alert group. The
notification carries both the firing and resolved notifications.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Fix comments
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: update inhibition cache when alerts resolve
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: remove unnecessary fmt.Sprintf
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: add unit tests
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* inhibit: use NopLogger in tests
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update old alert with result of merge with new
On ingest, alerts with matching fingerprints are
merged if the new alert's start and end times
overlap with the old alert's.
The merge creates a new alert, which is then
updated in the internal alert store.
The original alert is not updated (because merge
creates a copy), so it is never marked as resolved
in the inhibitor's reference to it.
The code within the inhibitor relies on skipping
over resolved alerts, but because the old alert is
never updated it is never marked as resolved. Thus
it continues to inhibit other alerts until it is
cleaned up by the internal GC.
This commit updates the struct of the old alert
with the result of the merge with the new alert.
An alternative would be to always update the
inhibitor's internal cache of alerts regardless of
an alert's resolve status.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* Update inhibitor cache even if alert is resolved
This seems like a better choice than the previous
commit. I think it is more sane to have the
inhibitor update its own cache, rather than having
one of its pointers updated externally.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
* cli: move commands to cli/cmd
* cli: use StatusAPI interface for config command
* cli: use SilenceAPI interface for silence commands
* cli: use AlertAPI for alert command
* cli: move back commands to cli package
And move API client code to its own package.
* cli: remove unused structs
When the aggregation group receives an alert that is past the initial
group_wait value, it should reset its timer only if the timer has ever
expired. Otherwise it means that the flush is already in-progress.
* cli: extract client bindings of the v1 API from amtool
This is a continuation of [1] but the code is kept in the alertmanage
repository rather than having it in client_golang.
[1] https://github.com/prometheus/client_golang/pull/333
Co-Authored-By: Fabian Reinartz <fab.reinartz@gmail.com>
Co-Authored-By: Tristan Colgate <tcolgate@gmail.com>
Co-Authored-By: Corin Lawson <corin@responsight.com>
Co-Authored-By: stuart nelson <stuartnelson3@gmail.com>
* cli: fix httpSilenceAPI.Set() method
* vendor: remove github.com/prometheus/client_golang/api/alertmanager
* cli: don't use the model.Alert type
* Wait for the gossip to settle before sending notifications
See #1209 for details.
As an heuristic for mesh readyness, try to see if
the mesh looks stable (the number of peers isn't changing too much).
This implementation always mark the altermanager as ready after a maximum of 60s.
This adds one new flags to control this behavior:
```
--cluster.settle-timeout=60s mesh settling timeout. Do not wait more than this duration on startup.
```
It also adds `/-/ready` which always return 200 (in order to make it clear
that we are ready as soon as we can receive requests).
The mesh status is exposed in `/api/v1/status` and visible on `/#/status`.
* cluster: fix typos and base interval on gossipInterval
After the initial notification has been sent, AlertManager shouldn't notify the
receiver again when no new alerts have been added to the group during
group_interval.
This change also modifies the acceptance test framework to assert that no
notification has been received in a given interval.
This change decreases the repeat_interval parameter from 5s to 4.9s to
make sure that the alerts are effectively sent after 5 seconds.
The workflow is:
- The dispatcher flushes the alerts at t0, sends the notification and
marks the notification log at t0+epsilon.
- The dispatcher flushes the alerts at t1, t2, t3 and t4 and doesn't
send the notifications as expected.
- At t5, the dispatcher flushes the alerts because current_time - (t0+epsilon)
is less then repeat_interval.
If repeat_interval is exactly 5s, there is a little chance that it is
greater than current_time - (t0+epsilon).
Building a hash over an entire set of alerts causes problems, because
the hash differs, on any change, whereas we only want to send
notifications if the alert and it's state have changed. Therefore this
introduces a list of alerts that are active and a list of alerts that
are resolved. If the currently active alerts of a group are a subset of
the ones that have been notified about before then they are
deduplicated. The resolved notifications work the same way, with a
separate list of resolved notifications that have already been sent.
* Find MAC address if mesh.hardware-addr not given
Defaulting to the machine's MAC address fails
sometimes fails and causes a panic. Allow the user
to specify custom address to skip this so they can
run AlertManager.
* -mesh.hardware-address -> -mesh.peer-id
* Fix command-line invocation
* Wait for test server to be ready before running tests
This fixes problems when running the acceptance tests in slow or CPU-starved
machines, as mentioned in #472.
Resolved alerts, even when filtered, have to end up in the
SetNotifiesStage, otherwise when an alert fires again it is ambiguous
whether it was resolved in between or not.
fixes#523
This commit replaces the previous NotifyInfo provider with the new
nflog package. It needs adjustments in the behavior of the deduping
stage.
The nflog stores notification digests per receiver per alert aggregation
group rather than one entry for alert per receiver. This drastically
reduces the number of entries and removes interference
across aggregation groups.
This commit removes the dependency on model.Silence for the internal
Silence type, uses UUIDs instead of uint64s and clarifies invariants
around timestamp handling.
The created_at timestamp is removed for the time being.
Initial testing has shown BoltDB in plain usage to be a bottleneck
at a few thousand alerts or more (especially JSON decoding.
This commit actually makes them purely memory as a temporary solution.
Previously, the tests would listen on all available interfaces.
Instead, have the tests use localhost only; using all available
interfaces is unnecessary.
On Mac OS X with the builtin firewall enabled, it triggers annoying
prompts to allow the tests to listen on all interfaces.
This commit changes the notification grouping behavior
to simply send all alerts of a group as soon as a single
one of them needs updating.
This fixes a critical bug which caused erroneous resolved
notifications to be sent.