Commit Graph

63 Commits

Author SHA1 Message Date
George Robinson f92a08d073
Remove unused feature flags (#3676)
This commit removes some code that should have been removed in #3668.
The FeatureFlags in silence.Options are no longer used but were
still initialized. These had a no-op effect.

Signed-off-by: George Robinson <george.robinson@grafana.com>
2024-01-19 10:43:50 +00:00
George Robinson fa6a7e6dd6
Fix inconsistent defaults in UTF-8 behavior (#3668)
This commit fixes inconsistent UTF-8 behavior if the compat package is
not initialized and feature flags are not passed to the API. This can
happen when Alertmanager is used as a package in software such
as Cortex or Mimir.

The inconsistent behavior is that Alertmanager will accept UTF-8 alerts
but reject UTF-8 configurations.

Since feature flags are optional via api.Options, we cannot force them
to be passed to api.New at compile time. Instead, it's better to defer
back to the compat package which is consistent even when not initialized.

Signed-off-by: George Robinson <george.robinson@grafana.com>
2024-01-15 10:03:51 +00:00
Matthieu MOREL b81bad8711 use Go standard errors
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2023-12-08 16:44:13 +01:00
George Robinson 70bd5dad98
Support UTF-8 label matchers: Use compat package in Alertmanager server (#3567)
* Support UTF-8 label matchers: Use compat package in Alertmanager server

This pull request adds use of the compat package in Alertmanager server that will allow users to switch between the new matchers/parse parser and the old pkg/labels parser. The new matchers/parse parser uses a fallback mechanism where if the input cannot be parsed in the new parser it then attempts to use the old parser. If an input is parsed in the old parser but not the new parser then a warning log is emitted.

Signed-off-by: George Robinson <george.robinson@grafana.com>

---------

Signed-off-by: George Robinson <george.robinson@grafana.com>
2023-11-24 10:01:40 +00:00
gotjosh 3ee2cd0f12
Metrics: Silence maintenance success and failure (#3285)
* Metrics: Silence maintenance success and failure

Due to various reasons, we've observed different kind of errors on this area. From read-only disks to silly code bugs.
Errors during maintenance are effectively a _data loss_ and therefore we should encourage proper monitoring of this area.

This PR Introduces a total and failure metric for silence maintenance. If agreed, I'll do the same for the nflog and fix the flaky test like I did for silences while I'm there.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2023-03-08 12:32:59 +00:00
gotjosh f59460bfd4
Refactor nflog configuration options to make it similar to Silences. (#3220)
* Refactor nflog configuration options to make it similar to Silences.

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.

To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.
2023-01-19 16:39:03 +00:00
Joe Blubaugh 505f944c6a Apply suggestions from code review.
Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
2022-07-05 11:22:46 +08:00
Joe Blubaugh c9249a02bc Remove a stray line that was breaking the linter.
Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
2022-07-05 11:22:46 +08:00
Joe Blubaugh bedd3c4175 Clean up linter warnings about unused code and atomic package
Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
2022-07-05 11:22:46 +08:00
Joe Blubaugh cb00d9259b Issue #2850: Add benbjohnson/clock to the silences package.
github.com/benbjohnson/clock provides a time interface to programs
rather than using the stdlib time package. This allows mocking time in
programs and tests. In this commit, the clock is used to speed up and
simplify testing of the silences package.

Signed-off-by: Joe Blubaugh <joe.blubaugh@grafana.com>
2022-07-05 11:22:46 +08:00
gotjosh cfb909f419
Marker: Rename `SetSilenced` to `SetActiveOrSilenced`
This accurately reflects what the function _actually_ does. If no active silences IDs are provided and the list of inhibitions we have is already empty the alert is actually set to Active. Took me a while to realise this as I was understanding how do we populate the alert list.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2022-06-17 12:51:23 +01:00
Matthias Loibl a6d10bd5bc
Update golangci-lint and fix complaints (#2853)
* Copy latest golangci-lint files from Prometheus

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Use grafana/regexp over stdlib regexp

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Fix typos in comments

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Fix goimports complains in import sorting

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* gofumpt all Go files

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Update naming to comply with revive linter

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* config: Fix error messages to be lower case

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* test/cli: Fix error messages to be lower case

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* .golangci.yaml: Remove obsolete space

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* config: Fix expected victorOps error

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Use stdlib regexp

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>

* Clean up Go modules

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2022-03-25 17:59:51 +01:00
Simon Pasquier 3f42c5e813
Merge pull request #2816 from prashbnair/update_check
Correcting the condition for updating a silence. Earlier was checking…
2022-03-04 15:17:12 +01:00
Soon-Ping a2d18c93de
Return no error when deleting expired silence (#2817)
* Changed Silences.expire(id) to not return error for already expired silence

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Added comment explaining idempotency change for Silences.expire()

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Trigger build

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Trigger build

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Fixed typo in comment

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Trigger build

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Trigger build

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Fixed another typo in comment

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Promoted comment to function-level

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Added API v2 test for DeleteSilence, PostSilence

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Fixed lint errors

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Trigger build

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Trigger build

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>

* Trigger build

Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
2022-02-22 13:34:21 +01:00
Prashant Balachandran 66182178d0 Correcting the condition for updating a silence. Earlier was checking upto
nanosecond precision but reduced to second as the UI only sends upto millisecond

Signed-off-by: Prashant Balachandran <pnair@redhat.com>
2022-01-31 11:32:48 +05:30
Kyle Brandt 1b8afe7cb5
export ValidateMatcher for DI (#2) (#2716)
so third parties, Grafana in particular, can over ride the validation.

Grafana wants to do this because other data sources will have label keys with things like spaces, periods, or other characters - and looking for a better integration with alert manager.

goes with grafana/grafana#38629
replaces https://github.com/prometheus/alertmanager/pull/2694

Signed-off-by: Kyle Brandt <kyle@grafana.com>
2021-10-21 09:29:55 +02:00
Yuriy Tseretyan 15f44f4a61
Close file descriptor after snapshot file was read (#2710)
* close file if it is opened

Signed-off-by: Yuriy Tseretyan <yuriy.tseretyan@grafana.com>
2021-10-19 01:12:02 +02:00
Julius Volz 5195460c95
Correctly call default silence maintenance function (#2701)
https://github.com/prometheus/alertmanager/pull/2689 introduced a
regression where the default maintenance function would no longer be
called even if no override was specified. The Alertmanager now crashes
on any silence maintenance run without this fix.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2021-09-13 19:42:48 +05:30
gotjosh 8da517524a
Enable support for custom callbacks as part of maintenance (#2689)
* Enable support for custom callbacks as part of maintenance

This enables support for custom Maintenance callbacks as part of the periodic maintenance of silences and notification logs.
Effectively a no-op for the Alertmanager but allows downstream implementation to inject custom logic as part of it.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Add tests

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Fix tests and remove whitespace

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Address review comments

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* run go fmt

Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Fix import ordering

Signed-off-by: gotjosh <josue.abreu@gmail.com>
2021-09-06 16:19:39 +05:30
Julien Pivotto b2a4cacb95 Update go dependencies & switch to go-kit/log
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-08-02 12:43:23 +02:00
beorn7 e84c265196 Include pending silences for future muting decisions
Previously, if a pending silence existed for an alert, and it later
became active without any silences getting added in the meantime, we
would miss the existence of that newly active silence.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-05-27 22:15:57 +02:00
Ganesh Vernekar 1f946f8a7d
Replace satori/go.uuid with gofrs/uuid
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2021-03-15 19:39:15 +05:30
Ganesh Vernekar 406ddd200a
Upgrade github.com/satori/go.uuid
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2021-03-10 14:49:07 +05:30
Björn Rabenstein 023937679f
Catch unknown matcher types in gossipped silences (#2484)
This has been discussed in #2479. Even if the conclusion there was
that we don't need this in a bugfix release, it's still better to have
this kind of robustness. So this introduces the same check into the
main branch.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-02-10 12:02:56 +01:00
Kiril Vladimirov f5382af591 silence: Add tests for Not(Equal|Regexp) matchers
... and fix a bug in validating silences with such matchers, caught
while writing them.

Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
2021-01-22 17:02:50 +02:00
Kiril Vladimirov 7320d83cbc Replace types.Matcher(s)? with labels.Matcher(s)?
Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
2021-01-22 17:02:48 +02:00
Josh Soref 0f2c65d265 Spelling (#2167)
* spelling: inhibition

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: matchers

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: notification

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: nonexistent

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: obfuscated

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: occurred

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: relevant

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: unexpected

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: marshaled

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: marshaling

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
2020-01-23 17:06:16 +01:00
Ilya Gladyshev 196c62f488 At least one non-empty silence matcher (#2081)
* check if at least one silence matcher doesn't match empty strings

Signed-off-by: qoops <ilya.v.gladyshev@gmail.com>

* fixed grammar

Signed-off-by: qoops <ilya.v.gladyshev@gmail.com>
2019-10-31 15:42:03 +01:00
beorn7 318e006065 Mark some Summaries explicitly as having no objectives
With the next release of client_golang, Summaries will not have
objectives by default. Interestingly, this will do the right thing for
the Summaries affected by this commit. However, right now those
summaries do get the old default objectives. They don't really make
sense because the affected Summaries receive Observations quite
infrequently (far less than once in the 10m max age currently
used). To not get surprising changes when moving on to client_golang
v1, let's explicitly set the Summaries as objective-less now.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-12 15:47:56 +02:00
stuart nelson b3972f3adc
Merge pull request #1672 from aixeshunter/master
Unused function 'QTimeRange' and empty slice declaration via literal
2019-05-03 14:13:04 +02:00
beorn7 46b61a38cd Remove a confusing closure
Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-28 13:04:05 +01:00
beorn7 0ab3b724cc Fix bug with zero retention time
Essentially, the Silences.Expire() will in that case have no effect
because the affected silence is immediately seen as expired from the
storage and thus not updated. The silence will stay around in its old
state.

This fix makes sure to use the same “now” throughout the expiration
process.

Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-28 12:51:40 +01:00
beorn7 3c981a92f7 Improve `Mutes` performance for silences
Add version tracking of silences states. Adding a silence to the state
increments the version. If the version hasn't changed since the last
time an alert was checked for being silenced, we only have to verify
that the relevant silences are still active rather than checking the
alert against all silences.

Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-28 12:34:41 +01:00
beorn7 f3d9c89bbc Create a `Muter` implementation for silences
This encapsulates the logic of querying and marking silenced
alerts. It removes the code duplication flagged earlier.

I removed the error returned by the setAlertStatus function as we were
only logging it, and that's already done anyway when the error is
received from the `silence.Query` call (now in the `Mutes` method).

Signed-off-by: beorn7 <beorn@soundcloud.com>
2019-02-26 16:42:59 +01:00
aixeshunter 4deb083823 Unused function 'QTimeRange'
Signed-off-by: aixeshunter <aixeshunter@gmail.com>
2018-12-19 09:47:52 +08:00
stuart nelson 2026e4a01f
[gossip] Don't merge expired gossip messages (#1631)
* [silences] Don't merge expired silences

If they're expired, they should be cleaned up on
the next GC cycle, but merging them in means that
they'll probably be gossip'd continually between
the cluster members.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add analogous behavior+test for nflog

The code for nflog was also constantly re-adding
nflogs to the internal memory store, the same as
the silence code was.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Add retention to TestQuery

With the default 0 retention, the alerts would not
be merged.

Signed-off-by: Stuart Nelson <stuartnelson3@gmail.com>
2018-11-21 11:40:57 +01:00
Max Leonard Inden f1b920bcc9
api: Implement OpenAPI generated Alertmanager API V2
The current Alertmanager API v1 is undocumented and written by hand.
This patch introduces a new Alertmanager API - v2. The API is fully
generated via an OpenAPI 2.0 [1] specification (see
`api/v2/openapi.yaml`) with the exception of the http handlers itself.

Pros:
- Generated server code
- Ability to generate clients in all major languages
  (Go, Java, JS, Python, Ruby, Haskell, *elm* [3] ...)
    - Strict contract (OpenAPI spec) between server and clients.
    - Instant feedback on frontend-breaking changes, due to strictly
      typed frontend language elm.
- Generated documentation (See Alertmanager online Swagger UI [4])

Cons:
- Dependency on open api ecosystem including go-swagger [2]

In addition this patch includes the following changes.

- README.md: Add API section

- test: Duplicate acceptance test to API v1 & API v2 version

  The Alertmanager acceptance test framework has a decent test coverage
  on the Alertmanager API. Introducing the Alertmanager API v2 does not go
  hand in hand with deprecating API v1. They should live alongside each
  other for a couple of minor Alertmanager versions.

  Instead of porting the acceptance test framework to use the new API v2,
  this patch duplicates the acceptance tests, one using the API v1, the
  other API v2.

  Once API v1 is removed we can simply remove `test/with_api_v1` and bring
  `test/with_api_v2` to `test/`.

[1]
https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md

[2] https://github.com/go-swagger/go-swagger/

[3] https://github.com/ahultgren/swagger-elm

[4]
http://petstore.swagger.io/?url=https://raw.githubusercontent.com/mxinden/alertmanager/apiv2/api/v2/openapi.yaml

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-09-04 13:38:34 +02:00
stuart nelson 445fbdf1a8
gossip large messages via SendReliable (#1415)
* Gossip large messages via SendReliable

For messages beyond half of the maximum gossip
packet size, send the message to all peer nodes
via TCP.

The choice of "larger than half the max gossip
size" is relatively arbitrary. From brief testing,
the overhead from memberlist on a packet seemed to
only use ~3 of the available 1400 bytes, and most
gossip messages seem to be <<500 bytes.

* Add tests for oversized/normal message gossiping

* Make oversize metric names consistent

* Remove errant printf in test

* Correctly increment WaitGroup

* Add comment for OversizedMessage func

* Add metric for oversized messages dropped

Code was added to drop oversized messages if the
buffered channel they are sent on is full. This
is a good thing to surface as a metric.

* Add counter for total oversized messages sent

* Change full queue log level to debug

Was previously a warning, which isn't necessary
now that there is a metric tracking it.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-15 13:40:21 +02:00
stuart nelson 36588c3865
memberlist gossip (#1389)
* Peers further propagate newly received nflogs

If a peer receives an nflog that it hasn't seen
before, queue the message and propagate it further
to other peers. This should ensure that all
peers within a cluster receive all gossip
messages.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set Retransmit value based on number of members

For alertmanagers that are brought up with a list
of peers, set the number of message retransmits to
be half of that number. If there are no peers on
start, or there are few, continue to use the
default value of 3.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [nflog] Move retransmit calculation

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* [silence] further gossip silence messages

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Set GossipNodes to equal RetransmitMulti

During a gossip, we send messages to at most
GossipNodes nodes. If possible, we only a message
to reach all nodes as soon as possible.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>

* Fix rebase

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
2018-06-08 11:48:42 +02:00
Ted Zlatanov b04e9ad19b #1346: move maintenance messages to DEBUG log level (#1347)
Signed-off-by: Ted Zlatanov <tzz@lifelogs.com>
2018-04-30 11:56:17 +02:00
Simon Pasquier 2d68b4d318 silence: fix potential panic in decodeState()
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-04-10 10:12:05 +02:00
Simon Pasquier 1531aa66f3 Fix for #1282 (#1286)
* cluster: add alertmanager_cluster_messages_queued metric

* cluster: add metrics for sent messages

This change adds 2 new metrics:

- alertmanager_cluster_messages_sent_total
- alertmanager_cluster_messages_sent_size_total

* Fix marshaling for entries being broadcast

Individual notifications logs and silences being broadcast to the other
peers need to be encoded using the same length-delimited format as when
doing full-state synchronization.

* main: fix argument order for cluster.Join()

cluster.Join() was called with the push/pull and gossip interval
parameters being swapped one for another.
2018-03-22 13:53:00 +01:00
pasquier-s c2dac90434 silence: fix skipped test (#1258)
TestStateMerge() was skipped because of a typo. Fixing the name revealed
that the test itself needed to be updated following the switch to the
memberlist library.
2018-02-27 10:17:48 +01:00
Fabian Reinartz fd49dbb477 *: move to memberlist for clustering 2018-02-08 12:18:44 +01:00
pasquier-s a7d4e4ea7c Log snapshot sizes on maintenance (#1155)
* Log snapshot sizes on maintenance

* Add metrics for snapshot sizes

This change adds 2 new gauges for tracking the last snapshots' sizes:

  - alertmanager_nflog_snapshot_size_bytes
  - alertmanager_silences_snapshot_size_bytes
2018-01-10 14:53:57 +01:00
Jose Donizetti 74808e40f3 Refactor silence constants (#1076)
* Refactor remove dups silence state constants

* Refactor to use const instead of string
2017-11-07 11:36:30 +01:00
Julius Volz 947970af44 Convert Alertmanager to use non-global go-kit loggers
Fixes https://github.com/prometheus/alertmanager/issues/1040
2017-10-22 00:20:40 -07:00
Corentin Chary bff889b490 silence|alerts: add metrics about current silences and alerts
This adds metrics that look like this:
```
alertmanager_alerts{state="active"} 6
alertmanager_alerts{state="suppressed"} 0
alertmanager_silences{state="active"} 1
alertmanager_silences{state="expired"} 1
alertmanager_silences{state="pending"} 0
```

This can be used to monitor alertmanager's usage and validate that
alertmanagers in a mesh have a similar number of silences and alerts.
2017-10-02 13:33:29 +02:00
Jose Donizetti 9449bd1fa9 Ignore expired silences OnGossip (#999)
This will fix the bug of resync deleted silences
due to the state of other peers.
2017-09-28 10:25:35 +02:00
Corentin Chary 34d9524ab9 silences: avoid deadlock (#995)
* silences: avoid deadlock

Calling gossip.GossipBroadcast() will cause a deadlock if
there is a currently executing OnBroadcast* function.

See #982

* silence_test: better unit test to detect deadlocks
2017-09-27 11:48:28 +02:00