Commit Graph

61 Commits

Author SHA1 Message Date
Julien Duchesne
8855c2e626
Add prometheus_tsdb_clean_start metric (#8824)
Add cleanup of the lockfile when the db is cleanly closed

The metric describes the status of the lockfile on startup
0: Already existed
1: Did not exist
-1: Disabled

Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly
We can then use that metric to have a much lower threshold on the crashlooping alert:

If the metric exists and it has been zero, two restarts is enough to trigger the alarm
If it does not exist (old prom version for example), the current five restarts threshold remains

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Change metric name + set unset value to -1

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Only check the last value of the clean start alert

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Fix test + nit

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2021-06-16 15:03:02 +05:30
hanjm
1df05bfd49 Add body_size_limit to prevent bad targets response large body cause Prometheus server OOM (#8827)
Signed-off-by: hanjm <hanjinming@outlook.com>
2021-05-29 07:05:42 +08:00
Levi Harrison
2826fbeeb7
SD: Add target creation failure counter and change failure handling (#8786)
* Added metric and changed failure/drop strategy

Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-05-28 23:50:59 +02:00
Damien Grisonnet
b50f9c1c84
Add label scrape limits (#8777)
* scrape: add label limits per scrape

Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.

The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.

The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: add metrics and alert to label limits

Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: apply label limits to __name__ label

Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: remove label limits gauges and refactor

Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
2021-05-06 09:56:21 +01:00
ravilr
adc8807851
Update remote-write alert rules mixin (#8423)
Signed-off-by: ravilr <raviprasad_lr@yahoo.com>
2021-01-31 20:07:49 +00:00
Frederic Branczyk
62bc755733
mixin: Scope grafana config
In its current form this configuration clashes in one of the most widely
used configurations (kube-prometheus). This patch scopes the
configuration to prevent this.

Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
2020-12-30 17:50:34 +01:00
Nicolas Lamirault
aa1ca13025
Add: Custom tags and prefix in Prometheus Mixin (#8287)
* Add: custom tags and prefix

Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>

* Fix: fmt

Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>
2020-12-16 18:49:06 +01:00
Björn Rabenstein
511511324a
Merge pull request #8235 from Allex1/master
Update remote-write grafana mixin
2020-12-08 14:50:47 +01:00
beorn7
553f904f2d mixin: Add a capability to exclude non-prod AM instances
Signed-off-by: beorn7 <beorn@grafana.com>
2020-12-03 20:59:53 +01:00
birca
3ec4161575 Update remote-write grafana mixin
Signed-off-by: birca <birca@adobe.com>
2020-12-02 09:50:15 +02:00
beorn7
638e99c814 prometheus-mixin: Make PrometheusRemoteWriteBehind more generic
Currently, it relies on `job, instance` being the labels completely
identifying a Prometheus instance. However, what's intended is to
simply not match on `remote_name, url`.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-11-17 13:29:49 +01:00
beorn7
371ca9ff46 prometheus-mixin: add HA-group aware alerts
There is certainly a potential to add more of these. This is mostly
meant to introduce the concept and cover a few critical parts.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-11-11 19:45:34 +01:00
Matthias Loibl
13ba013a24
Use absolute jsonnet import paths
This should be the way forward when importing libraries in jsonnet. It's
closer to how Go imports look and makes it more obvious where packages
live.

This is not breaking anything, as the old imports were already symlinks
to the now directly used directories.

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2020-10-20 11:42:30 +02:00
Björn Rabenstein
d49f267f76
Merge pull request #8054 from simonpasquier/improve-not-ingesting-samples-alert
documentation/prometheus-mixin: improve PrometheusNotIngestingSamples
2020-10-15 12:29:39 +02:00
Simon Pasquier
f381d8a9bd documentation/prometheus-mixin: improve PrometheusNotIngestingSamples
The alert shouldn't fire when there's no target and no rule configured.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-10-15 11:13:17 +02:00
Julien Pivotto
4596abee4d
Mixin: Ignore unset remote write timestamp (#8046)
* Mixin: Ignore unset remote write timestamp

This pull request ignores the zero value of highest_sent_timestamp_seconds
in Highest Timestamp In vs. Highest Timestamp Sent which just show that
remote write has not been successful yet.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-15 09:15:59 +02:00
Simon Pasquier
e693af6c01
.circleci/config.yml: check mixins (#6895)
* .circleci/config.yml: check mixins

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Run jsonnetfmt

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Install tools in the image instead of using coreos/jsonnet-ci

The latter is deprecated

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Update jsonnetfile.json

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-25 15:59:41 +02:00
Julien Pivotto
f482c7bdd7
Add per scrape-config targets limit (#7554)
* Add per scrape-config targets limit

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-07-30 14:20:24 +02:00
Tom Wilkie
27b1009acd
Rename the dashboard in the mixin to 'Prometheus Overview'. (#7489)
Due to https://github.com/grafana/grafana/issues/15642, this prevents users putting this dashboard in a Grafana folder called 'Prometheus'.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2020-06-30 15:45:44 +01:00
Manuel Fontan
6e7554639b Update Readme since jsonnetfmt is available in the jsonnet go implementation since v0.16.0
Signed-off-by: Manuel Fontan <mfontangarcia@slack-corp.com>
2020-06-16 10:41:58 +01:00
Callum Styan
5400e71b91 Update mixin dashboards and alerts for new remote write label names.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2020-04-08 12:56:00 -07:00
Marco Pracucci
1e1785690a
Fix queue in alerts annotation
Signed-off-by: Marco Pracucci <marco@pracucci.com>
2020-02-12 12:48:13 +01:00
paulfantom
7321f1d227
documentation/prometheus-mixin: add dependency on grafonnet
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2020-01-11 23:18:04 +01:00
Callum Styan
f4fb6dc208 Simplify remote write dashboard in mixin.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-11-18 19:58:07 -08:00
beorn7
9c8f9bfa63 Fix the description template for PrometheusRemoteWriteDesiredShards
Signed-off-by: beorn7 <beorn@grafana.com>
2019-10-30 13:27:37 +01:00
beorn7
61617eb2d9 Fix PrometheusRemoteWriteDesiredShards
This rule has the same labels on both sides. We don't want
`group_right` and `on`, we want nothing.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-10-29 00:23:39 +01:00
Callum Styan
da6d46625f Repeat shards panels on the queue label.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-10-21 11:03:50 -07:00
Callum Styan
818974ff8f Rewrite remote write dashboard using base grafonnet.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-10-17 15:40:58 -07:00
Callum Styan
81fa63006c Add additional shards/segment graphs to remote write dashboard.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-10-09 09:59:02 -07:00
Simon Pasquier
e36ab7e192
prometheus-mixin: improve description of sample alerts (#6050)
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-09-24 17:44:27 +02:00
Björn Rabenstein
3b3eaf3496
Merge pull request #5787 from cstyan/reshard-max-logging
Add metrics for max/min/desired shards to queue manager.
2019-09-09 22:32:54 +02:00
Callum Styan
a98599bea8 Update remote write max shards alert; properly template/query for max
shards in description.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-09-09 12:01:11 -07:00
Callum Styan
3b75614892 Add a warning alert, since the remote write behind alert will probably
already be going off, about desired shards being higher than max shards.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-08-08 06:45:46 -07:00
Simon Pasquier
dd174963a2 prometheus-mixin: remove PrometheusTSDBWALCorruptions
The counter is only increased when tsdb.Open() is called which
Prometheus does only once in its lifetime (when it initializes). If the
corruption can't be recovered, tsdb.Open() returns an error and
Prometheus exits. Hence the metric is either 0 (no corruption) or 1
(corruption detected and repaired). If the latter, the alert isn't
actionable and the only way to resolve it is to restart Prometheus which
would reset the counter.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-08-06 14:36:56 +02:00
Matthias Loibl
20d12ff1c7
Fix prometheus-mixin dashboards to use grafanaDashboards
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2019-07-11 15:40:26 +02:00
beorn7
4825585834 Tweak tenses
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-28 17:37:49 +02:00
beorn7
9a2177949d Protect gauge-based alerts against failed scrapes
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-28 16:46:19 +02:00
beorn7
52707535b8 Remove/improve unused variables and weird doc comments
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-28 15:41:31 +02:00
beorn7
7a25a2586d Sync with alerts from kube-prometheus
While doing so, re-introduce the summary/description
annotations. Also, add a few more rules and tweak a few of the
existing ones.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-27 23:50:26 +02:00
beorn7
ded0705bdc Update remote repo for grafana-builder dependency
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-27 14:39:38 +02:00
beorn7
1336a28848 Use a config variable for the Prometheus name
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-27 14:34:11 +02:00
beorn7
613cb5430c Add a "work in progress" disclaimer.
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 23:24:22 +02:00
beorn7
e34af6d4d3 Address various comments from the review
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 23:22:16 +02:00
beorn7
23c03207e9 Fixed indentation
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 20:31:05 +02:00
beorn7
d5845ad05b Fix formatting
This is the outcome of `make fmt`.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 16:23:25 +02:00
beorn7
d45e8a0f61 Adjust to jsonnet v0.13
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 16:22:21 +02:00
beorn7
5c04ef3935 Make README.md immediately useful
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 16:12:59 +02:00
beorn7
ddfabda152 Add Makefile and suitable jsonnet files
This makes the mixins usable as abvertised.

Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 15:30:55 +02:00
beorn7
e943803a3c Add .gitignore file
Signed-off-by: beorn7 <beorn@grafana.com>
2019-06-26 15:22:23 +02:00
Callum Styan
a5762f3681 Add dashboard for remote write to prometheus-mixin.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-06-17 15:02:42 -07:00