Commit Graph

120 Commits

Author SHA1 Message Date
Ryota Arai
135d580ab2 Introduce min_shards for remote write to set minimum number of shards. (#4924)
Signed-off-by: Ryota Arai <ryota.arai@gmail.com>
2018-12-04 17:32:14 +00:00
mknapphrt
f0e9196dca Return warnings on a remote read fail (#4832)
Signed-off-by: Mark Knapp <mknapp@hudson-trading.com>
2018-11-30 14:27:12 +00:00
Tariq Ibrahim
61cf4365d6 add logic to check if an azure VM is deallocated or not (#4908)
* add logic to check if an azure VM is deallocated or not
* update documentation  with the new azure power state label

Signed-off-by: tariqibrahim <tariq.ibrahim@microsoft.com>
2018-11-30 11:32:40 +00:00
Serghei Anicheev
8e659a5109 Adding private_dns_name to the list of ec2 labels which can be used i… (#4693)
* Adding private_dns_name to the list of ec2 labels which can be used in node naming for dynamic environments

Signed-off-by: Serghei Anicheev <serghei@rentalcover.com>
2018-11-30 11:11:06 +00:00
Ben Kochie
c6399296dc
Fix spelling/typos (#4921)
* Fix spelling/typos

Fix spelling/typos reported by codespell/misspell.
* UK -> US spelling changes.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-11-27 17:44:29 +01:00
Ganesh Vernekar
ca93fd544b /api/v1/labels endpoint for getting all label names (#4835)
* vendor: update tsdb

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* /api/v1/labels endpoint

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* regex matchers for API

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Add docs

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Matchers behaving as OR

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Removed the matchers

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* vendor: update tsdb using go mod

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* vendor update: tsdb

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Added LabelNames() to storage.Querier

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Test for api.labelNames

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Nits

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-11-19 15:51:14 +05:30
Benji Visser
11b336e3ca Migrate all Docker image references to Docker Hub (#4864)
Signed-off-by: noqcks <benny@noqcks.io>
2018-11-16 11:26:10 +00:00
Bryan Boreham
cf37e1feb4 Add __meta_kubernetes_pod_phase label in discovery (#4824)
This lets you add a relabel rule to drop scrapes for pods which are
not running.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2018-11-06 14:40:24 +00:00
Silvio Gissi
6100f160ad EC2 Platform meta label (#4663)
Set __meta_ec2_platform label with the instance platform string. Set to 'windows' on Windows servers and absent otherwise.


Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
2018-11-06 14:39:48 +00:00
nilsocket
fe0f0da6b3 docs: add missing word (time) (#4797)
Signed-off-by: nilsocket <nilsocket@gmail.com>
2018-10-28 07:36:09 +00:00
Timo Beckers
36143be234
docs - refer to documentation/examples/prometheus-marathon.yml
Signed-off-by: Timo Beckers <timo@incline.eu>
2018-10-25 18:02:59 +02:00
Brian Brazil
9c03e11c2c Hook OpenMetrics parser into scraping.
Extend metadata api to support units.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-10-18 13:58:00 +01:00
Kien Nguyen-Tuan
9c5370fdfe Support discover instances from all projects (#4682)
By default, OpenStack SD only queries for instances
from specified project. To discover instances from other
projects, users have to add more openstack_sd_configs for
each project.

This patch adds `all_tenants` <bool> options to
openstack_sd_configs. For example:

- job_name: 'openstack_all_instances'
  openstack_sd_configs:
    - role: instance
      region: RegionOne
      identity_endpoint: http://<identity_server>/identity/v3
      username: <username>
      password: <super_secret_password>
      domain_name: Default
      all_tenants: true

Co-authored-by: Kien Nguyen <kiennt2609@gmail.com>
Signed-off-by: dmatosl <danielmatos.lima@gmail.com>
2018-10-17 13:01:33 +01:00
Brian Brazil
468e49417c Update remote_write queue docs to present defaults. (#4715)
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-10-10 18:51:27 +01:00
Richard Kiene
b537f6047a Add ability to filter triton_sd targets by pre-defined groups (#4701)
Additionally, add triton groups metadata to the discovery reponse
and correct a documentation error regarding the triton server id
metadata.

Signed-off-by: Richard Kiene <richard.kiene@joyent.com>
2018-10-10 10:03:34 +01:00
Simon Pasquier
a2a78d0a09 discovery/openstack: discover all interfaces (#4649)
* discovery/openstack: discover all interfaces
* Add address pool label

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-10-09 16:17:08 +01:00
Simon Pasquier
e1e2821cca
Merge pull request #4654 from simonpasquier/openstack-tls
discovery/openstack: support tls_config
2018-10-05 18:11:55 +02:00
Ganesh Vernekar
420c0f5e46 Fix docs (#4690)
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-10-03 12:45:09 +01:00
Ganesh Vernekar
5790d23fd8 Unit testing for rules (#4350)
* Unit testing for rules
* Specifying order of group evaluation in unit tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2018-09-25 17:06:26 +01:00
Simon Pasquier
ff08c40091 discovery/openstack: support tls_config
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-09-25 14:31:32 +02:00
Tariq Ibrahim
f708fd5c99 Adding support for multiple azure environments (#4569)
Signed-off-by: Tariq Ibrahim <tariq.ibrahim@microsoft.com>
2018-09-04 17:55:40 +02:00
Max Inden
ecf676cf97 web/api: Expose rule health and last error (#4501)
Expose rule health and last evaluation error on `/api/v1/rules`.

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-08-23 18:30:10 +05:30
Fabian Reinartz
f571b69010
Merge pull request #4514 from jkohen/ec2-targets
Expose EC2 instance owner as a discovery label.
2018-08-20 08:43:44 +02:00
Javier Kohen
1c89984778 Expose EC2 instance owner as a discovery label.
This exposes the OwnerID field of the DescribeInstances respons as .

Signed-off-by: Javier Kohen <jkohen@google.com>
2018-08-17 11:30:18 -04:00
Brian Brazil
cd54add5b8
Clarify that {a="b",a!="c"} is possible. (#4492)
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-08-13 11:38:57 +01:00
Javier Kohen
2d4bcb3ee1 Document the new __meta_gce_instance_id discovery label.
Signed-off-by: Javier Kohen <jkohen@google.com>
2018-08-10 11:59:22 -04:00
Julius Volz
0c54cf489b
Document "<bool>" placeholder in API (#4465)
Signed-off-by: Julius Volz <julius.volz@gmail.com>
2018-08-04 21:30:53 +02:00
Johannes Scheuermann
7608ee87d0 Inital support for Azure VMSS (#4202)
* Inital support for Azure VMSS

Signed-off-by: Johannes Scheuermann <johannes.scheuermann@inovex.de>

* Add documentation for the newly introduced label

Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
2018-08-01 12:52:21 +01:00
Max Inden
41b0580e7e
Merge pull request #4318 from mxinden/expose-alerts-and-rules
api/v1: Expose rules and alerts
2018-07-31 13:50:55 +02:00
Max Leonard Inden
71fafad099
api/v1: Coninue work exposing rules and alerts
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
2018-07-30 15:31:51 +02:00
Catalin Patulea
50850c0ad9 Add link to TSDB format page. (#4402)
Signed-off-by: Catalin Patulea <catalinp@google.com>
2018-07-28 08:02:03 +01:00
José Martínez
791c13b142 discovery/ec2: Add primary_subnet_id label
Signed-off-by: José Martínez <xosemp@gmail.com>
2018-07-25 09:20:58 +01:00
Jannick Fahlbusch ฏ๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎
0be25f92e2 EC2 Discovery: Allow to set a custom endpoint (#4333)
Allowing to set a custom endpoint makes it easy to monitor targets on non AWS providers with EC2 compliant APIs.

Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>
2018-07-18 10:48:14 +01:00
Romain Baugue
b41be4ef52 Discovery consul service meta (#4280)
* Upgrade Consul client
* Add ServiceMeta to the labels in ConsulSD

Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
2018-07-18 05:06:56 +01:00
Simon Pasquier
ed99af0b05 docs: fix OpenStack SD for the hypervisor role
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2018-07-15 12:37:57 +01:00
Martin Chodur
504acf4a0a docs: added undocumented step api parameter format (#4360) 2018-07-07 09:20:18 +01:00
Marcin Owsiany
9fe8bcf4be Fix markup in example. (#4351)
Signed-off-by: Marcin Owsiany <marcin@owsiany.pl>
2018-07-05 09:13:00 +01:00
Fabian Reinartz
057a5ae2b1 Address comments
Signed-off-by: Fabian Reinartz <freinartz@google.com>
2018-06-06 11:21:17 -04:00
Fabian Reinartz
ad4c33c1ff scrape,api: provide per-target metric metadata
This adds a per-target cache of scraped metadata. The metadata is only
available for the lifecycle of the attached target. An API endpoint allows
to select metadata by metric name and a label selection of targets.

Signed-off-by: Fabian Reinartz <freinartz@google.com>
2018-06-06 05:56:10 -04:00
Damien Lespiau
e64037053d Expose controller kind and name to labelling rules
Relabelling rules can use this information to attach the name of the controller
that has created a pod.

In turn, this can be used to slice metrics by workload at query time, ie.
"Give me all metrics that have been created by the $name Deployment"

Signed-off-by: Damien Lespiau <damien@weave.works>
2018-05-09 11:51:37 +02:00
Nathan Graves
5b27996cb3 Include GCE labels during service discovery. Updated vendor files for Google API. (#4150)
Signed-off-by: Nathan Graves <nathan.graves@kofile.us>
2018-05-08 17:37:47 +01:00
Ben Kochie
390e260bd9 Improve wording of remote write documentation. (#3817)
Reduce the use of the term `long-term`, when what we're really talking
about is remote clustered storage for increased capacity and durability.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-05-05 16:38:45 +01:00
Daisy T
b424eb42e3 document remote write queue parameters (#4126) 2018-04-30 20:08:45 +02:00
Brian Brazil
fbe66819c5
Update ALERTS docs for 2.0 staleness changes. (#4116)
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2018-04-26 12:44:11 +01:00
Adam Shannon
809881d7f5 support reading basic_auth password_file for HTTP basic auth (#4077)
Issue: https://github.com/prometheus/prometheus/issues/4076

Signed-off-by: Adam Shannon <adamkshannon@gmail.com>
2018-04-25 18:19:06 +01:00
Julius Volz
fe10b36b30 Fix curl example for deleting series (#4046) 2018-04-05 13:06:18 +01:00
Philippe Laflamme
2aba238f31 Use common HTTPClientConfig for marathon_sd configuration (#4009)
This adds support for basic authentication which closes #3090

The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.

DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.

Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
2018-04-05 09:08:18 +01:00
albatross0
0245fd55bf Add a machine type label to GCE SD (#4032) 2018-03-31 09:20:19 +01:00
Kristiyan Nikolov
be85ba3842 discovery/ec2: Support filtering instances in discovery (#4011) 2018-03-31 07:51:11 +01:00
Corentin Chary
60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00