Commit Graph

4869 Commits

Author SHA1 Message Date
albatross0
0245fd55bf Add a machine type label to GCE SD (#4032) 2018-03-31 09:20:19 +01:00
Kristiyan Nikolov
be85ba3842 discovery/ec2: Support filtering instances in discovery (#4011) 2018-03-31 07:51:11 +01:00
Bryan Boreham
93494d8b7e Add an OpenTracing span for each rule (#4027)
* Add an OpenTracing span for each rule

So that tags and child spans can be traced back to the rule that they
refer to.
2018-03-30 21:29:19 +01:00
Solomon Van
68e394a56e notifier: update use testutil for testing (#3695) 2018-03-29 16:07:26 +01:00
Elif T. Kuş
daebf68ea2 Rewrote tests for relabel and template (#3754)
* relabel: use testutil for testing

* template: use testutil for testing
2018-03-29 16:02:28 +01:00
Fabian Reinartz
184b6e3767
Merge pull request #3968 from zjwzte/fix-magic-number
Fix magic number.
2018-03-28 14:09:43 +02:00
Krasi Georgiev
dfd6709a44 update common package (#4015) 2018-03-27 10:21:56 +05:30
Krasi Georgiev
5fec98d0a7 simplify server error handling (#4006) 2018-03-25 10:05:59 +01:00
Corentin Chary
60dafd425c consul: improve consul service discovery (#3814)
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
2018-03-23 14:48:43 +00:00
Ben Kochie
0d9fe18f5e Fix nil context staticcheck error. 2018-03-22 07:59:39 +00:00
Ben Kochie
0f37c02343 Update vendor golang.org/x/...
Update vendor golang.org/x/sys/unix
Update vendor golang.org/x/net/...
2018-03-22 07:59:39 +00:00
Ben Kochie
2b02fcb0cb Update vendor github.com/miekg/dns@v1.0.4
Update vendor `github.com/miekg/dns` to `v1.0.4` release.
* Add dependent vendor `golang.org/x/crypto/ed25519`.
* Add dependent vendor `golang.org/x/crypto/ed25519/internal/edwards25519`.
* Add dependent vendor `golang.org/x/net/bpf`.
* Add dependent vendor `golang.org/x/net/internal/iana`.
* Add dependent vendor `golang.org/x/net/internal/socket`.
* Add dependent vendor `golang.org/x/net/ipv4`.
* Add dependent vendor `golang.org/x/net/ipv6`.
2018-03-22 07:59:39 +00:00
Marek Siarkowicz
bb86c3f62b Report internal runtime information on status page (#3921)
Add information about tsdb, wal and config reload
2018-03-21 16:08:37 +00:00
Aaron Kirkbride
c47fbcb626 Fix moved fsnotify dependency (#3995) 2018-03-21 15:46:31 +00:00
Brian Brazil
cc39021b2b Provide custom marshalling for Point
Point has a non-standard marshalling, and is also
where the vast majority of CPU time is spent so
it is worth optimising.
2018-03-21 15:02:01 +00:00
Brian Brazil
f35fca1c3f Vendor github.com/json-iterator/go 2018-03-21 15:02:01 +00:00
Brian Brazil
299b78a887 Switch to json-iterator for v1 api.
This makes queries ~15% faster and cuts cpu
time spent on json encoding by ~40%.
2018-03-21 15:02:01 +00:00
Brian Brazil
8ede14b24c Add unittests for Point json output 2018-03-21 15:02:01 +00:00
Brian Brazil
ecd0a9c6ba web: Add benchmark for respond() 2018-03-21 15:02:01 +00:00
Anton Tereshchenkov
4cb8f6c260 web: remove unused MetricsPath option (#3964) 2018-03-21 09:29:40 +00:00
ferhat elmas
ec8e4d8a7c all: remove unnecessary type conversions (#3992)
excep promql due to not to create conflict with #3966.
2018-03-21 09:25:22 +00:00
Simon Pasquier
83325c8d82 web: replace deprecated InstrumentHandler() (#3862)
* web: replace deprecated InstrumentHandler()

This change replaces the deprecated InstrumentHandler function by the
equivalent functions from the promhttp package.

The following metrics are removed:

* http_request_duration_microseconds (Summary).
* http_request_size_bytes (Summary).
* http_requests_total (Counter).

And the following metrics are added instead:

* prometheus_http_request_duration_seconds (Histogram).
* prometheus_http_response_size_bytes (Histogram).
* promhttp_metric_handler_requests_in_flight (Gauge).
* promhttp_metric_handler_requests_total (Counter).

* Update github.com/prometheus/common/route package

* web: refactor using the new prometheus/common/route package
2018-03-21 08:16:16 +00:00
James Turnbull
ba5273a0ab Minor edits to help text (#3990) 2018-03-20 16:54:36 +00:00
Simon Pasquier
e1fd96db25 cmd: fix help text (#3989) 2018-03-20 15:58:19 +00:00
Warren Fernandes
d49a3df55b Parser test cleanup (#3977)
* parser test cleanup

- Test against the exported package functions instead of the private functions.

* Improves readability of TestParseSeries

- Moves package function closer to parser function
2018-03-20 14:30:52 +00:00
Jeeyoung Kim
5b962c5748 Revert "Feature: Allow getting credentials via EC2 role (#3343)" (#3985)
This reverts commit 808f79f00a.
2018-03-20 12:34:54 +00:00
Warren Fernandes
58e2a31db8 Cleans up test by removing unused function (#3969) 2018-03-15 08:59:19 +00:00
zjwzte
b7a37a1604 Fix magic number. 2018-03-15 10:15:35 +08:00
Fabian Reinartz
e87c6c8b28
Merge pull request #3963 from mz-techops/fix-query-err-scope
promql: propagate storage errors
2018-03-14 11:04:02 -04:00
Anton Tereshchenkov
18bbec050c promql: propagate storage errors 2018-03-14 15:19:22 +01:00
Tom Wilkie
02a154ced6
Merge pull request #3941 from prometheus/3809-correctly-stop-timer
Correctly stop the timer used in the remote write path.
2018-03-13 09:05:52 +00:00
Tom Wilkie
dc860e7d0e Fix nit. 2018-03-12 16:48:51 +00:00
Tom Wilkie
390b018c90 Test sample timeout delivery. 2018-03-12 15:35:43 +00:00
Tom Wilkie
22d820ef8e Review feedback. 2018-03-12 14:27:48 +00:00
Brian Brazil
a8c22c85cc
Correctly handle pruning wraparound after ring expansion (#3942)
Fixes #3939
2018-03-12 13:16:59 +00:00
Paul Gier
85a3c974b7 minor yaml indentation consistency fix in example configs (#3946) 2018-03-11 23:06:13 +00:00
James Turnbull
4486ef013b Make show annotations checkbox match query history checkbox (#3936)
After removing the checkbox in #3913 the only remaining element that
looked like it was the new Show Annotations checkbox on the Alerts page.
Which in turn didn't look like the Enable query history checkout on the
graph page. So:

1. This takes the Enable query history button as canonical.
2. Updates the show annotations button code to match it.
3. Simplifies the JS for the checkbox.
2018-03-09 14:39:28 +01:00
James Turnbull
50e6aff3fd Make job heading on service discovery consistent (#3937)
The new Service Discovery page uses the CSS/JS from the Targets page but
used slightly differently. This makes the job header match in the
Service Discovery page for a more consistent look-n-feel.
2018-03-09 14:33:53 +01:00
Tom Wilkie
f8c9d375b6 Correctly stop the timer used in the remote write path. 2018-03-09 12:00:26 +00:00
Matt Palmer
042090a6d3 [dns_sd] Send an EDNS0 query by default (#3586)
Based on https://groups.google.com/d/topic/prometheus-users/02kezHbuea4/discussion

Does not attempt to handle a situation where the server does not understand
EDNS0, however that is an unlikely case, and the behaviour of such ancient
systems is hard to predict in advance, so if it does come up, it will need
to be handled on a case-by-case basis.
2018-03-09 10:21:58 +00:00
James Turnbull
c3f4f2204f Refactor/redesign Unhealthy checkbox on Targets page (#3913)
* Added only healthy to Targets

This adds a "Only heathly" button to supplement the "Only unhealthy"
button. The two are mutually exclusive.

I've also added a red/green text color to the buttons.

Arguably this could be a toggle instead if folks think this is
worthwhile... Happy to modify it.

* Moved functions above init

* Simplifed code and made prettier

* Appeased codeacy

* Made buttons square
2018-03-09 11:19:09 +01:00
Yecheng Fu
56ed29fbf7 Map target infos of endpoints to prometheus meta labels. (#3770) 2018-03-09 10:07:00 +00:00
Brian Brazil
bf7d87aed2 Cleanup storage from all tests.
Fixed #3299
2018-03-09 07:53:35 +00:00
Brian Brazil
c0ce35d2d3 Only show debug output on test failure 2018-03-09 07:53:35 +00:00
Brian Brazil
e6ea146c81 Make benchmark tests pass
A new query object is needed for each evaulation,
as the iterators would otherwise be shared across evaluations.
2018-03-09 07:53:35 +00:00
Nikunj Aggarwal
998dfcbac6 Expose itemtype outside the package (#3933) 2018-03-08 16:52:44 +00:00
Fabian Reinartz
f63e7db4cb
Merge pull request #3931 from prometheus/cut200
*: cut v2.2.0
2018-03-08 17:37:57 +01:00
Fabian Reinartz
6b9cbacf52 *: cut v2.2.0 2018-03-08 15:37:46 +01:00
Fabian Reinartz
60edc2b6d5
Merge pull request #3928 from prometheus/snaphead
Add option to skip head on snapshots
2018-03-08 13:25:22 +01:00
Fabian Reinartz
3e6c890aea api: add flag to skip head on snapshots 2018-03-08 13:07:12 +01:00