Commit Graph

3608 Commits

Author SHA1 Message Date
Scott Larkin
5319e1da09 Update .codeclimate.yml
Changed the vendor/ path in the exclude paths node.
2017-01-23 14:58:53 -05:00
Frederic Branczyk
d840f2c400 Merge pull request #2359 from brancz/cut-1.5.0
*: cut 1.5.0
2017-01-23 14:05:51 +01:00
Frederic Branczyk
fb17493f66
*: cut 1.5.0 2017-01-23 12:59:01 +01:00
Björn Rabenstein
9688a312ed Merge pull request #2355 from prometheus/beorn7/lint
Remove auto-generated protobuf code from codeclimate
2017-01-20 11:31:51 +01:00
beorn7
4392aa43d4 Remove auto-generated protobuf code from codeclimate 2017-01-20 11:07:20 +01:00
Björn Rabenstein
d717175104 Merge pull request #2354 from prometheus/beorn7/lint
Documentation: Add Code Climate badges to README.md
2017-01-20 10:51:05 +01:00
beorn7
0c8b753f6e Documentation: Add Code Climate badges to README.md 2017-01-19 23:22:22 +01:00
Scott Larkin
e5a75b2b30 Code Climate config (#2351)
Created a Code Climate config with gofmt, golint, and govet enabled
2017-01-19 22:19:32 +01:00
Alex Somesan
b22eb65d0f Cleaner separation between ServiceAccount and custom authentication in K8S SD (#2348)
* Canonical usage of cluster service-account in K8S SD

* Early validation for opt-in custom auth in K8S SD

* Fix typo in condition
2017-01-19 10:52:52 +01:00
Fabian Reinartz
7eb849e6a8 Merge pull request #2307 from joyent/triton_discovery
Add Joyent Triton discovery
2017-01-18 05:08:11 +01:00
Richard Kiene
f3d9692d09 Add Joyent Triton discovery 2017-01-17 20:34:32 +00:00
Brian Brazil
c1b547a90e Only checkpoint chunkdescs and series that need persisting. (#2340)
This decreases checkpoint size by not checkpointing things
that don't actually need checkpointing.

This is fully compatible with the v2 checkpoint format,
as it makes series appear as though the only chunksdescs
in memory are those that need persisting.
2017-01-17 00:59:38 +00:00
Fabian Reinartz
5418a42965 Merge pull request #2345 from Bplotka/fixed-alertmanager-flag-auth
Fixed regression in `-alertmanager.url flag`. Basic auth was ignored.
2017-01-16 18:29:51 +01:00
Bartek Plotka
579e33f19a Fixed style issues. 2017-01-16 16:45:58 +00:00
Bartek Plotka
d7febe97fa Fixed regression in -alertmanager.url flag. Basic auth was ignored.
- Included basic auth parsing while parsing to AlertmanagerConfig
- Added test case

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
2017-01-16 16:39:20 +00:00
Fabian Reinartz
990e40c959 Merge pull request #2338 from brancz/alertmanager-api
web/api: add alertmanager api
2017-01-16 12:08:14 +01:00
Frederic Branczyk
bd92571bdd
web/api: make target and alertmanager api responses consistent 2017-01-16 11:53:00 +01:00
Fabian Reinartz
022714b60a Merge pull request #2341 from mattbostock/patch-1
Correct notifications_dropped description
2017-01-16 09:23:46 +01:00
Matt Bostock
4160892109 Correct notifications_dropped description
The current description does not accurately describe when the metric is incremented.

Aside from Alertmanger missing from the configuration, `prometheus_notifications_dropped_total` is incremented when errors occur while sending alert notifications to Alertmanager, or because the notifications queue is full, or because the number of notifications to be sent exceeds the queue capacity.

I think calling these cases 'errors' in a generic sense is more useful than the current description.
2017-01-13 23:36:00 +00:00
Brian Brazil
f64c231dad Allow checkpoints and maintenance to happen concurrently. (#2321)
This is essential on larger Prometheus servers, as otherwise
checkpoints prevent sufficient persisting of chunks to disk.
2017-01-13 17:24:19 +00:00
Frederic Branczyk
389c6d0043
web/api: add alertmanager api 2017-01-13 15:30:20 +01:00
Brian Brazil
1dcb7637f5 Add various persistence related metrics (#2333)
Add metrics around checkpointing and persistence

* Add a metric to say if checkpointing is happening,
and another to track total checkpoint time and count.

This breaks the existing prometheus_local_storage_checkpoint_duration_seconds
by renaming it to prometheus_local_storage_checkpoint_last_duration_seconds
as the former name is more appropriate for a summary.

* Add metric for last checkpoint size.

* Add metric for series/chunks processed by checkpoints.

For long checkpoints it'd be useful to see how they're progressing.

* Add metric for dirty series

* Add metric for number of chunks persisted per series.

You can get the number of chunks from chunk_ops,
but not the matching number of series. This helps determine
the size of the writes being made.

* Add metric for chunks queued for persistence

Chunks created includes both chunks that'll need persistence
and chunks read in for queries. This only includes chunks created
for persistence.

* Code review comments on new persistence metrics.
2017-01-11 15:11:19 +00:00
Björn Rabenstein
6ce97837ab Merge pull request #2327 from prometheus/beorn7/vendoring
vendoring: Update prometheus/common to pull in bug fixes
2017-01-09 13:28:36 +01:00
beorn7
86ec87b78f vendoring: Update prometheus/common to pull in bug fixes
In particular the one for https://github.com/prometheus/common/issues/72.
2017-01-09 12:25:17 +01:00
Fabian Reinartz
3302bb1eb1 Merge pull request #2323 from prometheus/beorn7/retrieval
Retrieval: Avoid copying Target
2017-01-08 06:49:47 +01:00
Björn Rabenstein
ad40d0abbc Merge pull request #2288 from prometheus/limit-scrape
Add ability to limit scrape samples, and related metrics
2017-01-08 01:34:06 +01:00
beorn7
5dc01202d7 Retrieval: Remove some test lines that fail on Travis only
These lines exercise an append in
TestScrapeLoopWrapSampleAppender. Arguably, append shouldn't be tested
there in the first place.

Still it's weird why this fails on Travis:

```
--- FAIL: TestScrapeLoopWrapSampleAppender (0.00s)
    scrape_test.go:259: Expected count of 1, got 0
    scrape_test.go:290: Expected count of 1, got 0
2017/01/07 22:48:26 http: TLS handshake error from 127.0.0.1:50716: read tcp 127.0.0.1:40265->127.0.0.1:50716: read: connection reset by peer
FAIL
FAIL	github.com/prometheus/prometheus/retrieval	3.603s
```

Should anybody ever find out why, please revert this commit accordingly.
2017-01-08 00:01:46 +01:00
beorn7
3610331eeb Retrieval: Do not buffer the samples if no sample limit configured
Also, simplify and streamline the code a bit.
2017-01-07 18:18:54 +01:00
André Carvalho
c43dfaba1c Add max concurrent and current queries engine metrics (#2326)
* Add max concurrent and current queries engine metrics

This commit adds two metrics to the promql/engine: the
number of max concurrent queries, as configured by the flag, and
the number of current queries being served+blocked in the engine.
2017-01-07 14:41:25 +00:00
beorn7
767c0709b1 Retrieval: Avoid copying Target
retreival.Target contains a mutex. It was copied in the Targets()
call. This potentially can wreak a lot of havoc.

It might even have caused the issues reported as #2266 and #2262 .
2017-01-06 18:43:41 +01:00
Brian Brazil
f9e581907a Make index queue bigger. (#2322)
When a large Prometheus starts up fresh it can take many minutes
to warmup and clear out the index queue. A larger queue means less
blocking, bigger batches and cuts down startup time by ~50%.
2017-01-05 17:57:42 +00:00
Fabian Reinartz
c9f4aea8e2 Merge pull request #2305 from alicebob/favicon
Add a favicon to the web GUI
2017-01-04 10:15:27 +01:00
Martin Lehmann
78fae3155f Make relative links in README.md absolute (#2316)
The relative links don't work in other pages that render the README (for example https://hub.docker.com/r/prom/prometheus/). As they are (hopefully) not due to change any time soon, I think using absolute links is better.
2017-01-03 20:07:33 +00:00
Julius Volz
90dd216646 Merge pull request #2306 from EdSchouten/sorted-alerts
Use lexicographic order to sort alerts by name.
2016-12-31 13:12:30 +01:00
Mitsuhiro Tanda
7e369b9318 expose max memory chunks metrics (#2303)
* expose max memory chunks metrics
2016-12-27 18:34:07 +00:00
Ed Schouten
b3a39ccd8a Use lexicographic order to sort alerts by name.
Right now the /alerts page of Prometheus sorts alerts by severity
(firing, pending, inactive). Once multiple alerts have the same
severity, their order seems to correlate to how they are placed in the
configuration files, but not always. Looking at the code, we make use of
sort.Sort(), which is documented not to provide a stable sort. The
Less() function also only takes the alert state into account.

This change extends the Less() function to provide a lexicographic order
on both the alert state and the name. This means I can finally find the
alerts I'm looking for without using my browser's search feature.
2016-12-27 14:28:44 +01:00
Harmen
135d32ea22 make assets 2016-12-27 13:59:20 +01:00
Harmen
dfa4f79bcd add favicon 2016-12-27 13:58:51 +01:00
Brian Brazil
93b70ee4ea Evict chunk descs of all unloaded chunks during maintenance. (#2297)
Keeping these around has two problems:
1) Each desc takes 64 bytes, 10 of them is 640B. This is a lot of
overhead on a 1024 byte chunk.
2) It can take well over a week to reach a point where this and thus
Prometheus memory usage as a whole enters steady state. This makes RAM
estimation very hard for users, and makes it difficult to investigate
things like memory fragmentation.

Instead we'll wipe them during each memory series maintenance cycle, and
if a query pulls them in they'll hang around as cache until the next
cycle.
2016-12-22 13:49:03 +00:00
Brian Brazil
bed4635802 Use irate consistently in console template examples. (#2296)
I must have forgotten my 'g' when switching these.
2016-12-21 13:19:23 +00:00
Fabian Reinartz
d6d03a966f Merge pull request #2295 from prometheus/fast-path-remote
Don't clone the metric if there's no remote writes.
2016-12-21 12:36:41 +01:00
Brian Brazil
1b8a474612 Don't clone the metric if there's no remote writes.
The metric clone can't be further optimised, and is a
non-trivial memory allocation cost so fast path it
if there's no remote writes configured.
2016-12-21 11:34:48 +00:00
Brian Brazil
6c07453ec1 Only clone the metric in the one place relabelling needs it. (#2292)
This cuts ~17% off memory allocations related to ingesting data
in a basic setup.
2016-12-21 10:00:33 +00:00
Brian Brazil
2e3b42ad6c Correctly handle the end time being 0 in the URL. (#2290) 2016-12-18 19:30:52 +00:00
Brian Brazil
f421ce0636 Remove label from prometheus_target_skipped_scrapes_total (#2289)
This avoids it not being intialised, and breaking out by
interval wasn't partiuclarly useful.

Fixes #2269
2016-12-16 18:00:52 +00:00
Brian Brazil
30448286c7 Add sample_limit to scrape config.
This imposes a hard limit on the number of samples ingested from the
target. This is counted after metric relabelling, to allow dropping of
problemtic metrics.

This is intended as a very blunt tool to prevent overload due to
misbehaving targets that suddenly jump in sample count (e.g. adding
a label containing email addresses).

Add metric to track how often this happens.

Fixes #2137
2016-12-16 15:10:09 +00:00
Björn Rabenstein
f3f798fbcf Merge pull request #2283 from tcolgate/ignoredots
ignore dotfiles in data directory
2016-12-15 13:32:03 +01:00
Tristan Colgate
30be8e0b8a ignore dotfiles in data directory 2016-12-15 11:48:23 +00:00
Tristan Colgate-McFarlane
4d9134e6d8 Add labeldrop and labelkeep actions. (#2279)
Introduce two new relabel actions. labeldrop, and labelkeep.
These can be used to filter the set of labels by matching regex

- labeldrop: drops all labels that match the regex
- labelkeep: drops all labels that do not match the regex
2016-12-14 10:17:42 +00:00
Björn Rabenstein
45570e5972 Merge pull request #2277 from prometheus/beorn7/storage2
storage: Sanity-check number of loaded chunk descs
2016-12-14 02:59:10 +01:00