prometheus

mirror of https://github.com/prometheus/prometheus synced 2024-12-29 10:12:26 +00:00

Author	SHA1	Message	Date
beorn7	9b6a1dad05	storage: Fix `go vet` error	2017-04-04 19:14:09 +02:00
Julius Volz	5f3327f620	Merge pull request #2568 from AlekSi/patch-1 Use latest released Go 1.8.x	2017-04-04 15:54:30 +02:00
Alexey Palazhchenko	535a18e978	Use latest released Go 1.8.x	2017-04-04 13:52:18 +03:00
Björn Rabenstein	50e4f49b7e	Merge pull request #2561 from prometheus/beorn7/storage2 storage: Evict unused chunk.Descs in crash recovery	2017-04-04 00:05:03 +02:00
beorn7	08fc6cbd39	storage: Evict unused chunk.Descs in crash recovery This is in line with the v1.5 change in paradigm to not keep chunk.Descs without chunks around after a series maintenance. It's mainly motivated by avoiding excessive amounts of RAM usage during crash recovery. The code avoids to create memory time series with zero chunk.Descs as that is prone to trigger weird effects. (Series maintenance would archive series with zero chunk.Descs, but we cannot do that here because the archive indices still have to be checked.)	2017-04-04 00:04:22 +02:00
Julius Volz	eda4286484	Merge pull request #2557 from prometheus/influxdb-read Add InfluxDB read-back support to remote storage bridge	2017-04-03 18:29:22 +02:00
Björn Rabenstein	1c6240fc40	Merge pull request #2559 from prometheus/beorn7/storage storage: Replace fpIter by sortedFPs	2017-04-03 16:56:21 +02:00
beorn7	d284ffab03	storage: Replace fpIter by sortedFPs The fpIter was kind of cumbersome to use and required a lock for each iteration (which wasn't even needed for the iteration at startup after loading the checkpoint). The new implementation here has an obvious penalty in memory, but it's only 8 byte per series, so 80MiB for a beefy server with 10M memory time series (which would probably need ~100GiB RAM, so the memory penalty is only 0.1% of the total memory need). The big advantage is that now series maintenance happens in order, which leads to the time between two maintenances of the same series being less random. Ideally, after each maintenance, the next maintenance would tackle the series with the largest number of non-persisted chunks. That would be quite an effort to find out or track, but with the approach here, the next maintenance will tackle the series whose previous maintenance is longest ago, which is a good approximation. While this commit won't change the _average_ number of chunks persisted per maintenance, it will reduce the mean time a given chunk has to wait for its persistence and thus reduce the steady-state number of chunks waiting for persistence. Also, the map iteration in Go is non-deterministic but not truly random. In practice, the iteration appears to be somewhat "bucketed". You can often observe a bunch of series with similar duration since their last maintenance, i.e. you see batches of series with similar number of chunks persisted per maintenance. If that batch is relatively young, a whole lot of series are maintained with very few chunks to persist. (See screenshot in PR for a better explanation.)	2017-04-03 15:34:46 +02:00
Tobias Schmidt	eac36d123e	Fix unstable fanin test (#2558 )	2017-04-03 13:02:15 +02:00
Conor Broderick	dafae52efa	Display total number of returned elements on console (#2532 ) Display total number of returned elements on console	2017-04-03 11:52:25 +01:00
Julius Volz	111841a230	Vendor new InfluxDB client library	2017-04-03 12:38:05 +02:00
Fabian Reinartz	e18be8d1a5	Merge pull request #2556 from prometheus/grobie/count-missed-group-executions Export number of missed rule evaluations	2017-04-03 10:09:12 +02:00
Julius Volz	3581057ea4	Update remote storage bridge README.md	2017-04-03 01:42:49 +02:00
Julius Volz	b391cbb808	Add InfluxDB read-back support to remote storage bridge	2017-04-03 01:42:43 +02:00
Tobias Schmidt	eaf33759fb	Register forgotten prometheus_evaluator_iterations_total metric	2017-04-02 20:32:56 -03:00
Tobias Schmidt	aaaba57184	Export number of missed rule evaluations In case the execution of all rules takes longer than the configured rule evaluation interval, one or more iterations will be skipped. This needs to be visible to the opterator.	2017-04-02 20:03:28 -03:00
Julius Volz	5a896033e3	Add remote read external label handling (#2555 ) * Add remote read external label handling This implements rule 1 and 2 from https://docs.google.com/document/d/188YauRgfF0J4CYMigLsVNN34V_kUwKnApBs2dQMfBbs/edit * Use more descriptive example labels in read test * Add comment for querier.addExternalLabels() * Make argument naming in removeLabels() more generic	2017-04-02 17:48:15 +02:00
Julius Volz	9cc7b393c5	Merge pull request #2548 from prometheus/sort-targets Sort targets by instance within a job	2017-04-01 00:07:31 +02:00
Julius Volz	589061919a	Merge pull request #2465 from Gouthamve/alert-metrics-2429 Better Metrics For Alerts	2017-03-31 21:45:05 +02:00
Goutham Veeramachaneni	f27ce34a13	Use Registerer to Register All Metrics * Made Metric a Gauge so that it can be registered.	2017-04-01 00:14:30 +05:30
Goutham Veeramachaneni	7ba0a9e81a	Add Comment About Initialising Counters	2017-03-31 23:39:02 +05:30
Goutham Veeramachaneni	0d0c9d5440	Move Registerer to Config Struct in Notifier	2017-03-31 21:20:12 +05:30
Julius Volz	947c83be3b	Sort targets by instance within a job Fixes https://github.com/prometheus/prometheus/issues/2536	2017-03-31 13:14:20 +02:00
Julius Volz	336c7870ea	Merge pull request #2550 from prometheus/update-go-version ci: Update Go version to 1.8	2017-03-31 13:12:03 +02:00
Julius Volz	a44aadf4a1	ci: Update Go version to 1.8	2017-03-31 00:29:04 +02:00
Brian Brazil	8cd5aff8fe	Send instance="" with federation if instance not set. This is needed for federating non-instance level metrics, so they don't end up with the instance label of the prometheus target. Also sort external labels, so label output order is consistent.	2017-03-30 06:48:48 +01:00
Brian Brazil	d42e01b07c	Sort labelnames for federation. This makes unittests with multiple labels possible, and may be needed for performance with the new ingestion text parser.	2017-03-30 06:48:48 +01:00
Brian Brazil	dbb65846f1	Add unittest for federation external_labels behaviour	2017-03-30 06:48:48 +01:00
Goutham Veeramachaneni	5856f87be3	Update Issue Template (#2541 ) This is a comment in markdown and won't be shown while creating the issue.	2017-03-29 15:39:38 +01:00
Björn Rabenstein	29f05680a2	Merge pull request #2528 from prometheus/beorn7/storage2 main.go: Set GOGC to 40 by default	2017-03-27 15:00:37 +02:00
Björn Rabenstein	e63d079b59	Merge pull request #2527 from prometheus/beorn7/storage storage: Evict chunks and calculate persistence pressure...	2017-03-27 14:49:42 +02:00
Julius Volz	b5b0e00923	Merge pull request #2499 from prometheus/remote-read Remote Read	2017-03-27 14:43:44 +02:00
beorn7	434ab2a6a3	storage: Evict chunks and calculate persistence pressure based on target heap size This is a fairly easy attempt to dynamically evict chunks based on the heap size. A target heap size has to be set as a command line flage, so that users can essentially say "utilize 4GiB of RAM, and please don't OOM". The -storage.local.max-chunks-to-persist and -storage.local.memory-chunks flags are deprecated by this change. Backwards compatibility is provided by ignoring -storage.local.max-chunks-to-persist and use -storage.local.memory-chunks to set the new -storage.local.target-heap-size to a reasonable (and conservative) value (both with a warning). This also makes the metrics intstrumentation more consistent (in naming and implementation) and cleans up a few quirks in the tests. Answers to anticipated comments: There is a chance that Go 1.9 will allow programs better control over the Go memory management. I don't expect those changes to be in contradiction with the approach here, but I do expect them to complement them and allow them to be more precise and controlled. In any case, once those Go changes are available, this code has to be revisted. One might be tempted to let the user specify an estimated value for the RSS usage, and then internall set a target heap size of a certain fraction of that. (In my experience, 2/3 is a fairly safe bet.) However, investigations have shown that RSS size and its relation to the heap size is really really complicated. It depends on so many factors that I wouldn't even start listing them in a commit description. It depends on many circumstances and not at least on the risk trade-off of each individual user between RAM utilization and probability of OOMing during a RAM usage peak. To not add even more to the confusion, we need to stick to the well-defined number we also use in the targeting here, the sum of the sizes of heap objects.	2017-03-27 14:33:50 +02:00
Björn Rabenstein	e1a84b6256	Merge pull request #2529 from prometheus/beorn7/storage3 storage: Use staleness delta as head chunk timeout	2017-03-27 14:25:08 +02:00
beorn7	96a303b348	storage: Use staleness delta as head chunk timeout Currently, if a series stops to exist, its head chunk will be kept open for an hour. That prevents it from being persisted. Which prevents it from being evicted. Which prevents the series from being archived. Most of the time, once no sample has been added to a series within the staleness limit, we can be pretty confident that this series will not receive samples anymore. The whole chain as described above can be started after 5m instead of 1h. In the relaxed case, this doesn't change a lot as the head chunk timeout is only checked during series maintenance, and usually, a series is only maintained every six hours. However, there is the typical scenario where a large service is deployed, the deoply turns out to be bad, and then it is deployed again within minutes, and quite quickly the number of time series has tripled. That's the point where the Prometheus server is stressed and switches (rightfully) into rushed mode. In that mode, time series are processed as quickly as possible, but all of that is in vein if all of those recently ended time series cannot be persisted yet for another hour. In that scenario, this change will help most, and it's exactly the scenario where help is most desperately needed.	2017-03-26 23:44:50 +02:00
beorn7	04ccf84559	main.go: Set GOGC to 40 by default Rationale: The default value for GOGC is 100, i.e. a garbage collected is initialized once as many heap space has been allocated as was in use after the last GC was done. This ratio doesn't make a lot of sense in Prometheus, as typically about 60% of the heap is allocated for long-lived memory chunks (most of which are around for many hours if not days). Thus, short-lived heap objects are accumulated for quite some time until they finally match the large amount of memory used by bulk memory chunks and a gigantic GC cyle is invoked. With GOGC=40, we are essentially reinstating "normal" GC behavior by acknowledging that about 60% of the heap are used for long-term bulk storage. The median Prometheus production server at SoundCloud runs a GC cycle every 90 seconds. With GOGC=40, a GC cycle is run every 35 seconds (which is still not very often). However, the effective RAM usage is now reduced by about 30%. If settings are updated to utilize more RAM, the time between GC cycles goes up again (as the heap size is larger with more long-lived memory chunks, but the frequency of creating short-lived heap objects does not change). On a quite busy large Prometheus server, the timing changed from one GC run every 20s to one GC run every 12s. In the former case (just changing GOGC, leave everything else as it is), the CPU usage increases by about 10% (on a mid-size referenc server from 8.1 to 8.9). If settings are adjusted, the CPU consumptions increases more drastically (from 8 cores to 13 cores on a large reference server), despite GCs happening more rarely, presumably because a 50% larger set of memory chunks is managed now. Having more memory chunks is good in many regards, and most servers are running out of memory long before they run out of CPU cycles, so the tradeoff is overwhelmingly positive in most cases. Power users can still set the GOGC environment variable as usual, as the implementation in this commit honors an explicitly set variable.	2017-03-26 21:55:37 +02:00
Julius Volz	3f23aa2cc7	Add headers to indicate remote read/write version Also add Content-Type header.	2017-03-24 17:39:51 +01:00
Tobias Schmidt	6dbd779099	Merge pull request #2519 from prometheus/update-arch-diag-link Update architecture diagram link	2017-03-23 14:18:38 +02:00
Julius Volz	a20105ddb0	Update architecture diagram link	2017-03-23 13:16:54 +01:00
Julius Volz	c34257d069	Merge pull request #2518 from prometheus/update-arch-diag Remove PromDash from architecture diagram	2017-03-23 13:13:14 +01:00
Julius Volz	428e1ad42c	Remove PromDash from architecture diagram	2017-03-23 13:11:05 +01:00
Björn Rabenstein	ddcf04a768	Merge pull request #2515 from leitzler/leitzler-patch-1 Use go env to fetch GOPATH to support Go 1.8	2017-03-23 11:58:30 +01:00
Pontus Leitzler	4774d6736a	Use go env to fetch GOPATH to support Go 1.8 Go 1.8 do not require env GOPATH to be set and make will fail if it isn't set.	2017-03-22 19:04:20 +01:00
Julius Volz	8fda83ea12	Make rules only read local data	2017-03-21 00:50:04 +01:00
Julius Volz	94acd3f1d8	Add fanin tests and fix uncovered bugs	2017-03-21 00:08:17 +01:00
Julius Volz	9b33cfc457	Fix/unify context-based remote storage timeouts	2017-03-20 14:17:06 +01:00
Julius Volz	815762a4ad	Move retrieval.NewHTTPClient -> httputil.NewClientFromConfig	2017-03-20 14:17:04 +01:00
Julius Volz	eb14678a25	Make remote read/write use config.HTTPClientConfig	2017-03-20 13:37:50 +01:00
Julius Volz	406b65d0dc	Rename remote.Storage to remote.Writer	2017-03-20 13:15:28 +01:00
Julius Volz	02395a224d	[WIP] Remote Read	2017-03-20 13:13:44 +01:00

1 2 3 4 5 ...

3764 Commits