prometheus

Commit Graph

Author	SHA1	Message	Date
Chris Marchbanks	dfad1da296	Remove duplicate metrics in QueueManager Right now any new metrics added for remote write need to be added to both the QueueManager struct, and the queueManagerMetrics struct. Instead, use the queueManagerMetrics struct directly from QueueManager. The newQueueManagerMetrics constructor will now create the metrics for a specific queue with name and endpoint pre-populated, and a new copy of the struct will be created specifically for each queue. This also fixes a bug where prometheus_remote_storage_sent_bytes_total is not being unregistered after a queue is changed. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-05-05 14:13:59 -06:00
qinng	f36ae1c21c	[remote-storage] use warn log level when send samples to remote failed (#7184 ) [remote] increasing sendbatch error log level Signed-off-by: guoruyi1 <guoruyi1@xiaomi.com> Co-authored-by: guoruyi1 <guoruyi1@xiaomi.com>	2020-04-30 17:06:22 -06:00
Marek Slabicki	4b5e7d4984	Adding a shouldReshard function to modularize logic for the QueueManager deciding if it should shard or not (#7143 ) Signed-off-by: Marek Slabicki <thaniri@gmail.com>	2020-04-20 16:20:39 -06:00
Chris Marchbanks	cd12f0873c	Merge pull request #7073 from csmarchbanks/fix-md5-remote-write Fix remote write not updating when relabel configs or secrets change	2020-04-16 16:36:25 -06:00
Chris Marchbanks	5ab6b043c1	Always update lastSendTimestamp after a request (#7122 ) If the server is returning non-recoverable errors, such as if we are trying to push samples that are too old, remote write will never reshard. Non-recoverable errors should be treated the same as success for the purpose of resharding, just as we do with sample rates and durations. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-04-15 09:03:28 -06:00
Chris Marchbanks	d88a2b0261	Handle secret changes in remote write ApplyConfig Remake the http client whenever ApplyConfig is called. This allows secrets to be updated without needing to restart an otherwise unchanged queue. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-04-13 23:14:15 +00:00
Marek Slabicki	8224ddec23	Capitalizing first letter of all log lines (#7043 ) Signed-off-by: Marek Slabicki <thaniri@gmail.com>	2020-04-11 09:22:18 +01:00
Callum Styan	f802f1e8ca	Fix bug with WAL watcher and Live Reader metrics usage. (#6998 ) * Fix bug with WAL watcher and Live Reader metrics usage. Calling NewXMetrics when creating a Watcher or LiveReader results in a registration error, which we're ignoring, and as a result other than the first Watcher/Reader created, we had no metrics for either. So we would only have metrics like Watcher Records Read for the first remote write config in a users config file. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2020-03-20 17:34:15 +01:00
helenxu1221	7df4fe3faa	reset counter after collecting metric (#6798 ) Signed-off-by: HelenXu <helenxu@Helens-MacBook-Pro.local>	2020-02-09 20:51:21 -07:00
Robert Fratto	a53e00f9fd	pass registerer from storage to queue manager for its metrics (#6728 ) * pass registerer from storage to queue manager for its metrics Signed-off-by: Robert Fratto <robert.fratto@grafana.com>	2020-02-03 13:47:03 -08:00
Chris Marchbanks	7f3aca62c4	Only reduce the number of shards when caught up. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-01-06 14:53:23 -07:00
Chris Marchbanks	9e24e1f9e8	Use samplesPending rather than integral The integral accumulator in the remote write sharding code is just a second way of keeping track of the number of samples pending. Remove integralAccumulator and use the samplesPending value we already calculate to calculate the number of shards. This has the added benefit of fixing a bug where the integralAccumulator was not being initialized correctly due to not taking into account the number of ticks being counted, causing the integralAccumulator initial value to be off by an order of magnitude in some cases. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-01-06 14:53:23 -07:00
Callum Styan	67838643ee	Add config option for remote job name (#6043 ) * Track remote write queues via a map so we don't care about index. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Support a job name for remote write/read so we can differentiate between them using the name. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Remote write/read has Name to not confuse the meaning of the field with scrape job names. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Split queue/client label into remote_name and url labels. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Don't allow for duplicate remote write/read configs. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Ensure we restart remote write queues if the hash of their config has not changed, but the remote name has changed. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Include name in remote read/write config hashes, simplify duplicates check, update test accordingly. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-12-12 12:47:23 -08:00
Chris Marchbanks	6f34e35b3e	Record the exact value of desired shards in metric It is possible that desired shards is always a bit higher than the number of shards (less than 30%) and by exporting desired shards as the raw number it will be easy to tell if a Prometheus is in that situation. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-11-26 06:26:03 -07:00
Chris Marchbanks	0e684ca205	Fix unknown type in sharding up log Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-11-26 06:22:56 -07:00
Callum Styan	c2cb1e4103	Add a metric to track total bytes sent per remote write queue. (#6344 ) Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-11-25 13:25:18 -07:00
Tom Wilkie	de0a772b8e	Port tsdb to use pkg/labels. (#6326 ) * Port tsdb to use pkg/labels. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Get tests passing. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Remove useless cast. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Appease linters. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Fix review comments Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2019-11-18 11:53:33 -08:00
Callum Styan	5f1be2cf45	Refactor calculateDesiredShards + don't reshard if we're having issues sending samples. (#6111 ) * Refactor calculateDesiredShards + don't reshard if we're having issues sending samples. * Track lastSendTimestamp via an int64 with atomic add/load, add a test for reshard calculation. * Simplify conditional for skipping resharding, add samplesIn/Out to shard testcase struct. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-10-21 15:54:25 -06:00
Chris Marchbanks	8df4bca470	Garbage collect asynchronously in the WAL Watcher The WAL Watcher replays a checkpoint after it is created in order to garbage collect series that no longer exist in the WAL. Currently the garbage collection process is done serially with reading from the tip of the WAL which can cause large delays in writing samples to remote storage just after compaction occurs. This also fixes a memory leak where dropped series are not cleaned up as part of the SeriesReset process. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-10-07 14:36:10 -06:00
Callum Styan	3344bb5c33	Move WAL watcher code to tsdb/wal package. (#5999 ) * Move WAL watcher code to tsdb/wal package. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Fix tests after moving WAL watcher code. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Lint fixes. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-09-19 14:45:41 +05:30
Björn Rabenstein	3b3eaf3496	Merge pull request #5787 from cstyan/reshard-max-logging Add metrics for max/min/desired shards to queue manager.	2019-09-09 22:32:54 +02:00
Chris Marchbanks	791a2409a2	Pre-allocate pendingSamples to reduce allocations Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-09-03 15:41:47 -06:00
Chris Marchbanks	160186da18	Store labels.Labels instead of []prompb.Label This will use half the steady state memory as required by prompb.Label. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-09-03 15:41:46 -06:00
Bartek Płotka	32be514845	Merge pull request #5805 from codesome/merge-tsdb Merge tsdb into prometheus	2019-08-13 11:39:41 +01:00
Chris Marchbanks	a6a55c433c	Improve desired shards calculation (#5763 ) The desired shards calculation now properly keeps track of the rate of pending samples, and uses the previously unused integralAccumulator to adjust for missing information in the desired shards calculation. Also, configure more capacity for each shard. The default 10 capacity causes shards to block on each other while sending remote requests. Default to a 500 sample capacity and explain in the documentation that having more capacity will help throughput. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-08-13 10:10:21 +01:00
Ganesh Vernekar	5ecef3542d	Cleanup after merging tsdb into prometheus Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2019-08-13 14:04:14 +05:30
ethan	38ccf0157e	cleanup: correct func name in log message (#5852 ) Signed-off-by: Guangming Wang <guangming.wang@daocloud.io>	2019-08-10 16:24:58 +01:00
Callum Styan	c40a83f386	Add metrics for max shards, min shards, and desired shards. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-08-04 20:04:19 -07:00
Xigang Wang	445bcd1251	Update the runShard method and change len(pendingSamples) to n=len(pendingSamples) (#5708 ) Signed-off-by: xigang <wangxigang2014@gmail.com>	2019-07-09 19:09:11 +01:00
Chris Marchbanks	06bdaf076f	Remote Write Allocs Improvements (#5614 ) * Add benchmark for sample delivery * Simplify StoreSeries to have only one loop * Reduce allocations for pending samples in runShard * Only allocate one send slice per segment * Cache a buffer in each shard for snappy to use * Remove queue manager seriesMtx It is not possible for any of the places protected by the seriesMtx to be called concurrently so it is safe to remove. By removing the mutex we can simplify the Append code to one loop. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-06-27 19:48:21 +01:00
Callum Styan	3639d51eb6	Remote Storage: string interner should not panic in release (#5487 ) * Don't panic if we try to release a string that is not in the interner. * Move seriesMtx locking in QueueManager's StoreSeries function. This stops us from calling release for strings that aren't interned if there's a race between reading a checkpoint and storing new series labels, which could happen during checkpointing or reloading config. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-04-24 10:46:31 +01:00
Callum Styan	e87449b59d	Remote Write: Queue Manager specific metrics shouldn't exist if the queue no longer exists (#5445 ) * Unregister remote write queue manager specific metrics when stopping the queue manager. * Use DeleteLabelValues instead of Unregister to remove queue and watcher related metrics when we stop them. Create those metrics in the structs start functions rather than in their constructors because of the ordering of creation, start, and stop in remote storage ApplyConfig. * Add setMetrics function to WAL watcher so we can set the watchers metrics in it's Start function, but not have to call Start in some tests (causes data race). Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-04-23 09:49:17 +01:00
Vasily Sliouniaev	5be9a1426f	Prevent reshard concurrent with calling stop (#5460 ) * Prevent reshard concurrent with calling stop Signed-off-by: Vasily <v.sliouniaev@gmail.com>	2019-04-16 11:25:19 +01:00
Tom Wilkie	807fd33ecc	Review feedback. - Update read path to use labels.Labels. - Fix the tests. - Remove pack. - Remove unused function. - Fix race in tests. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-18 20:31:12 +00:00
Callum Styan	1a7923dde3	Add ref counting to string interning so we can remove a string when there are no longer any refs. Add tests for interning. Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-03-18 20:31:12 +00:00
Tom Wilkie	c7b3535997	Use pkg/relabelling in remote write. - Unmarshall external_labels config as labels.Labels, add tests. - Convert some more uses of model.LabelSet to labels.Labels. - Remove old relabel pkg (fixes #3647). - Validate external label names. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-18 20:31:12 +00:00
Tom Wilkie	2fa93595d6	More WAL remote_write tweaks. (#5300 ) * Consistently pre-lookup the metrics for a given queue in queue manager. * Don't open the WAL (for writing) in the remote_write code. * Add some more logging. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-05 12:21:11 +00:00
Tariq Ibrahim	ab8e9b7423	fix typo in queue_manager.go comment (#5294 ) Signed-off-by: tariqibrahim <tariq181290@gmail.com>	2019-03-03 11:35:29 +00:00
Tom Wilkie	67da8e7b46	Refactor and fix queue resharding (#5286 ) - Remove prometheus_remote_queue_last_send_timestamp_seconds metric. Its not particularly useful, we have highest_timestamp_seconds. - Factor out maxGauage, a gauge that only increases. - Change sharding calculations to use max samples in timestamp - max samples out timestamp (not rates). - Also include the ratio of samples dropped to correctly predict number of pending samples. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-01 11:04:26 -08:00
Callum Styan	b8106dd459	Review feedback: - Add a dropped samples EWMA and use it in calculating desired shards. - Update metric names and a log messages. - Limit number of entries in the dedupe logging middleware to prevent potential OOM. Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	f795942572	Decrement pending sample when queue exits. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	efbd9559f4	Deal with corruptions in the WAL: - If we're replaying the WAL to get series records, skip that segment when we hit corruptions. - If we're tailing the WAL for samples, fail the watcher. - When the watcher fails, restart from the latest checkpoint - and only send new samples by updating startTime. - Tidy up log lines and error handling, don't return so many errors on quiting. - Expect EOF when processing checkpoints. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	d6f911b511	Factor out logging ratelimit & dedupe middleware. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	37ad4db485	Export timestamps in seconds since epoch. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
JoeWrightss	362873f72b	Fix .Log() error message (#5257 ) Signed-off-by: zhoulin xie <zhoulin.xie@daocloud.io>	2019-02-22 14:39:37 +00:00
Callum Styan	37e35f9e0c	Various improvements to WAL based remote write. - Use the queue name in WAL watcher logging. - Don't return from watch if the reader error was EOF. - Fix sample timestamp check logic regarding what samples we send. - Refactor so we don't need readToEnd/readSeriesRecords - Fix wal_watcher tests since readToEnd no longer exists Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-02-12 11:39:13 +00:00
Tom Wilkie	b93bafeee1	Various fixes to locking & shutdown for WAL-based remote write. - Remove datarace in the exported highest scrape timestamp. - Backoff on enqueue should be per-sample - reset the result for each sample. - Remove diffKeys, unused ctx and cancelfunc in WALWatcher, 'name' from writeTo interface, and pass it to constructor. - Reorder functions in WALWatcher depth-first according to call graph. - Fix vendor/modules.txt. - Split out the various timer periods into consts at the top of the file. - Move w.currentSegmentMetric.Set close to where we set the currentSegment. - Combine r.Next() and isClosed(w.quit) into a single loop. - Unnest some ifs in WALWatcher.watch, propagate erros in decodeRecord, add some new lines to make it easier to read. - Reorganise checkpoint handling to reduce nesting and make it easier to follow. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-12 11:39:13 +00:00
Callum Styan	6f69e31398	Tail the TSDB WAL for remote_write This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down. We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes. Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases. As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s). This changes also includes the following optimisations: - only marshal the proto request once, not once per retry - maintain a single copy of the labels for given series to reduce GC pressure Other minor tweaks: - only reshard if we've also successfully sent recently - add pending samples, latest sent timestamp, WAL events processed metrics Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype) Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes) Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-02-12 11:39:13 +00:00
Simon Pasquier	f678e27eb6	: use latest release of staticcheck (#5057 ) : use latest release of staticcheck It also fixes a couple of things in the code flagged by the additional checks. Signed-off-by: Simon Pasquier <spasquie@redhat.com> Use official release of staticcheck Also run 'go list' before staticcheck to avoid failures when downloading packages. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-01-04 14:47:38 +01:00
Bartek Płotka	62c8337e77	Moved configuration into `relabel` package. (#4955 ) Adapted top dir relabel to use pkg relabel structs. Removal of this in a separate tracked here: https://github.com/prometheus/prometheus/issues/3647 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>	2018-12-18 11:26:36 +00:00

1 2 3

105 Commits