prometheus

Commit Graph

Author	SHA1	Message	Date
Joe Elliott	04b028f1e6	Exports recoverable error (#7689 ) Signed-off-by: Joe Elliott <number101010@gmail.com>	2020-07-29 11:08:25 -06:00
Ganesh Vernekar	a4c2ea1ca3	Merge remote-tracking branch 'upstream/master' into merge-release-2.19 Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2020-06-26 14:33:50 +05:30
Chris Marchbanks	b299aba6cf	Fix panic when updating a remote write queue (#7452 ) Right now Queue Manager metrics are registered when the metrics struct is created, which happens before a changed queue is shutdown and the old metrics are unregistered. In the case of named queues or updates to external labels the apply config will panic due to duplicate metrics. Instead, register the metrics as part of starting the queue as we always guarantee that Stop will be called before a new Start. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-06-26 12:03:52 +05:30
Chris Marchbanks	d78656c244	Pending Samples metric includes samples in channel (#7335 ) * Pending Samples metric includes samples in channel The pending samples metric should also include samples waiting in the channels to be sent to provide a more accurate measure. In addition, make sure that the pending samples is reset to 0 anytime a queue is started as we remake all of the shards at that time. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com> * Log the number of dropped samples on hard shutdown Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-06-25 14:48:30 -06:00
Bartlomiej Plotka	b788986717	storage: Adjusted fully storage layer support for chunk iterators: Remote read client, readyStorage, fanout. (#7059 ) * Fixed nits introduced by https://github.com/prometheus/prometheus/pull/7334 * Added ChunkQueryable implementation to fanout and readyStorage. * Added more comments. * Changed NewVerticalChunkSeriesMerger to CompactingChunkSeriesMerger, removed tiny interface by reusing VerticalSeriesMergeFunc for overlapping algorithm for both chunks and series, for both querying and compacting (!) + made sure duplicates are merged. * Added ErrChunkSeriesSet * Added Samples interface for seamless []promb.Sample to []tsdbutil.Sample conversion. * Deprecating non chunks serieset based StreamChunkedReadResponses, added chunk one. * Improved tests. * Split remote client into Write (old storage) and read. * Queryable client is now SampleAndChunkQueryable. Since we cannot use nice QueryableFunc I moved all config based options to sampleAndChunkQueryableClient to aboid boilerplate. In next commit: Changes for TSDB. Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>	2020-06-24 14:41:52 +01:00
Bert Hartmann	82c7cd320a	increase the remote write bucket range (#7323 ) * increase the remote write bucket range Increase the range of remote write buckets to capture times above 10s for laggy scenarios Buckets had been: {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} Buckets are now: {0.03125, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512} Signed-off-by: Bert Hartmann <berthartm@gmail.com> * revert back to DefBuckets with addons to be backwards compatible Signed-off-by: Bert Hartmann <berthartm@gmail.com> * shuffle the buckets to maintain 2-2.5x increases Signed-off-by: Bert Hartmann <berthartm@gmail.com>	2020-06-04 13:54:47 -06:00
Cody Boggs	3268eac2dd	Trace Remote Write requests (#7206 ) * Trace Remote Write requests Signed-off-by: Cody Boggs <cboggs@splunk.com> * Refactor store attempts to keep code flow clearer, and avoid so many places to deal with span finishing Signed-off-by: Cody Boggs <cboggs@splunk.com>	2020-06-01 09:21:13 -06:00
Chris Marchbanks	dfad1da296	Remove duplicate metrics in QueueManager Right now any new metrics added for remote write need to be added to both the QueueManager struct, and the queueManagerMetrics struct. Instead, use the queueManagerMetrics struct directly from QueueManager. The newQueueManagerMetrics constructor will now create the metrics for a specific queue with name and endpoint pre-populated, and a new copy of the struct will be created specifically for each queue. This also fixes a bug where prometheus_remote_storage_sent_bytes_total is not being unregistered after a queue is changed. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-05-05 14:13:59 -06:00
qinng	f36ae1c21c	[remote-storage] use warn log level when send samples to remote failed (#7184 ) [remote] increasing sendbatch error log level Signed-off-by: guoruyi1 <guoruyi1@xiaomi.com> Co-authored-by: guoruyi1 <guoruyi1@xiaomi.com>	2020-04-30 17:06:22 -06:00
Marek Slabicki	4b5e7d4984	Adding a shouldReshard function to modularize logic for the QueueManager deciding if it should shard or not (#7143 ) Signed-off-by: Marek Slabicki <thaniri@gmail.com>	2020-04-20 16:20:39 -06:00
Chris Marchbanks	cd12f0873c	Merge pull request #7073 from csmarchbanks/fix-md5-remote-write Fix remote write not updating when relabel configs or secrets change	2020-04-16 16:36:25 -06:00
Chris Marchbanks	5ab6b043c1	Always update lastSendTimestamp after a request (#7122 ) If the server is returning non-recoverable errors, such as if we are trying to push samples that are too old, remote write will never reshard. Non-recoverable errors should be treated the same as success for the purpose of resharding, just as we do with sample rates and durations. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-04-15 09:03:28 -06:00
Chris Marchbanks	d88a2b0261	Handle secret changes in remote write ApplyConfig Remake the http client whenever ApplyConfig is called. This allows secrets to be updated without needing to restart an otherwise unchanged queue. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-04-13 23:14:15 +00:00
Marek Slabicki	8224ddec23	Capitalizing first letter of all log lines (#7043 ) Signed-off-by: Marek Slabicki <thaniri@gmail.com>	2020-04-11 09:22:18 +01:00
Callum Styan	f802f1e8ca	Fix bug with WAL watcher and Live Reader metrics usage. (#6998 ) * Fix bug with WAL watcher and Live Reader metrics usage. Calling NewXMetrics when creating a Watcher or LiveReader results in a registration error, which we're ignoring, and as a result other than the first Watcher/Reader created, we had no metrics for either. So we would only have metrics like Watcher Records Read for the first remote write config in a users config file. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2020-03-20 17:34:15 +01:00
helenxu1221	7df4fe3faa	reset counter after collecting metric (#6798 ) Signed-off-by: HelenXu <helenxu@Helens-MacBook-Pro.local>	2020-02-09 20:51:21 -07:00
Robert Fratto	a53e00f9fd	pass registerer from storage to queue manager for its metrics (#6728 ) * pass registerer from storage to queue manager for its metrics Signed-off-by: Robert Fratto <robert.fratto@grafana.com>	2020-02-03 13:47:03 -08:00
Chris Marchbanks	7f3aca62c4	Only reduce the number of shards when caught up. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-01-06 14:53:23 -07:00
Chris Marchbanks	9e24e1f9e8	Use samplesPending rather than integral The integral accumulator in the remote write sharding code is just a second way of keeping track of the number of samples pending. Remove integralAccumulator and use the samplesPending value we already calculate to calculate the number of shards. This has the added benefit of fixing a bug where the integralAccumulator was not being initialized correctly due to not taking into account the number of ticks being counted, causing the integralAccumulator initial value to be off by an order of magnitude in some cases. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2020-01-06 14:53:23 -07:00
Callum Styan	67838643ee	Add config option for remote job name (#6043 ) * Track remote write queues via a map so we don't care about index. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Support a job name for remote write/read so we can differentiate between them using the name. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Remote write/read has Name to not confuse the meaning of the field with scrape job names. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Split queue/client label into remote_name and url labels. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Don't allow for duplicate remote write/read configs. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Ensure we restart remote write queues if the hash of their config has not changed, but the remote name has changed. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Include name in remote read/write config hashes, simplify duplicates check, update test accordingly. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-12-12 12:47:23 -08:00
Chris Marchbanks	6f34e35b3e	Record the exact value of desired shards in metric It is possible that desired shards is always a bit higher than the number of shards (less than 30%) and by exporting desired shards as the raw number it will be easy to tell if a Prometheus is in that situation. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-11-26 06:26:03 -07:00
Chris Marchbanks	0e684ca205	Fix unknown type in sharding up log Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-11-26 06:22:56 -07:00
Callum Styan	c2cb1e4103	Add a metric to track total bytes sent per remote write queue. (#6344 ) Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-11-25 13:25:18 -07:00
Tom Wilkie	de0a772b8e	Port tsdb to use pkg/labels. (#6326 ) * Port tsdb to use pkg/labels. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Get tests passing. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Remove useless cast. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Appease linters. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com> * Fix review comments Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2019-11-18 11:53:33 -08:00
Callum Styan	5f1be2cf45	Refactor calculateDesiredShards + don't reshard if we're having issues sending samples. (#6111 ) * Refactor calculateDesiredShards + don't reshard if we're having issues sending samples. * Track lastSendTimestamp via an int64 with atomic add/load, add a test for reshard calculation. * Simplify conditional for skipping resharding, add samplesIn/Out to shard testcase struct. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-10-21 15:54:25 -06:00
Chris Marchbanks	8df4bca470	Garbage collect asynchronously in the WAL Watcher The WAL Watcher replays a checkpoint after it is created in order to garbage collect series that no longer exist in the WAL. Currently the garbage collection process is done serially with reading from the tip of the WAL which can cause large delays in writing samples to remote storage just after compaction occurs. This also fixes a memory leak where dropped series are not cleaned up as part of the SeriesReset process. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-10-07 14:36:10 -06:00
Callum Styan	3344bb5c33	Move WAL watcher code to tsdb/wal package. (#5999 ) * Move WAL watcher code to tsdb/wal package. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Fix tests after moving WAL watcher code. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Lint fixes. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-09-19 14:45:41 +05:30
Björn Rabenstein	3b3eaf3496	Merge pull request #5787 from cstyan/reshard-max-logging Add metrics for max/min/desired shards to queue manager.	2019-09-09 22:32:54 +02:00
Chris Marchbanks	791a2409a2	Pre-allocate pendingSamples to reduce allocations Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-09-03 15:41:47 -06:00
Chris Marchbanks	160186da18	Store labels.Labels instead of []prompb.Label This will use half the steady state memory as required by prompb.Label. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-09-03 15:41:46 -06:00
Bartek Płotka	32be514845	Merge pull request #5805 from codesome/merge-tsdb Merge tsdb into prometheus	2019-08-13 11:39:41 +01:00
Chris Marchbanks	a6a55c433c	Improve desired shards calculation (#5763 ) The desired shards calculation now properly keeps track of the rate of pending samples, and uses the previously unused integralAccumulator to adjust for missing information in the desired shards calculation. Also, configure more capacity for each shard. The default 10 capacity causes shards to block on each other while sending remote requests. Default to a 500 sample capacity and explain in the documentation that having more capacity will help throughput. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-08-13 10:10:21 +01:00
Ganesh Vernekar	5ecef3542d	Cleanup after merging tsdb into prometheus Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2019-08-13 14:04:14 +05:30
ethan	38ccf0157e	cleanup: correct func name in log message (#5852 ) Signed-off-by: Guangming Wang <guangming.wang@daocloud.io>	2019-08-10 16:24:58 +01:00
Callum Styan	c40a83f386	Add metrics for max shards, min shards, and desired shards. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-08-04 20:04:19 -07:00
Xigang Wang	445bcd1251	Update the runShard method and change len(pendingSamples) to n=len(pendingSamples) (#5708 ) Signed-off-by: xigang <wangxigang2014@gmail.com>	2019-07-09 19:09:11 +01:00
Chris Marchbanks	06bdaf076f	Remote Write Allocs Improvements (#5614 ) * Add benchmark for sample delivery * Simplify StoreSeries to have only one loop * Reduce allocations for pending samples in runShard * Only allocate one send slice per segment * Cache a buffer in each shard for snappy to use * Remove queue manager seriesMtx It is not possible for any of the places protected by the seriesMtx to be called concurrently so it is safe to remove. By removing the mutex we can simplify the Append code to one loop. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-06-27 19:48:21 +01:00
Callum Styan	3639d51eb6	Remote Storage: string interner should not panic in release (#5487 ) * Don't panic if we try to release a string that is not in the interner. * Move seriesMtx locking in QueueManager's StoreSeries function. This stops us from calling release for strings that aren't interned if there's a race between reading a checkpoint and storing new series labels, which could happen during checkpointing or reloading config. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-04-24 10:46:31 +01:00
Callum Styan	e87449b59d	Remote Write: Queue Manager specific metrics shouldn't exist if the queue no longer exists (#5445 ) * Unregister remote write queue manager specific metrics when stopping the queue manager. * Use DeleteLabelValues instead of Unregister to remove queue and watcher related metrics when we stop them. Create those metrics in the structs start functions rather than in their constructors because of the ordering of creation, start, and stop in remote storage ApplyConfig. * Add setMetrics function to WAL watcher so we can set the watchers metrics in it's Start function, but not have to call Start in some tests (causes data race). Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-04-23 09:49:17 +01:00
Vasily Sliouniaev	5be9a1426f	Prevent reshard concurrent with calling stop (#5460 ) * Prevent reshard concurrent with calling stop Signed-off-by: Vasily <v.sliouniaev@gmail.com>	2019-04-16 11:25:19 +01:00
Tom Wilkie	807fd33ecc	Review feedback. - Update read path to use labels.Labels. - Fix the tests. - Remove pack. - Remove unused function. - Fix race in tests. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-18 20:31:12 +00:00
Callum Styan	1a7923dde3	Add ref counting to string interning so we can remove a string when there are no longer any refs. Add tests for interning. Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-03-18 20:31:12 +00:00
Tom Wilkie	c7b3535997	Use pkg/relabelling in remote write. - Unmarshall external_labels config as labels.Labels, add tests. - Convert some more uses of model.LabelSet to labels.Labels. - Remove old relabel pkg (fixes #3647). - Validate external label names. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-18 20:31:12 +00:00
Tom Wilkie	2fa93595d6	More WAL remote_write tweaks. (#5300 ) * Consistently pre-lookup the metrics for a given queue in queue manager. * Don't open the WAL (for writing) in the remote_write code. * Add some more logging. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-05 12:21:11 +00:00
Tariq Ibrahim	ab8e9b7423	fix typo in queue_manager.go comment (#5294 ) Signed-off-by: tariqibrahim <tariq181290@gmail.com>	2019-03-03 11:35:29 +00:00
Tom Wilkie	67da8e7b46	Refactor and fix queue resharding (#5286 ) - Remove prometheus_remote_queue_last_send_timestamp_seconds metric. Its not particularly useful, we have highest_timestamp_seconds. - Factor out maxGauage, a gauge that only increases. - Change sharding calculations to use max samples in timestamp - max samples out timestamp (not rates). - Also include the ratio of samples dropped to correctly predict number of pending samples. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-01 11:04:26 -08:00
Callum Styan	b8106dd459	Review feedback: - Add a dropped samples EWMA and use it in calculating desired shards. - Update metric names and a log messages. - Limit number of entries in the dedupe logging middleware to prevent potential OOM. Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	f795942572	Decrement pending sample when queue exits. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	efbd9559f4	Deal with corruptions in the WAL: - If we're replaying the WAL to get series records, skip that segment when we hit corruptions. - If we're tailing the WAL for samples, fail the watcher. - When the watcher fails, restart from the latest checkpoint - and only send new samples by updating startTime. - Tidy up log lines and error handling, don't return so many errors on quiting. - Expect EOF when processing checkpoints. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	d6f911b511	Factor out logging ratelimit & dedupe middleware. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00

1 2 3

112 Commits