Commit Graph

26 Commits

Author SHA1 Message Date
Ganesh Vernekar 5ecef3542d
Cleanup after merging tsdb into prometheus
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2019-08-13 14:04:14 +05:30
AllenZMC 758c71b980 fix word `encourter` to `encounter`
Signed-off-by: czm <zhongming.chang@daocloud.io>
2019-07-29 22:16:23 +08:00
Devin Trejo d77f2aa29c Only check last directory when discovering checkpoint number (#5756)
* Only check last directory when discovering checkpoint number

Signed-off-by: Devin Trejo <dtrejo@palantir.com>

* Comments for checkpointNum

Signed-off-by: Devin Trejo <dtrejo@palantir.com>
2019-07-15 17:53:58 +01:00
Yao Zengzeng 3cde8a9941 pass error up if WALWathcer.segments() return err (#5741)
Signed-off-by: YaoZengzeng <yaozengzeng@zju.edu.cn>
2019-07-15 17:52:03 +01:00
Chris Marchbanks 475ca2ecd0
Update to tsdb 0.9.1
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-07-03 07:51:11 -06:00
Chris Marchbanks 06bdaf076f Remote Write Allocs Improvements (#5614)
* Add benchmark for sample delivery
* Simplify StoreSeries to have only one loop
* Reduce allocations for pending samples in runShard
* Only allocate one send slice per segment
* Cache a buffer in each shard for snappy to use
* Remove queue manager seriesMtx

  It is not possible for any of the places protected by the seriesMtx to
  be called concurrently so it is safe to remove. By removing the mutex we
  can simplify the Append code to one loop.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-06-27 19:48:21 +01:00
Callum Styan e87449b59d Remote Write: Queue Manager specific metrics shouldn't exist if the queue no longer exists (#5445)
* Unregister remote write queue manager specific metrics when stopping the
queue manager.

* Use DeleteLabelValues instead of Unregister to remove queue and watcher
related metrics when we stop them. Create those metrics in the structs
start functions rather than in their constructors because of the
ordering of creation, start, and stop in remote storage ApplyConfig.

* Add setMetrics function to WAL watcher so we can set
the watchers metrics in it's Start function, but not
have to call Start in some tests (causes data race).

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-04-23 09:49:17 +01:00
Callum Styan c2b88992a3 Remote Write: fix checkpoint reading (#5429)
* Fix ReadCheckpoint to ensure that it actually reads all the contents of
each segment in a checkpoint dir, or returns an error.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-04-09 10:52:44 +01:00
Tariq Ibrahim 8fdfa8abea refine error handling in prometheus (#5388)
i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors.
ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives.
iii) Does away with the use of fmt package for errors in favour of pkg/errors

Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-03-26 00:01:12 +01:00
Tom Wilkie 2fa93595d6
More WAL remote_write tweaks. (#5300)
* Consistently pre-lookup the metrics for a given queue in queue manager.
* Don't open the WAL (for writing) in the remote_write code.
* Add some more logging.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-03-05 12:21:11 +00:00
Tariq Ibrahim 1adb91738d fix typo in recordType method of wal_watcher.go (#5297)
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
2019-03-04 17:33:35 +01:00
Callum Styan b8106dd459 Review feedback:
- Add a dropped samples EWMA and use it in calculating desired shards.
- Update metric names and a log messages.
- Limit number of entries in the dedupe logging middleware to prevent potential OOM.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Callum Styan 512f549064 Refactor: inline decodeRecord in readSegment and don't bother decoding samples records if we're not tailing the segment, add a benchmark test and fix some other tests
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie ee7efa93fe Fix some tests.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Callum Styan b69bdfb4d1 Store the checkpoint we read last, so that we don't keep reading the same checkpoint on each tick.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie efbd9559f4 Deal with corruptions in the WAL:
- If we're replaying the WAL to get series records, skip that segment when we hit corruptions.
- If we're tailing the WAL for samples, fail the watcher.
- When the watcher fails, restart from the latest checkpoint - and only send new samples by updating startTime.
- Tidy up log lines and error handling, don't return so many errors on quiting.
- Expect EOF when processing checkpoints.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie adf5307470 Update wal LiveReader to ensure EOF is correctly propagated.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Callum Styan d6258aea8f Fix up remote write tests:
- Tests that created a QueueManager were leaving behind files at the end of tests.
- WAL replaying (readToEnd)tests seem to require extra time to finish now.
- Some fixes to make staticcheck happy

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie 184f06a981 Combine the record decoding metrics into one; break out garbage collection into a separate function.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie 859cda27ff Remove some 'global' state, moving segment numbers to parameters.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie bdc6b764b0 If reading the WAL fails, try again. Also, read from the segment containing the index for the last checkpoint, not the first segment.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie a5c20642b3 Refactor WAL watcher to remove some duplication.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Simon Pasquier b41d6d54f2
storage/remote: increase timeouts for Travis CI (#5224)
* storage/remote: adapt tests for Travis CI

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Check filesystems on Travis environment

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Run remote/storage tests on CircleCI for troubleshooting

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Try using tmpfs partition

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Revert "Try using tmpfs partition"

This reverts commit 85a30deb72.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Don't store labels in writeToMock

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Fix data race

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Bump retries to 100 meaning that the total timeout is 10s

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* clean up .travis.yml

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* code fixup

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Remove unneeded empty line

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-02-15 16:47:41 +01:00
Callum Styan 37e35f9e0c Various improvements to WAL based remote write.
- Use the queue name in WAL watcher logging.
- Don't return from watch if the reader error was EOF.
- Fix sample timestamp check logic regarding what samples we send.
- Refactor so we don't need readToEnd/readSeriesRecords
- Fix wal_watcher tests since readToEnd no longer exists

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-12 11:39:13 +00:00
Tom Wilkie b93bafeee1 Various fixes to locking & shutdown for WAL-based remote write.
- Remove datarace in the exported highest scrape timestamp.
- Backoff on enqueue should be per-sample - reset the result for each sample.
- Remove diffKeys, unused ctx and cancelfunc in WALWatcher, 'name' from writeTo interface, and pass it to constructor.
- Reorder functions in WALWatcher depth-first according to call graph.
- Fix vendor/modules.txt.
- Split out the various timer periods into consts at the top of the file.
- Move w.currentSegmentMetric.Set close to where we set the currentSegment.
- Combine r.Next() and isClosed(w.quit) into a single loop.
- Unnest some ifs in WALWatcher.watch, propagate erros in decodeRecord, add some new lines to make it easier to read.
- Reorganise checkpoint handling to reduce nesting and make it easier to follow.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-12 11:39:13 +00:00
Callum Styan 6f69e31398 Tail the TSDB WAL for remote_write
This change switches the remote_write API to use the TSDB WAL.  This should reduce memory usage and prevent sample loss when the remote end point is down.

We use the new LiveReader from TSDB to tail WAL segments.  Logic for finding the tracking segment is included in this PR.  The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.

Enqueuing a sample for sending via remote_write can now block, to provide back pressure.  Queues are still required to acheive parallelism and batching.  We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible.  The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.

As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).

This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure

Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics

Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-12 11:39:13 +00:00