This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down.
We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.
Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.
As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).
This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure
Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics
Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This no longer waits for all of the scrape reload to complete
before getting a list of AMs again.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add flag for size based retention
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Deprecate the old retention flag for a new one.
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Add ability to take a suffix for size flag
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Address feedback
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* *: use latest release of staticcheck
It also fixes a couple of things in the code flagged by the additional
checks.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use official release of staticcheck
Also run 'go list' before staticcheck to avoid failures when downloading packages.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Add WALSegmentSize as an option, and the corresponding flag "storage.tsdb.wal-segment-size" to tune the max size of wal segment files.
The addressed base problem is to reduce the disk space used by wal segment files : on a raspberry pi, for instance, we often want to reduce write load of the sd card, then, the wal directory is mounted on a memory (space limited) partition.
the default value of the segment max file size, pushed the size of directory to 128 MB for each segment , which is too much ram consumption on a rasp.
the initial discussion is at https://github.com/prometheus/tsdb/pull/450
* update promlog to latest version
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* Update api tests, fix main setup
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* tidy go.sum
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* revendor prometheus/common
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* only initialize config; use kingpin for remote_storage_adapter
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* actually parse the flags
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* clean up imports
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* web: added ability to set page title through flag.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Reformatted variable names and Flag description for readability.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* assets_vfsdata.go
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Flag name changed from web.ui-title to web.page-title
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* make assets
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Support writing output as json
Oftentimes I'll want to execute something based on
the output from promtool, and supporting json
makes it easy to pull out values with a supporting
tool such as jq.
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
According to the GoDoc for os.Signal [0]:
> Package signal will not block sending to c: the caller must ensure that
> c has sufficient buffer space to keep up with the expected signal rate.
> For a channel used for notification of just one signal value, a buffer
> of size 1 is sufficient.
[0] https://golang.org/pkg/os/signal/#Notify
Signed-off-by: Lucas Serven <lserven@gmail.com>
* Limit the number of samples remote read can return.
- Return 413 entity too large.
- Limit can be set be a flag. Allow 0 to mean no limit.
- Include limit in error message.
- Set default limit to 50M (* 16 bytes = 800MB).
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
To make local debuging with `go run` easyer moved all files into a
dedicate package `runtime`.
This allows running prometheus just by using `go run main.go` instead of
passing mani files like `go run main.go limits_default.go ...`
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
Sorry, I used GitHub's web-based merge-conflict-resolution editor on
https://github.com/prometheus/prometheus/pull/4308 and it didn't show me
test errors afterwards, but maybe they didn't run again or I should have
waited or something.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
`debug all` - all information
`debug metrics` - metrics information
`debug pprof` - profiling information
the final result is compressed in a `tar.gz` file
Signed-off-by: chyeh <chyeh.taiwan@gmail.com>
Start rule manager only after tsdb and config is loaded.
Stop rule manager before tsdb to avoid writing to closed storage.
Wait for any in-progress reloads to complete before shutting
down rule manager, so that rule manager doesn't get updated after
being shut down.
Remove incorrect comment around shutting down query enginge.
Log when config reload is completed.
Fixes#4133Fixes#4262
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>