* Track remote write queues via a map so we don't care about index.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Support a job name for remote write/read so we can differentiate between
them using the name.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Remote write/read has Name to not confuse the meaning of the field with
scrape job names.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Split queue/client label into remote_name and url labels.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Don't allow for duplicate remote write/read configs.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Ensure we restart remote write queues if the hash of their config has
not changed, but the remote name has changed.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Include name in remote read/write config hashes, simplify duplicates
check, update test accordingly.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
It is possible that desired shards is always a bit higher than the
number of shards (less than 30%) and by exporting desired shards as the
raw number it will be easy to tell if a Prometheus is in that situation.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
* Refactor calculateDesiredShards + don't reshard if we're having issues
sending samples.
* Track lastSendTimestamp via an int64 with atomic add/load, add a test
for reshard calculation.
* Simplify conditional for skipping resharding, add samplesIn/Out to shard
testcase struct.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
The WAL Watcher replays a checkpoint after it is created in order to
garbage collect series that no longer exist in the WAL. Currently the
garbage collection process is done serially with reading from the tip of
the WAL which can cause large delays in writing samples to remote
storage just after compaction occurs.
This also fixes a memory leak where dropped series are not cleaned up as
part of the SeriesReset process.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
* Replaced test validations with testutils on storage/remote/codec_test.go
Signed-off-by: George Felix <george.felix@ubeeqo.com>
* gofmt
Signed-off-by: George Felix <george.felix@ubeeqo.com>
* Removed shouldPass assertion
Signed-off-by: George Felix <gfelixc@gmail.com>
* Fixes to improve readability
Signed-off-by: George Felix <george.felix@ubeeqo.com>
* Fixes based on code review comments
Signed-off-by: George Felix <george.felix@ubeeqo.com>
* Added Fatal method and used it in buffer_test
Signed-off-by: Joe Elliott <number101010@gmail.com>
* Added period to meet contributing guidelines
Signed-off-by: Joe Elliott <number101010@gmail.com>
* Removed fatal testutil method. Refactored test cases to use testutil.Assert
Signed-off-by: Joe Elliott <number101010@gmail.com>
* Added if found condition for clarity
Signed-off-by: Joe Elliott <number101010@gmail.com>
* Show warnings in UI if query have returned some warnings
+ improve warning (error) text if query to remote was finished with error
* Add prefixes for remote_read errors
Signed-off-by: Stan Putrya <root.vagner@gmail.com>
* Update go.mod dependencies before release
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Add issue for showing query warnings in promtool
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Revert json-iterator back to 1.1.6
It produced errors when marshaling Point values with special float
values.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Fix expected step values in promtool tests after client_golang update
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Update generated protobuf code after proto dep updates
Signed-off-by: Julius Volz <julius.volz@gmail.com>
The desired shards calculation now properly keeps track of the rate of
pending samples, and uses the previously unused integralAccumulator to
adjust for missing information in the desired shards calculation.
Also, configure more capacity for each shard. The default 10 capacity
causes shards to block on each other while
sending remote requests. Default to a 500 sample capacity and explain in
the documentation that having more capacity will help throughput.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
* Check for duplicate label names in remote read
Also add test to confirm that #5731 is fixed
* Use subtests in TestValidateLabelsAndMetricName
* Really check that expectedErr matches err
Signed-off-by: Vadym Martsynovskyy <vmartsynovskyy@gmail.com>
* Only check last directory when discovering checkpoint number
Signed-off-by: Devin Trejo <dtrejo@palantir.com>
* Comments for checkpointNum
Signed-off-by: Devin Trejo <dtrejo@palantir.com>
* Add benchmark for sample delivery
* Simplify StoreSeries to have only one loop
* Reduce allocations for pending samples in runShard
* Only allocate one send slice per segment
* Cache a buffer in each shard for snappy to use
* Remove queue manager seriesMtx
It is not possible for any of the places protected by the seriesMtx to
be called concurrently so it is safe to remove. By removing the mutex we
can simplify the Append code to one loop.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
This allows other processes to reuse just the remote write code without
having to use the remote read code as well. This will be used to create
a sidecar capable of sending remote write payloads.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
* Update remote write and remote read separately
* Add external labels to the remote write conf hash
* Add unit tests for remote storage lifecycle
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
* Don't panic if we try to release a string that is not in the interner.
* Move seriesMtx locking in QueueManager's StoreSeries function.
This stops us from calling release for strings that aren't interned if
there's a race between reading a checkpoint and storing new series
labels, which could happen during checkpointing or reloading config.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Unregister remote write queue manager specific metrics when stopping the
queue manager.
* Use DeleteLabelValues instead of Unregister to remove queue and watcher
related metrics when we stop them. Create those metrics in the structs
start functions rather than in their constructors because of the
ordering of creation, start, and stop in remote storage ApplyConfig.
* Add setMetrics function to WAL watcher so we can set
the watchers metrics in it's Start function, but not
have to call Start in some tests (causes data race).
Signed-off-by: Callum Styan <callumstyan@gmail.com>
From the documentation:
> The default HTTP client's Transport may not
> reuse HTTP/1.x "keep-alive" TCP connections if the Body is
> not read to completion and closed.
This effectively enable keep-alive for the fixed requests.
Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>