prometheus

Commit Graph

Author	SHA1	Message	Date
Chris Marchbanks	b40cc43958	Provide option to compress WAL records (#609 ) In running Prometheus instances, compressing the records was shown to reduce disk usage by half while incurring a negligible CPU cost. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2019-06-19 16:46:24 +03:00
beorn7	90a7612df3	Make objectives of Summaries explicit With the next release of client_golang, Summaries will not have objectives by default. As it turns out, for prometheus_tsdb_head_gc_duration_seconds and prometheus_tsdb_wal_truncate_duration_seconds, the objective-less default makes more sense then the current default. To make sure we do the right thing before and after the upcoming release of client_golang, I have set the objectives explicitly wherever that was not the case so far: - prometheus_tsdb_head_gc_duration_seconds and prometheus_tsdb_wal_truncate_duration_seconds now have no objectives explicitly. - prometheus_tsdb_wal_fsync_duration_seconds now explicitly uses the previous default objectives. Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-14 14:17:24 +02:00
Brian Brazil	be4edbe174	Start a new WAL segement on head truncation. (#605 ) This reduces disk space usage to not be a minimum of 3 128MB files in small setups. This will possibly also help debug wal data issues, by making things a bit more deterministic. Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>	2019-06-07 11:35:02 +01:00
Callum Styan	562e93e8e6	Always create a new clean segment when starting the WAL. (#608 ) * Always create a new clean segment when starting the WAL. * Ensure we flush the last page after repairing and before recreating the new segment in Repair. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-05-24 19:33:27 +01:00
Callum Styan	bce663e1d9	Export the current segment index as a metic. (#601 ) * Export the current segment index as a metic. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-05-17 11:47:42 +03:00
Krasi Georgiev	96a87845cc	fix wal panic when page flush fails. (#582 ) * fix wal panic when page flush fails. New records should be added to the page only when the last flush succeeded. Otherwise the page would be full and panics when trying to add a new record. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2019-05-16 16:40:43 +03:00
Krasi Georgiev	5512826f13	make Close methods for the querier safe to call more than once. (#581 ) * make Close methods for the querier safe to call more than once. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2019-04-30 10:17:07 +03:00
Krasi Georgiev	8eeb70fee1	remove Fsync workaround for macos. (#574 ) since golang 1.12 no special handling is required for file.Sync() @pborzenkov thanks for the pointer. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2019-04-03 11:16:54 +03:00
Goutham Veeramachaneni	10d395259b	Avoid creation of 0 sized segments. (#527 ) If the corrupt segment is full, then we set donePages on open, `c59ed492b2/wal/wal.go (L235-L243)` Then when we try to repair, we set the segment to be a new segment but we don't update the donePages: `c59ed492b2/wal/wal.go (L334)` We we try to log to this, because donePages is full, we will never log anything to this segment and create a new one: `c59ed492b2/wal/wal.go (L486)` This does not cause issues because we simply concatenate the segments on read, there by transparently skipping this `0b` segment.	2019-02-25 12:10:27 +02:00
Tom Wilkie	77d5a7d47a	LiveReader can get into an infinite loop on corrupt WALs. (#524 ) Make WAL live tailer return EOF when the there is a half-written record at the end of the file. Previously, this would cause an infinite loop as we ignored EOFs when filling the buffer. We now differentiate between EOFs that read >0 bytes, and EOFs that didn't. Add some more unit tests for tailing a corrupt WAL, and unify interfaces Reader and LiveReader for the purposes of testing. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-19 14:33:57 +00:00
Tom Wilkie	bc3b0bd429	Test to corrupt segments mid-WAL, repair and check we can read the correct number of records. (#528 ) Test to corrupt segments mid-WAL, repair and check we can read the correct number of records. Make segmentBufReader pad short segments with zeros, and only advance curr segment index after fully reading segment.	2019-02-18 19:05:07 +00:00
Callum Styan	89ee5aaed4	clarify which segments are deleted when we find a corrupted segment (#522 ) Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-02-14 05:44:19 +02:00
Callum Styan	3929359302	add live reader for WAL (#481 ) * add live reader for WAL Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-01-16 10:09:08 -08:00
Krasi Georgiev	8d991bdc1e	Delete temp checkpoint folder on error. (#415 )	2019-01-07 11:43:33 +03:00
glutamatt	22e3aeb107	Add WALSegmentSize as an option of tsdb creation (#450 ) Expose `WALSegmentSize` option to allow overriding the `DefaultOptions.WALSegmentSize`.	2018-12-18 21:56:51 +03:00
Krasi Georgiev	2962202ed3	fix windows tests (#469 ) Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-12-13 16:29:29 +03:00
Krasi Georgiev	48efdf8b81	refactor NewSegmentsRangeReader to take multi WAL ranges (#449 ) * refactor NewSegmentsRangeReader to take multi WAL ranges In case of an error when checkpointing the WAL the error doesn't show the exact WAL index that is corrupter. this is because it uses MultiReader to read multiply WAL files. This refactoring allows the NewSegmentsRangeReader to take more than a single WAL range and it reads all of the ranges by iterating each one. this changes the logs from create checkpoint: read segments: corruption after 4841144384 bytes:... to create checkpoint: read segments: corruption in segment data/wal/00017351 at 123142208: ... Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-11-30 16:46:16 +02:00
Krasi Georgiev	0493efb7c5	repair wal when the record cannot be decoded (#453 ) * repair wal when the record cannot be decoded Currently repair is run only when the error happens in the reader. A corruption can occur after the record is read and when it is decoded. This change wraps the error at decoding as a CorruptionErr as this error is expected to trigger a repair. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-11-30 13:37:04 +02:00
Krasi Georgiev	24520727a4	return an error when the last wal segment record is torn. (#451 ) * return an error when the last wal segment record is torn. this ensures that a repair will be run when the last record in a segment is torn. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-11-28 15:15:11 +02:00
Krasi Georgiev	3385571ddf	buffer-panic when reading a record after recPageTerm (#429 ) Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-11-14 18:43:33 +02:00
Krasi Georgiev	a9470dd8d5	few more comments to explain the WAL workflow (#430 ) More comments for the WAL package. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-11-08 10:27:16 +02:00
Krasi Georgiev	d7492b9350	more descriptive var names and some more logging. (#405 ) * more descriptive checkpoint var names and some more logging. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>	2018-10-11 18:23:52 +03:00
Ganesh Vernekar	61b000ee0e	Fix review comments Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2018-09-28 15:00:51 +05:30
Ganesh Vernekar	632dfb349e	Add new metrics. 1. 'prometheus_tsdb_wal_truncate_fail' for failed WAL truncation. 2. 'prometheus_tsdb_checkpoint_delete_fail' for failed old checkpoint delete. Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>	2018-09-25 18:50:57 +05:30
Goutham Veeramachaneni	9c8ca47399	Fix filehandling for windows (#392 ) * Fix filehandling for windows Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Fix more windows filehandling issues Windows: Close files before deleting Checkpoints. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Windows: Close writers in case of errors so they can be deleted Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Windows: Close block so that it can be deleted. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Windows: Close file to delete it Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Windows: Close dir so that it can be deleted. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Windows: close files so that they can be deleted. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Review feedback Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>	2018-09-21 11:01:22 +05:30
Goutham Veeramachaneni	c7d0d10da4	Make sure WAL Repair can handle wrapped errors Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>	2018-09-19 12:12:25 +05:30
beorn7	3bc6c670fa	Revert "Remove `prometheus_` prefix from metrics" This reverts commit `98fe30438c`. After some discussion, it was concluded that we want the full `prometheus_tsdb_...` prefix hardcoded in the library. Signed-off-by: beorn7 <beorn@soundcloud.com>	2018-09-18 19:19:19 +02:00
beorn7	98fe30438c	Remove `prometheus_` prefix from metrics This can now be added by users of the library as needed with the new https://godoc.org/github.com/prometheus/client_golang/prometheus#WrapRegistererWithPrefix Signed-off-by: beorn7 <beorn@soundcloud.com>	2018-09-17 14:54:28 +02:00
Fabian Reinartz	22cae653d8	Fixes for 32bit archs Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-08-07 06:52:16 -04:00
Fabian Reinartz	f8ec0074e7	Add Replace function Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-08-02 17:51:49 -04:00
Fabian Reinartz	b81e0fbf2a	Address comments Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-07-20 02:26:12 -04:00
Fabian Reinartz	3e76f0163e	Address comments Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-07-19 07:25:30 -04:00
Fabian Reinartz	3f538817f8	move WAL lock Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-07-19 07:25:30 -04:00
Fabian Reinartz	d951140ab8	wal: avoid heap allocation in WAL reader The buffers we allocated were escaping to the heap, resulting in large memory usage spikes during startup and checkpointing in Prometheus. This attaches the buffer to the reader object to prevent this. Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-07-19 07:25:30 -04:00
Fabian Reinartz	449a2d0db7	wal: add segment type and repair procedure Allow to repair the WAL based on the error returned by a reader during a full scan over all records. Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-07-19 07:24:40 -04:00
Fabian Reinartz	8e1f97fad4	wal: add write ahead log package This adds a new WAL that's agnostic to the actual record contents. It's much simpler and should be more resilient than the existing one. Signed-off-by: Fabian Reinartz <freinartz@google.com>	2018-07-19 07:24:40 -04:00

36 Commits