Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.
The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)
Open Issues:
There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
Adam Kupczyk is fixing this issue in a separate PR.
A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh
Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:
Block the OSD queues from accepting any new request
Delete all items in queue which we didn't start yet
Drain all in-flight tasks
call umount (and destage the allocation-map)
If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
create allocator from on-disk onodes and BlueFS inodes
change allocator + add stat counters + report illegal physical-extents
compare allocator after rebuild from ONodes
prevent collection from being open twice
removed FSCK repo check for null-fm
Bug-Fix: don't add BlueFS allocation to shared allocator
add configuration option to commit to No-Column-B
Only invalidate allocation file after opening rocksdb in read-write mode
fix tests not to expect failure in cases unapplicable to null-allocator
accept non-existing allocation file and don't fail the invaladtion as it could happen legally
don't commit to null-fm when db is opened in repair-mode
add a reverse mechanism from null_fm to real_fm (using RocksDB)
Using Ceph encode/decode, adding more info to header/trailer, add crc protection
Code cleanup
some changes requested by Adam (cleanup and style changes)
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
Added option -i that allows to operate as specific osd.
It reads configuration options from monitor or ceph.conf.
In addition providing configuration option not accepted by OSD or ceph-bluestore-tool is now an error.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
Adds additional paragraph to ceph-bluestore-tool documentation,
describing how to use *special* options --bluefs_replay_recovery
and --bluefs_replay_recovery_disable_compact to recover large
bluefs log.
Fixes: https://tracker.ceph.com/issues/46552
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
Added possibility to control batch size and iterator refresh time for resharding process.
Replaced getenv() with new control for resharding unittests.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
Older versions of Sphinx, such as the one in CentOS 7, do not render "..
option::" lines correctly if the option contains a hyphen but does not start
with a hyphen. And ceph-bluestore-tool appears to be the only Ceph manpage
affected by this bug.
Fixes: http://tracker.ceph.com/issues/24800
Signed-off-by: Nathan Cutler <ncutler@suse.com>