mirror of
https://github.com/ceph/ceph
synced 2025-01-23 19:46:56 +00:00
272160ab5e
Currently BlueStore keeps its allocation info inside RocksDB. BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk. Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state. The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount(). This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount. We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases. When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB) Open Issues: There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path. Adam Kupczyk is fixing this issue in a separate PR. A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount. This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary). We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide: Block the OSD queues from accepting any new request Delete all items in queue which we didn't start yet Drain all in-flight tasks call umount (and destage the allocation-map) If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com> create allocator from on-disk onodes and BlueFS inodes change allocator + add stat counters + report illegal physical-extents compare allocator after rebuild from ONodes prevent collection from being open twice removed FSCK repo check for null-fm Bug-Fix: don't add BlueFS allocation to shared allocator add configuration option to commit to No-Column-B Only invalidate allocation file after opening rocksdb in read-write mode fix tests not to expect failure in cases unapplicable to null-allocator accept non-existing allocation file and don't fail the invaladtion as it could happen legally don't commit to null-fm when db is opened in repair-mode add a reverse mechanism from null_fm to real_fm (using RocksDB) Using Ceph encode/decode, adding more info to header/trailer, add crc protection Code cleanup some changes requested by Adam (cleanup and style changes) Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
241 lines
8.9 KiB
ReStructuredText
241 lines
8.9 KiB
ReStructuredText
:orphan:
|
|
|
|
======================================================
|
|
ceph-bluestore-tool -- bluestore administrative tool
|
|
======================================================
|
|
|
|
.. program:: ceph-bluestore-tool
|
|
|
|
Synopsis
|
|
========
|
|
|
|
| **ceph-bluestore-tool** *command*
|
|
[ --dev *device* ... ]
|
|
[ -i *osd_id* ]
|
|
[ --path *osd path* ]
|
|
[ --out-dir *dir* ]
|
|
[ --log-file | -l *filename* ]
|
|
[ --deep ]
|
|
| **ceph-bluestore-tool** fsck|repair --path *osd path* [ --deep ]
|
|
| **ceph-bluestore-tool** qfsck --path *osd path*
|
|
| **ceph-bluestore-tool** allocmap --path *osd path*
|
|
| **ceph-bluestore-tool** restore_cfb --path *osd path*
|
|
| **ceph-bluestore-tool** show-label --dev *device* ...
|
|
| **ceph-bluestore-tool** prime-osd-dir --dev *device* --path *osd path*
|
|
| **ceph-bluestore-tool** bluefs-export --path *osd path* --out-dir *dir*
|
|
| **ceph-bluestore-tool** bluefs-bdev-new-wal --path *osd path* --dev-target *new-device*
|
|
| **ceph-bluestore-tool** bluefs-bdev-new-db --path *osd path* --dev-target *new-device*
|
|
| **ceph-bluestore-tool** bluefs-bdev-migrate --path *osd path* --dev-target *new-device* --devs-source *device1* [--devs-source *device2*]
|
|
| **ceph-bluestore-tool** free-dump|free-score --path *osd path* [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]
|
|
| **ceph-bluestore-tool** reshard --path *osd path* --sharding *new sharding* [ --sharding-ctrl *control string* ]
|
|
| **ceph-bluestore-tool** show-sharding --path *osd path*
|
|
|
|
|
|
Description
|
|
===========
|
|
|
|
**ceph-bluestore-tool** is a utility to perform low-level administrative
|
|
operations on a BlueStore instance.
|
|
|
|
Commands
|
|
========
|
|
|
|
:command:`help`
|
|
|
|
show help
|
|
|
|
:command:`fsck` [ --deep ]
|
|
|
|
run consistency check on BlueStore metadata. If *--deep* is specified, also read all object data and verify checksums.
|
|
|
|
:command:`repair`
|
|
|
|
Run a consistency check *and* repair any errors we can.
|
|
|
|
:command:`qfsck`
|
|
|
|
run consistency check on BlueStore metadata comparing allocator data (from RocksDB CFB when exists and if not uses allocation-file) with ONodes state.
|
|
|
|
:command:`allocmap`
|
|
|
|
performs the same check done by qfsck and then stores a new allocation-file (command is disabled by default and requires a special build)
|
|
|
|
:command:`restore_cfb`
|
|
|
|
Reverses changes done by the new NCB code (either through ceph restart or when running allocmap command) and restores RocksDB B Column-Family (allocator-map).
|
|
|
|
|
|
:command:`bluefs-export`
|
|
|
|
Export the contents of BlueFS (i.e., RocksDB files) to an output directory.
|
|
|
|
:command:`bluefs-bdev-sizes` --path *osd path*
|
|
|
|
Print the device sizes, as understood by BlueFS, to stdout.
|
|
|
|
:command:`bluefs-bdev-expand` --path *osd path*
|
|
|
|
Instruct BlueFS to check the size of its block devices and, if they have
|
|
expanded, make use of the additional space. Please note that only the new
|
|
files created by BlueFS will be allocated on the preferred block device if
|
|
it has enough free space, and the existing files that have spilled over to
|
|
the slow device will be gradually removed when RocksDB performs compaction.
|
|
In other words, if there is any data spilled over to the slow device, it
|
|
will be moved to the fast device over time.
|
|
|
|
:command:`bluefs-bdev-new-wal` --path *osd path* --dev-target *new-device*
|
|
|
|
Adds WAL device to BlueFS, fails if WAL device already exists.
|
|
|
|
:command:`bluefs-bdev-new-db` --path *osd path* --dev-target *new-device*
|
|
|
|
Adds DB device to BlueFS, fails if DB device already exists.
|
|
|
|
:command:`bluefs-bdev-migrate` --dev-target *new-device* --devs-source *device1* [--devs-source *device2*]
|
|
|
|
Moves BlueFS data from source device(s) to the target one, source devices
|
|
(except the main one) are removed on success. Target device can be both
|
|
already attached or new device. In the latter case it's added to OSD
|
|
replacing one of the source devices. Following replacement rules apply
|
|
(in the order of precedence, stop on the first match):
|
|
|
|
- if source list has DB volume - target device replaces it.
|
|
- if source list has WAL volume - target device replace it.
|
|
- if source list has slow volume only - operation isn't permitted, requires explicit allocation via new-db/new-wal command.
|
|
|
|
:command:`show-label` --dev *device* [...]
|
|
|
|
Show device label(s).
|
|
|
|
:command:`free-dump` --path *osd path* [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]
|
|
|
|
Dump all free regions in allocator.
|
|
|
|
:command:`free-score` --path *osd path* [ --allocator block/bluefs-wal/bluefs-db/bluefs-slow ]
|
|
|
|
Give a [0-1] number that represents quality of fragmentation in allocator.
|
|
0 represents case when all free space is in one chunk. 1 represents worst possible fragmentation.
|
|
|
|
:command:`reshard` --path *osd path* --sharding *new sharding* [ --resharding-ctrl *control string* ]
|
|
|
|
Changes sharding of BlueStore's RocksDB. Sharding is build on top of RocksDB column families.
|
|
This option allows to test performance of *new sharding* without need to redeploy OSD.
|
|
Resharding is usually a long process, which involves walking through entire RocksDB key space
|
|
and moving some of them to different column families.
|
|
Option --resharding-ctrl provides performance control over resharding process.
|
|
Interrupted resharding will prevent OSD from running.
|
|
Interrupted resharding does not corrupt data. It is always possible to continue previous resharding,
|
|
or select any other sharding scheme, including reverting to original one.
|
|
|
|
:command:`show-sharding` --path *osd path*
|
|
|
|
Show sharding that is currently applied to BlueStore's RocksDB.
|
|
|
|
Options
|
|
=======
|
|
|
|
.. option:: --dev *device*
|
|
|
|
Add *device* to the list of devices to consider
|
|
|
|
.. option:: -i *osd_id*
|
|
|
|
Operate as OSD *osd_id*. Connect to monitor for OSD specific options.
|
|
If monitor is unavailable, add --no-mon-config to read from ceph.conf instead.
|
|
|
|
.. option:: --devs-source *device*
|
|
|
|
Add *device* to the list of devices to consider as sources for migrate operation
|
|
|
|
.. option:: --dev-target *device*
|
|
|
|
Specify target *device* migrate operation or device to add for adding new DB/WAL.
|
|
|
|
.. option:: --path *osd path*
|
|
|
|
Specify an osd path. In most cases, the device list is inferred from the symlinks present in *osd path*. This is usually simpler than explicitly specifying the device(s) with --dev. Not necessary if -i *osd_id* is provided.
|
|
|
|
.. option:: --out-dir *dir*
|
|
|
|
Output directory for bluefs-export
|
|
|
|
.. option:: -l, --log-file *log file*
|
|
|
|
file to log to
|
|
|
|
.. option:: --log-level *num*
|
|
|
|
debug log level. Default is 30 (extremely verbose), 20 is very
|
|
verbose, 10 is verbose, and 1 is not very verbose.
|
|
|
|
.. option:: --deep
|
|
|
|
deep scrub/repair (read and validate object data, not just metadata)
|
|
|
|
.. option:: --allocator *name*
|
|
|
|
Useful for *free-dump* and *free-score* actions. Selects allocator(s).
|
|
|
|
.. option:: --resharding-ctrl *control string*
|
|
|
|
Provides control over resharding process. Specifies how often refresh RocksDB iterator,
|
|
and how large should commit batch be before committing to RocksDB. Option format is:
|
|
<iterator_refresh_bytes>/<iterator_refresh_keys>/<batch_commit_bytes>/<batch_commit_keys>
|
|
Default: 10000000/10000/1000000/1000
|
|
|
|
Additional ceph.conf options
|
|
============================
|
|
|
|
Any configuration option that is accepted by OSD can be also passed to **ceph-bluestore-tool**.
|
|
Useful to provide necessary configuration options when access to monitor/ceph.conf is impossible and -i option cannot be used.
|
|
|
|
Device labels
|
|
=============
|
|
|
|
Every BlueStore block device has a single block label at the beginning of the
|
|
device. You can dump the contents of the label with::
|
|
|
|
ceph-bluestore-tool show-label --dev *device*
|
|
|
|
The main device will have a lot of metadata, including information
|
|
that used to be stored in small files in the OSD data directory. The
|
|
auxiliary devices (db and wal) will only have the minimum required
|
|
fields (OSD UUID, size, device type, birth time).
|
|
|
|
OSD directory priming
|
|
=====================
|
|
|
|
You can generate the content for an OSD data directory that can start up a
|
|
BlueStore OSD with the *prime-osd-dir* command::
|
|
|
|
ceph-bluestore-tool prime-osd-dir --dev *main device* --path /var/lib/ceph/osd/ceph-*id*
|
|
|
|
BlueFS log rescue
|
|
=====================
|
|
|
|
Some versions of BlueStore were susceptible to BlueFS log growing extremaly large -
|
|
beyond the point of making booting OSD impossible. This state is indicated by
|
|
booting that takes very long and fails in _replay function.
|
|
|
|
This can be fixed by::
|
|
ceph-bluestore-tool fsck --path *osd path* --bluefs_replay_recovery=true
|
|
|
|
It is advised to first check if rescue process would be successfull::
|
|
ceph-bluestore-tool fsck --path *osd path* \
|
|
--bluefs_replay_recovery=true --bluefs_replay_recovery_disable_compact=true
|
|
|
|
If above fsck is successful fix procedure can be applied.
|
|
|
|
Availability
|
|
============
|
|
|
|
**ceph-bluestore-tool** is part of Ceph, a massively scalable,
|
|
open-source, distributed storage system. Please refer to the Ceph
|
|
documentation at https://docs.ceph.com for more information.
|
|
|
|
|
|
See also
|
|
========
|
|
|
|
:doc:`ceph-osd <ceph-osd>`\(8)
|