The `ceph-volume lvm migrate/new-db/new-wal` commands don't support
running on non systemd systems or within containers.
Like other ceph-volume commands (lvm activate/batch/zap or raw activate)
we also need to be able to use the --no-systemd flag.
Fixes: https://tracker.ceph.com/issues/51854
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The man page did not make it clear that multiple objects could be
specified, nor did it describe use of "--force-full".
Info displayed about "rm" with `rados --help` was poorly formatted and
the wording was adjusted.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.
The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)
Open Issues:
There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
Adam Kupczyk is fixing this issue in a separate PR.
A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh
Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:
Block the OSD queues from accepting any new request
Delete all items in queue which we didn't start yet
Drain all in-flight tasks
call umount (and destage the allocation-map)
If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
create allocator from on-disk onodes and BlueFS inodes
change allocator + add stat counters + report illegal physical-extents
compare allocator after rebuild from ONodes
prevent collection from being open twice
removed FSCK repo check for null-fm
Bug-Fix: don't add BlueFS allocation to shared allocator
add configuration option to commit to No-Column-B
Only invalidate allocation file after opening rocksdb in read-write mode
fix tests not to expect failure in cases unapplicable to null-allocator
accept non-existing allocation file and don't fail the invaladtion as it could happen legally
don't commit to null-fm when db is opened in repair-mode
add a reverse mechanism from null_fm to real_fm (using RocksDB)
Using Ceph encode/decode, adding more info to header/trailer, add crc protection
Code cleanup
some changes requested by Adam (cleanup and style changes)
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
Added option -i that allows to operate as specific osd.
It reads configuration options from monitor or ceph.conf.
In addition providing configuration option not accepted by OSD or ceph-bluestore-tool is now an error.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
so user does not have to use virtualenv python package for creating a
virtualenv, the "venv" module in Python3 would suffice.
see also https://docs.python.org/3/library/venv.html
Signed-off-by: Kefu Chai <kchai@redhat.com>
as per
https://www.sphinx-doc.org/en/master/usage/restructuredtext/domains.html
> Like py:currentmodule, this directive produces no output. Instead, it
> serves to notify Sphinx that all following option directives document
> options for the program called name.
> ...
> The program name may contain spaces (in case you want to document
> subcommands like svn add and svn commit separately).
and to avoid the warnings like:
doc/man/8/ceph-volume.rst:424: WARNING: Duplicate explicit target name:
"cmdoption-ceph-volume-h".
we should specify different "program" for different set of options.
Signed-off-by: Kefu Chai <kchai@redhat.com>
ceph-deploy is not actively maintained anymore, and it was replaced by
ceph-volume and other high-level tools.
so there is no point to package its manpage anymore.
Signed-off-by: Kefu Chai <kchai@redhat.com>
I found that the difference between "rbd cp" and "rbd deep cp",
i.e. what "deep" means in this context, is documented only in
the mailing list archive and in the Mimic reelase notes.
Let's make the difference explicit in the manpage and in rbd --help.
Signed-off-by: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
This is a wrapper over ceph-bluestore-tool's bluefs-bdev-migrate command.
Primarily intended to introduce LVM tags manipulation which
ceph-bluestore-tool is lacking.
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
When adding more metrics the top line will be too long and maybe
wrapped with serval lines, which will make it hard to read.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Add description for options --id and --client_fs to the ceph-fuse manual
and move description for -d closer to -f since both options are similar.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
This reverts commit f635555fe7, reversing
changes made to d4d3d17b23.
This PR seems to be (indirectly?) responsible for
https://tracker.ceph.com/issues/49237
Also, it was causing the rados.py task's follow-up step to wait
for snap trimming to fail: it would time out a 'ceph osd dump --format=json'
command. :/
Signed-off-by: Sage Weil <sage@newdream.net>
generally all ceph containers need an init process to both reap any
zombie pids and/or perform signal handling (e.g. coredumps, etc.)
Signed-off-by: Michael Fritch <mfritch@suse.com>
Recognize ms_mode map option and filter initial monitor addresses
accordingly: if ms_mode is not given or ms_mode=legacy, discard v2
addresses, otherwise discard v1 addresses.
Note that nothing was discarded (i.e. v2 addresses were passed to
the kernel) previously. The intent was to preserve that behaviour
in case ms_mode is not given, allowing to change the kernel default
in the future. However, it turns out that mount.ceph helper has
been misguidedly discarding v2 addresses since commit eae0127513
("mount.ceph: fork a child to get info from local configuration"),
so that ship has sailed.
Fixes: https://tracker.ceph.com/issues/48976
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* refs/pull/37876/head:
systemd: cephfs-mirror systemd unit files
doc, man: man page for `cephfs-mirror` tool
doc: document mirror daemon internals
cephfs-mirror: switch to using PeerReplayer class
cephfs-mirror: cancel ongoing snapshot syncs on dir removal
cephfs-mirror: display peer snapshot sync stats
cephfs-mirror: carve out (and implement) mirroring snapshots to peers
cephfs-mirror: remove `cephfs_mirror_directory_choose_policy` option
cephfs-mirror: include helper routines to separate source
cephfs-mirror: remove peer only when peer is tracked
cephfs-mirror: typedef ceph_mount_info as MountRef shared pointer
cephfs-mirror: enclose json dump in object section
cephfs-mirror: note current peer set
cephfs-mirror: fix option typo and document certain options
cephfs-mirror: remove unnecessary command line options
cephfs-mirror: default log level 0/5
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>