CreatePrimaryRequest::unlink_peer() invoked via "rbd mirror image
snapshot" command or via rbd_support mgr module when creating a new
scheduled mirror snapshot at rbd_mirroring_max_mirroring_snapshots
capacity on the primary cluster can race with Replayer::unlink_peer()
invoked by rbd-mirror when finishing syncing an older snapshot on the
secondary cluster. Consider the following:
[ primary: primary-snap1, primary-snap2, primary-snap3
secondary: non-primary-snap1 (complete), non-primary-snap2 (syncing) ]
0. rbd-mirror is syncing snap1..snap2 delta
1. rbd_support creates primary-snap4
2. due to rbd_mirroring_max_mirroring_snapshots == 3, rbd_support picks
primary-snap3 for unlinking
3. rbd-mirror finishes syncing snap1..snap2 delta and marks
non-primary-snap2 complete
[ snap1 (the old base) is no longer needed on either cluster ]
4. rbd-mirror unlinks and removes primary-snap1
5. rbd-mirror removes non-primary-snap1
6. rbd-mirror picks snap2 as the new base
7. rbd-mirror creates non-primary-snap3 and starts syncing snap2..snap3
delta
[ primary: primary-snap2, primary-snap3, primary-snap4
secondary: non-primary-snap2 (complete), non-primary-snap3 (syncing) ]
8. rbd_support unlinks and removes primary-snap3 which is in-use by
rbd-mirror
If snap trimming on the primary cluster kicks in soon enough, the
secondary image becomes corrupted: rbd-mirror would eventually finish
"syncing" non-primary-snap3 and mark it complete in spite of bogus data
in the HEAD -- the primary cluster OSDs would start returning ENOENT
for snap trimmed objects. Luckily, rbd-mirror's attempt to pick snap3
as the new base would wedge the replayer with "split-brain detected:
failed to find matching non-primary snapshot in remote image" error.
Before commit a888bff8d0 ("librbd/mirror: tweak which snapshot is
unlinked when at capacity") this could happen pretty much all the time
as it was the second oldest snapshot that was unlinked. This commit
changed it to be the third oldest snapshot, turning this into a more
narrow but still very much possible to hit race.
Unfortunately this race condition appears to be inherent to the way
snapshot-based mirroring is currently implemented:
a. when mirror snapshots are created on the producer side of the
snapshot queue, they are already linked
b. mirror snapshots can be concurrently unlinked/removed on both
sides of the snapshot queue by non-cooperating clients (local
rbd_mirror_image_create_snapshot() vs remote rbd-mirror)
c. with mirror peer links off the list due to (a), there is no
existing way for rbd-mirror to persistently mark a snapshot as
in-use
As a workaround, bump rbd_mirroring_max_mirroring_snapshots to 5 and
always unlink the newest snapshot (i.e. slot 4) instead of the third
oldest snapshot (i.e. slot 2). Hopefully this gives enough leeway,
as rbd-mirror would need to sync two snapshots (i.e. transition from
syncing 0-1 to 1-2 and then to 2-3) before potentially colliding with
rbd_mirror_image_create_snapshot() on slot 4.
Fixes: https://tracker.ceph.com/issues/55803
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add basic metrics to ImageCacheState and persist them, including
allocated_bytes, cached_bytes, dirty_bytes, free_bytes and hit/miss
info.
Leverage periodic_stats() timer to call update_image_cache_state.
In order to avoid outputting too much debug information, the original
statistics output log level is changed to 5.
Switch to json_spirit for encoding because encode_json encodes bool as
"true"/"false" string.
Remove rbd_persistent_cache_log_periodic_stats option because we need
to always update cache state.
[ idryomov: add cached_bytes and hits_partial; report misses and
miss_bytes instead of respective totals; naming ]
Fixes: https://tracker.ceph.com/issues/50614
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The full name of PWL is persistent write log, and the full name of
RWL is replica write log. Change the title to make it consistent
with the name of the feature and better reflect its design.
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
Adds documentation how to change default rbd object size. With the
previous option `--order` it was easy to guess the config name for the
default value, with the current option name `--object-size` thats hard
to guess.
Also extends the documentation for rbd_default_order to include
* how object-size is derived from the configured value
* allowed range of the value
In the first version of this commit I also added min and max for this
parameter (12/25, matching the object size range in `man 8
rbd`/Striping/object-size), but this made some tests fail, since some
seem to set values outside this range (and probably are fine since
included for some time already). To have this a doc-change only, I
removed the range.
Signed-off-by: Mara Sophie Grosch <littlefox@lf-net.org>
Hyper-V identifies passthrough VM disks by number instead of SCSI ID, although
the disk number can change across host reboots. This means that the VMs can end
up using incorrect disks after rebooting the host, which is an important
security concern. This issue also affects iSCSI and Fibre Channel disks.
We're going to document this Hyper-V limitation along with possible
workarounds.
Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
On top of "profile rbd" permissions, "profile rbd-mirror-peer" also
allows getting rbd/mirror and setting rbd/mirror/peer/* config keys.
This is what "rbd mirror pool peer bootstrap create" does.
Fixes: https://tracker.ceph.com/issues/50970
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This PR rewrites the sections
- Configure Ceph-CSI
- Configure Nomad
in the rbd-nomad.rst Chapter of the RBD
Guide.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
This PR improves the English in the "Create
a Pool" section of the "RBD & Nomad Integration"
chapter of the RBD Guide.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
This PR improves the English syntax and
spelling in the "Block Devices and Nomad"
section of the rbd-nomad.rst page.
This PR rewrites the top-level matter.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
less repeating this way.
also fix a typo of "rbd_qos_writ_bps_limit", it should be
"rbd_qos_write_bps_limit".
Signed-off-by: Kefu Chai <kchai@redhat.com>
Amend the example output and drop the word "transient" -- currently it
is the exact opposite, as the status is updated only when the cache is
opened and closed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>