* refs/pull/58419/head:
mds: generate correct path for unlinked snapped files
qa: add test for cephx path check on unlinked snapped dir tree
mds: add debugging for stray_prior_path
Reviewed-by: Milind Changire <mchangir@redhat.com>
Groups was made a required parameter to be
`ceph orch apply nvmeof <pool> <group>` in
https://github.com/ceph/ceph/pull/58860.
That broke the `nvmeof` suite so this PR fixes that.
Right now, all gateway are deployed in a single group.
Later, this would be changed to have multi groups for a better test.
Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
The cephadm_from_container allows one to do a single container build
and then point teuthology at that image as the "single source of truth".
I find this extremely convenient when running teuthology locally and
I keep carrying this patch around - I figure having it upstream will
simplify my workflow. Maybe someday it'll benefit others too.
To use it I set up a yaml overrides file with the following content:
```yaml
overrides:
cephadm:
image: "quay.io/phlogistonjohn/ceph:dev"
cephadm_from_container: true
verify_ceph_hash: false
verify_ceph_hash: false
```
This let's me test my custom builds fairly easily!
Signed-off-by: John Mulligan <phlogistonjohn@asynchrono.us>
* refs/pull/56816/head:
doc: mention the peer status failed when snapshot created on the remote filesystem.
qa: add test_cephfs_mirror_remote_snap_corrupt_fails_synced_snapshot
cephfs_mirror: update peer status for invalid metadata in remote snapshot
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
The journal reset effectively cleared the cache so the rank may not have the
dirfrag in memory when we verify alternate name recovery.
Fixes: https://tracker.ceph.com/issues/67511
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
* Make all replayer threads busy and then query for 'syncing' state
instead of just fetching the current status.
* Dropped 'current_syncing_snap' check, as it's not compulsory for
this test. The actual intension is to make threads in 'syncing' status
and 'current_syncing_snap' check is not necessary for that.
* Drop 'snaps_deleted' metrics check in test_cephfs_mirror_cancel_mirroring_and_readd.
test_cephfs_mirror_cancel_mirroring_and_readd primarily focusses
on the synchronization of the newly added directory paths post removal
of the previously added/syncing directory paths. So checking of 'snaps_deleted'
metrics is unnecessary here.
* Wait for more time to finish the new snapshot creations and the sync backoff.
We need to wait for more time in test_cephfs_mirror_cancel_mirroring_and_readd,
as the test makes all replayer threads busy.
Fixes: https://tracker.ceph.com/issues/64711
Signed-off-by: Jos Collin <jcollin@redhat.com>
We install barbican by doing a pip install directly on the
cloned git repository but we don't honor the upper-constraints
from the OpenStack Requirements project that handles what
versions is supported.
This changes the pip install command that we issue when
installing barbican to honor the requirements for the
version (derived from the branch) that we use, in
this case it's the 2023.1 release upper-constraints [1].
This prevents us from pulling in untested Python packages.
This only updates Barbican because for the Keystone job
we dont directly issue pip but install using tox using the
`venv` environment which already by default sets the
constraints as you can see in [2].
[1] https://releases.openstack.org/constraints/upper/2023.1
[2] https://github.com/openstack/keystone/blob/stable/2023.1/tox.ini#L12
Fixes: https://tracker.ceph.com/issues/67444
Signed-off-by: Tobias Urdin <tobias.urdin@binero.com>
Reviewed-By: Casey Bodley <cbodley@ibm.com>
test/rgw/notification: use real ip address instead of localhost
based on that comment:
https://tracker.ceph.com/issues/67206#note-6
the address used by the endpoint is taken as the real IP address of the
host where the test script is running and not localhost.
we also changed the rabbitmq-server conf to allow "guest"
user to connect over non localhost address
Fixes: https://tracker.ceph.com/issues/67206
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
This commit allows running the qemu task on an arbitrary cluster name.
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This commit allows running the rbd task on an arbitrary cluster name.
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
based on that comment:
https://tracker.ceph.com/issues/67206#note-6
the address used by the endpoint is taken as the real IP address of the
host where the test script is running and not localhost.
we also changed the rabbitmq-server conf to allow "guest"
user to connect over non localhost address
Fixes: https://tracker.ceph.com/issues/67206
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Test name is test_subvolume_snapshot_info_if_clone_pending_for_no_group,
located in class TestSubvolumeSnapshotClones in test_volumes.py
5 seconds can (sometimes) be insufficient as value of the config option
"snapshot_clone_delay" in this. Increase it to avoid unnecessary race
conditions which leads to irrelevant failures.
Following is an example where 5 seconds was insufficient as waiting
period since instead it took 8 seconds -
2024-07-28T18:16:10.088 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph config set mgr mgr/volumes/snapshot_clone_no_wait False
...
2024-07-28T18:16:18.694 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume snapshot info cephfs subvol79370 subvol_snap40980
This issue was seen during testing of PR to which this commit belongs.
This commit has been separated from the commit that adds tests for clone
progress reporting so that it's easy to document need for this code
patch and also track it.
This commit is not being moved to a different PR and been kept on the
same PR since it can't be reproduced otherwise. This also ensures that
commit is backported to older release along with code that caused this
issue, causing no one to need to find this commit while backporting
effort.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Clone progress is shown to user through "ceph fs clone status" output
and through "ceph status" output. Test both these features.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
TestVolumesHelper._do_subvolume_io() is a helper method that allows
users to generate data for testing. mgr/vol code that reports progress
made by clone jobs depends on the value set for xattr rbytes. It takes
a bit of a time for rbytes to be set.
And, therefore, all tests in TestCloneProgressReporter needs to wait for
subvolume's rbytes xattr's value to be set to the actual amount of data
present on the subvolume before proceeding to actually testing.
So that this can be achieved make _do_subvolume_io() return size of the
data it has generated.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Add a helper method that accepts command arguments (along with rest of
paramters accepted by the method run_shell()) and return the stdout of
the command.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
1. Let caller check for multiple states. It might happen that clone
finishes while it is being cancelled, in such cases user might want
to check for both.
2. Add a helper method to check if clone is in pending state and add a
separate method to check if clone is in cancelled state.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
We should else bring and wait for MDS to be up since it is needed
for unmounting of CephFS in CephFSTestCase.tearDown() to be successful,
or just unmount the mountpoints before failing the filesystem.
Since the mountpoint won't be used in later tests so we just unmount
it.
Fixes: https://tracker.ceph.com/issues/66946
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Thrashers that do not inherit from ThrasherGreenlet previously used a
method called do_join, which combined stop and join functionality. To
ensure consistency and clarity, we want all thrashers to use separate
stop, join, and stop_and_join methods.
This commit renames methods and implements missing stop and stop_and_join
methods in thrashers that did not inherit from ThrasherGreenlet.
Fixes: https://tracker.ceph.com/issues/66698
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
If a thrasher exception occurs, the do_dump_ops thread will continue
looping until the Teuthology timeout is reached.
The watchdog should terminate the thrasher to free up resources.
Fixes: https://tracker.ceph.com/issues/66698
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
This was running ceph-volume through the
cephadm shell previously, but as we are trying
to remove mount points from cephadm shell, this
no longer works (specifically without the /dev mount)
Signed-off-by: Adam King <adking@redhat.com>
Basically when we deploy a 3 MONS
Check if the connection scores are clean
with a 60 seconds grace period
Fixes: https://tracker.ceph.com/issues/65695
Signed-off-by: Kamoltat <ksirivad@redhat.com>
Test the case where 2 DC loses connection with each other
for a 3 AZ stretch cluster with stretch pool enabled.
Check if cluster is accessible and PGs are active+clean
after reconnected.
Signed-off-by: Kamoltat <ksirivad@redhat.com>
Test the following new Ceph CLI commands:
`ceph osd pool stretch set`
`ceph osd pool stretch unset`
`ceph osd pool stretch show`
`qa/workunits/mon/mon-stretch-pool.sh`
will create the stretch cluster
while performing input validation for the CLI
Commands mentioned above.
`qa/tasks/stretch_cluster.py`
is in charge of
setting a pool to stretch cluster
and checks whether it prevents PGs
from the going active when there is not
enough buckets available in the acting
set of PGs to go active.
Also, test different MON fail over scenarios
after setting pool as stretch
`qa/suites/rados/singleton/all/mon-stretch-pool.yaml`
brings the scripts together.
Fixes: https://tracker.ceph.com/issues/64802
Signed-off-by: Kamoltat <ksirivad@redhat.com>
The comment was unclear without looking at previous version of code.
Therefore improve comment a bit and add link to commit ID due to which
this comment was introduced to give future readers context.
Also, place the comment before the command arguments. This too adds the
context.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Removing root_squasn from MDS auth caps through "fs authorize" command
should not be allowed as this command it not allowed to/meant for
removing caps.
Fixes: https://tracker.ceph.com/issues/65808
Signed-off-by: Rishabh Dave <ridave@redhat.com>
This test deletes the CephFS already present on the cluster at the very
beginning and unmounts the first client beforehand. But it leaves the
second client mounted on this deleted CephFS that doesn't exist for the
rest of the test. And then at the very end of this test it attempts to
remount the second client (during tearDown()) which hangs and causes
test runner to crash.
Unmount the second client beforehand to prevent the bug and delete
mount_b object to avoid confusion for the readers in future about
whether or not 2nd mountpoint exists.
Fixes: https://tracker.ceph.com/issues/66077
Signed-off-by: Rishabh Dave <ridave@redhat.com>
qa: account for rbd_trash object in krbd_data_pool.sh + related ceph{,adm} task fixes
Reviewed-by: Ramana Raja <rraja@redhat.com>
Reviewed-by: Adam King <adking@redhat.com>
Reviewed-by: N Balachandran <nibalach@redhat.com>
The cluster (name) is already specified in the arguments passed to
_shell() and this command doesn't need privileges.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This fails because teuthology.wait_until_osds_up() wants to use
adjust-ulimits wrapper which isn't available in "cephadm shell"
environment. The whole thing is also redundant because cephadm task
is supposed to wait for OSDs to come up earlier, in ceph_osds().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This was typoed in commit 9d485ae1f4 ("qa/tasks/ceph: provide
configuration for setting configs via mon") and went unnoticed likely
because 3-snaps/yes.yaml in fs:workload is the only user of the new
mgr-modules stanza so far and fs:workload suite runs exclusively on
cephadm.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
After blocklisted/failed, wait for the mirror daemon restart
which is after 30 seconds timeout and then check for the new rados_inst.
Fixes: https://tracker.ceph.com/issues/64927
Signed-off-by: Jos Collin <jcollin@redhat.com>
Introduce rename for the cephfs REST API controller, we can
rename the existing file or directory by it.
Fixes: https://tracker.ceph.com/issues/66797
Signed-off-by: Yite Gu <yitegu0@gmail.com>
The current ignore-list have \(SLOW_OPS\) but missing SLOW_OPS
Fixes: https://tracker.ceph.com/issues/66604
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
Initialize 'monitoring_profiles' to an empty python dictionary instead of
'None' to prevent the cbt task from failing due to the TypeError exception
when attempting to iterate a 'NoneType'.
The bug was introduced as part of https://github.com/ceph/ceph/pull/51438
and commit e174c6e2cf.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Fixes: https://tracker.ceph.com/issues/66799
* refs/pull/57619/head:
qa/cephfs: use wait_for_daemon() instead of sleep()-ing
qa/cephfs: mark file system joinable for fs rename tests before unmounting clients
Reviewed-by: Rishabh Dave <ridave@redhat.com>
Otherwise jobs end up with the following failure:
```
2024-06-25T14:22:18.659 INFO:teuthology.orchestra.run.smithi098.stderr:Failed to write to /dev/nvme-fabrics: Invalid argument
```
Also, the output of nvme list has changed so we have to update
qa/tasks/nvme_loop.py accordingly.
Fixes: https://tracker.ceph.com/issues/66707
Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
Test netsplit between 2 datacenters
in a stretch mode cluster.
Observe if:
- PGs are active
- Cluster is accessible
- Writes and Reads went through
Signed-off-by: Kamoltat <ksirivad@redhat.com>
We need to check if taking host out will cause the total in osds
to be less then min_in
Fixes: https://tracker.ceph.com/issues/66657
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
* refs/pull/53503/head:
qa: add tests for `mds last-seen` command
doc/cephfs: add documentation for `mds last-seen`
PendingReleaseNotes: add note on last-seen command
mon/MDSMonitor: add command to lookup when mds was last seen
mon/MDSMonitor: set birth time on FSMap during encode
pybind/mgr/dashboard: show context diff for openapi check
Reviewed-by: Venky Shankar <vshankar@redhat.com>
if multisite is configured, the default daemon needs to be selected
based on the default zonegroup. Otherwise dashboard gives you incorrect
details when doing the period commit
The issue occurs when you do a period update --commit and you reload one
of the block page, the api assigns the zonegroup of the second gateway
because for a moment, the first gateway reflects the period changes...
This is not true because the default zonegroup is of the previous active
gateway but even though the back-end correctly says the active
zonegroup, the dashboard api says it wrongly.
Fixes: https://tracker.ceph.com/issues/66394
Signed-off-by: Nizamudeen A <nia@redhat.com>
So we can begin to answer questions like: when did we last see an MDS?
Fixes: https://tracker.ceph.com/issues/62849
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
* refs/pull/55792/head:
tools/cephfs: recover alternate_name of dentries from journal
qa: add test to verify recovery of alternate_name from journal
tools/cephfs/JournalTool: add some more debugging
tools/cephfs/JournalTool: remove extraneous 0x in debug output
mds: dump alternate_name to formatter
mds: add warning about encoding new fields
Reviewed-by: Christopher Hoffman <choffman@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>