There are some older Arm server running pretty slow, the make
check jobs like `check-generated.sh` are killed as the job timeout.
Make CEPH_TEST_TIMEOUT more longer.
Signed-off-by: luo rixin <luorixin@huawei.com>
This is expected for cephadm deployments where join_fs is configured, causing
affinity replacements.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Some messages are duplicated to the cluster log lookign like:
2024-02-15T22:54:31.244 INFO:teuthology.orchestra.run.smithi033.stdout:2024-02-15T22:50:00.000263+0000 mon.smithi033 (mon.0) 558 : cluster 4 [ERR] MDS_ALL_DOWN: 1 filesystem is offline
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Commit e134c890 adds the bal_rank_mask with encoded (ev) version 17. This was
merged into main Oct 2022 and made it into the reef release normally.
Commit 7b8def5c adds the max_xattr_size also with encoded (ev) version 17 but
places it before bal_rank_mask. This is problematic as there were no plans to
backport e134c890 to quincy or pacific so piggybacking on the ev 17 bump would
not work and otherwise would require the backports to be done as a set to
ensure consistency (including with the kernel client).
However, the real issue is that 7b8def5c was not merged until after reef was
already cut. This required 7b8def5c to be backported separately in [1] which
was not merged until after v18.2.1 (current reef HEAD as of this commit).
Ultimately, this means that there are reef versions (v18.2.[01]) in the wild
which expect bal_rank_mask to be encoded at ev17 and not (max_xattr_size,
bal_rank_mask). Adding to the complications, the kernel client has already
merged code [2] expecting max_xattr_size for ev17.
It was decided in a github discussion [3] to move bal_rank_mask to ev18 to
avoid updating the kernel client which was done in the main branch via 36ee8e7e
and update the reef max_xattr_size backport with the same change (d8cebd67).
Unfortunately, this breaks upgrades v18.2.[01] to newer reef versions or to
main. The reason is that monitors will encode v17 with bal_rank_mask
(max_xattr_size is not merged yet) and send that to upgraded mgrs (which are
upgraded first). The mgr will attempt to decode bal_rank_mask as a uint64_t
(max_xattr_size) but fail because an empty (by default) bal_rank_mask is simply
encoded as a signed 32-bit integer. Consequently, the mgr will fail decoding
with:
failed to decode message of type 45 v1: End of buffer [buffer:2]
Of course the problem does not stop there, even if the mgr were able to handle
this, the monitors/mds/clients would fail in similar fashion.
So the only choice left is to fix max_xattr_size to be encoded at ev18.
Fortunately, v18.2.2 has not been released nor has any max_xattr_size backport
to quincy/pacific been merged. The main downside will be that kernels will
wrongly decode ev17 (which is already true for ceph clusters running
v18.2.[01]). A follow-up kernel fix will be required.
[1] https://tracker.ceph.com/issues/59405
[2] linux.git d93231a6bc8a452323d5fef16cca7107ce483a27
[3] https://github.com/ceph/ceph/pull/53340#discussion_r1399255031
Fixes: https://tracker.ceph.com/issues/64440
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Importantly: do this before any locks are to be acquired.
Fixes: https://tracker.ceph.com/issues/64503
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
rgw: code to display the complete user id that includes tenant, names…
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
Disambiguate a note in doc/cephfs/add-remove-mds.rst to help readers
distinguish between cases in which they might want to use an automated
tool such as cephadm to deploy MDSes and cases in which they might want
to manually deploy MDSes.
See: https://github.com/ceph/ceph/pull/45639
Tracker: https://tracker.ceph.com/issues/54551
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com>
Signed-off-by: Zac Dover <zac.dover@proton.me>
invalidating the cache before the librados delete means that a racing call
to `RGWSI_SysObj_Cache::read()` may succeed and repopulate the cache. in
that case, subsequent reads will continue to return cached data even after
the librados delete succeeds
Fixes: https://tracker.ceph.com/issues/64480
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Add a definition of Placement Groups to
doc/rados/operations/placement-groups.rst.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com>
Signed-off-by: Zac Dover <zac.dover@proton.me>
test_multi.py:test_object_sync is updated to reproduce the issue.
Without the fix, objects "." and ".." are not replicated and the test
fails (times out).
Fixes: https://tracker.ceph.com/issues/64366
Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
We aren't currently using jaeger tracing on Windows. The issue is
that Windows hosts (or any other host that doesn't use jaeger)
are experiencing message decoding failures after a recent change [1].
This change updates the tracer encoding so that messages from
non-jaeger hosts may be decoded by services that use jaeger.
[1] https://github.com/ceph/ceph/pull/47457
Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
This commit rebrings 3701ffa673 which
got reverted due to an implicit dependency with other revert. Please
see https://github.com/ceph/ceph/pull/52114#issuecomment-1950288188.
Conflicts:
src/common/tracer.h
formatting conflict with 7179ac0037
rgw/putobj: RadosWriter uses part head object for multipart parts
Reviewed-by: Mark Kogan <mkogan@ibm.com>
Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>
When doing PG dump using 'ceph pg dump --format json-pretty'
the output is extremely big that the command hangs and also
the ceph-mgr hangs and eventuall fails over.
The exact size depends on the number of OSDs in the cluster
and the number of peers for each OSD.
In tests, it's been identified that the network ping times
is the largest component in terms of size which is removed
from the output now so as to limit the overall size.
Fixes https://tracker.ceph.com/issues/57460
Signed-off-by: Ponnuvel Palaniyappan <pponnuvel@gmail.com>
Replacing PgScrubber::determine_scrub_time() with a local copy,
as a stop-gap measure to keep the test running.
The scrub scheduling refactoring will remove the need for
this function, and the test will be updated accordingly.
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>