Commit Graph

140037 Commits

Author SHA1 Message Date
Ronen Friedman
210dbd4ff1 osd: cleaning stop_for_fast_shutdown()
Removed unsued variables to prevent compiler warnings.
Protected the shard lock.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2023-09-12 10:55:48 -05:00
Roy Sahar
b690343128
Merge pull request #53408 from roysahar/nvmeof_fix_omap_state_object_contains_None
nvmeof-gw: omap object name contains None as string due to generated template to ceph-nvmeof.conf
2023-09-12 10:43:24 +03:00
Liu-Chunmei
17df35e9a7
Merge pull request #53232 from myoungwon/wip-enable-rbm-tests
test/crimson/seastore/rbm: add sub-tests regarding RBM to the existing tests

Reviewed-by: Yingxin Cheng <yingxin.cheng@intel.com>
Reviewed-by: Liu-Chunmei <chunmei.liu@intel.com>
2023-09-11 23:25:24 -07:00
Myoungwon Oh
73de8937f6 crimson/os/seastore/object_data_handler: consider a RBM case when checking if write can be merged
RBM's paddr always indicates physical address, which means it doesn't have the dealayed.
So, this commit adds a condition that checks if given paddr is used for ongoing write.

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
2023-09-12 12:17:47 +09:00
Yingxin
eb42e21d79
Merge pull request #53305 from xxhdx1985126/wip-seastore-onode-erase-replay
crimson/os/seastore/onode_manager: populate value recorders of onodes to be erased

Reviewed-by: Yingxin Cheng <yingxin.cheng@intel.com>
2023-09-12 09:21:29 +08:00
Roy Sahar
ab27f934a1 nvmeof-gw: omap object name contains None due to generated template contains string None
Signed-off-by: Roy Sahar <royswi@gmail.com>
2023-09-12 01:52:53 +03:00
Laura Flores
c0183c76d7
Merge pull request #53344 from ljflores/wip-tracker-62761
common: add CephContext parameter to tracing::Tracer::init() in !HAVE_JAEGER branch
2023-09-11 18:22:03 -04:00
Laura Flores
7179ac0037 src/common: add context to tracing::Tracer::init
Followup to https://github.com/ceph/ceph/pull/50948. This wasn't originally
caught since the centos 8 default build passed, but it did fail the
crimson build: b5259484db/

Fixes: https://tracker.ceph.com/issues/62761
Signed-off-by: Laura Flores <lflores@ibm.com>
2023-09-11 18:26:13 +00:00
Rishabh Dave
5b1f679ff4
Merge pull request #53390 from rishabh-d-dave/cephfs-doc-admin
doc/cephfs: write cephfs commands fully in docs

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
Reviewed-by: Zac Dover <zac.dover@proton.me>
Reviewed-by: Dhairya Parmar <dparmar@redhat.com>
2023-09-11 22:19:37 +05:30
Ronen Friedman
e53ec60820
Merge pull request #53361 from ronen-fr/wip-rf-moveit
rgw/test: qualifying 'move'

Reviewed-by: Mark Kogan <mkogan@redhat.com>
Reviewed-by: Yuval Lifshitz <ylifshit@redhat.com>
2023-09-11 19:40:43 +03:00
Yuval Lifshitz
6cbc873b9a
Merge pull request #53369 from yuvalif/wip-yuval-62784
rgw/notifications: allow cross tenant notification management

reviewed-by: mattbenjamin
2023-09-11 19:08:15 +03:00
Adam King
13ea7b0807
Merge pull request #52043 from adk3798/tcmu-entrypoint
cephadm: run tcmu-runner through script to do restart on failure

Reviewed-by: John Mulligan <jmulligan@redhat.com>
2023-09-11 11:32:55 -04:00
Adam King
2b839838f7
Merge pull request #52881 from adk3798/upgrade-test-start-squid
qa/cephadm: start upgrade tests from quincy

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
2023-09-11 11:29:38 -04:00
Adam King
a45ade8c15
Merge pull request #52708 from vshankar/wip-62236
qa: move nfs (mgr/nfs) related tests to fs suite

Reviewed-by: Adam King <adking@redhat.com>
2023-09-11 10:56:20 -04:00
Mark Nelson
923a4d7df4
Merge pull request #53343 from markhpc/wip-rocksdb-compression
common/options: Set LZ4 compression for bluestore RocksDB.
2023-09-11 09:55:20 -05:00
Adam King
55e13ffde7
Merge pull request #53072 from adk3798/tcmu-custom-configs
cephadm: make custom_configs work for tcmu-runner container

Reviewed-by: John Mulligan <jmulligan@redhat.com>
2023-09-11 10:30:06 -04:00
Adam King
a79a5ce22f
Merge pull request #53252 from adk3798/no-tags-upgrade-ls
mgr/cephadm: don't use image tag in orch upgrade ls

Reviewed-by: John Mulligan <jmulligan@redhat.com>
2023-09-11 10:28:38 -04:00
Adam King
ec05c55c56
Merge pull request #53297 from phlogistonjohn/jjm-logging
cephadm: move logging configuration into cephadmlib

Reviewed-by: Adam King <adking@redhat.com>
2023-09-11 10:27:05 -04:00
Yuval Lifshitz
150c490e2d
Merge pull request #53081 from kchheda3/wip-add-log
rgw/notification: enrich the expired notification log line for persistent notificatio

reviewed-by: thotz, yuvalif
2023-09-11 17:22:51 +03:00
Yuval Lifshitz
1b954e9507
Merge pull request #53009 from kchheda3/wip-bucket-stats-notification
rgw/notification: Publish notification information in bucket stats

reviewed-by: yuvalif
2023-09-11 17:22:17 +03:00
Yuval Lifshitz
f53f3aeba9
Merge pull request #53073 from kchheda3/wip-filter-noti-uid
rgw/notification: Filter topic list based on uid

reviewed-by: yuvalif
2023-09-11 17:21:47 +03:00
Anthony D'Atri
04e5b5083a
Merge pull request #53383 from zdover23/wip-doc-2023-09-11-glossary-ceph-client-link
doc/glossary: link to "ceph clients" from entry
2023-09-11 09:53:53 -04:00
Nizamudeen A
63871e38f2
Merge pull request #53246 from rhcs-dashboard/subvolumes-in-subvolumegroups
mgr/dashboard: display the groups in cephfs subvolume tab

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-09-11 18:23:22 +05:30
Nizamudeen A
46ff2b50a1
Merge pull request #53323 from rhcs-dashboard/rgw-port-error
mgr/dashboard: fix rgw port manipulation error in dashboard

Reviewed-by: Pedro Gonzalez Gomez <pegonzal@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
2023-09-11 17:26:02 +05:30
zdover23
81cc774ccc
Merge pull request #53389 from alvinowyong/wip-doc-remove-ec-profile-warning
doc/rados/operations: Add manual CRUSH rule remove warning to erasure-code-profile.rst

Reviewed-by: Zac Dover <zac.dover@proton.me>
2023-09-11 20:05:20 +10:00
Rishabh Dave
e63b573d3e doc/cephfs: write cephfs commands fully in docs
We write CephFS commands incompletely in docs. For example, "ceph tell
mds.a help" is simply written as "tell mds.a help". This might confuse
the reader and it won't harm to write the command in full.

Fixes: https://tracker.ceph.com/issues/62791
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2023-09-11 15:25:46 +05:30
Alvin Owyong
f944fa8ddb
doc: Add warning on manual CRUSH rule removal
Add warning for "osd erasure-code-profile rm" section under rados/operations.

Signed-off-by: Alvin Owyong <70066269+alvinowyong@users.noreply.github.com>
2023-09-11 17:15:15 +08:00
Nizamudeen A
6b878906ed
Merge pull request #53071 from rhcs-dashboard/disable-features-rbd-edit
mgr/dashboard: images -> edit -> disable checkboxes for layering and deef-flatten

Reviewed-by: Pedro Gonzalez Gomez <pegonzal@redhat.com>
Reviewed-by: Ankush Behl <cloudbehl@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-09-11 13:57:28 +05:30
Pedro Gonzalez Gomez
041bc0c362 mgr/dashboard: CephFS add groups in subvolume tab
Adds subvolume groups into the subvolume tabs in order to select the subvolumes from the appropiate group.
Also adds the capabilities to manage the subvolume groups of the subvolume in the different actions, create, edit, remove.

Fixes: https://tracker.ceph.com/issues/62675
Signed-off-by: Pedro Gonzalez Gomez <pegonzal@redhat.com>
2023-09-11 08:44:37 +02:00
Zac Dover
fd9c8c9e2f doc/glossary: link to "ceph clients" from entry
Link to the "Ceph Clients" section of doc/architecture.rst from the
"Ceph Clients" entry in the glossary. A glossary entry should be a short
summary of the topic with which it deals, and it should direct the
reader to further and more detailed reading if the reader is interested.
This does that.

Signed-off-by: Zac Dover <zac.dover@proton.me>
2023-09-11 15:18:40 +10:00
Nizamudeen A
dfa63ee549
Merge pull request #53236 from rhcs-dashboard/subvolume-size-validator-fix
mgr/dashboard: add validator for size field in the forms 

Reviewed-by: Pedro Gonzalez Gomez <pegonzal@redhat.com>
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
2023-09-11 10:05:52 +05:30
Xuehan Xu
6c236bb63a crimson/os/seastore/onode_manager: populate value recorders of onodes to
be erased

Otherwise, the following modification sequence with the same transaction
might lead to onode extents' crc inconsistency during journal replay:

1. modify the last mapping in an onode extent;
2. erase the last mapping in that onode extent.

During journal replay, if the first modification is not recorded in the
delta, the onode extent's content would be inconsistent with that before
the system reboot

Signed-off-by: Xuehan Xu <xuxuehan@qianxin.com>
2023-09-11 11:19:25 +08:00
Myoungwon Oh
670a19f277 test/crimson/seastore/rbm: add sub-tests regarding RBM to the existing tests
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
2023-09-11 11:30:07 +09:00
Anthony D'Atri
9c233914dd
Merge pull request #53371 from zdover23/wip-doc-2023-09-11-architecture-2-of-x
doc/architecture.rst - edit a sentence
2023-09-10 13:38:53 -04:00
Zac Dover
436fbf7a3e doc/architecture.rst - edit a sentence
Change the sentence structure of a sentence because the verb
"experience" looked like the abstract noun "experience" when I read it
with fresh eyes. I chose the perhaps TESOL-unfriendly verb "incur", but
I believe it is right.

Signed-off-by: Zac Dover <zac.dover@proton.me>
2023-09-11 02:38:26 +10:00
Ronen Friedman
b46a12e277 rgw/test: adding a qualifier to 'move'
rgw: adding a qualifier to 'move'

as Clang now requires fully specifying std::move, as
per https://reviews.llvm.org/D119670?id=408276

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2023-09-10 19:31:48 +03:00
Yuval Lifshitz
25f82210ab rgw/notifications: allow cross tenant notification management
testing instructions:
https://gist.github.com/yuvalif/60063dc67d981b387b382ff0f7f88d91

Fixes: https://tracker.ceph.com/issues/62784

Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
2023-09-10 14:40:37 +00:00
zdover23
e900cf5f32
Merge pull request #53353 from zdover23/wip-doc-2023-09-10-architecture-1-of-x
doc/architecture.rst - edit up to "Cluster Map"

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
2023-09-10 21:43:44 +10:00
Zac Dover
b3538f8ade doc/architecture.rst - edit up to "Cluster Map"
Edit doc/architecture.rst up to "Cluster Map", but not including
"Cluster Map".

Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com>
Signed-off-by: Zac Dover <zac.dover@proton.me>
2023-09-10 21:16:33 +10:00
Anthony D'Atri
1de6ea7c8f
Merge pull request #53341 from alvinowyong/wip-doc-create-osd-warning
doc/cephadm/services: Add package conflict warning to osd.rst
2023-09-09 16:08:59 -04:00
Patrick Donnelly
7264120b26
Merge PR #53331 into main
* refs/pull/53331/head:
	Revert "Revert "Merge PR #53077 into main""

Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
2023-09-09 10:50:22 -04:00
zdover23
c51628ae66
Merge pull request #53334 from zdover23/wip-doc-2023-09-08-README-md-test-cluster-commands
doc: update test cluster commands in README.md

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
2023-09-09 21:14:31 +10:00
zdover23
d787dd7e94
Merge pull request #53335 from zdover23/wip-doc-2023-09-08-rados-config-mon-config-ref-background
doc/configuration: edit "bg" in mon-config-ref.rst

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
2023-09-09 11:45:35 +10:00
Laura Flores
2810d1a5f8
Merge pull request #52539 from Matan-B/wip-matanb-61820
ceph_mon: Fix MonitorDBStore usage
2023-09-08 18:12:53 -04:00
Laura Flores
9c671cdbf9
Merge pull request #52485 from ifed01/wip-ifed-fix-btree-allocator
os/bluestore: fix btree allocator
2023-09-08 18:11:13 -04:00
Laura Flores
81c209d92b
Merge pull request #52211 from rzarzynski/wip-crimson-ec-classical-refactor
osd: dissect and abstract RMWPipeline from ECBackend for sharing it with crimson
2023-09-08 18:10:54 -04:00
Laura Flores
2ac39816dd
Merge pull request #52121 from cbodley/wip-blkdev-nodiscard
blkdev: fix nodiscard warning in get_device_metadata()
2023-09-08 18:10:33 -04:00
Alvin Owyong
38ec8e0692
docs: update warning message
truncated warning message according to comments

Signed-off-by: Alvin Owyong <70066269+alvinowyong@users.noreply.github.com>
2023-09-09 01:21:42 +08:00
Alvin Owyong
5e803567c7
docs: add warning about potential package conflict
Add warning for "creating new osds" section under cephadm services.

Signed-off-by: Alvin Owyong <70066269+alvinowyong@users.noreply.github.com>
2023-09-09 00:25:04 +08:00
Mark Nelson
17840dbfb1 common/options: Set LZ4 compression for bluestore RocksDB.
In the fall of 2022, we tested LZ4 RocksDB compression in bluestore on
NVMe backed OSDs here:

https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/

Since then, we've gotten feedback from users in the field testing
compression with extremely positive results.  Clyso has also worked with
a customer that has a large RGW deployment that has seen extremely positive
results.

Advantages of using compression
===============================

1) Significantly lower write amplification and space amplifcation.

In the article above, we saw a 4X reduction in space usage in RocksDB when
writing very small (4KB) objects to RGW.  On a real production cluster with
1.3 billion objects, Clyso observed a space usage reduction closer to 2.2X
which was still a substantial improvement.  This win is important in
multiple cluster configurations:

1A) Pure HDD

Pure HDD clusters are often seek limited under load.  This directly impacts
how quickly RocksDB can write data out, which can increase compaction times.

1B) Hybrid Clusters (HDD Block + Flash DB/WAL)

In this configuration, spillover to the HDD can become a concern when
there isn't enough space on the flash devices to hold all RocksDB
SST files for all of the assoicated OSDs on flash.  Compression has
dramatic effect on being able to store all SST files in flash and avoid
spillover.

1C) Pure Flash based clusters

A primary concern for pure flash based clusters is write-amplificaiton
and eventual wear out of the flash under write-intensive scenarios.
RocksDB compression not only reduces space-amplification but also
write-amplification.  That means lower wear on the flash cells and
longer flash life.

2) Reduced Compaction Times

The customer cluster that Clyso worked with utilized an HDD-only
configuration.  Prior to utilizing RocksDB Compaction, this cluster
could take up to several days to complete a manual compaction of a given
OSD during live operation.  Enabling LZ4 compression in RocksDB reduced
manual compaction time to closer to 25-30 minutes, with ~2 hours being
the longest manual compaction time observed.

Potential Disadvantages of RocksDB comppression
===============================================

1) Increased CPU usage

While there is CPU usage overhead associated with utilizing compression,
the effect appeared to be negligable, even on an NVMe backed cluster.
Despite restricting NVMe OSDs to 2 cores so that they were extremely
CPU bound during PUT operations, enabling compression had no notable
effect on PUT performance.

2) Lower GET throughput on NVMe

We noticed a very slight performance hit for GETs on NVMe backed
clusters during GET operations, though the effect was primarily observed
when using Snappy compression and not LZ4 compression.  LZ4 GET
performance was very close to performance with RocksDB uncompressed.

3) Other performance impact

Potential other concerns might include lower performance during
iteration or other actions, however I expect this to be unlikely.
RocksDB typically performs best when it can read data from SST files in
large chunks and then work from the block cache.  Large readahead values
tend to be a win, either to read data into the block cache or so that
data can be read quickly from the kernel page cache.  As far as I can
tell, compression is not having a negative impact here and in fact may be
helping in cases where the disk is already quite busy.  In general, we
are already completely dependent on our own in-memory caches for things like
bluestore onodes to achieve high performance on NVMe backed OSDs.

More importantly, the goal on 16.2.13+ should be to reduce the overehad
of iterating over tombstones, and our primary method to do this right
now is to issue compactions on iteration when too many tombstones are
encountered.  Reducing the impact of compaction directly benefits this
goal.

Why LZ4 Compression?

Snappy and LZ4 compression are both potential default options.  Ceph
previously had a bug related to LZ4 compression that could corrupt data,
so on the surface it might be tempting to default to using snappy
compression.  There are several reasons why I believe we should use LZ4
compression by default however.

1) The LZ4 bug is fixed, and there have been no reports of issues since
the fix was put in place.

2) The Google developers have made changes to Snappy's build system that
impacts Ceph.  Many distributions are working around these changes, but
the Google developers have explicitly stated that they plan to only
support google specific use cases:

"We are unlikely to accept contributions to the build configuration
files, such as CMakeLists.txt. We are focused on maintaining a build
configuration that allows us to test that the project works in a few
supported configurations inside Google. We are not currently interested
in supporting other requirements, such as different operating systems,
compilers, or build systems."

https://github.com/google/snappy/blob/main/README.md#contributing-to-the-snappy-project

3) LZ4 compression showed less of a performance impact during RGW 4KB
object gets versus Snappy.  Snappy showed no performance gains vs LZ4 in
any of the other tests nor did it appear to show a meaningful
compression advantage.

Impact on existing clusters
===========================

Enabling/Disabling compression in RocksDB will require an OSD restart,
but otherwise does not require user action.  SST files will gradually be
compressed over time as part of the compaction process.  A manual
compaction can be issued to help accelerate this process.  The same goes
if users would like to disable compression.  New uncompressed SST files
will be written over time as part of the compaction process, and a
manual compaction can be issued to accelerate this process.

Conclusion
==========

In general, enabling RocksDB compression in bluestore appears to be a
dramatic win.  I would like to make this our default behavior for Squid
going forward assuming no issues are uncovered during teuthology testing.

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
2023-09-08 11:21:30 -05:00