RBM's paddr always indicates physical address, which means it doesn't have the dealayed.
So, this commit adds a condition that checks if given paddr is used for ongoing write.
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
We write CephFS commands incompletely in docs. For example, "ceph tell
mds.a help" is simply written as "tell mds.a help". This might confuse
the reader and it won't harm to write the command in full.
Fixes: https://tracker.ceph.com/issues/62791
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Adds subvolume groups into the subvolume tabs in order to select the subvolumes from the appropiate group.
Also adds the capabilities to manage the subvolume groups of the subvolume in the different actions, create, edit, remove.
Fixes: https://tracker.ceph.com/issues/62675
Signed-off-by: Pedro Gonzalez Gomez <pegonzal@redhat.com>
Link to the "Ceph Clients" section of doc/architecture.rst from the
"Ceph Clients" entry in the glossary. A glossary entry should be a short
summary of the topic with which it deals, and it should direct the
reader to further and more detailed reading if the reader is interested.
This does that.
Signed-off-by: Zac Dover <zac.dover@proton.me>
mgr/dashboard: add validator for size field in the forms
Reviewed-by: Pedro Gonzalez Gomez <pegonzal@redhat.com>
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
be erased
Otherwise, the following modification sequence with the same transaction
might lead to onode extents' crc inconsistency during journal replay:
1. modify the last mapping in an onode extent;
2. erase the last mapping in that onode extent.
During journal replay, if the first modification is not recorded in the
delta, the onode extent's content would be inconsistent with that before
the system reboot
Signed-off-by: Xuehan Xu <xuxuehan@qianxin.com>
Change the sentence structure of a sentence because the verb
"experience" looked like the abstract noun "experience" when I read it
with fresh eyes. I chose the perhaps TESOL-unfriendly verb "incur", but
I believe it is right.
Signed-off-by: Zac Dover <zac.dover@proton.me>
rgw: adding a qualifier to 'move'
as Clang now requires fully specifying std::move, as
per https://reviews.llvm.org/D119670?id=408276
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Edit doc/architecture.rst up to "Cluster Map", but not including
"Cluster Map".
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com>
Signed-off-by: Zac Dover <zac.dover@proton.me>
In the fall of 2022, we tested LZ4 RocksDB compression in bluestore on
NVMe backed OSDs here:
https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/
Since then, we've gotten feedback from users in the field testing
compression with extremely positive results. Clyso has also worked with
a customer that has a large RGW deployment that has seen extremely positive
results.
Advantages of using compression
===============================
1) Significantly lower write amplification and space amplifcation.
In the article above, we saw a 4X reduction in space usage in RocksDB when
writing very small (4KB) objects to RGW. On a real production cluster with
1.3 billion objects, Clyso observed a space usage reduction closer to 2.2X
which was still a substantial improvement. This win is important in
multiple cluster configurations:
1A) Pure HDD
Pure HDD clusters are often seek limited under load. This directly impacts
how quickly RocksDB can write data out, which can increase compaction times.
1B) Hybrid Clusters (HDD Block + Flash DB/WAL)
In this configuration, spillover to the HDD can become a concern when
there isn't enough space on the flash devices to hold all RocksDB
SST files for all of the assoicated OSDs on flash. Compression has
dramatic effect on being able to store all SST files in flash and avoid
spillover.
1C) Pure Flash based clusters
A primary concern for pure flash based clusters is write-amplificaiton
and eventual wear out of the flash under write-intensive scenarios.
RocksDB compression not only reduces space-amplification but also
write-amplification. That means lower wear on the flash cells and
longer flash life.
2) Reduced Compaction Times
The customer cluster that Clyso worked with utilized an HDD-only
configuration. Prior to utilizing RocksDB Compaction, this cluster
could take up to several days to complete a manual compaction of a given
OSD during live operation. Enabling LZ4 compression in RocksDB reduced
manual compaction time to closer to 25-30 minutes, with ~2 hours being
the longest manual compaction time observed.
Potential Disadvantages of RocksDB comppression
===============================================
1) Increased CPU usage
While there is CPU usage overhead associated with utilizing compression,
the effect appeared to be negligable, even on an NVMe backed cluster.
Despite restricting NVMe OSDs to 2 cores so that they were extremely
CPU bound during PUT operations, enabling compression had no notable
effect on PUT performance.
2) Lower GET throughput on NVMe
We noticed a very slight performance hit for GETs on NVMe backed
clusters during GET operations, though the effect was primarily observed
when using Snappy compression and not LZ4 compression. LZ4 GET
performance was very close to performance with RocksDB uncompressed.
3) Other performance impact
Potential other concerns might include lower performance during
iteration or other actions, however I expect this to be unlikely.
RocksDB typically performs best when it can read data from SST files in
large chunks and then work from the block cache. Large readahead values
tend to be a win, either to read data into the block cache or so that
data can be read quickly from the kernel page cache. As far as I can
tell, compression is not having a negative impact here and in fact may be
helping in cases where the disk is already quite busy. In general, we
are already completely dependent on our own in-memory caches for things like
bluestore onodes to achieve high performance on NVMe backed OSDs.
More importantly, the goal on 16.2.13+ should be to reduce the overehad
of iterating over tombstones, and our primary method to do this right
now is to issue compactions on iteration when too many tombstones are
encountered. Reducing the impact of compaction directly benefits this
goal.
Why LZ4 Compression?
Snappy and LZ4 compression are both potential default options. Ceph
previously had a bug related to LZ4 compression that could corrupt data,
so on the surface it might be tempting to default to using snappy
compression. There are several reasons why I believe we should use LZ4
compression by default however.
1) The LZ4 bug is fixed, and there have been no reports of issues since
the fix was put in place.
2) The Google developers have made changes to Snappy's build system that
impacts Ceph. Many distributions are working around these changes, but
the Google developers have explicitly stated that they plan to only
support google specific use cases:
"We are unlikely to accept contributions to the build configuration
files, such as CMakeLists.txt. We are focused on maintaining a build
configuration that allows us to test that the project works in a few
supported configurations inside Google. We are not currently interested
in supporting other requirements, such as different operating systems,
compilers, or build systems."
https://github.com/google/snappy/blob/main/README.md#contributing-to-the-snappy-project
3) LZ4 compression showed less of a performance impact during RGW 4KB
object gets versus Snappy. Snappy showed no performance gains vs LZ4 in
any of the other tests nor did it appear to show a meaningful
compression advantage.
Impact on existing clusters
===========================
Enabling/Disabling compression in RocksDB will require an OSD restart,
but otherwise does not require user action. SST files will gradually be
compressed over time as part of the compaction process. A manual
compaction can be issued to help accelerate this process. The same goes
if users would like to disable compression. New uncompressed SST files
will be written over time as part of the compaction process, and a
manual compaction can be issued to accelerate this process.
Conclusion
==========
In general, enabling RocksDB compression in bluestore appears to be a
dramatic win. I would like to make this our default behavior for Squid
going forward assuming no issues are uncovered during teuthology testing.
Signed-off-by: Mark Nelson <mark.nelson@clyso.com>