mirror of
https://github.com/ceph/ceph
synced 2024-12-12 14:39:05 +00:00
54a75a0e40
doc: improve pending release notes and CephFS Reviewed-by: Zac Dover <zac.dover@proton.me> Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
578 lines
36 KiB
Plaintext
578 lines
36 KiB
Plaintext
>=20.0.0
|
||
|
||
* RBD: All Python APIs that produce timestamps now return "aware" `datetime`
|
||
objects instead of "naive" ones (i.e. those including time zone information
|
||
instead of those not including it). All timestamps remain to be in UTC but
|
||
including `timezone.utc` makes it explicit and avoids the potential of the
|
||
returned timestamp getting misinterpreted -- in Python 3, many `datetime`
|
||
methods treat "naive" `datetime` objects as local times.
|
||
* RBD: `rbd group info` and `rbd group snap info` commands are introduced to
|
||
show information about a group and a group snapshot respectively.
|
||
* RBD: `rbd group snap ls` output now includes the group snapshot IDs. The header
|
||
of the column showing the state of a group snapshot in the unformatted CLI
|
||
output is changed from 'STATUS' to 'STATE'. The state of a group snapshot
|
||
that was shown as 'ok' is now shown as 'complete', which is more descriptive.
|
||
* Based on tests performed at scale on an HDD based Ceph cluster, it was found
|
||
that scheduling with mClock was not optimal with multiple OSD shards. For
|
||
example, in the test cluster with multiple OSD node failures, the client
|
||
throughput was found to be inconsistent across test runs coupled with multiple
|
||
reported slow requests. However, the same test with a single OSD shard and
|
||
with multiple worker threads yielded significantly better results in terms of
|
||
consistency of client and recovery throughput across multiple test runs.
|
||
Therefore, as an interim measure until the issue with multiple OSD shards
|
||
(or multiple mClock queues per OSD) is investigated and fixed, the following
|
||
changes to the default option values have been made:
|
||
- osd_op_num_shards_hdd = 1 (was 5)
|
||
- osd_op_num_threads_per_shard_hdd = 5 (was 1)
|
||
For more details see https://tracker.ceph.com/issues/66289.
|
||
* MGR: The Ceph Manager's always-on modulues/plugins can now be force-disabled.
|
||
This can be necessary in cases where we wish to prevent the manager from being
|
||
flooded by module commands when Ceph services are down or degraded.
|
||
|
||
* CephFS: Modifying the setting "max_mds" when a cluster is
|
||
unhealthy now requires users to pass the confirmation flag
|
||
(--yes-i-really-mean-it). This has been added as a precaution to tell the
|
||
users that modifying "max_mds" may not help with troubleshooting or recovery
|
||
effort. Instead, it might further destabilize the cluster.
|
||
|
||
* mgr/restful, mgr/zabbix: both modules, already deprecated since 2020, have been
|
||
finally removed. They have not been actively maintenance in the last years,
|
||
and started suffering from vulnerabilities in their dependency chain (e.g.:
|
||
CVE-2023-46136). As alternatives, for the `restful` module, the `dashboard` module
|
||
provides a richer and better maintained RESTful API. Regarding the `zabbix` module,
|
||
there are alternative monitoring solutions, like `prometheus`, which is the most
|
||
widely adopted among the Ceph user community.
|
||
|
||
* CephFS: EOPNOTSUPP (Operation not supported ) is now returned by the CephFS
|
||
fuse client for `fallocate` for the default case (i.e. mode == 0) since
|
||
CephFS does not support disk space reservation. The only flags supported are
|
||
`FALLOC_FL_KEEP_SIZE` and `FALLOC_FL_PUNCH_HOLE`.
|
||
|
||
>=19.0.0
|
||
|
||
* cephx: key rotation is now possible using `ceph auth rotate`. Previously,
|
||
this was only possible by deleting and then recreating the key.
|
||
* Ceph: a new --daemon-output-file switch is available for `ceph tell` commands
|
||
to dump output to a file local to the daemon. For commands which produce
|
||
large amounts of output, this avoids a potential spike in memory usage on the
|
||
daemon, allows for faster streaming writes to a file local to the daemon, and
|
||
reduces time holding any locks required to execute the command. For analysis,
|
||
it is necessary to retrieve the file from the host running the daemon
|
||
manually. Currently, only --format=json|json-pretty are supported.
|
||
* RGW: GetObject and HeadObject requests now return an x-rgw-replicated-at
|
||
header for replicated objects. This timestamp can be compared against the
|
||
Last-Modified header to determine how long the object took to replicate.
|
||
* The cephfs-shell utility is now packaged for RHEL / CentOS / Rocky 9 as required
|
||
Python dependencies are now available in EPEL9.
|
||
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
|
||
multi-site deployments Previously, replicas of such objects were corrupted on decryption.
|
||
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
|
||
identify these original multipart uploads. The ``LastModified`` timestamp of any
|
||
identified object is incremented by one ns to cause peer zones to replicate it again.
|
||
For multi-site deployments that make use of Server-Side Encryption, we
|
||
recommended running this command against every bucket in every zone after all
|
||
zones have upgraded.
|
||
* Tracing: The blkin tracing feature (see https://docs.ceph.com/en/reef/dev/blkin/)
|
||
is now deprecated in favor of Opentracing (https://docs.ceph.com/en/reef/dev/developer_guide/jaegertracing/)
|
||
and will be removed in a later release.
|
||
* RGW: Introducing a new data layout for the Topic metadata associated with S3
|
||
Bucket Notifications, where each Topic is stored as a separate RADOS object
|
||
and the bucket notification configuration is stored in a bucket attribute.
|
||
This new representation supports multisite replication via metadata sync and
|
||
can scale to many topics. This is on by default for new deployments, but is
|
||
is not enabled by default on upgrade. Once all radosgws have upgraded (on all
|
||
zones in a multisite configuration), the ``notification_v2`` zone feature can
|
||
be enabled to migrate to the new format. See
|
||
https://docs.ceph.com/en/squid/radosgw/zone-features for details. The "v1"
|
||
format is now considered deprecated and may be removed after 2 major releases.
|
||
* CephFS: The MDS evicts clients which are not advancing their request tids, which causes
|
||
a large buildup of session metadata, which in turn results in the MDS going read-only
|
||
due to RADOS operations exceeding the size threshold. `mds_session_metadata_threshold`
|
||
config controls the maximum size to which (encoded) session metadata can grow.
|
||
* CephFS: A new "mds last-seen" command is available for querying the last time
|
||
an MDS was in the FSMap, subject to a pruning threshold.
|
||
* CephFS: For clusters with multiple CephFS file systems, all snap-schedule
|
||
commands now expect the '--fs' argument.
|
||
* CephFS: The period specifier ``m`` now implies minutes and the period specifier
|
||
``M`` now implies months. This is consistent with the rest of the system.
|
||
* RGW: New tools have been added to radosgw-admin for identifying and
|
||
correcting issues with versioned bucket indexes. Historical bugs with the
|
||
versioned bucket index transaction workflow made it possible for the index
|
||
to accumulate extraneous "book-keeping" olh entries and plain placeholder
|
||
entries. In some specific scenarios where clients made concurrent requests
|
||
referencing the same object key, it was likely that extra index
|
||
entries would accumulate. When a significant number of these entries are
|
||
present in a single bucket index shard, they can cause high bucket listing
|
||
latency and lifecycle processing failures. To check whether a versioned
|
||
bucket has unnecessary olh entries, users can now run ``radosgw-admin
|
||
bucket check olh``. If the ``--fix`` flag is used, the extra entries will
|
||
be safely removed. An additional issue is that some versioned buckets
|
||
may maintain extra unlinked objects that are not listable via the S3/Swift
|
||
APIs. These extra objects are typically a result of PUT requests that
|
||
exited abnormally in the middle of a bucket index transaction, and thus
|
||
the client would not have received a successful response. Bugs in prior
|
||
releases made these unlinked objects easy to reproduce with any PUT
|
||
request made on a bucket that was actively resharding. In certain
|
||
scenarios, a client of a bucket that was a victim of this bug may find
|
||
the object associated with the key to be in an inconsistent state. To check
|
||
whether a versioned bucket has unlinked entries, users can now run
|
||
``radosgw-admin bucket check unlinked``. If the ``--fix`` flag is used,
|
||
the unlinked objects will be safely removed. Finally, a third issue made
|
||
it possible for versioned bucket index stats to be accounted inaccurately.
|
||
The tooling for recalculating versioned bucket stats also had a bug, and
|
||
was not previously capable of fixing these inaccuracies. This release
|
||
resolves those issues and users can now expect that the existing
|
||
``radosgw-admin bucket check`` command will produce correct results.
|
||
We recommend that users with versioned buckets, especially those that
|
||
existed on prior releases, use these new tools to check whether their
|
||
buckets are affected and to clean them up accordingly.
|
||
* RGW: The "user accounts" feature unlocks several new AWS-compatible IAM APIs
|
||
for self-service management of users, keys, groups, roles, policy and
|
||
more. Existing users can be adopted into new accounts. This process is optional
|
||
but irreversible. See https://docs.ceph.com/en/squid/radosgw/account and
|
||
https://docs.ceph.com/en/squid/radosgw/iam for details.
|
||
* RGW: On startup, radosgw and radosgw-admin now validate the ``rgw_realm``
|
||
config option. Previously, they would ignore invalid or missing realms and
|
||
go on to load a zone/zonegroup in a different realm. If startup fails with
|
||
a "failed to load realm" error, fix or remove the ``rgw_realm`` option.
|
||
* RGW: The radosgw-admin commands ``realm create`` and ``realm pull`` no
|
||
longer set the default realm without ``--default``.
|
||
* CephFS: Running the command "ceph fs authorize" for an existing entity now
|
||
upgrades the entity's capabilities instead of printing an error. It can now
|
||
also change read/write permissions in a capability that the entity already
|
||
holds. If the capability passed by user is same as one of the capabilities
|
||
that the entity already holds, idempotency is maintained.
|
||
* CephFS: Two FS names can now be swapped, optionally along with their IDs,
|
||
using "ceph fs swap" command. The function of this API is to facilitate
|
||
file system swaps for disaster recovery. In particular, it avoids situations
|
||
where a named file system is temporarily missing which would prompt a higher
|
||
level storage operator (like Rook) to recreate the missing file system.
|
||
See https://docs.ceph.com/en/latest/cephfs/administration/#file-systems
|
||
docs for more information.
|
||
* CephFS: Before running the command "ceph fs rename", the filesystem to be
|
||
renamed must be offline and the config "refuse_client_session" must be set
|
||
for it. The config "refuse_client_session" can be removed/unset and
|
||
filesystem can be online after the rename operation is complete.
|
||
* RADOS: A POOL_APP_NOT_ENABLED health warning will now be reported if
|
||
the application is not enabled for the pool irrespective of whether
|
||
the pool is in use or not. Always tag a pool with an application
|
||
using ``ceph osd pool application enable`` command to avoid reporting
|
||
of POOL_APP_NOT_ENABLED health warning for that pool.
|
||
The user might temporarily mute this warning using
|
||
``ceph health mute POOL_APP_NOT_ENABLED``.
|
||
* The `mon_cluster_log_file_level` and `mon_cluster_log_to_syslog_level` options
|
||
have been removed. Henceforth, users should use the new generic option
|
||
`mon_cluster_log_level` to control the cluster log level verbosity for the cluster
|
||
log file as well as for all external entities.
|
||
CephFS: Disallow delegating preallocated inode ranges to clients. Config
|
||
`mds_client_delegate_inos_pct` defaults to 0 which disables async dirops
|
||
in the kclient.
|
||
* S3 Get/HeadObject now support query parameter `partNumber` to read a specific
|
||
part of a completed multipart upload.
|
||
* RGW: Fixed a S3 Object Lock bug with PutObjectRetention requests that specify
|
||
a RetainUntilDate after the year 2106. This date was truncated to 32 bits when
|
||
stored, so a much earlier date was used for object lock enforcement. This does
|
||
not effect PutBucketObjectLockConfiguration where a duration is given in Days.
|
||
The RetainUntilDate encoding is fixed for new PutObjectRetention requests, but
|
||
cannot repair the dates of existing object locks. Such objects can be identified
|
||
with a HeadObject request based on the x-amz-object-lock-retain-until-date
|
||
response header.
|
||
* RADOS: `get_pool_is_selfmanaged_snaps_mode` C++ API has been deprecated
|
||
due to being prone to false negative results. It's safer replacement is
|
||
`pool_is_in_selfmanaged_snaps_mode`.
|
||
* RADOS: For bug 62338 (https://tracker.ceph.com/issues/62338), in order to simplify
|
||
backporting, we choose to not
|
||
condition the fix on a server flag. As
|
||
a result, in rare cases it may be possible for a PG to flip between two acting
|
||
sets while an upgrade to a version with the fix is in progress. If you observe
|
||
this behavior, you should be able to work around it by completing the upgrade or
|
||
by disabling async recovery by setting osd_async_recovery_min_cost to a very
|
||
large value on all OSDs until the upgrade is complete:
|
||
``ceph config set osd osd_async_recovery_min_cost 1099511627776``
|
||
* RADOS: A detailed version of the `balancer status` CLI command in the balancer
|
||
module is now available. Users may run `ceph balancer status detail` to see more
|
||
details about which PGs were updated in the balancer's last optimization.
|
||
See https://docs.ceph.com/en/latest/rados/operations/balancer/ for more information.
|
||
* CephFS: Full support for subvolumes and subvolume groups is now available
|
||
for snap_schedule Manager module.
|
||
* RGW: The SNS CreateTopic API now enforces the same topic naming requirements as AWS:
|
||
Topic names must be made up of only uppercase and lowercase ASCII letters, numbers,
|
||
underscores, and hyphens, and must be between 1 and 256 characters long.
|
||
* RBD: When diffing against the beginning of time (`fromsnapname == NULL`) in
|
||
fast-diff mode (`whole_object == true` with `fast-diff` image feature enabled
|
||
and valid), diff-iterate is now guaranteed to execute locally if exclusive
|
||
lock is available. This brings a dramatic performance improvement for QEMU
|
||
live disk synchronization and backup use cases.
|
||
* RBD: The ``try-netlink`` mapping option for rbd-nbd has become the default
|
||
and is now deprecated. If the NBD netlink interface is not supported by the
|
||
kernel, then the mapping is retried using the legacy ioctl interface.
|
||
* RADOS: Read balancing may now be managed automatically via the balancer
|
||
manager module. Users may choose between two new modes: ``upmap-read``, which
|
||
offers upmap and read optimization simultaneously, or ``read``, which may be used
|
||
to only optimize reads. For more detailed information see https://docs.ceph.com/en/latest/rados/operations/read-balancer/#online-optimization.
|
||
* CephFS: MDS log trimming is now driven by a separate thread which tries to
|
||
trim the log every second (`mds_log_trim_upkeep_interval` config). Also,
|
||
a couple of configs govern how much time the MDS spends in trimming its
|
||
logs. These configs are `mds_log_trim_threshold` and `mds_log_trim_decay_rate`.
|
||
* RGW: Notification topics are now owned by the user that created them.
|
||
By default, only the owner can read/write their topics. Topic policy documents
|
||
are now supported to grant these permissions to other users. Preexisting topics
|
||
are treated as if they have no owner, and any user can read/write them using the SNS API.
|
||
If such a topic is recreated with CreateTopic, the issuing user becomes the new owner.
|
||
For backward compatibility, all users still have permission to publish bucket
|
||
notifications to topics owned by other users. A new configuration parameter:
|
||
``rgw_topic_require_publish_policy`` can be enabled to deny ``sns:Publish``
|
||
permissions unless explicitly granted by topic policy.
|
||
* RGW: Fix issue with persistent notifications where the changes to topic param that
|
||
were modified while persistent notifications were in the queue will be reflected in notifications.
|
||
So if user sets up topic with incorrect config (password/ssl) causing failure while delivering the
|
||
notifications to broker, can now modify the incorrect topic attribute and on retry attempt to delivery
|
||
the notifications, new configs will be used.
|
||
* RBD: The option ``--image-id`` has been added to `rbd children` CLI command,
|
||
so it can be run for images in the trash.
|
||
* PG dump: The default output of `ceph pg dump --format json` has changed. The
|
||
default json format produces a rather massive output in large clusters and
|
||
isn't scalable. So we have removed the 'network_ping_times' section from
|
||
the output. Details in the tracker: https://tracker.ceph.com/issues/57460
|
||
* mgr/REST: The REST manager module will trim requests based on the 'max_requests' option.
|
||
Without this feature, and in the absence of manual deletion of old requests,
|
||
the accumulation of requests in the array can lead to Out Of Memory (OOM) issues,
|
||
resulting in the Manager crashing.
|
||
|
||
* CephFS: The `subvolume snapshot clone` command now depends on the config option
|
||
`snapshot_clone_no_wait` which is used to reject the clone operation when
|
||
all the cloner threads are busy. This config option is enabled by default which means
|
||
that if no cloner threads are free, the clone request errors out with EAGAIN.
|
||
The value of the config option can be fetched by using:
|
||
`ceph config get mgr mgr/volumes/snapshot_clone_no_wait`
|
||
and it can be disabled by using:
|
||
`ceph config set mgr mgr/volumes/snapshot_clone_no_wait false`
|
||
* RBD: `RBD_IMAGE_OPTION_CLONE_FORMAT` option has been exposed in Python
|
||
bindings via `clone_format` optional parameter to `clone`, `deep_copy` and
|
||
`migration_prepare` methods.
|
||
* RBD: `RBD_IMAGE_OPTION_FLATTEN` option has been exposed in Python bindings via
|
||
`flatten` optional parameter to `deep_copy` and `migration_prepare` methods.
|
||
|
||
* CephFS: Command "ceph mds fail" and "ceph fs fail" now requires a
|
||
confirmation flag when some MDSs exhibit health warning MDS_TRIM or
|
||
MDS_CACHE_OVERSIZED. This is to prevent accidental MDS failover causing
|
||
further delays in recovery.
|
||
* CephFS: fixes to the implementation of the ``root_squash`` mechanism enabled
|
||
via cephx ``mds`` caps on a client credential require a new client feature
|
||
bit, ``client_mds_auth_caps``. Clients using credentials with ``root_squash``
|
||
without this feature will trigger the MDS to raise a HEALTH_ERR on the
|
||
cluster, MDS_CLIENTS_BROKEN_ROOTSQUASH. See the documentation on this warning
|
||
and the new feature bit for more information.
|
||
* CephFS: Expanded removexattr support for cephfs virtual extended attributes.
|
||
Previously one had to use setxattr to restore the default in order to "remove".
|
||
You may now properly use removexattr to remove. You can also now remove layout
|
||
on root inode, which then will restore layout to default layout.
|
||
|
||
* cls_cxx_gather is marked as deprecated.
|
||
* CephFS: cephfs-journal-tool is guarded against running on an online file system.
|
||
The 'cephfs-journal-tool --rank <fs_name>:<mds_rank> journal reset' and
|
||
'cephfs-journal-tool --rank <fs_name>:<mds_rank> journal reset --force'
|
||
commands require '--yes-i-really-really-mean-it'.
|
||
|
||
* Dashboard: Rearranged Navigation Layout: The navigation layout has been reorganized
|
||
for improved usability and easier access to key features.
|
||
* Dashboard: CephFS Improvments
|
||
* Support for managing CephFS snapshots and clones, as well as snapshot schedule
|
||
management
|
||
* Manage authorization capabilities for CephFS resources
|
||
* Helpers on mounting a CephFS volume
|
||
* Dashboard: RGW Improvements
|
||
* Support for managing bucket policies
|
||
* Add/Remove bucket tags
|
||
* ACL Management
|
||
* Several UI/UX Improvements to the bucket form
|
||
* Monitoring: Grafana dashboards are now loaded into the container at runtime rather than
|
||
building a grafana image with the grafana dashboards. Official Ceph grafana images
|
||
can be found in quay.io/ceph/grafana
|
||
* Monitoring: RGW S3 Analytics: A new Grafana dashboard is now available, enabling you to
|
||
visualize per bucket and user analytics data, including total GETs, PUTs, Deletes,
|
||
Copies, and list metrics.
|
||
* RBD: `Image::access_timestamp` and `Image::modify_timestamp` Python APIs now
|
||
return timestamps in UTC.
|
||
* RBD: Support for cloning from non-user type snapshots is added. This is
|
||
intended primarily as a building block for cloning new groups from group
|
||
snapshots created with `rbd group snap create` command, but has also been
|
||
exposed via the new `--snap-id` option for `rbd clone` command.
|
||
* RBD: The output of `rbd snap ls --all` command now includes the original
|
||
type for trashed snapshots.
|
||
* CephFS: "ceph fs clone status" command will now print statistics about clone
|
||
progress in terms of how much data has been cloned (in both percentage as
|
||
well as bytes) and how many files have been cloned.
|
||
* CephFS: "ceph status" command will now print a progress bar when cloning is
|
||
ongoing. If clone jobs are more than the cloner threads, it will print one
|
||
more progress bar that shows total amount of progress made by both ongoing
|
||
as well as pending clones. Both progress are accompanied by messages that
|
||
show number of clone jobs in the respective categories and the amount of
|
||
progress made by each of them.
|
||
* RGW: in bucket notifications, the `principalId` inside `ownerIdentity` now contains
|
||
complete user id, prefixed with tenant id
|
||
|
||
* NFS: The export create/apply of CephFS based exports will now have a additional parameter `cmount_path` under the FSAL block,
|
||
which specifies the path within the CephFS to mount this export on. If this and the other
|
||
`EXPORT { FSAL {} }` options are the same between multiple exports, those exports will share a single CephFS client. If not specified, the default is `/`.
|
||
|
||
>=18.0.0
|
||
|
||
* The RGW policy parser now rejects unknown principals by default. If you are
|
||
mirroring policies between RGW and AWS, you may wish to set
|
||
"rgw policy reject invalid principals" to "false". This affects only newly set
|
||
policies, not policies that are already in place.
|
||
* The CephFS automatic metadata load (sometimes called "default") balancer is
|
||
now disabled by default. The new file system flag `balance_automate`
|
||
can be used to toggle it on or off. It can be enabled or disabled via
|
||
`ceph fs set <fs_name> balance_automate <bool>`.
|
||
* RGW's default backend for `rgw_enable_ops_log` changed from RADOS to file.
|
||
The default value of `rgw_ops_log_rados` is now false, and `rgw_ops_log_file_path`
|
||
defaults to "/var/log/ceph/ops-log-$cluster-$name.log".
|
||
* The SPDK backend for BlueStore is now able to connect to an NVMeoF target.
|
||
Please note that this is not an officially supported feature.
|
||
* RGW's pubsub interface now returns boolean fields using bool. Before this change,
|
||
`/topics/<topic-name>` returns "stored_secret" and "persistent" using a string
|
||
of "true" or "false" with quotes around them. After this change, these fields
|
||
are returned without quotes so they can be decoded as boolean values in JSON.
|
||
The same applies to the `is_truncated` field returned by `/subscriptions/<sub-name>`.
|
||
* RGW's response of `Action=GetTopicAttributes&TopicArn=<topic-arn>` REST API now
|
||
returns `HasStoredSecret` and `Persistent` as boolean in the JSON string
|
||
encoded in `Attributes/EndPoint`.
|
||
* All boolean fields previously rendered as string by `rgw-admin` command when
|
||
the JSON format is used are now rendered as boolean. If your scripts/tools
|
||
relies on this behavior, please update them accordingly. The impacted field names
|
||
are:
|
||
* absolute
|
||
* add
|
||
* admin
|
||
* appendable
|
||
* bucket_key_enabled
|
||
* delete_marker
|
||
* exists
|
||
* has_bucket_info
|
||
* high_precision_time
|
||
* index
|
||
* is_master
|
||
* is_prefix
|
||
* is_truncated
|
||
* linked
|
||
* log_meta
|
||
* log_op
|
||
* pending_removal
|
||
* read_only
|
||
* retain_head_object
|
||
* rule_exist
|
||
* start_with_full_sync
|
||
* sync_from_all
|
||
* syncstopped
|
||
* system
|
||
* truncated
|
||
* user_stats_sync
|
||
* RGW: The beast frontend's HTTP access log line uses a new debug_rgw_access
|
||
configurable. This has the same defaults as debug_rgw, but can now be controlled
|
||
independently.
|
||
* RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write`
|
||
and `Image::aio_compare_and_write` methods) now match those of C API. Both
|
||
compare and write steps operate only on `len` bytes even if the respective
|
||
buffers are larger. The previous behavior of comparing up to the size of
|
||
the compare buffer was prone to subtle breakage upon straddling a stripe
|
||
unit boundary.
|
||
* RBD: compare-and-write operation is no longer limited to 512-byte sectors.
|
||
Assuming proper alignment, it now allows operating on stripe units (4M by
|
||
default).
|
||
* RBD: New `rbd_aio_compare_and_writev` API method to support scatter/gather
|
||
on both compare and write buffers. This compliments existing `rbd_aio_readv`
|
||
and `rbd_aio_writev` methods.
|
||
* The 'AT_NO_ATTR_SYNC' macro is deprecated, please use the standard 'AT_STATX_DONT_SYNC'
|
||
macro. The 'AT_NO_ATTR_SYNC' macro will be removed in the future.
|
||
* Trimming of PGLog dups is now controlled by the size instead of the version.
|
||
This fixes the PGLog inflation issue that was happening when the on-line
|
||
(in OSD) trimming got jammed after a PG split operation. Also, a new off-line
|
||
mechanism has been added: `ceph-objectstore-tool` got `trim-pg-log-dups` op
|
||
that targets situations where OSD is unable to boot due to those inflated dups.
|
||
If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning
|
||
will be visible.
|
||
Relevant tracker: https://tracker.ceph.com/issues/53729
|
||
* RBD: `rbd device unmap` command gained `--namespace` option. Support for
|
||
namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to
|
||
map and unmap images in namespaces using the `image-spec` syntax since then
|
||
but the corresponding option available in most other commands was missing.
|
||
* RGW: Compression is now supported for objects uploaded with Server-Side Encryption.
|
||
When both are enabled, compression is applied before encryption. Earlier releases
|
||
of multisite do not replicate such objects correctly, so all zones must upgrade to
|
||
Reef before enabling the `compress-encrypted` zonegroup feature: see
|
||
https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features and note the
|
||
security considerations.
|
||
* RGW: the "pubsub" functionality for storing bucket notifications inside Ceph
|
||
is removed. Together with it, the "pubsub" zone should not be used anymore.
|
||
The REST operations, as well as radosgw-admin commands for manipulating
|
||
subscriptions, as well as fetching and acking the notifications are removed
|
||
as well.
|
||
In case that the endpoint to which the notifications are sent maybe down or
|
||
disconnected, it is recommended to use persistent notifications to guarantee
|
||
the delivery of the notifications. In case the system that consumes the
|
||
notifications needs to pull them (instead of the notifications be pushed
|
||
to it), an external message bus (e.g. rabbitmq, Kafka) should be used for
|
||
that purpose.
|
||
* RGW: The serialized format of notification and topics has changed, so that
|
||
new/updated topics will be unreadable by old RGWs. We recommend completing
|
||
the RGW upgrades before creating or modifying any notification topics.
|
||
* RBD: Trailing newline in passphrase files (`<passphrase-file>` argument in
|
||
`rbd encryption format` command and `--encryption-passphrase-file` option
|
||
in other commands) is no longer stripped.
|
||
* RBD: Support for layered client-side encryption is added. Cloned images
|
||
can now be encrypted each with its own encryption format and passphrase,
|
||
potentially different from that of the parent image. The efficient
|
||
copy-on-write semantics intrinsic to unformatted (regular) cloned images
|
||
are retained.
|
||
* CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to
|
||
`client_max_retries_on_remount_failure` and move it from mds.yaml.in to
|
||
mds-client.yaml.in because this option was only used by MDS client from its
|
||
birth.
|
||
* The `perf dump` and `perf schema` commands are deprecated in favor of new
|
||
`counter dump` and `counter schema` commands. These new commands add support
|
||
for labeled perf counters and also emit existing unlabeled perf counters. Some
|
||
unlabeled perf counters became labeled in this release, with more to follow in
|
||
future releases; such converted perf counters are no longer emitted by the
|
||
`perf dump` and `perf schema` commands.
|
||
* `ceph mgr dump` command now outputs `last_failure_osd_epoch` and
|
||
`active_clients` fields at the top level. Previously, these fields were
|
||
output under `always_on_modules` field.
|
||
* `ceph mgr dump` command now displays the name of the mgr module that
|
||
registered a RADOS client in the `name` field added to elements of the
|
||
`active_clients` array. Previously, only the address of a module's RADOS
|
||
client was shown in the `active_clients` array.
|
||
* RBD: All rbd-mirror daemon perf counters became labeled and as such are now
|
||
emitted only by the new `counter dump` and `counter schema` commands. As part
|
||
of the conversion, many also got renamed to better disambiguate journal-based
|
||
and snapshot-based mirroring.
|
||
* RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed
|
||
`std::list` before potentially appending to it, aligning with the semantics
|
||
of the corresponding C API (`rbd_watchers_list`).
|
||
* The rados python binding is now able to process (opt-in) omap keys as bytes
|
||
objects. This enables interacting with RADOS omap keys that are not decodeable as
|
||
UTF-8 strings.
|
||
* Telemetry: Users who are opted-in to telemetry can also opt-in to
|
||
participating in a leaderboard in the telemetry public
|
||
dashboards (https://telemetry-public.ceph.com/). Users can now also add a
|
||
description of the cluster to publicly appear in the leaderboard.
|
||
For more details, see:
|
||
https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard
|
||
See a sample report with `ceph telemetry preview`.
|
||
Opt-in to telemetry with `ceph telemetry on`.
|
||
Opt-in to the leaderboard with
|
||
`ceph config set mgr mgr/telemetry/leaderboard true`.
|
||
Add leaderboard description with:
|
||
`ceph config set mgr mgr/telemetry/leaderboard_description ‘Cluster description’`.
|
||
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
|
||
procedure, the recovered files under `lost+found` directory can now be deleted.
|
||
* core: cache-tiering is now deprecated.
|
||
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
|
||
undergone significant usability and design improvements to address the slow
|
||
backfill issue. Some important changes are:
|
||
* The 'balanced' profile is set as the default mClock profile because it
|
||
represents a compromise between prioritizing client IO or recovery IO. Users
|
||
can then choose either the 'high_client_ops' profile to prioritize client IO
|
||
or the 'high_recovery_ops' profile to prioritize recovery IO.
|
||
* QoS parameters like reservation and limit are now specified in terms of a
|
||
fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
|
||
* The cost parameters (osd_mclock_cost_per_io_usec_* and
|
||
osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
|
||
is now determined using the random IOPS and maximum sequential bandwidth
|
||
capability of the OSD's underlying device.
|
||
* Degraded object recovery is given higher priority when compared to misplaced
|
||
object recovery because degraded objects present a data safety issue not
|
||
present with objects that are merely misplaced. Therefore, backfilling
|
||
operations with the 'balanced' and 'high_client_ops' mClock profiles may
|
||
progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
|
||
scheduler.
|
||
* The QoS allocations in all the mClock profiles are optimized based on the above
|
||
fixes and enhancements.
|
||
* For more detailed information see:
|
||
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
|
||
* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
|
||
than the number mentioned against the config tunable `mds_max_snaps_per_dir`
|
||
so that a new snapshot can be created and retained during the next schedule
|
||
run.
|
||
* `ceph config dump --format <json|xml>` output will display the localized
|
||
option names instead of its normalized version. For e.g.,
|
||
"mgr/prometheus/x/server_port" will be displayed instead of
|
||
"mgr/prometheus/server_port". This matches the output of the non pretty-print
|
||
formatted version of the command.
|
||
* CEPHFS: MDS config option name "mds_kill_skip_replaying_inotable" is a bit
|
||
confusing with "mds_inject_skip_replaying_inotable", therefore renaming it to
|
||
"mds_kill_after_journal_logs_flushed"
|
||
|
||
|
||
>=17.2.1
|
||
|
||
* The "BlueStore zero block detection" feature (first introduced to Quincy in
|
||
https://github.com/ceph/ceph/pull/43337) has been turned off by default with a
|
||
new global configuration called `bluestore_zero_block_detection`. This feature,
|
||
intended for large-scale synthetic testing, does not interact well with some RBD
|
||
and CephFS features. Any side effects experienced in previous Quincy versions
|
||
would no longer occur, provided that the configuration remains set to false.
|
||
Relevant tracker: https://tracker.ceph.com/issues/55521
|
||
|
||
* telemetry: Added new Rook metrics to the 'basic' channel to report Rook's
|
||
version, Kubernetes version, node metrics, etc.
|
||
See a sample report with `ceph telemetry preview`.
|
||
Opt-in with `ceph telemetry on`.
|
||
|
||
For more details, see:
|
||
|
||
https://docs.ceph.com/en/latest/mgr/telemetry/
|
||
|
||
* OSD: The issue of high CPU utilization during recovery/backfill operations
|
||
has been fixed. For more details, see: https://tracker.ceph.com/issues/56530.
|
||
|
||
>=15.2.17
|
||
|
||
* OSD: Octopus modified the SnapMapper key format from
|
||
<LEGACY_MAPPING_PREFIX><snapid>_<shardid>_<hobject_t::to_str()>
|
||
to
|
||
<MAPPING_PREFIX><pool>_<snapid>_<shardid>_<hobject_t::to_str()>
|
||
When this change was introduced, 94ebe0e also introduced a conversion
|
||
with a crucial bug which essentially destroyed legacy keys by mapping them
|
||
to
|
||
<MAPPING_PREFIX><poolid>_<snapid>_
|
||
without the object-unique suffix. The conversion is fixed in this release.
|
||
Relevant tracker: https://tracker.ceph.com/issues/56147
|
||
|
||
* Cephadm may now be configured to carry out CephFS MDS upgrades without
|
||
reducing ``max_mds`` to 1. Previously, Cephadm would reduce ``max_mds`` to 1 to
|
||
avoid having two active MDS modifying on-disk structures with new versions,
|
||
communicating cross-version-incompatible messages, or other potential
|
||
incompatibilities. This could be disruptive for large-scale CephFS deployments
|
||
because the cluster cannot easily reduce active MDS daemons to 1.
|
||
NOTE: Staggered upgrade of the mons/mgrs may be necessary to take advantage
|
||
of the feature, refer this link on how to perform it:
|
||
https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade
|
||
Relevant tracker: https://tracker.ceph.com/issues/55715
|
||
|
||
* Introduced a new file system flag `refuse_client_session` that can be set using the
|
||
`fs set` command. This flag allows blocking any incoming session
|
||
request from client(s). This can be useful during some recovery situations
|
||
where it's desirable to bring MDS up but have no client workload.
|
||
Relevant tracker: https://tracker.ceph.com/issues/57090
|
||
|
||
* New MDSMap field `max_xattr_size` which can be set using the `fs set` command.
|
||
This MDSMap field allows to configure the maximum size allowed for the full
|
||
key/value set for a filesystem extended attributes. It effectively replaces
|
||
the old per-MDS `max_xattr_pairs_size` setting, which is now dropped.
|
||
Relevant tracker: https://tracker.ceph.com/issues/55725
|
||
|
||
* Introduced a new file system flag `refuse_standby_for_another_fs` that can be
|
||
set using the `fs set` command. This flag prevents using a standby for another
|
||
file system (join_fs = X) when standby for the current filesystem is not available.
|
||
Relevant tracker: https://tracker.ceph.com/issues/61599
|
||
* mon: add NVMe-oF gateway monitor and HA
|
||
This PR adds high availability support for the nvmeof Ceph service. High availability
|
||
means that even in the case that a certain GW is down, there will be another available
|
||
path for the initiator to be able to continue the IO through another GW.
|
||
It is also adding 2 new mon commands, to notify monitor about the gateway creation/deletion:
|
||
- nvme-gw create
|
||
- nvme-gw delete
|
||
Relevant tracker: https://tracker.ceph.com/issues/64777
|