ceph/PendingReleaseNotes

>=18.2.4
--------

* RBD: `RBD_IMAGE_OPTION_CLONE_FORMAT` option has been exposed in Python
  bindings via `clone_format` optional parameter to `clone`, `deep_copy` and
  `migration_prepare` methods.
* RBD: `RBD_IMAGE_OPTION_FLATTEN` option has been exposed in Python bindings via
  `flatten` optional parameter to `deep_copy` and `migration_prepare` methods.
* Based on tests performed at scale on a HDD based Ceph cluster, it was found
  that scheduling with mClock was not optimal with multiple OSD shards. For
  example, in the test cluster with multiple OSD node failures, the client
  throughput was found to be inconsistent across test runs coupled with multiple
  reported slow requests. However, the same test with a single OSD shard and
  with multiple worker threads yielded significantly better results in terms of
  consistency of client and recovery throughput across multiple test runs.
  Therefore, as an interim measure until the issue with multiple OSD shards
  (or multiple mClock queues per OSD) is investigated and fixed, the following
  change to the default HDD OSD shard configuration is made:
   - osd_op_num_shards_hdd = 1 (was 5)
   - osd_op_num_threads_per_shard_hdd = 5 (was 1)
  For more details see https://tracker.ceph.com/issues/66289.

>=18.2.2
--------

* RBD: When diffing against the beginning of time (`fromsnapname == NULL`) in
  fast-diff mode (`whole_object == true` with `fast-diff` image feature enabled
  and valid), diff-iterate is now guaranteed to execute locally if exclusive
  lock is available.  This brings a dramatic performance improvement for QEMU
  live disk synchronization and backup use cases.
* RADOS: `get_pool_is_selfmanaged_snaps_mode` C++ API has been deprecated
  due to being prone to false negative results.  It's safer replacement is
  `pool_is_in_selfmanaged_snaps_mode`.
* RBD: The option ``--image-id`` has been added to `rbd children` CLI command,
  so it can be run for images in the trash.
* RADOS: For bug 62338 (https://tracker.ceph.com/issues/62338), we did not choose
  to condition the fix on a server flag in order to simplify backporting.  As
  a result, in rare cases it may be possible for a PG to flip between two acting
  sets while an upgrade to a version with the fix is in progress.  If you observe
  this behavior, you should be able to work around it by completing the upgrade or
  by disabling async recovery by setting osd_async_recovery_min_cost to a very
  large value on all OSDs until the upgrade is complete:
  ``ceph config set osd osd_async_recovery_min_cost 1099511627776``

>=19.0.0

* The cephfs-shell utility is now packaged for RHEL 9 / CentOS 9 as required
  python dependencies are now available in EPEL9.
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
  multi-site. Previously, the replicas of such objects were corrupted on decryption.
  A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
  identify these original multipart uploads. The ``LastModified`` timestamp of any
  identified object is incremented by 1ns to cause peer zones to replicate it again.
  For multi-site deployments that make any use of Server-Side Encryption, we
  recommended running this command against every bucket in every zone after all
  zones have upgraded.
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
  a large buildup of session metadata resulting in the MDS going read-only due to
  the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
  config controls the maximum size that a (encoded) session metadata can grow.
* RGW: New tools have been added to radosgw-admin for identifying and
  correcting issues with versioned bucket indexes. Historical bugs with the
  versioned bucket index transaction workflow made it possible for the index
  to accumulate extraneous "book-keeping" olh entries and plain placeholder
  entries. In some specific scenarios where clients made concurrent requests
  referencing the same object key, it was likely that a lot of extra index
  entries would accumulate. When a significant number of these entries are
  present in a single bucket index shard, they can cause high bucket listing
  latencies and lifecycle processing failures. To check whether a versioned
  bucket has unnecessary olh entries, users can now run ``radosgw-admin
  bucket check olh``. If the ``--fix`` flag is used, the extra entries will
  be safely removed. A distinct issue from the one described thus far, it is
  also possible that some versioned buckets are maintaining extra unlinked
  objects that are not listable from the S3/ Swift APIs. These extra objects
  are typically a result of PUT requests that exited abnormally, in the middle
  of a bucket index transaction - so the client would not have received a
  successful response. Bugs in prior releases made these unlinked objects easy
  to reproduce with any PUT request that was made on a bucket that was actively
  resharding. Besides the extra space that these hidden, unlinked objects
  consume, there can be another side effect in certain scenarios, caused by
  the nature of the failure mode that produced them, where a client of a bucket
  that was a victim of this bug may find the object associated with the key to
  be in an inconsistent state. To check whether a versioned bucket has unlinked
  entries, users can now run ``radosgw-admin bucket check unlinked``. If the
  ``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
  a third issue made it possible for versioned bucket index stats to be
  accounted inaccurately. The tooling for recalculating versioned bucket stats
  also had a bug, and was not previously capable of fixing these inaccuracies.
  This release resolves those issues and users can now expect that the existing
  ``radosgw-admin bucket check`` command will produce correct results. We
  recommend that users with versioned buckets, especially those that existed
  on prior releases, use these new tools to check whether their buckets are
  affected and to clean them up accordingly.
* mgr/snap-schedule: For clusters with multiple CephFS file systems, all the
  snap-schedule commands now expect the '--fs' argument.
* The `mon_cluster_log_file_level` and `mon_cluster_log_to_syslog_level` options
  have been removed. Henceforth, users should use the new generic option
  `mon_cluster_log_level` to control the cluster log level verbosity for the cluster
  log file as well as for all external entities.
* RGW: Fixed a S3 Object Lock bug with PutObjectRetention requests that specify
  a RetainUntilDate after the year 2106. This date was truncated to 32 bits when
  stored, so a much earlier date was used for object lock enforcement. This does
  not effect PutBucketObjectLockConfiguration where a duration is given in Days.
  The RetainUntilDate encoding is fixed for new PutObjectRetention requests, but
  cannot repair the dates of existing object locks. Such objects can be identified
  with a HeadObject request based on the x-amz-object-lock-retain-until-date
  response header.
* RADOS: `get_pool_is_selfmanaged_snaps_mode` C++ API has been deprecated
  due to being prone to false negative results.  It's safer replacement is
  `pool_is_in_selfmanaged_snaps_mode`.
* RADOS: For bug 62338 (https://tracker.ceph.com/issues/62338), we did not choose
  to condition the fix on a server flag in order to simplify backporting.  As
  a result, in rare cases it may be possible for a PG to flip between two acting
  sets while an upgrade to a version with the fix is in progress.  If you observe
  this behavior, you should be able to work around it by completing the upgrade or
  by disabling async recovery by setting osd_async_recovery_min_cost to a very
  large value on all OSDs until the upgrade is complete:
  ``ceph config set osd osd_async_recovery_min_cost 1099511627776``
* RADOS: A detailed version of the `balancer status` CLI command in the balancer
  module is now available. Users may run `ceph balancer status detail` to see more
  details about which PGs were updated in the balancer's last optimization.
  See https://docs.ceph.com/en/latest/rados/operations/balancer/ for more information.
* CephFS: For clusters with multiple CephFS file systems, all the snap-schedule
  commands now expect the '--fs' argument.
* CephFS: The period specifier ``m`` now implies minutes and the period specifier
  ``M`` now implies months. This has been made consistent with the rest
  of the system.
* CephFS: Full support for subvolumes and subvolume groups is now available
  for snap_schedule Manager module.
* CephFS: Two FS names can now be swapped, optionally along with their IDs,
  using "ceph fs swap" command. The function of this API is to facilitate
  file system swaps for disaster recovery. In particular, it avoids situations
  where a named file system is temporarily missing which would prompt a higher
  level storage operator (like Rook) to recreate the missing file system.
  See https://docs.ceph.com/en/latest/cephfs/administration/#file-systems
  docs for more information.

* CephFS: The `subvolume snapshot clone` command now depends on the config option
  `snapshot_clone_no_wait` which is used to reject the clone operation when
  all the cloner threads are busy. This config option is enabled by default which means
  that if no cloner threads are free, the clone request errors out with EAGAIN.
  The value of the config option can be fetched by using:
   `ceph config get mgr mgr/volumes/snapshot_clone_no_wait`
  and it can be disabled by using:
   `ceph config set mgr mgr/volumes/snapshot_clone_no_wait false`
* CephFS: fixes to the implementation of the ``root_squash`` mechanism enabled
  via cephx ``mds`` caps on a client credential require a new client feature
  bit, ``client_mds_auth_caps``. Clients using credentials with ``root_squash``
  without this feature will trigger the MDS to raise a HEALTH_ERR on the
  cluster, MDS_CLIENTS_BROKEN_ROOTSQUASH. See the documentation on this warning
  and the new feature bit for more information.

* CephFS: Expanded removexattr support for cephfs virtual extended attributes.
  Previously one had to use setxattr to restore the default in order to "remove".
  You may now properly use removexattr to remove. You can also now remove layout
  on root inode, which then will restore layout to default layout.

* cls_cxx_gather is marked as deprecated.
* CephFS: cephfs-journal-tool is guarded against running on an online file system.
  The 'cephfs-journal-tool --rank <fs_name>:<mds_rank> journal reset' and
  'cephfs-journal-tool --rank <fs_name>:<mds_rank> journal reset --force'
  commands require '--yes-i-really-really-mean-it'.

* CephFS: Command "ceph mds fail" and "ceph fs fail" now requires a
  confirmation flag when some MDSs exhibit health warning MDS_TRIM or
  MDS_CACHE_OVERSIZED. This is to prevent accidental MDS failover causing
  further delays in recovery.

>=18.0.0

* The RGW policy parser now rejects unknown principals by default. If you are
  mirroring policies between RGW and AWS, you may wish to set
  "rgw policy reject invalid principals" to "false". This affects only newly set
  policies, not policies that are already in place.
* The CephFS automatic metadata load (sometimes called "default") balancer is
  now disabled by default. The new file system flag `balance_automate`
  can be used to toggle it on or off. It can be enabled or disabled via
  `ceph fs set <fs_name> balance_automate <bool>`.
* RGW's default backend for `rgw_enable_ops_log` changed from RADOS to file.
  The default value of `rgw_ops_log_rados` is now false, and `rgw_ops_log_file_path`
  defaults to "/var/log/ceph/ops-log-$cluster-$name.log".
* The SPDK backend for BlueStore is now able to connect to an NVMeoF target.
  Please note that this is not an officially supported feature.
* RGW's pubsub interface now returns boolean fields using bool. Before this change,
  `/topics/<topic-name>` returns "stored_secret" and "persistent" using a string
  of "true" or "false" with quotes around them. After this change, these fields
  are returned without quotes so they can be decoded as boolean values in JSON.
  The same applies to the `is_truncated` field returned by `/subscriptions/<sub-name>`.
* RGW's response of `Action=GetTopicAttributes&TopicArn=<topic-arn>` REST API now
  returns `HasStoredSecret` and `Persistent` as boolean in the JSON string
  encoded in `Attributes/EndPoint`.
* All boolean fields previously rendered as string by `rgw-admin` command when
  the JSON format is used are now rendered as boolean. If your scripts/tools
  relies on this behavior, please update them accordingly. The impacted field names
  are:
  * absolute
  * add
  * admin
  * appendable
  * bucket_key_enabled
  * delete_marker
  * exists
  * has_bucket_info
  * high_precision_time
  * index
  * is_master
  * is_prefix
  * is_truncated
  * linked
  * log_meta
  * log_op
  * pending_removal
  * read_only
  * retain_head_object
  * rule_exist
  * start_with_full_sync
  * sync_from_all
  * syncstopped
  * system
  * truncated
  * user_stats_sync
* RGW: The beast frontend's HTTP access log line uses a new debug_rgw_access
  configurable. This has the same defaults as debug_rgw, but can now be controlled
  independently.
* RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write`
  and `Image::aio_compare_and_write` methods) now match those of C API.  Both
  compare and write steps operate only on `len` bytes even if the respective
  buffers are larger. The previous behavior of comparing up to the size of
  the compare buffer was prone to subtle breakage upon straddling a stripe
  unit boundary.
* RBD: compare-and-write operation is no longer limited to 512-byte sectors.
  Assuming proper alignment, it now allows operating on stripe units (4M by
  default).
* RBD: New `rbd_aio_compare_and_writev` API method to support scatter/gather
  on both compare and write buffers.  This compliments existing `rbd_aio_readv`
  and `rbd_aio_writev` methods.
* The 'AT_NO_ATTR_SYNC' macro is deprecated, please use the standard 'AT_STATX_DONT_SYNC'
  macro. The 'AT_NO_ATTR_SYNC' macro will be removed in the future.
* Trimming of PGLog dups is now controlled by the size instead of the version.
  This fixes the PGLog inflation issue that was happening when the on-line
  (in OSD) trimming got jammed after a PG split operation. Also, a new off-line
  mechanism has been added: `ceph-objectstore-tool` got `trim-pg-log-dups` op
  that targets situations where OSD is unable to boot due to those inflated dups.
  If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning
  will be visible.
  Relevant tracker: https://tracker.ceph.com/issues/53729
* RBD: `rbd device unmap` command gained `--namespace` option.  Support for
  namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to
  map and unmap images in namespaces using the `image-spec` syntax since then
  but the corresponding option available in most other commands was missing.
* RGW: Compression is now supported for objects uploaded with Server-Side Encryption.
  When both are enabled, compression is applied before encryption. Earlier releases
  of multisite do not replicate such objects correctly, so all zones must upgrade to
  Reef before enabling the `compress-encrypted` zonegroup feature: see
  https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features and note the
  security considerations.
* RGW: the "pubsub" functionality for storing bucket notifications inside Ceph
  is removed. Together with it, the "pubsub" zone should not be used anymore.
  The REST operations, as well as radosgw-admin commands for manipulating
  subscriptions, as well as fetching and acking the notifications are removed
  as well.
  In case that the endpoint to which the notifications are sent maybe down or
  disconnected, it is recommended to use persistent notifications to guarantee
  the delivery of the notifications. In case the system that consumes the
  notifications needs to pull them (instead of the notifications be pushed
  to it), an external message bus (e.g. rabbitmq, Kafka) should be used for
  that purpose.
* RGW: The serialized format of notification and topics has changed, so that
  new/updated topics will be unreadable by old RGWs. We recommend completing
  the RGW upgrades before creating or modifying any notification topics.
* RBD: Trailing newline in passphrase files (`<passphrase-file>` argument in
  `rbd encryption format` command and `--encryption-passphrase-file` option
  in other commands) is no longer stripped.
* RBD: Support for layered client-side encryption is added.  Cloned images
  can now be encrypted each with its own encryption format and passphrase,
  potentially different from that of the parent image.  The efficient
  copy-on-write semantics intrinsic to unformatted (regular) cloned images
  are retained.
* CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to
  `client_max_retries_on_remount_failure` and move it from mds.yaml.in to
  mds-client.yaml.in because this option was only used by MDS client from its
  birth.
* The `perf dump` and `perf schema` commands are deprecated in favor of new
  `counter dump` and `counter schema` commands. These new commands add support
  for labeled perf counters and also emit existing unlabeled perf counters. Some
  unlabeled perf counters became labeled in this release, with more to follow in
  future releases; such converted perf counters are no longer emitted by the
  `perf dump` and `perf schema` commands.
* `ceph mgr dump` command now outputs `last_failure_osd_epoch` and
  `active_clients` fields at the top level.  Previously, these fields were
  output under `always_on_modules` field.
* `ceph mgr dump` command now displays the name of the mgr module that
  registered a RADOS client in the `name` field added to elements of the
  `active_clients` array. Previously, only the address of a module's RADOS
  client was shown in the `active_clients` array.
* RBD: All rbd-mirror daemon perf counters became labeled and as such are now
  emitted only by the new `counter dump` and `counter schema` commands.  As part
  of the conversion, many also got renamed to better disambiguate journal-based
  and snapshot-based mirroring.
* RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed
  `std::list` before potentially appending to it, aligning with the semantics
  of the corresponding C API (`rbd_watchers_list`).
* The rados python binding is now able to process (opt-in) omap keys as bytes
  objects. This enables interacting with RADOS omap keys that are not decodeable as
  UTF-8 strings.
* Telemetry: Users who are opted-in to telemetry can also opt-in to
  participating in a leaderboard in the telemetry public
  dashboards (https://telemetry-public.ceph.com/). Users can now also add a
  description of the cluster to publicly appear in the leaderboard.
  For more details, see:
  https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard
  See a sample report with `ceph telemetry preview`.
  Opt-in to telemetry with `ceph telemetry on`.
  Opt-in to the leaderboard with
  `ceph config set mgr mgr/telemetry/leaderboard true`.
  Add leaderboard description with:
  `ceph config set mgr mgr/telemetry/leaderboard_description ‘Cluster description’`.
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
  procedure, the recovered files under `lost+found` directory can now be deleted.
* core: cache-tiering is now deprecated.
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
  undergone significant usability and design improvements to address the slow
  backfill issue. Some important changes are:
  * The 'balanced' profile is set as the default mClock profile because it
    represents a compromise between prioritizing client IO or recovery IO. Users
    can then choose either the 'high_client_ops' profile to prioritize client IO
    or the 'high_recovery_ops' profile to prioritize recovery IO.
  * QoS parameters like reservation and limit are now specified in terms of a
    fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
  * The cost parameters (osd_mclock_cost_per_io_usec_* and
    osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
    is now determined using the random IOPS and maximum sequential bandwidth
    capability of the OSD's underlying device.
  * Degraded object recovery is given higher priority when compared to misplaced
    object recovery because degraded objects present a data safety issue not
    present with objects that are merely misplaced. Therefore, backfilling
    operations with the 'balanced' and 'high_client_ops' mClock profiles may
    progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
    scheduler.
  * The QoS allocations in all the mClock profiles are optimized based on the above
    fixes and enhancements.
  * For more detailed information see:
    https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
  procedure, the recovered files under `lost+found` directory can now be deleted.
    https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
  than the number mentioned against the config tunable `mds_max_snaps_per_dir`
  so that a new snapshot can be created and retained during the next schedule
  run.
* cephfs: Running the command "ceph fs authorize" for an existing entity now
  upgrades the entity's capabilities instead of printing an error. It can now
  also change read/write permissions in a capability that the entity already
  holds. If the capability passed by user is same as one of the capabilities
  that the entity already holds, idempotency is maintained.
* `ceph config dump --format <json|xml>` output will display the localized
  option names instead of its normalized version. For e.g.,
  "mgr/prometheus/x/server_port" will be displayed instead of
  "mgr/prometheus/server_port". This matches the output of the non pretty-print
  formatted version of the command.
* CEPHFS: MDS config option name "mds_kill_skip_replaying_inotable" is a bit
  confusing with "mds_inject_skip_replaying_inotable", therefore renaming it to
  "mds_kill_after_journal_logs_flushed"


>=17.2.1

* The "BlueStore zero block detection" feature (first introduced to Quincy in
https://github.com/ceph/ceph/pull/43337) has been turned off by default with a
new global configuration called `bluestore_zero_block_detection`. This feature,
intended for large-scale synthetic testing, does not interact well with some RBD
and CephFS features. Any side effects experienced in previous Quincy versions
would no longer occur, provided that the configuration remains set to false.
Relevant tracker: https://tracker.ceph.com/issues/55521

* telemetry: Added new Rook metrics to the 'basic' channel to report Rook's
  version, Kubernetes version, node metrics, etc.
  See a sample report with `ceph telemetry preview`.
  Opt-in with `ceph telemetry on`.

  For more details, see:

  https://docs.ceph.com/en/latest/mgr/telemetry/

* OSD: The issue of high CPU utilization during recovery/backfill operations
  has been fixed. For more details, see: https://tracker.ceph.com/issues/56530.

>=15.2.17

* OSD: Octopus modified the SnapMapper key format from
  <LEGACY_MAPPING_PREFIX><snapid>_<shardid>_<hobject_t::to_str()>
  to
  <MAPPING_PREFIX><pool>_<snapid>_<shardid>_<hobject_t::to_str()>
  When this change was introduced, 94ebe0e also introduced a conversion
  with a crucial bug which essentially destroyed legacy keys by mapping them
  to
  <MAPPING_PREFIX><poolid>_<snapid>_
  without the object-unique suffix. The conversion is fixed in this release.
  Relevant tracker: https://tracker.ceph.com/issues/56147

* Cephadm may now be configured to carry out CephFS MDS upgrades without
reducing ``max_mds`` to 1. Previously, Cephadm would reduce ``max_mds`` to 1 to
avoid having two active MDS modifying on-disk structures with new versions,
communicating cross-version-incompatible messages, or other potential
incompatibilities. This could be disruptive for large-scale CephFS deployments
because the cluster cannot easily reduce active MDS daemons to 1.
NOTE: Staggered upgrade of the mons/mgrs may be necessary to take advantage
of the feature, refer this link on how to perform it:
https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade
Relevant tracker: https://tracker.ceph.com/issues/55715

  Relevant tracker: https://tracker.ceph.com/issues/5614

* Cephadm may now be configured to carry out CephFS MDS upgrades without
reducing ``max_mds`` to 1. Previously, Cephadm would reduce ``max_mds`` to 1 to
avoid having two active MDS modifying on-disk structures with new versions,
communicating cross-version-incompatible messages, or other potential
incompatibilities. This could be disruptive for large-scale CephFS deployments
because the cluster cannot easily reduce active MDS daemons to 1.
NOTE: Staggered upgrade of the mons/mgrs may be necessary to take advantage
of the feature, refer this link on how to perform it:
https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade
Relevant tracker: https://tracker.ceph.com/issues/55715

* Introduced a new file system flag `refuse_client_session` that can be set using the
`fs set` command. This flag allows blocking any incoming session
request from client(s). This can be useful during some recovery situations
where it's desirable to bring MDS up but have no client workload.
Relevant tracker: https://tracker.ceph.com/issues/57090

* New MDSMap field `max_xattr_size` which can be set using the `fs set` command.
  This MDSMap field allows to configure the maximum size allowed for the full
  key/value set for a filesystem extended attributes.  It effectively replaces
  the old per-MDS `max_xattr_pairs_size` setting, which is now dropped.
  Relevant tracker: https://tracker.ceph.com/issues/55725

* Introduced a new file system flag `refuse_standby_for_another_fs` that can be
set using the `fs set` command. This flag prevents using a standby for another
file system (join_fs = X) when standby for the current filesystem is not available.
Relevant tracker: https://tracker.ceph.com/issues/61599