mirror of
https://github.com/ceph/ceph
synced 2025-01-22 19:15:41 +00:00
1fd4309fbb
the default encoding of ceph::real_time truncates seconds to uint32_t, so stores the wrong timestamp for object lock enforcement Fixes: https://tracker.ceph.com/issues/63537 Signed-off-by: Casey Bodley <cbodley@redhat.com>
329 lines
19 KiB
Plaintext
329 lines
19 KiB
Plaintext
>=19.0.0
|
||
|
||
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
|
||
multi-site. Previously, the replicas of such objects were corrupted on decryption.
|
||
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
|
||
identify these original multipart uploads. The ``LastModified`` timestamp of any
|
||
identified object is incremented by 1ns to cause peer zones to replicate it again.
|
||
For multi-site deployments that make any use of Server-Side Encryption, we
|
||
recommended running this command against every bucket in every zone after all
|
||
zones have upgraded.
|
||
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
|
||
a large buildup of session metadata resulting in the MDS going read-only due to
|
||
the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
|
||
config controls the maximum size that a (encoded) session metadata can grow.
|
||
* CephFS: For clusters with multiple CephFS file systems, all the snap-schedule
|
||
commands now expect the '--fs' argument.
|
||
* CephFS: The period specifier ``m`` now implies minutes and the period specifier
|
||
``M`` now implies months. This has been made consistent with the rest
|
||
of the system.
|
||
* RGW: New tools have been added to radosgw-admin for identifying and
|
||
correcting issues with versioned bucket indexes. Historical bugs with the
|
||
versioned bucket index transaction workflow made it possible for the index
|
||
to accumulate extraneous "book-keeping" olh entries and plain placeholder
|
||
entries. In some specific scenarios where clients made concurrent requests
|
||
referencing the same object key, it was likely that a lot of extra index
|
||
entries would accumulate. When a significant number of these entries are
|
||
present in a single bucket index shard, they can cause high bucket listing
|
||
latencies and lifecycle processing failures. To check whether a versioned
|
||
bucket has unnecessary olh entries, users can now run ``radosgw-admin
|
||
bucket check olh``. If the ``--fix`` flag is used, the extra entries will
|
||
be safely removed. A distinct issue from the one described thus far, it is
|
||
also possible that some versioned buckets are maintaining extra unlinked
|
||
objects that are not listable from the S3/ Swift APIs. These extra objects
|
||
are typically a result of PUT requests that exited abnormally, in the middle
|
||
of a bucket index transaction - so the client would not have received a
|
||
successful response. Bugs in prior releases made these unlinked objects easy
|
||
to reproduce with any PUT request that was made on a bucket that was actively
|
||
resharding. Besides the extra space that these hidden, unlinked objects
|
||
consume, there can be another side effect in certain scenarios, caused by
|
||
the nature of the failure mode that produced them, where a client of a bucket
|
||
that was a victim of this bug may find the object associated with the key to
|
||
be in an inconsistent state. To check whether a versioned bucket has unlinked
|
||
entries, users can now run ``radosgw-admin bucket check unlinked``. If the
|
||
``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
|
||
a third issue made it possible for versioned bucket index stats to be
|
||
accounted inaccurately. The tooling for recalculating versioned bucket stats
|
||
also had a bug, and was not previously capable of fixing these inaccuracies.
|
||
This release resolves those issues and users can now expect that the existing
|
||
``radosgw-admin bucket check`` command will produce correct results. We
|
||
recommend that users with versioned buckets, especially those that existed
|
||
on prior releases, use these new tools to check whether their buckets are
|
||
affected and to clean them up accordingly.
|
||
* CephFS: Two FS names can now be swapped, optionally along with their IDs,
|
||
using "ceph fs swap" command. The function of this API is to facilitate
|
||
file system swaps for disaster recovery. In particular, it avoids situations
|
||
where a named file system is temporarily missing which would prompt a higher
|
||
level storage operator (like Rook) to recreate the missing file system.
|
||
See https://docs.ceph.com/en/latest/cephfs/administration/#file-systems
|
||
docs for more information.
|
||
* RADOS: A POOL_APP_NOT_ENABLED health warning will now be reported if
|
||
the application is not enabled for the pool irrespective of whether
|
||
the pool is in use or not. Always tag a pool with an application
|
||
using ``ceph osd pool application enable`` command to avoid reporting
|
||
of POOL_APP_NOT_ENABLED health warning for that pool.
|
||
The user might temporarily mute this warning using
|
||
``ceph health mute POOL_APP_NOT_ENABLED``.
|
||
CephFS: Disallow delegating preallocated inode ranges to clients. Config
|
||
`mds_client_delegate_inos_pct` defaults to 0 which disables async dirops
|
||
in the kclient.
|
||
* S3 Get/HeadObject now support query parameter `partNumber` to read a specific
|
||
part of a completed multipart upload.
|
||
* RGW: Fixed a S3 Object Lock bug with PutObjectRetention requests that specify
|
||
a RetainUntilDate after the year 2106. This date was truncated to 32 bits when
|
||
stored, so a much earlier date was used for object lock enforcement. This does
|
||
not effect PutBucketObjectLockConfiguration where a duration is given in Days.
|
||
The RetainUntilDate encoding is fixed for new PutObjectRetention requests, but
|
||
cannot repair the dates of existing object locks. Such objects can be identified
|
||
with a HeadObject request based on the x-amz-object-lock-retain-until-date
|
||
response header.
|
||
|
||
>=18.0.0
|
||
|
||
* The RGW policy parser now rejects unknown principals by default. If you are
|
||
mirroring policies between RGW and AWS, you may wish to set
|
||
"rgw policy reject invalid principals" to "false". This affects only newly set
|
||
policies, not policies that are already in place.
|
||
* RGW's default backend for `rgw_enable_ops_log` changed from RADOS to file.
|
||
The default value of `rgw_ops_log_rados` is now false, and `rgw_ops_log_file_path`
|
||
defaults to "/var/log/ceph/ops-log-$cluster-$name.log".
|
||
* The SPDK backend for BlueStore is now able to connect to an NVMeoF target.
|
||
Please note that this is not an officially supported feature.
|
||
* RGW's pubsub interface now returns boolean fields using bool. Before this change,
|
||
`/topics/<topic-name>` returns "stored_secret" and "persistent" using a string
|
||
of "true" or "false" with quotes around them. After this change, these fields
|
||
are returned without quotes so they can be decoded as boolean values in JSON.
|
||
The same applies to the `is_truncated` field returned by `/subscriptions/<sub-name>`.
|
||
* RGW's response of `Action=GetTopicAttributes&TopicArn=<topic-arn>` REST API now
|
||
returns `HasStoredSecret` and `Persistent` as boolean in the JSON string
|
||
encoded in `Attributes/EndPoint`.
|
||
* All boolean fields previously rendered as string by `rgw-admin` command when
|
||
the JSON format is used are now rendered as boolean. If your scripts/tools
|
||
relies on this behavior, please update them accordingly. The impacted field names
|
||
are:
|
||
* absolute
|
||
* add
|
||
* admin
|
||
* appendable
|
||
* bucket_key_enabled
|
||
* delete_marker
|
||
* exists
|
||
* has_bucket_info
|
||
* high_precision_time
|
||
* index
|
||
* is_master
|
||
* is_prefix
|
||
* is_truncated
|
||
* linked
|
||
* log_meta
|
||
* log_op
|
||
* pending_removal
|
||
* read_only
|
||
* retain_head_object
|
||
* rule_exist
|
||
* start_with_full_sync
|
||
* sync_from_all
|
||
* syncstopped
|
||
* system
|
||
* truncated
|
||
* user_stats_sync
|
||
* RGW: The beast frontend's HTTP access log line uses a new debug_rgw_access
|
||
configurable. This has the same defaults as debug_rgw, but can now be controlled
|
||
independently.
|
||
* RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write`
|
||
and `Image::aio_compare_and_write` methods) now match those of C API. Both
|
||
compare and write steps operate only on `len` bytes even if the respective
|
||
buffers are larger. The previous behavior of comparing up to the size of
|
||
the compare buffer was prone to subtle breakage upon straddling a stripe
|
||
unit boundary.
|
||
* RBD: compare-and-write operation is no longer limited to 512-byte sectors.
|
||
Assuming proper alignment, it now allows operating on stripe units (4M by
|
||
default).
|
||
* RBD: New `rbd_aio_compare_and_writev` API method to support scatter/gather
|
||
on both compare and write buffers. This compliments existing `rbd_aio_readv`
|
||
and `rbd_aio_writev` methods.
|
||
* The 'AT_NO_ATTR_SYNC' macro is deprecated, please use the standard 'AT_STATX_DONT_SYNC'
|
||
macro. The 'AT_NO_ATTR_SYNC' macro will be removed in the future.
|
||
* Trimming of PGLog dups is now controlled by the size instead of the version.
|
||
This fixes the PGLog inflation issue that was happening when the on-line
|
||
(in OSD) trimming got jammed after a PG split operation. Also, a new off-line
|
||
mechanism has been added: `ceph-objectstore-tool` got `trim-pg-log-dups` op
|
||
that targets situations where OSD is unable to boot due to those inflated dups.
|
||
If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning
|
||
will be visible.
|
||
Relevant tracker: https://tracker.ceph.com/issues/53729
|
||
* RBD: `rbd device unmap` command gained `--namespace` option. Support for
|
||
namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to
|
||
map and unmap images in namespaces using the `image-spec` syntax since then
|
||
but the corresponding option available in most other commands was missing.
|
||
* RGW: Compression is now supported for objects uploaded with Server-Side Encryption.
|
||
When both are enabled, compression is applied before encryption. Earlier releases
|
||
of multisite do not replicate such objects correctly, so all zones must upgrade to
|
||
Reef before enabling the `compress-encrypted` zonegroup feature: see
|
||
https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features and note the
|
||
security considerations.
|
||
* RGW: the "pubsub" functionality for storing bucket notifications inside Ceph
|
||
is removed. Together with it, the "pubsub" zone should not be used anymore.
|
||
The REST operations, as well as radosgw-admin commands for manipulating
|
||
subscriptions, as well as fetching and acking the notifications are removed
|
||
as well.
|
||
In case that the endpoint to which the notifications are sent maybe down or
|
||
disconnected, it is recommended to use persistent notifications to guarantee
|
||
the delivery of the notifications. In case the system that consumes the
|
||
notifications needs to pull them (instead of the notifications be pushed
|
||
to it), an external message bus (e.g. rabbitmq, Kafka) should be used for
|
||
that purpose.
|
||
* RGW: The serialized format of notification and topics has changed, so that
|
||
new/updated topics will be unreadable by old RGWs. We recommend completing
|
||
the RGW upgrades before creating or modifying any notification topics.
|
||
* RBD: Trailing newline in passphrase files (`<passphrase-file>` argument in
|
||
`rbd encryption format` command and `--encryption-passphrase-file` option
|
||
in other commands) is no longer stripped.
|
||
* RBD: Support for layered client-side encryption is added. Cloned images
|
||
can now be encrypted each with its own encryption format and passphrase,
|
||
potentially different from that of the parent image. The efficient
|
||
copy-on-write semantics intrinsic to unformatted (regular) cloned images
|
||
are retained.
|
||
* CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to
|
||
`client_max_retries_on_remount_failure` and move it from mds.yaml.in to
|
||
mds-client.yaml.in because this option was only used by MDS client from its
|
||
birth.
|
||
* The `perf dump` and `perf schema` commands are deprecated in favor of new
|
||
`counter dump` and `counter schema` commands. These new commands add support
|
||
for labeled perf counters and also emit existing unlabeled perf counters. Some
|
||
unlabeled perf counters became labeled in this release, with more to follow in
|
||
future releases; such converted perf counters are no longer emitted by the
|
||
`perf dump` and `perf schema` commands.
|
||
* `ceph mgr dump` command now outputs `last_failure_osd_epoch` and
|
||
`active_clients` fields at the top level. Previously, these fields were
|
||
output under `always_on_modules` field.
|
||
* `ceph mgr dump` command now displays the name of the mgr module that
|
||
registered a RADOS client in the `name` field added to elements of the
|
||
`active_clients` array. Previously, only the address of a module's RADOS
|
||
client was shown in the `active_clients` array.
|
||
* RBD: All rbd-mirror daemon perf counters became labeled and as such are now
|
||
emitted only by the new `counter dump` and `counter schema` commands. As part
|
||
of the conversion, many also got renamed to better disambiguate journal-based
|
||
and snapshot-based mirroring.
|
||
* RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed
|
||
`std::list` before potentially appending to it, aligning with the semantics
|
||
of the corresponding C API (`rbd_watchers_list`).
|
||
* The rados python binding is now able to process (opt-in) omap keys as bytes
|
||
objects. This enables interacting with RADOS omap keys that are not decodeable as
|
||
UTF-8 strings.
|
||
* Telemetry: Users who are opted-in to telemetry can also opt-in to
|
||
participating in a leaderboard in the telemetry public
|
||
dashboards (https://telemetry-public.ceph.com/). Users can now also add a
|
||
description of the cluster to publicly appear in the leaderboard.
|
||
For more details, see:
|
||
https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard
|
||
See a sample report with `ceph telemetry preview`.
|
||
Opt-in to telemetry with `ceph telemetry on`.
|
||
Opt-in to the leaderboard with
|
||
`ceph config set mgr mgr/telemetry/leaderboard true`.
|
||
Add leaderboard description with:
|
||
`ceph config set mgr mgr/telemetry/leaderboard_description ‘Cluster description’`.
|
||
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
|
||
procedure, the recovered files under `lost+found` directory can now be deleted.
|
||
* core: cache-tiering is now deprecated.
|
||
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
|
||
undergone significant usability and design improvements to address the slow
|
||
backfill issue. Some important changes are:
|
||
* The 'balanced' profile is set as the default mClock profile because it
|
||
represents a compromise between prioritizing client IO or recovery IO. Users
|
||
can then choose either the 'high_client_ops' profile to prioritize client IO
|
||
or the 'high_recovery_ops' profile to prioritize recovery IO.
|
||
* QoS parameters like reservation and limit are now specified in terms of a
|
||
fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
|
||
* The cost parameters (osd_mclock_cost_per_io_usec_* and
|
||
osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
|
||
is now determined using the random IOPS and maximum sequential bandwidth
|
||
capability of the OSD's underlying device.
|
||
* Degraded object recovery is given higher priority when compared to misplaced
|
||
object recovery because degraded objects present a data safety issue not
|
||
present with objects that are merely misplaced. Therefore, backfilling
|
||
operations with the 'balanced' and 'high_client_ops' mClock profiles may
|
||
progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
|
||
scheduler.
|
||
* The QoS allocations in all the mClock profiles are optimized based on the above
|
||
fixes and enhancements.
|
||
* For more detailed information see:
|
||
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
|
||
* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
|
||
than the number mentioned against the config tunable `mds_max_snaps_per_dir`
|
||
so that a new snapshot can be created and retained during the next schedule
|
||
run.
|
||
* cephfs: Running the command "ceph fs authorize" for an existing entity now
|
||
upgrades the entity's capabilities instead of printing an error. It can now
|
||
also change read/write permissions in a capability that the entity already
|
||
holds. If the capability passed by user is same as one of the capabilities
|
||
that the entity already holds, idempotency is maintained.
|
||
* `ceph config dump --format <json|xml>` output will display the localized
|
||
option names instead of its normalized version. For e.g.,
|
||
"mgr/prometheus/x/server_port" will be displayed instead of
|
||
"mgr/prometheus/server_port". This matches the output of the non pretty-print
|
||
formatted version of the command.
|
||
|
||
>=17.2.1
|
||
|
||
* The "BlueStore zero block detection" feature (first introduced to Quincy in
|
||
https://github.com/ceph/ceph/pull/43337) has been turned off by default with a
|
||
new global configuration called `bluestore_zero_block_detection`. This feature,
|
||
intended for large-scale synthetic testing, does not interact well with some RBD
|
||
and CephFS features. Any side effects experienced in previous Quincy versions
|
||
would no longer occur, provided that the configuration remains set to false.
|
||
Relevant tracker: https://tracker.ceph.com/issues/55521
|
||
|
||
* telemetry: Added new Rook metrics to the 'basic' channel to report Rook's
|
||
version, Kubernetes version, node metrics, etc.
|
||
See a sample report with `ceph telemetry preview`.
|
||
Opt-in with `ceph telemetry on`.
|
||
|
||
For more details, see:
|
||
|
||
https://docs.ceph.com/en/latest/mgr/telemetry/
|
||
|
||
* OSD: The issue of high CPU utilization during recovery/backfill operations
|
||
has been fixed. For more details, see: https://tracker.ceph.com/issues/56530.
|
||
|
||
>=15.2.17
|
||
|
||
* OSD: Octopus modified the SnapMapper key format from
|
||
<LEGACY_MAPPING_PREFIX><snapid>_<shardid>_<hobject_t::to_str()>
|
||
to
|
||
<MAPPING_PREFIX><pool>_<snapid>_<shardid>_<hobject_t::to_str()>
|
||
When this change was introduced, 94ebe0e also introduced a conversion
|
||
with a crucial bug which essentially destroyed legacy keys by mapping them
|
||
to
|
||
<MAPPING_PREFIX><poolid>_<snapid>_
|
||
without the object-unique suffix. The conversion is fixed in this release.
|
||
Relevant tracker: https://tracker.ceph.com/issues/56147
|
||
|
||
* Cephadm may now be configured to carry out CephFS MDS upgrades without
|
||
reducing ``max_mds`` to 1. Previously, Cephadm would reduce ``max_mds`` to 1 to
|
||
avoid having two active MDS modifying on-disk structures with new versions,
|
||
communicating cross-version-incompatible messages, or other potential
|
||
incompatibilities. This could be disruptive for large-scale CephFS deployments
|
||
because the cluster cannot easily reduce active MDS daemons to 1.
|
||
NOTE: Staggered upgrade of the mons/mgrs may be necessary to take advantage
|
||
of the feature, refer this link on how to perform it:
|
||
https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade
|
||
Relevant tracker: https://tracker.ceph.com/issues/55715
|
||
|
||
* Introduced a new file system flag `refuse_client_session` that can be set using the
|
||
`fs set` command. This flag allows blocking any incoming session
|
||
request from client(s). This can be useful during some recovery situations
|
||
where it's desirable to bring MDS up but have no client workload.
|
||
Relevant tracker: https://tracker.ceph.com/issues/57090
|
||
|
||
* New MDSMap field `max_xattr_size` which can be set using the `fs set` command.
|
||
This MDSMap field allows to configure the maximum size allowed for the full
|
||
key/value set for a filesystem extended attributes. It effectively replaces
|
||
the old per-MDS `max_xattr_pairs_size` setting, which is now dropped.
|
||
Relevant tracker: https://tracker.ceph.com/issues/55725
|
||
|
||
* Introduced a new file system flag `refuse_standby_for_another_fs` that can be
|
||
set using the `fs set` command. This flag prevents using a standby for another
|
||
file system (join_fs = X) when standby for the current filesystem is not available.
|
||
Relevant tracker: https://tracker.ceph.com/issues/61599
|