mirror of
https://github.com/ceph/ceph
synced 2025-01-25 20:45:06 +00:00
3fac3b1236
Before this, "mds_join_fs" config enforced a preference for the standby the monitors would select. Now the monitors actively enforce this by purposefully removing an MDS wither lower "affinity". An MDS standby has highest affinity if its mds_join_fs is the file system in question or a vanilla standby (no mds_join_fs). Fixes: https://tracker.ceph.com/issues/43392 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
309 lines
16 KiB
Plaintext
309 lines
16 KiB
Plaintext
>=15.0.0
|
|
--------
|
|
|
|
* The RGW "num_rados_handles" has been removed.
|
|
* If you were using a value of "num_rados_handles" greater than 1
|
|
multiply your current "objecter_inflight_ops" and
|
|
"objecter_inflight_op_bytes" paramaeters by the old
|
|
"num_rados_handles" to get the same throttle behavior.
|
|
|
|
* Ceph now packages python bindings for python3.6 instead of
|
|
python3.4, because python3 in EL7/EL8 is now using python3.6
|
|
as the native python3. see the `announcement <https://lists.fedoraproject.org/archives/list/epel-announce@lists.fedoraproject.org/message/EGUMKAIMPK2UD5VSHXM53BH2MBDGDWMO/>_`
|
|
for more details on the background of this change.
|
|
|
|
* librbd now uses a write-around cache policy be default,
|
|
replacing the previous write-back cache policy default.
|
|
This cache policy allows librbd to immediately complete
|
|
write IOs while they are still in-flight to the OSDs.
|
|
Subsequent flush requests will ensure all in-flight
|
|
write IOs are completed prior to completing. The
|
|
librbd cache policy can be controlled via a new
|
|
"rbd_cache_policy" configuration option.
|
|
|
|
* librbd now includes a simple IO scheduler which attempts to
|
|
batch together multiple IOs against the same backing RBD
|
|
data block object. The librbd IO scheduler policy can be
|
|
controlled via a new "rbd_io_scheduler" configuration
|
|
option.
|
|
|
|
* RGW: radosgw-admin introduces two subcommands that allow the
|
|
managing of expire-stale objects that might be left behind after a
|
|
bucket reshard in earlier versions of RGW. One subcommand lists such
|
|
objects and the other deletes them. Read the troubleshooting section
|
|
of the dynamic resharding docs for details.
|
|
|
|
* RGW: Bucket naming restrictions have changed and likely to cause
|
|
InvalidBucketName errors. We recommend to set ``rgw_relaxed_s3_bucket_names``
|
|
option to true as a workaround.
|
|
|
|
* In the Zabbix Mgr Module there was a typo in the key being send
|
|
to Zabbix for PGs in backfill_wait state. The key that was sent
|
|
was 'wait_backfill' and the correct name is 'backfill_wait'.
|
|
Update your Zabbix template accordingly so that it accepts the
|
|
new key being send to Zabbix.
|
|
|
|
* zabbix plugin for ceph manager now includes osd and pool
|
|
discovery. Update of zabbix_template.xml is needed
|
|
to receive per-pool (read/write throughput, diskspace usage)
|
|
and per-osd (latency, status, pgs) statistics
|
|
|
|
* The format of all date + time stamps has been modified to fully
|
|
conform to ISO 8601. The old format (``YYYY-MM-DD
|
|
HH:MM:SS.ssssss``) excluded the ``T`` separator between the date and
|
|
time and was rendered using the local time zone without any explicit
|
|
indication. The new format includes the separator as well as a
|
|
``+nnnn`` or ``-nnnn`` suffix to indicate the time zone, or a ``Z``
|
|
suffix if the time is UTC. For example,
|
|
``2019-04-26T18:40:06.225953+0100``.
|
|
|
|
Any code or scripts that was previously parsing date and/or time
|
|
values from the JSON or XML structure CLI output should be checked
|
|
to ensure it can handle ISO 8601 conformant values. Any code
|
|
parsing date or time values from the unstructured human-readable
|
|
output should be modified to parse the structured output instead, as
|
|
the human-readable output may change without notice.
|
|
|
|
* The ``bluestore_no_per_pool_stats_tolerance`` config option has been
|
|
replaced with ``bluestore_fsck_error_on_no_per_pool_stats``
|
|
(default: false). The overall default behavior has not changed:
|
|
fsck will warn but not fail on legacy stores, and repair will
|
|
convert to per-pool stats.
|
|
|
|
* The disaster-recovery related 'ceph mon sync force' command has been
|
|
replaced with 'ceph daemon <...> sync_force'.
|
|
|
|
* The ``osd_recovery_max_active`` option now has
|
|
``osd_recovery_max_active_hdd`` and ``osd_recovery_max_active_ssd``
|
|
variants, each with different default values for HDD and SSD-backed
|
|
OSDs, respectively. By default ``osd_recovery_max_active`` now
|
|
defaults to zero, which means that the OSD will conditionally use
|
|
the HDD or SSD option values. Administrators who have customized
|
|
this value may want to consider whether they have set this to a
|
|
value similar to the new defaults (3 for HDDs and 10 for SSDs) and,
|
|
if so, remove the option from their configuration entirely.
|
|
|
|
* monitors now have a `ceph osd info` command that will provide information
|
|
on all osds, or provided osds, thus simplifying the process of having to
|
|
parse `osd dump` for the same information.
|
|
|
|
* The structured output of ``ceph status`` or ``ceph -s`` is now more
|
|
concise, particularly the `mgrmap` and `monmap` sections, and the
|
|
structure of the `osdmap` section has been cleaned up.
|
|
|
|
* A health warning is now generated if the average osd heartbeat ping
|
|
time exceeds a configurable threshold for any of the intervals
|
|
computed. The OSD computes 1 minute, 5 minute and 15 minute
|
|
intervals with average, minimum and maximum values. New configuration
|
|
option ``mon_warn_on_slow_ping_ratio`` specifies a percentage of
|
|
``osd_heartbeat_grace`` to determine the threshold. A value of zero
|
|
disables the warning. New configuration option
|
|
``mon_warn_on_slow_ping_time`` specified in milliseconds over-rides the
|
|
computed value, causes a warning
|
|
when OSD heartbeat pings take longer than the specified amount.
|
|
New admin command ``ceph daemon mgr.# dump_osd_network [threshold]`` command will
|
|
list all connections with a ping time longer than the specified threshold or
|
|
value determined by the config options, for the average for any of the 3 intervals.
|
|
New admin command ``ceph daemon osd.# dump_osd_network [threshold]`` will
|
|
do the same but only including heartbeats initiated by the specified OSD.
|
|
|
|
* Inline data support for CephFS has been deprecated. When setting the flag,
|
|
users will see a warning to that effect, and enabling it now requires the
|
|
``--yes-i-really-really-mean-it`` flag. If the MDS is started on a
|
|
filesystem that has it enabled, a health warning is generated. Support for
|
|
this feature will be removed in a future release.
|
|
|
|
* ``ceph {set,unset} full`` is not supported anymore. We have been using
|
|
``full`` and ``nearfull`` flags in OSD map for tracking the fullness status
|
|
of a cluster back since the Hammer release, if the OSD map is marked ``full``
|
|
all write operations will be blocked until this flag is removed. In the
|
|
Infernalis release and Linux kernel 4.7 client, we introduced the per-pool
|
|
full/nearfull flags to track the status for a finer-grained control, so the
|
|
clients will hold the write operations if either the cluster-wide ``full``
|
|
flag or the per-pool ``full`` flag is set. This was a compromise, as we
|
|
needed to support the cluster with and without per-pool ``full`` flags
|
|
support. But this practically defeated the purpose of introducing the
|
|
per-pool flags. So, in the Mimic release, the new flags finally took the
|
|
place of their cluster-wide counterparts, as the monitor started removing
|
|
these two flags from OSD map. So the clients of Infernalis and up can benefit
|
|
from this change, as they won't be blocked by the full pools which they are
|
|
not writing to. In this release, ``ceph {set,unset} full`` is now considered
|
|
as an invalid command. And the clients will continue honoring both the
|
|
cluster-wide and per-pool flags to be backward comaptible with pre-infernalis
|
|
clusters.
|
|
|
|
* The telemetry module now reports more information.
|
|
|
|
First, there is a new 'device' channel, enabled by default, that
|
|
will report anonymized hard disk and SSD health metrics to
|
|
telemetry.ceph.com in order to build and improve device failure
|
|
prediction algorithms. If you are not comfortable sharing device
|
|
metrics, you can disable that channel first before re-opting-in::
|
|
|
|
ceph config set mgr mgr/telemetry/channel_device false
|
|
|
|
Second, we now report more information about CephFS file systems,
|
|
including:
|
|
|
|
- how many MDS daemons (in total and per file system)
|
|
- which features are (or have been) enabled
|
|
- how many data pools
|
|
- approximate file system age (year + month of creation)
|
|
- how many files, bytes, and snapshots
|
|
- how much metadata is being cached
|
|
|
|
We have also added:
|
|
|
|
- which Ceph release the monitors are running
|
|
- whether msgr v1 or v2 addresses are used for the monitors
|
|
- whether IPv4 or IPv6 addresses are used for the monitors
|
|
- whether RADOS cache tiering is enabled (and which mode)
|
|
- whether pools are replicated or erasure coded, and
|
|
which erasure code profile plugin and parameters are in use
|
|
- how many hosts are in the cluster, and how many hosts have each type of daemon
|
|
- whether a separate OSD cluster network is being used
|
|
- how many RBD pools and images are in the cluster, and how many pools have RBD mirroring enabled
|
|
- how many RGW daemons, zones, and zonegroups are present; which RGW frontends are in use
|
|
- aggregate stats about the CRUSH map, like which algorithms are used, how
|
|
big buckets are, how many rules are defined, and what tunables are in
|
|
use
|
|
|
|
If you had telemetry enabled, you will need to re-opt-in with::
|
|
|
|
ceph telemetry on
|
|
|
|
You can view exactly what information will be reported first with::
|
|
|
|
ceph telemetry show # see everything
|
|
ceph telemetry show basic # basic cluster info (including all of the new info)
|
|
|
|
* Following invalid settings now are not tolerated anymore
|
|
for the command `ceph osd erasure-code-profile set xxx`.
|
|
* invalid `m` for "reed_sol_r6_op" erasure technique
|
|
* invalid `m` and invalid `w` for "liber8tion" erasure technique
|
|
|
|
* New OSD daemon command dump_recovery_reservations which reveals the
|
|
recovery locks held (in_progress) and waiting in priority queues.
|
|
|
|
* New OSD daemon command dump_scrub_reservations which reveals the
|
|
scrub reservations that are held for local (primary) and remote (replica) PGs.
|
|
|
|
* Previously, ``ceph tell mgr ...`` could be used to call commands
|
|
implemented by mgr modules. This is no longer supported. Since
|
|
luminous, using ``tell`` has not been necessary: those same commands
|
|
are also accessible without the ``tell mgr`` portion (e.g., ``ceph
|
|
tell mgr influx foo`` is the same as ``ceph influx foo``. ``ceph
|
|
tell mgr ...`` will now call admin commands--the same set of
|
|
commands accessible via ``ceph daemon ...`` when you are logged into
|
|
the appropriate host.
|
|
|
|
* The ``ceph tell`` and ``ceph daemon`` commands have been unified,
|
|
such that all such commands are accessible via either interface.
|
|
Note that ceph-mgr tell commands are accessible via either ``ceph
|
|
tell mgr ...`` or ``ceph tell mgr.<id> ...``, and it is only
|
|
possible to send tell commands to the active daemon (the standbys do
|
|
not accept incoming connections over the network).
|
|
|
|
* Ceph will now issue a health warning if a RADOS pool as a ``pg_num``
|
|
value that is not a power of two. This can be fixed by adjusting
|
|
the pool to a nearby power of two::
|
|
|
|
ceph osd pool set <pool-name> pg_num <new-pg-num>
|
|
|
|
Alternatively, the warning can be silenced with::
|
|
|
|
ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false
|
|
|
|
* The format of MDSs in `ceph fs dump` has changed.
|
|
|
|
* The ``mds_cache_size`` config option is completely removed. Since luminous,
|
|
the ``mds_cache_memory_limit`` config option has been preferred to configure
|
|
the MDS's cache limits.
|
|
|
|
* The ``pg_autoscale_mode`` is now set to ``on`` by default for newly
|
|
created pools, which means that Ceph will automatically manage the
|
|
number of PGs. To change this behavior, or to learn more about PG
|
|
autoscaling, see :ref:`pg-autoscaler`. Note that existing pools in
|
|
upgraded clusters will still be set to ``warn`` by default.
|
|
|
|
* The pool parameter ``target_size_ratio``, used by the pg autoscaler,
|
|
has changed meaning. It is now normalized across pools, rather than
|
|
specifying an absolute ratio. For details, see :ref:`pg-autoscaler`.
|
|
If you have set target size ratios on any pools, you may want to set
|
|
these pools to autoscale ``warn`` mode to avoid data movement during
|
|
the upgrade::
|
|
|
|
ceph osd pool set <pool-name> pg_autoscale_mode warn
|
|
|
|
* The ``upmap_max_iterations`` config option of mgr/balancer has been
|
|
renamed to ``upmap_max_optimizations`` to better match its behaviour.
|
|
|
|
* ``mClockClientQueue`` and ``mClockClassQueue`` OpQueue
|
|
implementations have been removed in favor of of a single
|
|
``mClockScheduler`` implementation of a simpler OSD interface.
|
|
Accordingly, the ``osd_op_queue_mclock*`` family of config options
|
|
has been removed in favor of the ``osd_mclock_scheduler*`` family
|
|
of options.
|
|
|
|
* The config subsystem now searches dot ('.') delineated prefixes for
|
|
options. That means for an entity like ``client.foo.bar``, it's
|
|
overall configuration will be a combination of the global options,
|
|
``client``, ``client.foo``, and ``client.foo.bar``. Previously,
|
|
only global, ``client``, and ``client.foo.bar`` options would apply.
|
|
This change may affect the configuration for clients that include a
|
|
``.`` in their name.
|
|
|
|
Note that this only applies to configuration options in the
|
|
monitor's database--config file parsing is not affected.
|
|
|
|
* RGW: bucket listing performance on sharded bucket indexes has been
|
|
notably improved by heuristically -- and significantly, in many
|
|
cases -- reducing the number of entries requested from each bucket
|
|
index shard.
|
|
|
|
* MDS default cache memory limit is now 4GB.
|
|
|
|
* The behaviour of the ``-o`` argument to the rados tool has been reverted to
|
|
its orignal behaviour of indicating an output file. This reverts it to a more
|
|
consisten behaviour when compared to other tools. Specifying obect size is now
|
|
accomplished by using an upper case O ``-O``.
|
|
|
|
* In certain rare cases, OSDs would self-classify themselves as type
|
|
'nvme' instead of 'hdd' or 'ssd'. This appears to be limited to
|
|
cases where BlueStore was deployed with older versions of ceph-disk,
|
|
or manually without ceph-volume and LVM. Going forward, the OSD
|
|
will limit itself to only 'hdd' and 'ssd' (or whatever device class the user
|
|
manually specifies).
|
|
|
|
* RGW: a mismatch between the bucket notification documentation and the actual
|
|
message format was fixed. This means that any endpoints receiving bucket
|
|
notification, will now receive the same notifications inside an JSON array
|
|
named 'Records'. Note that this does not affect pulling bucket notification
|
|
from a subscription in a 'pubsub' zone, as these are already wrapped inside
|
|
that array.
|
|
|
|
* The configuration value ``osd_calc_pg_upmaps_max_stddev`` used for upmap
|
|
balancing has been removed. Instead use the mgr balancer config
|
|
``upmap_max_deviation`` which now is an integer number of PGs of deviation
|
|
from the target PGs per OSD. This can be set with a command like
|
|
``ceph config set mgr mgr/balancer/upmap_max_deviation 2``. The default
|
|
``upmap_max_deviation`` is 1. There are situations where crush rules
|
|
would not allow a pool to ever have completely balanced PGs. For example, if
|
|
crush requires 1 replica on each of 3 racks, but there are fewer OSDs in 1 of
|
|
the racks. In those cases, the configuration value can be increased.
|
|
|
|
* MDS daemons can now be assigned to manage a particular file system via the
|
|
new ``mds_join_fs`` option. The monitors will try to use only MDS for a file
|
|
system with mds_join_fs equal to the file system name (strong affinity).
|
|
Monitors may also deliberately failover an active MDS to a standby when the
|
|
cluster is otherwise healthy if the standby has stronger affinity.
|
|
|
|
* RGW Multisite: A new fine grained bucket-granularity policy configuration
|
|
system has been introduced and it supersedes the previous coarse zone sync
|
|
configuration (specifically the ``sync_from`` and ``sync_from_all`` fields
|
|
in the zonegroup configuration. New configuration should only be configured
|
|
after all relevant zones in the zonegroup have been upgraded.
|
|
|
|
* RGW S3: Support has been added for BlockPublicAccess set of APIs at a bucket
|
|
level, currently blocking/ignoring public acls & policies are supported.
|
|
User/Account level apis are planned to be added in the future |