This is a follow-up to PR: https://github.com/ceph/ceph/pull/48703.
This commit also considers changes made ephemerally using either the
'daemon' or the 'tell' interfaces to override the built-in mClock
QoS parameters. In such a scenario, the ephemeral changes are removed
using the rm_val() method exposed by the config subsytem and logging
this information.
Other changes:
1. Add a standalone test to exercise the fix.
2. Add documentation note on the outcome of the attempt to modify
built-in profile defaults.
Fixes: https://tracker.ceph.com/issues/61155
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Let's use the middle profile as the default.
Modify the standalone tests accordingly.
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Fix an issue where an overridden mClock recovery setting (set prior to
an osd restart) could be lost after an osd restart.
For e.g., consider that prior to an osd restart, the option
'osd_max_backfill' was successfully set to a value different from the
mClock default. If the osd was restarted for some reason, the
boot-up sequence was incorrectly resetting the backfill value to the
mclock default within the async local/remote reservers. This fix
ensures that no change is made if the current overriden value is
different from the mClock default.
Modify an existing standalone test to verify that the local and remote
async reservers are updated to the desired number of backfills under
normal conditions and also across osd restarts.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The mClock scheduler's cost model for HDDs/SSDs is modified and now
represents the cost of an IO in terms of bytes.
The cost parameters, namely, osd_mclock_cost_per_io_usec_[hdd|ssd]
and osd_mclock_cost_per_byte_usec_[hdd|ssd] which represent the cost
of an IO in secs are inaccurate and therefore removed.
The new model considers the following aspects of an osd to calculate
the cost of an IO:
- osd_mclock_max_capacity_iops_[hdd|ssd] (existing option)
The measured random write IOPS at 4 KiB block size. This is
measured during OSD boot-up using OSD bench tool.
- osd_mclock_max_sequential_bandwidth_[hdd|ssd] (new config option)
The maximum sequential bandwidth of of the underlying device.
For HDDs, 150 MiB/s is considered, and for SSDs 750 MiB/s is
considered in the cost calculation.
The following important changes are made to arrive at the overall
cost of an IO,
1. Represent QoS reservation and limit config parameter as proportion:
The reservation and limit parameters are now set in terms of a
proportion of the OSD's max IOPS capacity. The earlier representation
was in terms of IOPS per OSD shard which required the user to perform
calculations before setting the parameter. Representing the
reservation and limit in terms of proportions is much more intuitive
and simpler for a user.
2. Cost per IO Calculation:
Using the above config options, osd_bandwidth_cost_per_io for the osd is
calculated and set. It is the ratio of the max sequential bandwidth and
the max random write iops of the osd. It is a constant and represents the
base cost of an IO in terms of bytes. This is added to the actual size of
the IO(in bytes) to represent the overall cost of the IO operation.See
mClockScheduler::calc_scaled_cost().
3. Cost calculation in Bytes:
The settings for reservation and limit in terms a fraction of the OSD's
maximum IOPS capacity is converted to Bytes/sec before updating the
mClock server's ClientInfo structure. This is done for each OSD op shard
using osd_bandwidth_capacity_per_shard shown below:
(res|lim) = (IOPS proportion) * osd_bandwidth_capacity_per_shard
(Bytes/sec) (unitless) (bytes/sec)
The above result is updated within the mClock server's ClientInfo
structure for different op_scheduler_class operations. See
mClockScheduler::ClientRegistry::update_from_config().
The overall cost of an IO operation (in secs) is finally determined
during the tag calculations performed in the mClock server. See
crimson::dmclock::RequestTag::tag_calc() for more details.
4. Profile Allocations:
Optimize mClock profile allocations due to the change in the cost model
and lower recovery cost.
5. Modify standalone tests to reflect the change in the QoS config
parameter representation of reservation and limit options.
Fixes: https://tracker.ceph.com/issues/58529
Fixes: https://tracker.ceph.com/issues/59080
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The QoS parameters (res, wgt, lim) of mClock profiles are not allowed to
be modified by users using commands like "config set" or via admin socket.
handle_conf_change() does not allow changes to any built-in mClock profile
at the mClock scheduler. But the config subsystem showed the change as
expected for the built-in mClock profile QoS parameters. This misled the
user into thinking that the change was made at the mClock server when
it was not the case.
The above issue is the result of the config "levels" used by the config
subsystem. The inital built-in QoS params are set at the CONF_DEFAULT
level. This allows the user to modify the built-in QoS params using
"config set" command which sets values at CONF_MON level which has higher
priority than CONF_DEFAULT level. The new value is persisted on the mon
store and therefore the config subsystem shows the change when "config
show" command is issued.
To prevent the above, this commit adds changes to restore the defaults set
for the built-in profiles by removing the new config changes from the MON
store. This results in the original defaults to come back into effect and
maintain a consistent view of the built-in profile across all levels.
To accomplish this, the mClock scheduler is provided with additional
information like the OSD id, shard id and a pointer to the MonClient
using which the Mon store command to remove the option is executed.
A standalone test is added to verify that built-in params cannot be
modified and the original profile params are retained.
Fixes: https://tracker.ceph.com/issues/57533
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
- Consolidate all mclock standalone tests under
qa/standalone/misc/mclock-config.sh.
- Revert existing tests in ceph-helpers.sh that verified the earlier hard
override of recovery/backfill limits.
- Add new tests to verify the procedure to change the recovery/backfill
limits with mclock scheduler.
Fixes: https://tracker.ceph.com/issues/57529
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Create the initial mClock QoS params at CONF_DEFAULT level using
set_val_default(). This allows switching to a custom profile on a
running OSD and to make necessary changes to the desired QoS params.
Note that Switching to ‘custom’ profile and then subsequently changing
the QoS params using “config set osd.n …” will be at a higher level i.e.
at CONF_MON.
But When switching back to a built-in profile, the new values won’t take
effect since CONF_DEFAULT < CONF_MON. For the values to take effect, the
config keys created as part of the ‘custom’ profile must be removed from
the ConfigMonitor store after switching back to a built-in profile.
- Added a couple of standalone tests to exercise the scenario.
- Updated the mClock configuration document and the mClock internal
documentation with a couple of typos relating to the best effort weights.
- Added new sections to the mClock configuration document outlining the
steps to switch between the built-in and custom profile and vice-versa.
Fixes: https://tracker.ceph.com/issues/55153
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Add the snaptrim duration to the json formatted output of the pg dump
stats. Define methods for a PG to set the snaptrim begin time and then to
calculate the total time spent to trim all the objects for the snaps in
the snap_trimq for the PG.
Tests:
- Librados C and C++ API tests to verify the time spent for a snaptrim
operation on a PG. These tests use the self-managed snaps APIs.
- Standalone tests to verify snaptrim duration using rados pool snaps.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Add a new column, OBJECTS_TRIMMED, to the pg dump stats that shows the
number of objects trimmed when a snap is removed.
When a pg splits, the stats from the parent pg is copied to the child
pg. In such a case, reset objects_trimmed to 0 for the child pg
(see PeeringState::split_into()). Otherwise, this will result in incorrect
stats to be shown for a child pg after the split operation.
Tests:
- Librados C and C++ API tests to verify the number of objects trimmed
during snaptrim operation. These tests use the self-managed snaps APIs.
- Standalone tests to verify objects trimmed using rados pool snaps.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Modified test cases:
1. ver-health.sh:
a. TEST_check_version_health_1():
To avoid intermittent timeouts observed in wait_for_health_string(),
increase the wait time to 20 secs.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Given and initial (set of) osd(s), if provide up to N OSDs that can be
stopped together without making PGs become unavailable.
This can be used to quickly identify large(r) batches of OSDs that can be
stopped together to (for example) upgrade.
Signed-off-by: Sage Weil <sage@newdream.net>
in my test bed, it takes 11 seconds to boot the 3 OSDs and to restart
one of them, this fails the test.
so we need to take the time into consideration. in this change, the
delay is added to the total "warn_older_version_delay", so the monitor
does not start sending warning earlier than expected.
Signed-off-by: Kefu Chai <kchai@redhat.com>
in e5b1ae5554, a new option named
"debug_version_for_testing" is introduced to override the version so
we can test version check.
in crimson, we have two families of shared functions.
- one of them is used by alien store. they are compiled with
-DWITH_SEASTAR and -DWITH_ALIEN, to enable the shim code between
seastar and POSIX thread.
- another is used by crimson in general. where no lock is allowed.
currently, we use the "crimson" and "ceph" namespace to differentiate
these two families of functions, so they can colocate in the same
executable without violating the ODR. see src/include/common_fwd.h for
more details.
the functions defined in src/common/version.cc are also shared by
alien store and crimson code. and because we have different
implementations of `CephContext` in crimson and in classic OSD (i.e.
alienstore), we have to have different implementations of this function
as well, if we follow the same approach. but since these functions are
very simple and are non-blocking, there is not much value in
differentiating them, it is better to inject the test settings using
environment variable instead of using ceph option subsystem.
in this change, "ceph_debug_version_for_testing" environment variable is
checked instead, so that crimson and alienstore can share the same
compilation unit of version.cc. and "debug_version_for_testing" option
is removed.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Make sure PGs peer (simply flushing state to mon isn't enough).
Fixes: https://tracker.ceph.com/issues/43721
Signed-off-by: Sage Weil <sage@redhat.com>
We need to pay attention to account for CRUSH_ITEM_NONE entries in the
EC PG acting set.
Fixes: https://tracker.ceph.com/issues/43151
Signed-off-by: Sage Weil <sage@redhat.com>
Helpers to decide when it is safe to stop a mon, add a mon that is
not started, or remove a mon. (Adding and start a mon would always
be safe, but it takes time to sync, so it's not really possible to do
quickly.)
Signed-off-by: Sage Weil <sage@redhat.com>
/bin/bash is a Linuxism. Other operating systems install bash to
different paths. Use /usr/bin/env in shebangs to find bash.
Signed-off-by: Alan Somers <asomers@gmail.com>
- stop running via make check
- add teuthology yamls to run them
- disable ceph_objecstore_tool.py for now (too slow for make check, and
we can't use vstart in teuthology via a package install)
- drop cephtool tests since those are already covered by other teuthology
tests
- leave a handful of (fast!) ceph-helpers tests for make check for minimal
integration tests.
Signed-off-by: Sage Weil <sage@redhat.com>