Merge pull request #41308 from sseshasa/wip-osd-benchmark-for-mclock

osd: Run osd bench test to override default max osd capacity for mclock

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
This commit is contained in:
Neha Ojha 2021-06-03 08:39:22 -07:00 committed by GitHub
commit 11252f6117
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 399 additions and 261 deletions

View File

@ -7,11 +7,12 @@
Mclock profiles mask the low level details from users, making it
easier for them to configure mclock.
To use mclock, you must provide the following input parameters:
The following input parameters are required for a mclock profile to configure
the QoS related parameters:
* total capacity of each OSD
* total capacity (IOPS) of each OSD (determined automatically)
* an mclock profile to enable
* an mclock profile type to enable
Using the settings in the specified profile, the OSD determines and applies the
lower-level mclock and Ceph parameters. The parameters applied by the mclock
@ -31,11 +32,11 @@ Ceph cluster enables the throttling of the operations(IOPS) belonging to
different client classes (background recovery, scrub, snaptrim, client op,
osd subop)”*.
The mclock profile uses the capacity limits and the mclock profile selected by
the user to determine the low-level mclock resource control parameters.
The mclock profile uses the capacity limits and the mclock profile type selected
by the user to determine the low-level mclock resource control parameters.
Depending on the profile, lower-level mclock resource-control parameters and
some Ceph-configuration parameters are transparently applied.
Depending on the profile type, lower-level mclock resource-control parameters
and some Ceph-configuration parameters are transparently applied.
The low-level mclock resource control parameters are the *reservation*,
*limit*, and *weight* that provide control of the resource shares, as
@ -56,7 +57,7 @@ mclock profiles can be broadly classified into two types,
as compared to background recoveries and other internal clients within
Ceph. This profile is enabled by default.
- **high_recovery_ops**:
This profile allocates more reservation to background recoveries as
This profile allocates more reservation to background recoveries as
compared to external clients and other internal clients within Ceph. For
example, an admin may enable this profile temporarily to speed-up background
recoveries during non-peak hours.
@ -109,7 +110,8 @@ chunk of the bandwidth allocation goes to client ops. Background recovery ops
are given lower allocation (and therefore take a longer time to complete). But
there might be instances that necessitate giving higher allocations to either
client ops or recovery ops. In order to deal with such a situation, you can
enable one of the alternate built-in profiles mentioned above.
enable one of the alternate built-in profiles by following the steps mentioned
in the next section.
If any mClock profile (including "custom") is active, the following Ceph config
sleep options will be disabled,
@ -139,20 +141,64 @@ all its clients.
Steps to Enable mClock Profile
==============================
The following sections outline the steps required to enable a mclock profile.
As already mentioned, the default mclock profile is set to *high_client_ops*.
The other values for the built-in profiles include *balanced* and
*high_recovery_ops*.
Determining OSD Capacity Using Benchmark Tests
----------------------------------------------
If there is a requirement to change the default profile, then the option
:confval:`osd_mclock_profile` may be set during runtime by using the following
command:
To allow mclock to fulfill its QoS goals across its clients, it is most
important to have a good understanding of each OSD's capacity in terms of its
baseline throughputs (IOPS) across the Ceph nodes. To determine this capacity,
you must perform appropriate benchmarking tests. The steps for performing these
benchmarking tests are broadly outlined below.
.. prompt:: bash #
Any existing benchmarking tool can be used for this purpose. The following
steps use the *Ceph Benchmarking Tool* (cbt_). Regardless of the tool
used, the steps described below remain the same.
ceph config set [global,osd] osd_mclock_profile <value>
For example, to change the profile to allow faster recoveries, the following
command can be used to switch to the *high_recovery_ops* profile:
.. prompt:: bash #
ceph config set osd osd_mclock_profile high_recovery_ops
.. note:: The *custom* profile is not recommended unless you are an advanced
user.
And that's it! You are ready to run workloads on the cluster and check if the
QoS requirements are being met.
OSD Capacity Determination (Automated)
======================================
The OSD capacity in terms of total IOPS is determined automatically during OSD
initialization. This is achieved by running the OSD bench tool and overriding
the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
depending on the device type. No other action/input is expected from the user
to set the OSD capacity. You may verify the capacity of an OSD after the
cluster is brought up by using the following command:
.. prompt:: bash #
ceph config show osd.x osd_mclock_max_capacity_iops_[hdd, ssd]
For example, the following command shows the max capacity for osd.0 on a Ceph
node whose underlying device type is SSD:
.. prompt:: bash #
ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
Steps to Manually Benchmark an OSD (Optional)
=============================================
.. note:: These steps are only necessary if you want to override the OSD
capacity already determined automatically during OSD initialization.
Otherwise, you may skip this section entirely.
Any existing benchmarking tool can be used for this purpose. In this case, the
steps use the *Ceph OSD Bench* command described in the next section. Regardless
of the tool/command used, the steps outlined further below remain the same.
As already described in the :ref:`dmclock-qos` section, the number of
shards and the bluestore's throttle parameters have an impact on the mclock op
@ -167,68 +213,85 @@ maximize the impact of the mclock scheduler.
:Bluestore Throttle Parameters:
We recommend using the default values as defined by
:confval:`bluestore_throttle_bytes` and :confval:`bluestore_throttle_deferred_bytes`. But
these parameters may also be determined during the benchmarking phase as
described below.
:confval:`bluestore_throttle_bytes` and
:confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be
determined during the benchmarking phase as described below.
Benchmarking Test Steps Using CBT
`````````````````````````````````
OSD Bench Command Syntax
````````````````````````
The steps below use the default shards and detail the steps used to determine the
correct bluestore throttle values.
The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
used for benchmarking is shown below :
.. note:: These steps, although manual in April 2021, will be automated in the future.
.. prompt:: bash #
1. On the Ceph node hosting the OSDs, download cbt_ from git.
2. Install cbt and all the dependencies mentioned on the cbt github page.
3. Construct the Ceph configuration file and the cbt yaml file.
4. Ensure that the bluestore throttle options ( i.e.
:confval:`bluestore_throttle_bytes` and :confval:`bluestore_throttle_deferred_bytes`) are
set to the default values.
5. Ensure that the test is performed on similar device types to get reliable
OSD capacity data.
6. The OSDs can be grouped together with the desired replication factor for the
test to ensure reliability of OSD capacity data.
7. After ensuring that the OSDs nodes are in the desired configuration, run a
simple 4KiB random write workload on the OSD(s) for 300 secs.
8. Note the overall throughput(IOPS) obtained from the cbt output file. This
value is the baseline throughput(IOPS) when the default bluestore
throttle options are in effect.
9. If the intent is to determine the bluestore throttle values for your
environment, then set the two options, :confval:`bluestore_throttle_bytes` and
:confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each to begin
with. Otherwise, you may skip to the next section.
10. Run the 4KiB random write workload as before on the OSD(s) for 300 secs.
11. Note the overall throughput from the cbt log files and compare the value
against the baseline throughput in step 8.
12. If the throughput doesn't match with the baseline, increment the bluestore
throttle options by 2x and repeat steps 9 through 11 until the obtained
throughput is very close to the baseline value.
ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB for
both bluestore throttle and deferred bytes was determined to maximize the impact
of mclock. For HDDs, the corresponding value was 40 MiB, where the overall
throughput was roughly equal to the baseline throughput. Note that in general
for HDDs, the bluestore throttle values are expected to be higher when compared
to SSDs.
where,
.. _cbt: https://github.com/ceph/cbt
* ``TOTAL_BYTES``: Total number of bytes to write
* ``BYTES_PER_WRITE``: Block size per write
* ``OBJ_SIZE``: Bytes per object
* ``NUM_OBJS``: Number of objects to write
Benchmarking Test Steps Using OSD Bench
```````````````````````````````````````
The steps below use the default shards and detail the steps used to determine
the correct bluestore throttle values (optional).
#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
you wish to benchmark.
#. Run a simple 4KiB random write workload on an OSD using the following
commands:
.. note:: Note that before running the test, caches must be cleared to get an
accurate measurement.
For example, if you are running the benchmark test on osd.0, run the following
commands:
.. prompt:: bash #
ceph tell osd.0 cache drop
.. prompt:: bash #
ceph tell osd.0 bench 12288000 4096 4194304 100
#. Note the overall throughput(IOPS) obtained from the output of the osd bench
command. This value is the baseline throughput(IOPS) when the default
bluestore throttle options are in effect.
#. If the intent is to determine the bluestore throttle values for your
environment, then set the two options, :confval:`bluestore_throttle_bytes`
and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each
to begin with. Otherwise, you may skip to the next section.
#. Run the 4KiB random write test as before using OSD bench.
#. Note the overall throughput from the output and compare the value
against the baseline throughput recorded in step 3.
#. If the throughput doesn't match with the baseline, increment the bluestore
throttle options by 2x and repeat steps 5 through 7 until the obtained
throughput is very close to the baseline value.
For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
for both bluestore throttle and deferred bytes was determined to maximize the
impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
overall throughput was roughly equal to the baseline throughput. Note that in
general for HDDs, the bluestore throttle values are expected to be higher when
compared to SSDs.
Specifying Max OSD Capacity
----------------------------
````````````````````````````
The steps in this section may be performed only if the max osd capacity is
different from the default values (SSDs: 21500 IOPS and HDDs: 315 IOPS). The
option ``osd_mclock_max_capacity_iops_[hdd, ssd]`` can be set by specifying it
in either the **[global]** section or in a specific OSD section (**[osd.x]** of
your Ceph configuration file).
Alternatively, commands of the following form may be used:
The steps in this section may be performed only if you want to override the
max osd capacity automatically determined during OSD initialization. The option
``osd_mclock_max_capacity_iops_[hdd, ssd]`` can be set by running the
following command:
.. prompt:: bash #
ceph config set [global, osd] osd_mclock_max_capacity_iops_[hdd,ssd] <value>
ceph config set [global,osd] osd_mclock_max_capacity_iops_[hdd,ssd] <value>
For example, the following command sets the max capacity for all the OSDs in a
Ceph node whose underlying device type is SSDs:
@ -245,43 +308,12 @@ device type is HDD, use a command like this:
ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
Specifying Which mClock Profile to Enable
-----------------------------------------
As already mentioned, the default mclock profile is set to *high_client_ops*.
The other values for the built-in profiles include *balanced* and
*high_recovery_ops*.
If there is a requirement to change the default profile, then the option
:confval:`osd_mclock_profile` may be set in the **[global]** or **[osd]** section of
your Ceph configuration file before bringing up your cluster.
Alternatively, to change the profile during runtime, use the following command:
.. prompt:: bash #
ceph config set [global,osd] osd_mclock_profile <value>
For example, to change the profile to allow faster recoveries, the following
command can be used to switch to the *high_recovery_ops* profile:
.. prompt:: bash #
ceph config set osd osd_mclock_profile high_recovery_ops
.. note:: The *custom* profile is not recommended unless you are an advanced user.
And that's it! You are ready to run workloads on the cluster and check if the
QoS requirements are being met.
.. index:: mclock; config settings
mClock Config Options
=====================
.. confval:: osd_mclock_profile
.. confval:: osd_mclock_max_capacity_iops
.. confval:: osd_mclock_max_capacity_iops_hdd
.. confval:: osd_mclock_max_capacity_iops_ssd
.. confval:: osd_mclock_cost_per_io_usec

View File

@ -95,6 +95,8 @@ or delete them if they were just created. ::
ceph pg {pgid} mark_unfound_lost revert|delete
.. _osd-subsystem:
OSD Subsystem
=============

View File

@ -192,16 +192,22 @@ class CephTestCase(unittest.TestCase):
log.debug("wait_until_equal: success")
@classmethod
def wait_until_true(cls, condition, timeout, period=5):
def wait_until_true(cls, condition, timeout, check_fn=None, period=5):
elapsed = 0
retry_count = 0
while True:
if condition():
log.debug("wait_until_true: success in {0}s".format(elapsed))
log.debug("wait_until_true: success in {0}s and {1} retries".format(elapsed, retry_count))
return
else:
if elapsed >= timeout:
raise TestTimeoutError("Timed out after {0}s".format(elapsed))
if check_fn and check_fn() and retry_count < 5:
elapsed = 0
retry_count += 1
log.debug("wait_until_true: making progress, waiting (timeout={0} retry_count={1})...".format(timeout, retry_count))
else:
raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
else:
log.debug("wait_until_true: waiting (timeout={0})...".format(timeout))
log.debug("wait_until_true: waiting (timeout={0} retry_count={1})...".format(timeout, retry_count))
time.sleep(period)
elapsed += period

View File

@ -243,6 +243,13 @@ class TestProgress(MgrTestCase):
assert ev_id in live_ids
return False
def _is_inprogress_or_complete(self, ev_id):
for ev in self._events_in_progress():
if ev['id'] == ev_id:
return ev['progress'] > 0
# check if the event completed
return self._is_complete(ev_id)
def tearDown(self):
if self.POOL in self.mgr_cluster.mon_manager.pools:
self.mgr_cluster.mon_manager.remove_pool(self.POOL)
@ -396,5 +403,6 @@ class TestProgress(MgrTestCase):
log.info(json.dumps(ev1, indent=1))
self.wait_until_true(lambda: self._is_complete(ev1['id']),
check_fn=lambda: self._is_inprogress_or_complete(ev1['id']),
timeout=self.RECOVERY_PERIOD)
self.assertTrue(self._is_quiet())

View File

@ -1019,19 +1019,6 @@ options:
default: 0.011
flags:
- runtime
- name: osd_mclock_max_capacity_iops
type: float
level: basic
desc: Max IOPs capacity (at 4KiB block size) to consider per OSD (overrides _ssd
and _hdd if non-zero)
long_desc: This option specifies the max osd capacity in iops per OSD. Helps in
QoS calculations when enabling a dmclock profile. Only considered for osd_op_queue
= mclock_scheduler
fmt_desc: Max IOPS capacity (at 4KiB block size) to consider per OSD
(overrides _ssd and _hdd if non-zero)
default: 0
flags:
- runtime
- name: osd_mclock_max_capacity_iops_hdd
type: float
level: basic

View File

@ -2320,9 +2320,6 @@ OSD::OSD(CephContext *cct_,
this);
shards.push_back(one_shard);
}
// override some config options if mclock is enabled on all the shards
maybe_override_options_for_qos();
}
OSD::~OSD()
@ -2826,136 +2823,13 @@ will start to track new ops received afterwards.";
int64_t bsize = cmd_getval_or<int64_t>(cmdmap, "size", 4LL << 20);
int64_t osize = cmd_getval_or<int64_t>(cmdmap, "object_size", 0);
int64_t onum = cmd_getval_or<int64_t>(cmdmap, "object_num", 0);
uint32_t duration = cct->_conf->osd_bench_duration;
double elapsed = 0.0;
if (bsize > (int64_t) cct->_conf->osd_bench_max_block_size) {
// let us limit the block size because the next checks rely on it
// having a sane value. If we allow any block size to be set things
// can still go sideways.
ss << "block 'size' values are capped at "
<< byte_u_t(cct->_conf->osd_bench_max_block_size) << ". If you wish to use"
<< " a higher value, please adjust 'osd_bench_max_block_size'";
ret = -EINVAL;
ret = run_osd_bench_test(count, bsize, osize, onum, &elapsed, ss);
if (ret != 0) {
goto out;
} else if (bsize < (int64_t) (1 << 20)) {
// entering the realm of small block sizes.
// limit the count to a sane value, assuming a configurable amount of
// IOPS and duration, so that the OSD doesn't get hung up on this,
// preventing timeouts from going off
int64_t max_count =
bsize * duration * cct->_conf->osd_bench_small_size_max_iops;
if (count > max_count) {
ss << "'count' values greater than " << max_count
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
<< cct->_conf->osd_bench_small_size_max_iops << " IOPS,"
<< " for " << duration << " seconds,"
<< " can cause ill effects on osd. "
<< " Please adjust 'osd_bench_small_size_max_iops' with a higher"
<< " value if you wish to use a higher 'count'.";
ret = -EINVAL;
goto out;
}
} else {
// 1MB block sizes are big enough so that we get more stuff done.
// However, to avoid the osd from getting hung on this and having
// timers being triggered, we are going to limit the count assuming
// a configurable throughput and duration.
// NOTE: max_count is the total amount of bytes that we believe we
// will be able to write during 'duration' for the given
// throughput. The block size hardly impacts this unless it's
// way too big. Given we already check how big the block size
// is, it's safe to assume everything will check out.
int64_t max_count =
cct->_conf->osd_bench_large_size_max_throughput * duration;
if (count > max_count) {
ss << "'count' values greater than " << max_count
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
<< byte_u_t(cct->_conf->osd_bench_large_size_max_throughput) << "/s,"
<< " for " << duration << " seconds,"
<< " can cause ill effects on osd. "
<< " Please adjust 'osd_bench_large_size_max_throughput'"
<< " with a higher value if you wish to use a higher 'count'.";
ret = -EINVAL;
goto out;
}
}
if (osize && bsize > osize)
bsize = osize;
dout(1) << " bench count " << count
<< " bsize " << byte_u_t(bsize) << dendl;
ObjectStore::Transaction cleanupt;
if (osize && onum) {
bufferlist bl;
bufferptr bp(osize);
bp.zero();
bl.push_back(std::move(bp));
bl.rebuild_page_aligned();
for (int i=0; i<onum; ++i) {
char nm[30];
snprintf(nm, sizeof(nm), "disk_bw_test_%d", i);
object_t oid(nm);
hobject_t soid(sobject_t(oid, 0));
ObjectStore::Transaction t;
t.write(coll_t(), ghobject_t(soid), 0, osize, bl);
store->queue_transaction(service.meta_ch, std::move(t), NULL);
cleanupt.remove(coll_t(), ghobject_t(soid));
}
}
bufferlist bl;
bufferptr bp(bsize);
bp.zero();
bl.push_back(std::move(bp));
bl.rebuild_page_aligned();
{
C_SaferCond waiter;
if (!service.meta_ch->flush_commit(&waiter)) {
waiter.wait();
}
}
utime_t start = ceph_clock_now();
for (int64_t pos = 0; pos < count; pos += bsize) {
char nm[30];
unsigned offset = 0;
if (onum && osize) {
snprintf(nm, sizeof(nm), "disk_bw_test_%d", (int)(rand() % onum));
offset = rand() % (osize / bsize) * bsize;
} else {
snprintf(nm, sizeof(nm), "disk_bw_test_%lld", (long long)pos);
}
object_t oid(nm);
hobject_t soid(sobject_t(oid, 0));
ObjectStore::Transaction t;
t.write(coll_t::meta(), ghobject_t(soid), offset, bsize, bl);
store->queue_transaction(service.meta_ch, std::move(t), NULL);
if (!onum || !osize)
cleanupt.remove(coll_t::meta(), ghobject_t(soid));
}
{
C_SaferCond waiter;
if (!service.meta_ch->flush_commit(&waiter)) {
waiter.wait();
}
}
utime_t end = ceph_clock_now();
// clean up
store->queue_transaction(service.meta_ch, std::move(cleanupt), NULL);
{
C_SaferCond waiter;
if (!service.meta_ch->flush_commit(&waiter)) {
waiter.wait();
}
}
double elapsed = end - start;
double rate = count / elapsed;
double iops = rate / bsize;
f->open_object_section("osd_bench_results");
@ -3234,6 +3108,150 @@ will start to track new ops received afterwards.";
on_finish(ret, ss.str(), outbl);
}
int OSD::run_osd_bench_test(
int64_t count,
int64_t bsize,
int64_t osize,
int64_t onum,
double *elapsed,
ostream &ss)
{
int ret = 0;
uint32_t duration = cct->_conf->osd_bench_duration;
if (bsize > (int64_t) cct->_conf->osd_bench_max_block_size) {
// let us limit the block size because the next checks rely on it
// having a sane value. If we allow any block size to be set things
// can still go sideways.
ss << "block 'size' values are capped at "
<< byte_u_t(cct->_conf->osd_bench_max_block_size) << ". If you wish to use"
<< " a higher value, please adjust 'osd_bench_max_block_size'";
ret = -EINVAL;
return ret;
} else if (bsize < (int64_t) (1 << 20)) {
// entering the realm of small block sizes.
// limit the count to a sane value, assuming a configurable amount of
// IOPS and duration, so that the OSD doesn't get hung up on this,
// preventing timeouts from going off
int64_t max_count =
bsize * duration * cct->_conf->osd_bench_small_size_max_iops;
if (count > max_count) {
ss << "'count' values greater than " << max_count
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
<< cct->_conf->osd_bench_small_size_max_iops << " IOPS,"
<< " for " << duration << " seconds,"
<< " can cause ill effects on osd. "
<< " Please adjust 'osd_bench_small_size_max_iops' with a higher"
<< " value if you wish to use a higher 'count'.";
ret = -EINVAL;
return ret;
}
} else {
// 1MB block sizes are big enough so that we get more stuff done.
// However, to avoid the osd from getting hung on this and having
// timers being triggered, we are going to limit the count assuming
// a configurable throughput and duration.
// NOTE: max_count is the total amount of bytes that we believe we
// will be able to write during 'duration' for the given
// throughput. The block size hardly impacts this unless it's
// way too big. Given we already check how big the block size
// is, it's safe to assume everything will check out.
int64_t max_count =
cct->_conf->osd_bench_large_size_max_throughput * duration;
if (count > max_count) {
ss << "'count' values greater than " << max_count
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
<< byte_u_t(cct->_conf->osd_bench_large_size_max_throughput) << "/s,"
<< " for " << duration << " seconds,"
<< " can cause ill effects on osd. "
<< " Please adjust 'osd_bench_large_size_max_throughput'"
<< " with a higher value if you wish to use a higher 'count'.";
ret = -EINVAL;
return ret;
}
}
if (osize && bsize > osize) {
bsize = osize;
}
dout(1) << " bench count " << count
<< " bsize " << byte_u_t(bsize) << dendl;
ObjectStore::Transaction cleanupt;
if (osize && onum) {
bufferlist bl;
bufferptr bp(osize);
bp.zero();
bl.push_back(std::move(bp));
bl.rebuild_page_aligned();
for (int i=0; i<onum; ++i) {
char nm[30];
snprintf(nm, sizeof(nm), "disk_bw_test_%d", i);
object_t oid(nm);
hobject_t soid(sobject_t(oid, 0));
ObjectStore::Transaction t;
t.write(coll_t(), ghobject_t(soid), 0, osize, bl);
store->queue_transaction(service.meta_ch, std::move(t), nullptr);
cleanupt.remove(coll_t(), ghobject_t(soid));
}
}
bufferlist bl;
bufferptr bp(bsize);
bp.zero();
bl.push_back(std::move(bp));
bl.rebuild_page_aligned();
{
C_SaferCond waiter;
if (!service.meta_ch->flush_commit(&waiter)) {
waiter.wait();
}
}
utime_t start = ceph_clock_now();
for (int64_t pos = 0; pos < count; pos += bsize) {
char nm[30];
unsigned offset = 0;
if (onum && osize) {
snprintf(nm, sizeof(nm), "disk_bw_test_%d", (int)(rand() % onum));
offset = rand() % (osize / bsize) * bsize;
} else {
snprintf(nm, sizeof(nm), "disk_bw_test_%lld", (long long)pos);
}
object_t oid(nm);
hobject_t soid(sobject_t(oid, 0));
ObjectStore::Transaction t;
t.write(coll_t::meta(), ghobject_t(soid), offset, bsize, bl);
store->queue_transaction(service.meta_ch, std::move(t), nullptr);
if (!onum || !osize) {
cleanupt.remove(coll_t::meta(), ghobject_t(soid));
}
}
{
C_SaferCond waiter;
if (!service.meta_ch->flush_commit(&waiter)) {
waiter.wait();
}
}
utime_t end = ceph_clock_now();
*elapsed = end - start;
// clean up
store->queue_transaction(service.meta_ch, std::move(cleanupt), nullptr);
{
C_SaferCond waiter;
if (!service.meta_ch->flush_commit(&waiter)) {
waiter.wait();
}
}
return ret;
}
class TestOpsSocketHook : public AdminSocketHook {
OSDService *service;
ObjectStore *store;
@ -3783,6 +3801,10 @@ int OSD::init()
start_boot();
// Override a few options if mclock scheduler is enabled.
maybe_override_max_osd_capacity_for_qos();
maybe_override_options_for_qos();
return 0;
out:
@ -10062,6 +10084,53 @@ void OSD::handle_conf_change(const ConfigProxy& conf,
}
}
void OSD::maybe_override_max_osd_capacity_for_qos()
{
// If the scheduler enabled is mclock, override the default
// osd capacity with the value obtained from running the
// osd bench test. This is later used to setup mclock.
if (cct->_conf.get_val<std::string>("osd_op_queue") == "mclock_scheduler") {
// Write 200 4MiB objects with blocksize 4KiB
int64_t count = 12288000; // Count of bytes to write
int64_t bsize = 4096; // Block size
int64_t osize = 4194304; // Object size
int64_t onum = 100; // Count of objects to write
double elapsed = 0.0; // Time taken to complete the test
stringstream ss;
int ret = run_osd_bench_test(count, bsize, osize, onum, &elapsed, ss);
if (ret != 0) {
derr << __func__
<< " osd bench err: " << ret
<< " osd bench errstr: " << ss.str()
<< dendl;
} else {
double rate = count / elapsed;
double iops = rate / bsize;
dout(1) << __func__
<< " osd bench result -"
<< std::fixed << std::setprecision(3)
<< " bandwidth (MiB/sec): " << rate / (1024 * 1024)
<< " iops: " << iops
<< " elapsed_sec: " << elapsed
<< dendl;
// Override the appropriate config option
if (store_is_rotational) {
cct->_conf.set_val(
"osd_mclock_max_capacity_iops_hdd", std::to_string(iops));
} else {
cct->_conf.set_val(
"osd_mclock_max_capacity_iops_ssd", std::to_string(iops));
}
// Override the max osd capacity for all shards
for (auto& shard : shards) {
shard->update_scheduler_config();
}
}
}
}
bool OSD::maybe_override_options_for_qos()
{
// If the scheduler enabled is mclock, override the recovery, backfill
@ -10610,6 +10679,12 @@ void OSDShard::unprime_split_children(spg_t parent, unsigned old_pg_num)
}
}
void OSDShard::update_scheduler_config()
{
std::lock_guard l(shard_lock);
scheduler->update_configuration();
}
OSDShard::OSDShard(
int id,
CephContext *cct,
@ -10746,12 +10821,17 @@ void OSD::ShardedOpWQ::_process(uint32_t thread_index, heartbeat_handle_d *hb)
std::unique_lock wait_lock{sdata->sdata_wait_lock};
auto future_time = ceph::real_clock::from_double(*when_ready);
dout(10) << __func__ << " dequeue future request at " << future_time << dendl;
// Disable heartbeat timeout until we find a non-future work item to process.
osd->cct->get_heartbeat_map()->clear_timeout(hb);
sdata->shard_lock.unlock();
++sdata->waiting_threads;
sdata->sdata_cond.wait_until(wait_lock, future_time);
--sdata->waiting_threads;
wait_lock.unlock();
sdata->shard_lock.lock();
// Reapply default wq timeouts
osd->cct->get_heartbeat_map()->reset_timeout(hb,
timeout_interval, suicide_interval);
}
} // while

View File

@ -1078,6 +1078,7 @@ struct OSDShard {
std::set<std::pair<spg_t,epoch_t>> *merge_pgs);
void register_and_wake_split_child(PG *pg);
void unprime_split_children(spg_t parent, unsigned old_pg_num);
void update_scheduler_config();
OSDShard(
int id,
@ -2063,7 +2064,14 @@ private:
float get_osd_snap_trim_sleep();
int get_recovery_max_active();
void maybe_override_max_osd_capacity_for_qos();
bool maybe_override_options_for_qos();
int run_osd_bench_test(int64_t count,
int64_t bsize,
int64_t osize,
int64_t onum,
double *elapsed,
std::ostream& ss);
void scrub_purged_snaps();
void probe_smart(const std::string& devid, std::ostream& ss);

View File

@ -50,6 +50,9 @@ public:
// Print human readable brief description with relevant parameters
virtual void print(std::ostream &out) const = 0;
// Apply config changes to the scheduler (if any)
virtual void update_configuration() = 0;
// Destructor
virtual ~OpScheduler() {};
};
@ -134,6 +137,10 @@ public:
out << ", cutoff=" << cutoff << ")";
}
void update_configuration() final {
// no-op
}
~ClassedOpQueueScheduler() final {};
};

View File

@ -99,22 +99,19 @@ const dmc::ClientInfo *mClockScheduler::ClientRegistry::get_info(
void mClockScheduler::set_max_osd_capacity()
{
if (cct->_conf.get_val<double>("osd_mclock_max_capacity_iops")) {
if (is_rotational) {
max_osd_capacity =
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops");
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_hdd");
} else {
if (is_rotational) {
max_osd_capacity =
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_hdd");
} else {
max_osd_capacity =
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_ssd");
}
max_osd_capacity =
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_ssd");
}
// Set per op-shard iops limit
max_osd_capacity /= num_shards;
dout(1) << __func__ << " #op shards: " << num_shards
<< " max osd capacity(iops) per shard: " << max_osd_capacity << dendl;
<< std::fixed << std::setprecision(2)
<< " max osd capacity(iops) per shard: " << max_osd_capacity
<< dendl;
}
void mClockScheduler::set_osd_mclock_cost_per_io()
@ -137,7 +134,8 @@ void mClockScheduler::set_osd_mclock_cost_per_io()
}
}
dout(1) << __func__ << " osd_mclock_cost_per_io: "
<< std::fixed << osd_mclock_cost_per_io << dendl;
<< std::fixed << std::setprecision(7) << osd_mclock_cost_per_io
<< dendl;
}
void mClockScheduler::set_osd_mclock_cost_per_byte()
@ -160,7 +158,8 @@ void mClockScheduler::set_osd_mclock_cost_per_byte()
}
}
dout(1) << __func__ << " osd_mclock_cost_per_byte: "
<< std::fixed << osd_mclock_cost_per_byte << dendl;
<< std::fixed << std::setprecision(7) << osd_mclock_cost_per_byte
<< dendl;
}
void mClockScheduler::set_mclock_profile()
@ -378,6 +377,14 @@ int mClockScheduler::calc_scaled_cost(int item_cost)
return std::max(scaled_cost, 1);
}
void mClockScheduler::update_configuration()
{
// Apply configuration change. The expectation is that
// at least one of the tracked mclock config option keys
// is modified before calling this method.
cct->_conf.apply_changes(nullptr);
}
void mClockScheduler::dump(ceph::Formatter &f) const
{
}
@ -447,7 +454,6 @@ const char** mClockScheduler::get_tracked_conf_keys() const
"osd_mclock_cost_per_byte_usec",
"osd_mclock_cost_per_byte_usec_hdd",
"osd_mclock_cost_per_byte_usec_ssd",
"osd_mclock_max_capacity_iops",
"osd_mclock_max_capacity_iops_hdd",
"osd_mclock_max_capacity_iops_ssd",
"osd_mclock_profile",
@ -470,8 +476,7 @@ void mClockScheduler::handle_conf_change(
changed.count("osd_mclock_cost_per_byte_usec_ssd")) {
set_osd_mclock_cost_per_byte();
}
if (changed.count("osd_mclock_max_capacity_iops") ||
changed.count("osd_mclock_max_capacity_iops_hdd") ||
if (changed.count("osd_mclock_max_capacity_iops_hdd") ||
changed.count("osd_mclock_max_capacity_iops_ssd")) {
set_max_osd_capacity();
if (mclock_profile != "custom") {

View File

@ -193,6 +193,9 @@ public:
ostream << "mClockScheduler";
}
// Update data associated with the modified mclock config key(s)
void update_configuration() final;
const char** get_tracked_conf_keys() const final;
void handle_conf_change(const ConfigProxy& conf,
const std::set<std::string> &changed) final;