mirror of
https://github.com/ceph/ceph
synced 2025-01-19 17:41:39 +00:00
Merge pull request #41308 from sseshasa/wip-osd-benchmark-for-mclock
osd: Run osd bench test to override default max osd capacity for mclock Reviewed-by: Neha Ojha <nojha@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
This commit is contained in:
commit
11252f6117
@ -7,11 +7,12 @@
|
||||
Mclock profiles mask the low level details from users, making it
|
||||
easier for them to configure mclock.
|
||||
|
||||
To use mclock, you must provide the following input parameters:
|
||||
The following input parameters are required for a mclock profile to configure
|
||||
the QoS related parameters:
|
||||
|
||||
* total capacity of each OSD
|
||||
* total capacity (IOPS) of each OSD (determined automatically)
|
||||
|
||||
* an mclock profile to enable
|
||||
* an mclock profile type to enable
|
||||
|
||||
Using the settings in the specified profile, the OSD determines and applies the
|
||||
lower-level mclock and Ceph parameters. The parameters applied by the mclock
|
||||
@ -31,11 +32,11 @@ Ceph cluster enables the throttling of the operations(IOPS) belonging to
|
||||
different client classes (background recovery, scrub, snaptrim, client op,
|
||||
osd subop)”*.
|
||||
|
||||
The mclock profile uses the capacity limits and the mclock profile selected by
|
||||
the user to determine the low-level mclock resource control parameters.
|
||||
The mclock profile uses the capacity limits and the mclock profile type selected
|
||||
by the user to determine the low-level mclock resource control parameters.
|
||||
|
||||
Depending on the profile, lower-level mclock resource-control parameters and
|
||||
some Ceph-configuration parameters are transparently applied.
|
||||
Depending on the profile type, lower-level mclock resource-control parameters
|
||||
and some Ceph-configuration parameters are transparently applied.
|
||||
|
||||
The low-level mclock resource control parameters are the *reservation*,
|
||||
*limit*, and *weight* that provide control of the resource shares, as
|
||||
@ -56,7 +57,7 @@ mclock profiles can be broadly classified into two types,
|
||||
as compared to background recoveries and other internal clients within
|
||||
Ceph. This profile is enabled by default.
|
||||
- **high_recovery_ops**:
|
||||
This profile allocates more reservation to background recoveries as
|
||||
This profile allocates more reservation to background recoveries as
|
||||
compared to external clients and other internal clients within Ceph. For
|
||||
example, an admin may enable this profile temporarily to speed-up background
|
||||
recoveries during non-peak hours.
|
||||
@ -109,7 +110,8 @@ chunk of the bandwidth allocation goes to client ops. Background recovery ops
|
||||
are given lower allocation (and therefore take a longer time to complete). But
|
||||
there might be instances that necessitate giving higher allocations to either
|
||||
client ops or recovery ops. In order to deal with such a situation, you can
|
||||
enable one of the alternate built-in profiles mentioned above.
|
||||
enable one of the alternate built-in profiles by following the steps mentioned
|
||||
in the next section.
|
||||
|
||||
If any mClock profile (including "custom") is active, the following Ceph config
|
||||
sleep options will be disabled,
|
||||
@ -139,20 +141,64 @@ all its clients.
|
||||
Steps to Enable mClock Profile
|
||||
==============================
|
||||
|
||||
The following sections outline the steps required to enable a mclock profile.
|
||||
As already mentioned, the default mclock profile is set to *high_client_ops*.
|
||||
The other values for the built-in profiles include *balanced* and
|
||||
*high_recovery_ops*.
|
||||
|
||||
Determining OSD Capacity Using Benchmark Tests
|
||||
----------------------------------------------
|
||||
If there is a requirement to change the default profile, then the option
|
||||
:confval:`osd_mclock_profile` may be set during runtime by using the following
|
||||
command:
|
||||
|
||||
To allow mclock to fulfill its QoS goals across its clients, it is most
|
||||
important to have a good understanding of each OSD's capacity in terms of its
|
||||
baseline throughputs (IOPS) across the Ceph nodes. To determine this capacity,
|
||||
you must perform appropriate benchmarking tests. The steps for performing these
|
||||
benchmarking tests are broadly outlined below.
|
||||
.. prompt:: bash #
|
||||
|
||||
Any existing benchmarking tool can be used for this purpose. The following
|
||||
steps use the *Ceph Benchmarking Tool* (cbt_). Regardless of the tool
|
||||
used, the steps described below remain the same.
|
||||
ceph config set [global,osd] osd_mclock_profile <value>
|
||||
|
||||
For example, to change the profile to allow faster recoveries, the following
|
||||
command can be used to switch to the *high_recovery_ops* profile:
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph config set osd osd_mclock_profile high_recovery_ops
|
||||
|
||||
.. note:: The *custom* profile is not recommended unless you are an advanced
|
||||
user.
|
||||
|
||||
And that's it! You are ready to run workloads on the cluster and check if the
|
||||
QoS requirements are being met.
|
||||
|
||||
|
||||
OSD Capacity Determination (Automated)
|
||||
======================================
|
||||
|
||||
The OSD capacity in terms of total IOPS is determined automatically during OSD
|
||||
initialization. This is achieved by running the OSD bench tool and overriding
|
||||
the default value of ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option
|
||||
depending on the device type. No other action/input is expected from the user
|
||||
to set the OSD capacity. You may verify the capacity of an OSD after the
|
||||
cluster is brought up by using the following command:
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph config show osd.x osd_mclock_max_capacity_iops_[hdd, ssd]
|
||||
|
||||
For example, the following command shows the max capacity for osd.0 on a Ceph
|
||||
node whose underlying device type is SSD:
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
|
||||
|
||||
|
||||
Steps to Manually Benchmark an OSD (Optional)
|
||||
=============================================
|
||||
|
||||
.. note:: These steps are only necessary if you want to override the OSD
|
||||
capacity already determined automatically during OSD initialization.
|
||||
Otherwise, you may skip this section entirely.
|
||||
|
||||
Any existing benchmarking tool can be used for this purpose. In this case, the
|
||||
steps use the *Ceph OSD Bench* command described in the next section. Regardless
|
||||
of the tool/command used, the steps outlined further below remain the same.
|
||||
|
||||
As already described in the :ref:`dmclock-qos` section, the number of
|
||||
shards and the bluestore's throttle parameters have an impact on the mclock op
|
||||
@ -167,68 +213,85 @@ maximize the impact of the mclock scheduler.
|
||||
|
||||
:Bluestore Throttle Parameters:
|
||||
We recommend using the default values as defined by
|
||||
:confval:`bluestore_throttle_bytes` and :confval:`bluestore_throttle_deferred_bytes`. But
|
||||
these parameters may also be determined during the benchmarking phase as
|
||||
described below.
|
||||
:confval:`bluestore_throttle_bytes` and
|
||||
:confval:`bluestore_throttle_deferred_bytes`. But these parameters may also be
|
||||
determined during the benchmarking phase as described below.
|
||||
|
||||
Benchmarking Test Steps Using CBT
|
||||
`````````````````````````````````
|
||||
OSD Bench Command Syntax
|
||||
````````````````````````
|
||||
|
||||
The steps below use the default shards and detail the steps used to determine the
|
||||
correct bluestore throttle values.
|
||||
The :ref:`osd-subsystem` section describes the OSD bench command. The syntax
|
||||
used for benchmarking is shown below :
|
||||
|
||||
.. note:: These steps, although manual in April 2021, will be automated in the future.
|
||||
.. prompt:: bash #
|
||||
|
||||
1. On the Ceph node hosting the OSDs, download cbt_ from git.
|
||||
2. Install cbt and all the dependencies mentioned on the cbt github page.
|
||||
3. Construct the Ceph configuration file and the cbt yaml file.
|
||||
4. Ensure that the bluestore throttle options ( i.e.
|
||||
:confval:`bluestore_throttle_bytes` and :confval:`bluestore_throttle_deferred_bytes`) are
|
||||
set to the default values.
|
||||
5. Ensure that the test is performed on similar device types to get reliable
|
||||
OSD capacity data.
|
||||
6. The OSDs can be grouped together with the desired replication factor for the
|
||||
test to ensure reliability of OSD capacity data.
|
||||
7. After ensuring that the OSDs nodes are in the desired configuration, run a
|
||||
simple 4KiB random write workload on the OSD(s) for 300 secs.
|
||||
8. Note the overall throughput(IOPS) obtained from the cbt output file. This
|
||||
value is the baseline throughput(IOPS) when the default bluestore
|
||||
throttle options are in effect.
|
||||
9. If the intent is to determine the bluestore throttle values for your
|
||||
environment, then set the two options, :confval:`bluestore_throttle_bytes` and
|
||||
:confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each to begin
|
||||
with. Otherwise, you may skip to the next section.
|
||||
10. Run the 4KiB random write workload as before on the OSD(s) for 300 secs.
|
||||
11. Note the overall throughput from the cbt log files and compare the value
|
||||
against the baseline throughput in step 8.
|
||||
12. If the throughput doesn't match with the baseline, increment the bluestore
|
||||
throttle options by 2x and repeat steps 9 through 11 until the obtained
|
||||
throughput is very close to the baseline value.
|
||||
ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
|
||||
|
||||
For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB for
|
||||
both bluestore throttle and deferred bytes was determined to maximize the impact
|
||||
of mclock. For HDDs, the corresponding value was 40 MiB, where the overall
|
||||
throughput was roughly equal to the baseline throughput. Note that in general
|
||||
for HDDs, the bluestore throttle values are expected to be higher when compared
|
||||
to SSDs.
|
||||
where,
|
||||
|
||||
.. _cbt: https://github.com/ceph/cbt
|
||||
* ``TOTAL_BYTES``: Total number of bytes to write
|
||||
* ``BYTES_PER_WRITE``: Block size per write
|
||||
* ``OBJ_SIZE``: Bytes per object
|
||||
* ``NUM_OBJS``: Number of objects to write
|
||||
|
||||
Benchmarking Test Steps Using OSD Bench
|
||||
```````````````````````````````````````
|
||||
|
||||
The steps below use the default shards and detail the steps used to determine
|
||||
the correct bluestore throttle values (optional).
|
||||
|
||||
#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that
|
||||
you wish to benchmark.
|
||||
#. Run a simple 4KiB random write workload on an OSD using the following
|
||||
commands:
|
||||
|
||||
.. note:: Note that before running the test, caches must be cleared to get an
|
||||
accurate measurement.
|
||||
|
||||
For example, if you are running the benchmark test on osd.0, run the following
|
||||
commands:
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph tell osd.0 cache drop
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph tell osd.0 bench 12288000 4096 4194304 100
|
||||
|
||||
#. Note the overall throughput(IOPS) obtained from the output of the osd bench
|
||||
command. This value is the baseline throughput(IOPS) when the default
|
||||
bluestore throttle options are in effect.
|
||||
#. If the intent is to determine the bluestore throttle values for your
|
||||
environment, then set the two options, :confval:`bluestore_throttle_bytes`
|
||||
and :confval:`bluestore_throttle_deferred_bytes` to 32 KiB(32768 Bytes) each
|
||||
to begin with. Otherwise, you may skip to the next section.
|
||||
#. Run the 4KiB random write test as before using OSD bench.
|
||||
#. Note the overall throughput from the output and compare the value
|
||||
against the baseline throughput recorded in step 3.
|
||||
#. If the throughput doesn't match with the baseline, increment the bluestore
|
||||
throttle options by 2x and repeat steps 5 through 7 until the obtained
|
||||
throughput is very close to the baseline value.
|
||||
|
||||
For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB
|
||||
for both bluestore throttle and deferred bytes was determined to maximize the
|
||||
impact of mclock. For HDDs, the corresponding value was 40 MiB, where the
|
||||
overall throughput was roughly equal to the baseline throughput. Note that in
|
||||
general for HDDs, the bluestore throttle values are expected to be higher when
|
||||
compared to SSDs.
|
||||
|
||||
|
||||
Specifying Max OSD Capacity
|
||||
----------------------------
|
||||
````````````````````````````
|
||||
|
||||
The steps in this section may be performed only if the max osd capacity is
|
||||
different from the default values (SSDs: 21500 IOPS and HDDs: 315 IOPS). The
|
||||
option ``osd_mclock_max_capacity_iops_[hdd, ssd]`` can be set by specifying it
|
||||
in either the **[global]** section or in a specific OSD section (**[osd.x]** of
|
||||
your Ceph configuration file).
|
||||
|
||||
Alternatively, commands of the following form may be used:
|
||||
The steps in this section may be performed only if you want to override the
|
||||
max osd capacity automatically determined during OSD initialization. The option
|
||||
``osd_mclock_max_capacity_iops_[hdd, ssd]`` can be set by running the
|
||||
following command:
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph config set [global, osd] osd_mclock_max_capacity_iops_[hdd,ssd] <value>
|
||||
ceph config set [global,osd] osd_mclock_max_capacity_iops_[hdd,ssd] <value>
|
||||
|
||||
For example, the following command sets the max capacity for all the OSDs in a
|
||||
Ceph node whose underlying device type is SSDs:
|
||||
@ -245,43 +308,12 @@ device type is HDD, use a command like this:
|
||||
ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
|
||||
|
||||
|
||||
Specifying Which mClock Profile to Enable
|
||||
-----------------------------------------
|
||||
|
||||
As already mentioned, the default mclock profile is set to *high_client_ops*.
|
||||
The other values for the built-in profiles include *balanced* and
|
||||
*high_recovery_ops*.
|
||||
|
||||
If there is a requirement to change the default profile, then the option
|
||||
:confval:`osd_mclock_profile` may be set in the **[global]** or **[osd]** section of
|
||||
your Ceph configuration file before bringing up your cluster.
|
||||
|
||||
Alternatively, to change the profile during runtime, use the following command:
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph config set [global,osd] osd_mclock_profile <value>
|
||||
|
||||
For example, to change the profile to allow faster recoveries, the following
|
||||
command can be used to switch to the *high_recovery_ops* profile:
|
||||
|
||||
.. prompt:: bash #
|
||||
|
||||
ceph config set osd osd_mclock_profile high_recovery_ops
|
||||
|
||||
.. note:: The *custom* profile is not recommended unless you are an advanced user.
|
||||
|
||||
And that's it! You are ready to run workloads on the cluster and check if the
|
||||
QoS requirements are being met.
|
||||
|
||||
|
||||
.. index:: mclock; config settings
|
||||
|
||||
mClock Config Options
|
||||
=====================
|
||||
|
||||
.. confval:: osd_mclock_profile
|
||||
.. confval:: osd_mclock_max_capacity_iops
|
||||
.. confval:: osd_mclock_max_capacity_iops_hdd
|
||||
.. confval:: osd_mclock_max_capacity_iops_ssd
|
||||
.. confval:: osd_mclock_cost_per_io_usec
|
||||
|
@ -95,6 +95,8 @@ or delete them if they were just created. ::
|
||||
ceph pg {pgid} mark_unfound_lost revert|delete
|
||||
|
||||
|
||||
.. _osd-subsystem:
|
||||
|
||||
OSD Subsystem
|
||||
=============
|
||||
|
||||
|
@ -192,16 +192,22 @@ class CephTestCase(unittest.TestCase):
|
||||
log.debug("wait_until_equal: success")
|
||||
|
||||
@classmethod
|
||||
def wait_until_true(cls, condition, timeout, period=5):
|
||||
def wait_until_true(cls, condition, timeout, check_fn=None, period=5):
|
||||
elapsed = 0
|
||||
retry_count = 0
|
||||
while True:
|
||||
if condition():
|
||||
log.debug("wait_until_true: success in {0}s".format(elapsed))
|
||||
log.debug("wait_until_true: success in {0}s and {1} retries".format(elapsed, retry_count))
|
||||
return
|
||||
else:
|
||||
if elapsed >= timeout:
|
||||
raise TestTimeoutError("Timed out after {0}s".format(elapsed))
|
||||
if check_fn and check_fn() and retry_count < 5:
|
||||
elapsed = 0
|
||||
retry_count += 1
|
||||
log.debug("wait_until_true: making progress, waiting (timeout={0} retry_count={1})...".format(timeout, retry_count))
|
||||
else:
|
||||
raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
|
||||
else:
|
||||
log.debug("wait_until_true: waiting (timeout={0})...".format(timeout))
|
||||
log.debug("wait_until_true: waiting (timeout={0} retry_count={1})...".format(timeout, retry_count))
|
||||
time.sleep(period)
|
||||
elapsed += period
|
||||
|
@ -243,6 +243,13 @@ class TestProgress(MgrTestCase):
|
||||
assert ev_id in live_ids
|
||||
return False
|
||||
|
||||
def _is_inprogress_or_complete(self, ev_id):
|
||||
for ev in self._events_in_progress():
|
||||
if ev['id'] == ev_id:
|
||||
return ev['progress'] > 0
|
||||
# check if the event completed
|
||||
return self._is_complete(ev_id)
|
||||
|
||||
def tearDown(self):
|
||||
if self.POOL in self.mgr_cluster.mon_manager.pools:
|
||||
self.mgr_cluster.mon_manager.remove_pool(self.POOL)
|
||||
@ -396,5 +403,6 @@ class TestProgress(MgrTestCase):
|
||||
log.info(json.dumps(ev1, indent=1))
|
||||
|
||||
self.wait_until_true(lambda: self._is_complete(ev1['id']),
|
||||
check_fn=lambda: self._is_inprogress_or_complete(ev1['id']),
|
||||
timeout=self.RECOVERY_PERIOD)
|
||||
self.assertTrue(self._is_quiet())
|
||||
|
@ -1019,19 +1019,6 @@ options:
|
||||
default: 0.011
|
||||
flags:
|
||||
- runtime
|
||||
- name: osd_mclock_max_capacity_iops
|
||||
type: float
|
||||
level: basic
|
||||
desc: Max IOPs capacity (at 4KiB block size) to consider per OSD (overrides _ssd
|
||||
and _hdd if non-zero)
|
||||
long_desc: This option specifies the max osd capacity in iops per OSD. Helps in
|
||||
QoS calculations when enabling a dmclock profile. Only considered for osd_op_queue
|
||||
= mclock_scheduler
|
||||
fmt_desc: Max IOPS capacity (at 4KiB block size) to consider per OSD
|
||||
(overrides _ssd and _hdd if non-zero)
|
||||
default: 0
|
||||
flags:
|
||||
- runtime
|
||||
- name: osd_mclock_max_capacity_iops_hdd
|
||||
type: float
|
||||
level: basic
|
||||
|
338
src/osd/OSD.cc
338
src/osd/OSD.cc
@ -2320,9 +2320,6 @@ OSD::OSD(CephContext *cct_,
|
||||
this);
|
||||
shards.push_back(one_shard);
|
||||
}
|
||||
|
||||
// override some config options if mclock is enabled on all the shards
|
||||
maybe_override_options_for_qos();
|
||||
}
|
||||
|
||||
OSD::~OSD()
|
||||
@ -2826,136 +2823,13 @@ will start to track new ops received afterwards.";
|
||||
int64_t bsize = cmd_getval_or<int64_t>(cmdmap, "size", 4LL << 20);
|
||||
int64_t osize = cmd_getval_or<int64_t>(cmdmap, "object_size", 0);
|
||||
int64_t onum = cmd_getval_or<int64_t>(cmdmap, "object_num", 0);
|
||||
uint32_t duration = cct->_conf->osd_bench_duration;
|
||||
double elapsed = 0.0;
|
||||
|
||||
if (bsize > (int64_t) cct->_conf->osd_bench_max_block_size) {
|
||||
// let us limit the block size because the next checks rely on it
|
||||
// having a sane value. If we allow any block size to be set things
|
||||
// can still go sideways.
|
||||
ss << "block 'size' values are capped at "
|
||||
<< byte_u_t(cct->_conf->osd_bench_max_block_size) << ". If you wish to use"
|
||||
<< " a higher value, please adjust 'osd_bench_max_block_size'";
|
||||
ret = -EINVAL;
|
||||
ret = run_osd_bench_test(count, bsize, osize, onum, &elapsed, ss);
|
||||
if (ret != 0) {
|
||||
goto out;
|
||||
} else if (bsize < (int64_t) (1 << 20)) {
|
||||
// entering the realm of small block sizes.
|
||||
// limit the count to a sane value, assuming a configurable amount of
|
||||
// IOPS and duration, so that the OSD doesn't get hung up on this,
|
||||
// preventing timeouts from going off
|
||||
int64_t max_count =
|
||||
bsize * duration * cct->_conf->osd_bench_small_size_max_iops;
|
||||
if (count > max_count) {
|
||||
ss << "'count' values greater than " << max_count
|
||||
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
|
||||
<< cct->_conf->osd_bench_small_size_max_iops << " IOPS,"
|
||||
<< " for " << duration << " seconds,"
|
||||
<< " can cause ill effects on osd. "
|
||||
<< " Please adjust 'osd_bench_small_size_max_iops' with a higher"
|
||||
<< " value if you wish to use a higher 'count'.";
|
||||
ret = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
} else {
|
||||
// 1MB block sizes are big enough so that we get more stuff done.
|
||||
// However, to avoid the osd from getting hung on this and having
|
||||
// timers being triggered, we are going to limit the count assuming
|
||||
// a configurable throughput and duration.
|
||||
// NOTE: max_count is the total amount of bytes that we believe we
|
||||
// will be able to write during 'duration' for the given
|
||||
// throughput. The block size hardly impacts this unless it's
|
||||
// way too big. Given we already check how big the block size
|
||||
// is, it's safe to assume everything will check out.
|
||||
int64_t max_count =
|
||||
cct->_conf->osd_bench_large_size_max_throughput * duration;
|
||||
if (count > max_count) {
|
||||
ss << "'count' values greater than " << max_count
|
||||
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
|
||||
<< byte_u_t(cct->_conf->osd_bench_large_size_max_throughput) << "/s,"
|
||||
<< " for " << duration << " seconds,"
|
||||
<< " can cause ill effects on osd. "
|
||||
<< " Please adjust 'osd_bench_large_size_max_throughput'"
|
||||
<< " with a higher value if you wish to use a higher 'count'.";
|
||||
ret = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
}
|
||||
|
||||
if (osize && bsize > osize)
|
||||
bsize = osize;
|
||||
|
||||
dout(1) << " bench count " << count
|
||||
<< " bsize " << byte_u_t(bsize) << dendl;
|
||||
|
||||
ObjectStore::Transaction cleanupt;
|
||||
|
||||
if (osize && onum) {
|
||||
bufferlist bl;
|
||||
bufferptr bp(osize);
|
||||
bp.zero();
|
||||
bl.push_back(std::move(bp));
|
||||
bl.rebuild_page_aligned();
|
||||
for (int i=0; i<onum; ++i) {
|
||||
char nm[30];
|
||||
snprintf(nm, sizeof(nm), "disk_bw_test_%d", i);
|
||||
object_t oid(nm);
|
||||
hobject_t soid(sobject_t(oid, 0));
|
||||
ObjectStore::Transaction t;
|
||||
t.write(coll_t(), ghobject_t(soid), 0, osize, bl);
|
||||
store->queue_transaction(service.meta_ch, std::move(t), NULL);
|
||||
cleanupt.remove(coll_t(), ghobject_t(soid));
|
||||
}
|
||||
}
|
||||
|
||||
bufferlist bl;
|
||||
bufferptr bp(bsize);
|
||||
bp.zero();
|
||||
bl.push_back(std::move(bp));
|
||||
bl.rebuild_page_aligned();
|
||||
|
||||
{
|
||||
C_SaferCond waiter;
|
||||
if (!service.meta_ch->flush_commit(&waiter)) {
|
||||
waiter.wait();
|
||||
}
|
||||
}
|
||||
|
||||
utime_t start = ceph_clock_now();
|
||||
for (int64_t pos = 0; pos < count; pos += bsize) {
|
||||
char nm[30];
|
||||
unsigned offset = 0;
|
||||
if (onum && osize) {
|
||||
snprintf(nm, sizeof(nm), "disk_bw_test_%d", (int)(rand() % onum));
|
||||
offset = rand() % (osize / bsize) * bsize;
|
||||
} else {
|
||||
snprintf(nm, sizeof(nm), "disk_bw_test_%lld", (long long)pos);
|
||||
}
|
||||
object_t oid(nm);
|
||||
hobject_t soid(sobject_t(oid, 0));
|
||||
ObjectStore::Transaction t;
|
||||
t.write(coll_t::meta(), ghobject_t(soid), offset, bsize, bl);
|
||||
store->queue_transaction(service.meta_ch, std::move(t), NULL);
|
||||
if (!onum || !osize)
|
||||
cleanupt.remove(coll_t::meta(), ghobject_t(soid));
|
||||
}
|
||||
|
||||
{
|
||||
C_SaferCond waiter;
|
||||
if (!service.meta_ch->flush_commit(&waiter)) {
|
||||
waiter.wait();
|
||||
}
|
||||
}
|
||||
utime_t end = ceph_clock_now();
|
||||
|
||||
// clean up
|
||||
store->queue_transaction(service.meta_ch, std::move(cleanupt), NULL);
|
||||
{
|
||||
C_SaferCond waiter;
|
||||
if (!service.meta_ch->flush_commit(&waiter)) {
|
||||
waiter.wait();
|
||||
}
|
||||
}
|
||||
|
||||
double elapsed = end - start;
|
||||
double rate = count / elapsed;
|
||||
double iops = rate / bsize;
|
||||
f->open_object_section("osd_bench_results");
|
||||
@ -3234,6 +3108,150 @@ will start to track new ops received afterwards.";
|
||||
on_finish(ret, ss.str(), outbl);
|
||||
}
|
||||
|
||||
int OSD::run_osd_bench_test(
|
||||
int64_t count,
|
||||
int64_t bsize,
|
||||
int64_t osize,
|
||||
int64_t onum,
|
||||
double *elapsed,
|
||||
ostream &ss)
|
||||
{
|
||||
int ret = 0;
|
||||
uint32_t duration = cct->_conf->osd_bench_duration;
|
||||
|
||||
if (bsize > (int64_t) cct->_conf->osd_bench_max_block_size) {
|
||||
// let us limit the block size because the next checks rely on it
|
||||
// having a sane value. If we allow any block size to be set things
|
||||
// can still go sideways.
|
||||
ss << "block 'size' values are capped at "
|
||||
<< byte_u_t(cct->_conf->osd_bench_max_block_size) << ". If you wish to use"
|
||||
<< " a higher value, please adjust 'osd_bench_max_block_size'";
|
||||
ret = -EINVAL;
|
||||
return ret;
|
||||
} else if (bsize < (int64_t) (1 << 20)) {
|
||||
// entering the realm of small block sizes.
|
||||
// limit the count to a sane value, assuming a configurable amount of
|
||||
// IOPS and duration, so that the OSD doesn't get hung up on this,
|
||||
// preventing timeouts from going off
|
||||
int64_t max_count =
|
||||
bsize * duration * cct->_conf->osd_bench_small_size_max_iops;
|
||||
if (count > max_count) {
|
||||
ss << "'count' values greater than " << max_count
|
||||
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
|
||||
<< cct->_conf->osd_bench_small_size_max_iops << " IOPS,"
|
||||
<< " for " << duration << " seconds,"
|
||||
<< " can cause ill effects on osd. "
|
||||
<< " Please adjust 'osd_bench_small_size_max_iops' with a higher"
|
||||
<< " value if you wish to use a higher 'count'.";
|
||||
ret = -EINVAL;
|
||||
return ret;
|
||||
}
|
||||
} else {
|
||||
// 1MB block sizes are big enough so that we get more stuff done.
|
||||
// However, to avoid the osd from getting hung on this and having
|
||||
// timers being triggered, we are going to limit the count assuming
|
||||
// a configurable throughput and duration.
|
||||
// NOTE: max_count is the total amount of bytes that we believe we
|
||||
// will be able to write during 'duration' for the given
|
||||
// throughput. The block size hardly impacts this unless it's
|
||||
// way too big. Given we already check how big the block size
|
||||
// is, it's safe to assume everything will check out.
|
||||
int64_t max_count =
|
||||
cct->_conf->osd_bench_large_size_max_throughput * duration;
|
||||
if (count > max_count) {
|
||||
ss << "'count' values greater than " << max_count
|
||||
<< " for a block size of " << byte_u_t(bsize) << ", assuming "
|
||||
<< byte_u_t(cct->_conf->osd_bench_large_size_max_throughput) << "/s,"
|
||||
<< " for " << duration << " seconds,"
|
||||
<< " can cause ill effects on osd. "
|
||||
<< " Please adjust 'osd_bench_large_size_max_throughput'"
|
||||
<< " with a higher value if you wish to use a higher 'count'.";
|
||||
ret = -EINVAL;
|
||||
return ret;
|
||||
}
|
||||
}
|
||||
|
||||
if (osize && bsize > osize) {
|
||||
bsize = osize;
|
||||
}
|
||||
|
||||
dout(1) << " bench count " << count
|
||||
<< " bsize " << byte_u_t(bsize) << dendl;
|
||||
|
||||
ObjectStore::Transaction cleanupt;
|
||||
|
||||
if (osize && onum) {
|
||||
bufferlist bl;
|
||||
bufferptr bp(osize);
|
||||
bp.zero();
|
||||
bl.push_back(std::move(bp));
|
||||
bl.rebuild_page_aligned();
|
||||
for (int i=0; i<onum; ++i) {
|
||||
char nm[30];
|
||||
snprintf(nm, sizeof(nm), "disk_bw_test_%d", i);
|
||||
object_t oid(nm);
|
||||
hobject_t soid(sobject_t(oid, 0));
|
||||
ObjectStore::Transaction t;
|
||||
t.write(coll_t(), ghobject_t(soid), 0, osize, bl);
|
||||
store->queue_transaction(service.meta_ch, std::move(t), nullptr);
|
||||
cleanupt.remove(coll_t(), ghobject_t(soid));
|
||||
}
|
||||
}
|
||||
|
||||
bufferlist bl;
|
||||
bufferptr bp(bsize);
|
||||
bp.zero();
|
||||
bl.push_back(std::move(bp));
|
||||
bl.rebuild_page_aligned();
|
||||
|
||||
{
|
||||
C_SaferCond waiter;
|
||||
if (!service.meta_ch->flush_commit(&waiter)) {
|
||||
waiter.wait();
|
||||
}
|
||||
}
|
||||
|
||||
utime_t start = ceph_clock_now();
|
||||
for (int64_t pos = 0; pos < count; pos += bsize) {
|
||||
char nm[30];
|
||||
unsigned offset = 0;
|
||||
if (onum && osize) {
|
||||
snprintf(nm, sizeof(nm), "disk_bw_test_%d", (int)(rand() % onum));
|
||||
offset = rand() % (osize / bsize) * bsize;
|
||||
} else {
|
||||
snprintf(nm, sizeof(nm), "disk_bw_test_%lld", (long long)pos);
|
||||
}
|
||||
object_t oid(nm);
|
||||
hobject_t soid(sobject_t(oid, 0));
|
||||
ObjectStore::Transaction t;
|
||||
t.write(coll_t::meta(), ghobject_t(soid), offset, bsize, bl);
|
||||
store->queue_transaction(service.meta_ch, std::move(t), nullptr);
|
||||
if (!onum || !osize) {
|
||||
cleanupt.remove(coll_t::meta(), ghobject_t(soid));
|
||||
}
|
||||
}
|
||||
|
||||
{
|
||||
C_SaferCond waiter;
|
||||
if (!service.meta_ch->flush_commit(&waiter)) {
|
||||
waiter.wait();
|
||||
}
|
||||
}
|
||||
utime_t end = ceph_clock_now();
|
||||
*elapsed = end - start;
|
||||
|
||||
// clean up
|
||||
store->queue_transaction(service.meta_ch, std::move(cleanupt), nullptr);
|
||||
{
|
||||
C_SaferCond waiter;
|
||||
if (!service.meta_ch->flush_commit(&waiter)) {
|
||||
waiter.wait();
|
||||
}
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
class TestOpsSocketHook : public AdminSocketHook {
|
||||
OSDService *service;
|
||||
ObjectStore *store;
|
||||
@ -3783,6 +3801,10 @@ int OSD::init()
|
||||
|
||||
start_boot();
|
||||
|
||||
// Override a few options if mclock scheduler is enabled.
|
||||
maybe_override_max_osd_capacity_for_qos();
|
||||
maybe_override_options_for_qos();
|
||||
|
||||
return 0;
|
||||
|
||||
out:
|
||||
@ -10062,6 +10084,53 @@ void OSD::handle_conf_change(const ConfigProxy& conf,
|
||||
}
|
||||
}
|
||||
|
||||
void OSD::maybe_override_max_osd_capacity_for_qos()
|
||||
{
|
||||
// If the scheduler enabled is mclock, override the default
|
||||
// osd capacity with the value obtained from running the
|
||||
// osd bench test. This is later used to setup mclock.
|
||||
if (cct->_conf.get_val<std::string>("osd_op_queue") == "mclock_scheduler") {
|
||||
// Write 200 4MiB objects with blocksize 4KiB
|
||||
int64_t count = 12288000; // Count of bytes to write
|
||||
int64_t bsize = 4096; // Block size
|
||||
int64_t osize = 4194304; // Object size
|
||||
int64_t onum = 100; // Count of objects to write
|
||||
double elapsed = 0.0; // Time taken to complete the test
|
||||
stringstream ss;
|
||||
int ret = run_osd_bench_test(count, bsize, osize, onum, &elapsed, ss);
|
||||
if (ret != 0) {
|
||||
derr << __func__
|
||||
<< " osd bench err: " << ret
|
||||
<< " osd bench errstr: " << ss.str()
|
||||
<< dendl;
|
||||
} else {
|
||||
double rate = count / elapsed;
|
||||
double iops = rate / bsize;
|
||||
dout(1) << __func__
|
||||
<< " osd bench result -"
|
||||
<< std::fixed << std::setprecision(3)
|
||||
<< " bandwidth (MiB/sec): " << rate / (1024 * 1024)
|
||||
<< " iops: " << iops
|
||||
<< " elapsed_sec: " << elapsed
|
||||
<< dendl;
|
||||
|
||||
// Override the appropriate config option
|
||||
if (store_is_rotational) {
|
||||
cct->_conf.set_val(
|
||||
"osd_mclock_max_capacity_iops_hdd", std::to_string(iops));
|
||||
} else {
|
||||
cct->_conf.set_val(
|
||||
"osd_mclock_max_capacity_iops_ssd", std::to_string(iops));
|
||||
}
|
||||
|
||||
// Override the max osd capacity for all shards
|
||||
for (auto& shard : shards) {
|
||||
shard->update_scheduler_config();
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
bool OSD::maybe_override_options_for_qos()
|
||||
{
|
||||
// If the scheduler enabled is mclock, override the recovery, backfill
|
||||
@ -10610,6 +10679,12 @@ void OSDShard::unprime_split_children(spg_t parent, unsigned old_pg_num)
|
||||
}
|
||||
}
|
||||
|
||||
void OSDShard::update_scheduler_config()
|
||||
{
|
||||
std::lock_guard l(shard_lock);
|
||||
scheduler->update_configuration();
|
||||
}
|
||||
|
||||
OSDShard::OSDShard(
|
||||
int id,
|
||||
CephContext *cct,
|
||||
@ -10746,12 +10821,17 @@ void OSD::ShardedOpWQ::_process(uint32_t thread_index, heartbeat_handle_d *hb)
|
||||
std::unique_lock wait_lock{sdata->sdata_wait_lock};
|
||||
auto future_time = ceph::real_clock::from_double(*when_ready);
|
||||
dout(10) << __func__ << " dequeue future request at " << future_time << dendl;
|
||||
// Disable heartbeat timeout until we find a non-future work item to process.
|
||||
osd->cct->get_heartbeat_map()->clear_timeout(hb);
|
||||
sdata->shard_lock.unlock();
|
||||
++sdata->waiting_threads;
|
||||
sdata->sdata_cond.wait_until(wait_lock, future_time);
|
||||
--sdata->waiting_threads;
|
||||
wait_lock.unlock();
|
||||
sdata->shard_lock.lock();
|
||||
// Reapply default wq timeouts
|
||||
osd->cct->get_heartbeat_map()->reset_timeout(hb,
|
||||
timeout_interval, suicide_interval);
|
||||
}
|
||||
} // while
|
||||
|
||||
|
@ -1078,6 +1078,7 @@ struct OSDShard {
|
||||
std::set<std::pair<spg_t,epoch_t>> *merge_pgs);
|
||||
void register_and_wake_split_child(PG *pg);
|
||||
void unprime_split_children(spg_t parent, unsigned old_pg_num);
|
||||
void update_scheduler_config();
|
||||
|
||||
OSDShard(
|
||||
int id,
|
||||
@ -2063,7 +2064,14 @@ private:
|
||||
float get_osd_snap_trim_sleep();
|
||||
|
||||
int get_recovery_max_active();
|
||||
void maybe_override_max_osd_capacity_for_qos();
|
||||
bool maybe_override_options_for_qos();
|
||||
int run_osd_bench_test(int64_t count,
|
||||
int64_t bsize,
|
||||
int64_t osize,
|
||||
int64_t onum,
|
||||
double *elapsed,
|
||||
std::ostream& ss);
|
||||
|
||||
void scrub_purged_snaps();
|
||||
void probe_smart(const std::string& devid, std::ostream& ss);
|
||||
|
@ -50,6 +50,9 @@ public:
|
||||
// Print human readable brief description with relevant parameters
|
||||
virtual void print(std::ostream &out) const = 0;
|
||||
|
||||
// Apply config changes to the scheduler (if any)
|
||||
virtual void update_configuration() = 0;
|
||||
|
||||
// Destructor
|
||||
virtual ~OpScheduler() {};
|
||||
};
|
||||
@ -134,6 +137,10 @@ public:
|
||||
out << ", cutoff=" << cutoff << ")";
|
||||
}
|
||||
|
||||
void update_configuration() final {
|
||||
// no-op
|
||||
}
|
||||
|
||||
~ClassedOpQueueScheduler() final {};
|
||||
};
|
||||
|
||||
|
@ -99,22 +99,19 @@ const dmc::ClientInfo *mClockScheduler::ClientRegistry::get_info(
|
||||
|
||||
void mClockScheduler::set_max_osd_capacity()
|
||||
{
|
||||
if (cct->_conf.get_val<double>("osd_mclock_max_capacity_iops")) {
|
||||
if (is_rotational) {
|
||||
max_osd_capacity =
|
||||
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops");
|
||||
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_hdd");
|
||||
} else {
|
||||
if (is_rotational) {
|
||||
max_osd_capacity =
|
||||
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_hdd");
|
||||
} else {
|
||||
max_osd_capacity =
|
||||
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_ssd");
|
||||
}
|
||||
max_osd_capacity =
|
||||
cct->_conf.get_val<double>("osd_mclock_max_capacity_iops_ssd");
|
||||
}
|
||||
// Set per op-shard iops limit
|
||||
max_osd_capacity /= num_shards;
|
||||
dout(1) << __func__ << " #op shards: " << num_shards
|
||||
<< " max osd capacity(iops) per shard: " << max_osd_capacity << dendl;
|
||||
<< std::fixed << std::setprecision(2)
|
||||
<< " max osd capacity(iops) per shard: " << max_osd_capacity
|
||||
<< dendl;
|
||||
}
|
||||
|
||||
void mClockScheduler::set_osd_mclock_cost_per_io()
|
||||
@ -137,7 +134,8 @@ void mClockScheduler::set_osd_mclock_cost_per_io()
|
||||
}
|
||||
}
|
||||
dout(1) << __func__ << " osd_mclock_cost_per_io: "
|
||||
<< std::fixed << osd_mclock_cost_per_io << dendl;
|
||||
<< std::fixed << std::setprecision(7) << osd_mclock_cost_per_io
|
||||
<< dendl;
|
||||
}
|
||||
|
||||
void mClockScheduler::set_osd_mclock_cost_per_byte()
|
||||
@ -160,7 +158,8 @@ void mClockScheduler::set_osd_mclock_cost_per_byte()
|
||||
}
|
||||
}
|
||||
dout(1) << __func__ << " osd_mclock_cost_per_byte: "
|
||||
<< std::fixed << osd_mclock_cost_per_byte << dendl;
|
||||
<< std::fixed << std::setprecision(7) << osd_mclock_cost_per_byte
|
||||
<< dendl;
|
||||
}
|
||||
|
||||
void mClockScheduler::set_mclock_profile()
|
||||
@ -378,6 +377,14 @@ int mClockScheduler::calc_scaled_cost(int item_cost)
|
||||
return std::max(scaled_cost, 1);
|
||||
}
|
||||
|
||||
void mClockScheduler::update_configuration()
|
||||
{
|
||||
// Apply configuration change. The expectation is that
|
||||
// at least one of the tracked mclock config option keys
|
||||
// is modified before calling this method.
|
||||
cct->_conf.apply_changes(nullptr);
|
||||
}
|
||||
|
||||
void mClockScheduler::dump(ceph::Formatter &f) const
|
||||
{
|
||||
}
|
||||
@ -447,7 +454,6 @@ const char** mClockScheduler::get_tracked_conf_keys() const
|
||||
"osd_mclock_cost_per_byte_usec",
|
||||
"osd_mclock_cost_per_byte_usec_hdd",
|
||||
"osd_mclock_cost_per_byte_usec_ssd",
|
||||
"osd_mclock_max_capacity_iops",
|
||||
"osd_mclock_max_capacity_iops_hdd",
|
||||
"osd_mclock_max_capacity_iops_ssd",
|
||||
"osd_mclock_profile",
|
||||
@ -470,8 +476,7 @@ void mClockScheduler::handle_conf_change(
|
||||
changed.count("osd_mclock_cost_per_byte_usec_ssd")) {
|
||||
set_osd_mclock_cost_per_byte();
|
||||
}
|
||||
if (changed.count("osd_mclock_max_capacity_iops") ||
|
||||
changed.count("osd_mclock_max_capacity_iops_hdd") ||
|
||||
if (changed.count("osd_mclock_max_capacity_iops_hdd") ||
|
||||
changed.count("osd_mclock_max_capacity_iops_ssd")) {
|
||||
set_max_osd_capacity();
|
||||
if (mclock_profile != "custom") {
|
||||
|
@ -193,6 +193,9 @@ public:
|
||||
ostream << "mClockScheduler";
|
||||
}
|
||||
|
||||
// Update data associated with the modified mclock config key(s)
|
||||
void update_configuration() final;
|
||||
|
||||
const char** get_tracked_conf_keys() const final;
|
||||
void handle_conf_change(const ConfigProxy& conf,
|
||||
const std::set<std::string> &changed) final;
|
||||
|
Loading…
Reference in New Issue
Block a user