mirror of
https://github.com/ceph/ceph
synced 2024-12-24 20:33:27 +00:00
e0b5316171
Create the initial mClock QoS params at CONF_DEFAULT level using set_val_default(). This allows switching to a custom profile on a running OSD and to make necessary changes to the desired QoS params. Note that Switching to ‘custom’ profile and then subsequently changing the QoS params using “config set osd.n …” will be at a higher level i.e. at CONF_MON. But When switching back to a built-in profile, the new values won’t take effect since CONF_DEFAULT < CONF_MON. For the values to take effect, the config keys created as part of the ‘custom’ profile must be removed from the ConfigMonitor store after switching back to a built-in profile. - Added a couple of standalone tests to exercise the scenario. - Updated the mClock configuration document and the mClock internal documentation with a couple of typos relating to the best effort weights. - Added new sections to the mClock configuration document outlining the steps to switch between the built-in and custom profile and vice-versa. Fixes: https://tracker.ceph.com/issues/55153 Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
500 lines
23 KiB
ReStructuredText
500 lines
23 KiB
ReStructuredText
=========================================
|
|
QoS Study with mClock and WPQ Schedulers
|
|
=========================================
|
|
|
|
Introduction
|
|
============
|
|
|
|
The mClock scheduler provides three controls for each service using it. In Ceph,
|
|
the services using mClock are for example client I/O, background recovery,
|
|
scrub, snap trim and PG deletes. The three controls such as *weight*,
|
|
*reservation* and *limit* are used for predictable allocation of resources to
|
|
each service in proportion to its weight subject to the constraint that the
|
|
service receives at least its reservation and no more than its limit. In Ceph,
|
|
these controls are used to allocate IOPS for each service type provided the IOPS
|
|
capacity of each OSD is known. The mClock scheduler is based on
|
|
`the dmClock algorithm`_. See :ref:`dmclock-qos` section for more details.
|
|
|
|
Ceph's use of mClock was primarily experimental and approached with an
|
|
exploratory mindset. This is still true with other organizations and individuals
|
|
who continue to either use the codebase or modify it according to their needs.
|
|
|
|
DmClock exists in its own repository_. Before the Ceph *Pacific* release,
|
|
mClock could be enabled by setting the :confval:`osd_op_queue` Ceph option to
|
|
"mclock_scheduler". Additional mClock parameters like *reservation*, *weight*
|
|
and *limit* for each service type could be set using Ceph options.
|
|
For example, ``osd_mclock_scheduler_client_[res,wgt,lim]`` is one such option.
|
|
See :ref:`dmclock-qos` section for more details. Even with all the mClock
|
|
options set, the full capability of mClock could not be realized due to:
|
|
|
|
- Unknown OSD capacity in terms of throughput (IOPS).
|
|
- No limit enforcement. In other words, services using mClock were allowed to
|
|
exceed their limits resulting in the desired QoS goals not being met.
|
|
- Share of each service type not distributed across the number of operational
|
|
shards.
|
|
|
|
To resolve the above, refinements were made to the mClock scheduler in the Ceph
|
|
code base. See :doc:`/rados/configuration/mclock-config-ref`. With the
|
|
refinements, the usage of mClock is a bit more user-friendly and intuitive. This
|
|
is one step of many to refine and optimize the way mClock is used in Ceph.
|
|
|
|
Overview
|
|
========
|
|
|
|
A comparison study was performed as part of efforts to refine the mClock
|
|
scheduler. The study involved running tests with client ops and background
|
|
recovery operations in parallel with the two schedulers. The results were
|
|
collated and then compared. The following statistics were compared between the
|
|
schedulers from the test results for each service type:
|
|
|
|
- External client
|
|
|
|
- Average throughput(IOPS),
|
|
- Average and percentile(95th, 99th, 99.5th) latency,
|
|
|
|
- Background recovery
|
|
|
|
- Average recovery throughput,
|
|
- Number of misplaced objects recovered per second
|
|
|
|
Test Environment
|
|
================
|
|
|
|
1. **Software Configuration**: CentOS 8.1.1911 Linux Kernel 4.18.0-193.6.3.el8_2.x86_64
|
|
2. **CPU**: 2 x Intel® Xeon® CPU E5-2650 v3 @ 2.30GHz
|
|
3. **nproc**: 40
|
|
4. **System Memory**: 64 GiB
|
|
5. **Tuned-adm Profile**: network-latency
|
|
6. **CephVer**: 17.0.0-2125-g94f550a87f (94f550a87fcbda799afe9f85e40386e6d90b232e) quincy (dev)
|
|
7. **Storage**:
|
|
|
|
- Intel® NVMe SSD DC P3700 Series (SSDPE2MD800G4) [4 x 800GB]
|
|
- Seagate Constellation 7200 RPM 64MB Cache SATA 6.0Gb/s HDD (ST91000640NS) [4 x 1TB]
|
|
|
|
Test Methodology
|
|
================
|
|
|
|
Ceph cbt_ was used to test the recovery scenarios. A new recovery test to
|
|
generate background recoveries with client I/Os in parallel was created.
|
|
See the next section for the detailed test steps. The test was executed 3 times
|
|
with the default *Weighted Priority Queue (WPQ)* scheduler for comparison
|
|
purposes. This was done to establish a credible mean value to compare
|
|
the mClock scheduler results at a later point.
|
|
|
|
After this, the same test was executed with mClock scheduler and with different
|
|
mClock profiles i.e., *high_client_ops*, *balanced* and *high_recovery_ops* and
|
|
the results collated for comparison. With each profile, the test was
|
|
executed 3 times, and the average of those runs are reported in this study.
|
|
|
|
.. note:: Tests with HDDs were performed with and without the bluestore WAL and
|
|
dB configured. The charts discussed further below help bring out the
|
|
comparison across the schedulers and their configurations.
|
|
|
|
Establish Baseline Client Throughput (IOPS)
|
|
===========================================
|
|
|
|
Before the actual recovery tests, the baseline throughput was established for
|
|
both the SSDs and the HDDs on the test machine by following the steps mentioned
|
|
in the :doc:`/rados/configuration/mclock-config-ref` document under
|
|
the "Benchmarking Test Steps Using CBT" section. For this study, the following
|
|
baseline throughput for each device type was determined:
|
|
|
|
+--------------------------------------+-------------------------------------------+
|
|
| Device Type | Baseline Throughput(@4KiB Random Writes) |
|
|
+======================================+===========================================+
|
|
| **NVMe SSD** | 21500 IOPS (84 MiB/s) |
|
|
+--------------------------------------+-------------------------------------------+
|
|
| **HDD (with bluestore WAL & dB)** | 340 IOPS (1.33 MiB/s) |
|
|
+--------------------------------------+-------------------------------------------+
|
|
| **HDD (without bluestore WAL & dB)** | 315 IOPS (1.23 MiB/s) |
|
|
+--------------------------------------+-------------------------------------------+
|
|
|
|
.. note:: The :confval:`bluestore_throttle_bytes` and
|
|
:confval:`bluestore_throttle_deferred_bytes` for SSDs were determined to be
|
|
256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
|
|
by running 4 KiB random writes at a queue depth of 64 for 300 secs.
|
|
|
|
Factoring I/O Cost in mClock
|
|
============================
|
|
|
|
The services using mClock have a cost associated with them. The cost can be
|
|
different for each service type. The mClock scheduler factors in the cost
|
|
during calculations for parameters like *reservation*, *weight* and *limit*.
|
|
The calculations determine when the next op for the service type can be
|
|
dequeued from the operation queue. In general, the higher the cost, the longer
|
|
an op remains in the operation queue.
|
|
|
|
A cost modeling study was performed to determine the cost per I/O and the cost
|
|
per byte for SSD and HDD device types. The following cost specific options are
|
|
used under the hood by mClock,
|
|
|
|
- :confval:`osd_mclock_cost_per_io_usec`
|
|
- :confval:`osd_mclock_cost_per_io_usec_hdd`
|
|
- :confval:`osd_mclock_cost_per_io_usec_ssd`
|
|
- :confval:`osd_mclock_cost_per_byte_usec`
|
|
- :confval:`osd_mclock_cost_per_byte_usec_hdd`
|
|
- :confval:`osd_mclock_cost_per_byte_usec_ssd`
|
|
|
|
See :doc:`/rados/configuration/mclock-config-ref` for more details.
|
|
|
|
MClock Profile Allocations
|
|
==========================
|
|
|
|
The low-level mClock shares per profile are shown in the tables below. For
|
|
parameters like *reservation* and *limit*, the shares are represented as a
|
|
percentage of the total OSD capacity. For the *high_client_ops* profile, the
|
|
*reservation* parameter is set to 50% of the total OSD capacity. Therefore, for
|
|
the NVMe(baseline 21500 IOPS) device, a minimum of 10750 IOPS is reserved for
|
|
client operations. These allocations are made under the hood once
|
|
a profile is enabled.
|
|
|
|
The *weight* parameter is unitless. See :ref:`dmclock-qos`.
|
|
|
|
high_client_ops(default)
|
|
````````````````````````
|
|
|
|
This profile allocates more reservation and limit to external clients ops
|
|
when compared to background recoveries and other internal clients within
|
|
Ceph. This profile is enabled by default.
|
|
|
|
+------------------------+-------------+--------+-------+
|
|
| Service Type | Reservation | Weight | Limit |
|
|
+========================+=============+========+=======+
|
|
| client | 50% | 2 | MAX |
|
|
+------------------------+-------------+--------+-------+
|
|
| background recovery | 25% | 1 | 100% |
|
|
+------------------------+-------------+--------+-------+
|
|
| background best effort | 25% | 2 | MAX |
|
|
+------------------------+-------------+--------+-------+
|
|
|
|
balanced
|
|
`````````
|
|
|
|
This profile allocates equal reservations to client ops and background
|
|
recovery ops. The internal best effort client get a lower reservation
|
|
but a very high limit so that they can complete quickly if there are
|
|
no competing services.
|
|
|
|
+------------------------+-------------+--------+-------+
|
|
| Service Type | Reservation | Weight | Limit |
|
|
+========================+=============+========+=======+
|
|
| client | 40% | 1 | 100% |
|
|
+------------------------+-------------+--------+-------+
|
|
| background recovery | 40% | 1 | 150% |
|
|
+------------------------+-------------+--------+-------+
|
|
| background best effort | 20% | 2 | MAX |
|
|
+------------------------+-------------+--------+-------+
|
|
|
|
high_recovery_ops
|
|
`````````````````
|
|
|
|
This profile allocates more reservation to background recoveries when
|
|
compared to external clients and other internal clients within Ceph. For
|
|
example, an admin may enable this profile temporarily to speed-up background
|
|
recoveries during non-peak hours.
|
|
|
|
+------------------------+-------------+--------+-------+
|
|
| Service Type | Reservation | Weight | Limit |
|
|
+========================+=============+========+=======+
|
|
| client | 30% | 1 | 80% |
|
|
+------------------------+-------------+--------+-------+
|
|
| background recovery | 60% | 2 | 200% |
|
|
+------------------------+-------------+--------+-------+
|
|
| background best effort | 1 (MIN) | 2 | MAX |
|
|
+------------------------+-------------+--------+-------+
|
|
|
|
custom
|
|
```````
|
|
|
|
The custom profile allows the user to have complete control of the mClock
|
|
and Ceph config parameters. To use this profile, the user must have a deep
|
|
understanding of the workings of Ceph and the mClock scheduler. All the
|
|
*reservation*, *weight* and *limit* parameters of the different service types
|
|
must be set manually along with any Ceph option(s). This profile may be used
|
|
for experimental and exploratory purposes or if the built-in profiles do not
|
|
meet the requirements. In such cases, adequate testing must be performed prior
|
|
to enabling this profile.
|
|
|
|
|
|
Recovery Test Steps
|
|
===================
|
|
|
|
Before bringing up the Ceph cluster, the following mClock configuration
|
|
parameters were set appropriately based on the obtained baseline throughput
|
|
from the previous section:
|
|
|
|
- :confval:`osd_mclock_max_capacity_iops_hdd`
|
|
- :confval:`osd_mclock_max_capacity_iops_ssd`
|
|
- :confval:`osd_mclock_profile`
|
|
|
|
See :doc:`/rados/configuration/mclock-config-ref` for more details.
|
|
|
|
Test Steps(Using cbt)
|
|
`````````````````````
|
|
|
|
1. Bring up the Ceph cluster with 4 osds.
|
|
2. Configure the OSDs with replication factor 3.
|
|
3. Create a recovery pool to populate recovery data.
|
|
4. Create a client pool and prefill some objects in it.
|
|
5. Create the recovery thread and mark an OSD down and out.
|
|
6. After the cluster handles the OSD down event, recovery data is
|
|
prefilled into the recovery pool. For the tests involving SSDs, prefill 100K
|
|
4MiB objects into the recovery pool. For the tests involving HDDs, prefill
|
|
5K 4MiB objects into the recovery pool.
|
|
7. After the prefill stage is completed, the downed OSD is brought up and in.
|
|
The backfill phase starts at this point.
|
|
8. As soon as the backfill/recovery starts, the test proceeds to initiate client
|
|
I/O on the client pool on another thread using a single client.
|
|
9. During step 8 above, statistics related to the client latency and
|
|
bandwidth are captured by cbt. The test also captures the total number of
|
|
misplaced objects and the number of misplaced objects recovered per second.
|
|
|
|
To summarize, the steps above creates 2 pools during the test. Recovery is
|
|
triggered on one pool and client I/O is triggered on the other simultaneously.
|
|
Statistics captured during the tests are discussed below.
|
|
|
|
|
|
Non-Default Ceph Recovery Options
|
|
`````````````````````````````````
|
|
|
|
Apart from the non-default bluestore throttle already mentioned above, the
|
|
following set of Ceph recovery related options were modified for tests with both
|
|
the WPQ and mClock schedulers.
|
|
|
|
- :confval:`osd_max_backfills` = 1000
|
|
- :confval:`osd_recovery_max_active` = 1000
|
|
- :confval:`osd_async_recovery_min_cost` = 1
|
|
|
|
The above options set a high limit on the number of concurrent local and
|
|
remote backfill operations per OSD. Under these conditions the capability of the
|
|
mClock scheduler was tested and the results are discussed below.
|
|
|
|
Test Results
|
|
============
|
|
|
|
Test Results With NVMe SSDs
|
|
```````````````````````````
|
|
|
|
Client Throughput Comparison
|
|
----------------------------
|
|
|
|
The chart below shows the average client throughput comparison across the
|
|
schedulers and their respective configurations.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_NVMe_SSD_WPQ_vs_mClock.png
|
|
|
|
|
|
WPQ(def) in the chart shows the average client throughput obtained
|
|
using the WPQ scheduler with all other Ceph configuration settings set to
|
|
default values. The default setting for :confval:`osd_max_backfills` limits the number
|
|
of concurrent local and remote backfills or recoveries per OSD to 1. As a
|
|
result, the average client throughput obtained is impressive at just over 18000
|
|
IOPS when compared to the baseline value which is 21500 IOPS.
|
|
|
|
However, with WPQ scheduler along with non-default options mentioned in section
|
|
`Non-Default Ceph Recovery Options`_, things are quite different as shown in the
|
|
chart for WPQ(BST). In this case, the average client throughput obtained drops
|
|
dramatically to only 2544 IOPS. The non-default recovery options clearly had a
|
|
significant impact on the client throughput. In other words, recovery operations
|
|
overwhelm the client operations. Sections further below discuss the recovery
|
|
rates under these conditions.
|
|
|
|
With the non-default options, the same test was executed with mClock and with
|
|
the default profile (*high_client_ops*) enabled. As per the profile allocation,
|
|
the reservation goal of 50% (10750 IOPS) is being met with an average throughput
|
|
of 11209 IOPS during the course of recovery operations. This is more than 4x
|
|
times the throughput obtained with WPQ(BST).
|
|
|
|
Similar throughput with the *balanced* (11017 IOPS) and *high_recovery_ops*
|
|
(11153 IOPS) profile was obtained as seen in the chart above. This clearly
|
|
demonstrates that mClock is able to provide the desired QoS for the client
|
|
with multiple concurrent backfill/recovery operations in progress.
|
|
|
|
Client Latency Comparison
|
|
-------------------------
|
|
|
|
The chart below shows the average completion latency (*clat*) along with the
|
|
average 95th, 99th and 99.5th percentiles across the schedulers and their
|
|
respective configurations.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_NVMe_SSD_WPQ_vs_mClock.png
|
|
|
|
The average *clat* latency obtained with WPQ(Def) was 3.535 msec. But in this
|
|
case the number of concurrent recoveries was very much limited at an average of
|
|
around 97 objects/sec or ~388 MiB/s and a major contributing factor to the low
|
|
latency seen by the client.
|
|
|
|
With WPQ(BST) and with the non-default recovery options, things are very
|
|
different with the average *clat* latency shooting up to an average of almost
|
|
25 msec which is 7x times worse! This is due to the high number of concurrent
|
|
recoveries which was measured to be ~350 objects/sec or ~1.4 GiB/s which is
|
|
close to the maximum OSD bandwidth.
|
|
|
|
With mClock enabled and with the default *high_client_ops* profile, the average
|
|
*clat* latency was 5.688 msec which is impressive considering the high number
|
|
of concurrent active background backfill/recoveries. The recovery rate was
|
|
throttled down by mClock to an average of 80 objects/sec or ~320 MiB/s according
|
|
to the minimum profile allocation of 25% of the maximum OSD bandwidth thus
|
|
allowing the client operations to meet the QoS goal.
|
|
|
|
With the other profiles like *balanced* and *high_recovery_ops*, the average
|
|
client *clat* latency didn't change much and stayed between 5.7 - 5.8 msec with
|
|
variations in the average percentile latency as observed from the chart above.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_NVMe_SSD_WPQ_vs_mClock.png
|
|
|
|
Perhaps a more interesting chart is the comparison chart shown above that
|
|
tracks the average *clat* latency variations through the duration of the test.
|
|
The chart shows the differences in the average latency between the
|
|
WPQ and mClock profiles). During the initial phase of the test, for about 150
|
|
secs, the differences in the average latency between the WPQ scheduler and
|
|
across the profiles of mClock scheduler are quite evident and self explanatory.
|
|
The *high_client_ops* profile shows the lowest latency followed by *balanced*
|
|
and *high_recovery_ops* profiles. The WPQ(BST) had the highest average latency
|
|
through the course of the test.
|
|
|
|
Recovery Statistics Comparison
|
|
------------------------------
|
|
|
|
Another important aspect to consider is how the recovery bandwidth and recovery
|
|
time are affected by mClock profile settings. The chart below outlines the
|
|
recovery rates and times for each mClock profile and how they differ with the
|
|
WPQ scheduler. The total number of objects to be recovered in all the cases was
|
|
around 75000 objects as observed in the chart below.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_NVMe_SSD_WPQ_vs_mClock.png
|
|
|
|
Intuitively, the *high_client_ops* should impact recovery operations the most
|
|
and this is indeed the case as it took an average of 966 secs for the
|
|
recovery to complete at 80 Objects/sec. The recovery bandwidth as expected was
|
|
the lowest at an average of ~320 MiB/s.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_NVMe_SSD_WPQ_vs_mClock.png
|
|
|
|
The *balanced* profile provides a good middle ground by allocating the same
|
|
reservation and weight to client and recovery operations. The recovery rate
|
|
curve falls between the *high_recovery_ops* and *high_client_ops* curves with
|
|
an average bandwidth of ~480 MiB/s and taking an average of ~647 secs at ~120
|
|
Objects/sec to complete the recovery.
|
|
|
|
The *high_recovery_ops* profile provides the fastest way to complete recovery
|
|
operations at the expense of other operations. The recovery bandwidth was
|
|
nearly 2x the bandwidth at ~635 MiB/s when compared to the bandwidth observed
|
|
using the *high_client_ops* profile. The average object recovery rate was ~159
|
|
objects/sec and completed the fastest in approximately 488 secs.
|
|
|
|
Test Results With HDDs (WAL and dB configured)
|
|
``````````````````````````````````````````````
|
|
|
|
The recovery tests were performed on HDDs with bluestore WAL and dB configured
|
|
on faster NVMe SSDs. The baseline throughput measured was 340 IOPS.
|
|
|
|
Client Throughput & latency Comparison
|
|
--------------------------------------
|
|
|
|
The average client throughput comparison for WPQ and mClock and its profiles
|
|
are shown in the chart below.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_WALdB_WPQ_vs_mClock.png
|
|
|
|
With WPQ(Def), the average client throughput obtained was ~308 IOPS since the
|
|
the number of concurrent recoveries was very much limited. The average *clat*
|
|
latency was ~208 msec.
|
|
|
|
However for WPQ(BST), due to concurrent recoveries client throughput is affected
|
|
significantly with 146 IOPS and an average *clat* latency of 433 msec.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_WALdB_WPQ_vs_mClock.png
|
|
|
|
With the *high_client_ops* profile, mClock was able to meet the QoS requirement
|
|
for client operations with an average throughput of 271 IOPS which is nearly
|
|
80% of the baseline throughput at an average *clat* latency of 235 msecs.
|
|
|
|
For *balanced* and *high_recovery_ops* profiles, the average client throughput
|
|
came down marginally to ~248 IOPS and ~240 IOPS respectively. The average *clat*
|
|
latency as expected increased to ~258 msec and ~265 msec respectively.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_WALdB_WPQ_vs_mClock.png
|
|
|
|
The *clat* latency comparison chart above provides a more comprehensive insight
|
|
into the differences in latency through the course of the test. As observed
|
|
with the NVMe SSD case, *high_client_ops* profile shows the lowest latency in
|
|
the HDD case as well followed by the *balanced* and *high_recovery_ops* profile.
|
|
It's fairly easy to discern this between the profiles during the first 200 secs
|
|
of the test.
|
|
|
|
Recovery Statistics Comparison
|
|
------------------------------
|
|
|
|
The charts below compares the recovery rates and times. The total number of
|
|
objects to be recovered in all the cases using HDDs with WAL and dB was around
|
|
4000 objects as observed in the chart below.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_WALdB_WPQ_vs_mClock.png
|
|
|
|
As expected, the *high_client_ops* impacts recovery operations the most as it
|
|
took an average of ~1409 secs for the recovery to complete at ~3 Objects/sec.
|
|
The recovery bandwidth as expected was the lowest at ~11 MiB/s.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_WALdB_WPQ_vs_mClock.png
|
|
|
|
The *balanced* profile as expected provides a decent compromise with an an
|
|
average bandwidth of ~16.5 MiB/s and taking an average of ~966 secs at ~4
|
|
Objects/sec to complete the recovery.
|
|
|
|
The *high_recovery_ops* profile is the fastest with nearly 2x the bandwidth at
|
|
~21 MiB/s when compared to the *high_client_ops* profile. The average object
|
|
recovery rate was ~5 objects/sec and completed in approximately 747 secs. This
|
|
is somewhat similar to the recovery time observed with WPQ(Def) at 647 secs with
|
|
a bandwidth of 23 MiB/s and at a rate of 5.8 objects/sec.
|
|
|
|
Test Results With HDDs (No WAL and dB configured)
|
|
`````````````````````````````````````````````````
|
|
|
|
The recovery tests were also performed on HDDs without bluestore WAL and dB
|
|
configured. The baseline throughput measured was 315 IOPS.
|
|
|
|
This type of configuration without WAL and dB configured is probably rare
|
|
but testing was nevertheless performed to get a sense of how mClock performs
|
|
under a very restrictive environment where the OSD capacity is at the lower end.
|
|
The sections and charts below are very similar to the ones presented above and
|
|
are provided here for reference.
|
|
|
|
Client Throughput & latency Comparison
|
|
--------------------------------------
|
|
|
|
The average client throughput, latency and percentiles are compared as before
|
|
in the set of charts shown below.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Client_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Client_Latency_Percentiles_HDD_NoWALdB_WPQ_vs_mClock.png
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Clat_Latency_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
|
|
|
|
Recovery Statistics Comparison
|
|
------------------------------
|
|
|
|
The recovery rates and times are shown in the charts below.
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Avg_Obj_Rec_Throughput_HDD_NoWALdB_WPQ_vs_mClock.png
|
|
|
|
.. image:: ../../images/mclock_wpq_study/Recovery_Rate_Comparison_HDD_NoWALdB_WPQ_vs_mClock.png
|
|
|
|
Key Takeaways and Conclusion
|
|
============================
|
|
|
|
- mClock is able to provide the desired QoS using profiles to allocate proper
|
|
*reservation*, *weight* and *limit* to the service types.
|
|
- By using the cost per I/O and the cost per byte parameters, mClock can
|
|
schedule operations appropriately for the different device types(SSD/HDD).
|
|
|
|
The study so far shows promising results with the refinements made to the mClock
|
|
scheduler. Further refinements to mClock and profile tuning are planned. Further
|
|
improvements will also be based on feedback from broader testing on larger
|
|
clusters and with different workloads.
|
|
|
|
.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf
|
|
.. _repository: https://github.com/ceph/dmclock
|
|
.. _cbt: https://github.com/ceph/cbt
|