Merge pull request #33035 from jdurgin/wip-target-ratio

mgr/pg_autoscaler: treat target ratios as weights

Reviewed-by: Sage Weil <sage@redhat.com>
This commit is contained in:
Kefu Chai 2020-02-10 18:41:07 +08:00 committed by GitHub
commit d352315e52
9 changed files with 207 additions and 155 deletions

View File

@ -226,6 +226,15 @@
autoscaling, see :ref:`pg-autoscaler`. Note that existing pools in
upgraded clusters will still be set to ``warn`` by default.
* The pool parameter ``target_size_ratio``, used by the pg autoscaler,
has changed meaning. It is now normalized across pools, rather than
specifying an absolute ratio. For details, see :ref:`pg-autoscaler`.
If you have set target size ratios on any pools, you may want to set
these pools to autoscale ``warn`` mode to avoid data movement during
the upgrade::
ceph osd pool set <pool-name> pg_autoscale_mode warn
* The ``upmap_max_iterations`` config option of mgr/balancer has been
renamed to ``upmap_max_optimizations`` to better match its behaviour.

View File

@ -833,21 +833,6 @@ recommended amount with::
Please refer to :ref:`choosing-number-of-placement-groups` and
:ref:`pg-autoscaler` for more information.
POOL_TARGET_SIZE_RATIO_OVERCOMMITTED
____________________________________
One or more pools have a ``target_size_ratio`` property set to
estimate the expected size of the pool as a fraction of total storage,
but the value(s) exceed the total available storage (either by
themselves or in combination with other pools' actual usage).
This is usually an indication that the ``target_size_ratio`` value for
the pool is too large and should be reduced or set to zero with::
ceph osd pool set <pool-name> target_size_ratio 0
For more information, see :ref:`specifying_pool_target_size`.
POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
____________________________________
@ -863,6 +848,21 @@ the pool is too large and should be reduced or set to zero with::
For more information, see :ref:`specifying_pool_target_size`.
POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO
____________________________________
One or more pools have both ``target_size_bytes`` and
``target_size_ratio`` set to estimate the expected size of the pool.
Only one of these properties should be non-zero. If both are set,
``target_size_ratio`` takes precedence and ``target_size_bytes`` is
ignored.
To reset ``target_size_bytes`` to zero::
ceph osd pool set <pool-name> target_size_bytes 0
For more information, see :ref:`specifying_pool_target_size`.
TOO_FEW_OSDS
____________

View File

@ -41,10 +41,10 @@ the PG count with this command::
Output will be something like::
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE
a 12900M 3.0 82431M 0.4695 8 128 warn
c 0 3.0 82431M 0.0000 0.2000 1 64 warn
b 0 953.6M 3.0 82431M 0.0347 8 warn
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO PG_NUM NEW PG_NUM AUTOSCALE
a 12900M 3.0 82431M 0.4695 8 128 warn
c 0 3.0 82431M 0.0000 0.2000 0.9884 1 64 warn
b 0 953.6M 3.0 82431M 0.0347 8 warn
**SIZE** is the amount of data stored in the pool. **TARGET SIZE**, if
present, is the amount of data the administrator has specified that
@ -62,11 +62,21 @@ pools') data. **RATIO** is the ratio of that total capacity that
this pool is consuming (i.e., ratio = size * rate / raw capacity).
**TARGET RATIO**, if present, is the ratio of storage that the
administrator has specified that they expect this pool to consume.
The system uses the larger of the actual ratio and the target ratio
for its calculation. If both target size bytes and ratio are specified, the
administrator has specified that they expect this pool to consume
relative to other pools with target ratios set.
If both target size bytes and ratio are specified, the
ratio takes precedence.
**EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
1. subtracting any capacity expected to be used by pools with target size set
2. normalizing the target ratios among pools with target ratio set so
they collectively target the rest of the space. For example, 4
pools with target_ratio 1.0 would have an effective ratio of 0.25.
The system uses the larger of the actual ratio and the effective ratio
for its calculation.
**PG_NUM** is the current number of PGs for the pool (or the current
number of PGs that the pool is working towards, if a ``pg_num``
change is in progress). **NEW PG_NUM**, if present, is what the
@ -119,9 +129,9 @@ PGs can be used from the beginning, preventing subsequent changes in
``pg_num`` and the overhead associated with moving data around when
those adjustments are made.
The *target size** of a pool can be specified in two ways: either in
terms of the absolute size of the pool (i.e., bytes), or as a ratio of
the total cluster capacity.
The *target size* of a pool can be specified in two ways: either in
terms of the absolute size of the pool (i.e., bytes), or as a weight
relative to other pools with a ``target_size_ratio`` set.
For example,::
@ -130,18 +140,23 @@ For example,::
will tell the system that `mypool` is expected to consume 100 TiB of
space. Alternatively,::
ceph osd pool set mypool target_size_ratio .9
ceph osd pool set mypool target_size_ratio 1.0
will tell the system that `mypool` is expected to consume 90% of the
total cluster capacity.
will tell the system that `mypool` is expected to consume 1.0 relative
to the other pools with ``target_size_ratio`` set. If `mypool` is the
only pool in the cluster, this means an expected use of 100% of the
total capacity. If there is a second pool with ``target_size_ratio``
1.0, both pools would expect to use 50% of the cluster capacity.
You can also set the target size of a pool at creation time with the optional ``--target-size-bytes <bytes>`` or ``--target-size-ratio <ratio>`` arguments to the ``ceph osd pool create`` command.
Note that if impossible target size values are specified (for example,
a capacity larger than the total cluster, or ratio(s) that sum to more
than 1.0) then a health warning
(``POOL_TARET_SIZE_RATIO_OVERCOMMITTED`` or
``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
a capacity larger than the total cluster) then a health warning
(``POOL_TARGET_SIZE_BYTES_OVERCOMMITTED``) will be raised.
If both ``target_size_ratio`` and ``target_size_bytes`` are specified
for a pool, only the ratio will be considered, and a health warning
(``POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO``) will be issued.
Specifying bounds on a pool's PGs
---------------------------------

View File

@ -44,35 +44,34 @@ wait_for 120 "ceph osd pool get a pg_num | grep 4"
wait_for 120 "ceph osd pool get b pg_num | grep 2"
# target ratio
ceph osd pool set a target_size_ratio .5
ceph osd pool set b target_size_ratio .1
sleep 30
APGS=$(ceph osd dump -f json-pretty | jq '.pools[0].pg_num')
BPGS=$(ceph osd dump -f json-pretty | jq '.pools[1].pg_num')
ceph osd pool set a target_size_ratio 5
ceph osd pool set b target_size_ratio 1
sleep 10
APGS=$(ceph osd dump -f json-pretty | jq '.pools[0].pg_num_target')
BPGS=$(ceph osd dump -f json-pretty | jq '.pools[1].pg_num_target')
test $APGS -gt 100
test $BPGS -gt 10
# small ratio change does not change pg_num
ceph osd pool set a target_size_ratio .7
ceph osd pool set b target_size_ratio .2
ceph osd pool set a target_size_ratio 7
ceph osd pool set b target_size_ratio 2
sleep 10
ceph osd pool get a pg_num | grep $APGS
ceph osd pool get b pg_num | grep $BPGS
# too much ratio
ceph osd pool set a target_size_ratio .9
ceph osd pool set b target_size_ratio .9
wait_for 60 "ceph health detail | grep POOL_TARGET_SIZE_RATIO_OVERCOMMITTED"
wait_for 60 "ceph health detail | grep 1.8"
ceph osd pool set a target_size_ratio 0
ceph osd pool set b target_size_ratio 0
APGS2=$(ceph osd dump -f json-pretty | jq '.pools[0].pg_num_target')
BPGS2=$(ceph osd dump -f json-pretty | jq '.pools[1].pg_num_target')
test $APGS -eq $APGS2
test $BPGS -eq $BPGS2
# target_size
ceph osd pool set a target_size_bytes 1000000000000000
ceph osd pool set b target_size_bytes 1000000000000000
wait_for 60 "ceph health detail | grep POOL_TARGET_SIZE_BYTES_OVERCOMMITTED"
ceph osd pool set a target_size_bytes 0
ceph osd pool set a target_size_ratio 0
ceph osd pool set b target_size_ratio 0
wait_for 60 "ceph health detail | grep POOL_TARGET_SIZE_BYTES_OVERCOMMITTED"
ceph osd pool set a target_size_bytes 1000
ceph osd pool set b target_size_bytes 1000
ceph osd pool set a target_size_ratio 1
wait_for 60 "ceph health detail | grep POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO"
ceph osd pool rm a a --yes-i-really-really-mean-it
ceph osd pool rm b b --yes-i-really-really-mean-it

View File

@ -1 +1 @@
from .module import PgAutoscaler
from .module import PgAutoscaler, effective_target_ratio

View File

@ -2,14 +2,12 @@
Automatically scale pg_num based on how much data is stored in each pool.
"""
import errno
import json
import mgr_util
import threading
import uuid
from six import itervalues, iteritems
from collections import defaultdict
from prettytable import PrettyTable, PLAIN_COLUMNS
from prettytable import PrettyTable
from mgr_module import MgrModule
"""
@ -44,16 +42,43 @@ def nearest_power_of_two(n):
return x if (v - n) > (n - x) else v
def effective_target_ratio(target_ratio, total_target_ratio, total_target_bytes, capacity):
"""
Returns the target ratio after normalizing for ratios across pools and
adjusting for capacity reserved by pools that have target_size_bytes set.
"""
target_ratio = float(target_ratio)
if total_target_ratio:
target_ratio = target_ratio / total_target_ratio
if total_target_bytes and capacity:
fraction_available = 1.0 - min(1.0, float(total_target_bytes) / capacity)
target_ratio *= fraction_available
return target_ratio
class PgAdjustmentProgress(object):
"""
Keeps the initial and target pg_num values
"""
def __init__(self, pg_num, pg_num_target, ev_id, increase_decrease):
self._ev_id = ev_id
self._pg_num = pg_num
self._pg_num_target = pg_num_target
self._increase_decrease = increase_decrease
def __init__(self, pool_id, pg_num, pg_num_target):
self.ev_id = str(uuid.uuid4())
self.pool_id = pool_id
self.reset(pg_num, pg_num_target)
def reset(self, pg_num, pg_num_target):
self.pg_num = pg_num
self.pg_num_target = pg_num_target
def update(self, module, progress):
desc = 'increasing' if self.pg_num < self.pg_num_target else 'decreasing'
module.remote('progress', 'update', self.ev_id,
ev_msg="PG autoscaler %s pool %d PGs from %d to %d" %
(desc, self.pool_id, self.pg_num, self.pg_num_target),
ev_progress=progress,
refs=[("pool", self.pool_id)])
class PgAutoscaler(MgrModule):
"""
@ -120,6 +145,7 @@ class PgAutoscaler(MgrModule):
table = PrettyTable(['POOL', 'SIZE', 'TARGET SIZE',
'RATE', 'RAW CAPACITY',
'RATIO', 'TARGET RATIO',
'EFFECTIVE RATIO',
'BIAS',
'PG_NUM',
# 'IDEAL',
@ -134,6 +160,7 @@ class PgAutoscaler(MgrModule):
table.align['RAW CAPACITY'] = 'r'
table.align['RATIO'] = 'r'
table.align['TARGET RATIO'] = 'r'
table.align['EFFECTIVE RATIO'] = 'r'
table.align['BIAS'] = 'r'
table.align['PG_NUM'] = 'r'
# table.align['IDEAL'] = 'r'
@ -152,6 +179,10 @@ class PgAutoscaler(MgrModule):
tr = '%.4f' % p['target_ratio']
else:
tr = ''
if p['effective_target_ratio'] > 0.0:
etr = '%.4f' % p['effective_target_ratio']
else:
etr = ''
table.add_row([
p['pool_name'],
mgr_util.format_bytes(p['logical_used'], 6),
@ -160,6 +191,7 @@ class PgAutoscaler(MgrModule):
mgr_util.format_bytes(p['subtree_capacity'], 6),
'%.4f' % p['capacity_ratio'],
tr,
etr,
p['bias'],
p['pg_num_target'],
# p['pg_num_ideal'],
@ -200,6 +232,8 @@ class PgAutoscaler(MgrModule):
self.capacity = None # Total capacity of OSDs in subtree
self.pool_ids = []
self.pool_names = []
self.total_target_ratio = 0.0
self.total_target_bytes = 0 # including replication / EC overhead
# identify subtrees (note that they may overlap!)
for pool_id, pool in osdmap.get_pools().items():
@ -220,10 +254,16 @@ class PgAutoscaler(MgrModule):
result[root_id] = s
s.root_ids.append(root_id)
s.osds |= osds
s.pool_ids.append(int(pool_id))
s.pool_ids.append(pool_id)
s.pool_names.append(pool['pool_name'])
s.pg_current += pool['pg_num_target'] * pool['size']
target_ratio = pool['options'].get('target_size_ratio', 0.0)
if target_ratio:
s.total_target_ratio += target_ratio
else:
target_bytes = pool['options'].get('target_size_bytes', 0)
if target_bytes:
s.total_target_bytes += target_bytes * osdmap.pool_raw_used_rate(pool_id)
# finish subtrees
all_stats = self.get('osd_stats')
@ -250,7 +290,6 @@ class PgAutoscaler(MgrModule):
return result, pool_root
def _get_pool_status(
self,
osdmap,
@ -290,7 +329,10 @@ class PgAutoscaler(MgrModule):
pool_logical_used = pool_stats[pool_id]['stored']
bias = p['options'].get('pg_autoscale_bias', 1.0)
target_bytes = p['options'].get('target_size_bytes', 0)
target_bytes = 0
# ratio takes precedence if both are set
if p['options'].get('target_size_ratio', 0.0) == 0.0:
target_bytes = p['options'].get('target_size_bytes', 0)
# What proportion of space are we using?
actual_raw_used = pool_logical_used * raw_used_rate
@ -299,7 +341,16 @@ class PgAutoscaler(MgrModule):
pool_raw_used = max(pool_logical_used, target_bytes) * raw_used_rate
capacity_ratio = float(pool_raw_used) / capacity
target_ratio = p['options'].get('target_size_ratio', 0.0)
self.log.info("effective_target_ratio {0} {1} {2} {3}".format(
p['options'].get('target_size_ratio', 0.0),
root_map[root_id].total_target_ratio,
root_map[root_id].total_target_bytes,
capacity))
target_ratio = effective_target_ratio(p['options'].get('target_size_ratio', 0.0),
root_map[root_id].total_target_ratio,
root_map[root_id].total_target_bytes,
capacity)
final_ratio = max(capacity_ratio, target_ratio)
# So what proportion of pg allowance should we be using?
@ -340,7 +391,8 @@ class PgAutoscaler(MgrModule):
'raw_used': pool_raw_used,
'actual_capacity_ratio': actual_capacity_ratio,
'capacity_ratio': capacity_ratio,
'target_ratio': target_ratio,
'target_ratio': p['options'].get('target_size_ratio', 0.0),
'effective_target_ratio': target_ratio,
'pg_num_ideal': int(pool_pg_target),
'pg_num_final': final_pg_target,
'would_adjust': adjust,
@ -348,37 +400,19 @@ class PgAutoscaler(MgrModule):
});
return (ret, root_map, pool_root)
def _update_progress_events(self):
osdmap = self.get_osdmap()
pools = osdmap.get_pools()
for pool_id in list(self._event):
ev = self._event[pool_id]
if int(pool_id) not in pools:
# pool is gone
self.remote('progress', 'complete', ev._ev_id)
pool_data = pools.get(pool_id)
if pool_data is None or pool_data['pg_num'] == pool_data['pg_num_target']:
# pool is gone or we've reached our target
self.remote('progress', 'complete', ev.ev_id)
del self._event[pool_id]
continue
pool_data = pools[int(pool_id)]
pg_num = pool_data['pg_num']
pg_num_target = pool_data['pg_num_target']
initial_pg_num = ev._pg_num
initial_pg_num_target = ev._pg_num_target
progress = (pg_num - initial_pg_num) / (pg_num_target - initial_pg_num)
if pg_num == pg_num_target:
self.remote('progress', 'complete', ev._ev_id)
del self._event[pool_id]
continue
elif pg_num == initial_pg_num:
# Means no change
continue
else:
self.remote('progress', 'update', ev._ev_id,
ev_msg="PG autoscaler %s pool %s PGs from %d to %d" %
(ev._increase_decrease, pool_id, pg_num, pg_num_target),
ev_progress=progress,
refs=[("pool", int(pool_id))])
ev.update(self, (ev.pg_num - pool_data['pg_num']) / (ev.pg_num - ev.pg_num_target))
def _maybe_adjust(self):
self.log.info('_maybe_adjust')
@ -392,23 +426,18 @@ class PgAutoscaler(MgrModule):
# drop them from consideration.
too_few = []
too_many = []
bytes_and_ratio = []
health_checks = {}
total_ratio = dict([(r, 0.0) for r in iter(root_map)])
total_target_ratio = dict([(r, 0.0) for r in iter(root_map)])
target_ratio_pools = dict([(r, []) for r in iter(root_map)])
total_bytes = dict([(r, 0) for r in iter(root_map)])
total_target_bytes = dict([(r, 0.0) for r in iter(root_map)])
target_bytes_pools = dict([(r, []) for r in iter(root_map)])
for p in ps:
pool_id = str(p['pool_id'])
total_ratio[p['crush_root_id']] += max(p['actual_capacity_ratio'],
p['target_ratio'])
if p['target_ratio'] > 0:
total_target_ratio[p['crush_root_id']] += p['target_ratio']
target_ratio_pools[p['crush_root_id']].append(p['pool_name'])
pool_id = p['pool_id']
pool_opts = pools[p['pool_name']]['options']
if pool_opts.get('target_size_ratio', 0) > 0 and pool_opts.get('target_size_bytes', 0) > 0:
bytes_and_ratio.append('Pool %s has target_size_bytes and target_size_ratio set' % p['pool_name'])
total_bytes[p['crush_root_id']] += max(
p['actual_raw_used'],
p['target_bytes'] * p['raw_used_rate'])
@ -436,31 +465,17 @@ class PgAutoscaler(MgrModule):
'var': 'pg_num',
'val': str(p['pg_num_final'])
})
# Create new event for each pool
# and update existing events
# Call Progress Module to create progress event
if pool_id not in self._event:
osdmap = self.get_osdmap()
pools = osdmap.get_pools()
pool_data = pools[int(pool_id)]
pg_num = pool_data['pg_num']
pg_num_target = pool_data['pg_num_target']
ev_id = str(uuid.uuid4())
pg_adj_obj = None
if pg_num < pg_num_target:
pg_adj_obj = PgAdjustmentProgress(pg_num, pg_num_target, ev_id, 'increasing')
self._event[pool_id] = pg_adj_obj
else:
pg_adj_obj = PgAdjustmentProgress(pg_num, pg_num_target, ev_id, 'decreasing')
self._event[pool_id] = pg_adj_obj
self.remote('progress', 'update', ev_id,
ev_msg="PG autoscaler %s pool %s PGs from %d to %d" %
(pg_adj_obj._increase_decrease, pool_id, pg_num, pg_num_target),
ev_progress=0.0,
refs=[("pool", int(pool_id))])
# create new event or update existing one to reflect
# progress from current state to the new pg_num_target
pool_data = pools[p['pool_name']]
pg_num = pool_data['pg_num']
new_target = p['pg_num_final']
if pool_id in self._event:
self._event[pool_id].reset(pg_num, new_target)
else:
self._event[pool_id] = PgAdjustmentProgress(pool_id, pg_num, new_target)
self._event[pool_id].update(self, 0.0)
if r[0] != 0:
# FIXME: this is a serious and unexpected thing,
@ -490,34 +505,6 @@ class PgAutoscaler(MgrModule):
'detail': too_many
}
too_much_target_ratio = []
for root_id, total in iteritems(total_ratio):
total_target = total_target_ratio[root_id]
if total_target > 0 and total > 1.0:
too_much_target_ratio.append(
'Pools %s overcommit available storage by %.03fx due to '
'target_size_ratio %.03f on pools %s' % (
root_map[root_id].pool_names,
total,
total_target,
target_ratio_pools[root_id]
)
)
elif total_target > 1.0:
too_much_target_ratio.append(
'Pools %s have collective target_size_ratio %.03f > 1.0' % (
root_map[root_id].pool_names,
total_target
)
)
if too_much_target_ratio:
health_checks['POOL_TARGET_SIZE_RATIO_OVERCOMMITTED'] = {
'severity': 'warning',
'summary': "%d subtrees have overcommitted pool target_size_ratio" % len(too_much_target_ratio),
'count': len(too_much_target_ratio),
'detail': too_much_target_ratio,
}
too_much_target_bytes = []
for root_id, total in iteritems(total_bytes):
total_target = total_target_bytes[root_id]
@ -548,5 +535,12 @@ class PgAutoscaler(MgrModule):
'detail': too_much_target_bytes,
}
if bytes_and_ratio:
health_checks['POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO'] = {
'severity': 'warning',
'summary': "%d pools have both target_size_bytes and target_size_ratio set" % len(bytes_and_ratio),
'count': len(bytes_and_ratio),
'detail': bytes_and_ratio,
}
self.set_health_checks(health_checks)

View File

@ -0,0 +1,34 @@
from pg_autoscaler import effective_target_ratio
from pytest import approx
def check_simple_ratio(target_ratio, tot_ratio):
etr = effective_target_ratio(target_ratio, tot_ratio, 0, 0)
assert (target_ratio / tot_ratio) == approx(etr)
return etr
def test_simple():
etr1 = check_simple_ratio(0.2, 0.9)
etr2 = check_simple_ratio(2, 9)
etr3 = check_simple_ratio(20, 90)
assert etr1 == approx(etr2)
assert etr1 == approx(etr3)
etr = check_simple_ratio(0.9, 0.9)
assert etr == approx(1.0)
etr1 = check_simple_ratio(1, 2)
etr2 = check_simple_ratio(0.5, 1.0)
assert etr1 == approx(etr2)
def test_total_bytes():
etr = effective_target_ratio(1, 10, 5, 10)
assert etr == approx(0.05)
etr = effective_target_ratio(0.1, 1, 5, 10)
assert etr == approx(0.05)
etr = effective_target_ratio(1, 1, 5, 10)
assert etr == approx(0.5)
etr = effective_target_ratio(1, 1, 0, 10)
assert etr == approx(1.0)
etr = effective_target_ratio(0, 1, 5, 10)
assert etr == approx(0.0)
etr = effective_target_ratio(1, 1, 10, 10)
assert etr == approx(0.0)

View File

@ -4,4 +4,5 @@ ipaddress; python_version < '3.3'
../../python-common
kubernetes
requests-mock
pyyaml
pyyaml
prettytable

View File

@ -5,7 +5,7 @@ skipsdist = true
[testenv]
setenv = UNITTEST = true
deps = -r requirements.txt
commands = pytest -v --cov --cov-append --cov-report=term --doctest-modules {posargs:mgr_util.py tests/ cephadm/ progress/}
commands = pytest -v --cov --cov-append --cov-report=term --doctest-modules {posargs:mgr_util.py tests/ cephadm/ pg_autoscaler/ progress/}
[testenv:mypy]
basepython = python3