Merge pull request #19331 from jecluis/wip-mon-osdmap-prune

mon: osdmap prune

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
This commit is contained in:
Joao Eduardo Luis 2018-04-06 15:22:28 +01:00 committed by GitHub
commit 940dd941ef
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
18 changed files with 1468 additions and 25 deletions

View File

@ -0,0 +1,415 @@
===========================
FULL OSDMAP VERSION PRUNING
===========================
For each incremental osdmap epoch, the monitor will keep a full osdmap
epoch in the store.
While this is great when serving osdmap requests from clients, allowing
us to fulfill their request without having to recompute the full osdmap
from a myriad of incrementals, it can also become a burden once we start
keeping an unbounded number of osdmaps.
The monitors will attempt to keep a bounded number of osdmaps in the store.
This number is defined (and configurable) via ``mon_min_osdmap_epochs``, and
defaults to 500 epochs. Generally speaking, we will remove older osdmap
epochs once we go over this limit.
However, there are a few constraints to removing osdmaps. These are all
defined in ``OSDMonitor::get_trim_to()``.
In the event one of these conditions is not met, we may go over the bounds
defined by ``mon_min_osdmap_epochs``. And if the cluster does not meet the
trim criteria for some time (e.g., unclean pgs), the monitor may start
keeping a lot of osdmaps. This can start putting pressure on the underlying
key/value store, as well as on the available disk space.
One way to mitigate this problem would be to stop keeping full osdmap
epochs on disk. We would have to rebuild osdmaps on-demand, or grab them
from cache if they had been recently served. We would still have to keep
at least one osdmap, and apply all incrementals on top of either this
oldest map epoch kept in the store or a more recent map grabbed from cache.
While this would be feasible, it seems like a lot of cpu (and potentially
IO) would be going into rebuilding osdmaps.
Additionally, this would prevent the aforementioned problem going forward,
but would do nothing for stores currently in a state that would truly
benefit from not keeping osdmaps.
This brings us to full osdmap pruning.
Instead of not keeping full osdmap epochs, we are going to prune some of
them when we have too many.
Deciding whether we have too many will be dictated by a configurable option
``mon_osdmap_full_prune_min`` (default: 10000). The pruning algorithm will be
engaged once we go over this threshold.
We will not remove all ``mon_osdmap_full_prune_min`` full osdmap epochs
though. Instead, we are going to poke some holes in the sequence of full
maps. By default, we will keep one full osdmap per 10 maps since the last
map kept; i.e., if we keep epoch 1, we will also keep epoch 10 and remove
full map epochs 2 to 9. The size of this interval is configurable with
``mon_osdmap_full_prune_interval``.
Essentially, we are proposing to keep ~10% of the full maps, but we will
always honour the minimum number of osdmap epochs, as defined by
``mon_min_osdmap_epochs``, and these won't be used for the count of the
minimum versions to prune. For instance, if we have on-disk versions
[1..50000], we would allow the pruning algorithm to operate only over
osdmap epochs [1..49500); but, if have on-disk versions [1..10200], we
won't be pruning because the algorithm would only operate on versions
[1..9700), and this interval contains less versions than the minimum
required by ``mon_osdmap_full_prune_min``.
ALGORITHM
=========
Say we have 50,000 osdmap epochs in the store, and we're using the
defaults for all configurable options.
::
-----------------------------------------------------------
|1|2|..|10|11|..|100|..|1000|..|10000|10001|..|49999|50000|
-----------------------------------------------------------
^ first last ^
We will prune when all the following constraints are met:
1. number of versions is greater than ``mon_min_osdmap_epochs``;
2. the number of versions between ``first`` and ``prune_to`` is greater (or
equal) than ``mon_osdmap_full_prune_min``, with ``prune_to`` being equal to
``last`` minus ``mon_min_osdmap_epochs``.
If any of these conditions fails, we will *not* prune any maps.
Furthermore, if it is known that we have been pruning, but since then we
are no longer satisfying at least one of the above constraints, we will
not continue to prune. In essence, we only prune full osdmaps if the
number of epochs in the store so warrants it.
As pruning will create gaps in the sequence of full maps, we need to keep
track of the intervals of missing maps. We do this by keeping a manifest of
pinned maps -- i.e., a list of maps that, by being pinned, are not to be
pruned.
While pinned maps are not removed from the store, maps between two consecutive
pinned maps will; and the number of maps to be removed will be dictated by the
configurable option ``mon_osdmap_full_prune_interval``. The algorithm makes an
effort to keep pinned maps apart by as many maps as defined by this option,
but in the event of corner cases it may allow smaller intervals. Additionally,
as this is a configurable option that is read any time a prune iteration
occurs, there is the possibility this interval will change if the user changes
this config option.
Pinning maps is performed lazily: we will be pinning maps as we are removing
maps. This grants us more flexibility to change the prune interval while
pruning is happening, but also simplifies considerably the algorithm, as well
as the information we need to keep in the manifest. Below we show a simplified
version of the algorithm:::
manifest.pin(first)
last_to_prune = last - mon_min_osdmap_epochs
while manifest.get_last_pinned() + prune_interval < last_to_prune AND
last_to_prune - first > mon_min_osdmap_epochs AND
last_to_prune - first > mon_osdmap_full_prune_min AND
num_pruned < mon_osdmap_full_prune_txsize:
last_pinned = manifest.get_last_pinned()
new_pinned = last_pinned + prune_interval
manifest.pin(new_pinned)
for e in (last_pinned .. new_pinned):
store.erase(e)
++num_pruned
In essence, the algorithm ensures that the first version in the store is
*always* pinned. After all, we need a starting point when rebuilding maps, and
we can't simply remove the earliest map we have; otherwise we would be unable
to rebuild maps for the very first pruned interval.
Once we have at least one pinned map, each iteration of the algorithm can
simply base itself on the manifest's last pinned map (which we can obtain by
reading the element at the tail of the manifest's pinned maps list).
We'll next need to determine the interval of maps to be removed: all the maps
from ``last_pinned`` up to ``new_pinned``, which in turn is nothing more than
``last_pinned`` plus ``mon_osdmap_full_prune_interval``. We know that all maps
between these two values, ``last_pinned`` and ``new_pinned`` can be removed,
considering ``new_pinned`` has been pinned.
The algorithm ceases to execute as soon as one of the two initial
preconditions is not met, or if we do not meet two additional conditions that
have no weight on the algorithm's correctness:
1. We will stop if we are not able to create a new pruning interval properly
aligned with ``mon_osdmap_full_prune_interval`` that is lower than
``last_pruned``. There is no particular technical reason why we enforce
this requirement, besides allowing us to keep the intervals with an
expected size, and preventing small, irregular intervals that would be
bound to happen eventually (e.g., pruning continues over the course of
several iterations, removing one or two or three maps each time).
2. We will stop once we know that we have pruned more than a certain number of
maps. This value is defined by ``mon_osdmap_full_prune_txsize``, and
ensures we don't spend an unbounded number of cycles pruning maps. We don't
enforce this value religiously (deletes do not cost much), but we make an
effort to honor it.
We could do the removal in one go, but we have no idea how long that would
take. Therefore, we will perform several iterations, removing at most
``mon_osdmap_full_prune_txsize`` osdmaps per iteration.
In the end, our on-disk map sequence will look similar to::
------------------------------------------
|1|10|20|30|..|49500|49501|..|49999|50000|
------------------------------------------
^ first last ^
Because we are not pruning all versions in one go, we need to keep state
about how far along on our pruning we are. With that in mind, we have
created a data structure, ``osdmap_manifest_t``, that holds the set of pinned
maps:::
struct osdmap_manifest_t:
set<version_t> pinned;
Given we are only pinning maps while we are pruning, we don't need to keep
track of additional state about the last pruned version. We know as a matter
of fact that we have pruned all the intermediate maps between any two
consecutive pinned maps.
The question one could ask, though, is how can we be sure we pruned all the
intermediate maps if, for instance, the monitor crashes. To ensure we are
protected against such an event, we always write the osdmap manifest to disk
on the same transaction that is deleting the maps. This way we have the
guarantee that, if the monitor crashes, we will read the latest version of the
manifest: either containing the newly pinned maps, meaning we also pruned the
in-between maps; or we will find the previous version of the osdmap manifest,
which will not contain the maps we were pinning at the time we crashed, given
the transaction on which we would be writing the updated osdmap manifest was
not applied (alongside with the maps removal).
The osdmap manifest will be written to the store each time we prune, with an
updated list of pinned maps. It is written in the transaction effectively
pruning the maps, so we guarantee the manifest is always up to date. As a
consequence of this criteria, the first time we will write the osdmap manifest
is the first time we prune. If an osdmap manifest does not exist, we can be
certain we do not hold pruned map intervals.
We will rely on the manifest to ascertain whether we have pruned maps
intervals. In theory, this will always be the on-disk osdmap manifest, but we
make sure to read the on-disk osdmap manifest each time we update from paxos;
this way we always ensure having an up to date in-memory osdmap manifest.
Once we finish pruning maps, we will keep the manifest in the store, to
allow us to easily find which maps have been pinned (instead of checking
the store until we find a map). This has the added benefit of allowing us to
quickly figure out which is the next interval we need to prune (i.e., last
pinned plus the prune interval). This doesn't however mean we will forever
keep the osdmap manifest: the osdmap manifest will no longer be required once
the monitor trims osdmaps and the earliest available epoch in the store is
greater than the last map we pruned.
The same conditions from ``OSDMonitor::get_trim_to()`` that force the monitor
to keep a lot of osdmaps, thus requiring us to prune, may eventually change
and allow the monitor to remove some of its oldest maps.
MAP TRIMMING
------------
If the monitor trims maps, we must then adjust the osdmap manifest to
reflect our pruning status, or remove the manifest entirely if it no longer
makes sense to keep it. For instance, take the map sequence from before, but
let us assume we did not finish pruning all the maps.::
-------------------------------------------------------------
|1|10|20|30|..|490|500|501|502|..|49500|49501|..|49999|50000|
-------------------------------------------------------------
^ first ^ pinned.last() last ^
pinned = {1, 10, 20, ..., 490, 500}
Now let us assume that the monitor will trim up to epoch 501. This means
removing all maps prior to epoch 501, and updating the ``first_committed``
pointer to ``501``. Given removing all those maps would invalidate our
existing pruning efforts, we can consider our pruning has finished and drop
our osdmap manifest. Doing so also simplifies starting a new prune, if all
the starting conditions are met once we refreshed our state from the
store.
We would then have the following map sequence: ::
---------------------------------------
|501|502|..|49500|49501|..|49999|50000|
---------------------------------------
^ first last ^
However, imagine a slightly more convoluted scenario: the monitor will trim
up to epoch 491. In this case, epoch 491 has been previously pruned from the
store.
Given we will always need to have the oldest known map in the store, before
we trim we will have to check whether that map is in the prune interval
(i.e., if said map epoch belongs to ``[ pinned.first()..pinned.last() )``).
If so, we need to check if this is a pinned map, in which case we don't have
much to be concerned aside from removing lower epochs from the manifest's
pinned list. On the other hand, if the map being trimmed to is not a pinned
map, we will need to rebuild said map and pin it, and only then will we remove
the pinned maps prior to the map's epoch.
In this case, we would end up with the following sequence:::
-----------------------------------------------
|491|500|501|502|..|49500|49501|..|49999|50000|
-----------------------------------------------
^ ^- pinned.last() last ^
`- first
There is still an edge case that we should mention. Consider that we are
going to trim up to epoch 499, which is the very last pruned epoch.
Much like the scenario above, we would end up writing osdmap epoch 499 to
the store; but what should we do about pinned maps and pruning?
The simplest solution is to drop the osdmap manifest. After all, given we
are trimming to the last pruned map, and we are rebuilding this map, we can
guarantee that all maps greater than e 499 are sequential (because we have
not pruned any of them). In essence, dropping the osdmap manifest in this
case is essentially the same as if we were trimming over the last pruned
epoch: we can prune again later if we meet the required conditions.
And, with this, we have fully dwelled into full osdmap pruning. Later in this
document one can find detailed `REQUIREMENTS, CONDITIONS & INVARIANTS` for the
whole algorithm, from pruning to trimming. Additionally, the next section
details several additional checks to guarantee the sanity of our configuration
options. Enjoy.
CONFIGURATION OPTIONS SANITY CHECKS
-----------------------------------
We perform additional checks before pruning to ensure all configuration
options involved are sane:
1. If ``mon_osdmap_full_prune_interval`` is zero we will not prune; we
require an actual positive number, greater than one, to be able to prune
maps. If the interval is one, we would not actually be pruning any maps, as
the interval between pinned maps would essentially be a single epoch. This
means we would have zero maps in-between pinned maps, hence no maps would
ever be pruned.
2. If ``mon_osdmap_full_prune_min`` is zero we will not prune; we require a
positive, greater than zero, value so we know the threshold over which we
should prune. We don't want to guess.
3. If ``mon_osdmap_full_prune_interval`` is greater than
``mon_osdmap_full_prune_min`` we will not prune, as it is impossible to
ascertain a proper prune interval.
4. If ``mon_osdmap_full_prune_txsize`` is lower than
``mon_osdmap_full_prune_interval`` we will not prune; we require a
``txsize`` with a value at least equal than ``interval``, and (depending on
the value of the latter) ideally higher.
REQUIREMENTS, CONDITIONS & INVARIANTS
-------------------------------------
REQUIREMENTS
~~~~~~~~~~~~
* All monitors in the quorum need to support pruning.
* Once pruning has been enabled, monitors not supporting pruning will not be
allowed in the quorum, nor will be allowed to synchronize.
* Removing the osdmap manifest results in disabling the pruning feature quorum
requirement. This means that monitors not supporting pruning will be allowed
to synchronize and join the quorum, granted they support any other features
required.
CONDITIONS & INVARIANTS
~~~~~~~~~~~~~~~~~~~~~~~
* Pruning has never happened, or we have trimmed past its previous
intervals:::
invariant: first_committed > 1
condition: pinned.empty() AND !store.exists(manifest)
* Pruning has happened at least once:::
invariant: first_committed > 0
invariant: !pinned.empty())
invariant: pinned.first() == first_committed
invariant: pinned.last() < last_committed
precond: pinned.last() < prune_to AND
pinned.last() + prune_interval < prune_to
postcond: pinned.size() > old_pinned.size() AND
(for each v in [pinned.first()..pinned.last()]:
if pinned.count(v) > 0: store.exists_full(v)
else: !store.exists_full(v)
)
* Pruning has finished:::
invariant: first_committed > 0
invariant: !pinned.empty()
invariant: pinned.first() == first_committed
invariant: pinned.last() < last_committed
condition: pinned.last() == prune_to OR
pinned.last() + prune_interval < prune_to
* Pruning intervals can be trimmed:::
precond: OSDMonitor::get_trim_to() > 0
condition: !pinned.empty()
invariant: pinned.first() == first_committed
invariant: pinned.last() < last_committed
invariant: pinned.first() <= OSDMonitor::get_trim_to()
invariant: pinned.last() >= OSDMonitor::get_trim_to()
* Trim pruned intervals:::
invariant: !pinned.empty()
invariant: pinned.first() == first_committed
invariant: pinned.last() < last_committed
invariant: pinned.first() <= OSDMonitor::get_trim_to()
invariant: pinned.last() >= OSDMonitor::get_trim_to()
postcond: pinned.empty() OR
(pinned.first() == OSDMonitor::get_trim_to() AND
pinned.last() > pinned.first() AND
(for each v in [0..pinned.first()]:
!store.exists(v) AND
!store.exists_full(v)
) AND
(for each m in [pinned.first()..pinned.last()]:
if pinned.count(m) > 0: store.exists_full(m)
else: !store.exists_full(m) AND store.exists(m)
)
)
postcond: !pinned.empty() OR
(!store.exists(manifest) AND
(for each v in [pinned.first()..pinned.last()]:
!store.exists(v) AND
!store.exists_full(v)
)
)

View File

@ -44,10 +44,14 @@ else
COREPATTERN="core.%e.%p.%t"
fi
function finish() {
function cleanup() {
if [ -n "$precore" ]; then
sudo sysctl -w ${KERNCORE}=${precore}
fi
}
function finish() {
cleanup
exit 0
}
@ -55,6 +59,10 @@ trap finish TERM HUP INT
PATH=$(pwd)/bin:$PATH
# add /sbin and /usr/sbin to PATH to find sysctl in those cases where the
# user's PATH does not get these directories by default (e.g., tumbleweed)
PATH=$PATH:/sbin:/usr/sbin
# TODO: Use getops
dryrun=false
if [[ "$1" = "--dry-run" ]]; then
@ -75,6 +83,11 @@ count=0
errors=0
userargs=""
precore="$(sysctl -n $KERNCORE)"
if [[ "${precore:0:1}" = "|" ]]; then
precore="${precore:1}"
fi
# If corepattern already set, avoid having to use sudo
if [ "$precore" = "$COREPATTERN" ]; then
precore=""
@ -130,9 +143,7 @@ do
fi
fi
done
if [ -n "$precore" ]; then
sudo sysctl -w ${KERNCORE}=${precore}
fi
cleanup
if [ "$errors" != "0" ]; then
echo "$errors TESTS FAILED, $count TOTAL TESTS"

View File

@ -0,0 +1,62 @@
#!/bin/bash
source $CEPH_ROOT/qa/standalone/ceph-helpers.sh
base_test=$CEPH_ROOT/qa/workunits/mon/test_mon_osdmap_prune.sh
# We are going to open and close a lot of files, and generate a lot of maps
# that the osds will need to process. If we don't increase the fd ulimit, we
# risk having the osds asserting when handling filestore transactions.
ulimit -n 4096
function run() {
local dir=$1
shift
export CEPH_MON="127.0.0.1:7115"
export CEPH_ARGS
CEPH_ARGS+="--fsid=$(uuidgen) --auth-supported=none --mon-host=$CEPH_MON "
local funcs=${@:-$(set | sed -n -e 's/^\(TEST_[0-9a-z_]*\) .*/\1/p')}
for func in $funcs; do
setup $dir || return 1
$func $dir || return 1
teardown $dir || return 1
done
}
function TEST_osdmap_prune() {
local dir=$1
run_mon $dir a || return 1
run_mgr $dir x || return 1
run_osd $dir 0 || return 1
run_osd $dir 1 || return 1
run_osd $dir 2 || return 1
sleep 5
# we are getting OSD_OUT_OF_ORDER_FULL health errors, and it's not clear
# why. so, to make the health checks happy, mask those errors.
ceph osd set-full-ratio 0.97
ceph osd set-backfillfull-ratio 0.97
ceph config set osd osd_beacon_report_interval 10 || return 1
ceph config set mon mon_debug_extra_checks true || return 1
ceph config set mon mon_min_osdmap_epochs 100 || return 1
ceph config set mon mon_osdmap_full_prune_enabled true || return 1
ceph config set mon mon_osdmap_full_prune_min 200 || return 1
ceph config set mon mon_osdmap_full_prune_interval 10 || return 1
ceph config set mon mon_osdmap_full_prune_txsize 100 || return 1
bash -x $base_test || return 1
return 0
}
main mon-osdmap-prune "$@"

View File

@ -1,3 +1,13 @@
overrides:
ceph:
conf:
mon:
mon min osdmap epochs: 50
paxos service trim min: 10
# prune full osdmaps regularly
mon osdmap full prune min: 15
mon osdmap full prune interval: 2
mon osdmap full prune txsize: 2
tasks:
- install:
- ceph:

View File

@ -4,6 +4,10 @@ overrides:
mon:
mon min osdmap epochs: 25
paxos service trim min: 5
# prune full osdmaps regularly
mon osdmap full prune min: 15
mon osdmap full prune interval: 2
mon osdmap full prune txsize: 2
# thrashing monitors may make mgr have trouble w/ its keepalive
log-whitelist:
- daemon x is unresponsive

View File

@ -0,0 +1,22 @@
overrides:
ceph:
conf:
mon:
mon debug extra checks: true
mon min osdmap epochs: 100
mon osdmap full prune enabled: true
mon osdmap full prune min: 200
mon osdmap full prune interval: 10
mon osdmap full prune txsize: 100
osd:
osd beacon report interval: 10
log-whitelist:
# setting/unsetting noup will trigger health warns,
# causing tests to fail due to health warns, even if
# the tests themselves are successful.
- \(OSDMAP_FLAGS\)
tasks:
- workunit:
clients:
client.0:
- mon/test_mon_osdmap_prune.sh

View File

@ -10,6 +10,13 @@ overrides:
osd scrub max interval: 120
osd max backfills: 3
osd snap trim sleep: 2
mon:
mon min osdmap epochs: 50
paxos service trim min: 10
# prune full osdmaps regularly
mon osdmap full prune min: 15
mon osdmap full prune interval: 2
mon osdmap full prune txsize: 2
tasks:
- thrashosds:
timeout: 1200

View File

@ -6,7 +6,12 @@ overrides:
- osd_map_cache_size
conf:
mon:
mon min osdmap epochs: 2
mon min osdmap epochs: 50
paxos service trim min: 10
# prune full osdmaps regularly
mon osdmap full prune min: 15
mon osdmap full prune interval: 2
mon osdmap full prune txsize: 2
osd:
osd map cache size: 1
osd scrub min interval: 60

View File

@ -10,6 +10,13 @@ overrides:
filestore odsync write: true
osd max backfills: 2
osd snap trim sleep: .5
mon:
mon min osdmap epochs: 50
paxos service trim min: 10
# prune full osdmaps regularly
mon osdmap full prune min: 15
mon osdmap full prune interval: 2
mon osdmap full prune txsize: 2
tasks:
- thrashosds:
timeout: 1200

View File

@ -1,3 +1,13 @@
overrides:
ceph:
conf:
mon:
mon min osdmap epochs: 50
paxos service trim min: 10
# prune full osdmaps regularly
mon osdmap full prune min: 15
mon osdmap full prune interval: 2
mon osdmap full prune txsize: 2
tasks:
- install:
- ceph:

View File

@ -0,0 +1,205 @@
#!/bin/bash
. $(dirname $0)/../../standalone/ceph-helpers.sh
set -x
function wait_for_osdmap_manifest() {
local what=${1:-"true"}
local -a delays=($(get_timeout_delays $TIMEOUT .1))
local -i loop=0
for ((i=0; i < ${#delays[*]}; ++i)); do
has_manifest=$(ceph report | jq 'has("osdmap_manifest")')
if [[ "$has_manifest" == "$what" ]]; then
return 0
fi
sleep ${delays[$i]}
done
echo "osdmap_manifest never outputted on report"
ceph report
return 1
}
function wait_for_trim() {
local -i epoch=$1
local -a delays=($(get_timeout_delays $TIMEOUT .1))
local -i loop=0
for ((i=0; i < ${#delays[*]}; ++i)); do
fc=$(ceph report | jq '.osdmap_first_committed')
if [[ $fc -eq $epoch ]]; then
return 0
fi
sleep ${delays[$i]}
done
echo "never trimmed up to epoch $epoch"
ceph report
return 1
}
function test_osdmap() {
local epoch=$1
local ret=0
tmp_map=$(mktemp)
ceph osd getmap $epoch -o $tmp_map || return 1
if ! osdmaptool --print $tmp_map | grep "epoch $epoch" ; then
echo "ERROR: failed processing osdmap epoch $epoch"
ret=1
fi
rm $tmp_map
return $ret
}
function generate_osdmaps() {
local -i num=$1
cmds=( set unset )
for ((i=0; i < num; ++i)); do
ceph osd ${cmds[$((i%2))]} noup || return 1
done
return 0
}
function test_mon_osdmap_prune() {
create_pool foo 32
wait_for_clean || return 1
ceph config set mon mon_debug_block_osdmap_trim true || return 1
generate_osdmaps 500 || return 1
report="$(ceph report)"
fc=$(jq '.osdmap_first_committed' <<< $report)
lc=$(jq '.osdmap_last_committed' <<< $report)
[[ $((lc-fc)) -ge 500 ]] || return 1
wait_for_osdmap_manifest || return 1
manifest="$(ceph report | jq '.osdmap_manifest')"
first_pinned=$(jq '.first_pinned' <<< $manifest)
last_pinned=$(jq '.last_pinned' <<< $manifest)
pinned_maps=( $(jq '.pinned_maps[]' <<< $manifest) )
# validate pinned maps list
[[ $first_pinned -eq ${pinned_maps[0]} ]] || return 1
[[ $last_pinned -eq ${pinned_maps[-1]} ]] || return 1
# validate pinned maps range
[[ $first_pinned -lt $last_pinned ]] || return 1
[[ $last_pinned -lt $lc ]] || return 1
[[ $first_pinned -eq $fc ]] || return 1
# ensure all the maps are available, and work as expected
# this can take a while...
for ((i=$first_pinned; i <= $last_pinned; ++i)); do
test_osdmap $i || return 1
done
# update pinned maps state:
# the monitor may have pruned & pinned additional maps since we last
# assessed state, given it's an iterative process.
#
manifest="$(ceph report | jq '.osdmap_manifest')"
first_pinned=$(jq '.first_pinned' <<< $manifest)
last_pinned=$(jq '.last_pinned' <<< $manifest)
pinned_maps=( $(jq '.pinned_maps[]' <<< $manifest) )
# test trimming maps
#
# we're going to perform the following tests:
#
# 1. force trim to a pinned map
# 2. force trim to a pinned map's previous epoch
# 3. trim all maps except the last 200 or so.
#
# 1. force trim to a pinned map
#
[[ ${#pinned_maps[@]} -gt 10 ]] || return 1
trim_to=${pinned_maps[1]}
ceph config set mon mon_osd_force_trim_to $trim_to
ceph config set mon mon_min_osdmap_epochs 100
ceph config set mon paxos_service_trim_min 1
ceph config set mon mon_debug_block_osdmap_trim false
# generate an epoch so we get to trim maps
ceph osd set noup
ceph osd unset noup
wait_for_trim $trim_to || return 1
report="$(ceph report)"
fc=$(jq '.osdmap_first_committed' <<< $report)
[[ $fc -eq $trim_to ]] || return 1
old_first_pinned=$first_pinned
old_last_pinned=$last_pinned
first_pinned=$(jq '.osdmap_manifest.first_pinned' <<< $report)
last_pinned=$(jq '.osdmap_manifest.last_pinned' <<< $report)
[[ $first_pinned -eq $trim_to ]] || return 1
[[ $first_pinned -gt $old_first_pinned ]] || return 1
[[ $last_pinned -gt $old_first_pinned ]] || return 1
test_osdmap $trim_to || return 1
test_osdmap $(( trim_to+1 )) || return 1
pinned_maps=( $(jq '.osdmap_manifest.pinned_maps[]' <<< $report) )
# 2. force trim to a pinned map's previous epoch
#
[[ ${#pinned_maps[@]} -gt 2 ]] || return 1
trim_to=$(( ${pinned_maps[1]} - 1))
ceph config set mon mon_osd_force_trim_to $trim_to
# generate an epoch so we get to trim maps
ceph osd set noup
ceph osd unset noup
wait_for_trim $trim_to || return 1
report="$(ceph report)"
fc=$(jq '.osdmap_first_committed' <<< $report)
[[ $fc -eq $trim_to ]] || return 1
old_first_pinned=$first_pinned
old_last_pinned=$last_pinned
first_pinned=$(jq '.osdmap_manifest.first_pinned' <<< $report)
last_pinned=$(jq '.osdmap_manifest.last_pinned' <<< $report)
pinned_maps=( $(jq '.osdmap_manifest.pinned_maps[]' <<< $report) )
[[ $first_pinned -eq $trim_to ]] || return 1
[[ ${pinned_maps[1]} -eq $(( trim_to+1)) ]] || return 1
test_osdmap $first_pinned || return 1
test_osdmap $(( first_pinned + 1 )) || return 1
# 3. trim everything
#
ceph config set mon mon_osd_force_trim_to 0
# generate an epoch so we get to trim maps
ceph osd set noup
ceph osd unset noup
wait_for_osdmap_manifest "false" || return 1
return 0
}
test_mon_osdmap_prune || exit 1
echo "OK"

View File

@ -1152,6 +1152,36 @@ std::vector<Option> get_global_options() {
.set_default(true)
.set_description(""),
/* -- mon: osdmap prune (begin) -- */
Option("mon_osdmap_full_prune_enabled", Option::TYPE_BOOL, Option::LEVEL_ADVANCED)
.set_default(true)
.set_description("Enables pruning full osdmap versions when we go over a given number of maps")
.add_see_also("mon_osdmap_full_prune_min")
.add_see_also("mon_osdmap_full_prune_interval")
.add_see_also("mon_osdmap_full_prune_txsize"),
Option("mon_osdmap_full_prune_min", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
.set_default(10000)
.set_description("Minimum number of versions in the store to trigger full map pruning")
.add_see_also("mon_osdmap_full_prune_enabled")
.add_see_also("mon_osdmap_full_prune_interval")
.add_see_also("mon_osdmap_full_prune_txsize"),
Option("mon_osdmap_full_prune_interval", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
.set_default(10)
.set_description("Interval between maps that will not be pruned; maps in the middle will be pruned.")
.add_see_also("mon_osdmap_full_prune_enabled")
.add_see_also("mon_osdmap_full_prune_interval")
.add_see_also("mon_osdmap_full_prune_txsize"),
Option("mon_osdmap_full_prune_txsize", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
.set_default(100)
.set_description("Number of maps we will prune per iteration")
.add_see_also("mon_osdmap_full_prune_enabled")
.add_see_also("mon_osdmap_full_prune_interval")
.add_see_also("mon_osdmap_full_prune_txsize"),
/* -- mon: osdmap prune (end) -- */
Option("mon_osd_cache_size", Option::TYPE_INT, Option::LEVEL_ADVANCED)
.set_default(10)
.set_description(""),
@ -1606,6 +1636,22 @@ std::vector<Option> get_global_options() {
.set_default(false)
.set_description(""),
Option("mon_debug_extra_checks", Option::TYPE_BOOL, Option::LEVEL_DEV)
.set_default(false)
.set_description("Enable some additional monitor checks")
.set_long_description(
"Enable some additional monitor checks that would be too expensive "
"to run on production systems, or would only be relevant while "
"testing or debugging."),
Option("mon_debug_block_osdmap_trim", Option::TYPE_BOOL, Option::LEVEL_DEV)
.set_default(false)
.set_description("Block OSDMap trimming while the option is enabled.")
.set_long_description(
"Blocking OSDMap trimming may be quite helpful to easily reproduce "
"states in which the monitor keeps (hundreds of) thousands of "
"osdmaps."),
Option("mon_debug_deprecated_as_obsolete", Option::TYPE_BOOL, Option::LEVEL_DEV)
.set_default(false)
.set_description(""),

View File

@ -485,6 +485,14 @@ const char** Monitor::get_tracked_conf_keys() const
// scrub interval
"mon_scrub_interval",
"mon_allow_pool_delete",
// osdmap pruning - observed, not handled.
"mon_osdmap_full_prune_enabled",
"mon_osdmap_full_prune_min",
"mon_osdmap_full_prune_interval",
"mon_osdmap_full_prune_txsize",
// debug options - observed, not handled
"mon_debug_extra_checks",
"mon_debug_block_osdmap_trim",
NULL
};
return KEYS;

View File

@ -188,6 +188,7 @@ OSDMonitor::OSDMonitor(
cct(cct),
inc_osd_cache(g_conf->mon_osd_cache_size),
full_osd_cache(g_conf->mon_osd_cache_size),
has_osdmap_manifest(false),
last_attempted_minwait_time(utime_t()),
mapper(mn->cct, &mn->cpu_tp)
{}
@ -276,6 +277,11 @@ void OSDMonitor::get_store_prefixes(std::set<string>& s) const
void OSDMonitor::update_from_paxos(bool *need_bootstrap)
{
// we really don't care if the version has been updated, because we may
// have trimmed without having increased the last committed; yet, we may
// need to update the in-memory manifest.
load_osdmap_manifest();
version_t version = get_last_committed();
if (version == osdmap.epoch)
return;
@ -903,6 +909,11 @@ void OSDMonitor::encode_pending(MonitorDBStore::TransactionRef t)
dout(10) << "encode_pending e " << pending_inc.epoch
<< dendl;
if (do_prune(t)) {
dout(1) << __func__ << " osdmap full prune encoded e"
<< pending_inc.epoch << dendl;
}
// finalize up pending_inc
pending_inc.modified = ceph_clock_now();
@ -1499,6 +1510,15 @@ version_t OSDMonitor::get_trim_to() const
return 0;
}
}
if (g_conf->get_val<bool>("mon_debug_block_osdmap_trim")) {
dout(0) << __func__
<< " blocking osdmap trim"
" ('mon_debug_block_osdmap_trim' set to 'true')"
<< dendl;
return 0;
}
{
epoch_t floor = get_min_last_epoch_clean();
dout(10) << " min_last_epoch_clean " << floor << dendl;
@ -1540,8 +1560,368 @@ void OSDMonitor::encode_trim_extra(MonitorDBStore::TransactionRef tx,
bufferlist bl;
get_version_full(first, bl);
put_version_full(tx, first, bl);
if (has_osdmap_manifest &&
first > osdmap_manifest.get_first_pinned()) {
_prune_update_trimmed(tx, first);
}
}
/* full osdmap prune
*
* for more information, please refer to doc/dev/mon-osdmap-prune.rst
*/
void OSDMonitor::load_osdmap_manifest()
{
bool store_has_manifest =
mon->store->exists(get_service_name(), "osdmap_manifest");
if (!store_has_manifest) {
if (!has_osdmap_manifest) {
return;
}
dout(20) << __func__
<< " dropping osdmap manifest from memory." << dendl;
osdmap_manifest = osdmap_manifest_t();
has_osdmap_manifest = false;
return;
}
dout(20) << __func__
<< " osdmap manifest detected in store; reload." << dendl;
bufferlist manifest_bl;
int r = get_value("osdmap_manifest", manifest_bl);
if (r < 0) {
derr << __func__ << " unable to read osdmap version manifest" << dendl;
ceph_assert(0 == "error reading manifest");
}
osdmap_manifest.decode(manifest_bl);
has_osdmap_manifest = true;
dout(10) << __func__ << " store osdmap manifest pinned ("
<< osdmap_manifest.get_first_pinned()
<< " .. "
<< osdmap_manifest.get_last_pinned()
<< ")"
<< dendl;
}
bool OSDMonitor::should_prune() const
{
version_t first = get_first_committed();
version_t last = get_last_committed();
version_t min_osdmap_epochs =
g_conf->get_val<int64_t>("mon_min_osdmap_epochs");
version_t prune_min =
g_conf->get_val<uint64_t>("mon_osdmap_full_prune_min");
version_t prune_interval =
g_conf->get_val<uint64_t>("mon_osdmap_full_prune_interval");
version_t last_pinned = osdmap_manifest.get_last_pinned();
version_t last_to_pin = last - min_osdmap_epochs;
// Make it or break it constraints.
//
// If any of these conditions fails, we will not prune, regardless of
// whether we have an on-disk manifest with an on-going pruning state.
//
if ((last - first) <= min_osdmap_epochs) {
// between the first and last committed epochs, we don't have
// enough epochs to trim, much less to prune.
dout(10) << __func__
<< " currently holding only " << (last - first)
<< " epochs (min osdmap epochs: " << min_osdmap_epochs
<< "); do not prune."
<< dendl;
return false;
} else if ((last_to_pin - first) < prune_min) {
// between the first committed epoch and the last epoch we would prune,
// we simply don't have enough versions over the minimum to prune maps.
dout(10) << __func__
<< " could only prune " << (last_to_pin - first)
<< " epochs (" << first << ".." << last_to_pin << "), which"
" is less than the required minimum (" << prune_min << ")"
<< dendl;
return false;
} else if (has_osdmap_manifest && last_pinned >= last_to_pin) {
dout(10) << __func__
<< " we have pruned as far as we can; do not prune."
<< dendl;
return false;
} else if (last_pinned + prune_interval > last_to_pin) {
dout(10) << __func__
<< " not enough epochs to form an interval (last pinned: "
<< last_pinned << ", last to pin: "
<< last_to_pin << ", interval: " << prune_interval << ")"
<< dendl;
return false;
}
dout(15) << __func__
<< " should prune (" << last_pinned << ".." << last_to_pin << ")"
<< " lc (" << first << ".." << last << ")"
<< dendl;
return true;
}
void OSDMonitor::_prune_update_trimmed(
MonitorDBStore::TransactionRef tx,
version_t first)
{
dout(10) << __func__
<< " first " << first
<< " last_pinned " << osdmap_manifest.get_last_pinned()
<< " last_pinned " << osdmap_manifest.get_last_pinned()
<< dendl;
if (!osdmap_manifest.is_pinned(first)) {
osdmap_manifest.pin(first);
}
set<version_t>::iterator p_end = osdmap_manifest.pinned.find(first);
set<version_t>::iterator p = osdmap_manifest.pinned.begin();
osdmap_manifest.pinned.erase(p, p_end);
ceph_assert(osdmap_manifest.get_first_pinned() == first);
if (osdmap_manifest.get_last_pinned() == first+1 ||
osdmap_manifest.pinned.size() == 1) {
// we reached the end of the line, as pinned maps go; clean up our
// manifest, and let `should_prune()` decide whether we should prune
// again.
tx->erase(get_service_name(), "osdmap_manifest");
return;
}
bufferlist bl;
osdmap_manifest.encode(bl);
tx->put(get_service_name(), "osdmap_manifest", bl);
}
void OSDMonitor::prune_init()
{
dout(1) << __func__ << dendl;
version_t pin_first;
if (!has_osdmap_manifest) {
// we must have never pruned, OR if we pruned the state must no longer
// be relevant (i.e., the state must have been removed alongside with
// the trim that *must* have removed past the last pinned map in a
// previous prune).
ceph_assert(osdmap_manifest.pinned.empty());
ceph_assert(!mon->store->exists(get_service_name(), "osdmap_manifest"));
pin_first = get_first_committed();
} else {
// we must have pruned in the past AND its state is still relevant
// (i.e., even if we trimmed, we still hold pinned maps in the manifest,
// and thus we still hold a manifest in the store).
ceph_assert(!osdmap_manifest.pinned.empty());
ceph_assert(osdmap_manifest.get_first_pinned() == get_first_committed());
ceph_assert(osdmap_manifest.get_last_pinned() < get_last_committed());
dout(10) << __func__
<< " first_pinned " << osdmap_manifest.get_first_pinned()
<< " last_pinned " << osdmap_manifest.get_last_pinned()
<< dendl;
pin_first = osdmap_manifest.get_last_pinned();
}
osdmap_manifest.pin(pin_first);
}
bool OSDMonitor::_prune_sanitize_options() const
{
uint64_t prune_interval =
g_conf->get_val<uint64_t>("mon_osdmap_full_prune_interval");
uint64_t prune_min =
g_conf->get_val<uint64_t>("mon_osdmap_full_prune_min");
uint64_t txsize =
g_conf->get_val<uint64_t>("mon_osdmap_full_prune_txsize");
bool r = true;
if (prune_interval == 0) {
derr << __func__
<< " prune is enabled BUT prune interval is zero; abort."
<< dendl;
r = false;
} else if (prune_interval == 1) {
derr << __func__
<< " prune interval is equal to one, which essentially means"
" no pruning; abort."
<< dendl;
r = false;
}
if (prune_min == 0) {
derr << __func__
<< " prune is enabled BUT prune min is zero; abort."
<< dendl;
r = false;
}
if (prune_interval > prune_min) {
derr << __func__
<< " impossible to ascertain proper prune interval because"
<< " it is greater than the minimum prune epochs"
<< " (min: " << prune_min << ", interval: " << prune_interval << ")"
<< dendl;
r = false;
}
if (txsize <= prune_interval) {
derr << __func__
<< "'mon_osdmap_full_prune_txsize' (" << txsize
<< ") <= 'mon_osdmap_full_prune_interval' (" << prune_interval
<< "); abort." << dendl;
r = false;
}
return r;
}
bool OSDMonitor::is_prune_enabled() const {
return g_conf->get_val<bool>("mon_osdmap_full_prune_enabled");
}
bool OSDMonitor::is_prune_supported() const {
return mon->get_required_mon_features().contains_any(
ceph::features::mon::FEATURE_OSDMAP_PRUNE);
}
/** do_prune
*
* @returns true if has side-effects; false otherwise.
*/
bool OSDMonitor::do_prune(MonitorDBStore::TransactionRef tx)
{
bool enabled = is_prune_enabled();
dout(1) << __func__ << " osdmap full prune "
<< ( enabled ? "enabled" : "disabled")
<< dendl;
if (!enabled || !_prune_sanitize_options() || !should_prune()) {
return false;
}
// we are beyond the minimum prune versions, we need to remove maps because
// otherwise the store will grow unbounded and we may end up having issues
// with available disk space or store hangs.
// we will not pin all versions. We will leave a buffer number of versions.
// this allows us the monitor to trim maps without caring too much about
// pinned maps, and then allow us to use another ceph-mon without these
// capabilities, without having to repair the store.
version_t first = get_first_committed();
version_t last = get_last_committed();
version_t last_to_pin = last - g_conf->mon_min_osdmap_epochs;
version_t last_pinned = osdmap_manifest.get_last_pinned();
uint64_t prune_interval =
g_conf->get_val<uint64_t>("mon_osdmap_full_prune_interval");
uint64_t txsize =
g_conf->get_val<uint64_t>("mon_osdmap_full_prune_txsize");
prune_init();
// we need to get rid of some osdmaps
dout(5) << __func__
<< " lc (" << first << " .. " << last << ")"
<< " last_pinned " << last_pinned
<< " interval " << prune_interval
<< " last_to_pin " << last_to_pin
<< dendl;
// We will be erasing maps as we go.
//
// We will erase all maps between `last_pinned` and the `next_to_pin`.
//
// If `next_to_pin` happens to be greater than `last_to_pin`, then
// we stop pruning. We could prune the maps between `next_to_pin` and
// `last_to_pin`, but by not doing it we end up with neater pruned
// intervals, aligned with `prune_interval`. Besides, this should not be a
// problem as long as `prune_interval` is set to a sane value, instead of
// hundreds or thousands of maps.
auto map_exists = [this](version_t v) {
string k = mon->store->combine_strings("full", v);
return mon->store->exists(get_service_name(), k);
};
// 'interval' represents the number of maps from the last pinned
// i.e., if we pinned version 1 and have an interval of 10, we're pinning
// version 11 next; all intermediate versions will be removed.
//
// 'txsize' represents the maximum number of versions we'll be removing in
// this iteration. If 'txsize' is large enough to perform multiple passes
// pinning and removing maps, we will do so; if not, we'll do at least one
// pass. We are quite relaxed about honouring 'txsize', but we'll always
// ensure that we never go *over* the maximum.
// e.g., if we pin 1 and 11, we're removing versions [2..10]; i.e., 9 maps.
uint64_t removal_interval = prune_interval - 1;
if (txsize < removal_interval) {
dout(5) << __func__
<< " setting txsize to removal interval size ("
<< removal_interval << " versions"
<< dendl;
txsize = removal_interval;
}
ceph_assert(removal_interval > 0);
uint64_t num_pruned = 0;
while (num_pruned + removal_interval <= txsize) {
last_pinned = osdmap_manifest.get_last_pinned();
if (last_pinned + prune_interval > last_to_pin) {
break;
}
ceph_assert(last_pinned < last_to_pin);
version_t next_pinned = last_pinned + prune_interval;
ceph_assert(next_pinned <= last_to_pin);
osdmap_manifest.pin(next_pinned);
dout(20) << __func__
<< " last_pinned " << last_pinned
<< " next_pinned " << next_pinned
<< " num_pruned " << num_pruned
<< " removal interval (" << (last_pinned+1)
<< ".." << (next_pinned-1) << ")"
<< " txsize " << txsize << dendl;
ceph_assert(map_exists(last_pinned));
ceph_assert(map_exists(next_pinned));
for (version_t v = last_pinned+1; v < next_pinned; ++v) {
ceph_assert(!osdmap_manifest.is_pinned(v));
dout(20) << __func__ << " pruning full osdmap e" << v << dendl;
string full_key = mon->store->combine_strings("full", v);
tx->erase(get_service_name(), full_key);
++num_pruned;
}
}
ceph_assert(num_pruned > 0);
bufferlist bl;
osdmap_manifest.encode(bl);
tx->put(get_service_name(), "osdmap_manifest", bl);
return true;
}
// -------------
bool OSDMonitor::preprocess_query(MonOpRequestRef op)
@ -3125,16 +3505,138 @@ int OSDMonitor::get_version(version_t ver, bufferlist& bl)
return ret;
}
int OSDMonitor::get_inc(version_t ver, OSDMap::Incremental& inc)
{
bufferlist inc_bl;
int err = get_version(ver, inc_bl);
ceph_assert(err == 0);
ceph_assert(inc_bl.length());
bufferlist::iterator p = inc_bl.begin();
inc.decode(p);
dout(10) << __func__ << " "
<< " epoch " << inc.epoch
<< " inc_crc " << inc.inc_crc
<< " full_crc " << inc.full_crc
<< " encode_features " << inc.encode_features << dendl;
return 0;
}
int OSDMonitor::get_full_from_pinned_map(version_t ver, bufferlist& bl)
{
dout(10) << __func__ << " ver " << ver << dendl;
version_t closest_pinned = osdmap_manifest.get_lower_closest_pinned(ver);
if (closest_pinned == 0) {
return -ENOENT;
}
if (closest_pinned > ver) {
dout(0) << __func__ << " pinned: " << osdmap_manifest.pinned << dendl;
}
ceph_assert(closest_pinned <= ver);
dout(10) << __func__ << " closest pinned ver " << closest_pinned << dendl;
// get osdmap incremental maps and apply on top of this one.
bufferlist osdm_bl;
bool has_cached_osdmap = false;
for (version_t v = ver-1; v >= closest_pinned; --v) {
if (full_osd_cache.lookup(v, &osdm_bl)) {
dout(10) << __func__ << " found map in cache ver " << v << dendl;
closest_pinned = v;
has_cached_osdmap = true;
break;
}
}
if (!has_cached_osdmap) {
int err = PaxosService::get_version_full(closest_pinned, osdm_bl);
if (err != 0) {
derr << __func__ << " closest pinned map ver " << closest_pinned
<< " not available! error: " << cpp_strerror(err) << dendl;
}
ceph_assert(err == 0);
}
ceph_assert(osdm_bl.length());
OSDMap osdm;
osdm.decode(osdm_bl);
dout(10) << __func__ << " loaded osdmap epoch " << closest_pinned
<< " e" << osdm.epoch
<< " crc " << osdm.get_crc()
<< " -- applying incremental maps." << dendl;
uint64_t encode_features = 0;
for (version_t v = closest_pinned + 1; v <= ver; ++v) {
dout(20) << __func__ << " applying inc epoch " << v << dendl;
OSDMap::Incremental inc;
int err = get_inc(v, inc);
ceph_assert(err == 0);
encode_features = inc.encode_features;
err = osdm.apply_incremental(inc);
ceph_assert(err == 0);
// this block performs paranoid checks on map retrieval
if (g_conf->get_val<bool>("mon_debug_extra_checks") &&
inc.full_crc != 0) {
uint64_t f = encode_features;
if (!f) {
f = (mon->quorum_con_features ? mon->quorum_con_features : -1);
}
// encode osdmap to force calculating crcs
bufferlist tbl;
osdm.encode(tbl, f | CEPH_FEATURE_RESERVED);
// decode osdmap to compare crcs with what's expected by incremental
OSDMap tosdm;
tosdm.decode(tbl);
if (tosdm.get_crc() != inc.full_crc) {
derr << __func__
<< " osdmap crc mismatch! (osdmap crc " << tosdm.get_crc()
<< ", expected " << inc.full_crc << ")" << dendl;
ceph_assert(0 == "osdmap crc mismatch");
}
}
// note: we cannot add the recently computed map to the cache, as is,
// because we have not encoded the map into a bl.
}
if (!encode_features) {
dout(10) << __func__
<< " last incremental map didn't have features;"
<< " defaulting to quorum's or all" << dendl;
encode_features =
(mon->quorum_con_features ? mon->quorum_con_features : -1);
}
osdm.encode(bl, encode_features | CEPH_FEATURE_RESERVED);
return 0;
}
int OSDMonitor::get_version_full(version_t ver, bufferlist& bl)
{
if (full_osd_cache.lookup(ver, &bl)) {
return 0;
}
int ret = PaxosService::get_version_full(ver, bl);
if (!ret) {
full_osd_cache.add(ver, bl);
if (ret == -ENOENT) {
// build map?
ret = get_full_from_pinned_map(ver, bl);
}
return ret;
if (ret != 0) {
return ret;
}
full_osd_cache.add(ver, bl);
return 0;
}
epoch_t OSDMonitor::blacklist(const entity_addr_t& a, utime_t until)
@ -3380,6 +3882,9 @@ void OSDMonitor::tick()
dout(10) << osdmap << dendl;
// always update osdmap manifest, regardless of being the leader.
load_osdmap_manifest();
if (!mon->is_leader()) return;
bool do_propose = false;
@ -3390,8 +3895,16 @@ void OSDMonitor::tick()
}
// mark osds down?
if (check_failures(now))
if (check_failures(now)) {
do_propose = true;
}
// Force a proposal if we need to prune; pruning is performed on
// ``encode_pending()``, hence why we need to regularly trigger a proposal
// even if there's nothing going on.
if (is_prune_enabled() && should_prune()) {
do_propose = true;
}
// mark down osds out?
@ -3565,6 +4078,12 @@ void OSDMonitor::dump_info(Formatter *f)
f->open_object_section("crushmap");
osdmap.crush->dump(f);
f->close_section();
if (has_osdmap_manifest) {
f->open_object_section("osdmap_manifest");
osdmap_manifest.dump(f);
f->close_section();
}
}
namespace {

View File

@ -25,6 +25,7 @@
#include <set>
#include "include/types.h"
#include "include/encoding.h"
#include "common/simple_cache.hpp"
#include "msg/Messenger.h"
@ -124,6 +125,85 @@ public:
};
struct osdmap_manifest_t {
// all the maps we have pinned -- i.e., won't be removed unless
// they are inside a trim interval.
set<version_t> pinned;
osdmap_manifest_t() {}
version_t get_last_pinned() const
{
set<version_t>::const_reverse_iterator it = pinned.crbegin();
if (it == pinned.crend()) {
return 0;
}
return *it;
}
version_t get_first_pinned() const
{
set<version_t>::const_iterator it = pinned.cbegin();
if (it == pinned.cend()) {
return 0;
}
return *it;
}
bool is_pinned(version_t v) const
{
return pinned.find(v) != pinned.end();
}
void pin(version_t v)
{
pinned.insert(v);
}
version_t get_lower_closest_pinned(version_t v) const {
set<version_t>::const_iterator p = pinned.lower_bound(v);
if (p == pinned.cend()) {
return 0;
} else if (*p > v) {
if (p == pinned.cbegin()) {
return 0;
}
--p;
}
return *p;
}
void encode(bufferlist& bl) const
{
ENCODE_START(1, 1, bl);
encode(pinned, bl);
ENCODE_FINISH(bl);
}
void decode(bufferlist::iterator& bl)
{
DECODE_START(1, bl);
decode(pinned, bl);
DECODE_FINISH(bl);
}
void decode(bufferlist& bl) {
bufferlist::iterator p = bl.begin();
decode(p);
}
void dump(Formatter *f) {
f->dump_unsigned("first_pinned", get_first_pinned());
f->dump_unsigned("last_pinned", get_last_pinned());
f->open_array_section("pinned_maps");
for (auto& i : pinned) {
f->dump_unsigned("epoch", i);
}
f->close_section();
}
};
WRITE_CLASS_ENCODER(osdmap_manifest_t);
class OSDMonitor : public PaxosService {
CephContext *cct;
@ -142,6 +222,9 @@ public:
SimpleLRU<version_t, bufferlist> inc_osd_cache;
SimpleLRU<version_t, bufferlist> full_osd_cache;
bool has_osdmap_manifest;
osdmap_manifest_t osdmap_manifest;
bool check_failures(utime_t now);
bool check_failure(utime_t now, int target_osd, failure_info_t& fi);
void force_failure(int target_osd, int by);
@ -160,7 +243,7 @@ public:
};
// svc
public:
public:
void create_initial() override;
void get_store_prefixes(std::set<string>& s) const override;
@ -171,6 +254,19 @@ private:
void on_active() override;
void on_restart() override;
void on_shutdown() override;
/* osdmap full map prune */
void load_osdmap_manifest();
bool should_prune() const;
void _prune_update_trimmed(
MonitorDBStore::TransactionRef tx,
version_t first);
void prune_init();
bool _prune_sanitize_options() const;
bool is_prune_enabled() const;
bool is_prune_supported() const;
bool do_prune(MonitorDBStore::TransactionRef tx);
/**
* we haven't delegated full version stashing to paxosservice for some time
* now, making this function useless in current context.
@ -542,6 +638,8 @@ public:
int get_version(version_t ver, bufferlist& bl) override;
int get_version_full(version_t ver, bufferlist& bl) override;
int get_inc(version_t ver, OSDMap::Incremental& inc);
int get_full_from_pinned_map(version_t ver, bufferlist& bl);
epoch_t blacklist(const entity_addr_t& a, utime_t until);

View File

@ -434,7 +434,6 @@ public:
}
void load_health();
private:
/**
* @defgroup PaxosService_h_store_keys Set of keys that are usually used on
* all the services implementing this
@ -451,6 +450,7 @@ public:
* @}
*/
private:
/**
* @defgroup PaxosService_h_version_cache Variables holding cached values
* for the most used versions (first

View File

@ -493,6 +493,7 @@ namespace ceph {
constexpr mon_feature_t FEATURE_KRAKEN( (1ULL << 0));
constexpr mon_feature_t FEATURE_LUMINOUS( (1ULL << 1));
constexpr mon_feature_t FEATURE_MIMIC( (1ULL << 2));
constexpr mon_feature_t FEATURE_OSDMAP_PRUNE (1ULL << 3);
constexpr mon_feature_t FEATURE_RESERVED( (1ULL << 63));
constexpr mon_feature_t FEATURE_NONE( (0ULL));
@ -507,6 +508,7 @@ namespace ceph {
FEATURE_KRAKEN |
FEATURE_LUMINOUS |
FEATURE_MIMIC |
FEATURE_OSDMAP_PRUNE |
FEATURE_NONE
);
}
@ -525,10 +527,18 @@ namespace ceph {
FEATURE_KRAKEN |
FEATURE_LUMINOUS |
FEATURE_MIMIC |
FEATURE_OSDMAP_PRUNE |
FEATURE_NONE
);
}
constexpr mon_feature_t get_optional() {
return (
FEATURE_OSDMAP_PRUNE |
FEATURE_NONE
);
}
static inline mon_feature_t get_feature_by_name(std::string n);
}
}
@ -543,6 +553,8 @@ static inline const char *ceph::features::mon::get_feature_name(uint64_t b) {
return "luminous";
} else if (f == FEATURE_MIMIC) {
return "mimic";
} else if (f == FEATURE_OSDMAP_PRUNE) {
return "osdmap-prune";
} else if (f == FEATURE_RESERVED) {
return "reserved";
}
@ -557,6 +569,8 @@ inline mon_feature_t ceph::features::mon::get_feature_by_name(std::string n) {
return FEATURE_LUMINOUS;
} else if (n == "mimic") {
return FEATURE_MIMIC;
} else if (n == "osdmap-prune") {
return FEATURE_OSDMAP_PRUNE;
} else if (n == "reserved") {
return FEATURE_RESERVED;
}

View File

@ -11,21 +11,21 @@
required: [none]
AVAILABLE FEATURES:
supported: [kraken(1),luminous(2),mimic(4)]
persistent: [kraken(1),luminous(2),mimic(4)]
supported: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
persistent: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
MONMAP FEATURES:
persistent: [none]
optional: [none]
required: [none]
AVAILABLE FEATURES:
supported: [kraken(1),luminous(2),mimic(4)]
persistent: [kraken(1),luminous(2),mimic(4)]
supported: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
persistent: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
monmap:persistent:[none]
monmap:optional:[none]
monmap:required:[none]
available:supported:[kraken(1),luminous(2),mimic(4)]
available:persistent:[kraken(1),luminous(2),mimic(4)]
available:supported:[kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
available:persistent:[kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
$ monmaptool --feature-set foo /tmp/test.monmap.1234
unknown features name 'foo' or unable to parse value: Expected option value to be integer, got 'foo'
@ -49,8 +49,8 @@
required: [kraken(1),unknown(16),unknown(32)]
AVAILABLE FEATURES:
supported: [kraken(1),luminous(2),mimic(4)]
persistent: [kraken(1),luminous(2),mimic(4)]
supported: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
persistent: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
$ monmaptool --feature-unset 32 --optional --feature-list /tmp/test.monmap.1234
monmaptool: monmap file /tmp/test.monmap.1234
@ -60,8 +60,8 @@
required: [kraken(1),unknown(16),unknown(32)]
AVAILABLE FEATURES:
supported: [kraken(1),luminous(2),mimic(4)]
persistent: [kraken(1),luminous(2),mimic(4)]
supported: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
persistent: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
monmaptool: writing epoch 0 to /tmp/test.monmap.1234 (1 monitors)
$ monmaptool --feature-unset 32 --persistent --feature-unset 16 --optional --feature-list /tmp/test.monmap.1234
@ -72,8 +72,8 @@
required: [kraken(1)]
AVAILABLE FEATURES:
supported: [kraken(1),luminous(2),mimic(4)]
persistent: [kraken(1),luminous(2),mimic(4)]
supported: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
persistent: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
monmaptool: writing epoch 0 to /tmp/test.monmap.1234 (1 monitors)
$ monmaptool --feature-unset kraken --feature-list /tmp/test.monmap.1234
@ -84,8 +84,8 @@
required: [none]
AVAILABLE FEATURES:
supported: [kraken(1),luminous(2),mimic(4)]
persistent: [kraken(1),luminous(2),mimic(4)]
supported: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
persistent: [kraken(1),luminous(2),mimic(4),osdmap-prune(8)]
monmaptool: writing epoch 0 to /tmp/test.monmap.1234 (1 monitors)
$ rm /tmp/test.monmap.1234