mirror of https://github.com/ceph/ceph
416 lines
18 KiB
ReStructuredText
416 lines
18 KiB
ReStructuredText
===========================
|
|
FULL OSDMAP VERSION PRUNING
|
|
===========================
|
|
|
|
For each incremental osdmap epoch, the monitor will keep a full osdmap
|
|
epoch in the store.
|
|
|
|
While this is great when serving osdmap requests from clients, allowing
|
|
us to fulfill their request without having to recompute the full osdmap
|
|
from a myriad of incrementals, it can also become a burden once we start
|
|
keeping an unbounded number of osdmaps.
|
|
|
|
The monitors will attempt to keep a bounded number of osdmaps in the store.
|
|
This number is defined (and configurable) via ``mon_min_osdmap_epochs``, and
|
|
defaults to 500 epochs. Generally speaking, we will remove older osdmap
|
|
epochs once we go over this limit.
|
|
|
|
However, there are a few constraints to removing osdmaps. These are all
|
|
defined in ``OSDMonitor::get_trim_to()``.
|
|
|
|
In the event one of these conditions is not met, we may go over the bounds
|
|
defined by ``mon_min_osdmap_epochs``. And if the cluster does not meet the
|
|
trim criteria for some time (e.g., unclean pgs), the monitor may start
|
|
keeping a lot of osdmaps. This can start putting pressure on the underlying
|
|
key/value store, as well as on the available disk space.
|
|
|
|
One way to mitigate this problem would be to stop keeping full osdmap
|
|
epochs on disk. We would have to rebuild osdmaps on-demand, or grab them
|
|
from cache if they had been recently served. We would still have to keep
|
|
at least one osdmap, and apply all incrementals on top of either this
|
|
oldest map epoch kept in the store or a more recent map grabbed from cache.
|
|
While this would be feasible, it seems like a lot of cpu (and potentially
|
|
IO) would be going into rebuilding osdmaps.
|
|
|
|
Additionally, this would prevent the aforementioned problem going forward,
|
|
but would do nothing for stores currently in a state that would truly
|
|
benefit from not keeping osdmaps.
|
|
|
|
This brings us to full osdmap pruning.
|
|
|
|
Instead of not keeping full osdmap epochs, we are going to prune some of
|
|
them when we have too many.
|
|
|
|
Deciding whether we have too many will be dictated by a configurable option
|
|
``mon_osdmap_full_prune_min`` (default: 10000). The pruning algorithm will be
|
|
engaged once we go over this threshold.
|
|
|
|
We will not remove all ``mon_osdmap_full_prune_min`` full osdmap epochs
|
|
though. Instead, we are going to poke some holes in the sequence of full
|
|
maps. By default, we will keep one full osdmap per 10 maps since the last
|
|
map kept; i.e., if we keep epoch 1, we will also keep epoch 10 and remove
|
|
full map epochs 2 to 9. The size of this interval is configurable with
|
|
``mon_osdmap_full_prune_interval``.
|
|
|
|
Essentially, we are proposing to keep ~10% of the full maps, but we will
|
|
always honour the minimum number of osdmap epochs, as defined by
|
|
``mon_min_osdmap_epochs``, and these won't be used for the count of the
|
|
minimum versions to prune. For instance, if we have on-disk versions
|
|
[1..50000], we would allow the pruning algorithm to operate only over
|
|
osdmap epochs [1..49500); but, if have on-disk versions [1..10200], we
|
|
won't be pruning because the algorithm would only operate on versions
|
|
[1..9700), and this interval contains less versions than the minimum
|
|
required by ``mon_osdmap_full_prune_min``.
|
|
|
|
|
|
ALGORITHM
|
|
=========
|
|
|
|
Say we have 50,000 osdmap epochs in the store, and we're using the
|
|
defaults for all configurable options.
|
|
|
|
::
|
|
|
|
-----------------------------------------------------------
|
|
|1|2|..|10|11|..|100|..|1000|..|10000|10001|..|49999|50000|
|
|
-----------------------------------------------------------
|
|
^ first last ^
|
|
|
|
We will prune when all the following constraints are met:
|
|
|
|
1. number of versions is greater than ``mon_min_osdmap_epochs``;
|
|
|
|
2. the number of versions between ``first`` and ``prune_to`` is greater (or
|
|
equal) than ``mon_osdmap_full_prune_min``, with ``prune_to`` being equal to
|
|
``last`` minus ``mon_min_osdmap_epochs``.
|
|
|
|
If any of these conditions fails, we will *not* prune any maps.
|
|
|
|
Furthermore, if it is known that we have been pruning, but since then we
|
|
are no longer satisfying at least one of the above constraints, we will
|
|
not continue to prune. In essence, we only prune full osdmaps if the
|
|
number of epochs in the store so warrants it.
|
|
|
|
As pruning will create gaps in the sequence of full maps, we need to keep
|
|
track of the intervals of missing maps. We do this by keeping a manifest of
|
|
pinned maps -- i.e., a list of maps that, by being pinned, are not to be
|
|
pruned.
|
|
|
|
While pinned maps are not removed from the store, maps between two consecutive
|
|
pinned maps will; and the number of maps to be removed will be dictated by the
|
|
configurable option ``mon_osdmap_full_prune_interval``. The algorithm makes an
|
|
effort to keep pinned maps apart by as many maps as defined by this option,
|
|
but in the event of corner cases it may allow smaller intervals. Additionally,
|
|
as this is a configurable option that is read any time a prune iteration
|
|
occurs, there is the possibility this interval will change if the user changes
|
|
this config option.
|
|
|
|
Pinning maps is performed lazily: we will be pinning maps as we are removing
|
|
maps. This grants us more flexibility to change the prune interval while
|
|
pruning is happening, but also simplifies considerably the algorithm, as well
|
|
as the information we need to keep in the manifest. Below we show a simplified
|
|
version of the algorithm:::
|
|
|
|
manifest.pin(first)
|
|
last_to_prune = last - mon_min_osdmap_epochs
|
|
|
|
while manifest.get_last_pinned() + prune_interval < last_to_prune AND
|
|
last_to_prune - first > mon_min_osdmap_epochs AND
|
|
last_to_prune - first > mon_osdmap_full_prune_min AND
|
|
num_pruned < mon_osdmap_full_prune_txsize:
|
|
|
|
last_pinned = manifest.get_last_pinned()
|
|
new_pinned = last_pinned + prune_interval
|
|
manifest.pin(new_pinned)
|
|
for e in (last_pinned .. new_pinned):
|
|
store.erase(e)
|
|
++num_pruned
|
|
|
|
In essence, the algorithm ensures that the first version in the store is
|
|
*always* pinned. After all, we need a starting point when rebuilding maps, and
|
|
we can't simply remove the earliest map we have; otherwise we would be unable
|
|
to rebuild maps for the very first pruned interval.
|
|
|
|
Once we have at least one pinned map, each iteration of the algorithm can
|
|
simply base itself on the manifest's last pinned map (which we can obtain by
|
|
reading the element at the tail of the manifest's pinned maps list).
|
|
|
|
We'll next need to determine the interval of maps to be removed: all the maps
|
|
from ``last_pinned`` up to ``new_pinned``, which in turn is nothing more than
|
|
``last_pinned`` plus ``mon_osdmap_full_prune_interval``. We know that all maps
|
|
between these two values, ``last_pinned`` and ``new_pinned`` can be removed,
|
|
considering ``new_pinned`` has been pinned.
|
|
|
|
The algorithm ceases to execute as soon as one of the two initial
|
|
preconditions is not met, or if we do not meet two additional conditions that
|
|
have no weight on the algorithm's correctness:
|
|
|
|
1. We will stop if we are not able to create a new pruning interval properly
|
|
aligned with ``mon_osdmap_full_prune_interval`` that is lower than
|
|
``last_pruned``. There is no particular technical reason why we enforce
|
|
this requirement, besides allowing us to keep the intervals with an
|
|
expected size, and preventing small, irregular intervals that would be
|
|
bound to happen eventually (e.g., pruning continues over the course of
|
|
several iterations, removing one or two or three maps each time).
|
|
|
|
2. We will stop once we know that we have pruned more than a certain number of
|
|
maps. This value is defined by ``mon_osdmap_full_prune_txsize``, and
|
|
ensures we don't spend an unbounded number of cycles pruning maps. We don't
|
|
enforce this value religiously (deletes do not cost much), but we make an
|
|
effort to honor it.
|
|
|
|
We could do the removal in one go, but we have no idea how long that would
|
|
take. Therefore, we will perform several iterations, removing at most
|
|
``mon_osdmap_full_prune_txsize`` osdmaps per iteration.
|
|
|
|
In the end, our on-disk map sequence will look similar to::
|
|
|
|
------------------------------------------
|
|
|1|10|20|30|..|49500|49501|..|49999|50000|
|
|
------------------------------------------
|
|
^ first last ^
|
|
|
|
|
|
Because we are not pruning all versions in one go, we need to keep state
|
|
about how far along on our pruning we are. With that in mind, we have
|
|
created a data structure, ``osdmap_manifest_t``, that holds the set of pinned
|
|
maps:::
|
|
|
|
struct osdmap_manifest_t:
|
|
set<version_t> pinned;
|
|
|
|
Given we are only pinning maps while we are pruning, we don't need to keep
|
|
track of additional state about the last pruned version. We know as a matter
|
|
of fact that we have pruned all the intermediate maps between any two
|
|
consecutive pinned maps.
|
|
|
|
The question one could ask, though, is how can we be sure we pruned all the
|
|
intermediate maps if, for instance, the monitor crashes. To ensure we are
|
|
protected against such an event, we always write the osdmap manifest to disk
|
|
on the same transaction that is deleting the maps. This way we have the
|
|
guarantee that, if the monitor crashes, we will read the latest version of the
|
|
manifest: either containing the newly pinned maps, meaning we also pruned the
|
|
in-between maps; or we will find the previous version of the osdmap manifest,
|
|
which will not contain the maps we were pinning at the time we crashed, given
|
|
the transaction on which we would be writing the updated osdmap manifest was
|
|
not applied (alongside with the maps removal).
|
|
|
|
The osdmap manifest will be written to the store each time we prune, with an
|
|
updated list of pinned maps. It is written in the transaction effectively
|
|
pruning the maps, so we guarantee the manifest is always up to date. As a
|
|
consequence of this criteria, the first time we will write the osdmap manifest
|
|
is the first time we prune. If an osdmap manifest does not exist, we can be
|
|
certain we do not hold pruned map intervals.
|
|
|
|
We will rely on the manifest to ascertain whether we have pruned maps
|
|
intervals. In theory, this will always be the on-disk osdmap manifest, but we
|
|
make sure to read the on-disk osdmap manifest each time we update from paxos;
|
|
this way we always ensure having an up to date in-memory osdmap manifest.
|
|
|
|
Once we finish pruning maps, we will keep the manifest in the store, to
|
|
allow us to easily find which maps have been pinned (instead of checking
|
|
the store until we find a map). This has the added benefit of allowing us to
|
|
quickly figure out which is the next interval we need to prune (i.e., last
|
|
pinned plus the prune interval). This doesn't however mean we will forever
|
|
keep the osdmap manifest: the osdmap manifest will no longer be required once
|
|
the monitor trims osdmaps and the earliest available epoch in the store is
|
|
greater than the last map we pruned.
|
|
|
|
The same conditions from ``OSDMonitor::get_trim_to()`` that force the monitor
|
|
to keep a lot of osdmaps, thus requiring us to prune, may eventually change
|
|
and allow the monitor to remove some of its oldest maps.
|
|
|
|
MAP TRIMMING
|
|
------------
|
|
|
|
If the monitor trims maps, we must then adjust the osdmap manifest to
|
|
reflect our pruning status, or remove the manifest entirely if it no longer
|
|
makes sense to keep it. For instance, take the map sequence from before, but
|
|
let us assume we did not finish pruning all the maps.::
|
|
|
|
-------------------------------------------------------------
|
|
|1|10|20|30|..|490|500|501|502|..|49500|49501|..|49999|50000|
|
|
-------------------------------------------------------------
|
|
^ first ^ pinned.last() last ^
|
|
|
|
pinned = {1, 10, 20, ..., 490, 500}
|
|
|
|
Now let us assume that the monitor will trim up to epoch 501. This means
|
|
removing all maps prior to epoch 501, and updating the ``first_committed``
|
|
pointer to ``501``. Given removing all those maps would invalidate our
|
|
existing pruning efforts, we can consider our pruning has finished and drop
|
|
our osdmap manifest. Doing so also simplifies starting a new prune, if all
|
|
the starting conditions are met once we refreshed our state from the
|
|
store.
|
|
|
|
We would then have the following map sequence: ::
|
|
|
|
---------------------------------------
|
|
|501|502|..|49500|49501|..|49999|50000|
|
|
---------------------------------------
|
|
^ first last ^
|
|
|
|
However, imagine a slightly more convoluted scenario: the monitor will trim
|
|
up to epoch 491. In this case, epoch 491 has been previously pruned from the
|
|
store.
|
|
|
|
Given we will always need to have the oldest known map in the store, before
|
|
we trim we will have to check whether that map is in the prune interval
|
|
(i.e., if said map epoch belongs to ``[ pinned.first()..pinned.last() )``).
|
|
If so, we need to check if this is a pinned map, in which case we don't have
|
|
much to be concerned aside from removing lower epochs from the manifest's
|
|
pinned list. On the other hand, if the map being trimmed to is not a pinned
|
|
map, we will need to rebuild said map and pin it, and only then will we remove
|
|
the pinned maps prior to the map's epoch.
|
|
|
|
In this case, we would end up with the following sequence:::
|
|
|
|
-----------------------------------------------
|
|
|491|500|501|502|..|49500|49501|..|49999|50000|
|
|
-----------------------------------------------
|
|
^ ^- pinned.last() last ^
|
|
`- first
|
|
|
|
There is still an edge case that we should mention. Consider that we are
|
|
going to trim up to epoch 499, which is the very last pruned epoch.
|
|
|
|
Much like the scenario above, we would end up writing osdmap epoch 499 to
|
|
the store; but what should we do about pinned maps and pruning?
|
|
|
|
The simplest solution is to drop the osdmap manifest. After all, given we
|
|
are trimming to the last pruned map, and we are rebuilding this map, we can
|
|
guarantee that all maps greater than e 499 are sequential (because we have
|
|
not pruned any of them). In essence, dropping the osdmap manifest in this
|
|
case is essentially the same as if we were trimming over the last pruned
|
|
epoch: we can prune again later if we meet the required conditions.
|
|
|
|
And, with this, we have fully dwelled into full osdmap pruning. Later in this
|
|
document one can find detailed `REQUIREMENTS, CONDITIONS & INVARIANTS` for the
|
|
whole algorithm, from pruning to trimming. Additionally, the next section
|
|
details several additional checks to guarantee the sanity of our configuration
|
|
options. Enjoy.
|
|
|
|
|
|
CONFIGURATION OPTIONS SANITY CHECKS
|
|
-----------------------------------
|
|
|
|
We perform additional checks before pruning to ensure all configuration
|
|
options involved are sane:
|
|
|
|
1. If ``mon_osdmap_full_prune_interval`` is zero we will not prune; we
|
|
require an actual positive number, greater than one, to be able to prune
|
|
maps. If the interval is one, we would not actually be pruning any maps, as
|
|
the interval between pinned maps would essentially be a single epoch. This
|
|
means we would have zero maps in-between pinned maps, hence no maps would
|
|
ever be pruned.
|
|
|
|
2. If ``mon_osdmap_full_prune_min`` is zero we will not prune; we require a
|
|
positive, greater than zero, value so we know the threshold over which we
|
|
should prune. We don't want to guess.
|
|
|
|
3. If ``mon_osdmap_full_prune_interval`` is greater than
|
|
``mon_osdmap_full_prune_min`` we will not prune, as it is impossible to
|
|
ascertain a proper prune interval.
|
|
|
|
4. If ``mon_osdmap_full_prune_txsize`` is lower than
|
|
``mon_osdmap_full_prune_interval`` we will not prune; we require a
|
|
``txsize`` with a value at least equal than ``interval``, and (depending on
|
|
the value of the latter) ideally higher.
|
|
|
|
|
|
REQUIREMENTS, CONDITIONS & INVARIANTS
|
|
-------------------------------------
|
|
|
|
REQUIREMENTS
|
|
~~~~~~~~~~~~
|
|
|
|
* All monitors in the quorum need to support pruning.
|
|
|
|
* Once pruning has been enabled, monitors not supporting pruning will not be
|
|
allowed in the quorum, nor will be allowed to synchronize.
|
|
|
|
* Removing the osdmap manifest results in disabling the pruning feature quorum
|
|
requirement. This means that monitors not supporting pruning will be allowed
|
|
to synchronize and join the quorum, granted they support any other features
|
|
required.
|
|
|
|
|
|
CONDITIONS & INVARIANTS
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
* Pruning has never happened, or we have trimmed past its previous
|
|
intervals:::
|
|
|
|
invariant: first_committed > 1
|
|
|
|
condition: pinned.empty() AND !store.exists(manifest)
|
|
|
|
|
|
* Pruning has happened at least once:::
|
|
|
|
invariant: first_committed > 0
|
|
invariant: !pinned.empty())
|
|
invariant: pinned.first() == first_committed
|
|
invariant: pinned.last() < last_committed
|
|
|
|
precond: pinned.last() < prune_to AND
|
|
pinned.last() + prune_interval < prune_to
|
|
|
|
postcond: pinned.size() > old_pinned.size() AND
|
|
(for each v in [pinned.first()..pinned.last()]:
|
|
if pinned.count(v) > 0: store.exists_full(v)
|
|
else: !store.exists_full(v)
|
|
)
|
|
|
|
|
|
* Pruning has finished:::
|
|
|
|
invariant: first_committed > 0
|
|
invariant: !pinned.empty()
|
|
invariant: pinned.first() == first_committed
|
|
invariant: pinned.last() < last_committed
|
|
|
|
condition: pinned.last() == prune_to OR
|
|
pinned.last() + prune_interval < prune_to
|
|
|
|
|
|
* Pruning intervals can be trimmed:::
|
|
|
|
precond: OSDMonitor::get_trim_to() > 0
|
|
|
|
condition: !pinned.empty()
|
|
|
|
invariant: pinned.first() == first_committed
|
|
invariant: pinned.last() < last_committed
|
|
invariant: pinned.first() <= OSDMonitor::get_trim_to()
|
|
invariant: pinned.last() >= OSDMonitor::get_trim_to()
|
|
|
|
* Trim pruned intervals:::
|
|
|
|
invariant: !pinned.empty()
|
|
invariant: pinned.first() == first_committed
|
|
invariant: pinned.last() < last_committed
|
|
invariant: pinned.first() <= OSDMonitor::get_trim_to()
|
|
invariant: pinned.last() >= OSDMonitor::get_trim_to()
|
|
|
|
postcond: pinned.empty() OR
|
|
(pinned.first() == OSDMonitor::get_trim_to() AND
|
|
pinned.last() > pinned.first() AND
|
|
(for each v in [0..pinned.first()]:
|
|
!store.exists(v) AND
|
|
!store.exists_full(v)
|
|
) AND
|
|
(for each m in [pinned.first()..pinned.last()]:
|
|
if pinned.count(m) > 0: store.exists_full(m)
|
|
else: !store.exists_full(m) AND store.exists(m)
|
|
)
|
|
)
|
|
postcond: !pinned.empty() OR
|
|
(!store.exists(manifest) AND
|
|
(for each v in [pinned.first()..pinned.last()]:
|
|
!store.exists(v) AND
|
|
!store.exists_full(v)
|
|
)
|
|
)
|
|
|