If the next oldest clone is dirty, we cannot flush. That is, we must
always flush starting with the oldest dirty clone.
Note that we can never have a sequence like dirty -> clean -> dirty,
because clones are only dirty on creation, are created in order, and cannot
be flushed (cleaned) out of order. Thus checking the previous clone is
sufficient (and thankfully cheap).
Signed-off-by: Sage Weil <sage@inktank.com>
We do three things here:
- make cache-evict a CACHE instead of WR op, allowing us to submit it
on snaps (not just head)
- allow eviction of a snap
- verify that all snaps are missing before evicting a head
Signed-off-by: Sage Weil <sage@inktank.com>
It is useful to distinguish cache operations from read and modify
operations. Specifically, we will allow cache ops to be sent for
snaps and also allow those ops to result in a write.
Signed-off-by: Sage Weil <sage@inktank.com>
A clone that comes into existence via promotion takes an entirely
different path than a typical clone (which comes into existence via a
CLONE op in make_writeable()). Make sure snap_mapper is updated
accordingly.
Signed-off-by: Sage Weil <sage@inktank.com>
On promote we use finish_ctx to build the final log entries, and need to
encode the snaps vector in that case. (Normally this is done by
make_writeable or explicitly by the snap trimmer.)
Signed-off-by: Sage Weil <sage@inktank.com>
When we promote the head for an object, get the list of snaps from the
backend pool and construct an appropriate SnapSet. Note that this is
always placed on the head in the cache pool, since we will have a
whiteout object in this case.
Also note that the SnapSet's list of snapids will not include any snaps
for which there were no clones. This is fine, since it is only used for
creating clones, and we've already done that.
Signed-off-by: Sage Weil <sage@inktank.com>
This is an alternative to MODIFY that indicates the object was just
promoted from another tier. Thanksfully, is_modify() is used in very
few places!
Signed-off-by: Sage Weil <sage@inktank.com>
When promoting a snapped object, we need to also get the set of snaps over
which the clone is defined. This is not strictly available except via the
list-snaps rados call, but that is only used on the snapdir object much
earlier when the head (whiteout) is promoted, and is not conveniently
available now. Adding it to the internal copy-get is not exposed via
librados (copy-get is not exposed at all) so I don't think this is a
problem.
Signed-off-by: Sage Weil <sage@inktank.com>
find_object_context() now tells us which object it could use if it
doesn't find it on disk. Promote that one.
Signed-off-by: Sage Weil <sage@inktank.com>
Prevoiusly we would return a snapid that we are blocked on if it is
missing. This is necessary because the missing clone does not always
match the logical snap we are trying to read.
Extend this to return a full hobject_t that is the missing object we want.
For the missing clone case, this cleans things up slightly. More
importantly, it lets find_object_context also tell us which on-disk
object is missing that, if it could be promoted, would help.
Signed-off-by: Sage Weil <sage@inktank.com>
If we call
bl.append(some_istream);
do not include a \n if the istream is empty (which is apparently is not
the same thing as eof). This was causing 'ceph pg getmap' to include a
trailing newline.
Probably we don't want this newline at all! But all callers need to be
fixed for that change.
Signed-off-by: Sage Weil <sage@inktank.com>
Make peer_backfill_info a map which holds a
BackfillInterval for all backfill targets.
Initially see if recover_backfill() can just backfill
the first one and mark them all finished.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Checking the pointer alignment using a cast to long long raises a
warning when --Wpointer-to-int-cast is given.
Signed-off-by: Loic Dachary <loic@dachary.org>
* add information about CEPH_ARGS
* rework the --build documentation and example
* add an Author section
* replace vi with emacs for no good reason
* cleanup whitespace
Signed-off-by: Loic Dachary <loic@dachary.org>
* dump the crush tree created by --build at debug level 1.
* display a warning at debug level 1 if there is more than one root. In
most cases it is not what the user wants and it may be confusing
because the ruleset will only apply to the first of root and have less
devices under it as expected.
Signed-off-by: Loic Dachary <loic@dachary.org>
Instead of creating a ruleset from scratch, use the
OSDMap::build_simple_crush_rulesets helper. It is more likely to match
the user expecations.
Signed-off-by: Loic Dachary <loic@dachary.org>
When the number of args provided to --build is not a multiple of 3,
display the arguments which do not comply.
For instance the --debug_crush 0 option is not consumed by global_init
in crushtool because, unlike most ceph tools, the arguments are not
passed to global_init. As a result --debug_crush 0 become part of the
arguments and triggers the failure.
crushtool --debug_crush 0 --build --num_osds 320 node straw 4
remaining args: [--debug_crush,0,node,straw,4]
layers must be specified with 3-tuples of (name, buckettype, size)
Signed-off-by: Loic Dachary <loic@dachary.org>
The arguments are not given to global_init because the -c option would
conflict. Reading arguments from CEPH_ARGS the way other ceph tools do
is the only way to control verbosity ( via --debug_crush 0 for instance ).
Signed-off-by: Loic Dachary <loic@dachary.org>
There is no need to specialize the argument into stringstream. It is
replaced by a ostream which is convenient to display errors directly to
cerr if appropriate.
Signed-off-by: Loic Dachary <loic@dachary.org>
Require that all OSDs support TMAP2OMAP before starting the MDS. This
avoids doing some work and then crashing with EOPNOTSUPP, and gives us
a more informative message in the logs.
Signed-off-by: Sage Weil <sage@inktank.com>