If we have epoch X and find out we died as of epoch Y, we still want to
request X+1. Among other things, this fixes a 'stall' if Y happens to be
the most recent map published and no new maps are generated because we will
never get anything back from our subscription.
This makes this osdmap_subscribe() caller match every other caller by
passing in current epoch + 1.
Fixes: #8002
Signed-off-by: Sage Weil <sage@inktank.com>
This is useful only for debugging. The encoded contents of a message are
dumped to the log on message send. This is useful when valgrind is
triggering warnings about uninitialized memory in messages because the
call chain will indicate which message type is to blame, whereas the
usual writer thread context does not tell us any useful information.
Signed-off-by: Sage Weil <sage@inktank.com>
We should not respond to checks for map versions when we are in the
probing or electing states or else clients will get incorrect results when
they ask what the latest map version is.
Fixes: #7997
Signed-off-by: Sage Weil <sage@inktank.com>
Notify kernel to invalidate top level directory entries. As a side
effect, the kernel inode cache get shrinked.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
After a split the pg stats are approximate but not precisely correct. Any
inaccuracy can be problematic for the agent because it determines the
level of effort and potentially full/blocking behavior based on that.
We could concievably do some estimation here that is "safe" in that we
don't commit to too much effort (or back off later if it isn't paying off)
and never block, but that is error-prone.
Instead, just disable the agent until a scrub makes the stats reliable
again.
We should document that a scrub after split is recommended (in any case)
and especially important on cache tiers, but there are currently *no*
user docs about PG splitting.
Fixes: #7975
Signed-off-by: Sage Weil <sage@inktank.com>
This ensures that they get new maps before an op which requires them (that
they would then request from the monitor).
Signed-off-by: Greg Farnum <greg@inktank.com>
The hit_set transactions may include both a modify of the new hit_set and
deletion of an old one, spanning the backfill boundary, and we may end up
sending a backfill target a blank transaction that does not correctly
remove the old object. Later it will notice the stray object and
throw an assertion.
Fix this by skipping hit_set_persist() if any of the backfill targets are
still working on the very first hash value in the PG (which is where all
of the hit_set objects live). This is coarse but simple.
Another solution would be to send separate ops for the trim/deletion and
new hit_set update, but that is a bit more complex and a bit more
runtime overhead (twice the messages).
Fixes: #7983
Signed-off-by: Sage Weil <sage@inktank.com>
This reintroduces the same semantics that were in place in dumpling prior
to the refactoring of the cap/command matching code.
We haven't added this requirement to auth read-write operations as that
would have the potential to break a lot of well-configured keyrings once
the users upgraded, without any significant gain -- we assume that if
they have set 'rw' caps on a given entity, they are indeed expecting said
entity to be sort-of-privileged entities with regard to monitor access.
Fixes: #7919
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
While we're here, remove the non-const get_xlock_by() (because
we don't need it). Also note we return a full MutationRef
(instead of a ref to the stored one). It's necessary in case we
don't have a set-up more() object.
Signed-off-by: Greg Farnum <greg@inktank.com>
We keep an MDRequestImpl::set_self_ref(MDRequestRef&) function so
that we don't need to do the pointer conversion elsewhere.
Signed-off-by: Greg Farnum <greg@inktank.com>
We're switching the MDRequest to be used as a shared pointer. This is the
first step on the path to inserting an OpTracker into the MDS.
Give the MDRequestImpl a weak_ptr self_ref so that we can keep
using the elist for now.
Signed-off-by: Greg Farnum <greg@inktank.com>
When MDS receives the getattr request, corresponding inode's filelock
can be in unstable state which waits for client's Fr cap.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Pass properly 'retain' to Client::send_cap() because it is used to
adjust cap->issued.
Also make Client::encode_inode_release() not release used/dirty caps.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The following
./ceph osd pool create data-cache 8 8
./ceph osd tier add data data-cache
./ceph osd tier cache-mode data-cache writeback
./ceph osd tier set-overlay data data-cache
./rados -p data create foo
./rados -p data stat foo
results in
error stat-ing data/foo: No such file or directory
even though foo exists in the data-cache pool, as it should. STAT
checks for (exists && !is_whiteout()), but the whiteout flag isn't
cleared on CREATE as it is on WRITE and WRITEFULL. The problem is
that, for newly created 0-sized cache pool objects, CREATE handler in
do_osd_ops() doesn't get a chance to queue OP_TOUCH, and so the logic
in prepare_transaction() considers CREATE to be a read and therefore
doesn't clear whiteout. Fix it by allowing CREATE handler to queue
OP_TOUCH at all times, mimicking WRITE and WRITEFULL behaviour.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
When getting a REJECT from a backfill target, tell already GRANTed targets to
go back to RepNotRecovering state by sending a REJECT to them.
Fixes: #7922
Signed-off-by: David Zafman <david.zafman@inktank.com>
Fixes: #7978
We tried to move to the next placement rule, but we were already at the
last one, so we ended up looping forever.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>