In several places, a change in the up_primary triggers a new peering
interval, but the palces that actually generate the new past intervals,
including check_new_interval(), did not enforce that. This becomes
somewhat obvious when you see that those callers are ignoring the
up_primary output argument for pg_to_up_acting_osds().
Fix this by adding arguments to check_new_interval and fixing the callers
to pass them in properly. Add a unit test case to verify this.
Note that the past interval struct itself does not record who the
up_primary was; possibly it should.
Fixes: #8139
Signed-off-by: Sage Weil <sage@inktank.com>
Feed in the ancestor pg_t (if any) when we are looking at intervals for
previous maps that may have preceded a recent split.
Fixes: #8139
Signed-off-by: Sage Weil <sage@inktank.com>
We check whether the head is degraded, and we check whether a clone is
unreadable, but in the case where we have a cache op on a degraded object,
we don't check. That leads to an assert when the repop hits the replica
and the object is in the peer's missing set.
Fix this by adding a check on the clone when write_ordered is true. Note
that checking write_ordered is better than whether it is a cache op because
we want to preserve write ordering even for reads that are flagged by the
client.
Fixes: #8048
Signed-off-by: Sage Weil <sage@inktank.com>
If we recalculate the mapping and find that there is no primary, we need
to set the 'osd' field to -1. Otherwise, the caller will try to resend
to a dead session with bad results.
This was introduced in the refactor 860d72770c.
Fixes: #8130
Signed-off-by: Sage Weil <sage@inktank.com>
If we have just started and receive a command, we currently will reply with
EINVAL because the leader commands are empty. Note that this race is very
difficult to reach because the (old) peon needs to forward a command to
the mon while it still thinks it has quorum, and the message needs to get
sent after the leader mon has restarted and reset its connection but before
it has declared a new election.
To fix this, we should assume at startup time that our commands are
valid. If it is an internal command that does not require quorum, that
is fine. If it does require quorum, we will retry the command after the
election completes and we will revalidate the command then.
Fixes: #8132
Signed-off-by: Sage Weil <sage@inktank.com>
In 69321bf, EAGAIN changed behaviour to block indefinitely
rather than returning to user. Change the return for
`osd pool set` operations that are blocked by creating PGs
to return EBUSY instead of EAGAIN, so that they are excepted
from this blocking behaviour.
Signed-off-by: John Spray <john.spray@inktank.com>
There are several perils when splitting a cache pool:
- split invalidstes pg stats, which disables the agent
- a scrub must be manually triggered post-split to rebuild stats
- the pool may fill the OSDs during that period.
- or, the pool may end up beyond the 'full' mark and once scrub does
complete and the agent activate we may block IO for a long time while
we catch up with flush/evict
Make it a bit harder for users to shoot themselves in the foot.
Fixes: #8043
Signed-off-by: Sage Weil <sage@inktank.com>
Previously assumed that ceph-mds executable was in
PWD - now use /proc/self/exe to find the
executable whereever it may be. Leave in old version
as a fallback for non-linux environments.
Also add a 'respawn' command so that it's easy to test
respawn with `ceph mds tell <id> respawn`
Fixes: #7966
strerror_r is not portable; on Gnu libc it returns char * and sometimes
does not fill in the supplied buffer. Use autoconf to test which
version this platform uses and adapt.
Clean up the random calls to strerror and strerror_r (along with all
their private little one-use buffers) and regularize the code to use
cpp_strerror almost everywhere. Where changed, any negation of the
error code is also removed, since cpp_strerror() will do that.
Note: some tools were using their own calls to strerror/strerror_r, so
will now get a (%d) in their output that wasn't there before; hence
the change to test/cli/monmaptool/print-nonexistent.t
Fixes: #8041
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Current the dup op checks happen in execute_ctx, long after we handle
cache ops or get the obc and (potentially) return ENOENT. That means that
object deletions and cache ops both aren't properly idempotent.
This is easy to fix by moving the check earlier in do_op.
Fixes: #8089
Signed-off-by: Sage Weil <sage@inktank.com>
If early reply is not allowed, MDS does not send reply to client immediately
after Locker::issue_new_caps adds new caps. So MDS can revoke the caps before
sending reply to client.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
MDCache::do_file_recover may call Locker::evel_gather, which may change
filelock to stable state. So we should authpin the inode (for unstable
lock state) first.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>