The active MDS calls MDCache::rejoin_scour_survivor_replicas() when it
receives the cache rejoin message. The function will remove the objects
replicated by MDentry{Link,Unlink} from replica map.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
For active MDS, it may receive resolve/rejoin message before receiving
the mdsmap message that claims the MDS cluster is in resolving/rejoning
state. So instead of set the gather MDS set when receiving the mdsmap.
set them in advance when detecting MDS' failure.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
When MDS cluster is resolving, current behavior is sending subtree resolve
message to all other MDS and waiting for all other MDS' resolve message.
The problem is that active MDS can have diffent subtree map due to rename.
Besides gathering active MDS's resolve messages are also racy. The only
function for these messages is disambiguate other MDS' import. We can
replace it by import finish notification.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Resolve messages for all MDS are the same, so we can compose and
send them in batch.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Replicated objects need to be added into the cache immediately
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
When requesting remote xlock or remote wrlock, the master request is
put into lock object's REMOTEXLOCK waiting queue. The problem is that
remote wrlock's target can be different from lock's auth MDS. When
the lock's auth MDS recovers, MDCache::handle_mds_recovery() may wake
incorrect request. So just unify slave request waiting, dispatch the
master request when receiving slave request reply.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Locks' states should not change between composing the cache rejoin ack
messages and sending the message. If Locker::eval_gather() is called
in MDCache::{inode,dentry}_remove_replica(), it may wake requests and
change locks' states.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
When a MDS becomes active, the table server re-sends 'agree' messages
for old prepared request. If the recoverd MDS starts a new table request
at the same time, The new request's ID can happen to be the same as old
prepared request's ID, because current table client code assigns request
ID from zero after MDS restarts.
This patch make table server send 'ready' messages when table clients
become active or itself becomes active. The 'ready' message updates
table client's last_reqid to avoid request ID collision. The message
also replaces the roles of finish_recovery() and handle_mds_recovery()
callbacks for table client.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
MDS in clientreplsy state already starts servering requests. It also
make MDS::handle_mds_recovery() and MDS::recovery_done() match.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The signal method removes conds from the list after it signals. That's
not okay if the cond triggers for some other reason; an invalid Cond*
will remain on the list and get signaled later.
Make the wait_on_list() helper remove it; use that in several callers;
explicitly do the removal in the remaining callers.
Change signal_cond_list() to not clear the list; rely on the signalee's to
do that. Audit all users and make sure they are either using the
wait_on_list() helper (which removes its Cond) or do the remove explicitly.
Backport some form of this: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
discard, flush, and striping info slipped through the cracks before,
but are useful and trivial to add.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The python interface is a bit awkward since it maps directly
to the C interface, but it'll work well enough and not use
tons of memory.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
All the other commands that display information have this.
For consistency, add it to this command too.
Also switch the plain output to use a TextTable for better readability.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Without this, the same seed is used each time, so multiple runs
of bench-write with the same parameters have the same I/O pattern.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Use int instead of bool for the callback, and make it represent
whether the data exists, rather than the opposite, since callers
are likely to test for whether it's data instead of whether its zeroes.
Change the return value to 0, since an int64_t will wrap around
for large reads, and there's no value in reporting the length
read when it will always be the length requested clipped to the
size of the image.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
safe_read() just protects against EINTR, and may return less data than
requested if it reaches the end of the file. Use safe_read_exact() to
make sure we get the right amount of data.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This will be jumpy since changed extents probably aren't evenly
distributed, but it's better than nothing.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
We were using the internal CEPH_NOSNAP and CEPH_SNAPDIR constants, and
defining a clone_info_t::HEAD (with a different value). The docs were
referrring to the internal constant names.
Instead, define librados constants (C and C++) with the same values as the
internal types.
Note that this changes the clone_info_t::HEAD value from -1 to -2 so that
it now matches the internal type.
Signed-off-by: Sage Weil <sage@inktank.com>