This should only occur with the root inode, but caused a segfault for
anybody running more than one MDS who restarted.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Avoid confusing other code (e.g. kick_flushing_caps) by staying on the mds
flushign_caps list when we don't even have an auth_cap with them anymore.
We'll need to re-flush to a new MDS later.
Signed-off-by: Sage Weil <sage@newdream.net>
This should only return true when recovery is done, i.e., no more missing
objects. Nothing to do with unfound.
Signed-off-by: Sage Weil <sage@newdream.net>
In PG::is_all_uptodate, don't try to look for peer_missing[osd->whoami].
The primary keeps that in PG::missing!
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Formerly, we had a special case in read_log for dealing with objects
whose objects were present on the disk, but not their attributes. This
conflicts with our plans to mark objects as lost by putting a bit in the
object attributes, since without those attributes, we'll never know if
the objects were formerly marked as lost.
This should almost never happen, and if it does, we just handle the
objects as missing in the normal way.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
The might_have_unfound set is used by the primary OSD during recovery.
This set tracks the OSDs which might have unfound objects that the
primary OSD needs. As we receive Missing from each OSD in
might_have_unfound, we will remove the OSD from the set.
When might_have_unfound is empty, we will mark objects as LOST if the
latest version of the object resided on an OSD marked as lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
No need to specify destination in send_reply, as we always have the request
for reference.
Simplify MRoute constructors (keep the ones we use) for tid and bcast
best-effort case.
Do NOT do a best-effort forward of a reply with a tid specified if the tid
is not in the routed-request map.
Signed-off-by: Sage Weil <sage@newdream.net>
When activating an inactive replica, assert that we are doing so based
on a message from the primary.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
We want to remove replicas that we don't ack, but those don't appear in
the strong_inode map; they're appended to the base_inode bufferlist. Make
a (temporary) set to track who those are so that we know who to get rid of.
Signed-off-by: Sage Weil <sage@newdream.net>
This removes a compiler warning that appeared in a gcc upgrade and
is apparently erroneous, about its usage violating strict-aliasing rules
when the + operator is used.
This actually is initialized before all uses, but compilers tend to
have trouble with assignment in if-else branches, and -1 is considered
invalid so there's no danger of refactoring breaking anything.
switching to a new journal segment.
MDSCache:
The stray member has been replaced with strays, an array of inodes
representing the set of available stray directories, as well as
stray_index indicating the index of the current stray directory.
get_stray() now returns a pointer to the current stray directory
inode.
advance_stray() advances stray_index to the next stray directory.
migrate_stray no longer takes a source argument, the source mds
is inferred from the parent of the dir entry.
stray dir entries are now stray<index> rather than stray.
scan_stray_dir now scans all stray directories.
MDSLog:
start_new_segment now calls advance_stray() on MDSCache to force a new
stray directory.
mdstypes:
NUM_STRAY indicates the number of stray directories to use per MDS
MDS_INO_STRAY now takes an index argument as well as the mds number
MDS_INO_STRAY_OWNER(i) returns the mds owner of the stray directory i
MDS_INO_STRAY_OWNER(i) returns the index of the stray directory i
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
PG::generate_past_intervals needs to generate all the intervals back to
history.last_epoch_clean, rather than just to
history.last_epoch_started. This is required by
PG::build_might_have_unfound, which needs to examine these intervals
when building the might_have_unfound set.
Move the check for whether past_intervals is up-to-date into
generate_past_intervals itself. Fix the check.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
We need to set the subtree bounds before trimming it away, or else we may
throw out things we're still auth for.
Signed-off-by: Sage Weil <sage@newdream.net>
If we come back up on the same address, there is a possible race. Other
nodes will mark_down when they see us go down. If we go up first, queue
some messages, and _then_ they see that we're down and mark_down, the
messages we queued will get lost. Since it's stateful on the cluster
backend, we need to introduce an ordering so that closing out the _old_
session doesn't break the new session. We do this by binding to a new
address (just a different port, actually) before marking ourselves back
up.
Fixes#592.
Signed-off-by: Sage Weil <sage@newdream.net>
Closes out all old connections and binds to a _different_ port. This
ensures that someone doing mark_down on our old address won't get us.
Signed-off-by: Sage Weil <sage@newdream.net>
Accomplish this by making a list of cap releases in the (permanent)
MetaRequest, and then copying that into the (potentially-temporary)
MClientRequest.
Always set up cluster_messenger (before we would only do so if there was
an explicit address configured for it). The overhead to do so is minimal,
it simplifies the code, and will allow us to fix down->up transitions
(later).
Signed-off-by: Sage Weil <sage@newdream.net>