Previously, we could get into a state where although up[0] has been
fully backfilled, acting[0] could be selected as a primary if it is able
to pull another peer into the acting set. This also collects the logic
of choosing the best info into a helper function.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Previously, we called disk_tp.pause_new(). This can cause a race
where snap_trimmer queues more transactions after we flush the
store. Calling disk_tp.pause() under the osd_lock causes a
deadlock with pg removal.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
If the target object is before last_backfill, then the backfill_target
will be asked to apply the operation. If one of the src objects is past
last_backfill, that will fail, so we need to wait for the src object to
be not degraded.
Signed-off-by: Sage Weil <sage@newdream.net>
We need to preserve the order of write operations on each object. If we
have a write on X that needs to read from Y, and Y is degraded, then we
need to wait for Y to repair. Doing so blindly will allow other writes
to X to proceed while our clone op is still waiting, violating the
ordering.
Fix this by adding blocked_by and blocking vars to the ObjectContext. If
we wait on a src_oid, the oid is "blocked" by that object, and any
subsequent writes should also wait on the same queue.
Use a helper to do the cleanup when we complete recovery, or when the
pg resets.
Signed-off-by: Sage Weil <sage@newdream.net>
Maintain backfill target pg stats to be the summation over objects to
the left of last_backfill. Reflect this in the degraded stats we report
to the monitor.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Objects yet to be backfilled do not show up in the missing set. Thus,
we cannot use an object past last_backfill to clone into the object we
are pushing/pulling.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Set/clear states in peering state machine state ctor/dtors where possible.
Set degraded if the number of non-backfilling replicas is lower than the
target replication factor.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Always ship log for updates to backfill targets to preserve the repgather
ordering.
Fix up recover_backfill() bounds. Re-scan the local collect on every pass
in case there were concurrent modifications. (This could be optimized.)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
This means we should set it to a hash boundary or the last item of our
result set (not the next item we didn't include).
It means that during backfill we can set our last_backfill to the last
object we did recover and be sure that any new files locally will be
included in the next result set, and we can bound that result set by that
last object recovered and not include it in the resulting range.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Use get_effective_key() to return key (if explicit) or object name. Sort
by that within each hash value.
Clean up operator<< so that it prints things in sort order.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We always fill from the bottom up anyway. Using an hobject_t also gives us
a precise bound. It also makes things conceptually simpler: last_complete
and last_backfill bounding each of the two dimensions of updatedness.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>