Filestore now properly fails to clone a non-existent object, which means
we should create one.
Fixes: #2062
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
These are also defined internally in ceph_fs.h, so use a guard. Annoying,
but gives us consistent naming (ceph_*/CEPH_*, not LIBCEPHFS_SETATTR_*).
Signed-off-by: Sage Weil <sage@newdream.net>
For now, until we have a better handle on the ext4 bug, and demonstrate
that it is a clear performance win with the full stack.
Signed-off-by: Sage Weil <sage@newdream.net>
Now, push progress is represented by ObjectRecoveryProgress. In
particular, rather than tracking data_subset_*ing, we track the furthest
offset before which the data will be consistent once cloning is complete.
sub_op_push now separates the pull response implementation from the
replica push implementation.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Require it for osd <-> osd and osd <-> mon communication.
This covers all the new encoding changes, except hobject_t, which is used
between the rados command line tool and the OSD for a object listing
position marker. We can't distinguish between specific types of clients,
though, and we don't want to introduce any incompatibility with other
clients, so we'll just have to make do here. :(
Signed-off-by: Sage Weil <sage@newdream.net>
A write may trigger via make_writeable the creation of a clone which
sorts before the object being written.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If is_degraded returns true for backfill, the object may not be
in any replica's missing set. Only call start_recovery_op if
we actually started an op. This bug could cause a stuck
in backfill error.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
A write may trigger via make_writeable the creation of a clone which
sorts before the object being written.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
If is_degraded returns true for backfill, the object may not be
in any replica's missing set. Only call start_recovery_op if
we actually started an op. This bug could cause a stuck
in backfill error.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
We haven't used this feature for years and years, and don't plan to. It
was there to facilitate "read shedding", where the primary OSD would
forward a read request to a replica. However, replicas can't reply back
to the client in that case because OSDs don't initiate connections (they
used to).
Rip this out for now, especially since osd_peer_stat_t just changed.
Signed-off-by: Sage Weil <sage@newdream.net>
We weren't using this, and it had broken (raw) encoding. The constructor
also didn't initialize fields properly.
Clear out the struct and use the new encoding scheme, so we can cleanly
add fields moving forward.
Signed-off-by: Sage Weil <sage@newdream.net>
Take a bool so that we initialize the last_epoch_started properly on
newly created PGs. This gives us a single code path for all new PGs.
We drop the clear_primary_state(), which has no effect, given that this is
a newly constructed PG.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Use the helper for PGs that are newly instantiated on the local OSD.
This fixes the initialization of pg->info.stats.{up,acting,mapping_epoch}.
We also get rid of a premature (and useless) write_info/log, which has
bad information (and is soon after followed by the real/good one).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Move initialization of misc elements of the new pg from OSD.cc to a PG
method. No change in functionality.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Currently we update the overall heartbeat peers by looking directly at
per-pg state. This is potentially problematic now (#2033), and definitely
so in the future when we push more peering operations into the work queues.
Create a per-pg set of peers, protected by an inner lock, and update it
using PG::update_heartbeat_peers() when appropriate under pg->lock. Then
aggregate it into the osd peer list in OSD::update_heatbeat_peers() under
osd_lock and the inner lock.
We could probably have re-used osd->heartbeat_lock instead of adding a
new pg->heartbeat_peer_lock, but the finer locking can't hurt.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We recently added a flush on activate, but we are still building the
transaction (the caller queues it), so calling osr.flush() here is totally
useless.
Instead, set a flag 'need_flush', and do the flush the next time we receive
some work.
This has the added benefit of doing the flush in the worker thread, outside
of osd_lock.
Signed-off-by: Sage Weil <sage@newdream.net>
If we are blackholing the disk, we need to make flush() wait forever, or
else the flush() logic will return (the IO wasn't queued!) and higher
layers will continue and (eventually) misbehave.
Signed-off-by: Sage Weil <sage@newdream.net>
We can receive an op with an old SnapContext that includes snaps that we've
already trimmed or are in the process of trimming. Filter them out!
Otherwise we will recreate and add links into collections we've already
marked as removed, and we'll get things like ENOTEMPTY when we try to
remove them. Or just leave them laying around.
Fixes: #1949
Signed-off-by: Sage Weil <sage@newdream.net>