There are two problems:
1) We choose the min last_update amoung peers with the max local-les
value as an upper bound on requests which could have been reported to
the client as committed. We then, for ec pools, roll back to that point
to ensure that we don't inadvertently commit to an update which fewer
than K replicas actually saw. If the primary sets local-les, accepts an
update from a client, and there is a new interval before any of the
replicas have been activated, we will end up being forced to use that
update which no other replica has seen as the new last_update. This
will cause the object to become unfound. We don't have this problem as
long as all active replicas agree on last_update before we accept IO.
2) Even for replicated pools, we would then immediately respond to the
request which created the primary-only update with a commit since it is
in the log and we have no outstanding repops. If we then lose that
primary before any of the replicas in the new interval record the new
log, we will not only lose the object, but also the log entry recording
it, which will result in a lost write.
For these reasons, it seems like we need to wait for the replicas to
activate before we can process new requests essentially because whatever
update we select as last_update is essentially regarded as committed as
soon as we accept IO.
Fixes: #7649
Signed-off-by: Samuel Just <sam.just@inktank.com>
Otherwise, we see a different object_info_t depending on whether the
transaction deleting the object clears before another op recreating it appears.
In particular, we use oi.version to set the prior_version on the log entries in
finish_ctx. If the oi is allowed to stick around the recreation log event will
have a prior version of the deletion event when it should have a prior version
of eversion_t().
Fixes: #7655
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
osd: Add hit_set_flushing to track current flushes and prevent races
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Add debug_ prefix also for 'ceph --admin-daemon *.asok config show'
as already done e.g. by 'ceph-osd --show-config'.
Fixes: #7602
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Reviewed-by: Sage Weil <sage@inktank.com>
- fix the wait check for osds to come back up
- make sure they get marked back in, too
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
mon: fix check for primary-affinity feature bit, and fix a race in similar checks
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
For erasure pools, these may not match.
In the case of #7652, this caused pg_create messages to be send
indefinitely. register_pg() added it to the list for acting_primary, and
when we got the (non-creating) pg stat update we removed it from the list
for acting[0].
Fixes: #7652
Signed-off-by: Sage Weil <sage@inktank.com>
The check for OSD features may race with the boot of an OSD that does not
have the necessary features. Check the pending info too, and if there is
a missing feature, return -EAGAIN. In the callers, wait on -EAGAIN.
Signed-off-by: Sage Weil <sage@inktank.com>
Make sure all running OSDs support the feature before we start using it
(even if the config option is on!).
Fixes: #7642
Signed-off-by: Sage Weil <sage@inktank.com>
'rados cppool' copies the contents but that doesn't make the destination
pool an unmanaged snaps pool. Therefore, we must get an ENOTSUP when
we try to remove an unmanaged snap from a not-unmanaged pool.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Although we should allow creating unmanaged snaps on not-unamanaged pools,
as long as those pools don't have any managed snapshots in them, we cannot
allow removal -- because the pool will not have any unmanaged snapshots.
Fixes: 7210
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
When flushing a HitSet track in hit_set_flushing map so that
agent_load_hit_sets() doesn't try to read it too soon.
Fixes: #7575
Signed-off-by: David Zafman <david.zafman@inktank.com>
We had an old invariant that agent_queue would have at least 1 entry in
it to simplify some other code paths, but it turns out that it is simpler
not to do that.
In particular, this was triggering a failed assertion on shutdown when we
assert that the queue is empty.
Dump offending items on shutdown if they are there, tho, to catch any
future bugs.
Fixes: #7637
Signed-off-by: Sage Weil <sage@inktank.com>
Each upstart/*-all-starter.conf use the same script to find the list of
daemons and their ids. Copy it over to the corresponding logrotate.conf
script instead of using a less reliable script based on initctl list
output.
If logrotate fails to run initctl reload on a daemon, it will keep
writing to the rotated log file, even after it is deleted and until it
fills the disk. By using the exact same shell snippet as the upstart
scripts used to start the daemon, all of them will be sent the HUP
signal and reopen the log file that was just rotated.
http://tracker.ceph.com/issues/7072fixes#7072
Signed-off-by: Loic Dachary <loic@dachary.org>
Otherwise, two objects with different namespaces but
the same object_t will end up clobbering each other's
contexts.
Fixes: #7634
Signed-off-by: Samuel Just <sam.just@inktank.com>
This wreaks havoc on our QA because it marks osds up and down and then
immediately after that we try to scrub and some osds are still down.
Adjust the CLI test to wait for all OSDs to come back up after thrashing.
Signed-off-by: Sage Weil <sage@inktank.com>
Previously, a _delete_head() followed by a recreation on an object in
the same transaction would result in num_dirty being decremented in
_delete_head() without the flag being cleared. make_writeable() would
then see exists and was_dirty and therefore not increment num_dirty
resulting in a mismatch. Rather than trying to maintain the num_dirty
number in _delete_head(), rollback_to(), and make_writeable(), it seems
simpler to do the adjustment once in make_writeable based on undirty,
ctx->obc->obs.oi, and ctx->new_obs->oi.
Fixes: 7393
Signed-off-by: Samuel Just <sam.just@inktank.com>
Otherwise, our attempt to sanitize object_size bytes of
data.object_contents will be doomed to memory corruption.
Fixes: #7610
Signed-off-by: Samuel Just <sam.just@inktank.com>