Caused by 3bd48cbbad
feature 4207 implementation
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit e1e2d5d217)
This ensures that the paxos state is not active when the PaxosService
restart() methods run right afterwards, and that EAGAIN waiters will get
requeued appropriately.
Signed-off-by: Sage Weil <sage@inktank.com>
If 7aec13f749 we started passing non-zero
return values to these completions; now we have to deal with them
accordingly.
RetryMessage behaves just like the Monitor variant.
Propose and Committed update state but otherwise ignore non-zero
return values.
Signed-off-by: Sage Weil <sage@inktank.com>
- Cancel the propsal waiters with EAGAIN on election, etc.
- Drop the wakeup helper and open-code the one caller.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
If queue_pos == header.max_size when we create the entry
header magic, the entry will be rejected at get_top() on
replay.
Fixes: #4436
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Otherwise:
1) expand_pg_num removes a splitting pg entry
2) peering thread grabs pg lock and starts split
3) OSD::consume_map grabs pg lock and starts removal
At step 2), we run afoul of the assert(is_splitting)
check in split_pgs. This way, the would be splitting
pg is marked as removed prior to the splitting state
being updated.
Backport: bobtail
Fixes: #4449
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
1) Replica sends notify
2) Prior to processing notify, primary queues query to replica
3) Primary processes notify and activates sending MOSDPGLog
to replica.
4) Primary does do_notifies at end of process_peering_events
and sends to Query.
5) Replica sees MOSDPGLog and activates
6) Replica sees Query and asserts.
In the above case, the Replica should simply ignore the old
Query.
Fixes: #4050
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
I broke this in 4637752db6 when I
restructured this function. Only try to increase the max if we are
the leader.
Signed-off-by: Sage Weil <sage@inktank.com>
The ceph-mds.conf file moced from the ceph package to the
ceph-mds package. Add replaces/breaks statements to the
control file to handle this on upgrade.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
This ensures that when we then start individual mds instances, we can
stop ceph-mds-all and they will get stopped. We do the same already for
ceph-all.
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 41897fcba1)
In get_local_daemon_list() the sed expression trimming the cluster
name from the host name was trimming too much if the host name
contained hyphens.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Move declarations above error conditons so we can goto done almost
everywhere. Remove cpp_strerror printing, since it will be done by the
caller.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
On some kernels and filesystems fiemap can be racy and provide
incorrect data even after an fsync. Later we can use SEEK_HOLE and
SEEK_DATA, but for now just detect zero runs like we do with stdin.
Basically this adapts import from stdin to work in the case of a file
or block device, and gets rid of other cruft in the import that used
fiemap.
Fixes: #4388
Backport: bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Simplify the logic a bit so it is easier to follow.
Small behavior change: we will successfully allocate and return a gid that
== the max when we can't bump it.
Signed-off-by: Sage Weil <sage@inktank.com>
This only happens on the Leader and leads to duplicate global_ids.
Fixes: #4285
Signed-off-by: Joao Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When proposing an older value learned during recovery, we don't create
a queued proposal -- we go straight through Paxos. Therefore, when
finishing a proposal, we must be sure that we have a proposal in the queue
before dereferencing it, otherwise we will segfault.
Fixes: #4250
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Fixes: #4425
Backport: bobtail
Apparently, libcurl needs that in order to be thread safe. Side
effect is that if libcurl is not compiled with c-ares support,
domain name lookups are not going to time out.
Issue affected keystone.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The cache stores which objects don't exist. Flatten bypasses the cache
when doing its copyups, so when it is done the -ENOENT from the cache
is treated as zeroes instead of 'need to read from parent'.
Clients that have the image open need to forgot about the cached
non-existent objects as well. Do this during ictx_refresh, while the
parent_lock is held exclusively so no new reads from the parent can
happen until the updated parent metadata is visible, so no new reads
from the parent will occur.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Clear the exists and complete flags for any objects that have exists
set to false, and force any in-flight reads to retry if they get
-ENOENT instead of generating zeros.
This is useful for getting the cache into a consistent state for rbd
after an image has been flattened, since many objects which previously
did not exist and went up to the parent to retrieve data may now exist
in the child.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reads always use C_ReadFinish as a callback (and they are the only
user of this callback). Keep an xlist of these for each object, so
they can remove themselves as they finish. To prevent racing requests
and with discard removing objects from the cache, clear the xlist in
the object destructor, so if the Object is still valid the set_item
will still be on the list.
Make the ObjectCacher constructor take an Object* instead of the pool
and object id, which are derived from the Object* anyway.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This reverts commit a99ed038ec.
On second thought, this will require a bit more care to ensure that all
of the paths radosgw needs to read/write from have the correct permissions
in the packages and so forth.
Signed-off-by: Sage Weil <sage@inktank.com>
This increase only means that we'll keep more versions around before we
trim. It doesn't change the number of versions we'll keep around after
trimming (that's still as much as 'paxos_max_join_drift', i.e. 10), nor
does it change the criteria used to consider a monitor as having drifted
(same rule applies, 'paxos_max_join_drift').
This change however will enable the leader to put off trimming for a longer
period of time, giving a better chance for a monitor to join the cluster.
See, after going through the probing phase, at which point a monitor may
only be, say, 5 versions off, the same monitor may end up getting into the
quorum only to find that in-between probing and finally triggering an
election some 6 versions might have come to existence. Before this patch,
by then the state had been trimmed and the monitor would have to bootstrap
to perform a full store sync. With this patch in place, the monitor would
be able to sync the remaining 11 versions.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The --signal argument to Debian's start-stop-daemon doesn't
make it send a signal, but defines which signal should be send
when --stop is specified.
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
This is a common error and there's no reason the script can't
at least tell you it's a really bad idea. One might argue it
could even successfully proactively truncate the host parameter
at the first dot, but that's a little controlling, perhaps.
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Close out any connection with an old peer. This avoids a race like:
- peer marked down
- we get map, mark down the con
- they reconnect and try to send us some stuff
- we share our map to tell them they are old and dead, but leave the con
open
...
- peer marks itself up a few times, eventually reuses the same port
- sends messages on their fresh con
- we discard because of our old con
This could cause a tight reconnect loop, but it is better than wrong
behavior.
Other possible fixes:
- make addr nonce truly unique (augment pid in nonce)
- make a smarter 'disposable' msgr state (bleh)
Signed-off-by: Sage Weil <sage@inktank.com>
This avoids confusion with the OSD method of the same name, and better
matches what the function tests (and does not do).
Signed-off-by: Sage Weil <sage@inktank.com>
Fixes: #4247
The list buckets operation was missing some attrs on the different
xml result entities. This fixes it.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
xml entities may have attrs assigned to them. Add the ability
to set them. A usage example:
formatter->open_array_section_with_attrs("container",
FormatterAttrs("name", "foo", NULL));
This will generate the following xml entity:
<container name="foo">
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Fixes: #4363
Backport: argonaut, bobtail
When listing objects in namespace don't iterate through all the
objects, only go though the ones that starts with the namespace
prefix
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
The default of 100000 can result in hundreds of MBs of extra memory
used. This was most obvious when using librbd with caching enabled,
since there was a dout(0) accidentally left in the ObjectCacher.
refs: #4352
backport: bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Level 0 should never be used for this kind of debugging.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>