Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.
On our setup we encountered a symlink which was linked to the wrong rbd:
/dev/rbd/mypool/myrbd -> /dev/rbd1
While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).
Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.
In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:
/usr/bin/ceph-rbdnamer /dev/rbd3
/usr/bin/ceph-rbdnamer /dev/rbd3p1
/usr/bin/ceph-rbdnamer rbd3
/usr/bin/ceph-rbdnamer rbd3p1
/usr/bin/ceph-rbdnamer 3
Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.
With that fixed, we hit the second problem. We ended up with:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:
/dev/rbd/mypool/myrbd -> /dev/rbd3
However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):
/dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1
Please let me know any feedback you have on this patch or the approach
used.
Regards,
Pascal de Bruijn
Unilogic B.V.
Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.
Signed-off-by: Sage Weil <sage@inktank.com>
We should gather an event if it is below the log or gather threshold.
Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.
Signed-off-by: Sage Weil <sage@inktank.com>
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:
- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.
This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.
I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().
While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.
Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
Bug #2650. We were overriding subuser perm mask whenever subuser
was modified, even if perm mask was not passed.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Bad linebreaks, wrapping, stringification, missing doc for bench args
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Bug #2772. This fixes an issue that was introduced when we
added the 'rados cp' command. The -t param was already used
for rados bench. With this change the only way to specify
a target pool is using --target-pool.
Though this problem is post argonaut, the 'rados cp' command
has been backported, so we need this fix there too.
Backport: argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
This is leftover from when we built a libcrush.so. We can re-add when we
start doing that again.
Reported-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@inktank.com>
This was creating a new cluster connection/session per iteration, and
along with it a few service threads and sockets and so forth.
Unfortunately, librados leaks like a sieve, starting with CephContext
and ceph::crypto::init(). See #845 and #2067.
Signed-off-by: Sage Weil <sage@inktank.com>
pinfo.stats might be wrong if we did log-based recovery on the
backfilled portion in addition to continuing backfill.
bug #2750
Signed-off-by: Samuel Just <sam.just@inktank.com>
When we are signaling the cond to indicate that a notify is complete,
take the appropriate lock. This removes the possibility of a race
that loses our signal. (That would be very difficult given that there
are network round trips involved, but this makes the lock/cond usage
"correct.")
Signed-off-by: Sage Weil <sage@inktank.com>
Wait for all replicas to construct the base scrub map before finalizing
the scrub and locking out writes.
Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
Issue #2701. This info wasn't really used anywhere and we weren't
removing it. It was also sharing the same pool namespace as the
info indexed by bucket name, which is bad.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Issue #2701. This info wasn't really used anywhere and we weren't
removing it. It was also sharing the same pool namespace as the
info indexed by bucket name, which is bad.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
New rados command: rados cp <src-obj> [dest-obj]
Requires specifying source pool. Target pool and locator can be specified.
The new command preserves object xattrs and omap data.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
In these cases + and / are replaced by - and _ to prevent problems when using
the base64 strings in URLs.
Signed-off-by: Wido den Hollander <wido@widodh.nl>
Signed-off-by: Sage Weil <sage@inktank.com>