When called with a "map" parameter, the rbdmap script iterates the list
of images present in RBDMAPFILE (/etc/ceph/rbdmap), and maps each entry.
When called with "unmap", rbdmap currently iterates *all* mapped RBD
images and unmaps each one, regardless of whether it's listed in the
RBDMAPFILE or not.
This commit adds functionality such that only RBD images listed in the
configuration file are unmapped. This behaviour is the new default for
"rbdmap unmap". A new "unmap-all" parameter is added to offer the old
unmap-all-rbd-images behaviour, which is used by the systemd service.
Fixes: http://tracker.ceph.com/issues/18884
Signed-off-by: David Disseldorp <ddiss@suse.de>
When booting a server with 20+ HDDs udev has to process a *lot* of
events (especially if dm-crypt is used), and 2 minutes might be not
enough for that. Make it possible to override the timeout (via systemd
drop-in files), and use a longer timeout (5 minutes) by default.
Fixes: http://tracker.ceph.com/issues/18740
Signed-off-by: Alexey Sheplyakov <asheplyakov@mirantis.com>
Currently, we start/stop OSDs and MONs simultaneously. This may cause
problems especially when we are shutting down the system. Once the mon
goes down it causes a re-election and the MONs can miss the message
from the OSD that is going down.
Resolves: http://tracker.ceph.com/issues/18516
Signed-off-by: Boris Ranto <branto@redhat.com>
In some situations the IP address the Monitor wants to bind to
might not be available yet.
This might for example be a IPv6 Address which is still performing
DAD or waiting for a Router Advertisement to be send by the Router(s).
Have systemd wait for 10s before starting the Mon and increase the amount
of times it does so to 5.
This allows the system to bring up IP Addresses in the mean time while
systemd waits with restarting the Mon.
Fixes: #18635
Signed-off-by: Wido den Hollander <wido@42on.com>
Instead of the default 100ms pause before trying to restart an OSD, wait
20 seconds instead and retry 30 times instead of 3. There is no scenario
in which restarting an OSD almost immediately after it failed would get
a better result.
It is possible that a failure to start is due to a race with another
systemd unit at boot time. For instance if ceph-disk@.service is
delayed, it may start after the OSD that needs it. A long pause may give
the racing service enough time to complete and the next attempt to start
the OSD may succeed.
This is not a sound alternative to resolve a race, it only makes the OSD
boot process less sensitive. In the example above, the proper fix is to
enable --runtime ceph-osd@.service so that it cannot race at boot time.
The wait delay should not be minutes to preserve the current runtime
behavior. For instance, if an OSD is killed or fails and restarts after
10 minutes, it will be marked down by the ceph cluster. This is not a
change that could break things but it is significant and should be
avoided.
Refs: http://tracker.ceph.com/issues/17889
Signed-off-by: Loic Dachary <loic@dachary.org>
"ceph-disk trigger" invocation is currently performed in a mutually
exclusive fashion, with each call first taking an flock on the path
/var/lock/ceph-disk. On systems with a lot of osds, this leads to a
large amount of lock contention during boot-up, and can cause some
service instances to trip the 120 second timeout.
Take an flock on a device specific path instead of /var/lock/ceph-disk,
so that concurrent "ceph-disk trigger" invocations are permitted for
independent osds. This greatly reduces lock contention and consequently
the chance of service timeout. Per-device concurrency restrictions
required for http://tracker.ceph.com/issues/13160 are maintained.
Fixes: http://tracker.ceph.com/issues/18049
Signed-off-by: David Disseldorp <ddiss@suse.de>
A ceph udev action may be triggered before the local file systems are
mounted because there is no ordering in udev. The ceph udev action
delegates asynchronously to systemd via ceph-disk@.service which will
fail if (for instance) the LVM partition required to mount /var/lib/ceph
is not available yet. The systemd unit will retry a few times but will
eventually fail permanently. The sysadmin can systemctl reset-fail at a
later time and it will succeed.
Add a dependency to ceph-disk@.service so that it waits until the local
file systems are mounted:
After=local-fs.target
Since local-fs.target depends on lvm, it will wait until the lvm
partition (as well as any dm devices) is ready and mounted before
attempting to activate the OSD. It may still fail because the
corresponding journal/data partition is not ready yet (which is
expected) but it will no longer fail because the lvm/filesystems/dm are
not ready.
Fixes: http://tracker.ceph.com/issues/17889
Signed-off-by: Loic Dachary <loic@dachary.org>
ceph-create-keys should not be started on boot of mons with systemd so should
not exist as 'After' or 'Wants' for the ceph-mon.service
Signed-off-by: Owen Synge <osynge@suse.com>
ceph-create-keys should not be started on boot of mons with systemd so should
not exist as 'After' or 'Wants' for the ceph-mon.service
Signed-off-by: Owen Synge <osynge@suse.com>
ceph-create-keys should not be started on boot of mons with systemd so should
not exist in the systemd files
Signed-off-by: Owen Synge <osynge@suse.com>
This is a hack to inject a key for the mgr daemon, using whatever
key already exists on the mon on this node to gain sufficient
permissions to create the mgr key. Failure is ignored at every
step (the '-' prefix) in case someone has already used some other
trick to set everything up manually.
Signed-off-by: Tim Serong <tserong@suse.com>
This change introduces the following behaviour:
- When ceph-mon starts, it will try to start ceph-mgr with the same
instance id (Wants=), but will *not* fail to start if ceph-mgr
doesn't start (i.e. the mon still works as it always did).
- ceph-mgr will start After= ceph-mon, and will stop and start when
ceph-mon stops and starts, because it's PartOf= ceph-mon.
If you don't want ceph-mgr to run on the mons, you need to mask the
service, i.e. `systemctl mask ceph-mgr@INSTANCE`. Hostnames are
typically instance names, so `systemctl mask ceph-mgr@$(hostname)`
should suffice if you wish to disable ceph-mgr on the mons.
Signed-off-by: Tim Serong <tserong@suse.com>
When ceph-disk runs from udev or init script, it is in the background
and should it block for any reason, it may keep a lock forever. All
calls to ceph-disk in these context are changed to timeout.
The TimeoutStartSec= and TimeoutStopSec= which are both set via
TimeoutSec= do not apply to Type=oneshot services.
https://www.freedesktop.org/software/systemd/man/systemd.service.html
Fixes: http://tracker.ceph.com/issues/16580
Signed-off-by: Loic Dachary <loic@dachary.org>
Some distros, like Fedora and openSUSE, have a policy that all services are
disabled by default.
This patch changes that default for the ceph.target and
ceph-{mds,mon,osd,radosgw}.target services.
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Signed-off-by: Boris Ranto <branto@redhat.com>
Currently, the systemd daemons are not restarted on failure. This patch
adds this functionality and sets the defaults to those defined in
upstart. This resolves to 3 fails per 30 minutes for osd, mon and mds
and 5 fails per 30 seconds for radosgw.
Signed-off-by: Boris Ranto <branto@redhat.com>
If systemd has task accounting enabled, a default of 512 tasks
will be applied to all systemd units.
For ceph, this is way to low even for a modest cluster, so stop
this restriction being applied and allow administrators to apply
limits using sysctl.
Signed-off-by: James Page <james.page@ubuntu.com>
These are not supported by /usr/lib/ceph/ceph-osd-prestart.sh,
resulting in warnings:
ceph-osd-prestart.sh[23367]: getopt: unrecognized option '--setuser'
ceph-osd-prestart.sh[23367]: getopt: unrecognized option '--setgroup'
--setuser and --setgroup are only needed for the ceph-osd process.
Signed-off-by: James Page <james.page@ubuntu.com>
First, it makes sense for both ceph_common.sh and ceph-osd-prestart.sh to
reside in the same directory: make it so.
Second, /usr/lib exists on both RHEL/Fedora and SLE/openSUSE, whereas
the later lacks /usr/libexec. To make this less painful, package
ceph_common.sh and ceph-osd-prestart.sh in /usr/lib/ceph.
Third, allow e.g. FreeBSD to do its own thing by using the $(libexecdir)
Autoconf variable (but set it to /usr/lib in the spec file).
http://tracker.ceph.com/issues/14687Fixes: #14687
Signed-off-by: Nathan Cutler <ncutler@suse.com>
This change makes it so the mon/osd/mds/radosgw daemons:
o Cannot write to /usr, /etc, and /boot.
o Cannot access /home, /root, or /run/user.
o Each daemon gets its own private /tmp and /var/tmp.
o All daemons get a private /dev without physical devices (exception: osd)
I'm not sure if the osd daemon needs access to a full /dev so I left
ProtectDevices out for ceph-osd@.service.
Signed-off-by: Patrick Donnelly <batrick@batbytes.com>
The flock command may be installed elsewhere, depending on the
system. Let the PATH search figure that out.
http://tracker.ceph.com/issues/13975Fixes: #13975
Signed-off-by: Loic Dachary <loic@dachary.org>
systemd: start/stop/restart ceph services by daemon type
Reviewed-by: Nathan Cutler <ncutler@suse.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Boris Ranto <branto@redhat.com>
Reviewed-by: Ken Dreyer <kdreyer@redhat.com>
When activating a device, ceph-disk trigger restarts the ceph-disk
systemd service. Two consecutive udev add on the same device will
restart the ceph-disk systemd service and the second one may kill the
first one, leaving the device half activated.
The ceph-disk systemd service is instructed to not kill an existing
process when restarting. The second run waits (via flock) for the second
one to complete before running so that they do not overlap.
http://tracker.ceph.com/issues/13160Fixes: #13160
Signed-off-by: Loic Dachary <ldachary@redhat.com>
We were observed to be hitting the limit on centos7
(triggering pthread_create failures) on a ~2000 OSD cluster.
Increasing this resolves it!
Reported-by: Dan van der Ster <daniel.vanderster@cern.ch>
Signed-off-by: Sage Weil <sage@redhat.com>