ceph/doc/ops/manage/failures/mon.rst

==================================
 Recovering from ceph-mon failure
==================================

Any single ceph-mon failure should not take down the entire monitor
cluster as long as a majority of the nodes are available.  If that
is the case--the remainin nodes are able to form a quorum--the ``ceph
health`` command will report any problems::

 $ ceph health
 HEALTH_WARN 1 mons down, quorum 0,2

and

 $ ceph health detail
 HEALTH_WARN 1 mons down, quorum 0,2
 mon.b (rank 1) addr 192.168.106.220:6790/0 is down (out of quorum)

Generally speaking, simply restarting the affected node will repair things.

If there are not enough monitors for form a quorum, the ``ceph``
command will block trying to reach the cluster.  In this situation,
you need to get enough ``ceph-mon`` daemons running to form a quorum
before doing anything else with the cluster.


Replacing a monitor
===================

If, for some reason, a monitor data store becomes corrupt, the monitor
can be recreated and allowed to rejoin the cluster, much like a normal
monitor cluster expansion.  See :ref:`adding-mon`.
doc: document some osd failure recovery scenarios - simple osd failure - ceph health [detail] - peering failure ('down') state - unfound objects Signed-off-by: Sage Weil <sage@newdream.net> 2012-03-06 23:27:02 +00:00			`==================================`
			`Recovering from ceph-mon failure`
			`==================================`

doc: talk about mon failures a bit Signed-off-by: Sage Weil <sage@newdream.net> 2012-03-07 00:09:42 +00:00			`Any single ceph-mon failure should not take down the entire monitor`
			`cluster as long as a majority of the nodes are available. If that`
			is the case--the remainin nodes are able to form a quorum--the ``ceph
			health`` command will report any problems::

			`$ ceph health`
			`HEALTH_WARN 1 mons down, quorum 0,2`

			`and`

			`$ ceph health detail`
			`HEALTH_WARN 1 mons down, quorum 0,2`
			`mon.b (rank 1) addr 192.168.106.220:6790/0 is down (out of quorum)`

			`Generally speaking, simply restarting the affected node will repair things.`

			If there are not enough monitors for form a quorum, the ``ceph``
			`command will block trying to reach the cluster. In this situation,`
			you need to get enough ``ceph-mon`` daemons running to form a quorum
			`before doing anything else with the cluster.`


			`Replacing a monitor`
			`===================`

			`If, for some reason, a monitor data store becomes corrupt, the monitor`
			`can be recreated and allowed to rejoin the cluster, much like a normal`
			monitor cluster expansion. See :ref:`adding-mon`.