ceph/doc/cluster-ops/troubleshooting-mon.rst

==================================
 Recovering from Monitor Failures
==================================

In production clusters, we recommend running the cluster with a minimum
of three monitors. The failure of a single monitor should not take down
the entire monitor cluster, provided a majority of the monitors remain
available. If the majority of nodes are available, the remaining nodes
will be able to form a quorum.

When you check your cluster's health, you may notice that a monitor
has failed. For example:: 

	ceph health
	HEALTH_WARN 1 mons down, quorum 0,2

For additional detail, you may check the cluster status::

	ceph status
	HEALTH_WARN 1 mons down, quorum 0,2
	mon.b (rank 1) addr 192.168.106.220:6790/0 is down (out of quorum)

In most cases, you can simply restart the affected node. 
For example:: 

	service ceph -a restart {failed-mon}

If there are not enough monitors to form a quorum, the ``ceph``
command will block trying to reach the cluster.  In this situation,
you need to get enough ``ceph-mon`` daemons running to form a quorum
before doing anything else with the cluster.
doc: Added monitor failure recovery. Will be re-factored again soon. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2012-09-04 23:33:47 +00:00			`==================================`
			`Recovering from Monitor Failures`
			`==================================`

			`In production clusters, we recommend running the cluster with a minimum`
			`of three monitors. The failure of a single monitor should not take down`
			`the entire monitor cluster, provided a majority of the monitors remain`
			`available. If the majority of nodes are available, the remaining nodes`
			`will be able to form a quorum.`

			`When you check your cluster's health, you may notice that a monitor`
			`has failed. For example::`

			`ceph health`
			`HEALTH_WARN 1 mons down, quorum 0,2`

			`For additional detail, you may check the cluster status::`

			`ceph status`
			`HEALTH_WARN 1 mons down, quorum 0,2`
			`mon.b (rank 1) addr 192.168.106.220:6790/0 is down (out of quorum)`

			`In most cases, you can simply restart the affected node.`
			`For example::`

			`service ceph -a restart {failed-mon}`

			If there are not enough monitors to form a quorum, the ``ceph``
			`command will block trying to reach the cluster. In this situation,`
			you need to get enough ``ceph-mon`` daemons running to form a quorum
			`before doing anything else with the cluster.`