================================== Recovering from Monitor Failures ================================== .. index:: monitor, high availability In production clusters, we recommend running the cluster with a minimum of three monitors. The failure of a single monitor should not take down the entire monitor cluster, provided a majority of the monitors remain available. If the majority of nodes are available, the remaining nodes will be able to form a quorum. When you check your cluster's health, you may notice that a monitor has failed. For example:: ceph health HEALTH_WARN 1 mons down, quorum 0,2 For additional detail, you may check the cluster status:: ceph status HEALTH_WARN 1 mons down, quorum 0,2 mon.b (rank 1) addr 192.168.106.220:6790/0 is down (out of quorum) In most cases, you can simply restart the affected node. For example:: service ceph -a restart {failed-mon} If there are not enough monitors to form a quorum, the ``ceph`` command will block trying to reach the cluster. In this situation, you need to get enough ``ceph-mon`` daemons running to form a quorum before doing anything else with the cluster. Client Can't Connect/Mount ========================== Check your IP tables. Some OS install utilities add a ``REJECT`` rule to ``iptables``. The rule rejects all clients trying to connect to the host except for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in place, clients connecting from a separate node will fail to mount with a timeout error. You need to address ``iptables`` rules that reject clients trying to connect to Ceph daemons. For example, you would need to address rules that look like this appropriately:: REJECT all -- anywhere anywhere reject-with icmp-host-prohibited You may also need to add rules to IP tables on your Ceph hosts to ensure that clients can access the ports associated with your Ceph monitors (i.e., port 6789 by default) and Ceph OSDs (i.e., 6800 et. seq. by default). For example:: iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:6810 -j ACCEPT Latency with Down Monitors ========================== When you have a monitor that is down, you may experience some latency as clients will try to connect to a monitor in the configuration even though it is down. If the client fails to connect to the monitor within a timeout window, the client will try another monitor in the cluster. You can also specify the ``-m`` option to point to a monitor that is up and in the quorum to avoid latency. =