diff --git a/doc/hardware-monitoring/index.rst b/doc/hardware-monitoring/index.rst new file mode 100644 index 00000000000..dcafa82303f --- /dev/null +++ b/doc/hardware-monitoring/index.rst @@ -0,0 +1,183 @@ +.. _hardware-monitoring: + +Hardware monitoring +=================== + +`node-proxy` is the internal name to designate the running agent which inventories a machine's hardware, provides the different statuses and enable the operator to perform some actions. +It gathers details from the RedFish API, processes and pushes data to agent endpoint in the Ceph manager daemon. + +.. graphviz:: + + digraph G { + node [shape=record]; + mgr [label="{ ceph manager}"]; + dashboard [label=" ceph dashboard"]; + agent [label=" agent"]; + redfish [label=" redfish"]; + + agent -> redfish [label=" 1." color=green]; + agent -> mgr [label=" 2." color=orange]; + dashboard:dashboard -> mgr [label=" 3."color=lightgreen]; + node [shape=plaintext]; + legend [label=< + + + + +
Legend
1. Collects data from redfish API
2. Pushes data to ceph mgr
3. Query ceph mgr
>]; + } + + +Limitations +----------- + +For the time being, the `node-proxy` agent relies on the RedFish API. +It implies both `node-proxy` agent and `ceph-mgr` daemon need to be able to access the Out-Of-Band network to work. + + +Deploying the agent +------------------- + +| The first step is to provide the out of band management tool credentials. +| This can be done when adding the host with a service spec file: + +.. code-block:: bash + + # cat host.yml + --- + service_type: host + hostname: node-10 + addr: 10.10.10.10 + oob: + addr: 20.20.20.10 + username: admin + password: p@ssword + +Apply the spec: + +.. code-block:: bash + + # ceph orch apply -i host.yml + Added host 'node-10' with addr '10.10.10.10' + +Deploy the agent: + +.. code-block:: bash + + # ceph config set mgr mgr/cephadm/hw_monitoring true + +CLI +--- + +| **orch** **hardware** **status** [hostname] [--category CATEGORY] [--format plain | json] + +supported categories are: + +* summary (default) +* memory +* storage +* processors +* network +* power +* fans +* firmwares +* criticals + +Examples +******** + + +hardware health statuses summary +++++++++++++++++++++++++++++++++ + +.. code-block:: bash + + # ceph orch hardware status + +------------+---------+-----+-----+--------+-------+------+ + | HOST | STORAGE | CPU | NET | MEMORY | POWER | FANS | + +------------+---------+-----+-----+--------+-------+------+ + | node-10 | ok | ok | ok | ok | ok | ok | + +------------+---------+-----+-----+--------+-------+------+ + + +storage devices report +++++++++++++++++++++++ + +.. code-block:: bash + + # ceph orch hardware status IBM-Ceph-1 --category storage + +------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+ + | HOST | NAME | MODEL | SIZE | PROTOCOL | SN | STATUS | STATE | + +------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+ + | node-10 | Disk 8 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99QLL | OK | Enabled | + | node-10 | Disk 10 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZYX | OK | Enabled | + | node-10 | Disk 11 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZWB | OK | Enabled | + | node-10 | Disk 9 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZC9 | OK | Enabled | + | node-10 | Disk 3 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT9903Y | OK | Enabled | + | node-10 | Disk 1 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT9901E | OK | Enabled | + | node-10 | Disk 7 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZQJ | OK | Enabled | + | node-10 | Disk 2 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99PA2 | OK | Enabled | + | node-10 | Disk 4 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99PFG | OK | Enabled | + | node-10 | Disk 0 in Backplane 0 of Storage Controller in Slot 2 | MZ7L33T8HBNAAD3 | 3840755981824 | SATA | S6M5NE0T800539 | OK | Enabled | + | node-10 | Disk 1 in Backplane 0 of Storage Controller in Slot 2 | MZ7L33T8HBNAAD3 | 3840755981824 | SATA | S6M5NE0T800554 | OK | Enabled | + | node-10 | Disk 6 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZER | OK | Enabled | + | node-10 | Disk 0 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZEJ | OK | Enabled | + | node-10 | Disk 5 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99QMH | OK | Enabled | + | node-10 | Disk 0 on AHCI Controller in SL 6 | MTFDDAV240TDU | 240057409536 | SATA | 22373BB1E0F8 | OK | Enabled | + | node-10 | Disk 1 on AHCI Controller in SL 6 | MTFDDAV240TDU | 240057409536 | SATA | 22373BB1E0D5 | OK | Enabled | + +------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+ + + + +firmwares details ++++++++++++++++++ + +.. code-block:: bash + + # ceph orch hardware status node-10 --category firmwares + +------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+ + | HOST | COMPONENT | NAME | DATE | VERSION | STATUS | + +------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+ + | node-10 | current-107649-7.03__raid.backplane.firmware.0 | Backplane 0 | 2022-12-05T00:00:00Z | 7.03 | OK | + + + ... omitted output ... + + + | node-10 | previous-25227-6.10.30.20__idrac.embedded.1-1 | Integrated Remote Access Controller | 00:00:00Z | 6.10.30.20 | OK | + +------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+ + + +hardware critical warnings report ++++++++++++++++++++++++++++++++++ + +.. code-block:: bash + + # ceph orch hardware status --category criticals + +------------+-----------+------------+----------+-----------------+ + | HOST | COMPONENT | NAME | STATUS | STATE | + +------------+-----------+------------+----------+-----------------+ + | node-10 | power | PS2 Status | critical | unplugged | + +------------+-----------+------------+----------+-----------------+ + + +Developpers +----------- + +.. py:currentmodule:: cephadm.agent +.. autoclass:: NodeProxyEndpoint +.. automethod:: NodeProxyEndpoint.__init__ +.. automethod:: NodeProxyEndpoint.oob +.. automethod:: NodeProxyEndpoint.data +.. automethod:: NodeProxyEndpoint.fullreport +.. automethod:: NodeProxyEndpoint.summary +.. automethod:: NodeProxyEndpoint.criticals +.. automethod:: NodeProxyEndpoint.memory +.. automethod:: NodeProxyEndpoint.storage +.. automethod:: NodeProxyEndpoint.network +.. automethod:: NodeProxyEndpoint.power +.. automethod:: NodeProxyEndpoint.processors +.. automethod:: NodeProxyEndpoint.fans +.. automethod:: NodeProxyEndpoint.firmwares +.. automethod:: NodeProxyEndpoint.led + diff --git a/doc/index.rst b/doc/index.rst index 8edc2cb09e0..df8652dc065 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -121,5 +121,6 @@ about Ceph, see our `Architecture`_ section. releases/general releases/index security/index + hardware-monitoring/index Glossary Tracing diff --git a/doc/monitoring/index.rst b/doc/monitoring/index.rst index 2bf2aca90f2..3f0bb29194b 100644 --- a/doc/monitoring/index.rst +++ b/doc/monitoring/index.rst @@ -470,5 +470,8 @@ Useful queries rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata +Hardware monitoring +=================== +See :ref:`hardware-monitoring`