doc: add node-proxy documentation

This commit adds some documentation about the
'hardware inventory / monitoring' feature (node-proxy agent).

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
This commit is contained in:
Guillaume Abrioux 2024-01-31 15:23:44 +01:00
parent 9a949f1ad7
commit b7c0a6a5b0
3 changed files with 187 additions and 0 deletions

View File

@ -0,0 +1,183 @@
.. _hardware-monitoring:
Hardware monitoring
===================
`node-proxy` is the internal name to designate the running agent which inventories a machine's hardware, provides the different statuses and enable the operator to perform some actions.
It gathers details from the RedFish API, processes and pushes data to agent endpoint in the Ceph manager daemon.
.. graphviz::
digraph G {
node [shape=record];
mgr [label="{<mgr> ceph manager}"];
dashboard [label="<dashboard> ceph dashboard"];
agent [label="<agent> agent"];
redfish [label="<redfish> redfish"];
agent -> redfish [label=" 1." color=green];
agent -> mgr [label=" 2." color=orange];
dashboard:dashboard -> mgr [label=" 3."color=lightgreen];
node [shape=plaintext];
legend [label=<<table border="0" cellborder="1" cellspacing="0">
<tr><td bgcolor="lightgrey">Legend</td></tr>
<tr><td align="center">1. Collects data from redfish API</td></tr>
<tr><td align="left">2. Pushes data to ceph mgr</td></tr>
<tr><td align="left">3. Query ceph mgr</td></tr>
</table>>];
}
Limitations
-----------
For the time being, the `node-proxy` agent relies on the RedFish API.
It implies both `node-proxy` agent and `ceph-mgr` daemon need to be able to access the Out-Of-Band network to work.
Deploying the agent
-------------------
| The first step is to provide the out of band management tool credentials.
| This can be done when adding the host with a service spec file:
.. code-block:: bash
# cat host.yml
---
service_type: host
hostname: node-10
addr: 10.10.10.10
oob:
addr: 20.20.20.10
username: admin
password: p@ssword
Apply the spec:
.. code-block:: bash
# ceph orch apply -i host.yml
Added host 'node-10' with addr '10.10.10.10'
Deploy the agent:
.. code-block:: bash
# ceph config set mgr mgr/cephadm/hw_monitoring true
CLI
---
| **orch** **hardware** **status** [hostname] [--category CATEGORY] [--format plain | json]
supported categories are:
* summary (default)
* memory
* storage
* processors
* network
* power
* fans
* firmwares
* criticals
Examples
********
hardware health statuses summary
++++++++++++++++++++++++++++++++
.. code-block:: bash
# ceph orch hardware status
+------------+---------+-----+-----+--------+-------+------+
| HOST | STORAGE | CPU | NET | MEMORY | POWER | FANS |
+------------+---------+-----+-----+--------+-------+------+
| node-10 | ok | ok | ok | ok | ok | ok |
+------------+---------+-----+-----+--------+-------+------+
storage devices report
++++++++++++++++++++++
.. code-block:: bash
# ceph orch hardware status IBM-Ceph-1 --category storage
+------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
| HOST | NAME | MODEL | SIZE | PROTOCOL | SN | STATUS | STATE |
+------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
| node-10 | Disk 8 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99QLL | OK | Enabled |
| node-10 | Disk 10 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZYX | OK | Enabled |
| node-10 | Disk 11 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZWB | OK | Enabled |
| node-10 | Disk 9 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZC9 | OK | Enabled |
| node-10 | Disk 3 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT9903Y | OK | Enabled |
| node-10 | Disk 1 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT9901E | OK | Enabled |
| node-10 | Disk 7 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZQJ | OK | Enabled |
| node-10 | Disk 2 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99PA2 | OK | Enabled |
| node-10 | Disk 4 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99PFG | OK | Enabled |
| node-10 | Disk 0 in Backplane 0 of Storage Controller in Slot 2 | MZ7L33T8HBNAAD3 | 3840755981824 | SATA | S6M5NE0T800539 | OK | Enabled |
| node-10 | Disk 1 in Backplane 0 of Storage Controller in Slot 2 | MZ7L33T8HBNAAD3 | 3840755981824 | SATA | S6M5NE0T800554 | OK | Enabled |
| node-10 | Disk 6 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZER | OK | Enabled |
| node-10 | Disk 0 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZEJ | OK | Enabled |
| node-10 | Disk 5 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99QMH | OK | Enabled |
| node-10 | Disk 0 on AHCI Controller in SL 6 | MTFDDAV240TDU | 240057409536 | SATA | 22373BB1E0F8 | OK | Enabled |
| node-10 | Disk 1 on AHCI Controller in SL 6 | MTFDDAV240TDU | 240057409536 | SATA | 22373BB1E0D5 | OK | Enabled |
+------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
firmwares details
+++++++++++++++++
.. code-block:: bash
# ceph orch hardware status node-10 --category firmwares
+------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
| HOST | COMPONENT | NAME | DATE | VERSION | STATUS |
+------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
| node-10 | current-107649-7.03__raid.backplane.firmware.0 | Backplane 0 | 2022-12-05T00:00:00Z | 7.03 | OK |
... omitted output ...
| node-10 | previous-25227-6.10.30.20__idrac.embedded.1-1 | Integrated Remote Access Controller | 00:00:00Z | 6.10.30.20 | OK |
+------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
hardware critical warnings report
+++++++++++++++++++++++++++++++++
.. code-block:: bash
# ceph orch hardware status --category criticals
+------------+-----------+------------+----------+-----------------+
| HOST | COMPONENT | NAME | STATUS | STATE |
+------------+-----------+------------+----------+-----------------+
| node-10 | power | PS2 Status | critical | unplugged |
+------------+-----------+------------+----------+-----------------+
Developpers
-----------
.. py:currentmodule:: cephadm.agent
.. autoclass:: NodeProxyEndpoint
.. automethod:: NodeProxyEndpoint.__init__
.. automethod:: NodeProxyEndpoint.oob
.. automethod:: NodeProxyEndpoint.data
.. automethod:: NodeProxyEndpoint.fullreport
.. automethod:: NodeProxyEndpoint.summary
.. automethod:: NodeProxyEndpoint.criticals
.. automethod:: NodeProxyEndpoint.memory
.. automethod:: NodeProxyEndpoint.storage
.. automethod:: NodeProxyEndpoint.network
.. automethod:: NodeProxyEndpoint.power
.. automethod:: NodeProxyEndpoint.processors
.. automethod:: NodeProxyEndpoint.fans
.. automethod:: NodeProxyEndpoint.firmwares
.. automethod:: NodeProxyEndpoint.led

View File

@ -121,5 +121,6 @@ about Ceph, see our `Architecture`_ section.
releases/general
releases/index
security/index
hardware-monitoring/index
Glossary <glossary>
Tracing <jaegertracing/index>

View File

@ -470,5 +470,8 @@ Useful queries
rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata
Hardware monitoring
===================
See :ref:`hardware-monitoring`