ceph/doc/dev/crimson/crimson.rst

568 lines
20 KiB
ReStructuredText

=======
crimson
=======
Crimson is the code name of ``crimson-osd``, which is the next
generation ``ceph-osd``. It improves performance when using fast network
and storage devices, employing state-of-the-art technologies including
DPDK and SPDK. BlueStore continues to support HDDs and slower SSDs.
Crimson aims to be backward compatible with the classic ``ceph-osd``.
.. highlight:: console
Building Crimson
================
Crimson is not enabled by default. Enable it at build time by running::
$ WITH_SEASTAR=true ./install-deps.sh
$ ./do_cmake.sh -DWITH_SEASTAR=ON
Please note, `ASan`_ is enabled by default if Crimson is built from a source
cloned using ``git``.
.. _ASan: https://github.com/google/sanitizers/wiki/AddressSanitizer
Testing crimson with cephadm
===============================
The Ceph CI/CD pipeline builds containers with
``crimson-osd`` substituted for ``ceph-osd``.
Once a branch at commit <sha1> has been built and is available in
``shaman``, you can deploy it using the cephadm instructions outlined
in :ref:`cephadm` with the following adaptations.
First, while performing the initial bootstrap, use the ``--image`` flag to
use a Crimson build:
.. prompt:: bash #
cephadm --image quay.ceph.io/ceph-ci/ceph:<sha1>-crimson --allow-mismatched-release bootstrap ...
You'll likely need to supply the ``--allow-mismatched-release`` flag to
use a non-release branch.
Configure Crimson with Bluestore
================================
As Bluestore is not a Crimson native `object store backend`_,
deploying Crimson with Bluestore as the back end requires setting
one of the two following configuration options:
.. note::
#. These two options, along with ``crimson_alien_op_num_threads``,
can't be changed after deployment.
#. `vstart.sh`_ sets these options using the ``--crimson-smp`` flag.
1) ``crimson_seastar_num_threads``
In order to allow easier cluster deployments, this option can be used
instead of setting the CPU mask manually for each OSD.
It's recommended to let the **number of OSDs on each host** multiplied by
``crimson_seastar_num_threads`` to be less than the node's number of CPU
cores (``nproc``).
For example, for deploying two nodes with eight CPU cores and two OSDs each:
.. code-block:: yaml
conf:
# Global to all OSDs
osd:
crimson seastar num threads: 3
.. note::
#. For optimal performance ``crimson_seastar_cpu_cores`` should be set instead.
2) ``crimson_seastar_cpu_cores`` and ``crimson_alien_thread_cpu_cores``.
Explicitly set the CPU core allocation for each ``crimson-osd``
and for the BlueStore back end. It's recommended for each set to be mutually exclusive.
For example, for deploying two nodes with eight CPU cores and two OSDs each:
.. code-block:: yaml
conf:
# Both nodes
osd:
crimson alien thread cpu cores: 6-7
# First node
osd.0:
crimson seastar cpu cores: 0-2
osd.1:
crimson seastar cpu cores: 3-5
# Second node
osd.2:
crimson seastar cpu cores: 0-2
osd.3:
crimson seastar cpu cores: 3-5
For a single node with eight node and three OSDs:
.. code-block:: yaml
conf:
osd:
crimson alien thread cpu cores: 6-7
osd.0:
crimson seastar cpu cores: 0-1
osd.1:
crimson seastar cpu cores: 2-3
osd.2:
crimson seastar cpu cores: 4-5
Running Crimson
===============
.. note::
Crimson is in a tech preview stage.
As you might expect, Crimson does not yet have as extensive a feature set as does ceph-osd.
Malfunctions including crashes and data loss are to be expected.
Enabling Crimson
================
After building Crimson and starting your cluster, but prior to deploying OSDs, you'll need to
`Configure Crimson with Bluestore`_ and enable Crimson to
direct the default pools to be created as Crimson pools. You can proceed by running the following after you have a running cluster:
.. note::
`vstart.sh`_ enables crimson automatically when `--crimson` is used.
.. prompt:: bash #
ceph config set global 'enable_experimental_unrecoverable_data_corrupting_features' crimson
ceph osd set-allow-crimson --yes-i-really-mean-it
ceph config set mon osd_pool_default_crimson true
The first command enables the ``crimson`` experimental feature.
The second enables the ``allow_crimson`` OSDMap flag. The monitor will
not allow ``crimson-osd`` to boot without that flag.
The last causes pools to be created by default with the ``crimson`` flag.
Crimson pools are restricted to operations supported by Crimson.
``Crimson-osd`` won't instantiate PGs from non-Crimson pools.
vstart.sh
=========
The following options can be used with ``vstart.sh``.
``--crimson``
Start ``crimson-osd`` instead of ``ceph-osd``.
``--nodaemon``
Do not daemonize the service.
``--redirect-output``
Redirect the ``stdout`` and ``stderr`` to ``out/$type.$num.stdout``.
``--osd-args``
Pass extra command line options to ``crimson-osd`` or ``ceph-osd``.
This is useful for passing Seastar options to ``crimson-osd``. For
example, one can supply ``--osd-args "--memory 2G"`` to set the amount of
memory to use. Please refer to the output of::
crimson-osd --help-seastar
for additional Seastar-specific command line options.
``--crimson-smp``
The number of cores to use for each OSD.
If BlueStore is used, the balance of available cores
(as determined by `nproc`) will be assigned to the object store.
``--bluestore``
Use the alienized BlueStore as the object store backend. This is the default (see below section on the `object store backend`_ for more details)
``--cyanstore``
Use CyanStore as the object store backend.
``--memstore``
Use the alienized MemStore as the object store backend.
``--seastore``
Use SeaStore as the back end object store.
``--seastore-devs``
Specify the block device used by SeaStore.
``--seastore-secondary-devs``
Optional. SeaStore supports multiple devices. Enable this feature by
passing the block device to this option.
``--seastore-secondary-devs-type``
Optional. Specify the type of secondary devices. When the secondary
device is slower than main device passed to ``--seastore-devs``, the cold
data in faster device will be evicted to the slower devices over time.
Valid types include ``HDD``, ``SSD``(default), ``ZNS``, and ``RANDOM_BLOCK_SSD``
Note secondary devices should not be faster than the main device.
To start a cluster with a single Crimson node, run::
$ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh \
--without-dashboard --bluestore --crimson \
--redirect-output
Another SeaStore example::
$ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
--without-dashboard --seastore \
--crimson --redirect-output \
--seastore-devs /dev/sda \
--seastore-secondary-devs /dev/sdb \
--seastore-secondary-devs-type HDD
Stop this ``vstart`` cluster by running::
$ ../src/stop.sh --crimson
Object Store Backend
====================
At the moment, ``crimson-osd`` offers both native and alienized object store
backends. The native object store backends perform IO using the SeaStar reactor.
They are:
.. describe:: cyanstore
CyanStore is modeled after memstore in the classic OSD.
.. describe:: seastore
Seastore is still under active development.
The alienized object store backends are backed by a thread pool, which
is a proxy of the alienstore adaptor running in Seastar. The proxy issues
requests to object stores running in alien threads, i.e., worker threads not
managed by the Seastar framework. They are:
.. describe:: memstore
The memory backend object store
.. describe:: bluestore
The object store used by the classic ``ceph-osd``
daemonize
---------
Unlike ``ceph-osd``, ``crimson-osd`` does not daemonize itself even if the
``daemonize`` option is enabled. In order to read this option, ``crimson-osd``
needs to ready its config sharded service, but this sharded service lives
in the Seastar reactor. If we fork a child process and exit the parent after
starting the Seastar engine, that will leave us with a single thread which is
a replica of the thread that called `fork()`_. Tackling this problem in Crimson
would unnecessarily complicate the code.
Since supported GNU/Linux distributions use ``systemd``, which is able to
daemonize processes, there is no need to daemonize ourselves.
Those using sysvinit can use ``start-stop-daemon`` to daemonize ``crimson-osd``.
If this is does not work out, a helper utility may be devised.
.. _fork(): http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
logging
-------
``Crimson-osd`` currently uses the logging utility offered by Seastar. See
``src/common/dout.h`` for the mapping between Ceph logging levels to
the severity levels in Seastar. For instance, messages sent to ``derr``
will be issued using ``logger::error()``, and the messages with a debug level
greater than ``20`` will be issued using ``logger::trace()``.
+---------+---------+
| ceph | seastar |
+---------+---------+
| < 0 | error |
+---------+---------+
| 0 | warn |
+---------+---------+
| [1, 6) | info |
+---------+---------+
| [6, 20] | debug |
+---------+---------+
| > 20 | trace |
+---------+---------+
Note that ``crimson-osd``
does not send log messages directly to a specified ``log_file``. It writes
the logging messages to stdout and/or syslog. This behavior can be
changed using ``--log-to-stdout`` and ``--log-to-syslog`` command line
options. By default, ``log-to-stdout`` is enabled, and ``--log-to-syslog`` is disabled.
Metrics and Tracing
===================
Crimson offers three ways to report stats and metrics.
PG stats reported to mgr
------------------------
Crimson collects the per-pg, per-pool, and per-osd stats in a `MPGStats`
message which is sent to the Ceph Managers. Manager modules can query
them using the `MgrModule.get()` method.
Asock command
-------------
An admin socket command is offered for dumping metrics::
$ ceph tell osd.0 dump_metrics
$ ceph tell osd.0 dump_metrics reactor_utilization
Here `reactor_utilization` is an optional string allowing us to filter
the dumped metrics by prefix.
Prometheus text protocol
------------------------
The listening port and address can be configured using the command line options of
`--prometheus_port`
see `Prometheus`_ for more details.
.. _Prometheus: https://github.com/scylladb/seastar/blob/master/doc/prometheus.md
Profiling Crimson
=================
Fio
---
``crimson-store-nbd`` exposes configurable ``FuturizedStore`` internals as an
NBD server for use with ``fio``.
In order to use ``fio`` to test ``crimson-store-nbd``, perform the below steps.
#. You will need to install ``libnbd``, and compile it into ``fio``
.. prompt:: bash $
apt-get install libnbd-dev
git clone git://git.kernel.dk/fio.git
cd fio
./configure --enable-libnbd
make
#. Build ``crimson-store-nbd``
.. prompt:: bash $
cd build
ninja crimson-store-nbd
#. Run the ``crimson-store-nbd`` server with a block device. Specify
the path to the raw device, for example ``/dev/nvme1n1``, in place of the created
file for testing with a block device.
.. prompt:: bash $
export disk_img=/tmp/disk.img
export unix_socket=/tmp/store_nbd_socket.sock
rm -f $disk_img $unix_socket
truncate -s 512M $disk_img
./bin/crimson-store-nbd \
--device-path $disk_img \
--smp 1 \
--mkfs true \
--type transaction_manager \
--uds-path ${unix_socket} &
Below are descriptions of these command line arguments:
``--smp``
The number of CPU cores to use (Symmetric MultiProcessor)
``--mkfs``
Initialize the device first.
``--type``
The back end to use. If ``transaction_manager`` is specified, SeaStore's
``TransactionManager`` and ``BlockSegmentManager`` are used to emulate a
block device. Otherwise, this option is used to choose a backend of
``FuturizedStore``, where the whole "device" is divided into multiple
fixed-size objects whose size is specified by ``--object-size``. So, if
you are only interested in testing the lower-level implementation of
SeaStore like logical address translation layer and garbage collection
without the object store semantics, ``transaction_manager`` would be a
better choice.
#. Create a ``fio`` job file named ``nbd.fio``
.. code:: ini
[global]
ioengine=nbd
uri=nbd+unix:///?socket=${unix_socket}
rw=randrw
time_based
runtime=120
group_reporting
iodepth=1
size=512M
[job0]
offset=0
#. Test the Crimson object store, using the custom ``fio`` built just now
.. prompt:: bash $
./fio nbd.fio
CBT
---
We can use `cbt`_ for performance tests::
$ git checkout main
$ make crimson-osd
$ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml
$ git checkout yet-another-pr
$ make crimson-osd
$ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml
$ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v
19:48:23 - INFO - cbt - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155 => accepted
19:48:23 - INFO - cbt - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0 => accepted
19:48:23 - WARNING - cbt - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833 => rejected
19:48:23 - INFO - cbt - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337 => accepted
19:48:23 - WARNING - cbt - 1 tests failed out of 16
Here we compile and run the same test against two branches: ``main`` and ``yet-another-pr``.
We then compare the results. Along with every test case, a set of rules is defined to check for
performance regressions when comparing the sets of test results. If a possible regression is found, the rule and
corresponding test results are highlighted.
.. _cbt: https://github.com/ceph/cbt
Hacking Crimson
===============
Seastar Documents
-----------------
See `Seastar Tutorial <https://github.com/scylladb/seastar/blob/master/doc/tutorial.md>`_ .
Or build a browsable version and start an HTTP server::
$ cd seastar
$ ./configure.py --mode debug
$ ninja -C build/debug docs
$ python3 -m http.server -d build/debug/doc/html
You might want to install ``pandoc`` and other dependencies beforehand.
Debugging Crimson
=================
Debugging with GDB
------------------
The `tips`_ for debugging Scylla also apply to Crimson.
.. _tips: https://github.com/scylladb/scylla/blob/master/docs/dev/debugging.md#tips-and-tricks
Human-readable backtraces with addr2line
----------------------------------------
When a Seastar application crashes, it leaves us with a backtrace of addresses, like::
Segmentation fault.
Backtrace:
0x00000000108254aa
0x00000000107f74b9
0x00000000105366cc
0x000000001053682c
0x00000000105d2c2e
0x0000000010629b96
0x0000000010629c31
0x00002a02ebd8272f
0x00000000105d93ee
0x00000000103eff59
0x000000000d9c1d0a
/lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a
0x000000000d833ac9
Segmentation fault
The ``seastar-addr2line`` utility provided by Seastar can be used to map these
addresses to functions. The script expects input on ``stdin``,
so we need to copy and paste the above addresses, then send EOF by inputting
``control-D`` in the terminal. One might use ``echo`` or ``cat`` instead::
$ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd
0x00000000108254aa
0x00000000107f74b9
0x00000000105366cc
0x000000001053682c
0x00000000105d2c2e
0x0000000010629b96
0x0000000010629c31
0x00002a02ebd8272f
0x00000000105d93ee
0x00000000103eff59
0x000000000d9c1d0a
0x00000000108254aa
[Backtrace #0]
seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136
seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157
seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164
seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119
seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105
seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101
?? ??:0
seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5)
main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1)
Note that ``seastar-addr2line`` is able to extract addresses from
its input, so you can also paste the log messages as below::
2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace:
2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e78dbc
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e7f0
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e8b8
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e985
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: /lib64/libpthread.so.0+0x0000000000012dbf
Unlike the classic ``ceph-osd``, Crimson does not print a human-readable backtrace when it
handles fatal signals like `SIGSEGV` or `SIGABRT`. It is also more complicated
with a stripped binary. So instead of planting a signal handler for
those signals into Crimson, we can use `script/ceph-debug-docker.sh` to map
addresses in the backtrace::
# assuming you are under the source tree of ceph
$ ./src/script/ceph-debug-docker.sh --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8
....
[root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line
[root@3deb50a8ad51 ~]# dnf install -q -y file
[root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd
# paste the backtrace here
Code Walkthroughs
=================
* `Ceph Code Walkthroughs: Crimson <https://www.youtube.com/watch?v=rtkrHk6grsg>`_
* `Ceph Code Walkthroughs: SeaStore <https://www.youtube.com/watch?v=0rr5oWDE2Ck>`_