2019-09-16 09:26:22 +00:00
|
|
|
MDS Journaling
|
|
|
|
==============
|
|
|
|
|
|
|
|
CephFS Metadata Pool
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
CephFS uses a separate (metadata) pool for managing file metadata (inodes and
|
|
|
|
dentries) in a Ceph File System. The metadata pool has all the information about
|
|
|
|
files in a Ceph File System including the File System hierarchy. Additionally,
|
|
|
|
CephFS maintains meta information related to other entities in a file system
|
|
|
|
such as file system journals, open file table, session map, etc.
|
|
|
|
|
|
|
|
This document describes how Ceph Metadata Servers use and rely on journaling.
|
|
|
|
|
|
|
|
CephFS MDS Journaling
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
CephFS metadata servers stream a journal of metadata events into RADOS in the metadata
|
|
|
|
pool prior to executing a file system operation. Active MDS daemon(s) manage metadata
|
|
|
|
for files and directories in CephFS.
|
|
|
|
|
|
|
|
CephFS uses journaling for couple of reasons:
|
|
|
|
|
|
|
|
#. Consistency: On an MDS failover, the journal events can be replayed to reach a
|
|
|
|
consistent file system state. Also, metadata operations that require multiple
|
|
|
|
updates to the backing store need to be journaled for crash consistency (along
|
|
|
|
with other consistency mechanisms such as locking, etc..).
|
|
|
|
|
|
|
|
#. Performance: Journal updates are (mostly) sequential, hence updates to journals
|
|
|
|
are fast. Furthermore, updates can be batched into single write, thereby saving
|
|
|
|
disk seek time involved in updates to different parts of a file. Having a large
|
|
|
|
journal also helps a standby MDS to warm its cache which helps indirectly during
|
|
|
|
MDS failover.
|
|
|
|
|
|
|
|
Each active metadata server maintains its own journal in the metadata pool. Journals
|
|
|
|
are striped over multiple objects. Journal entries which are not required (deemed as
|
|
|
|
old) are trimmed by the metadata server.
|
|
|
|
|
|
|
|
Journal Events
|
|
|
|
--------------
|
|
|
|
|
|
|
|
Apart from journaling file system metadata updates, CephFS journals various other events
|
|
|
|
such as client session info and directory import/export state to name a few. These events
|
|
|
|
are used by the metadata sever to reestablish correct state as required, e.g., Ceph MDS
|
|
|
|
tries to reconnect clients on restart when journal events get replayed and a specific
|
|
|
|
event type in the journal specifies that a client entity type has a session with the MDS
|
|
|
|
before it was restarted.
|
|
|
|
|
|
|
|
To examine the list of such events recorded in the journal, CephFS provides a command
|
|
|
|
line utility `cephfs-journal-tool` which can be used as follows:
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
cephfs-journal-tool --rank=<fs>:<rank> event get list
|
|
|
|
|
|
|
|
`cephfs-journal-tool` is also used to discover and repair a damaged Ceph File System.
|
|
|
|
(See :doc:`/cephfs/cephfs-journal-tool` for more details)
|
2019-10-07 11:08:48 +00:00
|
|
|
|
|
|
|
Journal Event Types
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
Following are various event types that are journaled by the MDS.
|
|
|
|
|
|
|
|
#. `EVENT_COMMITTED`: Mark a request (id) as committed.
|
|
|
|
|
|
|
|
#. `EVENT_EXPORT`: Maps directories to an MDS rank.
|
|
|
|
|
|
|
|
#. `EVENT_FRAGMENT`: Tracks various stages of directory fragmentation (split/merge).
|
|
|
|
|
|
|
|
#. `EVENT_IMPORTSTART`: Logged when an MDS rank starts importing directory fragments.
|
|
|
|
|
|
|
|
#. `EVENT_IMPORTFINISH`: Logged when an MDS rank finishes importing directory fragments.
|
|
|
|
|
|
|
|
#. `EVENT_NOOP`: No operation event type for skipping over a journal region.
|
|
|
|
|
|
|
|
#. `EVENT_OPEN`: Tracks which inodes have open file handles.
|
|
|
|
|
|
|
|
#. `EVENT_RESETJOURNAL`: Used to mark a journal as `reset` post truncation.
|
|
|
|
|
|
|
|
#. `EVENT_SESSION`: Tracks open client sessions.
|
|
|
|
|
|
|
|
#. `EVENT_SLAVEUPDATE`: Logs various stages of an operation that has been forwarded to a (slave) mds.
|
|
|
|
|
|
|
|
#. `EVENT_SUBTREEMAP`: Map of directory inodes to directory contents (subtree partition).
|
|
|
|
|
|
|
|
#. `EVENT_TABLECLIENT`: Log transition states of MDSs view of client tables (snap/anchor).
|
|
|
|
|
|
|
|
#. `EVENT_TABLESERVER`: Log transition states of MDSs view of server tables (snap/anchor).
|
|
|
|
|
|
|
|
#. `EVENT_UPDATE`: Log file operations on an inode.
|
2022-11-04 12:56:49 +00:00
|
|
|
|
|
|
|
#. `EVENT_SEGMENT`: Log a new journal segment boundary.
|
|
|
|
|
|
|
|
#. `EVENT_LID`: Mark the beginning of a journal without a logical subtree map.
|
|
|
|
|
|
|
|
Journal Segments
|
|
|
|
----------------
|
|
|
|
|
|
|
|
The MDS journal is composed of logical segments, called LogSegments in the
|
|
|
|
code. These segments are used to collect metadata updates by multiple events
|
|
|
|
into one logical unit for the purposes of trimming. Whenever the journal tries
|
|
|
|
to commit metadata operations (e.g. flush a file create out as an omap update
|
|
|
|
to a dirfrag object), it does so in a replayable batch of updates from the
|
|
|
|
LogSegment. The updates must be replayable in case the MDS fails during the
|
|
|
|
series of updates to various metadata objects. The reason the updates are
|
|
|
|
performed in batch is to group updates to the same metadata object (a dirfrag)
|
|
|
|
where multiple omap entries are probably updated in the same time period.
|
|
|
|
|
|
|
|
Once a segment is trimmed, it is considered "expired". An expired segment is
|
|
|
|
eligible for deletion by the journaler as all of its updates are flushed to the
|
|
|
|
backing RADOS objects. This is done by updating the "expire position" of the
|
|
|
|
journaler to advance past the end of the expired segment. Some expired segments
|
|
|
|
may be kept in the journal to improve cache locality when the MDS restarts.
|
|
|
|
|
|
|
|
For most of CephFS's history (up to 2023), the journal segments were delineated
|
|
|
|
by subtree maps, the ``ESubtreeMap`` event. The major reason for this is that
|
|
|
|
journal recovery must start with a copy of the subtree map before replaying any
|
|
|
|
other events.
|
|
|
|
|
|
|
|
Now, log segments can be delineated by events which are a ``SegmentBoundary``.
|
|
|
|
These include, ``ESubtreeMap``, ``EResetJournal``, ``ESegment`` (2023), or
|
|
|
|
``ELid`` (2023). For ``ESegment``, this light-weight segment boundary allows
|
|
|
|
the MDS to journal the subtree map less frequently while also keeping the
|
|
|
|
journal segments small to keep trimming events short. In order to maintain the
|
|
|
|
constraint that the first event journal replay sees is the ``ESubtreeMap``,
|
|
|
|
those segments beginning with that event are considered "major segments" and a
|
|
|
|
new constraint was added to the deletion of expired segments: the first segment
|
|
|
|
of the journal must always be a major segment.
|
|
|
|
|
|
|
|
The ``ELid`` event exists to mark the MDS journal as "new" where a logical
|
|
|
|
``LogSegment`` and log sequence number is required for other operations to
|
|
|
|
proceed, in particular the MDSTable operations. The MDS uses this event when
|
|
|
|
creating a rank or shutting it down. No subtree map is required when replaying
|
|
|
|
the rank from this initial state.
|
|
|
|
|
|
|
|
|
|
|
|
Configurations
|
|
|
|
--------------
|
|
|
|
|
|
|
|
The targetted size of a log segment in terms of number of events is controlled by:
|
|
|
|
|
|
|
|
.. confval:: mds_log_events_per_segment
|
|
|
|
|
|
|
|
The frequency of major segments (noted by the journaling of the latest ``ESubtreeMap``) is controlled by:
|
|
|
|
|
|
|
|
.. confval:: mds_log_major_segment_event_ratio
|
|
|
|
|
|
|
|
When ``mds_log_events_per_segment * mds_log_major_segment_event_ratio``
|
|
|
|
non-``ESubtreeMap`` events are logged, the MDS will journal a new
|
|
|
|
``ESubtreeMap``. This is necessary to allow the journal to shrink in size
|
|
|
|
during the trimming of expired segments.
|
|
|
|
|
|
|
|
The target maximum number of segments is controlled by:
|
|
|
|
|
|
|
|
.. confval:: mds_log_max_segments
|
|
|
|
|
|
|
|
The MDS will often sit a little above this number due to non-major segments
|
|
|
|
awaiting trimming up to the next major segment.
|