mirror of
https://github.com/ceph/ceph
synced 2025-01-26 04:55:30 +00:00
c83cb7a11d
Sometimes users know that particular data pool PGs have been damaged, and they would like to scan their files to work out which ones might have been affected. Fixes: http://tracker.ceph.com/issues/17249 Signed-off-by: John Spray <john.spray@redhat.com>
200 lines
6.5 KiB
ReStructuredText
200 lines
6.5 KiB
ReStructuredText
|
|
Disaster recovery
|
|
=================
|
|
|
|
.. danger::
|
|
|
|
The notes in this section are aimed at experts, making a best effort
|
|
to recovery what they can from damaged filesystems. These steps
|
|
have the potential to make things worse as well as better. If you
|
|
are unsure, do not proceed.
|
|
|
|
|
|
Journal export
|
|
--------------
|
|
|
|
Before attempting dangerous operations, make a copy of the journal like so:
|
|
|
|
::
|
|
|
|
cephfs-journal-tool journal export backup.bin
|
|
|
|
Note that this command may not always work if the journal is badly corrupted,
|
|
in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
|
|
|
|
|
|
Dentry recovery from journal
|
|
----------------------------
|
|
|
|
If a journal is damaged or for any reason an MDS is incapable of replaying it,
|
|
attempt to recover what file metadata we can like so:
|
|
|
|
::
|
|
|
|
cephfs-journal-tool event recover_dentries summary
|
|
|
|
This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
|
|
|
|
This command will write any inodes/dentries recoverable from the journal
|
|
into the backing store, if these inodes/dentries are higher-versioned
|
|
than the previous contents of the backing store. If any regions of the journal
|
|
are missing/damaged, they will be skipped.
|
|
|
|
Note that in addition to writing out dentries and inodes, this command will update
|
|
the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
|
|
are now in use. In simple cases, this will result in an entirely valid backing
|
|
store state.
|
|
|
|
.. warning::
|
|
|
|
The resulting state of the backing store is not guaranteed to be self-consistent,
|
|
and an online MDS scrub will be required afterwards. The journal contents
|
|
will not be modified by this command, you should truncate the journal
|
|
separately after recovering what you can.
|
|
|
|
Journal truncation
|
|
------------------
|
|
|
|
If the journal is corrupt or MDSs cannot replay it for any reason, you can
|
|
truncate it like so:
|
|
|
|
::
|
|
|
|
cephfs-journal-tool journal reset
|
|
|
|
.. warning::
|
|
|
|
Resetting the journal *will* lose metadata unless you have extracted
|
|
it by other means such as ``recover_dentries``. It is likely to leave
|
|
some orphaned objects in the data pool. It may result in re-allocation
|
|
of already-written inodes, such that permissions rules could be violated.
|
|
|
|
MDS table wipes
|
|
---------------
|
|
|
|
After the journal has been reset, it may no longer be consistent with respect
|
|
to the contents of the MDS tables (InoTable, SessionMap, SnapServer).
|
|
|
|
To reset the SessionMap (erase all sessions), use:
|
|
|
|
::
|
|
|
|
cephfs-table-tool all reset session
|
|
|
|
This command acts on the tables of all 'in' MDS ranks. Replace 'all' with an MDS
|
|
rank to operate on that rank only.
|
|
|
|
The session table is the table most likely to need resetting, but if you know you
|
|
also need to reset the other tables then replace 'session' with 'snap' or 'inode'.
|
|
|
|
MDS map reset
|
|
-------------
|
|
|
|
Once the in-RADOS state of the filesystem (i.e. contents of the metadata pool)
|
|
is somewhat recovered, it may be necessary to update the MDS map to reflect
|
|
the contents of the metadata pool. Use the following command to reset the MDS
|
|
map to a single MDS:
|
|
|
|
::
|
|
|
|
ceph fs reset <fs name> --yes-i-really-mean-it
|
|
|
|
Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:
|
|
as a result it is possible for this to result in data loss.
|
|
|
|
One might wonder what the difference is between 'fs reset' and 'fs remove; fs new'. The
|
|
key distinction is that doing a remove/new will leave rank 0 in 'creating' state, such
|
|
that it would overwrite any existing root inode on disk and orphan any existing files. In
|
|
contrast, the 'reset' command will leave rank 0 in 'active' state such that the next MDS
|
|
daemon to claim the rank will go ahead and use the existing in-RADOS metadata.
|
|
|
|
Recovery from missing metadata objects
|
|
--------------------------------------
|
|
|
|
Depending on what objects are missing or corrupt, you may need to
|
|
run various commands to regenerate default versions of the
|
|
objects.
|
|
|
|
::
|
|
|
|
# Session table
|
|
cephfs-table-tool 0 reset session
|
|
# SnapServer
|
|
cephfs-table-tool 0 reset snap
|
|
# InoTable
|
|
cephfs-table-tool 0 reset inode
|
|
# Journal
|
|
cephfs-journal-tool --rank=0 journal reset
|
|
# Root inodes ("/" and MDS directory)
|
|
cephfs-data-scan init
|
|
|
|
Finally, you can regenerate metadata objects for missing files
|
|
and directories based on the contents of a data pool. This is
|
|
a two-phase process. First, scanning *all* objects to calculate
|
|
size and mtime metadata for inodes. Second, scanning the first
|
|
object from every file to collect this metadata and inject
|
|
it into the metadata pool.
|
|
|
|
::
|
|
|
|
cephfs-data-scan scan_extents <data pool>
|
|
cephfs-data-scan scan_inodes <data pool>
|
|
|
|
This command may take a very long time if there are many
|
|
files or very large files in the data pool. To accelerate
|
|
the process, run multiple instances of the tool. Decide on
|
|
a number of workers, and pass each worker a number within
|
|
the range 0-(N_workers - 1), like so:
|
|
|
|
::
|
|
|
|
# Worker 0
|
|
cephfs-data-scan scan_extents <data pool> 0 1
|
|
# Worker 1
|
|
cephfs-data-scan scan_extents <data pool> 1 1
|
|
|
|
# Worker 0
|
|
cephfs-data-scan scan_inodes <data pool> 0 1
|
|
# Worker 1
|
|
cephfs-data-scan scan_inodes <data pool> 1 1
|
|
|
|
It is important to ensure that all workers have completed the
|
|
scan_extents phase before any workers enter the scan_inodes phase.
|
|
|
|
Finding files affected by lost data PGs
|
|
---------------------------------------
|
|
|
|
Losing a data PG may affect many files. Files are split into many objects,
|
|
so identifying which files are affected by loss of particular PGs requires
|
|
a full scan over all object IDs that may exist within the size of a file.
|
|
This type of scan may be useful for identifying which files require
|
|
restoring from a backup.
|
|
|
|
.. danger::
|
|
|
|
This command does not repair any metadata, so when restoring files in
|
|
this case you must *remove* the damaged file, and replace it in order
|
|
to have a fresh inode. Do not overwrite damaged files in place.
|
|
|
|
If you know that objects have been lost from PGs, use the ``pg_files``
|
|
subcommand to scan for files that may have been damaged as a result:
|
|
|
|
::
|
|
|
|
cephfs-data-scan pg_files <path> <pg id> [<pg id>...]
|
|
|
|
For example, if you have lost data from PGs 1.4 and 4.5, and you would like
|
|
to know which files under /home/bob might have been damaged:
|
|
|
|
::
|
|
|
|
cephfs-data-scan pg_files /home/bob 1.4 4.5
|
|
|
|
The output will be a list of paths to potentially damaged files, one
|
|
per line.
|
|
|
|
Note that this command acts as a normal CephFS client to find all the
|
|
files in the filesystem and read their layouts, so the MDS must be
|
|
up and running.
|
|
|