As Chris reported at the following mail, although
btrfs property has its own manpage, man 8 btrfs-property,
there is no explanation about it in man 8 btrfs.
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg40134.html
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
There are options to specify if the subvolume deletion should wait for
commit after each subvol or at the end. This is reported at the
beginning and considered as a noise. We'd like to report the mode for
each subvolume instead.
http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg34617.html
Reported-by: Marc MERLIN <marc@merlins.org
Signed-off-by: David Sterba <dsterba@suse.cz>
If some critical roots are corrupt, reapr_root_items() will fail,
this is detected by fsck_tests.sh's extent rebuilding tests.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
This patch makes us to rebuild a really corrupt extent tree with snapshots.
To implement this, we have to verify whether a block is FULL BACKREF.
This idea come from Josef Bacik:
1) We walk down the original tree, every eb we encounter has
btrfs_header_owner(eb) == root->objectid. We add normal references
for this root (BTRFS_TREE_BLOCK_REF_KEY) for this root. World peace
is achieved.
2) We walk down the snapshotted tree. Say we didn't change anything
at all, it was just a clean snapshot and then boom. So the
btrfs_header_owner(root->node) == root->objectid, so normal backref.
We walk down to the next level, where btrfs_header_owner(eb) !=
root->objectid, but the level above did, so we add normal refs for all
of these blocks. We go down the next level, now our
btrfs_header_owner(parent) != root->objectid and
btrfs_header_owner(eb) != root->objectid. This is where we need to
now go back and see if btrfs_header_owner(eb) currently has a ref on
eb. If it does we are done, move on to the next block in this same
level, we don't have to go further down.
3) Harder case, we snapshotted and then changed things in the original
root. Do the same thing as in step 2, but now we get down to
btrfs_header_owner(eb) != root->objectid && btrfs_header_owner(parent)
!= root->objectid. We lookup the references we have for eb and notice
that btrfs_header_owner(eb) no longer refers to eb. So now we must
set FULL_BACKREF on this extent reference and add a
SHARED_BLOCK_REF_KEY for this eb using the parent->start as the
offset. And we need to keep walking down and doing the same thing
until we either hit level 0 or btrfs_header_owner(eb) has a ref on the
block.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Previously, we deal with node block firstly and then leaf block which can
maximize readahead. However, to rebuild extent tree, we need deal with snapshot
one by one.
This patch makes us deal with snapshot one by one if we need rebuild extent
tree otherwise we drop into previous way.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
According to public poll, this is desired and deemed to be safe. Feature
introduced in kernel 3.10 (Jun 2013).
Signed-off-by: David Sterba <dsterba@suse.cz>
Add a basic inode item rebuild function for I_ERR_NO_INODE_ITEM.
The main use case is to repair btrfs which fs root has corrupted leaf,
but it is already working for case if the corrupteed fs root leaf/node
contains no inode extent_data.
The repair needs 3 elements for inode rebuild:
1. inode number
This is quite easy, existing inode_record codes will detect it quite
well.
2. inode type
This is the trick part. The only reliable method is to recovery it from
parent's dir_index/item.
The remaining method will search for regular file extent for FILE
type or child's backref for DIR(todo).
Fallback will be FILE.
Inode name(inode_ref) will be recoverd by nlink repair function.
This is just a fundamental implement, some advanced recovery can be
improved later with btrfs-progs infrastructure change.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Current btrfsck can skip corrupted leaf and even repair some corrupted
one if their bytenr or key order is wrong.
However when it comes to csum error leaf, btrfsck can only skip them,
which is OK for read-only iteration, but is bad for write.
This patch introduce the new repair_btree() function to recover the
btree, deleting all the corrupted leaf/node including corresponding
extent, allowing later write into the btree.
This patch provides the basis for later recovery with corrupted leaves.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
When leaf/node is corrupted in fs/subvolume root, btrfsck can ignore it
without much pain except some stderr messages complaining about it.
But this works fine doing read-only works, if we want to do deeper
recovery like rebuild missing inodes in the b+tree, it will cause
problem.
At least, info user that there is something wrong in the btree,
and this patch provides the base for later btree repair.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
[BUG]
At least two users have already hit a bug in btrfs causing file
missing(Chromium config file).
The missing file's nlink is still 1 but its backref points to non-exist
parent inode.
This should be a kernel bug, but btrfsck fix is needed anyway.
[FIX]
For such nlink mismatch inode, we will reset all the inode_ref with its
dir_index/item (including valid one), and re-add the valids.
If there is no valid backref for it, create 'lost+found' under the root
of the subvolume and add link to the directory.
Reported-by: Mike Gavrilov <mikhail.v.gavrilov@gmail.com>
Reported-by: Ed Tomlinson <edt@aei.ca>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Add find_file_name() and find_file_type() function for later nlink and
inode_item repair.
Later nlink repair will use both function and and inode_item repair will
use find_file_type().
They are done by searching the backref list, dir_item/index for type
search and dir_item/index or inode_ref for name search.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Add count_digits() function in utils.h to help calculate filename with
ino suffix.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
With the previous btrfs inode operations patches, now we can use
btrfs_mkdir() to create the 'lost+found' dir to do some data salvage in
btrfsck.
This patch along with previous ones will make data salvage easier.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Add btrfs_unlink() and btrfs_add_link() functions in inode.c,
for the incoming btrfs_mkdir() and later inode operations functions.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Record highest inode number before inode repair.
This is especially important for corrupted leaf case.
Under that case, if use btrfs_find_free_objectid, it may find a ino
existing in corrupted leaf but dropped by btree_recover.
If that happens, created dir will be referenced incorrectly since there
may be inode_ref or dir_index/item refers to it.
So we must record the highest inode number according to the inode_cache.
Inode_cache is OK since when a inode_ref or dir_index/item is found even
the referenced source is not found, it will be created.
If we record the highest inode number of inode_cache, and use
highest_inode + 1 as 'lost+found' dir, it will ensure the newly created
dir not conflicting with any possible inode.
This provides the basis for nlink or inode rebuild for repairing btrfs
with leaf/node corruption.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Allow direct search for the last cache extent.
Provide the basis for finding the last ino in inode_cache.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Import lookup/del_inode_ref() function in inode-item.c, as base functions
for the incoming btrfs_add_link() and btrfs_unlink() functions.
Also modify btrfs_insert_inode_ref() and split_leaf() making them able
to deal with EXTENT_IREF incompat flag.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Import btrfs_insert/del/lookup_extref() functions form kernel for the
incoming btrfs_add_link() and btrfs_unlink() functions.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Run fstests: btrfs/012 will fail with message:
unable to do rollback
It is because the rollback function checks sequentially each piece of space
to map to a certain block group. If some piece doesn't, rollback refuses to continue.
After kernel commit:
commit 47ab2a6c689913db23ccae38349714edf8365e0a
Btrfs: remove empty block groups automatically
Empty block groups are removed, so there are possible gaps:
|--block group 1--| |--block group 2--|
^
|
gap
So the piece of space of the gap belongs to a removed empty block group,
and rollback should detect this case, and feel free to continue.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
The @search_cache_extent() only returns the next cache_extent or NULL,
it will never return the previous cache_extent.
So just remove the dead condition for previous cache_extent handle.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
The value of variable leaf in while loop don't have to be set
for every round. Just move it outside.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
With the task-utils it's in the default LIBS flags now. We want to use
-pthread as it also sets flags for the preprocessor.
Signed-off-by: David Sterba <dsterba@suse.cz>
Support for monitoring progress of running tasks, based on timerfd and
pthreads.
Signed-off-by: Silvio Fricke <silvio.fricke@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Sometimes we have a pretty corrupted fs but have an old tree bytenr that we
could use, add the ability to specify the tree root bytenr. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Ansgar Hockmann-Stolle <ansgar.hockmann-stolle@uni-osnabrueck.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
This is a starting point for a debugfs style python interface using
the search ioctl. For now it can only do one thing, which is to
print out all the extents in a file and calculate the compression ratio.
Over time it will grow more features, especially for the kinds of things
we might run btrfs-debug-tree to find out. Expect the usage and output
to change dramatically over time (don't hard code to it).
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
The @fi_args->num_devices in @get_fs_info() does not include seed devices.
We could just correct it by searching the chunk tree and count how
many dev_items there are in total which includes seed devices.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
The following BUG_ON:
BUG_ON(ndevs >= fi_args->num_devices)
is not needed, because it always fails with seed devices present.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
There is no need to try to build seed/sprout mapping for those btrfs
without seed devices, so just skip such fs.
We could get the total number of devices from the disk super block, if it
equals the number of items in list @fs_devices->devices, then there shouldn't
be any seed devices.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Before the patch, chunk will be considered bad if the corresponding
block group is missing, even the only uncertain data is the 'used'
member of the block group.
This patch will try to recalculate the 'used' value of the block group
and rebuild it.
So even only chunk item and dev extent item is found, the chunk can be
recovered.
Although if extent tree is damanged and needed extent item can't be
read, the block group's 'used' value will be the block group length, to
prevent any later write/block reserve damaging the block group.
In that case, we will prompt user and recommend them to use
'--init-extent-tree' to rebuild extent tree if possible.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Extract the procedure of searching for a target device for fi show
from the @map_seed_devices() function to make it more clear.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
This patch reworks the basic calculations of 'fi usage'. It does not address
all problems but should make the code more prepared to do so.
The original code tries to estimate the free space that could lead to negative
numbers for some raid profiles:
Data, RAID1: total=147.00GiB, used=141.92GiB
System, RAID1: total=32.00MiB, used=36.00KiB
Metadata, RAID1: total=2.00GiB, used=1.17GiB
GlobalReserve, single: total=404.00MiB, used=0.00B
Overall:
Device size: 279.46GiB
Device allocated: 298.06GiB
Device unallocated: 16.00EiB
Used: 286.18GiB
Free (estimated): 8.00EiB (min: 8.00EiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 404.00MiB (used: 0.00B)
Eg. "Device size" - "Device allocated" = negative number or a very large
positive, hence the EiB values.
There are logical and raw numbers multiplied by ratios mixed together,
so the new code makes it explicit which kind is being used. The data and
metadata ratios are calculated separately.
Output after this patch will look like:
Overall:
Device size: 558.92GiB
Device allocated: 298.06GiB
Device unallocated: 260.86GiB
Used: 286.18GiB
Free (estimated): 135.51GiB (min: 135.51GiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 404.00MiB (used: 0.00B)
Data,RAID1: Size:147.00GiB, Used:141.92GiB
/dev/sdc 147.00GiB
/dev/sdd 147.00GiB
Metadata,RAID1: Size:2.00GiB, Used:1.17GiB
/dev/sdc 2.00GiB
/dev/sdd 2.00GiB
System,RAID1: Size:32.00MiB, Used:36.00KiB
/dev/sdc 32.00MiB
/dev/sdd 32.00MiB
Unallocated:
/dev/sdc 130.43GiB
/dev/sdd 130.43GiB
Changes:
* Device size is now the raw size, same for the following three
* Free is the logical size
* Max/min were reduced to just min
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 280G 144G 141G 51% /mnt/sdc
The difference between Avail and Free is there because userspace tool does a
different guesswork than kernel.
Issues not addressed by this patch:
* RAID56 profiles are not handled
* mixed profiles are not handled
Signed-off-by: David Sterba <dsterba@suse.cz>
Even if run as root:
# su
# btrfs file usage <path> <== path exits outside the mnt point
We get the output:
WARNING: ..., run as root
WARNING: ..., run as root
ERROR:...
It is because in load_chunk_info, the errno of ioctl is not judged
but rather the ret value of ioctl is judged. And the ret value of
ioctl is -1 which happens to match -EPERM exactly.
So the outer warning is printed.
Just judge the errno of ioctl and prevent the ret value of load_chunk_info
to be -1 in other error conditions.
For load_device_info, the problem and fix is the same.
After the fix, the 'run as root' WARNINGs will not show up in this condition.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
When we exec the following cmd:
# btrfs file usage -t <path> <-- an invalid path
output:
# ERROR: can't access '-t'
should be:
# ERROR: can't access 'path'
Just replace the static 'argv[1]' with 'argv[i]'.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
When run btrfs-file-usage on a btrfs with data profile raid5/6,
the output message for "Free" & "Data to device ratio" seems wrong
as follows:
...
Device size: 100.00GiB
Device allocated: 2.04GiB
Device unallocated: 97.96GiB
Used: 1.12MiB
Free (Estimated): 197.89GiB <== Free > Device size
Data to device ratio: 198 % <== > 100%
Global reserve: 0.00B
...
It is because the function get_raid56_used() is not iterating the
chunk_info array correctly, it is just repeating adding the first
chunk_info statistics.
Just add a ptr to iterate over the array.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Current error messages are like following:
Error: unable to create FS with metadata profile 32 (have 2 devices)
Error: unable to create FS with metadata profile 256 (have 2 devices)
Obviously it is hard for users to interpret "profile XX" to proper
meaning, such as "raidN". So use recongizable string instead of
internal numerical value. In case of "DUP", use an explicit message.
Plus this patch fix a bug that message mistake metadata profile
for data profile.
After applying this patch, messages will be like:
Error: DUP is not allowed when FS have multiple devices
Error: unable to create FS with metadata profile RAID6 (have 2
devices but 3 devices are required)
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
The term has not seen an agreement and we don't want to change it once
it's in non-development branches or even released.
Discussion under the patch:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/34627
Signed-off-by: David Sterba <dsterba@suse.cz>
It looks confusing among the chunks, it is not in fact a chunk type.
Sample:
Overall:
Device size: 35.00GiB
Device allocated: 8.07GiB
Device unallocated: 26.93GiB
Used: 1.12MiB
Free (Estimated): 17.57GiB (Max: 30.98GiB, min: 17.52GiB)
Data to device ratio: 50 %
Global reserve: 16.00MiB (used: 0.00B)
...
Signed-off-by: David Sterba <dsterba@suse.cz>
The main point of this is to load the device and chunk infos at one
place and pass down to the printers. The EPERM is handled separately, in
case kernel does not give us all the information about chunks or
devices, but we want to warn and print at least something.
For non-root users, 'filesystem usage' prints only the overall stats and
warns about RAID5/6.
The sole cleanup changes affect mostly the modified code and the related
functions, should be reasonably small.
Signed-off-by: David Sterba <dsterba@suse.cz>
The 'fi usage' lacks an overall report, this used to be in the enhanced
df command. Add it back.
Sample:
Overall:
Device size: 35.00GiB
Device allocated: 8.07GiB
Device unallocated: 26.93GiB
Used: 1.12MiB
Free (Estimated): 17.57GiB (Max: 30.98GiB, min: 17.52GiB)
Data to device ratio: 50 %
...
Signed-off-by: David Sterba <dsterba@suse.cz>
The device may not be fully occupied by the filesystem, the value of
Unallocated should not be calculated against the device size but the
size provided by DEV_INFO.
Signed-off-by: David Sterba <dsterba@suse.cz>