We are going to start creating multiple sets of global trees, which at
the moment are the free space tree, csum tree, and extent tree.
Generally we will assign these at block group creation time, but Dave
would like to be able to have them per-subvolume at some point, so
reserve a slot for that as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the on disk definitions for the block group tree. This will be part
of the super block so we need to add the appropriate helpers to the
super block, as well as adding it to the backup roots.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to enable developers to test the extent tree v2 features as they
are added, add the ability to mkfs an extent tree v2 fs if we have
experimental enabled.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We could have multiple extent roots, so add a helper to mark all the
used space in the FS based on any extent roots we find, and then use
this extent io tree to fixup the block group accounting.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We may have multiple extent roots, so cycle through all of the extent
roots and populate the csum tree based on the content of every extent
root we have in the file system.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the global roots tree to find all of the csum roots in the system
and check all of them as appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of checking the csum and extent tree individually, walk through
the global roots and validate them all. This will work properly if we
have extent tree v1 or extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of looking for the first extent root or csum root in memory,
scan through the tree root and re-init any root items that match the
given objectid. This will allow reinit to work with both extent tree v1
and extent tree v2 global roots when using the --reinit option.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiples of these trees with extent tree v2, so
add a rb tree to track them based on their root key value. This works
for both v1 and v2, so we can remove the direct pointers to these roots
in our fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so access
it via a helper in most cases. We will address the remaining direct
accesses in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we switch to multiple global trees we'll need to access the
appropriate extent root depending on the block group or possibly root.
To handle this, use a helper in most places and then the actual root in
places where it is required. We will whittle down the direct accessors
with future patches, but this does the bulk of the preparatory work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filesystem du command fails and exits when it access file that has
permission denied. But it can continue the command except the files.
This patch prints error message just like /bin/du does and it continues
if it can.
Issue: #421
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At this point it's not clear what exactly needs the indices and they
appear in the left column with list of index pages which is a bit
confusing.
Signed-off-by: David Sterba <dsterba@suse.com>
There was a regression in version 5.15 due to changes in how the ratio
for the various raid profiles got calculated and this in turn had a
cascading effect on unallocated/allocated space reported.
Add a test to ensure this regression doesn't occur again.
Issue: #422
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_mark_used_tree_blocks skips the reloc roots for some reason, which
causes problems because these blocks are in use, and we use this helper
to determine if the block accounting is correct with extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is going to be used for the extent tree v2 stuff more commonly, so
move it out so that it is accessible from everywhere that we need it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will walk all referenced tree blocks during check in order to avoid
writing over any referenced blocks during fsck. However in the future
we're going to need to do this for things like fixing block group
accounting with extent tree v2. This is because extent tree v2 will not
refer to all of the allocated blocks in the extent tree. Refactor the
code some to allow us to send down an arbitrary extent_io_tree so we can
use this helper for any case where we need to figure out where all the
used space is on an extent tree v2 file system.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this helper sitting in extent-tree.c, but it's a repair
function. I'm going to need to make changes to this for extent-tree-v2
and would rather this live outside of the code we need to share with the
kernel.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extent tree v2 no longer tracks all allocated blocks on the file system,
so we'll have to default to walking trees to generate metadata images.
There's an annoying drawback with walking trees with btrfs-image where
we'll happily copy multiple blocks over and over again if there are
snapshots. Fix this by keeping track of blocks we've seen and simply
skipping blocks that we've already queued up for copying.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we will have per-block group checksums, so add a
helper to access the csum root and rename the fs_info csum_root to
_csum_root to catch all the places that are accessing it directly.
Convert everybody to use the helper except for internal things.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass the csum root from way high in the call chain in check down to
where we actually need it. However we can just get it from the fs_info
in these places, so clean up the functions to skip passing around the
csum root needlessly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to need to start looking up the csum root based on the
bytenr with extent tree v2. To that end stop passing the root to the
csum related functions so that can be done in the helper functions
themselves.
There's an unrelated deletion of a function prototype that no longer
exists.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use this pattern in a few places, and will use it more with different
roots in the future. Extract out this helper to read the root nodes.
There is a behavior change here in that we're now checking the root
levels, whereas before we were not.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Running with ASAN we won't pass the self tests because we leak the whole
fs_info with btrfs filesystem show. Fix this by making sure we close
out the fs_info and clean up all of the memory and such.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will lookup implied refs for every root id for every block that we
find when verifying qgroups, but we don't need to worry about non
fstrees, so skip them here.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is doing the same work as insert_block_group_item, rework it to
call the helper instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I screwed up a fix where we're setting the bytenr range as dirty when
marking all tree blocks used, I was looking at btrfs_pin_extent and put
->nodesize for end instead of the actual end, which is bytenr +
->nodesize - 1. Fix this up so it's correct.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that a corrupted key type (expected
UUID_KEY_SUBVOL, has EXTENT_ITEM) causing newer kernel to reject a
mount.
Although the root cause is not determined yet, with roll out of v5.11
kernel to various distros, such problem should be prevented by
tree-checker, no matter if it's hardware problem or not.
And older kernel with "-o uuid_rescan" mount option won't help, as
uuid_rescan will only delete items with
UUID_KEY_SUBVOL/UUID_KEY_RECEIVED_SUBVOL key types, not deleting such
corrupted key.
[FIX]
To fix such problem we have to rely on offline tool, thus there we
introduce a new rescue tool, clear-uuid-tree, to empty and then remove
uuid tree.
Kernel will re-generate the correct uuid tree at next mount.
Reported-by: S. <sb56637@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a bug in raid56_recov() which doesn't properly repair data and
P case corruption:
/* Data and P*/
if (dest2 == nr_devs - 1)
return raid6_recov_datap(nr_devs, stripe_len, dest1, data);
Note that, dest1/2 is to indicate which slot has corruption.
For RAID6 cases:
[0, nr_devs - 2) is for data stripes,
@data_devs - 2 is for P,
@data_devs - 1 is for Q.
For above code, the comment is correct, but the check condition is
wrong, and leads to the only project, btrfs-fuse, to report raid6
recovery error for 2 devices missing case.
Fix it by using correct condition.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current formula calculates the stripe size, however that's not what we
want in the case of RAID1/DUP profiles. In those cases since chunk are
mirrored across devices we want the full size of the chunk. Without this
patch the 'btrfs fi usage' output from an fs which is using RAID1 is:
Data,RAID1: Size:2.00GiB, Used:1.00GiB (50.03%)
/dev/vdc 1.00GiB
/dev/vdf 1.00GiB
Metadata,RAID1: Size:256.00MiB, Used:1.34MiB (0.52%)
/dev/vdc 128.00MiB
/dev/vdf 128.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%)
/dev/vdc 4.00MiB
/dev/vdf 4.00MiB
Unallocated:
/dev/vdc 8.87GiB
/dev/vdf 8.87GiB
So a 2 gigabyte RAID1 chunk actually will take up 4 gigabytes on the
actual disks 2 each. In this case this is being miscalculated as taking
up 1GiB on each device.
This also leads to erroneously calculated unallocated space. The correct
output in this case is:
Data,RAID1: Size:2.00GiB, Used:1.00GiB (50.03%)
/dev/vdc 2.00GiB
/dev/vdf 2.00GiB
Metadata,RAID1: Size:256.00MiB, Used:1.34MiB (0.52%)
/dev/vdc 256.00MiB
/dev/vdf 256.00MiB
System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%)
/dev/vdc 8.00MiB
/dev/vdf 8.00MiB
Unallocated:
/dev/vdc 7.74GiB
/dev/vdf 7.74GiB
Fix it by only utilising the chunk formula for profiles which are not
RAID1/DUP.
Issue: #422
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 80714610f3 ("btrfs-progs: use raid table for ncopies")
slightly broke how raid ratio are being calculated since the resulting
code would always reset ratio to be 1 in case we didn't have RAID56
profile. The correct behavior is to simply set it to 0 if we have RAID56
as the calculation is different in this case and leave it intact
otherwise.
This bug manifests by doing all size-related calculation for 'btrfs
filesystem usage' command as if all block groups are of type SINGLE. Fix
this by only resetting ratio 0 in case of RAID56.
Issue: #422
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
mkfs.btrfs v5.15 outputs a message even if the disk is a HDD without
TRIM/DISCARD support:
Performing full device TRIM /dev/sdc2 (326.03GiB) ...
[CAUSE]
mkfs.btrfs check TRIM/DISCARD support through the content of
queue/discard_granularity, but compare it against a wrong value.
When HDD without TRIM/DISCARD support, the content of
queue/discard_granularity is '0' '\n' '\0', rather than '0' '\0'.
[FIX]
- compare the value based on atoi() to provide more robustness
- delete unnecessary '\n' in pr_verbose()
Fixes: c50c448518 ("btrfs-progs: do sysfs detection of device discard capability")
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My tool ntfs2btrfs has been creating btrfs volumes in a way that the
kernel doesn't like, but which isn't picked up by btrfs check - see
maharmstone/ntfs2btrfs#23 for the details, including a backtrace. This
patch adds a check for when a csum item contains too many entries -
effectively it ensures that there's always at least sizeof(struct
btrfs_item) bytes free in the tree, otherwise btrfs_del_csums can throw
an error.
max_entries is the value of the __MAX_CSUM_ITEMS macro in
fs/btrfs/file-item.c.
Pull-request: #401
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make it possible to select sphinx doc generator instead of asciidoc so
we don't have two makefiles for that. It's still a bit crude and does
not support installing the files.
The required package is python-Sphinx (or similar name), built by
'sphinx-build'.
Configure:
$ ASCIIDOC_TOOL=sphinx ./configure
...
doc generator: sphinx
...
Generate:
$ cd Documentation/
$ make man
$ make html
$ make info # yes we can have info pages too
There are several more targets provided by sphinx, run 'make' to list
them.
Signed-off-by: David Sterba <dsterba@suse.com>
All long command line options in the html target render as a single
character which is confusing and wrong as we expect that to be '--'.
Disable it.
Signed-off-by: David Sterba <dsterba@suse.com>
Cell spanning is not supported for manual page target, so add separate
columns for the redundancy numbers.
Signed-off-by: David Sterba <dsterba@suse.com>
Just like kernel commit 22b6331d9617 ("btrfs: store precalculated
csum_size in fs_info"), we can cache csum_size and csum_type in
btrfs_fs_info.
Furthermore, there is already a 32 bits hole in btrfs_fs_info, and we
can fit csum_type and csum_size into the hole without increase the size
of btrfs_fs_info.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a lot of call sites where we use the following code snippet:
u8 super_block_data[BTRFS_SUPER_INFO_SIZE];
struct btrfs_super_block *sb;
u64 ret;
sb = (struct btrfs_super_block *)super_block_data;
The reason for this is, structure btrfs_super_block was smaller than
BTRFS_SUPER_INFO_SIZE.
Thus for anything with csum involved, we have to use a proper 4K buffer.
Since the recent unification of sizeof(struct btrfs_super_block), we no
longer need such workaround, and can use struct btrfs_super_block
directly to do any operation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like kernel change, pad struct btrfs_super_block to 4096 bytes. As
ctree.h is part of public headers, use raw number for the superblock
offset.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit a138daac17 ("btrfs-progs: mkfs: set super_cache_generation
to 0 if we're using free space tree") the space cache (v1) generation
was reset to 0 to let kernel know it's not used when the free-space-tree
is enabled.
This got broken again in 5.14 when the free space tree code got
refactored in 4b6cf2a3eb ("btrfs-progs: mkfs: generate free space tree
at make_btrfs() time").
Reset the space cache generation to 0.
Issue: #414
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
We noticed 'btrfs check' outputs something like
leaf 30408704 flags 0x0(P1逅?) backref revision 1
but we expected:
leaf 30408704 flags 0x0() backref revision 1
[CAUSE]
Some XX_flags_to_str() failed to make sure the result string always ends
with '\0' in some case.
[FIX]
Reset the buffer at the beginnig.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Wang Yugui (wangyugui@e16-tech.com)
Signed-off-by: David Sterba <dsterba@suse.com>
For consistency with older versions switch the case of 'single' to be
lower case again even if it's inconsistent. This could be revisited in
the future.
Signed-off-by: David Sterba <dsterba@suse.com>