Block group tree feature is completely a standalone feature, and it has
been over 5 years before the initial introduction to solve the long
mount time.
I don't really want to waste another 5 years waiting for a feature which
may or may not work, but definitely not properly reviewed for its
preparation patches.
So this patch will separate the block group tree feature into a
standalone compat RO feature.
There is a catch, in mkfs create_block_group_tree(), current
tree-checker only accepts block group item with valid chunk_objectid,
but the existing code from extent-tree-v2 didn't properly initialize it.
This patch will also fix above mentioned problem so kernel can mount it
correctly.
Now mkfs/fsck should be able to handle the fs with block group tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent tree v2 (thankfully not yet fully materialized) needs a
new root for storing all block group items.
My initial proposal years ago just added a new tree rootid, and load it
from tree root, just like what we did for quota/free space tree/uuid/extent
roots.
But the extent tree v2 patches introduced a completely new (and to me,
wasteful) way to store block group tree root into super block.
Currently there are only 3 trees stored in super blocks, and they all
have their valid reasons:
- Chunk root
Needed for bootstrap.
- Tree root
Really the entrance of all trees.
- Log root
This is special as log root has to be updated out of existing
transaction mechanism.
There is not even any reason to put block group root into super blocks,
the block group tree is updated at the same timing as old extent tree,
no need for extra bootstrap/out-of-transaction update.
So just move block group root from super block into tree root.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In mkfs_btrfs(), we have a btrfs_mkfs_block array to store how many tree
blocks we need to reserve for the initial btrfs image.
Currently we have two very similar arrays, extent_tree_v1_blocks and
extent_tree_v2_blocks.
The only difference is just v2 has an extra block for block group tree.
This patch will add two helpers, mkfs_blocks_add() and
mkfs_blocks_remove() to properly add/remove one block dynamically from
the array.
This allows 3 things:
- Merge extent_tree_v1_blocks and extent_tree_v2_blocks into one array
The new array will be the same as extent_tree_v1_blocks.
For extent-tree-v2, we just dynamically add MKFS_BLOCK_GROUP_TREE.
- Remove free space tree block on-demand
This only works for extent-tree-v1 case, as v2 has a hard requirement
on free space tree.
But this still make code much cleaner, not doing any special hacks.
- Allow future expansion without introduce new array
I strongly doubt why this is not properly done in extent-tree-v2
preparation patches.
We should not allow bad practice to sneak in just because it's some
preparation patches for a larger feature.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In read_tree_block, extent buffer EXTENT_BAD_TRANSID flagged will
be added into fs_info->recow_ebs with an increment of its refs.
The corresponding free_extent_buffer should be called after we
fix transid error by cowing extent buffer then remove them from
fs_info->recow_ebs.
Otherwise, extent buffers will be leaked as fsck-tests/002 reports:
===================================================================
====== RUN CHECK /root/btrfs-progs/btrfs check --repair --force ./default_case.img.restored
parent transid verify failed on 29360128 wanted 9 found 755944791
parent transid verify failed on 29360128 wanted 9 found 755944791
parent transid verify failed on 29360128 wanted 9 found 755944791
Ignoring transid failure
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
extent buffer leak: start 29360128 len 4096
enabling repair mode
===================================================================
Fixes: c64485544b ("Btrfs-progs: keep track of transid failures and fix them if possible")
Signed-off-by: Su Yue <glass@fydeos.io>
Signed-off-by: David Sterba <dsterba@suse.com>
If we found that the underlying block device size is smaller than
total_bytes in dev item, kernel will reject the mount, and there is no
progs tool to fix it.
Under most case it's just a small mismatch, and there is no dev extent
in the shrunk range.
In that case, we can let "btrfs rescue fix-device-size" to reset the
total_bytes in dev items to fix.
We add some extra checks, like to make sure there is no dev extent in
the shrunk device range, to make sure we won't lose data during the
device item shrink.
And also update the test case to verify the repaired fs can pass the
check.
Issue: #504
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create a filesystem on a file backed loop block device, then shrink the
file (and its loop block device), then make sure btrfs check can detect
such shrunk device.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that, one btrfs got its underlying device shrunk
accidentally.
Fortunately the user has no data at the truncated range. However kernel
will reject such filesystem, while btrfs-check reports nothing wrong
with it.
This can be easily reproduced by:
# truncate -s 1G test.img
# mkfs.btrfs test.img
# truncate -s 996M test.img
# btrfs check test.img
Opening filesystem to check...
Checking filesystem on test.img
UUID: dbf0a16d-f158-4383-9025-29d7f4c43f17
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 16527360 bytes used, no error found
^^^^^^^^^^^^^^
total csum bytes: 13836
total tree bytes: 2359296
total fs tree bytes: 2162688
total extent tree bytes: 65536
btree space waste bytes: 503569
file data blocks allocated: 14168064
referenced 14168064
[CAUSE]
Btrfs check really only checks the metadata cross references, not really
bothering if the underlying device has correct size. Thus we completely
ignored such size mismatch.
[FIX]
For both regular and lowmem mode, add extra check against the underlying
block device size.
If the block device size is smaller than its total_bytes, gives a error
message and error out.
Now the check looks like this for both modes:
...
[2/7] checking extents
ERROR: block device size is smaller than total_bytes in device item, has 1046478848 expect >= 1073741824
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
...
found 16527360 bytes used, error(s) found
Issue: #504
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent leaks are detected in debug builds but tests/scan-build.sh
does not look for them, so add the match expression.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Commit 06b6ad5e01 ("btrfs-progs: check: check for invalid free space
tree entries") makes btrfs check to report eb leakage even on newly
created btrfs:
# mkfs.btrfs -f test.img
# btrfs check test.img
Opening filesystem to check...
Checking filesystem on test.img
UUID: 13c26b6a-3b2c-49b3-94c7-80bcfa4e494b
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 147456 bytes used, no error found
total csum bytes: 0
total tree bytes: 147456
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 140595
file data blocks allocated: 0
referenced 0
extent buffer leak: start 30572544 len 16384 <<< Extent buffer leakage
[CAUSE]
The patch in mailinglist uses a dynamically allocated path while the
committed one has been converted to on-stack path, which is preferred.
However, the cleanup was not done properly. We only release the path
inside the while loop, no at out label. This means, if we hit error or
even just exhausted free space tree as expected, we will leak the path
to free space tree root.
Thus leading to the above leak report.
[FIX]
Fix the bug by calling btrfs_release_path() at out: label too.
This should make the code behave the same as the patch submitted to the
mailing list.
Fixes: 06b6ad5e01 ("btrfs-progs: check: check for invalid free space tree entries")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compile warning:
./kerncompat.h:142: warning: "__bitwise__" redefined
#define __bitwise__
In file included from ./kerncompat.h:35,
from check/qgroup-verify.c:24:
/usr/include/linux/types.h:25: note: this is the location of the previous definition
#define __bitwise__ __bitwise
Because __bitwise__ is already defined in newer kernel-headers
(/usr/include/linux/types.h), so add #ifndef to avoid this warning.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While testing some changes to how we reclaim block groups I started
hitting failures with my TEST_DEV. This occurred because I had a bug
and failed to properly remove a block groups free space tree entries.
However this wasn't caught in testing when it happened because
btrfs check only checks that the free space cache for the existing block
groups is valid, it doesn't check for free space entries that don't have
a corresponding block group.
Fix this by checking for free space entries that don't have a
corresponding block group. Additionally add a test image to validate
this fix.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function write_data_to_disk() can handle RAID56 writes without any
problem.
So just call write_data_to_disk() inside write_and_map_eb() instead of
manually doing the RAID56 write.
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BUG_ON() condition in write_data_to_disk() is no longer correct.
Now write_raid56_with_parity() will return the bytes written of last
stripe.
Thus a success writeback can trigger the BUG_ON(ret).
Fix the condition to (ret < 0).
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Shinichiro reported that "mkfs.btrfs -m DUP" is doing repeated write
into the device.
For non-zoned device this is not a big deal, but for zoned device this
is critical, as zoned device doesn't support overwrite at all.
[CAUSE]
The problem is related to write_and_map_eb() call, since commit
2a93728391 ("btrfs-progs: use write_data_to_disk() to replace
write_extent_to_disk()"), we call write_data_to_disk() for metadata
write back.
But the problem is, write_data_to_disk() will call btrfs_map_block()
with rw = WRITE.
By that btrfs_map_block() will always return all stripes, while in
write_data_to_disk() we also iterate through each mirror of the range.
This results above repeated writeback.
[FIX]
Fix this problem by completely remove @mirror argument
from write_data_to_disk().
With extra comments to explicitly show that function will write to
all mirrors.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2a93728391 ("btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk()")
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper `check_min_kernel_version` is duplicated and can be removed.
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some older compilers do not support overflow builtins introduced in
5ad2aacd24 ("btrfs-progs: kernel-lib: sync include/overflow.h"). Add
stubs to make it compile. This fixes CI build of Centos 7.
Signed-off-by: David Sterba <dsterba@suse.com>
Use the autoconf archive macros for gcc builtin detection and add the
overflow from recently added from kernel.
New:
__builtin_add_overflow
__builtin_sub_overflow
__builtin_mul_overflow
Signed-off-by: David Sterba <dsterba@suse.com>
The help text is out of sync with many options, lacking the long
options, required arguments or mistakenly requiring arguments when the
value is read from another one.
Signed-off-by: David Sterba <dsterba@suse.com>
This file includes linux/fs.h which includes linux/mount.h and with
glibc 2.36 linux/mount.h and glibc mount.h are not compatible [1]
therefore try to avoid including both headers
[1] https://sourceware.org/glibc/wiki/Release/2.36
Signed-off-by: Khem Raj <raj.khem@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Attempting to dump a bad btrfs superblock returns successful exit status
zero. According to the manual page non-zero should be returned on
failure. Fix this.
$ btrfs inspect-internal dump-super /dev/zero
superblock: bytenr=65536, device=/dev/zero
---------------------------------------------------------
ERROR: bad magic on superblock on /dev/zero at 65536
$ echo $?
0
Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This compat flag is missing, but is being checked by mount, and could
well be present legitimately.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
To corrupt holes/prealloc/inline extents, we need to mess with
extent data items. This patch makes it possible to modify
disk_bytenr with a specific value (useful for hole corruptions)
and to modify the type field (useful for prealloc corruptions)
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs-corrupt-block already has a mix of generic and specific corruption
options, but currently lacks the capacity for totally arbitrary
corruption in item data.
There is already a flag for corruption size (bytes/-b), so add a flag
for an offset and a value to memset the item with. Exercise the new
flags with a new variant for -I (item) corruption. Look up the item as
before, but instead of corrupting a field in the item struct, corrupt an
offset/size in the item data.
The motivating example for this is that in testing fsverity with btrfs,
we need to corrupt the generated Merkle tree--metadata item data which
is an opaque blob to btrfs.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The command group of 'replace' belongs to device and could be seen as
confusing. At minimum we can add an alias so now there's equivalent:
# btrfs replace start
# btrfs device replace start
Both commands will exist for backward compatibility, tough we might
revisit which one is the primary one.
Issue: #484
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation for introducing tabular output for device stats. Simply
factor out string-specific output lines in a separate function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to kernel check_leaf(), calling btrfs_item_end_nr() may get a
reasonable value even an item has invalid offset/size due to u32
overflow.
Fix it by use u64 variable to store item data end in btrfs_check_leaf()
to avoid u32 overflow.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Su Yue <l@damenly.su>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a test to ensure that 'btrfs fi show' on a mounted filesystem, which
has a missing device will explicitly print which device is missing.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when a device is missing for a mounted filesystem the output
that is produced is unhelpful:
Label: none uuid: 139ef309-021f-4b98-a3a8-ce230a83b1e2
Total devices 2 FS bytes used 128.00KiB
devid 1 size 5.00GiB used 1.26GiB path /dev/loop0
*** Some devices missing
While the context which prints this is perfectly capable of showing
which device exactly is missing, like so:
Label: none uuid: 4a85a40b-9b79-4bde-8e52-c65a550a176b
Total devices 2 FS bytes used 128.00KiB
devid 1 size 5.00GiB used 1.26GiB path /dev/loop0
devid 2 size 0 used 0 path /dev/loop1 MISSING
This is a lot more usable output as it presents the user with the id
of the missing device and its path.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the same on-disk format update synchronized from the kernel
code.
Unlike kernel, there are two callers reading this member:
- btrfs inspect dump-super
It's just printing the value, add a notice about deprecation.
- btrfs-find-root
In that case, since we always got 0, the root search for log root
should never find a perfect match.
Use btrfs_super_geneartion() + 1 to provide a better result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On large (blockcount > 32bit) filesystems reading directly
super_block->s_blocks_count is not sufficient as the block count is held
in 2 separate 32 bit variables. Instead always use the provided
ext2fs_blocks_count to read the value. This can result in assertion
failure, when the block count is only held in the high 32 bits, in this
case s_block_counts would be zero, which would result in
btrfs_convert_context::block_count/total_bytes to also be 0 and hit an
assertion failure:
convert/main.c:1162: do_convert: Assertion `cctx.total_bytes != 0` failed, value 0
btrfs-convert(+0xffb0)[0x557defdabfb0]
btrfs-convert(main+0x6c5)[0x557defdaa125]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7f66e1f8bd0a]
btrfs-convert(_start+0x2a)[0x557defdab52a]
Aborted
What's worse it can also result in btrfs-convert mistakenly thinking
that a filesystem is smaller than it actually is (ignoring the top 32 bits).
Link: https://lore.kernel.org/linux-btrfs/023b5ca9-0610-231b-fc4e-a72fe1377a5a@jansson.tech/
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The initial proposal for file attributes was built on simply doing
SETFLAGS but this builds on an old and non-extensible interface that has
no direct mapping for all inode flags. There's a unified interface
fileattr that covers file attributes and xflags, it should be possible
to add new bits.
On the protocol level the value is copied as-is in the original inode
but this does not provide enough information how to apply the bits on
the receiving side. Eg. IMMUTABLE flag prevents any changes to the file
and has to be handled manually.
The receiving side does not apply the bits yet, only parses it from the
stream.
Signed-off-by: David Sterba <dsterba@suse.com>
Add constant for initial value to avoid unexpected clashes with user
defined getopt values and shift the common size getopt values.
Signed-off-by: David Sterba <dsterba@suse.com>
Now that LZO and ZSTD are optional for not just restore, rename the
build variables to a more generic name and update configure summary.
Signed-off-by: David Sterba <dsterba@suse.com>
There are build-time options for LZO and ZSTD support, the stream v2+
supports compression. The help text lists what has been compiled in,
similar to what 'restore' does, with a similar limitation that a stream
with compressed data cannot be processed if any of the extents is
compressed.
Signed-off-by: David Sterba <dsterba@suse.com>
Copy contents from https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature
The formatting is done by a definition and list, instead of a table.
Unfortunatelly RST does not wrap long text in table cells so the width
exceeds visible area.
Signed-off-by: David Sterba <dsterba@suse.com>