btrfs-progs

Commit Graph

Author	SHA1	Message	Date
Qu Wenruo	2cdc8dddbf	btrfs-progs: mkfs: offset inode numbers of the source filesystem [BUG] When running mkfs tests on a newly rebooted minimal system, it can cause mkfs/009 to fail. The reproduce steps requires /tmp to has minimal files in the first place. # mkdir /tmp/rootdir # xfs_io -f -c "pwrite 0 16k" /tmp/rootdir # mkfs.btrfs --rootdir /tmp/rootdir -f $dev # btrfs check $dev Opening filesystem to check... Checking filesystem on /dev/test/scratch1 UUID: 6821b3db-f056-4c18-b797-32679dcd4272 [1/7] checking root items [2/7] checking extents data backref 13631488 root 5 owner 170 offset 0 num_refs 0 not found in extent tree incorrect local backref count on 13631488 root 5 owner 170 offset 0 found 1 wanted 0 back 0x55ff6cd72260 backref 13631488 root 5 not referenced back 0x55ff6cd4c1f0 incorrect global backref count on 13631488 found 2 wanted 1 backpointer mismatch on [13631488 16384] ERROR: errors found in extent allocation tree or chunk allocation [CAUSE] The extent tree has the following weird item: item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16250 itemsize 33 refs 1 gen 0 flags DATA tree block backref root FS_TREE This is an extent item for data, thus it should not have an inline tree backref. Then checking the fs tree: item 0 key (170 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 0 size 16384 nbytes 16384 block group 0 mode 100600 links 1 uid 1000 gid 1000 rdev 0 sequence 0 flags 0x0(none) atime 1664866393.0 (2022-10-04 14:53:13) ctime 1664863510.0 (2022-10-04 14:05:10) mtime 1664863455.0 (2022-10-04 14:04:15) otime 0.0 (1970-01-01 08:00:00) There is an inode item before the root dir inode. And that inode number 170 is causing the problem. In traverse_directory(), we use the inode number reported from stat() directly as btrfs inode number, and pass it to btrfs_record_file_extent(), which finally calls btrfs_inc_extent_ref(), with above 170 passed as @owner parameter. But inside btrfs_inc_extent_ref() we use that @owner value to determine if it's a data backref. Since we got a smaller than BTRFS_FIRST_FREE_OBJECTID, btrfs treats it as tree block, and cause the above problem. [FIX] As a quick fix, always add BTRFS_FIRST_FREE_OBJECTID to all inode number directly grabbed from stat(). And add an ASSERT() in __btrfs_record_file_extent() to catch unexpected objectid. This is not a perfect solution, as the resulted fs will has a huge gap in its inodes: item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 item 4 key (426 INODE_ITEM 0) itemoff 15883 itemsize 160 For a proper fix, we should allocate new btrfs inode numbers in a sequential order, but that would be another series of patches. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:10 +02:00
Qu Wenruo	dad9db45bb	btrfs-progs: properly initialize extent generation in __btrfs_record_file_extent() [BUG] When using mkfs.btrfs --rootdir option, the data extents generated will have 0 as their generation in extent tree: # mkdir /tmp/rootdir # xfs_io -f -c "pwrite 0 16k" /tmp/rootdir/foobar # mkfs.btrfs -f --rootdir /tmp/rootdir $dev # btrfs ins dump-tree -t extent $dev btrfs-progs v5.19.1 extent tree key (EXTENT_TREE ROOT_ITEM 0) leaf 30474240 items 13 free space 15536 generation 7 owner EXTENT_TREE leaf 30474240 flags 0x1(WRITTEN) backref revision 1 fs uuid c1f05988-49f9-4dd4-8489-b90d60f522ee chunk uuid 40f81603-fe75-4f58-aa9e-e74e28df8523 item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16230 itemsize 53 refs 1 gen 0 flags DATA <<< Generation is 0 ... [CAUSE] In __btrfs_record_file_extent() we just set the extent generation to 0. [FIX] Use trans->transid to properly fill extent generation. Now after mkfs, the first data extent backref looks like this: item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16230 itemsize 53 refs 1 gen 7 flags DATA ... Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:10 +02:00
David Sterba	ccb2d4aa45	btrfs-progs: device-utils: rename btrfs_device_size There's a group of helpers to read device size, the btrfs_device_size should be one of them. Rename it and so minor cleanup. Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:10 +02:00
David Sterba	a827bb2db8	btrfs-progs: use template for transaction commit error messages Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:10 +02:00
David Sterba	8fcafae04a	btrfs-progs: use template for transaction start error messages Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:10 +02:00
David Sterba	c2be0e2ce0	btrfs-progs: use template for out of memory error messages Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:09 +02:00
David Sterba	2267708bfe	btrfs-progs: move repair.c from common/ to check/ Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:09 +02:00
Qu Wenruo	08bb354a1c	btrfs-progs: properly handle write error when writing back tree blocks [BUG] If we emulate a write error during commit transaction, by setting the block device read-only, then we can easily have the following crash using "btrfs check --clear-space-cache v2": Opening filesystem to check... Checking filesystem on /dev/test/scratch1 UUID: 5945915b-37f1-4bfa-9f64-684b318b8f73 Clear free space cache v2 Error writing to device 1 kernel-shared/transaction.c:156: __commit_transaction: BUG_ON `ret` triggered, value 1 ./btrfs(+0x570c9)[0x562ec894f0c9] ./btrfs(+0x57167)[0x562ec894f167] ./btrfs(__commit_transaction+0x13b)[0x562ec894f7f2] ./btrfs(btrfs_commit_transaction+0x214)[0x562ec894fa64] ./btrfs(btrfs_clear_free_space_tree+0x177)[0x562ec8941ae6] ./btrfs(+0xc8958)[0x562ec89c0958] ./btrfs(+0xc9d53)[0x562ec89c1d53] ./btrfs(+0x17ec7)[0x562ec890fec7] ./btrfs(main+0x12f)[0x562ec8910908] /usr/lib/libc.so.6(+0x232d0)[0x7ff917ee82d0] /usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7ff917ee838a] ./btrfs(_start+0x25)[0x562ec890fdc5] Aborted (core dumped) [CAUSE] The call trace has shown it's a BUG_ON(), and it's from __commit_transaction(), which is writing tree blocks back. [FIX] The fix is pretty simple, just return error. In fact we even have an error value check in btrfs_commit_transaction() just after __commit_transaction() call (although not catching the return value from it). And since we're here, also call btrfs_abort_transaction() to prevent newer transactions from being started. Now we won't have a full crash: Opening filesystem to check... Checking filesystem on /dev/test/scratch1 UUID: 5945915b-37f1-4bfa-9f64-684b318b8f73 Clear free space cache v2 Error writing to device 1 ERROR: failed to write bytenr 30425088 length 16384: Operation not permitted ERROR: failed to write tree block 30425088: Operation not permitted ERROR: failed to clear free space cache v2: -1 extent buffer leak: start 30720000 len 16384 Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:08 +02:00
Qu Wenruo	75800c2fee	btrfs-progs: remove duplicated leaked extent buffer report [BUG] When transaction is aborted halfway, we can have extent buffer leaked, and in that case, the same leaked extent buffer can be reported for multiple times: ERROR: failed to clear free space cache v2: -1 extent buffer leak: start 30441472 len 16384 WARNING: dirty eb leak (aborted trans): start 30441472 len 16384 extent buffer leak: start 30720000 len 16384 extent buffer leak: start 30425088 len 16384 extent buffer leak: start 30425088 len 16384 << Duplicated WARNING: dirty eb leak (aborted trans): start 30425088 len 16384 Note that 30425088 line is reported twice (not accounting the "dirty eb leak" line). [CAUSE] When we detected a leaked eb, we call free_extent_buffer_nocache(), but free_extent_buffer_nocache() can only remove the eb when its reduced refs is 0. If the eb has refs 2, it will need two free_extent_buffer_nocache() calls to remove it from the cache. [FIX] Just reset the eb->refs to 1 so that free_extent_buffer_nocache() can remove it from cache for sure. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:08 +02:00
Qu Wenruo	811ae819e3	btrfs-progs: remove unused function extent_io_tree_init_cache_max() The function was introduced by commit `a5ce5d2198` ("btrfs-progs: extent-cache: actually cache extent buffers") but never got utilized. Thus we can just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:08 +02:00
Boris Burkov	980ba4e842	btrfs-progs: receive: add support for fs-verity Process an enable_verity cmd by running the enable verity ioctl on the file. Since enabling verity denies write access to the file, it is important that we don't have any open write file descriptors. This also revs the send stream format to version 3 with no format changes besides the new commands and attributes. This version is not finalized and commands may change, also this needs to be synchronized with any kernel changes. Note: the build is conditional on the header linux/fsverity.h Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:08 +02:00
David Sterba	feef6aaaf6	btrfs-progs: kernel-lib: remove radix-tree The radix-tree is not used in userspace code. In kernel it's for tracking unpersisted and in-memory structures and has been replaced by the xarray. Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:08:07 +02:00
Qu Wenruo	d8f3355734	btrfs-progs: unexport csum_tree_block() The function csum_tree_block() is not really utilized by anyone, all current callers just use csum_tree_block_size(). Furthermore there is a stale definition in common/utils.h which is using the old "struct btrfs_root" as the first argument, while we have already migrated to "struct btrfs_fs_info". So just unexport csum_tree_block() and remove the stale definition. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:06:11 +02:00
David Sterba	d5e15ba825	btrfs-progs: fix may be unused warning in load_free_space_extents Some compilers warn about potentially unused variable, however the value validity is guarded by have_prev so this can't happen and it's probably insufficient analysis on the compiler side. Let's initialize the prev_key to zeros that would also work as the condition. In file included from /usr/include/stdio.h:894, from ./kerncompat.h:27, from ./kernel-lib/list.h:23, from ./kernel-shared/ctree.h:24, from kernel-shared/free-space-tree.c:19: In function ‘fprintf’, inlined from ‘load_free_space_extents’ at kernel-shared/free-space-tree.c:1446:5, inlined from ‘load_free_space_tree’ at kernel-shared/free-space-tree.c:1577:9: /usr/include/bits/stdio2.h:105:10: warning: ‘prev_key.objectid’ may be used uninitialized [-Wmaybe-uninitialized] 105 \| return __fprintf_chk (__stream, __USE_FORTIFY_LEVEL - 1, __fmt, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 106 \| __va_arg_pack ()); \| ~~~~~~~~~~~~~~~~~ kernel-shared/free-space-tree.c: In function ‘load_free_space_tree’: kernel-shared/free-space-tree.c:1398:31: note: ‘prev_key.objectid’ was declared here 1398 \| struct btrfs_key key, prev_key; Signed-off-by: David Sterba <dsterba@suse.com>	2022-10-11 09:06:11 +02:00
Qu Wenruo	2f2f6bfe17	btrfs-progs: btrfstune: add the ability to convert to block group tree feature The new '-b' option will be responsible for converting to block group tree compat ro feature. The workflow looks like this for new convert: - Setting CHANGING_BG_TREE flag And initialize fs_info->last_converted_bg_bytenr value to (u64)-1. Any bg with bytenr >= last_converted_bg_bytenr will have its bg item update go to the new root (bg tree). - Iterate each block group by their bytenr in descending order This involves: * Delete the old bg item from the old tree (extent tree) * Update last_converted_bg_bytenr to the bytenr of the bg * Add the new bg item into the new tree (bg tree) * If we have converted a bunch of bgs, commit current transaction - Clear CHANGING_BG_TREE flag And set the new BLOCK_GROUP_TREE compat ro flag and commit. And since we're doing the convert in multiple transactions, we also need to resume from last interrupted convert. In that case, we just grab the last unconverted bg, and start from it. And to co-operate with the new kernel requirement for both no-holes and free-space-tree features, the convert tool will check for free-space-tree feature. If not enabled, will error out with an error message to how to continue (by mounting with "-o space_cache=v2"). For missing no-holes feature, we just need to set the flag during convert. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-12 18:25:32 +02:00
Qu Wenruo	1430b41427	btrfs-progs: separate block group tree from extent tree v2 Block group tree feature is completely a standalone feature, and it has been over 5 years before the initial introduction to solve the long mount time. I don't really want to waste another 5 years waiting for a feature which may or may not work, but definitely not properly reviewed for its preparation patches. So this patch will separate the block group tree feature into a standalone compat RO feature. There is a catch, in mkfs create_block_group_tree(), current tree-checker only accepts block group item with valid chunk_objectid, but the existing code from extent-tree-v2 didn't properly initialize it. This patch will also fix above mentioned problem so kernel can mount it correctly. Now mkfs/fsck should be able to handle the fs with block group tree. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-12 18:25:32 +02:00
Qu Wenruo	c5a21a7814	btrfs-progs: don't save block group root into super block The extent tree v2 (thankfully not yet fully materialized) needs a new root for storing all block group items. My initial proposal years ago just added a new tree rootid, and load it from tree root, just like what we did for quota/free space tree/uuid/extent roots. But the extent tree v2 patches introduced a completely new (and to me, wasteful) way to store block group tree root into super block. Currently there are only 3 trees stored in super blocks, and they all have their valid reasons: - Chunk root Needed for bootstrap. - Tree root Really the entrance of all trees. - Log root This is special as log root has to be updated out of existing transaction mechanism. There is not even any reason to put block group root into super blocks, the block group tree is updated at the same timing as old extent tree, no need for extra bootstrap/out-of-transaction update. So just move block group root from super block into tree root. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-12 15:31:27 +02:00
Qu Wenruo	e47c34821f	btrfs-progs: rescue: allow fix-device-size to shrink device item If we found that the underlying block device size is smaller than total_bytes in dev item, kernel will reject the mount, and there is no progs tool to fix it. Under most case it's just a small mismatch, and there is no dev extent in the shrunk range. In that case, we can let "btrfs rescue fix-device-size" to reset the total_bytes in dev items to fix. We add some extra checks, like to make sure there is no dev extent in the shrunk device range, to make sure we won't lose data during the device item shrink. And also update the test case to verify the repaired fs can pass the check. Issue: #504 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-12 15:31:21 +02:00
Qu Wenruo	75fea7496c	btrfs-progs: use write_data_to_disk() to handle RAID56 in write_and_map_eb() Function write_data_to_disk() can handle RAID56 writes without any problem. So just call write_data_to_disk() inside write_and_map_eb() instead of manually doing the RAID56 write. Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-16 15:18:12 +02:00
Qu Wenruo	2060120201	btrfs-progs: fix a BUG_ON() condition for write_data_to_disk() The BUG_ON() condition in write_data_to_disk() is no longer correct. Now write_raid56_with_parity() will return the bytes written of last stripe. Thus a success writeback can trigger the BUG_ON(ret). Fix the condition to (ret < 0). Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-16 15:18:12 +02:00
Qu Wenruo	fc6925bfd3	btrfs-progs: avoid repeated data write for metadata [BUG] Shinichiro reported that "mkfs.btrfs -m DUP" is doing repeated write into the device. For non-zoned device this is not a big deal, but for zoned device this is critical, as zoned device doesn't support overwrite at all. [CAUSE] The problem is related to write_and_map_eb() call, since commit `2a93728391` ("btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk()"), we call write_data_to_disk() for metadata write back. But the problem is, write_data_to_disk() will call btrfs_map_block() with rw = WRITE. By that btrfs_map_block() will always return all stripes, while in write_data_to_disk() we also iterate through each mirror of the range. This results above repeated writeback. [FIX] Fix this problem by completely remove @mirror argument from write_data_to_disk(). With extra comments to explicitly show that function will write to all mirrors. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `2a93728391` ("btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk()") Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-16 15:18:12 +02:00
Boris Burkov	ba7b281049	btrfs-progs: add VERITY ro compat flag This compat flag is missing, but is being checked by mount, and could well be present legitimately. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-16 15:18:11 +02:00
Su Yue	d0a99313e5	btrfs-progs: save item data end in u64 to avoid overflow in btrfs_check_leaf() Similar to kernel check_leaf(), calling btrfs_item_end_nr() may get a reasonable value even an item has invalid offset/size due to u32 overflow. Fix it by use u64 variable to store item data end in btrfs_check_leaf() to avoid u32 overflow. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299 Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> Signed-off-by: Su Yue <l@damenly.su> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-16 15:18:11 +02:00
Qu Wenruo	963188943f	btrfs-progs: make btrfs_super_block::log_root_transid deprecated This is the same on-disk format update synchronized from the kernel code. Unlike kernel, there are two callers reading this member: - btrfs inspect dump-super It's just printing the value, add a notice about deprecation. - btrfs-find-root In that case, since we always got 0, the root search for log root should never find a perfect match. Use btrfs_super_geneartion() + 1 to provide a better result. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-16 15:18:11 +02:00
David Sterba	8356c423e6	btrfs-progs: receive: implement FILEATTR command The initial proposal for file attributes was built on simply doing SETFLAGS but this builds on an old and non-extensible interface that has no direct mapping for all inode flags. There's a unified interface fileattr that covers file attributes and xflags, it should be possible to add new bits. On the protocol level the value is copied as-is in the original inode but this does not provide enough information how to apply the bits on the receiving side. Eg. IMMUTABLE flag prevents any changes to the file and has to be handled manually. The receiving side does not apply the bits yet, only parses it from the stream. Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-16 15:18:11 +02:00
Omar Sandoval	0ee5b22345	btrfs-progs: send: stream v2 ioctl flags First, add a --proto option to allow specifying the desired send protocol version. It defaults to one, the original version. In a couple of releases once people are aware that protocol revisions are happening, we can change it to default to zero, which means the latest version supported by the kernel. This is based on Dave Sterba's patch. Also add a --compressed-data flag to instruct the kernel to use encoded_write commands for compressed extents. This requires an explicit opt in separate from the protocol version because: 1. The user may not want compression on the receiving side, or may want a different compression algorithm/level on the receiving side. 2. It has a soft requirement for kernel support on the receiving side (btrfs-progs can fall back to decompressing and writing if the kernel doesn't support BTRFS_IOC_ENCODED_WRITE, but the user may not be prepared to pay that CPU cost). Going forward, since it's easier to update progs than the kernel, I think we'll want to make new send features that require kernel support opt-in, whereas anything that only requires a progs update can happen automatically. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-07 13:59:33 +02:00
Omar Sandoval	1c05b10008	btrfs-progs: receive: add send stream v2 commands and attributes Update our copy of send.h from the kernel. This adds the new commands and attributes for v2 as well as explicit enum numbering. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-07 13:59:32 +02:00
Boris Burkov	a82996e1b6	btrfs-progs: receive: dynamically allocate sctx->read_buf In send stream v2, write commands can now be an arbitrary size. For that reason, we can no longer allocate a fixed array in sctx for read_cmd. Instead, read_cmd dynamically allocates sctx->read_buf. To avoid needless reallocations, we reuse read_buf between read_cmd calls by also keeping track of the size of the allocated buffer in sctx->read_buf_sz. We do the first allocation of the old default size at the start of processing the stream, and we only reallocate if we encounter a command that needs a larger buffer. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-07 13:59:31 +02:00
David Sterba	0f65bf66be	btrfs-progs: libbtrfs: drop ifdef BTRFS_FLAT_INCLUDES where not necessary Headers that are only exported and not used for build do not need the BTRFS_FLAT_INCLUDES switch (between local and installed headers). Now that there are local copies of the shared headers drop the respective part from local headers. Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-06 15:48:52 +02:00
Johannes Thumshirn	a7ae6d5948	btrfs-progs: zoned: add upper and lower zone size boundaries As we're not supporting arbitrarily big or small zone sizes in the kernel, reject devices that don't fit in progs as well. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-06 15:47:50 +02:00
Qu Wenruo	38f90e906e	btrfs-progs: properly initialize block group thresholds [BUG] When creating btrfs with new v2 cache (the default behavior), mkfs.btrfs always create the free space tree using bitmap. It's fine for small fs, but will be a disaster if the device is large and the data profile is something like RAID0: $ mkfs.btrfs -f -m raid1 -d raid0 /dev/test/scratch[1234] btrfs-progs v5.17 [...] Block group profiles: Data: RAID0 4.00GiB Metadata: RAID1 256.00MiB System: RAID1 8.00MiB [..] $ btrfs ins dump-tree -t free-space /dev/test/scratch1 btrfs-progs v5.17 free space tree key (FREE_SPACE_TREE ROOT_ITEM 0) node 30441472 level 1 items 10 free space 483 generation 6 owner FREE_SPACE_TREE node 30441472 flags 0x1(WRITTEN) backref revision 1 fs uuid deddccae-afd0-4160-9a12-48fe7b526fb1 chunk uuid 68f6cf98-afe3-4f47-9797-37fd9c610219 key (1048576 FREE_SPACE_INFO 4194304) block 30457856 gen 6 key (475004928 FREE_SPACE_BITMAP 8388608) block 30703616 gen 5 key (953155584 FREE_SPACE_BITMAP 8388608) block 30720000 gen 5 key (1431306240 FREE_SPACE_BITMAP 8388608) block 30736384 gen 5 key (1909456896 FREE_SPACE_BITMAP 8388608) block 30752768 gen 5 key (2387607552 FREE_SPACE_BITMAP 8388608) block 30769152 gen 5 key (2865758208 FREE_SPACE_BITMAP 8388608) block 30785536 gen 5 key (3343908864 FREE_SPACE_BITMAP 8388608) block 30801920 gen 5 key (3822059520 FREE_SPACE_BITMAP 8388608) block 30818304 gen 5 key (4300210176 FREE_SPACE_BITMAP 8388608) block 30834688 gen 5 [...] ^^^ So many bitmaps that an empty fs will have two levels for free space tree already [CAUSE] Member btrfs_block_group::bitmap_high_thresh is never properly set to any value other than 0, thus in function update_free_space_extent_count(), the following check is always true: if (!(flags & BTRFS_FREE_SPACE_USING_BITMAPS) && extent_count > block_group->bitmap_high_thresh) { ret = convert_free_space_to_bitmaps(trans, block_group, path); Thus we always got converted to bitmaps. [FIX] Cross-port the function set_free_space_tree_thresholds() from kernel, and call that function in btrfs_make_block_group() and read_one_block_group() so that every block group has bitmap_high_thresh properly set. Now even for that 4GiB large data chunk, we still only have one free extent: btrfs-progs v5.17 free space tree key (FREE_SPACE_TREE ROOT_ITEM 0) leaf 30572544 items 15 free space 15860 generation 6 owner FREE_SPACE_TREE leaf 30572544 flags 0x1(WRITTEN) backref revision 1 fs uuid b24e52ea-6580-4a88-aa70-cb173090bfe3 chunk uuid d85f3905-fc61-4084-b335-2b6b97814b8e [...] item 13 key (298844160 FREE_SPACE_INFO 4294967296) itemoff 16235 itemsize 8 free space info extent count 1 flags 0 item 14 key (298844160 FREE_SPACE_EXTENT 4294967296) itemoff 16235 itemsize 0 free space extent Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-20 15:54:20 +02:00
Qu Wenruo	9bded24a46	btrfs-progs: do not use btrfs_commit_transaction() just to update super blocks There are several call sites utilizing btrfs_commit_transaction() just to update members in super blocks, without any metadata update. This can be problematic for some simple call sites, like zero_log_tree() or check_and_repair_super_num_devs(). If we have big problems preventing the fs to be mounted in the first place, and need to clear the log or super block size, but by some other problems in extent tree, we're unable to allocate new blocks. Then we fall into a deadlock that, we need to mount (even ro,rescue=all) to collect extra info, but btrfs-progs can not do any super block updates. Fix the problem by allowing the following super blocks only operations to be done without using btrfs_commit_transaction(): - btrfs_fix_super_size() - check_and_repair_super_num_devs() - zero_log_tree(). There are some exceptions in btrfstune.c, related to the csum type conversion and seed flags. In those btrfstune cases, we in fact wants to proper error report in btrfs_commit_transaction(), as those operations are not mount critical, and any early error can be helpful to expose any problems in the fs. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-20 15:54:16 +02:00
David Sterba	f1178950d3	btrfs-progs: btrfstune: fix build-time detection of experimental features Qu noticed that the full checksums are still printed even if the experimental build is not enabled. This is caused by wrong use of #ifdef (as the macro is always defined), this must be "#if". Fixes: `1bb6fb896d` ("btrfs-progs: btrfstune: experimental, new option to switch csums") Reported-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-10 15:42:13 +02:00
Qu Wenruo	50a5dfde6d	btrfs-progs: print-tree: print the checksum of header without tailing zeros For the default CRC32C checksum, print-tree now prints tons of unnecessary padding zeros: btrfs-progs v5.17 chunk tree leaf 22036480 items 7 free space 15430 generation 6 owner CHUNK_TREE leaf 22036480 flags 0x1(WRITTEN) backref revision 1 checksum stored 0ac1b9fa00000000000000000000000000000000000000000000000000000000 checksum calced 0ac1b9fa00000000000000000000000000000000000000000000000000000000 fs uuid 3d95b7e3-3ab6-4927-af56-c58aa634342e This is caused by commit `1bb6fb896d` ("btrfs-progs: btrfstune: experimental, new option to switch csums"), and it looks like most distros just enable EXPERIMENTAL features by default. (Which is a good thing to provide much better coverage). So here we just limit the csum print to the utilized csum size. Now the output looks like: btrfs-progs v5.17 chunk tree leaf 22036480 items 4 free space 15781 generation 6 owner CHUNK_TREE leaf 22036480 flags 0x1(WRITTEN) backref revision 1 checksum stored 676b812f checksum calced 676b812f fs uuid d11f8799-b6dc-415d-b1ed-cebe6da5f0b7 Fixes: `1bb6fb896d` ("btrfs-progs: btrfstune: experimental, new option to switch csums") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-05-10 13:44:37 +02:00
Qu Wenruo	851ef59b2c	btrfs-progs: remove the unused btrfs_fs_info::seeding member This member is not used by anyone, just remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-29 22:13:22 +02:00
Qu Wenruo	4e9e978783	btrfs-progs: allow read_data_from_disk() to rebuild RAID56 using P/Q This new ability is added by: - Allow btrfs_map_block() to return the chunk type This makes later work much easier - Only reset stripe offset inside btrfs_map_block() when needed Currently if @raid_map is not NULL, btrfs_map_block() will consider this call is for WRITE and will reset stripe offset. This is no longer the case, as for RAID56 read with mirror_num 1/0, we will still call btrfs_map_block() with non-NULL raid_map. Add a small check to make sure we won't reset stripe offset for mirror 1/0 read. - Add new helper read_raid56() to handle rebuild We will read the full stripe (including all data and P/Q stripes) do the rebuild, then only copy the refered part to the caller. There is a catch for RAID6, we have no way to exhaust all combination, so the current repair will assume the mirror = 0 data is corrupted, then try to find a missing device. But if no missing device can be found, it will assume P is corrupted. This is just a guess, and can to totally wrong, but we have no better idea. Now btrfs-progs have full read ability for RAID56. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 19:08:30 +02:00
Qu Wenruo	a99bece1cd	btrfs-progs: remove extent_buffer::fd and extent_buffer::dev_bytes Those two members are a shortcut for non-RAID56 profiles. But we should not use such shortcut, and move all our logical address read/write to the unified read_data_from_disk()/write_data_to_disk(). With previous refactors, now we're safe to remove them. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 19:08:30 +02:00
Qu Wenruo	3ff9d35257	btrfs-progs: use read_data_from_disk() to replace read_extent_from_disk() and replace read_extent_data() The function read_extent_from_disk() is only a wrapper to read tree block. And read_extent_data() is just a while loop to eliminate short read caused by stripe boundary. In fact, a lot of call sites of read_extent_data() are either reading metadata (thus no possible short read) or doing extra loop by themselves. This patch will replace those two functions with read_data_from_disk(), making it the only entrance for data/metadata read. And update read_data_from_disk() to return the read bytes, so caller can do a simple while loop. For the few callers of read_extent_data(), open-code a small while loop for them. This will allow later RAID56 read repair using P/Q much easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 19:08:30 +02:00
Qu Wenruo	2a93728391	btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk() Function write_extent_to_disk() is just writing the content of a tree block to disk. It can not handle RAID56, and its work is the same as write_data_to_disk(). Thus we can replace write_extent_to_disk() with write_data_to_disk() easily. There is only one special call site in write_raid56_with_parity(), which can easily be replace with btrfs_pwrite() directly. This reduce the write entrance, and make later eb::fd removal easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 19:08:29 +02:00
Qu Wenruo	01c25d73f1	btrfs-progs: extract metadata restore read code into its own helper For metadata restore, our logical address is mapped to a single device with logical address 1:1 mapped to device physical address. Move this part of code into a helper, this will make later extent buffer read path refactoer much easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 19:07:09 +02:00
Qu Wenruo	7a0c4b5dc1	btrfs-progs: remove the unnecessary BTRFS_SUPER_INFO_OFFSET path for tree block read We used to use read_whole_eb() to read super block, but it's no longer the case (so long that I can not even find out which commit did the conversion). Thus there is no need for read_whole_eb() to handle super block read anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 19:07:08 +02:00
Qu Wenruo	f9659c7235	btrfs-progs: fix an error path which can lead to empty device list [BUG] With the incoming delayed chunk item insertion feature, there is a super weird failure at mkfs/022: ====== RUN CHECK ./mkfs.btrfs -f --rootdir tmp.KnKpP5 -d dup -b 350M tests/test.img ... Checksum: crc32c Number of devices: 0 Devices: ID SIZE PATH Note the "Number of devices: 0" line, this means our fs_info->fs_devices->devices list is empty. And since our rw device list is empty, we won't finish the mkfs with proper superblock magic, and cause later btrfs check to fail. [CAUSE] Although the failure is only triggered by the incoming delayed chunk item insertion feature, the bug itself is here for a while. In btrfs_alloc_chunk(), we move rw devices to our @private_devs list first, then in create_chunk(), we move it back to our rw devices list. This dance is pretty dangerous, especially if btrfs_alloc_dev_extent() failed inside create_chunk(), and current profile have multiple stripes (including DUP), we will exit create_chunk() directly, without moving the remaining devices in @private_devs list back to @dev_list. Furthermore, btrfs_alloc_chunk() is expected to return -ENOSPC, as we call btrfs_alloc_chunk() to pre-allocate chunks, and ignore the -ENOSPC error if it's just a pre-allocation failure. This existing error path can lead to the empty rw list seen above. [FIX] After create_chunk(), unconditionally move all devices in @private_devs back to rw device list. And add extra check to make sure our rw device list is never empty after a chunk allocation attempt. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 18:33:29 +02:00
Qu Wenruo	4a940ab2c0	btrfs-progs: fix a memory leak when starting a transaction on fs with error Function btrfs_start_transaction() will allocate the memory unconditionally, but if the fs has an aborted transaction we don't free the allocated memory but return error directly. Fix it by only allocate the new memory after all the checks. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-25 18:32:17 +02:00
Naohiro Aota	fd4bab06a4	btrfs-progs: zoned: fix and simplify dev_extent_hole_check_zoned() The previous patch revealed a bug in dev_extent_hole_check_zoned(). If the given hole is OK to use as is, it should have just returned the hole. But on the contrary, it shifts the hole start position by one zone. That results in refusing any hole. We don't use btrfs_ensure_empty_zones() in the btrfs-progs version of dev_extent_hole_check_zoned() unlike the kernel side, because btrfs_find_allocatable_zones() itself is doing the necessary checks. So, we can just "return changed" if the "pos" is unchanged. That also makes the loop and "changed" variable unnecessary. So, fix and simplify the code in one shot. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-08 23:17:35 +02:00
Naohiro Aota	38670212dd	btrfs-progs: fix ordering of hole_size setting and dev_extent_hole_check() The hole_size is used by dev_extent_hole_check() to check the hole is OK as a device extent. However, commit `b031fe84fd` ("btrfs-progs: zoned: implement zoned chunk allocator") mis-ported the kernel code and placed dev_extent_hole_check() before setting hole_check. That made the dev_extent_hole_check() call here essentially pass through as we have hole_size == 0 on mkfs time. As a result, mkfs.btrfs creates data BG at 64 MB where the regular superblock exists, when zone size is 16 MB. Fix the ordering of hole_size setting and calling dev_extent_hole_check(). Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-08 23:17:35 +02:00
Naohiro Aota	32c43d0c68	btrfs-progs: zoned: export sb_zone_number() and related constants Move sb_zone_number() and related constants from zoned.c to the corresponding header for later use. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-04-08 23:17:35 +02:00
Sweet Tea Dorminy	c494724858	btrfs-progs: dump-tree: add print support for verity items 'btrfs inspect-internals dump-tree' doesn't currently know about the two types of verity items and prints them as 'UNKNOWN.36' or 'UNKNOWN.37'. So add them to the known item types. Suggested-by: Boris Burkov <boris@bur.io> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-24 00:49:19 +01:00
Josef Bacik	02fb308bdc	btrfs-progs: make btrfs_create_tree take a key for the root key We're going to start create global roots from mkfs, and we need to have a offset set for the root key. Make the btrfs_create_tree() take a key for the root_key instead of just the objectid so we can setup these new style roots properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-09 18:07:22 +01:00
Josef Bacik	5fb27deaf1	btrfs-progs: make btrfs_clear_free_space_tree extent tree v2 aware With extent tree v2 we'll have multiple free space trees, and we can't just unset the feature flags for the free space tree. Fix this to loop through all of the free space trees and clear them out properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-09 18:07:21 +01:00
Josef Bacik	c4164edeb5	btrfs-progs: add a btrfs_delete_and_free_root helper The free space tree code already does this, but we need it for cleaning up per block group roots. Abstract this code out into a helper so that we can use it in multiple places in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-03-09 18:07:19 +01:00

1 2 3 4 5

232 Commits