We want to enable developers to test the extent tree v2 features as they
are added, add the ability to mkfs an extent tree v2 fs if we have
experimental enabled.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiples of these trees with extent tree v2, so
add a rb tree to track them based on their root key value. This works
for both v1 and v2, so we can remove the direct pointers to these roots
in our fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so access
it via a helper in most cases. We will address the remaining direct
accesses in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we switch to multiple global trees we'll need to access the
appropriate extent root depending on the block group or possibly root.
To handle this, use a helper in most places and then the actual root in
places where it is required. We will whittle down the direct accessors
with future patches, but this does the bulk of the preparatory work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we will have per-block group checksums, so add a
helper to access the csum root and rename the fs_info csum_root to
_csum_root to catch all the places that are accessing it directly.
Convert everybody to use the helper except for internal things.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to need to start looking up the csum root based on the
bytenr with extent tree v2. To that end stop passing the root to the
csum related functions so that can be done in the helper functions
themselves.
There's an unrelated deletion of a function prototype that no longer
exists.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like kernel commit 22b6331d9617 ("btrfs: store precalculated
csum_size in fs_info"), we can cache csum_size and csum_type in
btrfs_fs_info.
Furthermore, there is already a 32 bits hole in btrfs_fs_info, and we
can fit csum_type and csum_size into the hole without increase the size
of btrfs_fs_info.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like kernel change, pad struct btrfs_super_block to 4096 bytes. As
ctree.h is part of public headers, use raw number for the superblock
offset.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When compiling btrfs-progs on 32bit x86 using GCC 11.1.0, there are
several warnings:
In file included from ./common/utils.h:30,
from check/main.c:36:
check/main.c: In function 'run_next_block':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
check/main.c:6496:33: note: in expansion of macro 'error'
6496 | error(
| ^~~~~
In file included from ./common/utils.h:30,
from kernel-shared/volumes.c:32:
kernel-shared/volumes.c: In function 'btrfs_check_chunk_valid':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
kernel-shared/volumes.c:2052:17: note: in expansion of macro 'error'
2052 | error("invalid chunk item size, have %u expect [%zu, %lu)",
| ^~~~~
image/main.c: In function 'search_for_chunk_blocks':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
image/main.c:2122:33: note: in expansion of macro 'error'
2122 | error(
| ^~~~~
There are two types of problems:
- __BTRFS_LEAF_DATA_SIZE()
This macro has no type definition, making it behaves differently on
different arches.
Fix this by following kernel to use inline function to make its return
value fixed to u32.
- size_t related output
For x86_64 %lu is OK but not for x86.
Fix this by using %zu.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ("btrfs-progs: switch btrfs_group_profile_str to use raid table")
introduced a regression that raid profile of GlobalReserve will be
printed as 'unknown'.
$ btrfs filesystem df /mnt/test
Data, single: total=5.02TiB, used=4.98TiB
System, single: total=4.00MiB, used=624.00KiB
Metadata, single: total=11.01GiB, used=6.94GiB
GlobalReserve, unknown: total=512.00MiB, used=0.00B
Fix it by:
- take BTRFS_BLOCK_GROUP_RESERVED into account when masking the block
group flags
- update the define of BTRFS_BLOCK_GROUP_RESERVED too so it's same as in
kernel
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When a special image (diverted from fsck/012) has its unused slots (slot
number >= nritems) with garbage, lowmem mode btrfs check can crash:
(gdb) run check --mode=lowmem ~/downloads/good.img.restored
Starting program: /home/adam/btrfs/btrfs-progs/btrfs check --mode=lowmem ~/downloads/good.img.restored
...
ERROR: root 5 INODE[5044031582654955520] nlink(257228800) not equal to inode_refs(0)
ERROR: root 5 INODE[5044031582654955520] nbytes 474624 not equal to extent_size 0
Program received signal SIGSEGV, Segmentation fault.
0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
1703 BTRFS_SETGET_FUNCS(inode_size, struct btrfs_inode_item, size, 64);
(gdb) bt
#0 0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
#1 0x0000555555641544 in check_inode_item (root=0x5555556c2290, path=0x7fffffffd960) at check/mode-lowmem.c:2628
[CAUSE]
At check_inode_item() we have path->slot[0] at 29, while the tree block
only has 26 items.
This happens because two reasons:
- btrfs_next_item() never reverts its slots
Even if we failed to read next leaf.
- check_inode_item() doesn't inform the caller that a fatal error
happened
In check_inode_item(), if btrfs_next_item() failed, it goes to out
label, which doesn't really set @err properly.
This means, when check_inode_item() fails at btrfs_next_item(), it will
increase path->slots[0], while it's already beyond current tree block
nritems.
When the slot increases furthermore, and if the unused item slots have
some garbage, we will get invalid btrfs_item_ptr() result, and causing
above segfault.
[FIX]
Fix the problems by two ways:
- Make btrfs_next_item() to revert its path->slots[0] on failure
- Properly detect fatal error from check_inode_item()
By this, we will no longer crash on the crafted image.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Issue: #412
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the helpers using crc32c not inline so the crc32c.h can be removed
from the public headers exported by libbtrfs.
Signed-off-by: David Sterba <dsterba@suse.com>
There's an ancient macro btrfs_crc32c which is just wrapping crc32c and
not doing anything else, so we can use the crc helper directly.
Signed-off-by: David Sterba <dsterba@suse.com>
To drop sizes.h from exported headers, replace the few SZ_ constants
from the existing exported headers (ctree.h, send.h). It would be nice
to use them in the long run but right now it would prevent unexporting
the sizes.h file.
Signed-off-by: David Sterba <dsterba@suse.com>
It will be used to clear received data on RW snapshots that were
received. The function is copied from kernel sources.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function lies in the kernel-shared directory and is supposed to be
close to 1:1 copy with its kernel counterpart, yet it takes one extra
argument - root. But this is now unused to simply remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used, so just remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
max_zone_append_size is unused and can as well be removed just like we
did on the kernel side.
Keep one sanity check though, so we're not adding devices to a zoned FS
that aren't supporting zone append.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a report that, btrfstune can even work while the fs has transid
mismatch problems.
$ btrfstune -f -u /dev/sdb1
Current fsid: b2b5ae8d-4c49-45f0-b42e-46fe7dcfcb07
New fsid: b2b5ae8d-4c49-45f0-b42e-46fe7dcfcb07
Set superblock flag CHANGING_FSID
Change fsid in extents
parent transid verify failed on 792854528 wanted 20103 found 20091
parent transid verify failed on 792854528 wanted 20103 found 20091
parent transid verify failed on 792854528 wanted 20103 found 20091
Ignoring transid failure
parent transid verify failed on 792870912 wanted 20103 found 20091
parent transid verify failed on 792870912 wanted 20103 found 20091
parent transid verify failed on 792870912 wanted 20103 found 20091
Ignoring transid failure
parent transid verify failed on 792887296 wanted 20103 found 20091
parent transid verify failed on 792887296 wanted 20103 found 20091
parent transid verify failed on 792887296 wanted 20103 found 20091
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=38010880 item=69 parent level=1 child level=1
ERROR: failed to change UUID of metadata: -5
ERROR: btrfstune failed
This leaves a corrupted fs even more corrupted, and due to the extra
CHANGING_FSID flag, btrfs check will not even try to run on it:
Opening filesystem to check...
ERROR: Filesystem UUID change in progress
ERROR: cannot open file system
[CAUSE]
Unlike kernel, btrfs-progs has a less strict check on transid mismatch.
In read_tree_block() we will fall back to use the tree block even its
transid mismatch if we can't find any better copy.
However not all commands in btrfs-progs needs this feature, only
btrfs-check (which may fix the problem) and btrfs-restore (it just tries
to ignore any problems) really utilize this feature.
[FIX]
Introduce a new open ctree flag, OPEN_CTREE_ALLOW_TRANSID_MISMATCH, to
be explicit about whether we really want to ignore transid error.
Currently only btrfs-check and btrfs-restore will utilize this new flag.
Also add btrfs-image to allow opening such fs with transid error.
Link: https://www.reddit.com/r/btrfs/comments/pivpqk/failure_during_btrfstune_u/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_check_node() has far less meaningful error message compared to
kernel counterpart, and it even lacks certain checks like level check.
Backport btrfs_check_node() to btrfs-progs to not only unify the code
but greatly improve the readability of the error messages.
Extra modification includes:
- No fs_info needed
As we don't need to output fsid.
- Remove unlikely() macro
- Extra BTRFS_TREE_BLOCK_* error type
- Btrfs-progs specific error handling
To record the corrupted tree blocks.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In kernel space we hardly use btrfs_disk_key, unless for very lowlevel
code.
There is no need to intentionally use btrfs_disk_key in btrfs-progs
either.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I will have a lot of preparatory patches to reduce the review pain of
this large feature. In order to enable that work define the incompat
flag. Once all of the work lands to support the feature there will be a
patch to actually enable us to select it and manipulate file systems
with that incompat flag set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.
Check if next dirty extent buffer is continuous to a previously written
one. If not, it redirty extent buffers between the previous one and the
next one, so that all dirty buffers are written sequentially.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address
within a block group. To facilitate this, add an "alloc_offset" to the
block group to track the logical addresses of the write pointer.
This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.
For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.
We need fine-grained logical to physical mapping to support such
separated physical address issue. Since it should require additional
metadata type, disable non-single profiles for now.
This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.
Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.
Zone append command is mandatory for zoned btrfs. So, reject a device
with max_zone_append_size == 0.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the zoned feature enabled, a zoned block device-aware btrfs
allocates block groups aligned to the device zones and always written in
sequential zones at the zone write pointer position.
It also supports "emulated" zoned mode on a non-zoned device. In the
emulated mode, btrfs emulates conventional zones by slicing the device
into fixed-size zones.
We don't support conversion from the ext4 volume with the zoned feature
because we can't be sure all the converted block groups are aligned to
zone boundaries.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function exists in kernel side but using the _item suffix, and
objectid argument is placed before the name argument. Change the
function to reflect the kernel version.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>