[BUG]
When running mkfs tests on a newly rebooted minimal system, it can cause
mkfs/009 to fail.
The reproduce steps requires /tmp to has minimal files in the first
place.
# mkdir /tmp/rootdir
# xfs_io -f -c "pwrite 0 16k" /tmp/rootdir
# mkfs.btrfs --rootdir /tmp/rootdir -f $dev
# btrfs check $dev
Opening filesystem to check...
Checking filesystem on /dev/test/scratch1
UUID: 6821b3db-f056-4c18-b797-32679dcd4272
[1/7] checking root items
[2/7] checking extents
data backref 13631488 root 5 owner 170 offset 0 num_refs 0 not found in extent tree
incorrect local backref count on 13631488 root 5 owner 170 offset 0 found 1 wanted 0 back 0x55ff6cd72260
backref 13631488 root 5 not referenced back 0x55ff6cd4c1f0
incorrect global backref count on 13631488 found 2 wanted 1
backpointer mismatch on [13631488 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[CAUSE]
The extent tree has the following weird item:
item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16250 itemsize 33
refs 1 gen 0 flags DATA
tree block backref root FS_TREE
This is an extent item for data, thus it should not have an inline tree
backref.
Then checking the fs tree:
item 0 key (170 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 7 transid 0 size 16384 nbytes 16384
block group 0 mode 100600 links 1 uid 1000 gid 1000 rdev 0
sequence 0 flags 0x0(none)
atime 1664866393.0 (2022-10-04 14:53:13)
ctime 1664863510.0 (2022-10-04 14:05:10)
mtime 1664863455.0 (2022-10-04 14:04:15)
otime 0.0 (1970-01-01 08:00:00)
There is an inode item before the root dir inode.
And that inode number 170 is causing the problem.
In traverse_directory(), we use the inode number reported from stat()
directly as btrfs inode number, and pass it to
btrfs_record_file_extent(), which finally calls btrfs_inc_extent_ref(),
with above 170 passed as @owner parameter.
But inside btrfs_inc_extent_ref() we use that @owner value to determine
if it's a data backref.
Since we got a smaller than BTRFS_FIRST_FREE_OBJECTID, btrfs treats it
as tree block, and cause the above problem.
[FIX]
As a quick fix, always add BTRFS_FIRST_FREE_OBJECTID to all inode number
directly grabbed from stat().
And add an ASSERT() in __btrfs_record_file_extent() to catch unexpected
objectid.
This is not a perfect solution, as the resulted fs will has a huge gap
in its inodes:
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
item 4 key (426 INODE_ITEM 0) itemoff 15883 itemsize 160
For a proper fix, we should allocate new btrfs inode numbers in a
sequential order, but that would be another series of patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When using mkfs.btrfs --rootdir option, the data extents generated will
have 0 as their generation in extent tree:
# mkdir /tmp/rootdir
# xfs_io -f -c "pwrite 0 16k" /tmp/rootdir/foobar
# mkfs.btrfs -f --rootdir /tmp/rootdir $dev
# btrfs ins dump-tree -t extent $dev
btrfs-progs v5.19.1
extent tree key (EXTENT_TREE ROOT_ITEM 0)
leaf 30474240 items 13 free space 15536 generation 7 owner EXTENT_TREE
leaf 30474240 flags 0x1(WRITTEN) backref revision 1
fs uuid c1f05988-49f9-4dd4-8489-b90d60f522ee
chunk uuid 40f81603-fe75-4f58-aa9e-e74e28df8523
item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16230 itemsize 53
refs 1 gen 0 flags DATA <<< Generation is 0
...
[CAUSE]
In __btrfs_record_file_extent() we just set the extent generation to 0.
[FIX]
Use trans->transid to properly fill extent generation.
Now after mkfs, the first data extent backref looks like this:
item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16230 itemsize 53
refs 1 gen 7 flags DATA
...
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The radix-tree is not used in userspace code. In kernel it's for
tracking unpersisted and in-memory structures and has been replaced by
the xarray.
Signed-off-by: David Sterba <dsterba@suse.com>
The new '-b' option will be responsible for converting to block group
tree compat ro feature.
The workflow looks like this for new convert:
- Setting CHANGING_BG_TREE flag
And initialize fs_info->last_converted_bg_bytenr value to (u64)-1.
Any bg with bytenr >= last_converted_bg_bytenr will have its bg item
update go to the new root (bg tree).
- Iterate each block group by their bytenr in descending order
This involves:
* Delete the old bg item from the old tree (extent tree)
* Update last_converted_bg_bytenr to the bytenr of the bg
* Add the new bg item into the new tree (bg tree)
* If we have converted a bunch of bgs, commit current transaction
- Clear CHANGING_BG_TREE flag
And set the new BLOCK_GROUP_TREE compat ro flag and commit.
And since we're doing the convert in multiple transactions, we also need
to resume from last interrupted convert.
In that case, we just grab the last unconverted bg, and start from it.
And to co-operate with the new kernel requirement for both no-holes and
free-space-tree features, the convert tool will check for
free-space-tree feature. If not enabled, will error out with an error
message to how to continue (by mounting with "-o space_cache=v2").
For missing no-holes feature, we just need to set the flag during
convert.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When creating btrfs with new v2 cache (the default behavior), mkfs.btrfs
always create the free space tree using bitmap.
It's fine for small fs, but will be a disaster if the device is large
and the data profile is something like RAID0:
$ mkfs.btrfs -f -m raid1 -d raid0 /dev/test/scratch[1234]
btrfs-progs v5.17
[...]
Block group profiles:
Data: RAID0 4.00GiB
Metadata: RAID1 256.00MiB
System: RAID1 8.00MiB
[..]
$ btrfs ins dump-tree -t free-space /dev/test/scratch1
btrfs-progs v5.17
free space tree key (FREE_SPACE_TREE ROOT_ITEM 0)
node 30441472 level 1 items 10 free space 483 generation 6 owner FREE_SPACE_TREE
node 30441472 flags 0x1(WRITTEN) backref revision 1
fs uuid deddccae-afd0-4160-9a12-48fe7b526fb1
chunk uuid 68f6cf98-afe3-4f47-9797-37fd9c610219
key (1048576 FREE_SPACE_INFO 4194304) block 30457856 gen 6
key (475004928 FREE_SPACE_BITMAP 8388608) block 30703616 gen 5
key (953155584 FREE_SPACE_BITMAP 8388608) block 30720000 gen 5
key (1431306240 FREE_SPACE_BITMAP 8388608) block 30736384 gen 5
key (1909456896 FREE_SPACE_BITMAP 8388608) block 30752768 gen 5
key (2387607552 FREE_SPACE_BITMAP 8388608) block 30769152 gen 5
key (2865758208 FREE_SPACE_BITMAP 8388608) block 30785536 gen 5
key (3343908864 FREE_SPACE_BITMAP 8388608) block 30801920 gen 5
key (3822059520 FREE_SPACE_BITMAP 8388608) block 30818304 gen 5
key (4300210176 FREE_SPACE_BITMAP 8388608) block 30834688 gen 5
[...]
^^^ So many bitmaps that an empty fs will have two levels for free
space tree already
[CAUSE]
Member btrfs_block_group::bitmap_high_thresh is never properly set to
any value other than 0, thus in function
update_free_space_extent_count(), the following check is always true:
if (!(flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
extent_count > block_group->bitmap_high_thresh) {
ret = convert_free_space_to_bitmaps(trans, block_group, path);
Thus we always got converted to bitmaps.
[FIX]
Cross-port the function set_free_space_tree_thresholds() from kernel,
and call that function in btrfs_make_block_group() and
read_one_block_group() so that every block group has bitmap_high_thresh
properly set.
Now even for that 4GiB large data chunk, we still only have one free extent:
btrfs-progs v5.17
free space tree key (FREE_SPACE_TREE ROOT_ITEM 0)
leaf 30572544 items 15 free space 15860 generation 6 owner FREE_SPACE_TREE
leaf 30572544 flags 0x1(WRITTEN) backref revision 1
fs uuid b24e52ea-6580-4a88-aa70-cb173090bfe3
chunk uuid d85f3905-fc61-4084-b335-2b6b97814b8e
[...]
item 13 key (298844160 FREE_SPACE_INFO 4294967296) itemoff 16235 itemsize 8
free space info extent count 1 flags 0
item 14 key (298844160 FREE_SPACE_EXTENT 4294967296) itemoff 16235 itemsize 0
free space extent
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will now be using block_group->chunk_objectid to point at the global
root id for this particular block group. For now we'll assign this
based on mod'ing the offset of the block group against the number of
global root id's and handle the block_group_item updating appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the ability to load the block group root, as well as make sure
the various backup super block and super block updates are made
appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers are using the _nr variations we can simply rename
these helpers to btrfs_item_##member/btrfs_set_item_##member and change
the actual item SETGET funcs to raw_item_##member/set_raw_item_##member
and then change all callers to drop the _nr part.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we switch to multiple global trees we'll need to access the
appropriate extent root depending on the block group or possibly root.
To handle this, use a helper in most places and then the actual root in
places where it is required. We will whittle down the direct accessors
with future patches, but this does the bulk of the preparatory work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this helper sitting in extent-tree.c, but it's a repair
function. I'm going to need to make changes to this for extent-tree-v2
and would rather this live outside of the code we need to share with the
kernel.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is doing the same work as insert_block_group_item, rework it to
call the helper instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function btrfs_reserve_extent(), we call find_free_extent() passing
"u64 profile" into "int data".
This is definitely a width reduction, but when looking further into the
code, it's more serious than that, in fact the "int data" parameter is
not really to indicate whether it's data extent, but really a block
group profile (with block group type).
This is not only width reduction, but also confusing.
Thankfully so for we don't have any BLOCK_GROUP bits beyond 32 bits, so
the width reduction is not causing a big problem.
This patch will rename the "int data" parameter to a more proper one,
"u64 profile" in all involved call paths.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's an ancient macro btrfs_crc32c which is just wrapping crc32c and
not doing anything else, so we can use the crc helper directly.
Signed-off-by: David Sterba <dsterba@suse.com>
This function lies in the kernel-shared directory and is supposed to be
close to 1:1 copy with its kernel counterpart, yet it takes one extra
argument - root. But this is now unused to simply remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dave reported a failure of mkfs-test 009 with the free space tree
enabled by default. This is because 009 pre-populates the file system
with a given directory, and for some reason our data allocation path
isn't the same as in the kernel. Fix this by making sure when we
allocate a data extent we remove the space from the free space tree, and
with this our mkfs tests now pass.
Issue: #410
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This exists in the kernel free-space-tree.c but not in progs. We need
it to generate the free space items for new block groups, which is
needed when we start creating the free space tree in make_btrfs().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add new options to dumps checksums in node headers and in the checksum
items:
$ btrfs inspect dump-tree --csum-headers image
root tree
leaf 471515136 items 19 free space 12186 generation 15 owner ROOT_TREE
leaf 471515136 flags 0x1(WRITTEN) backref revision 1 csum 0x756b2d54
fs uuid df0348df-5773-47dd-81e9-a18221461239
For nodes/leaves it's appended on the 2nd line of the header.
Checksum items are stored in leaves as EXTENT_CSUM key type, with offset
value as the logical offset starting. As the array would be hard to
parse or match, each offset value is printed with the checksum. For
crc32c it's 4 values on a line, for xxhash it's 2 and for the long
256bit checksums it's one checksum per line.
$ btrfs inspect dump-tree --csum-items image
leaf 5423104 items 1 free space 30 generation 6 owner CSUM_TREE
leaf 5423104 flags 0x1(WRITTEN) backref revision 1
fs uuid bd7c981e-16ff-4081-a734-3ef5d50cafc1
chunk uuid 13f4c76c-7845-4984-88ed-f01b52e05cf8
item 0 key (EXTENT_CSUM EXTENT_CSUM 22020096) itemoff 55 itemsize 16228
range start 22020096 end 38637568 length 16617472
[22020096] 0x8941f998 [22024192] 0x8941f998 [22028288] 0x8941f998 [22032384] 0x8941f998
[22036480] 0x8941f998 [22040576] 0x8941f998 [22044672] 0x8941f998 [22048768] 0x8941f998
...
$ btrfs inspect dump-tree --csum-items image
leaf 5718016 items 1 free space 7746 generation 6 owner CSUM_TREE
leaf 5718016 flags 0x1(WRITTEN) backref revision 1
fs uuid f453a5b4-8b4a-4fbf-90a2-2925e4fe2335
chunk uuid eb1da63b-248b-44c2-82da-71b2564bf50e
item 0 key (EXTENT_CSUM EXTENT_CSUM 52387840) itemoff 7771 itemsize 8512
range start 52387840 end 53477376 length 1089536
[52387840] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
[52391936] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
...
The options are not on by default, the header checksum is not important
for the structures. Data checksums can be quite big so that would make
the dump long and without any actual data to match against.
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing a chunk, we can/should reset the underlying device zones
for the chunk. Introduce btrfs_reset_chunk_zones() and reset the zones.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.
Since the allocator is really simple, we implement it directly in
find_search_start().
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address
within a block group. To facilitate this, add an "alloc_offset" to the
block group to track the logical addresses of the write pointer.
This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.
For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.
We need fine-grained logical to physical mapping to support such
separated physical address issue. Since it should require additional
metadata type, disable non-single profiles for now.
This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>