The enumeration of profiles not available for zoned mode in
btrfs_load_block_group_zone_info was lacking the 3 and 4 copy raid1, add
them.
Signed-off-by: David Sterba <dsterba@suse.com>
Dave reported a failure of mkfs-test 009 with the free space tree
enabled by default. This is because 009 pre-populates the file system
with a given directory, and for some reason our data allocation path
isn't the same as in the kernel. Fix this by making sure when we
allocate a data extent we remove the space from the free space tree, and
with this our mkfs tests now pass.
Issue: #410
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enabling quota in zoned mored hits the following assertion:
$ mkfs.btrfs -f -d single -m single -R quota /dev/nullb0
btrfs-progs v5.11
See http://btrfs.wiki.kernel.org for more information.
Zoned: /dev/nullb0: host-managed device detected, setting zoned feature
Resetting device zones /dev/nullb0 (1600 zones) ...
bad tree block 25395200, bytenr mismatch, want=25395200, have=0
kernel-shared/disk-io.c:549: write_tree_block: BUG_ON `1` triggered, value 1
./mkfs.btrfs(+0x26aaa)[0x564d1a7ccaaa]
./mkfs.btrfs(write_tree_block+0xb8)[0x564d1a7cee29]
./mkfs.btrfs(__commit_transaction+0x91)[0x564d1a7e3740]
./mkfs.btrfs(btrfs_commit_transaction+0x135)[0x564d1a7e39aa]
./mkfs.btrfs(main+0x1fe9)[0x564d1a7b442a]
/lib64/libc.so.6(__libc_start_main+0xcd)[0x7f36377d37fd]
./mkfs.btrfs(_start+0x2a)[0x564d1a7b1fda]
zsh: IOT instruction sudo ./mkfs.btrfs -f -d single -m single -R quota /dev/nullb0
The issue occurs because btrfs_create_root() is not formatting the root
node properly. This is fine in regular mode, because it's fortunately
reusing an once freed buffer. As the previous tree node allocation
kindly formatted the header, it will see the proper bytenr and pass the
checks.
However, we never reuse a once freed buffer on zoned filesystem. As a
result, we have zero-filled bytenr, FSID, and chunk-tree UUID, hitting
the asserts in check_tree_block().
Reported-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
max_zone_append_size is unused and can as well be removed just like we
did on the kernel side.
Keep one sanity check though, so we're not adding devices to a zoned FS
that aren't supporting zone append.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We cannot zone reset a regular file with emulated zones. So, mkfs.btrfs
on such a file causes the following error.
ERROR: zoned: failed to reset device '/home/naota/tmp/btrfs.img' zones: Inappropriate ioctl for device
Introduce btrfs_zoned_device_info->emulated to distinguish the zones are
emulated or not. And, use it to decide it needs zone reset or not.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a report that, btrfstune can even work while the fs has transid
mismatch problems.
$ btrfstune -f -u /dev/sdb1
Current fsid: b2b5ae8d-4c49-45f0-b42e-46fe7dcfcb07
New fsid: b2b5ae8d-4c49-45f0-b42e-46fe7dcfcb07
Set superblock flag CHANGING_FSID
Change fsid in extents
parent transid verify failed on 792854528 wanted 20103 found 20091
parent transid verify failed on 792854528 wanted 20103 found 20091
parent transid verify failed on 792854528 wanted 20103 found 20091
Ignoring transid failure
parent transid verify failed on 792870912 wanted 20103 found 20091
parent transid verify failed on 792870912 wanted 20103 found 20091
parent transid verify failed on 792870912 wanted 20103 found 20091
Ignoring transid failure
parent transid verify failed on 792887296 wanted 20103 found 20091
parent transid verify failed on 792887296 wanted 20103 found 20091
parent transid verify failed on 792887296 wanted 20103 found 20091
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=38010880 item=69 parent level=1 child level=1
ERROR: failed to change UUID of metadata: -5
ERROR: btrfstune failed
This leaves a corrupted fs even more corrupted, and due to the extra
CHANGING_FSID flag, btrfs check will not even try to run on it:
Opening filesystem to check...
ERROR: Filesystem UUID change in progress
ERROR: cannot open file system
[CAUSE]
Unlike kernel, btrfs-progs has a less strict check on transid mismatch.
In read_tree_block() we will fall back to use the tree block even its
transid mismatch if we can't find any better copy.
However not all commands in btrfs-progs needs this feature, only
btrfs-check (which may fix the problem) and btrfs-restore (it just tries
to ignore any problems) really utilize this feature.
[FIX]
Introduce a new open ctree flag, OPEN_CTREE_ALLOW_TRANSID_MISMATCH, to
be explicit about whether we really want to ignore transid error.
Currently only btrfs-check and btrfs-restore will utilize this new flag.
Also add btrfs-image to allow opening such fs with transid error.
Link: https://www.reddit.com/r/btrfs/comments/pivpqk/failure_during_btrfstune_u/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A combination of new progs and old kernel may lead to problems with
detecting zone size by ioctl. Fixed by #376 but still incomplete because
old kernels may return EINVAL for unsupported ioctl. This should be
ENOTTY but hasn't been like that until kernel 5.11.
As we always pass valid arguments to the ioctl we can't conflate the two
and can EINVAL the same way as ENOTTY.
Issue: #399
Signed-off-by: David Sterba <dsterba@suse.com>
The header contains the protocol definitions and is almost exactly the
same as the kernel version, move it to the proper directory.
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_check_node() has far less meaningful error message compared to
kernel counterpart, and it even lacks certain checks like level check.
Backport btrfs_check_node() to btrfs-progs to not only unify the code
but greatly improve the readability of the error messages.
Extra modification includes:
- No fs_info needed
As we don't need to output fsid.
- Remove unlikely() macro
- Extra BTRFS_TREE_BLOCK_* error type
- Btrfs-progs specific error handling
To record the corrupted tree blocks.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_check_leaf() provides almost meaningless messages for
things like invalid item offset:
incorrect offsets 8492 3707786077
While kernel tree-checker is doing a way better job, so it's wise to
backport btrfs_check_leaf() from kernel.
There are some modification needed:
- New generic_err() helper
- Remove unlikely() macro
- Remove empty essential tree check
Mkfs still needs to create empty essential trees.
- Using BTRFS_TREE_BLOCK_* return value
Original mode check still relies on them to do certain repair.
- No need for btrfs_fs_info
We no longer need fsid output, thus no need for btrfs_fs_info.
- No item contents check
- Still using the fail: label for btrfs-progs specific error handling
The new output looks like:
corrupt leaf: root=2 block=72164753408 slot=109, unexpected item end, have 3707786077 expect 8492
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In kernel space we hardly use btrfs_disk_key, unless for very lowlevel
code.
There is no need to intentionally use btrfs_disk_key in btrfs-progs
either.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the GPL v2 header to files where it was missing and is not from an
external source, update to the most recent version with the address.
Signed-off-by: David Sterba <dsterba@suse.com>
I will have a lot of preparatory patches to reduce the review pain of
this large feature. In order to enable that work define the incompat
flag. Once all of the work lands to support the feature there will be a
patch to actually enable us to select it and manipulate file systems
with that incompat flag set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This exists in the kernel free-space-tree.c but not in progs. We need
it to generate the free space items for new block groups, which is
needed when we start creating the free space tree in make_btrfs().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adding support for the per-block group roots means we will be reading
the roots directly in different places. Make sure we set ->track_dirty
and ->ref_cows properly in the helper so we don't have to do this
everywhere.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Kernel patch b2f78e88052bc0bee ("btrfs: allow degenerate raid0/raid10")
in
5.15 will allow mounting and converting to single device raid0 or two
device raid10. Let mkfs create such filesystem.
"The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same."
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_format_csum() is a special helper only used in
btrfs-progs.
Move it to common/utils.[ch] other than leaving it in
kernel-shared/disk-io.c.
Since we're moving the code, also introduce a macro,
BTRFS_CSUM_STRING_LEN, to replace open-coded string length calculation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By enabling the lowmem checks properly I uncovered the case where test
fsck/007 will infinite loop at the detection stage. This is because
when checking the inode item we will just btrfs_next_item(), and because
we ignore check tree block failures at read time we don't get an -EIO
from btrfs_next_leaf.
This occurs because we allow fsck to raw-read blocks even if they fail
basic sanity checks, because we want the opportunity to repair the
blocks. However this means corrupt blocks are sitting in cache marked
as uptodate. btrfs_search_slot() handles this by doing a check_block()
on every block we add to the path, so that anything that is doing a
search gets a proper -EIO.
btrfs_next_sibling_block() needs a similar check. With this fix we now
return -EIO on btrfs_next_leaf() properly and we no longer infinite loop
on fsck/007 with lowmem.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[HICCUP]
There is a bug report that mkfs.btrfs -R free-space-tree still makes
kernel to try to cleanup the v1 space cache:
# mkfs.btrfs -R free-space-tree -f /dev/test/scratch1
# mount /dev/test/scratch1 /mnt/btrfs
# dmesg | grep cleaning
BTRFS info (device dm-6): cleaning free space cache v1
[CAUSE]
By default, mkfs.btrfs will set super cache generation to (u64)-1, which
will inform kernel that the v1 space cache is invalid, needs to
regenerate it.
But for free space cache tree, kernel will set super cache generation to
0, to indicate v1 space cache is not in use.
This means, even we enabled free space tree with all the RO compatible
bits and new tree, as long as super cache generation is not 0, kernel
still consider the fs has some invalid v1 space cache, and will try to
remove them.
[FIX]
This is not a big deal, but to make the "-R free-space-tree" to really
work as kernel, we also need to set super cache generation to 0.
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtSvgzyOnxtrqQZZirSycEHp+g0eDH5c+Kw9mW=PgxuXmw@mail.gmail.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently v1 space cache clearing will delete one cache inode just in
one transaction, and then start a new transaction to delete the next
inode.
This is far from efficient and can make the already slow v1 space cache
deleting even slower, as large fs has tons of cache inodes to delete.
This patch will speed up the process by batching up to 16 inode deletion
into one transaction.
A quick benchmark of deleting 702 v1 space cache inodes would look like
this:
Unpatched: 4.898s
Patched: 0.087s
Which is obviously a big win.
Reported-by: Joshua <joshua@mailmag.net>
Link: https://lore.kernel.org/linux-btrfs/0b4cf70fc883e28c97d893a3b2f81b11@mailmag.net/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_sb_io(), blk_zone_report is used for getting information about
zones. But it is not freed if code goes in usual path. This patch frees
the variable just after it used.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add new options to dumps checksums in node headers and in the checksum
items:
$ btrfs inspect dump-tree --csum-headers image
root tree
leaf 471515136 items 19 free space 12186 generation 15 owner ROOT_TREE
leaf 471515136 flags 0x1(WRITTEN) backref revision 1 csum 0x756b2d54
fs uuid df0348df-5773-47dd-81e9-a18221461239
For nodes/leaves it's appended on the 2nd line of the header.
Checksum items are stored in leaves as EXTENT_CSUM key type, with offset
value as the logical offset starting. As the array would be hard to
parse or match, each offset value is printed with the checksum. For
crc32c it's 4 values on a line, for xxhash it's 2 and for the long
256bit checksums it's one checksum per line.
$ btrfs inspect dump-tree --csum-items image
leaf 5423104 items 1 free space 30 generation 6 owner CSUM_TREE
leaf 5423104 flags 0x1(WRITTEN) backref revision 1
fs uuid bd7c981e-16ff-4081-a734-3ef5d50cafc1
chunk uuid 13f4c76c-7845-4984-88ed-f01b52e05cf8
item 0 key (EXTENT_CSUM EXTENT_CSUM 22020096) itemoff 55 itemsize 16228
range start 22020096 end 38637568 length 16617472
[22020096] 0x8941f998 [22024192] 0x8941f998 [22028288] 0x8941f998 [22032384] 0x8941f998
[22036480] 0x8941f998 [22040576] 0x8941f998 [22044672] 0x8941f998 [22048768] 0x8941f998
...
$ btrfs inspect dump-tree --csum-items image
leaf 5718016 items 1 free space 7746 generation 6 owner CSUM_TREE
leaf 5718016 flags 0x1(WRITTEN) backref revision 1
fs uuid f453a5b4-8b4a-4fbf-90a2-2925e4fe2335
chunk uuid eb1da63b-248b-44c2-82da-71b2564bf50e
item 0 key (EXTENT_CSUM EXTENT_CSUM 52387840) itemoff 7771 itemsize 8512
range start 52387840 end 53477376 length 1089536
[52387840] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
[52391936] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
...
The options are not on by default, the header checksum is not important
for the structures. Data checksums can be quite big so that would make
the dump long and without any actual data to match against.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace follow and traverse by one parameter that takes bits to affect
the behaviour. This allows to extend btrfs_print_tree output with more
modes from one place.
Signed-off-by: David Sterba <dsterba@suse.com>
There's a report that a system with 4.19 kernel fails boot because
device scan exits with error. This is because zoned support is compiled
in btrfs-progs but not in kernel.
To make new progs and old kernels work, do a fallback when the zoned
ioctl is not available, as if it were a non-zoned device. There is no
other option, but this is safe at least for the device scan that would
not error out. Any unaligned writes to a zoned device will fail as
expected.
Issue: #376
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 8ef9313cf2 ("btrfs-progs: zoned: implement log-structured
superblock") changed to write BTRFS_SUPER_INFO_SIZE bytes to device.
The before num of bytes to be written is sectorsize.
It causes mkfs.btrfs failed on my 16k pagesize kvm:
$ /usr/bin/mkfs.btrfs -s 16k -f -mraid0 /dev/vdb2 /dev/vdb3
btrfs-progs v5.12
See http://btrfs.wiki.kernel.org for more information.
ERROR: superblock magic doesn't match
ERROR: superblock magic doesn't match
common/device-scan.c:195: btrfs_add_to_fsid: BUG_ON `ret != sectorsize`
triggered, value 1
/usr/bin/mkfs.btrfs(btrfs_add_to_fsid+0x274)[0xaaab4fe8a5fc]
/usr/bin/mkfs.btrfs(main+0x1188)[0xaaab4fe4dc8c]
/usr/lib/libc.so.6(__libc_start_main+0xe8)[0xffff7223c538]
/usr/bin/mkfs.btrfs(+0xc558)[0xaaab4fe4c558]
[1] 225842 abort (core dumped) /usr/bin/mkfs.btrfs -s 16k -f -mraid0
/dev/vdb2 /dev/vdb3
btrfs_add_to_fsid() now always calls sbwrite() to write
BTRFS_SUPER_INFO_SIZE bytes to device, so change condition of
the BUG_ON().
Also add comments for sbread() and sbwrite().
Signed-off-by: Su Yue <l@damenly.su>
Signed-off-by: David Sterba <dsterba@suse.com>
The check condition (csum_result == 0) does not make sense anymore as
it's not the buffer and not the crc32c result as it used to be. The
message does not bring any value and looks like it's some debugging aid
from the old times (added in 2008 as bb7055ec21 ("Add some extra
debugging around file data checksum failures")).
Signed-off-by: David Sterba <dsterba@suse.com>
Move the file to common as it's used by several parts, while still
keeping the name 'repair' although the only thing it does is adding a
corrupted extent.
Signed-off-by: David Sterba <dsterba@suse.com>
Decrease dependency on system headers, remove where they're not needed
or became stale after code moved. The path-utils.h encapsulate path
operations so include linux/limits.h here, that's where PATH_MAX is
defined.
Signed-off-by: David Sterba <dsterba@suse.com>
The newly added zoned mode constants can utilize the const ilog2
version. Copy it from kernel include/linux/log2.h.
Signed-off-by: David Sterba <dsterba@suse.com>
mkfs.btrfs uses a temporary superblock during the initialization process.
The temporary superblock uses BTRFS_MAGIC_TEMPORARY as its magic which is
different from a regular superblock. As a result, libblkid, which only
supports the usual magic, cannot recognize the volume as btrfs. So, let's
wipe the temporary magic before writing out the usual superblock.
Technically, we can add the temporary magic to the libblkid's table. But,
it will result in recognizing a half-baked filesystem as btrfs, which is
not ideal.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we zero out a region in a sequential write required zone, we cannot
write to the region until we reset the zone. Thus, we must prohibit zeroing
out to a sequential write required zone.
zero_dev_clamped() is modified to take the zone information and it calls
zero_zone_blocks() if the device is host managed to avoid writing to
sequential write required zones.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All zones of zoned block devices should be reset before writing. Support
this by introducing PREP_DEVICE_ZONED.
btrfs_reset_all_zones() walk all the zones on a device, and reset a zone if
it is sequential required zone, or discard the zone range otherwise.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing a chunk, we can/should reset the underlying device zones
for the chunk. Introduce btrfs_reset_chunk_zones() and reset the zones.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.
Check if next dirty extent buffer is continuous to a previously written
one. If not, it redirty extent buffers between the previous one and the
next one, so that all dirty buffers are written sequentially.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset for sequential allocation if a block
group contains a conventional zone.
But instead, we can consider the end of the highest addressed extent in
the block group for the allocation offset.
For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking
extent buffer lock after chunk mutex, which is already taken in
btrfs_make_block_group(). Since it is a new block group anyways, we can
simply set the allocation offset to 0.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.
Since the allocator is really simple, we implement it directly in
find_search_start().
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address
within a block group. To facilitate this, add an "alloc_offset" to the
block group to track the logical addresses of the write pointer.
This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.
For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.
We need fine-grained logical to physical mapping to support such
separated physical address issue. Since it should require additional
metadata type, disable non-single profiles for now.
This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.
To implement the allocator, we need to extend the following functions for
a zoned filesystem:
- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size
Here, dev_extent_hole_check() is newly introduced to check the validity of
a hole found.
init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.
dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.
dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.
With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.
Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>