Clarify that tree-stats can print inaccurate results or warnings when
the filesystem is mounted. Inspired by
https://bugzilla.kernel.org/show_bug.cgi?id=97481 .
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running mkfs tests on a newly rebooted minimal system, it can cause
mkfs/009 to fail.
The reproduce steps requires /tmp to has minimal files in the first
place.
# mkdir /tmp/rootdir
# xfs_io -f -c "pwrite 0 16k" /tmp/rootdir
# mkfs.btrfs --rootdir /tmp/rootdir -f $dev
# btrfs check $dev
Opening filesystem to check...
Checking filesystem on /dev/test/scratch1
UUID: 6821b3db-f056-4c18-b797-32679dcd4272
[1/7] checking root items
[2/7] checking extents
data backref 13631488 root 5 owner 170 offset 0 num_refs 0 not found in extent tree
incorrect local backref count on 13631488 root 5 owner 170 offset 0 found 1 wanted 0 back 0x55ff6cd72260
backref 13631488 root 5 not referenced back 0x55ff6cd4c1f0
incorrect global backref count on 13631488 found 2 wanted 1
backpointer mismatch on [13631488 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[CAUSE]
The extent tree has the following weird item:
item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16250 itemsize 33
refs 1 gen 0 flags DATA
tree block backref root FS_TREE
This is an extent item for data, thus it should not have an inline tree
backref.
Then checking the fs tree:
item 0 key (170 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 7 transid 0 size 16384 nbytes 16384
block group 0 mode 100600 links 1 uid 1000 gid 1000 rdev 0
sequence 0 flags 0x0(none)
atime 1664866393.0 (2022-10-04 14:53:13)
ctime 1664863510.0 (2022-10-04 14:05:10)
mtime 1664863455.0 (2022-10-04 14:04:15)
otime 0.0 (1970-01-01 08:00:00)
There is an inode item before the root dir inode.
And that inode number 170 is causing the problem.
In traverse_directory(), we use the inode number reported from stat()
directly as btrfs inode number, and pass it to
btrfs_record_file_extent(), which finally calls btrfs_inc_extent_ref(),
with above 170 passed as @owner parameter.
But inside btrfs_inc_extent_ref() we use that @owner value to determine
if it's a data backref.
Since we got a smaller than BTRFS_FIRST_FREE_OBJECTID, btrfs treats it
as tree block, and cause the above problem.
[FIX]
As a quick fix, always add BTRFS_FIRST_FREE_OBJECTID to all inode number
directly grabbed from stat().
And add an ASSERT() in __btrfs_record_file_extent() to catch unexpected
objectid.
This is not a perfect solution, as the resulted fs will has a huge gap
in its inodes:
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
item 4 key (426 INODE_ITEM 0) itemoff 15883 itemsize 160
For a proper fix, we should allocate new btrfs inode numbers in a
sequential order, but that would be another series of patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When using mkfs.btrfs --rootdir option, the data extents generated will
have 0 as their generation in extent tree:
# mkdir /tmp/rootdir
# xfs_io -f -c "pwrite 0 16k" /tmp/rootdir/foobar
# mkfs.btrfs -f --rootdir /tmp/rootdir $dev
# btrfs ins dump-tree -t extent $dev
btrfs-progs v5.19.1
extent tree key (EXTENT_TREE ROOT_ITEM 0)
leaf 30474240 items 13 free space 15536 generation 7 owner EXTENT_TREE
leaf 30474240 flags 0x1(WRITTEN) backref revision 1
fs uuid c1f05988-49f9-4dd4-8489-b90d60f522ee
chunk uuid 40f81603-fe75-4f58-aa9e-e74e28df8523
item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16230 itemsize 53
refs 1 gen 0 flags DATA <<< Generation is 0
...
[CAUSE]
In __btrfs_record_file_extent() we just set the extent generation to 0.
[FIX]
Use trans->transid to properly fill extent generation.
Now after mkfs, the first data extent backref looks like this:
item 0 key (13631488 EXTENT_ITEM 16384) itemoff 16230 itemsize 53
refs 1 gen 7 flags DATA
...
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a group of helpers to read device size, the btrfs_device_size
should be one of them. Rename it and so minor cleanup.
Signed-off-by: David Sterba <dsterba@suse.com>
Switch the remaining use of assert() as it lacks the verbose assert that
we have for ASSERT (but otherwise is equivalent).
Signed-off-by: David Sterba <dsterba@suse.com>
There are cases where the BUG_ON should be replaced by error
handling as it's validating the data from the source filesystem or
possibility to convert. The unconverted cases are asserts and will be
replaced later.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace BUG_ON after transaction start failures, all the functions
already handle errors and return them to the caller. The other error
handling is for impossible conditions.
Signed-off-by: David Sterba <dsterba@suse.com>
There are several generic errors that repeat the same message. Define a
template for such messages, with optional text.
Signed-off-by: David Sterba <dsterba@suse.com>
The leafsize has never been different from nodesize and since 4.0 (2015)
it's been alias for nodesize. This should be enough time for everybody
to update so the support is removed.
Signed-off-by: David Sterba <dsterba@suse.com>
The meaning of the -b/--byte-count option is different than what the
help text says. Historically it was used to set the filesystem size but
with multiple devices it sets the size on each device:
$ mkfs.btrfs /dev/sdx[1234]
...
Number of devices: 4
Devices:
ID SIZE PATH
1 2.00GiB /dev/sdx1
2 2.00GiB /dev/sdx2
3 2.00GiB /dev/sdx3
4 2.00GiB /dev/sdx4
And when set to 1G:
$ mkfs.btrfs -b 1G /dev/sdx[1234]
...
Number of devices: 4
Devices:
ID SIZE PATH
1 1.00GiB /dev/sdx1
2 1.00GiB /dev/sdx2
3 1.00GiB /dev/sdx3
4 1.00GiB /dev/sdx4
Signed-off-by: David Sterba <dsterba@suse.com>
Add more granularity to verbose levels and describe when they should be
used. Lots of pr_verbose still hardcode the value or compare level to
bconf.verbose but the individual messages have to be revisited
separately.
Signed-off-by: David Sterba <dsterba@suse.com>
Rename MUST_LOG Use a prefix LOG_ so we can add more levels, use it
where it was hardcoded as argument to pr_verbose.
Signed-off-by: David Sterba <dsterba@suse.com>
The (unsigned long long) type casts can be dropped, printf understands
%llu and u64 and does not warn. In cases where the type is not u64 keep
the cast.
Signed-off-by: David Sterba <dsterba@suse.com>
In a few occasions there's an internal report, make a common helper so
the prefix message is not necessary and the stack trace can be printed
if enabled.
Signed-off-by: David Sterba <dsterba@suse.com>
Add declarations for global fs_info and task context so they can be
accessed from any .c file once the main.c will be split. Add prefix "g_"
for the task.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we emulate a write error during commit transaction, by setting the
block device read-only, then we can easily have the following crash
using "btrfs check --clear-space-cache v2":
Opening filesystem to check...
Checking filesystem on /dev/test/scratch1
UUID: 5945915b-37f1-4bfa-9f64-684b318b8f73
Clear free space cache v2
Error writing to device 1
kernel-shared/transaction.c:156: __commit_transaction: BUG_ON `ret` triggered, value 1
./btrfs(+0x570c9)[0x562ec894f0c9]
./btrfs(+0x57167)[0x562ec894f167]
./btrfs(__commit_transaction+0x13b)[0x562ec894f7f2]
./btrfs(btrfs_commit_transaction+0x214)[0x562ec894fa64]
./btrfs(btrfs_clear_free_space_tree+0x177)[0x562ec8941ae6]
./btrfs(+0xc8958)[0x562ec89c0958]
./btrfs(+0xc9d53)[0x562ec89c1d53]
./btrfs(+0x17ec7)[0x562ec890fec7]
./btrfs(main+0x12f)[0x562ec8910908]
/usr/lib/libc.so.6(+0x232d0)[0x7ff917ee82d0]
/usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7ff917ee838a]
./btrfs(_start+0x25)[0x562ec890fdc5]
Aborted (core dumped)
[CAUSE]
The call trace has shown it's a BUG_ON(), and it's from
__commit_transaction(), which is writing tree blocks back.
[FIX]
The fix is pretty simple, just return error.
In fact we even have an error value check in btrfs_commit_transaction()
just after __commit_transaction() call (although not catching the return
value from it).
And since we're here, also call btrfs_abort_transaction() to prevent
newer transactions from being started.
Now we won't have a full crash:
Opening filesystem to check...
Checking filesystem on /dev/test/scratch1
UUID: 5945915b-37f1-4bfa-9f64-684b318b8f73
Clear free space cache v2
Error writing to device 1
ERROR: failed to write bytenr 30425088 length 16384: Operation not permitted
ERROR: failed to write tree block 30425088: Operation not permitted
ERROR: failed to clear free space cache v2: -1
extent buffer leak: start 30720000 len 16384
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When transaction is aborted halfway, we can have extent buffer leaked,
and in that case, the same leaked extent buffer can be reported for
multiple times:
ERROR: failed to clear free space cache v2: -1
extent buffer leak: start 30441472 len 16384
WARNING: dirty eb leak (aborted trans): start 30441472 len 16384
extent buffer leak: start 30720000 len 16384
extent buffer leak: start 30425088 len 16384
extent buffer leak: start 30425088 len 16384 << Duplicated
WARNING: dirty eb leak (aborted trans): start 30425088 len 16384
Note that 30425088 line is reported twice (not accounting the "dirty eb
leak" line).
[CAUSE]
When we detected a leaked eb, we call free_extent_buffer_nocache(), but
free_extent_buffer_nocache() can only remove the eb when its reduced
refs is 0.
If the eb has refs 2, it will need two free_extent_buffer_nocache()
calls to remove it from the cache.
[FIX]
Just reset the eb->refs to 1 so that free_extent_buffer_nocache() can
remove it from cache for sure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function was introduced by commit a5ce5d2198 ("btrfs-progs:
extent-cache: actually cache extent buffers") but never got utilized.
Thus we can just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
RST format provides cross reference function that users can navigate
manual pages click. This patch is written by macro that replaces old
references to doc role in RST format.
Issue: #495
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logic at the beginning of this function to handle reserved ranges
was pretty complex and hard to follow. By refactoring it to use the
existing intersect_with_reserved() function, we can remove most of the
comparisons and boolean operators while preserving the exact same logic.
This change is only for readability. It does not change the logic itself
at all.
Author: Thomas Hebb <tommyhebb@gmail.com>
Pull-request: #494
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently open code a similar operation in create_image_file_range().
By exposing intersect_with_reserved() outside of source-fs.c and
slightly changing its semantics to return the entire range instead of
just the end address, we can reuse it in create_image_file_range().
Author: Thomas Hebb <tommyhebb@gmail.com>
Pull-request: #494
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When checking if the requested range starts in a valid region but later
hits a reserved range, we require the reserved range to end before the
requested one does.
This is incorrect. Since we're going to truncate the requested range
anyway, we want this check to pass even if the requested range ends
partway through a reserved range.
Fix the issue by checking against the reserved range's start address
instead of its end.
Luckily, I don't believe this bug makes a difference in the current code
path, since the range we pass to this function never ends before the end
of the filesystem.
Issue: #297
Issue: #349
Author: Thomas Hebb <tommyhebb@gmail.com>
Pull-request: #494
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
intersect_with_reserved() currently succeeds if (bytenr + num_bytes) is
greater than or equal to the first address in the range, assuming that
bytenr is also not past the end of the range.
This is wrong. (bytenr + num bytes) is one byte past the last address in
the range we're checking, meaning that our range only overlaps the
reserved range if it's strictly greater than the reserved range's start
address.
For example, imagine a range at 0x3000 with length 0x1000 that we're
checking against a reserved range that starts at 0x4000. The addresses
in our range are 0x3000-0x3fff: it doesn't overlap. But the current
check, (0x3000 + 0x1000 >= 0x4000), will erroneously pass.
Fix the issue by changing >= to >.
Issue: #297
Issue: #349
Author: Thomas Hebb <tommyhebb@gmail.com>
Pull-request: #494
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is currently defined in source-fs.h, but main.c uses it far more
than source-fs.c does. Put it in common.h instead, since it's a useful
standalone type.
Author: Thomas Hebb <tommyhebb@gmail.com>
Pull-request: #494
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create a few emulated zoned devices and run mkfs, the zone reset is
expected to be run in parallel. It's using memory-backed devices so it's
too fast to measure the differences and we can't expect availability of
slow zoned devices so this test is very simplistic.
Signed-off-by: David Sterba <dsterba@suse.com>
I've written a simple shell wrapper for null_blk configuration
(https://github.com/kdave/nullb). Make a local copy of version 0.1 to
avoid external dependency for our tests.
Signed-off-by: David Sterba <dsterba@suse.com>
When devices are formatted as btrfs, btrfs_prepare_device is called
sequentially for each device, which takes too much time.
Put each btrfs_prepare_device into a thread, wait for the first thread
to complete to mkfs.btrfs, and wait for other threads to complete before
adding other devices to the file system.
During the preparation it's either trim/discard or zone reset.
This was tested with TCMU emulation with two zoned devices. Each device
is 2000G (about 19.53 TiB), the region size is 4MB, Use the following
parameters for targetcli:
create name=zbc0 size=20000G cfgstring=model-HM/zsize-4/conv-100@~/zbc0.raw
Call difftime to calculate the running time of the function
btrfs_prepare_device. Calculate the time from thread creation to
completion of all threads after patching:
$ lsscsi -p
[10:0:1:0] (0x14) LIO-ORG TCMU ZBC device 0002 /dev/sdb - none
[11:0:1:0] (0x14) LIO-ORG TCMU ZBC device 0002 /dev/sdc - none
$ sudo mkfs.btrfs -d single -m single -O zoned /dev/sdc /dev/sdb -f
....
time for prepare devices:4.000000.
....
$ sudo mkfs.btrfs -d single -m single -O zoned /dev/sdc /dev/sdb -f
...
time for prepare devices:2.000000.
...
Issue: #496
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The egrep command is deprecated (per manual page of grep) for a long
time and will probably be removed, the replacement is 'grep -E'.
Signed-off-by: David Sterba <dsterba@suse.com>
Process an enable_verity cmd by running the enable verity ioctl on the
file. Since enabling verity denies write access to the file, it is
important that we don't have any open write file descriptors.
This also revs the send stream format to version 3 with no format
changes besides the new commands and attributes. This version is not
finalized and commands may change, also this needs to be synchronized
with any kernel changes.
Note: the build is conditional on the header linux/fsverity.h
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The block group tree doesn't yet have full bi-directional conversion
support from btrfstune, and it seems we may want one or two release
cycles to rule out some extra bugs before really releasing the progs
support.
This patch will hide the block group tree feature behind experimental
flag for the following tools:
- btrfstune
"-b" option to convert to bg tree.
- mkfs.btrfs
hide "block-group-tree" feature from both -O (the new default position
for all features) and -R (the old, soon to be deprecated one).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The online manual pages of the btrfs utilities seem to have been moved to
`readthedocs.io`; update references in the README accordingly.
Author: Guillaume Legrand
Pull-request: #500
Signed-off-by: David Sterba <dsterba@suse.com>
swapon fails with an unclear error message, add some hints were to look
for more information.
Author: Torstein Eide
Pull-request: #491
Signed-off-by: David Sterba <dsterba@suse.com>
Mention the version support for the cross-mount support, since 5.18.
Author: AtticFinder65536
Pull-request: #480
Signed-off-by: David Sterba <dsterba@suse.com>
The radix-tree is not used in userspace code. In kernel it's for
tracking unpersisted and in-memory structures and has been replaced by
the xarray.
Signed-off-by: David Sterba <dsterba@suse.com>
The random-test exercises the b-tree operations but hasn't been in use
for a long time and we won't probably resurrect it. Also it's the only
user of the radix_tree structures, that are otherwise used in the kernel
code, it needs the kerne-lib radix-tree implementation. Let's remove it
as it's basically dead code.
Signed-off-by: David Sterba <dsterba@suse.com>