Currently v1 space cache clearing will delete one cache inode just in
one transaction, and then start a new transaction to delete the next
inode.
This is far from efficient and can make the already slow v1 space cache
deleting even slower, as large fs has tons of cache inodes to delete.
This patch will speed up the process by batching up to 16 inode deletion
into one transaction.
A quick benchmark of deleting 702 v1 space cache inodes would look like
this:
Unpatched: 4.898s
Patched: 0.087s
Which is obviously a big win.
Reported-by: Joshua <joshua@mailmag.net>
Link: https://lore.kernel.org/linux-btrfs/0b4cf70fc883e28c97d893a3b2f81b11@mailmag.net/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since v4.19, btrfs-progs has full write support to free space tree, the
out-of-date warning in btrfs(5) has already confused some end user.
Update the content to avoid further confusion.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the libbtrfsutil library and license more visible in the overview.
Drop link to travis-ci.org CI as it's not used anymore.
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_sb_io(), blk_zone_report is used for getting information about
zones. But it is not freed if code goes in usual path. This patch frees
the variable just after it used.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
check_running_fs_exclop() can return 1 when exclop is changed to "none"
The ret is set by the return value of the select() operation. Checking
the exclusive op changes just the exclop variable while ret is still
set to 1.
Set ret = 0 if exclop is set to BTRFS_EXCL_NONE or BTRFS_EXCL_UNKNOWN.
Remove unnecessary continue statement at the end of the block.
The command appears to have executed, but does not. This was found when
balance which typically reports chunks relocated did not print anything
on screen.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The Used and Free should be together, while all the device information
is in the first section.
Example:
Overall:
Device size: 128.00GiB
Device allocated: 24.00GiB
Device unallocated: 104.00GiB
Device missing: 0.00B
Device zone unusable: 5.13MiB
Device zone size: 256.00MiB
Used: 213.33MiB
Free (estimated): 111.79GiB (min: 111.79GiB)
Free (statfs, df): 111.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 25.58MiB (used: 16.00KiB)
Multiple profiles: no
Signed-off-by: David Sterba <dsterba@suse.com>
Read device size and print it in the overall overview in zoned mode. The
total unusable size is there so the zone size is complementing it. It's
read from the first device assuming that kernel mandates that all
devices have the same zone size.
Example:
Overall:
Device size: 128.00GiB
Device allocated: 24.00GiB
Device unallocated: 104.00GiB
Device missing: 0.00B
Used: 213.33MiB
Device zone unusable: 5.13MiB
Device zone size: 256.00MiB
Free (estimated): 111.79GiB (min: 111.79GiB)
Free (statfs, df): 111.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 25.58MiB (used: 16.00KiB)
Multiple profiles: no
Signed-off-by: David Sterba <dsterba@suse.com>
Sysfs hides the zone size of a block device in the queue/chunk_sectors
file, so add a helper that will read it for us when given the short
device name (that can be found in FSID/devices).
Signed-off-by: David Sterba <dsterba@suse.com>
There are several directories in /sys/fs/btrfs/FSID that contain more
than one file/directory. Add a helper to open the directory so that the
file descriptor can be used for fdopendir.
Signed-off-by: David Sterba <dsterba@suse.com>
If there's CONFIG_CRYPTO_SHA256=y in /proc/config.gz and no line with
'sha256' in /proc/modules, then the mount will use the generic
implementation.
After 'modprobe sha256' there's 'sha256_ssse3' in /proc/modules and the
sysfs checksum file would show e.g. 'sha256-avx2'.
Signed-off-by: David Sterba <dsterba@suse.com>
Print number of stripes for striped profiles in device usage commands.
It helps to see profiles easily. The output is like below.
/dev/vdc, ID: 1
Device size: 1.00GiB
Device slack: 0.00B
Data,RAID0/2: 912.62MiB
Data,RAID0/3: 912.62MiB
Metadata,RAID1: 102.38MiB
System,RAID1: 8.00MiB
Unallocated: 1.00MiB
Multiple lines can appear in case a balance conversion process was
interrupted or if there's been a new device added and new data written
to the full stripe.
Issue: #372
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The user transaction ioctls have been removed in kernel 4.17 by commit
7a5a07a81062 ("btrfs: Remove userspace transaction ioctls"), the
definitions are not relevant and can be removed.
The numbers could be reused in the future, eg. when there are no
maintained LTS kernels older than 4.19.
Signed-off-by: David Sterba <dsterba@suse.com>
There's another loop protection during scan of directory items. This can
fire under invalid conditions, ie. when there's no real endless loop.
The layout of b-tree items could trigger that and has been observed in
practice. This prevents automated restoration as it requires user
attention.
The number of loops is 1024, unjustified and without explanation. Errors
during traversing the leaves are checked so most errors would be caught.
A real loop in the directory items would require some crafting and would
not happen on a normal filesystem.
Issue: #59
Issue: #164
Issue: #237
Signed-off-by: David Sterba <dsterba@suse.com>
There's some kind of looping protection during copying file extents,
mostly likely to avoid endless loops on severely damaged filesystems.
This has been bothering users and makes restoring hard to automate as
it requires user attention to press 'y' or 'a'. This has not been well
documented either.
The number of loops is 1024 which looks arbitrary and hard to justify.
This eg. means that a file with many fragments hits the interactive
question more than once.
There are other checks when iterating the leaves that would catch
corruptions or other errors, so the looping would happen in some rare
and rather artificial case when some kind of loop exists inside the
extent items. This is not easily possible if possible at all as the
items do not directly reference other.
In case there's some genuine error found that would require a looping
protection, we'll add it or extend the checks to identify the loop.
Issue: #59
Issue: #164
Issue: #237
Signed-off-by: David Sterba <dsterba@suse.com>
The test image is manually crafted with 1MiB offset in the device item
of devid 1.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a report from the mailing list that one user got its filesystem
with device item bytes_used mismatch.
This problem leaves the device item with some ghost bytes_used, meaning
even if we delete all device extents of that device, the bytes_used
still won't be 0.
This itself is not a big deal, but when the user used up all its
unallocated space, write time tree-checker can be triggered and make the
fs RO, as the new device::bytes_used can be larger than
device::total_bytes.
Thus we need to fix the problem in btrfs-check to avoid above write-time
tree check warning.
This patch will add the ability to reset a device's bytes_used to both
original mode and lowmem mode.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add new options to dumps checksums in node headers and in the checksum
items:
$ btrfs inspect dump-tree --csum-headers image
root tree
leaf 471515136 items 19 free space 12186 generation 15 owner ROOT_TREE
leaf 471515136 flags 0x1(WRITTEN) backref revision 1 csum 0x756b2d54
fs uuid df0348df-5773-47dd-81e9-a18221461239
For nodes/leaves it's appended on the 2nd line of the header.
Checksum items are stored in leaves as EXTENT_CSUM key type, with offset
value as the logical offset starting. As the array would be hard to
parse or match, each offset value is printed with the checksum. For
crc32c it's 4 values on a line, for xxhash it's 2 and for the long
256bit checksums it's one checksum per line.
$ btrfs inspect dump-tree --csum-items image
leaf 5423104 items 1 free space 30 generation 6 owner CSUM_TREE
leaf 5423104 flags 0x1(WRITTEN) backref revision 1
fs uuid bd7c981e-16ff-4081-a734-3ef5d50cafc1
chunk uuid 13f4c76c-7845-4984-88ed-f01b52e05cf8
item 0 key (EXTENT_CSUM EXTENT_CSUM 22020096) itemoff 55 itemsize 16228
range start 22020096 end 38637568 length 16617472
[22020096] 0x8941f998 [22024192] 0x8941f998 [22028288] 0x8941f998 [22032384] 0x8941f998
[22036480] 0x8941f998 [22040576] 0x8941f998 [22044672] 0x8941f998 [22048768] 0x8941f998
...
$ btrfs inspect dump-tree --csum-items image
leaf 5718016 items 1 free space 7746 generation 6 owner CSUM_TREE
leaf 5718016 flags 0x1(WRITTEN) backref revision 1
fs uuid f453a5b4-8b4a-4fbf-90a2-2925e4fe2335
chunk uuid eb1da63b-248b-44c2-82da-71b2564bf50e
item 0 key (EXTENT_CSUM EXTENT_CSUM 52387840) itemoff 7771 itemsize 8512
range start 52387840 end 53477376 length 1089536
[52387840] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
[52391936] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
...
The options are not on by default, the header checksum is not important
for the structures. Data checksums can be quite big so that would make
the dump long and without any actual data to match against.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace follow and traverse by one parameter that takes bits to affect
the behaviour. This allows to extend btrfs_print_tree output with more
modes from one place.
Signed-off-by: David Sterba <dsterba@suse.com>
There's a report that a system with 4.19 kernel fails boot because
device scan exits with error. This is because zoned support is compiled
in btrfs-progs but not in kernel.
To make new progs and old kernels work, do a fallback when the zoned
ioctl is not available, as if it were a non-zoned device. There is no
other option, but this is safe at least for the device scan that would
not error out. Any unaligned writes to a zoned device will fail as
expected.
Issue: #376
Signed-off-by: David Sterba <dsterba@suse.com>
Internally it's blake2b but for the user facing output or other command
line interfaces let's call it just BLAKE2.
Signed-off-by: David Sterba <dsterba@suse.com>
With explicit width the default alignment is to the right, using space
is a gnu extension. Fix the following warnings:
crypto/hash-speedtest.c: In function ‘main’:
crypto/hash-speedtest.c:152:15: warning: ' ' flag used with ‘%s’ gnu_printf format [-Wformat=]
152 | printf("% 12s: ", c->name);
| ^
crypto/hash-speedtest.c:172:21: warning: ' ' flag used with ‘%u’ gnu_printf format [-Wformat=]
172 | printf("%s: % 12llu, %s/i % 8llu",
| ^
crypto/hash-speedtest.c:172:34: warning: ' ' flag used with ‘%u’ gnu_printf format [-Wformat=]
172 | printf("%s: % 12llu, %s/i % 8llu",
| ^
Signed-off-by: David Sterba <dsterba@suse.com>
Recognize special resize amount 'cancel' for resize operation. This
will request kernel to stop running any resize operation (most likely
shrinking resize). This needs support in kernel, otherwise this will
fail due to another exclusive operation running (though could be the
same one).
The command returns after kernel finishes any work that got interrupted,
but this should not take long in kernels 5.10+ that allow interruptible
relocation. The waiting inside kernel is interruptible so this command
(and the waiting stage) can be interrupted.
The resize operation could relocate block groups but the nominal
filesystem size will be restored when resize won't finish. It's
recommended to review the filesystem state.
Note: in kernels 5.10+ sending a fatal signal (TERM, KILL, Ctrl-C) to
the process running the resize will cancel it too.
Example:
$ btrfs fi resize -10G /mnt
...
$ btrfs fi resize cancel /mnt
Signed-off-by: David Sterba <dsterba@suse.com>
Recognize special name 'cancel' for device deletion, that will request
kernel to stop running device deletion. This needs support in kernel,
otherwise this will fail due to another exclusive operation running
(though could be the same one).
The command returns after kernel finishes any work that got interrupted,
but this should not take long in kernels 5.10+ that allow interruptible
relocation. The waiting inside kernel is interruptible so this command
(and the waiting stage) can be interrupted.
The device size is restored when deletion does not finish but it's
recommended to review the filesystem state.
Note: in kernels 5.10+ sending a fatal signal (TERM, KILL, Ctrl-C) to
the process running the device deletion will cancel it too.
Example:
$ btrfs device delete /dev/sdx /mnt
...
$ btrfs device delete cancel /mnt
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs inspect-internal --help shows incomplete sentence. As shown
below:
btrfs inspect-internal --help
<snip>
btrfs inspect-internal min-dev-size [options] <path>
Get the minimum size the device can be shrunk to. The
btrfs inspect-internal dump-tree [options] <device> [<device> ..]
<snip>
The short help string can be multi-line but must be in one string. This
patch fixes it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
User reported that test fsck-tests/037-freespacetree-repair fails:
# TEST=037\* ./fsck-tests.sh
[TEST/fsck] 037-freespacetree-repair
btrfs check should have detected corruption
test failed for case 037-freespacetree-repair
The test tries to corrupt FST, call btrfs check readonly then repair FST
using btrfs check. Above case failed at the second readonly check step.
Test log said "cache and super generation don't match, space cache will
be invalidated" which is printed by validate_free_space_cache().
If cache_generation of the superblock is not -1ULL,
validate_free_space_cache() requires that cache_generation must equal
to the superblock's generation. Otherwise, it skips the check of space
cache(v1, v2) like the above case where the sb cache_generation is 0.
Since kernel commit 948462294577 ("btrfs: keep sb cache_generation
consistent with space_cache"), sb cache_generation will be set to be 0
once space cache v1 is disabled (nospace_cache/space_cache=v2). But
progs check was forgotten to be added the 0 case support.
Fix it by adding the condition if sb cache_generation is 0 in
validate_free_space_cache() as the 0 case is valid now since the
kernel commit mentioned above.
Issue: #338
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 8ef9313cf2 ("btrfs-progs: zoned: implement log-structured
superblock") changed to write BTRFS_SUPER_INFO_SIZE bytes to device.
The before num of bytes to be written is sectorsize.
It causes mkfs.btrfs failed on my 16k pagesize kvm:
$ /usr/bin/mkfs.btrfs -s 16k -f -mraid0 /dev/vdb2 /dev/vdb3
btrfs-progs v5.12
See http://btrfs.wiki.kernel.org for more information.
ERROR: superblock magic doesn't match
ERROR: superblock magic doesn't match
common/device-scan.c:195: btrfs_add_to_fsid: BUG_ON `ret != sectorsize`
triggered, value 1
/usr/bin/mkfs.btrfs(btrfs_add_to_fsid+0x274)[0xaaab4fe8a5fc]
/usr/bin/mkfs.btrfs(main+0x1188)[0xaaab4fe4dc8c]
/usr/lib/libc.so.6(__libc_start_main+0xe8)[0xffff7223c538]
/usr/bin/mkfs.btrfs(+0xc558)[0xaaab4fe4c558]
[1] 225842 abort (core dumped) /usr/bin/mkfs.btrfs -s 16k -f -mraid0
/dev/vdb2 /dev/vdb3
btrfs_add_to_fsid() now always calls sbwrite() to write
BTRFS_SUPER_INFO_SIZE bytes to device, so change condition of
the BUG_ON().
Also add comments for sbread() and sbwrite().
Signed-off-by: Su Yue <l@damenly.su>
Signed-off-by: David Sterba <dsterba@suse.com>
The configure phase can detect the following:
- no /usr/include/blkzoned.h - no zoned support possible
- usable /usr/include/blkzoned.h - on by default
- present /usr/include/blkzoned.h but unusable due to missing struct
members or ioctl defintion
Case 3 could be confusing, so document the requirements.
Issue: #370
Signed-off-by: David Sterba <dsterba@suse.com>