In lowmem mode, we check fs roots and free space cache by iterating
each root item and inode item, using btrfs_next_item() and a path
pointing to the root tree.
However in repair mode, check_fs_root() can modify the fs root, thus
CoWs the tree root, and the old path in check_fs
It could lead to strange behavior, e.g. after repairing a fs tree, the
path can point to a fs tree.
Since no ROOT_ITEM exists in fs tree, all remaining trees are skipped in
repair mode.
This bug exists from the early time of lowmem mode repair, and is only
exposed by recent free space inode check code. (Fs tree inodes are
passed to free space inode check, causing false alerts and repair
failure).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS_COMPAT_EXTENT_TREE_V0 is introduced for a short time in kernel,
and it's over 10 years ago.
Nowadays there should be no user for that feature, and kernel has remove
this support in Jun, 2018. There is no need for btrfs-progs to support
it.
This patch will remove EXTENT_TREE_V0 related code and replace those
BUG_ON() to a more graceful error message.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the link reported, btrfs fi sh may crash while a device is removing.
valgrind reported:
======================================================================
...
==883== Invalid write of size 8
==883== at 0x13C99A: get_device_info (in /usr/bin/btrfs)
==883== by 0x13D715: get_fs_info (in /usr/bin/btrfs)
==883== by 0x153B5F: ??? (in /usr/bin/btrfs)
==883== by 0x11B0C1: main (in /usr/bin/btrfs)
==883== Address 0x4d8c7a0 is 0 bytes after a block of size 12,288 alloc'd
==883== at 0x483877F: malloc (vg_replace_malloc.c:299)
==883== by 0x13D861: get_fs_info (in /usr/bin/btrfs)
==883== by 0x153B5F: ??? (in /usr/bin/btrfs)
==883== by 0x11B0C1: main (in /usr/bin/btrfs)
==883==
==883== Invalid write of size 8
==883== at 0x13C99D: get_device_info (in /usr/bin/btrfs)
==883== by 0x13D715: get_fs_info (in /usr/bin/btrfs)
==883== by 0x153B5F: ??? (in /usr/bin/btrfs)
==883== by 0x11B0C1: main (in /usr/bin/btrfs)
==883== Address 0x4d8c7a8 is 8 bytes after a block of size 12,288 alloc'd
==883== at 0x483877F: malloc (vg_replace_malloc.c:299)
==883== by 0x13D861: get_fs_info (in /usr/bin/btrfs)
==883== by 0x153B5F: ??? (in /usr/bin/btrfs)
==883== by 0x11B0C1: main (in /usr/bin/btrfs)
==883==
==883== Syscall param ioctl(generic) points to unaddressable byte(s)
==883== at 0x4CA9CBB: ioctl (in /usr/lib/libc-2.29.so)
==883== by 0x13C9AB: get_device_info (in /usr/bin/btrfs)
==883== by 0x13D715: get_fs_info (in /usr/bin/btrfs)
==883== by 0x153B5F: ??? (in /usr/bin/btrfs)
==883== by 0x11B0C1: main (in /usr/bin/btrfs)
==883== Address 0x4d8c7a0 is 0 bytes after a block of size 12,288 alloc'd
==883== at 0x483877F: malloc (vg_replace_malloc.c:299)
==883== by 0x13D861: get_fs_info (in /usr/bin/btrfs)
==883== by 0x153B5F: ??? (in /usr/bin/btrfs)
==883== by 0x11B0C1: main (in /usr/bin/btrfs)
==883==
--883-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--883-- si_code=1; Faulting address: 0x284D8C7B8; sp: 0x1002eb5e50
valgrind: the 'impossible' happened:
Killed by fatal signal
host stacktrace:
==883== at 0x5805261C: get_bszB_as_is (m_mallocfree.c:303)
==883== by 0x5805261C: get_bszB (m_mallocfree.c:315)
==883== by 0x5805261C: vgPlain_arena_malloc (m_mallocfree.c:1799)
==883== by 0x58005AD2: vgMemCheck_new_block (mc_malloc_wrappers.c:372)
==883== by 0x58005AD2: vgMemCheck_malloc (mc_malloc_wrappers.c:407)
==883== by 0x580A7373: do_client_request (scheduler.c:1925)
==883== by 0x580A7373: vgPlain_scheduler (scheduler.c:1488)
==883== by 0x580F57A0: thread_wrapper (syswrap-linux.c:103)
==883== by 0x580F57A0: run_a_thread_NORETURN (syswrap-linux.c:156)
sched status:
running_tid=1
Thread 1: status = VgTs_Runnable (lwpid 883)
==883== at 0x483877F: malloc (vg_replace_malloc.c:299)
==883== by 0x1534AA: ??? (in /usr/bin/btrfs)
==883== by 0x153C49: ??? (in /usr/bin/btrfs)
==883== by 0x11B0C1: main (in /usr/bin/btrfs)
client stack range: [0x1FFEFFA000 0x1FFF000FFF] client SP: 0x1FFEFFDCE0
valgrind stack range: [0x1002DB6000 0x1002EB5FFF] top usage: 7520 of 1048576
======================================================================
The above log says that invalid write to allocated @di_args happened
in get_device_info() called in get_fs_info().
The size of @di_args is allocated according by fi_args->num_devices.
And fi_args->num_devices is *the number of dev_items in chunk_tree*.
However, in the loop to get devices info, btrfs-progs calls ioctl
BTRFS_IOC_DEV_INFO which just finds device in
fs_info->fs_devices->devices.
Let's look at kernel side.
In btrfs_rm_device(), btrfs_rm_dev_item() causes removal of
related dev_items in chunk_tree. *Do something*.
Then delete the device from device->fs_devices.
So the case is:
Userspace kernel
get_fs_info() btrfs_rm_device()
...
btrfs_rm_dev_item()
determine fi_args->num_devices and
fi_args->max_id by seraching chunk_tree.
malloc() ...
Loop(Crashed): call get_device_info() by devid
from 1 to fi_args->max_id.
mutex_lock(&fs_devices->device_list_mutex);
list_del_rcu(&device->dev_list);
...
In the loop of get_device_info(), get_device_info() still can get info
of the removing device since it's still in fs_info->fs_devices->devices.
Then the iterator value @ndev increaments causes invalid access out of
bounds.
Solved it by adding the check of @ndev while looping.
Reported-by: Peter Hjalmarsson <kanelxake@gmail.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1711787
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although moderm hardware is fast enough and crc32c calculation is not a
hotspot, doing such optimization won't hurt anyway.
Issue: #175
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a bug report of BUG_ON() which is caused by __free_extent()
failed to lookup a backref extent:
Failed to find [1429288337408, 168, 16384]
btrfs unable to find ref byte nr 1429288583168 parent 0 root 2 owner 0 offset 0
convert/source-ext2.c:834: ext2_copy_inodes: BUG_ON ret triggered, value -5
./btrfs-convert[0x410941]
./btrfs-convert(main+0x1fdc)[0x40d3b8]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f93bb7d2f33]
./btrfs-convert(_start+0x2e)[0x40a96e]
It's still unclear how this bug can be triggered, but adding such debug
output will provide more info for us to debug.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The test misc-tests/035-receive-common-mount-point-prefix does another
mount inside TEST_MNT but current 'make test-clean' will not properly
undo the nested mount and this will break subsequent tests. The
recursive unmount can handle that.
Signed-off-by: David Sterba <dsterba@suse.com>
In convert we use trans->block_reserved >= 4096 as a threshold to commit
transaction, where block_reserved is the number of new tree blocks
allocated inside a transaction.
The problem is, we still have a hidden bug in delayed ref implementation
in btrfs-progs, when we have a large enough transaction, delayed ref may
failed to find certain tree blocks in extent tree and cause transaction
abort.
This fix will workaround it by committing transaction at a much lower
threshold.
The old 4096 means 4096 new tree blocks, when using default (16K)
nodesize, it's 64M, which can contain over 12k inlined data extent or
csum for around 60G, or over 800K file extents.
The new threshold will limit the size of new tree blocks to 2M, aligning
with the chunk preallocator threshold, and reducing the possibility to
hit that delayed ref bug.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
libbtrfs.so already has user's LDFLAGS applied. The change also applies
those to libbtrfsutil.so. A separate variable is used for that though it
currently only copies LDFLAGS. This is to make it obvious that
libbtrfsutils is a standalone library.
Reported-by: Michał Górny
Bug: https://bugs.gentoo.org/686284
Pull-request: #172
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Adds Make variables EXTRA_PYTHON_CFLAGS and EXTRA_PYTHON_LDFLAGS which
can be used to pass CFLAGS and LDFLAGS respectively when building the
Python library.
This is required to support reproducible builds, as there are often
compiler and linker flags that must be passed in order to generate
reproducible output (e.g. -fdebug-prefix-map)
Pull-request: #176
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one report of compressed extent happens in btrfs, but has no
csum and then leads to possible decompress error screwing up kernel
memory.
Although it's a kernel bug, and won't cause problem until compressed
data get corrupted, let's catch such problem in advance.
This patch will catch any unexpected compressed extent with:
1) 0 or less than expected csum
2) nodatasum flag set in the inode item
This is for original mode.
Reported-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one report of compressed extent happens in btrfs, but has no
csum and then leads to possible decompress error screwing up kernel
memory.
Although it's a kernel bug, and won't cause problem until compressed
data get corrupted, let's catch such problem in advance.
This patch will catch any unexpected compressed extent with:
1) missing csum
2) nodatasum flag set in the inode item
This is for lowmem mode.
Reported-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The command
$ printf 'btrfs-stream\0\0\0\0\0' | btrfs receive --dump
can loop as the stream is not valid, but the maximum error limit is not
set properly for --dump. The command line parameter -E applies here too,
so it's still possible to dump partially damanged stream.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200085
Author: Alexander Kovtunenko <akovtunenko@slice.com>
Signed-off-by: David Sterba <dsterba@suse.com>
blkid_get_cache() returns error code which is -errno. So we can use them
directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, when operating in a more verbose mode (-vv), the receive
command does not mention any write or clone commands, unlike other
commands.
This change adds debug messages for the write and clone operations, that
do not include data but only offsets and lengths, as this is actually
very useful to debug a send stream and I use it frequently.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When repairing a file system created by a very old kernel, I ran into
issues fixing up the extent flags since fixup_extent_flags assumed
that a METADATA_ITEM would be present if the record was for metadata.
Since METADATA_ITEMs don't exist without skinny metadata, we need to
fall back to EXTENT_ITEMs. This also falls back to EXTENT_ITEMs even
with skinny metadata enabled as other parts of the tools do.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running fuzz-tests/003 and fuzz-tests/009, btrfs-progs will crash
due to BUG_ON().
[CAUSE]
We abused BUG_ON() in btrfs_commit_transaction(), which is one of the
most error prone function for fuzzed images.
Currently to cleanup the aborted transaction, we only need to clean up
the only per-transaction data: delayed refs.
This patch will introduce a new function, btrfs_destroy_delayed_refs()
to cleanup delayed refs when we failed to commit transaction.
With that function, we will gently destroy per-trans delayed ref, and
remove the BUG_ON()s in btrfs_commit_transaction().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will refactor btrfs_finish_extent_commit():
- Make it return void
There is no failure pattern for btrfs_finish_extent_commit(), thus it
always return 0. And the caller doesn't care about the return value.
So no need to return int.
- Remove @root and @unpin parameters
@root is only used to extract fs_info, which can be extracted from
transaction handler already.
@unpin is always fs_info->pinned_extents.
All these parameters can be extracted from @trans, no need to pass
them.
The function signature now matches the kernel counterpart.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
cleanup_ref_head() will only return 0 or 1, no way to return a negative
value. So remove the dead branch.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The temporary files are not accessible if the testsuite is hosted on
NFS, pre-create them and allow writes.
Signed-off-by: David Sterba <dsterba@suse.com>
The caller owns the fd passed to btrfs_util_subvolume_id_fd(), so we
shouldn't close it on error. Fix it, add a regression test, and bump the
library patch version.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The branch passes all selftests, except:
- misc/035
Known bug as the fix is reverted.
- fuzz/003
- fuzz/009
Not a regression, as stable tags also triggers them.
BUG_ON() in commit_transaction get triggered due to ENOSPC.
These two bugs will be addressed soon. but not in this pull.
This pull request include the following features:
Core change:
- check --repair
* Flush/FUA support to avoid breaking metadata CoW
Now btrfs-progs crashing or transaction aborted won't cause
new transid error.
Fixes and Enhancement:
- generic
* Try best copy when reading tree blocks.
* Skip unnecessary retry when one tree block copy fails.
* Avoid back tree block to populate tree block cache.
* Don't BUG_ON() when failed to flush/write super blocks
- check
* File extents repair no longer relies data in extent tree.
* New ability to check and repair free space cache invalid inode mode.
* Update backup roots when commit transaction.
- Misc
* fs_info <-> root parameters cleanup for btrfs_check_leaf/node()
Commit 7a12d8470e ("btrfs-progs: Do metadata preallocation as long as
we're not modifying extent tree") tries to fix#123, however due to the
fact that chunk tree also has root->ref_cows set, we will call
do_chunk_alloc() until call stack explodes.
So revert that offending patch until we have a much better comment on
root->ref_cows and find a better solution to this problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
Since commit "btrfs-progs: disk-io: Flush to ensure super block write is
FUA" mkfs-tests/017 will fail like:
====== RUN MUSTFAIL /home/adam/btrfs-progs/mkfs.btrfs -K -f /dev/mapper/btrfs-progs-thin-vol
ERROR: failed to write super block for devid 1: flush error: Input/output error
disk-io.c:1810: write_all_supers: BUG_ON `ret` triggered, value -5
/home/adam/btrfs-progs/mkfs.btrfs(+0x1e5c1)[0x557a2c83e5c1]
/home/adam/btrfs-progs/mkfs.btrfs(+0x1e65f)[0x557a2c83e65f]
/home/adam/btrfs-progs/mkfs.btrfs(write_all_supers+0x1ce)[0x557a2c843a8a]
/home/adam/btrfs-progs/mkfs.btrfs(write_ctree_super+0x12d)[0x557a2c843be2]
/home/adam/btrfs-progs/mkfs.btrfs(btrfs_commit_transaction+0x250)[0x557a2c887c56]
/home/adam/btrfs-progs/mkfs.btrfs(+0xc0b1)[0x557a2c82c0b1]
/home/adam/btrfs-progs/mkfs.btrfs(main+0x1049)[0x557a2c82e929]
/usr/lib/libc.so.6(__libc_start_main+0xf3)[0x7f6689e99223]
/home/adam/btrfs-progs/mkfs.btrfs(_start+0x2e)[0x557a2c82b86e]
failed (expected): /home/adam/btrfs-progs/mkfs.btrfs -K -f /dev/mapper/btrfs-progs-thin-vol
[CAUSE]
Just one BUG_ON() in write_all_supers().
[FIX]
Just remove the BUG_ON(). Callers of write_all_supers() are already
checking the return value.
Also since write_all_supers() can return error, make write_ctree_super()
callers, btrfs_commit_transaction() and close_ctree_fs_info() to
handle the error correctly.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
There are tons of reports of btrfs-progs screwing up the fs, the most
recent one is "btrfs check --clear-space-cache v1" triggered BUG_ON()
and then leaving the fs with transid mismatch problem.
[CAUSE]
In kernel, we have block layer handing the flush work, even on devices
without FUA support (like most SATA device using default libata
settings), kernel handles FUA write by flushing the device, then normal
write, and finish it with another flush.
The pre-flush, write, post-flush works pretty well to implement FUA
write.
However in btrfs-progs we just use pwrite(), there is nothing keeping
the write order.
So even for basic v1 free space cache clearing, we have different vision
on the write sequence from kernel bio layer (by dm-log-writes) and user
space pwrite() calls.
In btrfs-progs, with extra debug output in write_tree_block() and
write_dev_supers(), we can see btrfs-progs follows the right write
sequence:
Opening filesystem to check...
Checking filesystem on /dev/mapper/log
UUID: 3feb3c8b-4eb3-42f3-8e9c-0af22dd58ecf
write tree block start=1708130304 gen=39
write tree block start=1708146688 gen=39
write tree block start=1708163072 gen=39
write super devid=1 gen=39
write tree block start=1708179456 gen=40
write tree block start=1708195840 gen=40
write super devid=1 gen=40
write tree block start=1708130304 gen=41
write tree block start=1708146688 gen=41
write tree block start=1708228608 gen=41
write super devid=1 gen=41
write tree block start=1708163072 gen=42
write tree block start=1708179456 gen=42
write super devid=1 gen=42
write tree block start=1708130304 gen=43
write tree block start=1708146688 gen=43
write super devid=1 gen=43
Free space cache cleared
But from dm-log-writes, the bio sequence is a different story:
replaying 1742: sector 131072, size 4096, flags 0(NONE)
replaying 1743: sector 128, size 4096, flags 0(NONE) <<< Only one sb write
replaying 1744: sector 2828480, size 4096, flags 0(NONE)
replaying 1745: sector 2828488, size 4096, flags 0(NONE)
replaying 1746: sector 2828496, size 4096, flags 0(NONE)
replaying 1787: sector 2304120, size 4096, flags 0(NONE)
......
replaying 1790: sector 2304144, size 4096, flags 0(NONE)
replaying 1791: sector 2304152, size 4096, flags 0(NONE)
replaying 1792: sector 0, size 0, flags 8(MARK)
During the free space cache clearing, we committed 3 transaction but
dm-log-write only caught one super block write.
This means all the 3 writes were merged into the last super block write.
And the super block write was the 2nd write, before all tree block
writes, completely screwing up the metadata CoW protection.
No wonder crashed btrfs-progs can make things worse.
[FIX]
Fix this super serious problem by implementing pre and post flush for
the primary super block in btrfs-progs.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
When we failed to write super blocks, we just output something like:
WARNING: failed to write sb: I/O error
Or
WARNING: failed to write all sb data
There is no info about which device failed and there are two different
error message for the same write error.
This patch will change it to something more detailed:
ERROR: failed to write super block for devid 1: write error: I/O error
This provides the basis for later super block flush error handling.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
The image has one free space cache inode with invalid mode (0).
item 9 key (256 INODE_ITEM 0) itemoff 13702 itemsize 160
generation 30 transid 30 size 65536 nbytes 1507328
block group 0 mode 0 links 1 uid 0 gid 0 rdev 0
sequence 23 flags 0x1b(NODATASUM|NODATACOW|NOCOMPRESS|PREALLOC)
atime 0.0 (1970-01-01 08:00:00)
ctime 1553491158.189771625 (2019-03-25 13:19:18)
mtime 0.0 (1970-01-01 08:00:00)
otime 0.0 (1970-01-01 08:00:00)
Both lowmem and original mode should be able to detect and fix it.
The extracted test image is pretty big (1G extracted), as kernel won't
cache small chunks.
Even with SSD, such test may still take some seconds just extracting the
image.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Just like lowmem mode, also check and repair free space cache inode
item.
And since we don't really have a good timing/function to check free
space chace inodes, we use the same common mode
check_repair_free_space_inode() when iterating root tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Unlike inodes in fs roots, we don't really check the inode items in root
tree, in fact we just skip everything other than ROOT_ITEM and ROOT_REF.
This makes invalid inode items sneak into root tree.
For example:
item 9 key (256 INODE_ITEM 0) itemoff 13702 itemsize 160
generation 30 transid 30 size 65536 nbytes 1507328
block group 0 mode 0 links 1 uid 0 gid 0 rdev 0
^ Should be 100600
sequence 23 flags 0x1b(NODATASUM|NODATACOW|NOCOMPRESS|PREALLOC)
atime 0.0 (1970-01-01 08:00:00)
ctime 1553491158.189771625 (2019-03-25 13:19:18)
mtime 0.0 (1970-01-01 08:00:00)
otime 0.0 (1970-01-01 08:00:00)
There is a report of such problem in the mail list.
This patch will check and repair inode items of free space cache inodes in
lowmem mode.
Since free space cache inodes doesn't have INODE_REF but still has 1
link, we can't use check_inode_item() directly.
Instead we only check the inode mode, as that's the important part.
The check and repair function: check_repair_free_space_inode() is also
exported for original mode.
Signed-off-by: Qu Wenruo <wqu@suse.com>
In root tree, we only have 2 types of inodes:
- ROOT_TREE_DIR inode
Its mode is fixed to 40755
- free space cache inodes
Its mode is fixed to 100600
This patch will add the ability to repair such inodes to lowmem mode.
For fs/subvolume tree error, at least we haven't see such corruption
yet, so we don't need to rush to fix corruption in fs trees yet.
The repair function, reset_imode() and repair_imode_common() can be
reused by later original mode patch, so it's placed in check/mode-common.c.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Just like lowmem mode, check inode mode, specially for S_IFMT bits and
beyond.
Please note that, this check only applies to inodes in fs/subvol trees.
It doesn't apply to free space cache inodes.
Reported-by: Thorsten Hirsch <t.hirsch@web.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
There is one report about invalid free space cache inode mode.
Normally free space cache inode should have mode 100600 (regular file,
no uid/gid/sticky bit, rw------ bit).
But in that report, we have free space cache inode mode as 0.
So at least btrfs check should report invalid inode mode.
This patch will at least make btrfs check lowmem mode to detect this
problem.
Please note that, this check only applies to inodes in fs/subvol trees.
It doesn't apply to free space cache inodes.
Reported-by: Thorsten Hirsch <t.hirsch@web.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
If the first copy of a tree block has a bad key order, but the second
copy is completely good, then "btrfs ins dump-tree -b <bytenr>" fails to
print anything past the bad key:
leaf 29786112 items 47 free space 983 generation 20 owner EXTENT_TREE
leaf 29786112 flags 0x1(WRITTEN) backref revision 1
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
[snip]
item 9 key (20975616 METADATA_ITEM 0) itemoff 3543 itemsize 33
refs 1 gen 16 flags TREE_BLOCK
tree block skinny level 0
tree block backref root CHUNK_TREE
item 10 key (29360128 BLOCK_GROUP_ITEM 33554432) itemoff 3519 itemsize 24
block group used 94208 chunk_objectid 256 flags METADATA|DUP
ERROR: leaf 29786112 slot 11 pointer invalid, offset 1245184 size 0 leaf data limit 3995
ERROR: skip remaining slots
While kernel can locate the good copy and acts just like nothing
happened.
[CAUSE]
btrfs-progs uses read_tree_block() to try each copy. But it only uses
less strict check_tree_block(), which has less sanity check than
btrfs_check_node/leaf().
Some error like bad key order is ignored to allow btrfs check to fix it.
This leads to above problem.
[FIX]
Introduce a new member, @candidate_mirror in read_tree_block(), which
records the copy passes check_tree_block() but fails
btrfs_check_leaf/node() as last chance.
Only if no better copy found, then use @candidate_mirror.
So btrfs-progs can act just like kernel to use best copy.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
[Inspired by that image, not to fix any bug of that bugzilla]
Signed-off-by: Qu Wenruo <wqu@suse.com>
btrfs_num_copies really only needs to be called once, so move it out of
the verification loop in read_tree_block().
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
If the first copy of a tree block is corrupted but the other copy is
good, btrfs-progs will report the error twice:
checksum verify failed on 30556160 found 42A2DA71 wanted 00000000
checksum verify failed on 30556160 found 42A2DA71 wanted 00000000
While kernel only report it once, just as expected:
BTRFS warning (device dm-3): dm-3 checksum verify failed on 30556160 wanted 0 found 42A2DA71 level 0
[CAUSE]
We use mirror_num = 0 in read_tree_block() of btrfs-progs.
At first glance it's pretty OK, but mirror num 0 in btrfs means ANY
good copy. Real mirror num starts from 1.
In the context of read_tree_block(), since it's read_tree_block() to do
all the checks, mirror num 0 just means the first copy.
So if the first copy is corrupted, btrfs-progs will try mirror num 1
next, which is just the same as mirror num 0.
After reporting the same error on the same copy, btrfs-progs will
finally try mirror num 2, and get the good copy.
[FIX]
The fix is way simpler than all the above analyse, just starts from
mirror num 1.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
For the new multiple -b parameter supporting, we could hit this bug on a
16K node sized btrfs:
$ ./btrfs inspect dump-tree -b 1024 -b 2048 -b 4096 -b 8192 zimg
btrfs-progs v4.20.2
ERROR: tree block bytenr 1024 is not aligned to sectorsize 4096
ERROR: tree block bytenr 2048 is not aligned to sectorsize 4096
Couldn't map the block 4096
Invalid mapping for 4096-20480, got 13631488-22020096
Couldn't map the block 4096
bad tree block 4096, bytenr mismatch, want=4096, have=0
ERROR: failed to read tree block 4096
extent_io.c:665: free_extent_buffer_internal: BUG_ON `eb->refs < 0`
triggered, value 1
./btrfs[0x426e57]
./btrfs(free_extent_buffer+0xe)[0x427701]
./btrfs(alloc_extent_buffer+0x3f)[0x427872]
./btrfs(btrfs_find_create_tree_block+0xf)[0x415b3c]
./btrfs(read_tree_block+0x5c)[0x4171b5]
./btrfs(cmd_inspect_dump_tree+0x587)[0x46fb75]
./btrfs(handle_command_group+0x44)[0x40df89]
./btrfs(cmd_inspect+0x15)[0x44b569]
./btrfs(main+0x8b)[0x40e032]
/lib64/libc.so.6(__libc_start_main+0xeb)[0x7f2001a54b7b]
./btrfs(_start+0x2a)[0x40dd1a]
Aborted (core dumped)
This is not only limited to multiple ins dump-tree -b parameter support,
but also to possible overlapping bad tree blocks.
[CAUSE]
Btrfs delay extent freeing to improve performance.
However for the "-b 4096 -b 8192" case, the first -b 4096 will cause an
extent buffer start=4096 len=16384 refs=0 in the cached extent tree.
Then the incoming -b 8192 will hit the cache and reuse the cached extent
buffer.
And since the cached extent buffer doesn't match the bytenr, its refs
won't get increased, and we're going to free that eb again.
Since the bad cached eb already has a ref number 0, calling
free_extent_buffer() on it again will trigger the assert.
[FIX]
So for bad extent buffer we failed to read, just delete them
immediately.
This will free them from extent buffer cache, so later extent buffer
allocation will not hit the stale one, and prevent the bug from
happening.
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <wqu@suse.com>
The code is mostly ported from kernel with minimal change.
Since btrfs-progs doesn't support replaying log, there is some code
unnecessary for btrfs-progs, but to keep the code the same, that
unnecessary code is kept as it.
Now "btrfs check --repair" will update backup roots correctly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Lowmem can repair after commit
'btrfs-progs: lowmem: move nbytes check before isize check',
so add the beacon file.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
The missing extent will lead to the existence of the gap between adjacent
extents. The fsck should can detect the gap correctly and repair by punch
a hole.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
For test case fsck-tests/001-bad-file-extent-bytenr, we have an
obviously hand crafted image with unaligned file extent:
item 7 key (257 EXTENT_DATA 0) itemoff 3453 itemsize 53
generation 6 type 1 (regular)
extent data disk byte 755944791 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression 0 (none)
disk bytenr 755944791 is obviously unaligned (not even).
For such obviously corrupted file extent, we should just delete the file
extent.
Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
[Update commit message and comment]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Function find_possible_backrefs() is used to locate the file extents
referring to an data extent.
For data extent backref, its btrfs_extent_data_ref structure has
the following members:
- root
Which root refers to this data extent
- objectid
Which inode refers to this data extent
- offset
Search *hint*.
Its value is @file_offset - @extent_offset.
While for @file_offset, it's directly recorded in (INO EXTENT_DATA
FILE_OFFSET) key.
So when searching the file extents refers to this data extent, we can't
use btrfs_extent_data_ref::offset as search key::offset.
We must search from file offset 0, and iterate all file extents until we
hit a file extent matches the data backref.
Thankfully such time consuming behavior is not triggered frequently,
it only gets called for repair, so it shouldn't affect normal check
routine.
Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
[Update commit message]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Commit 0ddf63c09f ("btrfs-progs: Record orphan data extent ref to
corresponding root.") introduces the ability to record a file extent
even all other related info is lost (data backref, inode item).
However this patch only records such info without doing any proper
repair, further more, it could even record invalid file extents, and the
report part only happens after all check is done.
Since we will later introduce proper file extent repair functionality,
we could revert that patch.
Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
[Update commit message, solve merge conflicts]
Signed-off-by: Qu Wenruo <wqu@suse.com>