With the introduction of xxhash64 to btrfs-progs we created a crypto/
directory for all the hashes used in btrfs (although no
cryptographically secure hash is there yet).
Move the crc32c implementation from kernel-lib/ to crypto/ as well so we
have all hashes consolidated.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Discovered with cppcheck. Fix signed/unsigned int mismatches, sizeof and
long formats.
Pull-request: #197
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adding this table will make extending btrfs-progs with new checksum types
easier.
Also add accessor functions to access the table fields.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the checksumming API to be able to cope with more checksum types
than just CRC32C. The finalization call is merged into btrfs_csum_data.
There are some fixme's and asserts added that need to be resolved.
Co-developed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to supporting new checksum algorithm pass the checksum type
to btrfs_csum_data/btrfs_csum_final, this allows us to encapsulate any
differences in processing into the respective functions
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass pointer to a generic buffer instead of fixed size that crc32c
currently uses.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the checksum type to csum_tree_block_size(), __csum_tree_block_size()
and verify_tree_block_csum_silent().
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the btrfs_csum_data() wrapper in __csum_tree_block_size() instead of
directly calling crc32c().
This helps us when plumbing new checksum algorithms into the FS.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will export disk-io.c::check_super() as btrfs_check_super()
and use it in btrfs-image for extra verification.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since commit "btrfs-progs: disk-io: Flush to ensure super block write is
FUA" mkfs-tests/017 will fail like:
====== RUN MUSTFAIL /home/adam/btrfs-progs/mkfs.btrfs -K -f /dev/mapper/btrfs-progs-thin-vol
ERROR: failed to write super block for devid 1: flush error: Input/output error
disk-io.c:1810: write_all_supers: BUG_ON `ret` triggered, value -5
/home/adam/btrfs-progs/mkfs.btrfs(+0x1e5c1)[0x557a2c83e5c1]
/home/adam/btrfs-progs/mkfs.btrfs(+0x1e65f)[0x557a2c83e65f]
/home/adam/btrfs-progs/mkfs.btrfs(write_all_supers+0x1ce)[0x557a2c843a8a]
/home/adam/btrfs-progs/mkfs.btrfs(write_ctree_super+0x12d)[0x557a2c843be2]
/home/adam/btrfs-progs/mkfs.btrfs(btrfs_commit_transaction+0x250)[0x557a2c887c56]
/home/adam/btrfs-progs/mkfs.btrfs(+0xc0b1)[0x557a2c82c0b1]
/home/adam/btrfs-progs/mkfs.btrfs(main+0x1049)[0x557a2c82e929]
/usr/lib/libc.so.6(__libc_start_main+0xf3)[0x7f6689e99223]
/home/adam/btrfs-progs/mkfs.btrfs(_start+0x2e)[0x557a2c82b86e]
failed (expected): /home/adam/btrfs-progs/mkfs.btrfs -K -f /dev/mapper/btrfs-progs-thin-vol
[CAUSE]
Just one BUG_ON() in write_all_supers().
[FIX]
Just remove the BUG_ON(). Callers of write_all_supers() are already
checking the return value.
Also since write_all_supers() can return error, make write_ctree_super()
callers, btrfs_commit_transaction() and close_ctree_fs_info() to
handle the error correctly.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
There are tons of reports of btrfs-progs screwing up the fs, the most
recent one is "btrfs check --clear-space-cache v1" triggered BUG_ON()
and then leaving the fs with transid mismatch problem.
[CAUSE]
In kernel, we have block layer handing the flush work, even on devices
without FUA support (like most SATA device using default libata
settings), kernel handles FUA write by flushing the device, then normal
write, and finish it with another flush.
The pre-flush, write, post-flush works pretty well to implement FUA
write.
However in btrfs-progs we just use pwrite(), there is nothing keeping
the write order.
So even for basic v1 free space cache clearing, we have different vision
on the write sequence from kernel bio layer (by dm-log-writes) and user
space pwrite() calls.
In btrfs-progs, with extra debug output in write_tree_block() and
write_dev_supers(), we can see btrfs-progs follows the right write
sequence:
Opening filesystem to check...
Checking filesystem on /dev/mapper/log
UUID: 3feb3c8b-4eb3-42f3-8e9c-0af22dd58ecf
write tree block start=1708130304 gen=39
write tree block start=1708146688 gen=39
write tree block start=1708163072 gen=39
write super devid=1 gen=39
write tree block start=1708179456 gen=40
write tree block start=1708195840 gen=40
write super devid=1 gen=40
write tree block start=1708130304 gen=41
write tree block start=1708146688 gen=41
write tree block start=1708228608 gen=41
write super devid=1 gen=41
write tree block start=1708163072 gen=42
write tree block start=1708179456 gen=42
write super devid=1 gen=42
write tree block start=1708130304 gen=43
write tree block start=1708146688 gen=43
write super devid=1 gen=43
Free space cache cleared
But from dm-log-writes, the bio sequence is a different story:
replaying 1742: sector 131072, size 4096, flags 0(NONE)
replaying 1743: sector 128, size 4096, flags 0(NONE) <<< Only one sb write
replaying 1744: sector 2828480, size 4096, flags 0(NONE)
replaying 1745: sector 2828488, size 4096, flags 0(NONE)
replaying 1746: sector 2828496, size 4096, flags 0(NONE)
replaying 1787: sector 2304120, size 4096, flags 0(NONE)
......
replaying 1790: sector 2304144, size 4096, flags 0(NONE)
replaying 1791: sector 2304152, size 4096, flags 0(NONE)
replaying 1792: sector 0, size 0, flags 8(MARK)
During the free space cache clearing, we committed 3 transaction but
dm-log-write only caught one super block write.
This means all the 3 writes were merged into the last super block write.
And the super block write was the 2nd write, before all tree block
writes, completely screwing up the metadata CoW protection.
No wonder crashed btrfs-progs can make things worse.
[FIX]
Fix this super serious problem by implementing pre and post flush for
the primary super block in btrfs-progs.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
When we failed to write super blocks, we just output something like:
WARNING: failed to write sb: I/O error
Or
WARNING: failed to write all sb data
There is no info about which device failed and there are two different
error message for the same write error.
This patch will change it to something more detailed:
ERROR: failed to write super block for devid 1: write error: I/O error
This provides the basis for later super block flush error handling.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
If the first copy of a tree block has a bad key order, but the second
copy is completely good, then "btrfs ins dump-tree -b <bytenr>" fails to
print anything past the bad key:
leaf 29786112 items 47 free space 983 generation 20 owner EXTENT_TREE
leaf 29786112 flags 0x1(WRITTEN) backref revision 1
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
[snip]
item 9 key (20975616 METADATA_ITEM 0) itemoff 3543 itemsize 33
refs 1 gen 16 flags TREE_BLOCK
tree block skinny level 0
tree block backref root CHUNK_TREE
item 10 key (29360128 BLOCK_GROUP_ITEM 33554432) itemoff 3519 itemsize 24
block group used 94208 chunk_objectid 256 flags METADATA|DUP
ERROR: leaf 29786112 slot 11 pointer invalid, offset 1245184 size 0 leaf data limit 3995
ERROR: skip remaining slots
While kernel can locate the good copy and acts just like nothing
happened.
[CAUSE]
btrfs-progs uses read_tree_block() to try each copy. But it only uses
less strict check_tree_block(), which has less sanity check than
btrfs_check_node/leaf().
Some error like bad key order is ignored to allow btrfs check to fix it.
This leads to above problem.
[FIX]
Introduce a new member, @candidate_mirror in read_tree_block(), which
records the copy passes check_tree_block() but fails
btrfs_check_leaf/node() as last chance.
Only if no better copy found, then use @candidate_mirror.
So btrfs-progs can act just like kernel to use best copy.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
[Inspired by that image, not to fix any bug of that bugzilla]
Signed-off-by: Qu Wenruo <wqu@suse.com>
btrfs_num_copies really only needs to be called once, so move it out of
the verification loop in read_tree_block().
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
If the first copy of a tree block is corrupted but the other copy is
good, btrfs-progs will report the error twice:
checksum verify failed on 30556160 found 42A2DA71 wanted 00000000
checksum verify failed on 30556160 found 42A2DA71 wanted 00000000
While kernel only report it once, just as expected:
BTRFS warning (device dm-3): dm-3 checksum verify failed on 30556160 wanted 0 found 42A2DA71 level 0
[CAUSE]
We use mirror_num = 0 in read_tree_block() of btrfs-progs.
At first glance it's pretty OK, but mirror num 0 in btrfs means ANY
good copy. Real mirror num starts from 1.
In the context of read_tree_block(), since it's read_tree_block() to do
all the checks, mirror num 0 just means the first copy.
So if the first copy is corrupted, btrfs-progs will try mirror num 1
next, which is just the same as mirror num 0.
After reporting the same error on the same copy, btrfs-progs will
finally try mirror num 2, and get the good copy.
[FIX]
The fix is way simpler than all the above analyse, just starts from
mirror num 1.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
For the new multiple -b parameter supporting, we could hit this bug on a
16K node sized btrfs:
$ ./btrfs inspect dump-tree -b 1024 -b 2048 -b 4096 -b 8192 zimg
btrfs-progs v4.20.2
ERROR: tree block bytenr 1024 is not aligned to sectorsize 4096
ERROR: tree block bytenr 2048 is not aligned to sectorsize 4096
Couldn't map the block 4096
Invalid mapping for 4096-20480, got 13631488-22020096
Couldn't map the block 4096
bad tree block 4096, bytenr mismatch, want=4096, have=0
ERROR: failed to read tree block 4096
extent_io.c:665: free_extent_buffer_internal: BUG_ON `eb->refs < 0`
triggered, value 1
./btrfs[0x426e57]
./btrfs(free_extent_buffer+0xe)[0x427701]
./btrfs(alloc_extent_buffer+0x3f)[0x427872]
./btrfs(btrfs_find_create_tree_block+0xf)[0x415b3c]
./btrfs(read_tree_block+0x5c)[0x4171b5]
./btrfs(cmd_inspect_dump_tree+0x587)[0x46fb75]
./btrfs(handle_command_group+0x44)[0x40df89]
./btrfs(cmd_inspect+0x15)[0x44b569]
./btrfs(main+0x8b)[0x40e032]
/lib64/libc.so.6(__libc_start_main+0xeb)[0x7f2001a54b7b]
./btrfs(_start+0x2a)[0x40dd1a]
Aborted (core dumped)
This is not only limited to multiple ins dump-tree -b parameter support,
but also to possible overlapping bad tree blocks.
[CAUSE]
Btrfs delay extent freeing to improve performance.
However for the "-b 4096 -b 8192" case, the first -b 4096 will cause an
extent buffer start=4096 len=16384 refs=0 in the cached extent tree.
Then the incoming -b 8192 will hit the cache and reuse the cached extent
buffer.
And since the cached extent buffer doesn't match the bytenr, its refs
won't get increased, and we're going to free that eb again.
Since the bad cached eb already has a ref number 0, calling
free_extent_buffer() on it again will trigger the assert.
[FIX]
So for bad extent buffer we failed to read, just delete them
immediately.
This will free them from extent buffer cache, so later extent buffer
allocation will not hit the stale one, and prevent the bug from
happening.
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <wqu@suse.com>
The code is mostly ported from kernel with minimal change.
Since btrfs-progs doesn't support replaying log, there is some code
unnecessary for btrfs-progs, but to keep the code the same, that
unnecessary code is kept as it.
Now "btrfs check --repair" will update backup roots correctly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
For test case fsck-tests/001-bad-file-extent-bytenr, we have an
obviously hand crafted image with unaligned file extent:
item 7 key (257 EXTENT_DATA 0) itemoff 3453 itemsize 53
generation 6 type 1 (regular)
extent data disk byte 755944791 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression 0 (none)
disk bytenr 755944791 is obviously unaligned (not even).
For such obviously corrupted file extent, we should just delete the file
extent.
Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
[Update commit message and comment]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Commit 0ddf63c09f ("btrfs-progs: Record orphan data extent ref to
corresponding root.") introduces the ability to record a file extent
even all other related info is lost (data backref, inode item).
However this patch only records such info without doing any proper
repair, further more, it could even record invalid file extents, and the
report part only happens after all check is done.
Since we will later introduce proper file extent repair functionality,
we could revert that patch.
Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
[Update commit message, solve merge conflicts]
Signed-off-by: Qu Wenruo <wqu@suse.com>
This function provides the offline functionality to add new uuid tree
entry. Also port fs_info->uuid and its initialization and cleanup code
to support uuid tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For data reloc tree creation, we copy its contents from the fs tree just
for its INODE_ITEM, INODE_REF and dirid. This hides the detail and is
not obvious for why we're copying from fs root.
This patch will create data reloc tree from scratch:
- Create root, including root item and new tree root
- Change dirid to BTRFS_FIRST_FREE_OBJECTID
- Insert root INODE_ITEM and INODE_REF
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just as how kernel uses it.
This provides the basis for later uuid creation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add support for a new metadata_uuid field. This is just a preparatory
commit which switches all users of the fsid field for metdata comparison
purposes to utilize the new field. This more or less mirrors the
kernel patch, additionally:
* Update 'btrfs inspect-internal dump-super' to account for the new
field. This involes introducing the 'metadata_uuid' line to the
output and updating the logic for comparing the fs uuid to the
dev_item uuid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For completeness sake add code to btrfs_read_fs_root so that it can
handle the freespace tree.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Given that the new delayed refs infrastructure is implemented and wired
up, there is no point in keeping the old code. So just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the 'btrfsune -u' command is interrupted, the final filesystem fsid
is not written to the superblock and it cannot be mounted. Too bad that
'btrfstune' cannot continue to finish the UUID change as it should.
This patch fixes that and passes the relaxed flags for superblock and
only warns when it detects the fsid mismatch. As this is something that
should be noted in case it would be needed for further debugging, it's
not just silent.
Signed-off-by: David Sterba <dsterba@suse.com>
For easier debugging, let print_tree_block_error() print bytenr of tree
block.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
New flag that mimics OPEN_CTREE_IGNORE_FSID_MISMATCH but only for
reading the superblock. It should be passed around to various helpers
like scan or mount checks as they'd fail before we'd get to the final
caller that can do something useful with the filesystem.
This will be used for an interrupted 'btrfstune -u'.
Note to __open_ctree_fd: the RECOVERY mode is not compatible with that
flag
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which has a reference
to the fs_info, so use that to obtain it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function actually uses only the extent_buffer arg but takes 3
arguments. Furthermore, it's current interface doesn't even mirror
the kernel counterpart. Just remove the extra arguments.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old flag OPEN_CTREE_FS_PARTIAL is in fact quite easy to be confused
with OPEN_CTREE_PARTIAL, which allow btrfs-progs to open damaged
filesystem (like corrupted extent/csum tree).
However OPEN_CTREE_FS_PARTIAL, unlike its name, is just allowing
btrfs-progs to open fs with temporary superblocks (which only has 6
basic trees on SINGLE meta/sys chunks).
The usage of FS_PARTIAL is really confusing here.
So rename OPEN_CTREE_FS_PARTIAL to OPEN_CTREE_TEMPORARY_SUPER, and add
extra comment for its behavior.
Also rename BTRFS_MAGIC_PARTIAL to BTRFS_MAGIC_TEMPORARY to keep the
naming consistent.
And with above comment, the usage of FS_PARTIAL in dump-tree is
obviously incorrect, fix it.
Fixes: 8698a2b9ba ("btrfs-progs: Allow inspect dump-tree to show specified tree block even some tree roots are corrupted")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using the internal struct extent_io_tree, use struct fs_info.
This does not only unify the interface between kernel and btrfs-progs,
but also makes later btrfs_print_tree() use fewer parameters.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When device is missing, read_extent_data() (function exported from old
btrfs check code) has the following problems:
1) Modifies @len parameter if device is missing
If device returned in @multi is missing, @len can be larger than
@max_len (originl length).
This could confuse caller and underflow in the read loop.
2) Still returns 0 for missing device
It only handles read error, missing device is not handled and 0 is
returned.
3) Wrong check for device->fd
In fact, 0 is also a valid fd.
Although not possible under most cases, but still needs fix.
Fix them all.
Fixes: 1bad2f2f2d ("Btrfs-progs: fsck: add an option to check data csums")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As btrfs is specific to Linux, %m can be used instead of strerror(errno)
in format strings. This has some size reduction benefits for embedded
systems.
glibc, musl, and uclibc-ng all support %m as a modifier to printf.
A quick glance at the BIONIC libc source indicates that it has
support for %m as well. BSDs and Windows do not but I do believe
them to be beyond the scope of btrfs-progs.
Compiled sizes on Ubuntu 16.04:
Before:
3916512 btrfs
233688 libbtrfs.so.0.1
4899 bcp
2367672 btrfs-convert
2208488 btrfs-corrupt-block
13302 btrfs-debugfs
2152160 btrfs-debug-tree
2136024 btrfs-find-root
2287592 btrfs-image
2144600 btrfs-map-logical
2130760 btrfs-select-super
2152608 btrfstune
2131760 btrfs-zero-log
2277752 mkfs.btrfs
9166 show-blocks
After:
3908744 btrfs
233256 libbtrfs.so.0.1
4899 bcp
2366560 btrfs-convert
2207432 btrfs-corrupt-block
13302 btrfs-debugfs
2151104 btrfs-debug-tree
2134968 btrfs-find-root
2281864 btrfs-image
2143536 btrfs-map-logical
2129704 btrfs-select-super
2151552 btrfstune
2130696 btrfs-zero-log
2276272 mkfs.btrfs
9166 show-blocks
Total savings: 23928 (24 kilo)bytes
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a couple of places where instead of the more succinct
list_for_each_entry the code uses list_for_each. This results in
slightly more code with no additional benefit as well as no
coherent pattern. This patch makes the code uniform. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ remove unused variable in uuid_search ]
Signed-off-by: David Sterba <dsterba@suse.com>
Closing the fs will try to commit a pending transaction, but may fail to
do so if the filesystem state is not well defined. This will eg. fail
for some fuzz tests. The data structures are freed but no furhter
attempt to commit is made.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently transaction bugs out insided btrfs_start_transaction in case
of error, we want to lift the error handling to the callers. This patch
adds the BUG_ON anywhere it's been missing so far. This is not the best
way of course. Transforming BUG_ON to a proper error handling highly
depends on the caller and should be dealt with case by case.
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_read_block_groups could loop if the metadata are
damaged (reported eg. for an unaligned block), due to lack of error
handling. We have to check for restored images or currently created
filesystems, that do not contain the blockgroups.
Can be reproduced by fuzzed image bko-155551.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=155551
Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree blocks are always nodesize. As readahead is only an optimization,
exact size is not required and is only advisory.
Signed-off-by: David Sterba <dsterba@suse.com>
Metadata blocks are always nodesize. When reading the
superblock::sys_array, the actual size of data is fixed to 4k and
smaller than nodesize, but otherwise everything works as before.
Signed-off-by: David Sterba <dsterba@suse.com>
The tree blocks are supposed to be always of nodesize. Before the
parameter has been dropped, it was unlikely but possible to pass a
misaligned value.
Signed-off-by: David Sterba <dsterba@suse.com>