Previously 'btrfs check --check-data-csum' will report tons of false
alerts for RAID56.
Add a test case to make sure with the new RAID56 rebuild ability, there
should be no false alerts.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This new ability is added by:
- Allow btrfs_map_block() to return the chunk type
This makes later work much easier
- Only reset stripe offset inside btrfs_map_block() when needed
Currently if @raid_map is not NULL, btrfs_map_block() will consider
this call is for WRITE and will reset stripe offset.
This is no longer the case, as for RAID56 read with mirror_num 1/0,
we will still call btrfs_map_block() with non-NULL raid_map.
Add a small check to make sure we won't reset stripe offset for
mirror 1/0 read.
- Add new helper read_raid56() to handle rebuild
We will read the full stripe (including all data and P/Q stripes)
do the rebuild, then only copy the refered part to the caller.
There is a catch for RAID6, we have no way to exhaust all combination,
so the current repair will assume the mirror = 0 data is corrupted,
then try to find a missing device.
But if no missing device can be found, it will assume P is corrupted.
This is just a guess, and can to totally wrong, but we have no better
idea.
Now btrfs-progs have full read ability for RAID56.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those two members are a shortcut for non-RAID56 profiles.
But we should not use such shortcut, and move all our logical address
read/write to the unified read_data_from_disk()/write_data_to_disk().
With previous refactors, now we're safe to remove them.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function read_extent_from_disk() is only a wrapper to read tree
block.
And read_extent_data() is just a while loop to eliminate short read
caused by stripe boundary.
In fact, a lot of call sites of read_extent_data() are either reading
metadata (thus no possible short read) or doing extra loop by
themselves.
This patch will replace those two functions with read_data_from_disk(),
making it the only entrance for data/metadata read.
And update read_data_from_disk() to return the read bytes, so caller can
do a simple while loop.
For the few callers of read_extent_data(), open-code a small while loop
for them.
This will allow later RAID56 read repair using P/Q much easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function write_extent_to_disk() is just writing the content of a tree
block to disk.
It can not handle RAID56, and its work is the same as
write_data_to_disk().
Thus we can replace write_extent_to_disk() with write_data_to_disk()
easily.
There is only one special call site in write_raid56_with_parity(), which
can easily be replace with btrfs_pwrite() directly.
This reduce the write entrance, and make later eb::fd removal easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two call sites using write_extent_to_disk() directly:
- debug_corrupt_block() in btrfs-corrupt-block.c
- corrupt_keys() in btrfs-corrupt-block.c
The problem of write_extent_to_disk() is, it can only handle plain
profiles (All profiles except P/Q stripes of RAID56).
Calling it directly can corrupted RAID56 P/Q, and in the future we're
going to remove eb::fd/eb::dev_bytes, so remove such call sites with
write_and_map_eb().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For metadata restore, our logical address is mapped to a single device
with logical address 1:1 mapped to device physical address.
Move this part of code into a helper, this will make later extent buffer
read path refactoer much easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to use read_whole_eb() to read super block, but it's no longer
the case (so long that I can not even find out which commit did the
conversion).
Thus there is no need for read_whole_eb() to handle super block read
anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we had a bug that btrfs check would report false warning for
a sprouted filesystem.
So this patch will add a new test case to make sure neither seed nor
and sprouted filesystem will cause such false warning.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script can lead to false positive from btrfs check:
mkfs.btrfs -f $dev1
mount $dev1 $mnt
btrfstune -S1 $dev1
mount $dev1 $mnt
btrfs dev add -f $dev2 $mnt
umount $mnt
# Now dev1 is seed, and dev2 is the rw fs.
btrfs check $dev2
...
[2/7] checking extents
WARNING: minor unaligned/mismatch device size detected
WARNING: recommended to use 'btrfs rescue fix-device-size' to fix it
...
This false positive only happens on $dev2, $dev1 is completely fine.
[CAUSE]
The warning is from is_super_size_valid(), in that function we verify
the super block total bytes (@super_bytes) is correct against the total
device bytes (@total_bytes).
However the when calculating @total_bytes, we only use devices in
current fs_devices, which only contains RW devices.
Thus all bytes from seed device are not taken into consideration, and
trigger the false positive.
[FIX]
Fix it by also iterating seed devices.
Since we're here, also output @total_bytes and @super_bytes when
outputting the warning message, to allow end users have a better idea on
what's going wrong.
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While working on my Windows driver, I found that it was inadvertently
allowing users to create xattrs with names longer than 255 bytes, which
wasn't being picked up by btrfs-check.
If the Linux driver encounters a file with an invalid xattr like this,
it makes the whole directory it's in inaccessible. If it's the root
directory, it'll refuse to mount the filesystem entirely.
Pull-request: #456
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With the incoming delayed chunk item insertion feature, there is a super
weird failure at mkfs/022:
====== RUN CHECK ./mkfs.btrfs -f --rootdir tmp.KnKpP5 -d dup -b 350M tests/test.img
...
Checksum: crc32c
Number of devices: 0
Devices:
ID SIZE PATH
Note the "Number of devices: 0" line, this means our
fs_info->fs_devices->devices list is empty.
And since our rw device list is empty, we won't finish the mkfs with
proper superblock magic, and cause later btrfs check to fail.
[CAUSE]
Although the failure is only triggered by the incoming delayed chunk
item insertion feature, the bug itself is here for a while.
In btrfs_alloc_chunk(), we move rw devices to our @private_devs list
first, then in create_chunk(), we move it back to our rw devices list.
This dance is pretty dangerous, especially if btrfs_alloc_dev_extent()
failed inside create_chunk(), and current profile have multiple stripes
(including DUP), we will exit create_chunk() directly, without moving the
remaining devices in @private_devs list back to @dev_list.
Furthermore, btrfs_alloc_chunk() is expected to return -ENOSPC, as we
call btrfs_alloc_chunk() to pre-allocate chunks, and ignore the -ENOSPC
error if it's just a pre-allocation failure.
This existing error path can lead to the empty rw list seen above.
[FIX]
After create_chunk(), unconditionally move all devices in @private_devs
back to rw device list.
And add extra check to make sure our rw device list is never empty after
a chunk allocation attempt.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_start_transaction() will allocate the memory
unconditionally, but if the fs has an aborted transaction we don't free
the allocated memory but return error directly.
Fix it by only allocate the new memory after all the checks.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new test case will have a image file which has dirty log
(btrfs-image supports dumping log tree).
So we can easily check if "btrfstune -S" will reject fs with dirty log.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following sequence operation can lead to a seed fs rejected by
kernel:
# Generate a fs with dirty log
mkfs.btrfs -f $file
mount $dev $mnt
xfs_io -f -c "pwrite 0 16k" -c fsync $mnt/file
cp $file $file.backup
umount $mnt
mv $file.backup $file
# now $file has dirty log, set seed flag on it
btrfstune -S1 $file
# mount will fail
mount $file $mnt
The mount failure with the following dmesg:
[ 980.363667] loop0: detected capacity change from 0 to 262144
[ 980.371177] BTRFS info (device loop0): flagging fs with big metadata feature
[ 980.372229] BTRFS info (device loop0): using free space tree
[ 980.372639] BTRFS info (device loop0): has skinny extents
[ 980.375075] BTRFS info (device loop0): start tree-log replay
[ 980.375513] BTRFS warning (device loop0): log replay required on RO media
[ 980.381652] BTRFS error (device loop0): open_ctree failed
[CAUSE]
Although btrfs will replay its dirty log even with RO mount, but kernel
will treat seed device as RO device, and dirty log can not be replayed
on RO device.
This rejection is already the better end, just imagine if we don't treat
seed device as RO, and replayed the dirty log.
The filesystem relying on the seed device will be completely screwed up.
[FIX]
Just add extra check on log tree in btrfstune to reject setting seed
flag on filesystems with dirty log.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, if user specifies value 'no' or 'none' on the command line,
it gets translated to an empty value that is passed to kernel. There was
a change in kernel 5.14 done by commit 5548c8c6f55b ("btrfs: props:
change how empty value is interpreted") that changes the behaviour
in that case.
The empty value is supposed to mean 'the default value' for any
property. For compression there is a need to distinguish resetting the
value and also setting the NOCOMPRESS property. The translation to empty
value makes that impossible.
The explanation and behaviour copied from the kernel patch:
Old behaviour:
$ lsattr file
---------------------- file
# the NOCOMPRESS bit is set
$ btrfs prop set file compression ''
$ lsattr file
---------------------m file
This is equivalent to 'btrfs prop set file compression no' in current
btrfs-progs as the 'no' or 'none' values are translated to an empty
string.
This is where the new behaviour is different: empty string drops the
compression flag (-c) and nocompress (-m):
$ lsattr file
---------------------- file
# No change
$ btrfs prop set file compression ''
$ lsattr file
---------------------- file
$ btrfs prop set file compression lzo
$ lsattr file
--------c------------- file
$ btrfs prop get file compression
compression=lzo
$ btrfs prop set file compression ''
# Reset to the initial state
$ lsattr file
---------------------- file
# Set NOCOMPRESS bit
$ btrfs prop set file compression no
$ lsattr file
---------------------m file
This obviously brings problems with backward compatibility, so this
patch should not be backported without making sure the updated
btrfs-progs are also used and that scripts have been updated to use the
new semantics.
Summary:
- old kernel:
no, none, "" - set NOCOMPRESS bit
- new kernel:
no, none - set NOCOMPRESS bit
"" - drop all compression flags, ie. COMPRESS and NOCOMPRESS
Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The previous patch revealed a bug in dev_extent_hole_check_zoned(). If the
given hole is OK to use as is, it should have just returned the hole. But
on the contrary, it shifts the hole start position by one zone. That
results in refusing any hole.
We don't use btrfs_ensure_empty_zones() in the btrfs-progs version of
dev_extent_hole_check_zoned() unlike the kernel side, because
btrfs_find_allocatable_zones() itself is doing the necessary checks. So, we
can just "return changed" if the "pos" is unchanged. That also makes the
loop and "changed" variable unnecessary.
So, fix and simplify the code in one shot.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The hole_size is used by dev_extent_hole_check() to check the hole is OK as
a device extent. However, commit b031fe84fd ("btrfs-progs: zoned:
implement zoned chunk allocator") mis-ported the kernel code and placed
dev_extent_hole_check() before setting hole_check. That made the
dev_extent_hole_check() call here essentially pass through as we have
hole_size == 0 on mkfs time.
As a result, mkfs.btrfs creates data BG at 64 MB where the regular
superblock exists, when zone size is 16 MB.
Fix the ordering of hole_size setting and calling dev_extent_hole_check().
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we create the initial system block group in the zone 2. That
will create the BG at 64MB when the zone size is 32 MB, which collides with
the regular superblock location. It results in mount failure with:
BTRFS info (device nullb0): zoned mode enabled with zone size 33554432
BTRFS error (device nullb0): zoned: block group 67108864 must not contain super block
BTRFS error (device nullb0): failed to read block groups: -117
BTRFS error (device nullb0): open_ctree failed
Fix that by calculating the proper location of the initial system BG. It
avoids using zones reserved for zoned superblock logging and the zones
where a regular superblock resides.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move sb_zone_number() and related constants from zoned.c to the
corresponding header for later use.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
'btrfs inspect-internals dump-tree' doesn't currently know about the two
types of verity items and prints them as 'UNKNOWN.36' or 'UNKNOWN.37'.
So add them to the known item types.
Suggested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: David Sterba <dsterba@suse.com>
The new image has incorrect super num devices (have 8 expect 1).
The image has to be raw format, as btrfs-image restore will reset super
num devices to correct value.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report of kernel rejecting fs which has a mismatch in
super num devices and num devices found in chunk tree.
But btrfs-check reports no problem about the fs.
[CAUSE]
We just didn't verify super num devices against the result found in
chunk tree.
[FIX]
Add such check and repair ability for btrfs-check.
The ability is mode independent.
Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a weird error message when running btrfs check on a specific
image:
[7/7] checking quota groups
ERROR: out of memory
ERROR: Loading qgroups from disk: -2
ERROR: failed to check quota groups
[CAUSE]
The "Out of memory" one is in load_quota_info(), which is output in two
cases:
- No memory can be allocated for btrfs_qgroup_list
AKA, real -ENOMEM.
- No qgroup can be found for either the child or the parent qgroup
This returnes -ENOENT.
Obvious the image has hit -ENOENT case, but the error message is fixed
to ENOMEM case.
[FIX]
Fix it by using %m to output the real reason of failure.
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Link: https://forums.opensuse.org/showthread.php/567851-btrfs-fails-to-load-qgroups-from-disk-with-error-2-(out-of-memory)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the current set of changes we could probably do this check, but it
would involve changing the code quite a bit, and in the future we're not
going to track the metadata in the extent tree at all. Since this check
was for a very old kernel just skip it for extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have all of the supporting code, add the ability to create
all of the global roots for an extent tree v2 fs. This will default to
nr_cpu's, but also allow the user to specify how many global roots they
would like.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Our initial block group will use global root id 0 with extent tree v2,
so adjust the helper to take the chunk_objectid as an argument, as we'll
set this to 0 for extent tree v2 and then
BTRFS_FIRST_CHUNK_TREE_OBJECTID for extent tree v1.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to start create global roots from mkfs, and we need to have
a offset set for the root key. Make the btrfs_create_tree() take a key
for the root_key instead of just the objectid so we can setup these new
style roots properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we'll have multiple free space trees, and we can't
just unset the feature flags for the free space tree. Fix this to loop
through all of the free space trees and clear them out properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The free space tree code already does this, but we need it for cleaning
up per block group roots. Abstract this code out into a helper so that
we can use it in multiple places in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will now be using block_group->chunk_objectid to point at the global
root id for this particular block group. For now we'll assign this
based on mod'ing the offset of the block group against the number of
global root id's and handle the block_group_item updating appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to make sure the file system is consistent we need to record
the number of global roots we should have in the super block. We could
infer this from the number of global roots we find, however this could
lead to interesting fuzzing problems, so add a source of truth to the
super block in order to make it easier to verify the file system is
consistent.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to make sure we process the block group root, and mark its
blocks as used for the free space tree checking.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the case of per-bg roots we may be missing the root items. To
re-initialize them we want to add the root item as well as allocate the
empty block. To achieve this extract out the reinit root logic to a
helper that just takes the root key and then does the appropriate work
to allocate an empty root and update the root item. Fix the normal
reinit root helper to use this new helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The free space tree needs to be validated against all referenced blocks
in the file system, so use the btrfs_mark_used_blocks() helper to check
the free space tree and free space cache against. This will do the
right thing for both extent tree v1 and extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we switch to per-block group extent roots we'll need to scan each
individual extent root. To make this easier in the future go ahead and
use the range of the block groups to scan the extents.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This makes the appropriate changes to enable the block group tree
checking for both lowmem and normal check modes. This is relatively
straightforward, simply need to use the helper to get the right root for
dealing with block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the extent tree v2 table with the block group tree as a root, and
then create the empty root and use the proper root for cleanup up the
temporary block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have an ASSERT(ret == 0) when populating the free space tree as we
should at least find the block group item with extent tree v1. However
with v2 we no longer have the block group item in the extent tree, so
fix the population logic to handle an empty block group (which occurs
during mkfs) and only assert if ret != 0 and we don't have extent tree
v2 turned on.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we're messing with block group items use the
btrfs_block_group_root() helper to get the correct root to search, and
this will do the right thing based on the file system flags.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of accessing the extent root directory for modifying block
groups, use the helper which will do the correct thing based on the
flags of the file system.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the appropriate support to the print tree and dump tree code to spit
out the block group tree.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the ability to load the block group root, as well as make sure
the various backup super block and super block updates are made
appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we change the size of the btrfs_header we're going to need to
change how these helpers calculate where to find the start of items or
block ptrs. To prepare for that make these helpers take the
extent_buffer as an argument so we can do the appropriate math based on
the version type.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are duplicating the offsetof(btrfs_node, key_ptr) logic everywhere,
instead use the helper to do this work for us, and make all the node
accessors use the helper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_item_nr_offset(0) is simply offsetof(struct btrfs_leaf, items),
which is the same thing as btrfs_leaf_data(), so replace all calls of
btrfs_item_nr_offset(0) with btrfs_leaf_data().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers are using the _nr variations we can simply rename
these helpers to btrfs_item_##member/btrfs_set_item_##member and change
the actual item SETGET funcs to raw_item_##member/set_raw_item_##member
and then change all callers to drop the _nr part.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers use the btrfs_item_end_nr() variation, simply drop
btrfs_item_end() and make btrfs_item_end_nr() use the _nr() variations
of the item get helpers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This matches how the kernel does it, simply pass in the slot and fix up
btrfs_file_extent_inline_item_len to use the btrfs_item_nr() helper and
the correct define. Fixup all the callers to use the slot now instead
of passing in the btrfs_item.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>