Drop the devid argument, it can be fetched from the disk_super argument.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This syncs tree-checker.c from the kernel. The main modification was to
add a open ctree flag to skip the deeper leaf checks, and plumbing this
through tree-checker.c. We need this for things like fsck or
btrfs-image that need to work with slightly corrupted file systems, and
these checks simply make us unable to look at the corrupted blocks.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs-progs we check the actual leaf pointers as well as the chunk
itself in btrfs_check_chunk_valid. However in the kernel the leaf stuff
is handled separately as part of the read, and then we have the chunk
checker itself. Change the btrfs-progs version to match the in-kernel
version temporarily so it makes syncing the in-kernel code easier.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[FALSE ALERT]
Clang 15.0.7 gives the following false alert in get_dev_extent_len():
kernel-shared/extent-tree.c:3328:2: warning: variable 'div' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized]
default:
^~~~~~~
kernel-shared/extent-tree.c:3332:24: note: uninitialized use occurs here
return map->ce.size / div;
^~~
kernel-shared/extent-tree.c:3311:9: note: initialize the variable 'div' to silence this warning
int div;
^
= 0
And one in btrfs_stripe_length() too:
kernel-shared/volumes.c:2781:2: warning: variable 'stripe_len' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized]
default:
^~~~~~~
kernel-shared/volumes.c:2785:9: note: uninitialized use occurs here
return stripe_len;
^~~~~~~~~~
kernel-shared/volumes.c:2754:16: note: initialize the variable 'stripe_len' to silence this warning
u64 stripe_len;
^
= 0
[CAUSE]
Clang doesn't really understand what BUG_ON() means, thus in that
default case, we won't get uninitialized value but crash directly.
[FIX]
Silent the errors by assigning the default value properly using the
value of SINGLE profile.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a group of helpers to read device size, the btrfs_device_size
should be one of them. Rename it and so minor cleanup.
Signed-off-by: David Sterba <dsterba@suse.com>
If we found that the underlying block device size is smaller than
total_bytes in dev item, kernel will reject the mount, and there is no
progs tool to fix it.
Under most case it's just a small mismatch, and there is no dev extent
in the shrunk range.
In that case, we can let "btrfs rescue fix-device-size" to reset the
total_bytes in dev items to fix.
We add some extra checks, like to make sure there is no dev extent in
the shrunk device range, to make sure we won't lose data during the
device item shrink.
And also update the test case to verify the repaired fs can pass the
check.
Issue: #504
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several call sites utilizing btrfs_commit_transaction() just
to update members in super blocks, without any metadata update.
This can be problematic for some simple call sites, like zero_log_tree()
or check_and_repair_super_num_devs().
If we have big problems preventing the fs to be mounted in the first
place, and need to clear the log or super block size, but by some other
problems in extent tree, we're unable to allocate new blocks.
Then we fall into a deadlock that, we need to mount (even
ro,rescue=all) to collect extra info, but btrfs-progs can not do any
super block updates.
Fix the problem by allowing the following super blocks only operations
to be done without using btrfs_commit_transaction():
- btrfs_fix_super_size()
- check_and_repair_super_num_devs()
- zero_log_tree().
There are some exceptions in btrfstune.c, related to the csum type
conversion and seed flags.
In those btrfstune cases, we in fact wants to proper error report in
btrfs_commit_transaction(), as those operations are not mount critical,
and any early error can be helpful to expose any problems in the fs.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This new ability is added by:
- Allow btrfs_map_block() to return the chunk type
This makes later work much easier
- Only reset stripe offset inside btrfs_map_block() when needed
Currently if @raid_map is not NULL, btrfs_map_block() will consider
this call is for WRITE and will reset stripe offset.
This is no longer the case, as for RAID56 read with mirror_num 1/0,
we will still call btrfs_map_block() with non-NULL raid_map.
Add a small check to make sure we won't reset stripe offset for
mirror 1/0 read.
- Add new helper read_raid56() to handle rebuild
We will read the full stripe (including all data and P/Q stripes)
do the rebuild, then only copy the refered part to the caller.
There is a catch for RAID6, we have no way to exhaust all combination,
so the current repair will assume the mirror = 0 data is corrupted,
then try to find a missing device.
But if no missing device can be found, it will assume P is corrupted.
This is just a guess, and can to totally wrong, but we have no better
idea.
Now btrfs-progs have full read ability for RAID56.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those two members are a shortcut for non-RAID56 profiles.
But we should not use such shortcut, and move all our logical address
read/write to the unified read_data_from_disk()/write_data_to_disk().
With previous refactors, now we're safe to remove them.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function write_extent_to_disk() is just writing the content of a tree
block to disk.
It can not handle RAID56, and its work is the same as
write_data_to_disk().
Thus we can replace write_extent_to_disk() with write_data_to_disk()
easily.
There is only one special call site in write_raid56_with_parity(), which
can easily be replace with btrfs_pwrite() directly.
This reduce the write entrance, and make later eb::fd removal easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With the incoming delayed chunk item insertion feature, there is a super
weird failure at mkfs/022:
====== RUN CHECK ./mkfs.btrfs -f --rootdir tmp.KnKpP5 -d dup -b 350M tests/test.img
...
Checksum: crc32c
Number of devices: 0
Devices:
ID SIZE PATH
Note the "Number of devices: 0" line, this means our
fs_info->fs_devices->devices list is empty.
And since our rw device list is empty, we won't finish the mkfs with
proper superblock magic, and cause later btrfs check to fail.
[CAUSE]
Although the failure is only triggered by the incoming delayed chunk
item insertion feature, the bug itself is here for a while.
In btrfs_alloc_chunk(), we move rw devices to our @private_devs list
first, then in create_chunk(), we move it back to our rw devices list.
This dance is pretty dangerous, especially if btrfs_alloc_dev_extent()
failed inside create_chunk(), and current profile have multiple stripes
(including DUP), we will exit create_chunk() directly, without moving the
remaining devices in @private_devs list back to @dev_list.
Furthermore, btrfs_alloc_chunk() is expected to return -ENOSPC, as we
call btrfs_alloc_chunk() to pre-allocate chunks, and ignore the -ENOSPC
error if it's just a pre-allocation failure.
This existing error path can lead to the empty rw list seen above.
[FIX]
After create_chunk(), unconditionally move all devices in @private_devs
back to rw device list.
And add extra check to make sure our rw device list is never empty after
a chunk allocation attempt.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The previous patch revealed a bug in dev_extent_hole_check_zoned(). If the
given hole is OK to use as is, it should have just returned the hole. But
on the contrary, it shifts the hole start position by one zone. That
results in refusing any hole.
We don't use btrfs_ensure_empty_zones() in the btrfs-progs version of
dev_extent_hole_check_zoned() unlike the kernel side, because
btrfs_find_allocatable_zones() itself is doing the necessary checks. So, we
can just "return changed" if the "pos" is unchanged. That also makes the
loop and "changed" variable unnecessary.
So, fix and simplify the code in one shot.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The hole_size is used by dev_extent_hole_check() to check the hole is OK as
a device extent. However, commit b031fe84fd ("btrfs-progs: zoned:
implement zoned chunk allocator") mis-ported the kernel code and placed
dev_extent_hole_check() before setting hole_check. That made the
dev_extent_hole_check() call here essentially pass through as we have
hole_size == 0 on mkfs time.
As a result, mkfs.btrfs creates data BG at 64 MB where the regular
superblock exists, when zone size is 16 MB.
Fix the ordering of hole_size setting and calling dev_extent_hole_check().
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers are using the _nr variations we can simply rename
these helpers to btrfs_item_##member/btrfs_set_item_##member and change
the actual item SETGET funcs to raw_item_##member/set_raw_item_##member
and then change all callers to drop the _nr part.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we switch to multiple global trees we'll need to access the
appropriate extent root depending on the block group or possibly root.
To handle this, use a helper in most places and then the actual root in
places where it is required. We will whittle down the direct accessors
with future patches, but this does the bulk of the preparatory work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a lot of call sites where we use the following code snippet:
u8 super_block_data[BTRFS_SUPER_INFO_SIZE];
struct btrfs_super_block *sb;
u64 ret;
sb = (struct btrfs_super_block *)super_block_data;
The reason for this is, structure btrfs_super_block was smaller than
BTRFS_SUPER_INFO_SIZE.
Thus for anything with csum involved, we have to use a proper 4K buffer.
Since the recent unification of sizeof(struct btrfs_super_block), we no
longer need such workaround, and can use struct btrfs_super_block
directly to do any operation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For consistency with older versions switch the case of 'single' to be
lower case again even if it's inconsistent. This could be revisited in
the future.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since commit dad03fac3b ("btrfs-progs: switch btrfs_group_profile_str
to use raid table"), fstests/btrfs/023 and btrfs/151 will always fail.
The failure of btrfs/151 explains the reason pretty well:
btrfs/151 1s ... - output mismatch
--- tests/btrfs/151.out 2019-10-22 15:18:14.068965341 +0800
+++ ~/xfstests-dev/results//btrfs/151.out.bad 2021-11-02 17:13:43.879999994 +0800
@@ -1,2 +1,2 @@
QA output created by 151
-Data, RAID1
+Data, raid1
...
(Run 'diff -u ~/xfstests-dev/tests/btrfs/151.out ~/xfstests-dev/results//btrfs/151.out.bad' to see the entire diff)
[CAUSE]
Commit dad03fac3b ("btrfs-progs: switch btrfs_group_profile_str to use
raid table") will use btrfs_raid_array[index].raid_name, which is all
lower case.
[FIX]
There is no need to bring such output format change.
So here we split the btrfs_raid_attr::raid_name[] into upper_name[] and
lower_name[], and make upper and lower case helpers for callers to use.
Now there are several types of callers referring to lower_name and
upper_name:
- parse_bg_profile()
It uses strcasecmp(), either case would be fine.
- btrfs_group_profile_str()
Originally it uses upper case for all profiles except "single".
Now unified to upper case.
- sprint_profiles()
It uses lower case.
- bg_flags_to_str()
It uses upper case.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When compiling btrfs-progs on 32bit x86 using GCC 11.1.0, there are
several warnings:
In file included from ./common/utils.h:30,
from check/main.c:36:
check/main.c: In function 'run_next_block':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
check/main.c:6496:33: note: in expansion of macro 'error'
6496 | error(
| ^~~~~
In file included from ./common/utils.h:30,
from kernel-shared/volumes.c:32:
kernel-shared/volumes.c: In function 'btrfs_check_chunk_valid':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
kernel-shared/volumes.c:2052:17: note: in expansion of macro 'error'
2052 | error("invalid chunk item size, have %u expect [%zu, %lu)",
| ^~~~~
image/main.c: In function 'search_for_chunk_blocks':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
image/main.c:2122:33: note: in expansion of macro 'error'
2122 | error(
| ^~~~~
There are two types of problems:
- __BTRFS_LEAF_DATA_SIZE()
This macro has no type definition, making it behaves differently on
different arches.
Fix this by following kernel to use inline function to make its return
value fixed to u32.
- size_t related output
For x86_64 %lu is OK but not for x86.
Fix this by using %zu.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several profiles like raid0, raid10, raid5 and raid6 that can
span as many devices as possible and need special handling for the
stripe calculations. Provide a helper to identify the profiles in a
simple way.
Signed-off-by: David Sterba <dsterba@suse.com>
Use the raid table helper to avoid hard coding profiles for the given
number of devices in test_num_disk_vs_raid.
Signed-off-by: David Sterba <dsterba@suse.com>
There's a private helper for parity and there are many open coded
calculations of parity for the RAID56 profiles. The helper will be used
to remove that and use the raid table values.
Signed-off-by: David Sterba <dsterba@suse.com>
There's opencoded value of raid table ncopies in
print_filesystem_usage_overall, add a helper and use it.
Signed-off-by: David Sterba <dsterba@suse.com>
Another duplication of the raid table, in this case missing the changes
to raid10 and raid0 minimum devices changed in a177ef7dd4
("btrfs-progs: mkfs: allow degenerate raid0/raid10").
Define and use a helper using the table value.
Signed-off-by: David Sterba <dsterba@suse.com>
We need to use direct-IO for zoned devices to preserve the write
ordering. Instead of detecting if the device is zoned or not, we simply
use direct-IO for any kind of device (even if emulated zoned mode on a
regular device).
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several extent_buffer initializations miss fs_info initialization. This
is OK before the following patch ("btrfs-progs: use direct-io for zoned
device") as eb->fs_info is not always necessary. But, after that patch,
we will use fs_info to determine it is zoned or not and that causes
segfault in such cases.
Properly set fs_info when initializing extent_buffers to fix the issue.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the GPL v2 header to files where it was missing and is not from an
external source, update to the most recent version with the address.
Signed-off-by: David Sterba <dsterba@suse.com>
Kernel patch b2f78e88052bc0bee ("btrfs: allow degenerate raid0/raid10")
in
5.15 will allow mounting and converting to single device raid0 or two
device raid10. Let mkfs create such filesystem.
"The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same."
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.
To implement the allocator, we need to extend the following functions for
a zoned filesystem:
- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size
Here, dev_extent_hole_check() is newly introduced to check the validity of
a hole found.
init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.
dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.
dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.
With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.
Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Get the zone information (number of zones and zone size) from all the
devices, if the volume contains a zoned block device. To avoid costly
run-time zone report commands to test the device zones type during block
allocation, it also records all the zone status (zone type, write
pointer position, etc.).
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Likewise in the kernel code, provide fs_info access from struct
btrfs_device. This will help to unify the code between the kernel and
the userland.
Since fs_info can be NULL at the time of btrfs_add_to_fsid(), let's use
btrfs_open_devices() to set fs_info to the devices.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
alloc_chunk_ctl::calc_size is actually the stripe_size in the kernel
side code. Let's rename it to clarify what the "calc" is.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chunk_bytes_by_type() takes type, calc_size, and ctl as arguments. But
the first two can be obtained from the ctl. Let's drop these arguments
for simplicity.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit b9444efb66 ("btrfs-progs: don't pretend RAID56 has a
different stripe length"), alloc_chunk_ctl::stripe_len is always fixed
to BTRFS_STRIPE_LEN. Let's replace alloc_chunk_ctl::stripe_len with
BTRFS_STRIPE_LEN, like in the kernel code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several calculations in the chunk allocation process use this pattern.
x /= y;
x *= y;
Replace this pattern with round_down().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the DUP profile, we can use only half of the space available in a
device extent. Fix the calculation of calc_size for it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_alloc_data_chunk() and create_chunk() have the most part in common.
Let's rewrite btrfs_alloc_data_chunk() using create_chunk().
There are two differences between btrfs_alloc_data_chunk() and
create_chunk(). create_chunk() uses find_next_chunk() to decide the
logical address of the chunk, and it uses btrfs_alloc_dev_extent() to
decide the physical address of a device extent. On the other hand,
btrfs_alloc_data_chunk() uses *start for both logical and physical
addresses.
To support the btrfs_alloc_data_chunk()'s use case, we use ctl->start
and ctl->dev_offset. If these values are set (non-zero), use the
specified values as the address. It is safe to use 0 to indicate the
value is not set here. Because both lower addresses of logical
(0..BTRFS_FIRST_CHUNK_TREE_OBJECT_ID) and physical
(0..BTRFS_BLOCK_RESERVED_1M_FOR_SUPER) are reserved.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out create_chunk() from btrfs_alloc_chunk(). This new function
creates a chunk.
There is no functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out decide_stripe_size() from btrfs_alloc_chunk(). This new
function calculates the actual stripe size to allocate and decides the
size of a stripe (ctl->calc_size).
This commit has no functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>