This is still work in progress but can survive some stress testing.
There are still some sanity checks missing, do not user this on valuable
data. To enables this, configure must be run with the experimental
features enabled.
$ mkfs.btrfs --csum crc32c /dev/sdx
$ <mount, fill with data, unmount>
$ btrfstune --csum sha256
Will change the checksum to sha256.
Implementation:
- set bit on superblock when the checksums are being changed (similar to
the uuid rewrite)
- metadata checksums are overwritten in place
- data checksums:
- the checksum tree is completely deleted and no checksums are
verified
- data blocks are enumerated and all checksums generated (same as
check --init-csum-tree)
To make it usable, it should be restartable and track the current
progress somehow. Also the previous data checksums should be verified
any time they're available.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When calling iterate_extent_inodes() on data extents with indirect ref
(with inline or keyed EXTENT_DATA_REF_KEY), it will fail to execute the
call back function at all.
[CAUSE]
In function find_parent_nodes(), we only add the target tree block if a
backref has @parent populated.
For indirect backref like EXTENT_DATA_REF_KEY, we rely on
__resolve_indirect_ref() to get the parent leaves.
However __resolve_indirect_ref() only grabs backrefs from
&prefstate->pending_indirect_refs.
Meaning callers should queue any indirect backref to
pending_indirect_refs.
But unfortunately in __add_prelim_ref() and __add_missing_keys(), none
of them properly queue the indirect backrefs to pending_indirect_refs,
but directly to pending.
Making all indirect backrefs never got resolved, thus no callback
function executed
[FIX]
Fix __add_prelim_ref() and __add_missing_keys() to properly queue
indirect backrefs to the correct list.
Currently there is no such direct user in btrfs-progs, but later csum
tree re-initialization code will rely this to do proper csum
re-calculate (to avoid preallocated/nodatasum extents).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to start creating multiple sets of global trees, which at
the moment are the free space tree, csum tree, and extent tree.
Generally we will assign these at block group creation time, but Dave
would like to be able to have them per-subvolume at some point, so
reserve a slot for that as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the on disk definitions for the block group tree. This will be part
of the super block so we need to add the appropriate helpers to the
super block, as well as adding it to the backup roots.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to enable developers to test the extent tree v2 features as they
are added, add the ability to mkfs an extent tree v2 fs if we have
experimental enabled.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiples of these trees with extent tree v2, so
add a rb tree to track them based on their root key value. This works
for both v1 and v2, so we can remove the direct pointers to these roots
in our fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so access
it via a helper in most cases. We will address the remaining direct
accesses in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we switch to multiple global trees we'll need to access the
appropriate extent root depending on the block group or possibly root.
To handle this, use a helper in most places and then the actual root in
places where it is required. We will whittle down the direct accessors
with future patches, but this does the bulk of the preparatory work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this helper sitting in extent-tree.c, but it's a repair
function. I'm going to need to make changes to this for extent-tree-v2
and would rather this live outside of the code we need to share with the
kernel.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we will have per-block group checksums, so add a
helper to access the csum root and rename the fs_info csum_root to
_csum_root to catch all the places that are accessing it directly.
Convert everybody to use the helper except for internal things.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to need to start looking up the csum root based on the
bytenr with extent tree v2. To that end stop passing the root to the
csum related functions so that can be done in the helper functions
themselves.
There's an unrelated deletion of a function prototype that no longer
exists.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use this pattern in a few places, and will use it more with different
roots in the future. Extract out this helper to read the root nodes.
There is a behavior change here in that we're now checking the root
levels, whereas before we were not.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is doing the same work as insert_block_group_item, rework it to
call the helper instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like kernel commit 22b6331d9617 ("btrfs: store precalculated
csum_size in fs_info"), we can cache csum_size and csum_type in
btrfs_fs_info.
Furthermore, there is already a 32 bits hole in btrfs_fs_info, and we
can fit csum_type and csum_size into the hole without increase the size
of btrfs_fs_info.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a lot of call sites where we use the following code snippet:
u8 super_block_data[BTRFS_SUPER_INFO_SIZE];
struct btrfs_super_block *sb;
u64 ret;
sb = (struct btrfs_super_block *)super_block_data;
The reason for this is, structure btrfs_super_block was smaller than
BTRFS_SUPER_INFO_SIZE.
Thus for anything with csum involved, we have to use a proper 4K buffer.
Since the recent unification of sizeof(struct btrfs_super_block), we no
longer need such workaround, and can use struct btrfs_super_block
directly to do any operation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like kernel change, pad struct btrfs_super_block to 4096 bytes. As
ctree.h is part of public headers, use raw number for the superblock
offset.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
We noticed 'btrfs check' outputs something like
leaf 30408704 flags 0x0(P1逅?) backref revision 1
but we expected:
leaf 30408704 flags 0x0() backref revision 1
[CAUSE]
Some XX_flags_to_str() failed to make sure the result string always ends
with '\0' in some case.
[FIX]
Reset the buffer at the beginnig.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Wang Yugui (wangyugui@e16-tech.com)
Signed-off-by: David Sterba <dsterba@suse.com>
For consistency with older versions switch the case of 'single' to be
lower case again even if it's inconsistent. This could be revisited in
the future.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since commit dad03fac3b ("btrfs-progs: switch btrfs_group_profile_str
to use raid table"), fstests/btrfs/023 and btrfs/151 will always fail.
The failure of btrfs/151 explains the reason pretty well:
btrfs/151 1s ... - output mismatch
--- tests/btrfs/151.out 2019-10-22 15:18:14.068965341 +0800
+++ ~/xfstests-dev/results//btrfs/151.out.bad 2021-11-02 17:13:43.879999994 +0800
@@ -1,2 +1,2 @@
QA output created by 151
-Data, RAID1
+Data, raid1
...
(Run 'diff -u ~/xfstests-dev/tests/btrfs/151.out ~/xfstests-dev/results//btrfs/151.out.bad' to see the entire diff)
[CAUSE]
Commit dad03fac3b ("btrfs-progs: switch btrfs_group_profile_str to use
raid table") will use btrfs_raid_array[index].raid_name, which is all
lower case.
[FIX]
There is no need to bring such output format change.
So here we split the btrfs_raid_attr::raid_name[] into upper_name[] and
lower_name[], and make upper and lower case helpers for callers to use.
Now there are several types of callers referring to lower_name and
upper_name:
- parse_bg_profile()
It uses strcasecmp(), either case would be fine.
- btrfs_group_profile_str()
Originally it uses upper case for all profiles except "single".
Now unified to upper case.
- sprint_profiles()
It uses lower case.
- bg_flags_to_str()
It uses upper case.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When compiling btrfs-progs on 32bit x86 using GCC 11.1.0, there are
several warnings:
In file included from ./common/utils.h:30,
from check/main.c:36:
check/main.c: In function 'run_next_block':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
check/main.c:6496:33: note: in expansion of macro 'error'
6496 | error(
| ^~~~~
In file included from ./common/utils.h:30,
from kernel-shared/volumes.c:32:
kernel-shared/volumes.c: In function 'btrfs_check_chunk_valid':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
kernel-shared/volumes.c:2052:17: note: in expansion of macro 'error'
2052 | error("invalid chunk item size, have %u expect [%zu, %lu)",
| ^~~~~
image/main.c: In function 'search_for_chunk_blocks':
./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
42 | __btrfs_error((fmt), ##__VA_ARGS__); \
| ^~~~~
image/main.c:2122:33: note: in expansion of macro 'error'
2122 | error(
| ^~~~~
There are two types of problems:
- __BTRFS_LEAF_DATA_SIZE()
This macro has no type definition, making it behaves differently on
different arches.
Fix this by following kernel to use inline function to make its return
value fixed to u32.
- size_t related output
For x86_64 %lu is OK but not for x86.
Fix this by using %zu.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ("btrfs-progs: switch btrfs_group_profile_str to use raid table")
introduced a regression that raid profile of GlobalReserve will be
printed as 'unknown'.
$ btrfs filesystem df /mnt/test
Data, single: total=5.02TiB, used=4.98TiB
System, single: total=4.00MiB, used=624.00KiB
Metadata, single: total=11.01GiB, used=6.94GiB
GlobalReserve, unknown: total=512.00MiB, used=0.00B
Fix it by:
- take BTRFS_BLOCK_GROUP_RESERVED into account when masking the block
group flags
- update the define of BTRFS_BLOCK_GROUP_RESERVED too so it's same as in
kernel
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When a special image (diverted from fsck/012) has its unused slots (slot
number >= nritems) with garbage, lowmem mode btrfs check can crash:
(gdb) run check --mode=lowmem ~/downloads/good.img.restored
Starting program: /home/adam/btrfs/btrfs-progs/btrfs check --mode=lowmem ~/downloads/good.img.restored
...
ERROR: root 5 INODE[5044031582654955520] nlink(257228800) not equal to inode_refs(0)
ERROR: root 5 INODE[5044031582654955520] nbytes 474624 not equal to extent_size 0
Program received signal SIGSEGV, Segmentation fault.
0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
1703 BTRFS_SETGET_FUNCS(inode_size, struct btrfs_inode_item, size, 64);
(gdb) bt
#0 0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
#1 0x0000555555641544 in check_inode_item (root=0x5555556c2290, path=0x7fffffffd960) at check/mode-lowmem.c:2628
[CAUSE]
At check_inode_item() we have path->slot[0] at 29, while the tree block
only has 26 items.
This happens because two reasons:
- btrfs_next_item() never reverts its slots
Even if we failed to read next leaf.
- check_inode_item() doesn't inform the caller that a fatal error
happened
In check_inode_item(), if btrfs_next_item() failed, it goes to out
label, which doesn't really set @err properly.
This means, when check_inode_item() fails at btrfs_next_item(), it will
increase path->slots[0], while it's already beyond current tree block
nritems.
When the slot increases furthermore, and if the unused item slots have
some garbage, we will get invalid btrfs_item_ptr() result, and causing
above segfault.
[FIX]
Fix the problems by two ways:
- Make btrfs_next_item() to revert its path->slots[0] on failure
- Properly detect fatal error from check_inode_item()
By this, we will no longer crash on the crafted image.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Issue: #412
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Commit ("btrfs-progs: use raid table for profile names in
print-tree.c") introduced one bug in block group and chunk flags output
and changed the behavior:
item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
length 8388608 owner 2 stripe_len 65536 type SINGLE
...
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993 itemsize 112
length 8388608 owner 2 stripe_len 65536 type DUP
...
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 itemsize 112
length 268435456 owner 2 stripe_len 65536 type DUP
...
Note that, the flag string only contains the profile (SINGLE/DUP/etc...)
no type (DATA/METADATA/SYSTEM).
And we have new "SINGLE" string, even that profile has no extra bit to
indicate that.
[CAUSE]
The "SINGLE" part is caused by the raid array which has a name for
SINGLE profile, even it doesn't have the corresponding bit.
The missing type string is caused by a code bug:
strcpy(buf, name);
while (*tmp) {
*tmp = toupper(*tmp);
tmp++;
}
strcpy(ret, buf);
The last strcpy() call overrides the existing string in @ret.
[FIX]
- Enhance string handling using strn*()/snprintf()
- Add extra "UKNOWN.0x%llx" output for unknown profiles
- Call proper strncat() to merge type and profile
- Add extra handling for "SINGLE" to keep the old output
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function btrfs_reserve_extent(), we call find_free_extent() passing
"u64 profile" into "int data".
This is definitely a width reduction, but when looking further into the
code, it's more serious than that, in fact the "int data" parameter is
not really to indicate whether it's data extent, but really a block
group profile (with block group type).
This is not only width reduction, but also confusing.
Thankfully so for we don't have any BLOCK_GROUP bits beyond 32 bits, so
the width reduction is not causing a big problem.
This patch will rename the "int data" parameter to a more proper one,
"u64 profile" in all involved call paths.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several profiles like raid0, raid10, raid5 and raid6 that can
span as many devices as possible and need special handling for the
stripe calculations. Provide a helper to identify the profiles in a
simple way.
Signed-off-by: David Sterba <dsterba@suse.com>
Use the raid table helper to avoid hard coding profiles for the given
number of devices in test_num_disk_vs_raid.
Signed-off-by: David Sterba <dsterba@suse.com>
There's a private helper for parity and there are many open coded
calculations of parity for the RAID56 profiles. The helper will be used
to remove that and use the raid table values.
Signed-off-by: David Sterba <dsterba@suse.com>
The enumeration could get out of date, like fixed in previous commit.
Create a helper that will hide the implementation details.
Signed-off-by: David Sterba <dsterba@suse.com>
There's opencoded value of raid table ncopies in
print_filesystem_usage_overall, add a helper and use it.
Signed-off-by: David Sterba <dsterba@suse.com>
Another duplication of the raid table, in this case missing the changes
to raid10 and raid0 minimum devices changed in a177ef7dd4
("btrfs-progs: mkfs: allow degenerate raid0/raid10").
Define and use a helper using the table value.
Signed-off-by: David Sterba <dsterba@suse.com>
We need to use direct-IO for zoned devices to preserve the write
ordering. Instead of detecting if the device is zoned or not, we simply
use direct-IO for any kind of device (even if emulated zoned mode on a
regular device).
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions to read data/metadata e.g. read_extent_from_disk() now depend on
the fs_info->zoned flag to determine if they do direct-IO or not.
The flag (and zone_size) is not known before reading the chunk tree and it
set to 0 while in the initial chunk tree setup process. That will cause
btrfs_pread() to fail because it does not align the buffer.
Use fcntl() to find out the file descriptor is opened with O_DIRECT or not,
and if it is, set the zoned flag to 1 temporally for this initial process.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Wrap pread with btrfs_pread as well.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Wrap pwrite with btrfs_pwrite(). It simply calls pwrite() on non-zoned
btrfs (opened without O_DIRECT). On zoned mode (opened with O_DIRECT),
it allocates an aligned bounce buffer, copies the contents and uses it
for direct-IO writing.
Writes in device_zero_blocks() and btrfs_wipe_existing_sb() are a little
tricky. We don't have fs_info on our hands, so use zinfo to determine it
is a zoned device or not.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several extent_buffer initializations miss fs_info initialization. This
is OK before the following patch ("btrfs-progs: use direct-io for zoned
device") as eb->fs_info is not always necessary. But, after that patch,
we will use fs_info to determine it is zoned or not and that causes
segfault in such cases.
Properly set fs_info when initializing extent_buffers to fix the issue.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to include this besides btrfs-list.c itself and
subvolume.c that does use the btrfs_list_* API.
Signed-off-by: David Sterba <dsterba@suse.com>
device_get_partition_size_fd() fails if we pass a regular file. This can
happen when trying to create an emulated zoned filesystem on a regular file.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the helpers using crc32c not inline so the crc32c.h can be removed
from the public headers exported by libbtrfs.
Signed-off-by: David Sterba <dsterba@suse.com>
There's an ancient macro btrfs_crc32c which is just wrapping crc32c and
not doing anything else, so we can use the crc helper directly.
Signed-off-by: David Sterba <dsterba@suse.com>
To drop sizes.h from exported headers, replace the few SZ_ constants
from the existing exported headers (ctree.h, send.h). It would be nice
to use them in the long run but right now it would prevent unexporting
the sizes.h file.
Signed-off-by: David Sterba <dsterba@suse.com>