Commit Graph

196 Commits

Author SHA1 Message Date
Josef Bacik
7119dc3d79 btrfs-progs: simplify btrfs_make_block_group
This is doing the same work as insert_block_group_item, rework it to
call the helper instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 21:45:37 +01:00
Qu Wenruo
c4ff87c3d1 btrfs-progs: cache csum_size and csum_type in btrfs_fs_info
Just like kernel commit 22b6331d9617 ("btrfs: store precalculated
csum_size in fs_info"), we can cache csum_size and csum_type in
btrfs_fs_info.

Furthermore, there is already a 32 bits hole in btrfs_fs_info, and we
can fit csum_type and csum_size into the hole without increase the size
of btrfs_fs_info.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
Qu Wenruo
636b2e6027 btrfs-progs: remove temporary buffer for super block
There are a lot of call sites where we use the following code snippet:

	u8 super_block_data[BTRFS_SUPER_INFO_SIZE];
	struct btrfs_super_block *sb;
	u64 ret;

	sb = (struct btrfs_super_block *)super_block_data;

The reason for this is, structure btrfs_super_block was smaller than
BTRFS_SUPER_INFO_SIZE.

Thus for anything with csum involved, we have to use a proper 4K buffer.

Since the recent unification of sizeof(struct btrfs_super_block), we no
longer need such workaround, and can use struct btrfs_super_block
directly to do any operation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
Qu Wenruo
76f1a2ed57 btrfs-progs: unify size of btrfs_super_block and BTRFS_SUPER_INFO_SIZE
Just like kernel change, pad struct btrfs_super_block to 4096 bytes. As
ctree.h is part of public headers, use raw number for the superblock
offset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
Wang Yugui
e9a7efb1b6 btrfs-progs: fix XX_flags_to_str() to always end with '\0'
[BUG]
We noticed 'btrfs check' outputs something like

  leaf 30408704 flags 0x0(P1逅?) backref revision 1

but we expected:

  leaf 30408704 flags 0x0() backref revision 1

[CAUSE]
Some XX_flags_to_str() failed to make sure the result string always ends
with '\0' in some case.

[FIX]
Reset the buffer at the beginnig.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Wang Yugui (wangyugui@e16-tech.com)
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
David Sterba
4882a4f5fa btrfs-porgs: add exception for upper case single profile name
For consistency with older versions switch the case of 'single' to be
lower case again even if it's inconsistent. This could be revisited in
the future.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
Qu Wenruo
b1c944657c btrfs-progs: make "btrfs filesystem df" command show upper case profile
[BUG]
Since commit dad03fac3b ("btrfs-progs: switch btrfs_group_profile_str
to use raid table"), fstests/btrfs/023 and btrfs/151 will always fail.

The failure of btrfs/151 explains the reason pretty well:

btrfs/151 1s ... - output mismatch
    --- tests/btrfs/151.out	2019-10-22 15:18:14.068965341 +0800
    +++ ~/xfstests-dev/results//btrfs/151.out.bad	2021-11-02 17:13:43.879999994 +0800
    @@ -1,2 +1,2 @@
     QA output created by 151
    -Data, RAID1
    +Data, raid1
    ...
    (Run 'diff -u ~/xfstests-dev/tests/btrfs/151.out ~/xfstests-dev/results//btrfs/151.out.bad'  to see the entire diff)

[CAUSE]
Commit dad03fac3b ("btrfs-progs: switch btrfs_group_profile_str to use
raid table") will use btrfs_raid_array[index].raid_name, which is all
lower case.

[FIX]
There is no need to bring such output format change.

So here we split the btrfs_raid_attr::raid_name[] into upper_name[] and
lower_name[], and make upper and lower case helpers for callers to use.

Now there are several types of callers referring to lower_name and
upper_name:

- parse_bg_profile()
  It uses strcasecmp(), either case would be fine.

- btrfs_group_profile_str()
  Originally it uses upper case for all profiles except "single".
  Now unified to upper case.

- sprint_profiles()
  It uses lower case.

- bg_flags_to_str()
  It uses upper case.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
Qu Wenruo
5bee5c99bf btrfs-progs: fix printf formats on 32bit x86
When compiling btrfs-progs on 32bit x86 using GCC 11.1.0, there are
several warnings:

  In file included from ./common/utils.h:30,
                   from check/main.c:36:
  check/main.c: In function 'run_next_block':
  ./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
     42 |                 __btrfs_error((fmt), ##__VA_ARGS__);                    \
        |                               ^~~~~
  check/main.c:6496:33: note: in expansion of macro 'error'
   6496 |                                 error(
        |                                 ^~~~~

  In file included from ./common/utils.h:30,
                   from kernel-shared/volumes.c:32:
  kernel-shared/volumes.c: In function 'btrfs_check_chunk_valid':
  ./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'u32' {aka 'unsigned int'} [-Wformat=]
     42 |                 __btrfs_error((fmt), ##__VA_ARGS__);                    \
        |                               ^~~~~
  kernel-shared/volumes.c:2052:17: note: in expansion of macro 'error'
   2052 |                 error("invalid chunk item size, have %u expect [%zu, %lu)",
        |                 ^~~~~

  image/main.c: In function 'search_for_chunk_blocks':
  ./common/messages.h:42:31: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
     42 |                 __btrfs_error((fmt), ##__VA_ARGS__);                    \
        |                               ^~~~~
  image/main.c:2122:33: note: in expansion of macro 'error'
   2122 |                                 error(
        |                                 ^~~~~

There are two types of problems:

- __BTRFS_LEAF_DATA_SIZE()
  This macro has no type definition, making it behaves differently on
  different arches.

  Fix this by following kernel to use inline function to make its return
  value fixed to u32.

- size_t related output
  For x86_64 %lu is OK but not for x86.

  Fix this by using %zu.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
Wang Yugui
b1d8f945c9 btrfs-progs: mask out all unwanted profiles in btrfs_group_profile_str
Commit ("btrfs-progs: switch btrfs_group_profile_str to use raid table")
introduced a regression that raid profile of GlobalReserve will be
printed as 'unknown'.

  $ btrfs filesystem df /mnt/test
  Data, single: total=5.02TiB, used=4.98TiB
  System, single: total=4.00MiB, used=624.00KiB
  Metadata, single: total=11.01GiB, used=6.94GiB
  GlobalReserve, unknown: total=512.00MiB, used=0.00B

Fix it by:

- take BTRFS_BLOCK_GROUP_RESERVED into account when masking the block
  group flags
- update the define of BTRFS_BLOCK_GROUP_RESERVED too so it's same as in
  kernel

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-05 12:50:03 +01:00
Qu Wenruo
8f81113021 btrfs-progs: check: fix a lowmem mode crash where fatal error is not properly handled
[BUG]
When a special image (diverted from fsck/012) has its unused slots (slot
number >= nritems) with garbage, lowmem mode btrfs check can crash:

  (gdb) run check --mode=lowmem ~/downloads/good.img.restored
  Starting program: /home/adam/btrfs/btrfs-progs/btrfs check --mode=lowmem ~/downloads/good.img.restored
  ...
  ERROR: root 5 INODE[5044031582654955520] nlink(257228800) not equal to inode_refs(0)
  ERROR: root 5 INODE[5044031582654955520] nbytes 474624 not equal to extent_size 0

  Program received signal SIGSEGV, Segmentation fault.
  0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
  1703	BTRFS_SETGET_FUNCS(inode_size, struct btrfs_inode_item, size, 64);
  (gdb) bt
  #0  0x0000555555639b11 in btrfs_inode_size (eb=0x5555558a7540, s=0x642e6cd1) at ./kernel-shared/ctree.h:1703
  #1  0x0000555555641544 in check_inode_item (root=0x5555556c2290, path=0x7fffffffd960) at check/mode-lowmem.c:2628

[CAUSE]
At check_inode_item() we have path->slot[0] at 29, while the tree block
only has 26 items.

This happens because two reasons:

- btrfs_next_item() never reverts its slots
  Even if we failed to read next leaf.

- check_inode_item() doesn't inform the caller that a fatal error
  happened
  In check_inode_item(), if btrfs_next_item() failed, it goes to out
  label, which doesn't really set @err properly.

This means, when check_inode_item() fails at btrfs_next_item(), it will
increase path->slots[0], while it's already beyond current tree block
nritems.

When the slot increases furthermore, and if the unused item slots have
some garbage, we will get invalid btrfs_item_ptr() result, and causing
above segfault.

[FIX]
Fix the problems by two ways:

- Make btrfs_next_item() to revert its path->slots[0] on failure

- Properly detect fatal error from check_inode_item()

By this, we will no longer crash on the crafted image.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Issue: #412
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-04 20:56:42 +01:00
Qu Wenruo
eacdd1606c btrfs-progs: print-tree: fix chunk/block group flags output
[BUG]
Commit ("btrfs-progs: use raid table for profile names in
print-tree.c") introduced one bug in block group and chunk flags output
and changed the behavior:

	item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
		length 8388608 owner 2 stripe_len 65536 type SINGLE
		...
	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993 itemsize 112
		length 8388608 owner 2 stripe_len 65536 type DUP
		...
	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 itemsize 112
		length 268435456 owner 2 stripe_len 65536 type DUP
		...

Note that, the flag string only contains the profile (SINGLE/DUP/etc...)
no type (DATA/METADATA/SYSTEM).

And we have new "SINGLE" string, even that profile has no extra bit to
indicate that.

[CAUSE]
The "SINGLE" part is caused by the raid array which has a name for
SINGLE profile, even it doesn't have the corresponding bit.

The missing type string is caused by a code bug:

		strcpy(buf, name);
		while (*tmp) {
			*tmp = toupper(*tmp);
			tmp++;
		}
		strcpy(ret, buf);

The last strcpy() call overrides the existing string in @ret.

[FIX]
- Enhance string handling using strn*()/snprintf()

- Add extra "UKNOWN.0x%llx" output for unknown profiles

- Call proper strncat() to merge type and profile

- Add extra handling for "SINGLE" to keep the old output

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
Qu Wenruo
f4c712e024 btrfs-progs: rename data parameter to profile in extent allocation path
In function btrfs_reserve_extent(), we call find_free_extent() passing
"u64 profile" into "int data".

This is definitely a width reduction, but when looking further into the
code, it's more serious than that, in fact the "int data" parameter is
not really to indicate whether it's data extent, but really a block
group profile (with block group type).

This is not only width reduction, but also confusing.

Thankfully so for we don't have any BLOCK_GROUP bits beyond 32 bits, so
the width reduction is not causing a big problem.

This patch will rename the "int data" parameter to a more proper one,
"u64 profile" in all involved call paths.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
bc28dc6bea btrfs-progs: introduce helper for striped profiles
There are several profiles like raid0, raid10, raid5 and raid6 that can
span as many devices as possible and need special handling for the
stripe calculations. Provide a helper to identify the profiles in a
simple way.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
a25a5cc2c0 btrfs-progs: use btrfs_bg_type_to_nparity in btrfs_stripe_length
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
11fcdbc35e btrfs-progs: introduce helper to get allowed profiles for a given device number
Use the raid table helper to avoid hard coding profiles for the given
number of devices in test_num_disk_vs_raid.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
cb0b63cd90 btrfs-progs: use raid table value for sub_stripes in btrfs_check_chunk_valid
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
4f662d74fd btrfs-progs: export raid table helper for sub_stripes
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
833ce53872 btrfs-progs: use btrfs_bg_type_to_nparity in calc_stripe_length
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
e3355b43b4 btrfs-progs: use raid table for min devs in btrfs_check_chunk_valid
Replace the hard coded values with the raid table reference.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:24 +02:00
David Sterba
3d05b20435 btrfs-progs: use btrfs_bg_type_to_nparity in chunk_bytes_by_type
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
David Sterba
359710d4dd btrfs-progs: use btrfs_bg_type_to_nparity in get_dev_extent_len
Stripe calculation with hard coded parity, use the helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
David Sterba
f759e9bbad btrfs-progs: introduce a public helper for raid parity
There's a private helper for parity and there are many open coded
calculations of parity for the RAID56 profiles. The helper will be used
to remove that and use the raid table values.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
David Sterba
447bf2fb37 btrfs-progs: zoned: factor out supported profiles to a helper
The enumeration could get out of date, like fixed in previous commit.
Create a helper that will hide the implementation details.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
David Sterba
15eb03ca1d btrfs-progs: use raid table for profile names in print-tree.c
Pick the names from the raid table and do the uppercase conversion.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
David Sterba
80714610f3 btrfs-progs: use raid table for ncopies
There's opencoded value of raid table ncopies in
print_filesystem_usage_overall, add a helper and use it.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
David Sterba
b29f1603b0 btrfs-progs: use raid table for devs_min and replace local helper
Another duplication of the raid table, in this case missing the changes
to raid10 and raid0 minimum devices changed in a177ef7dd4
("btrfs-progs: mkfs: allow degenerate raid0/raid10").

Define and use a helper using the table value.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
Naohiro Aota
e9696b06f0 btrfs-progs: use direct-io for zoned device
We need to use direct-IO for zoned devices to preserve the write
ordering.  Instead of detecting if the device is zoned or not, we simply
use direct-IO for any kind of device (even if emulated zoned mode on a
regular device).

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
Naohiro Aota
4a8d85f730 btrfs-progs: temporarily set zoned flag for initial tree reading
Functions to read data/metadata e.g. read_extent_from_disk() now depend on
the fs_info->zoned flag to determine if they do direct-IO or not.

The flag (and zone_size) is not known before reading the chunk tree and it
set to 0 while in the initial chunk tree setup process. That will cause
btrfs_pread() to fail because it does not align the buffer.

Use fcntl() to find out the file descriptor is opened with O_DIRECT or not,
and if it is, set the zoned flag to 1 temporally for this initial process.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
Naohiro Aota
ae0dfb246d btrfs-progs: introduce btrfs_pread wrapper for pread
Wrap pread with btrfs_pread as well.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
Naohiro Aota
c821e5545f btrfs-progs: introduce btrfs_pwrite wrapper for pwrite
Wrap pwrite with btrfs_pwrite(). It simply calls pwrite() on non-zoned
btrfs (opened without O_DIRECT). On zoned mode (opened with O_DIRECT),
it allocates an aligned bounce buffer, copies the contents and uses it
for direct-IO writing.

Writes in device_zero_blocks() and btrfs_wipe_existing_sb() are a little
tricky. We don't have fs_info on our hands, so use zinfo to determine it
is a zoned device or not.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 18:59:23 +02:00
Naohiro Aota
40ab7530df btrfs-progs: set eb::fs_info properly everywhere
Several extent_buffer initializations miss fs_info initialization. This
is OK before the following patch ("btrfs-progs: use direct-io for zoned
device") as eb->fs_info is not always necessary. But, after that patch,
we will use fs_info to determine it is zoned or not and that causes
segfault in such cases.

Properly set fs_info when initializing extent_buffers to fix the issue.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:47:04 +02:00
David Sterba
8bb13015bd btrfs-progs: don't include btrfs-list.h unless necessary
We don't need to include this besides btrfs-list.c itself and
subvolume.c that does use the btrfs_list_* API.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:47:03 +02:00
Naohiro Aota
585ac14d1a btrfs-progs: use btrfs_device_size() instead of device_get_partition_size_fd()
device_get_partition_size_fd() fails if we pass a regular file. This can
happen when trying to create an emulated zoned filesystem on a regular file.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:35 +02:00
David Sterba
785218efb1 btrfs-progs: remove direct calls to crc32c from ctree.h
Make the helpers using crc32c not inline so the crc32c.h can be removed
from the public headers exported by libbtrfs.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:35 +02:00
David Sterba
732d73dc1f btrfs-progs: remove btrfs_crc32c alias
There's an ancient macro btrfs_crc32c which is just wrapping crc32c and
not doing anything else, so we can use the crc helper directly.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:35 +02:00
David Sterba
979bda6fb5 btrfs-progs: libbtrfs: replace SZ_ constants and drop sizes.h
To drop sizes.h from exported headers, replace the few SZ_ constants
from the existing exported headers (ctree.h, send.h). It would be nice
to use them in the long run but right now it would prevent unexporting
the sizes.h file.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:35 +02:00
David Sterba
38356d456b btrfs-progs: libbtrfs: drop radix-tree.h from exported headers
The header is only included from ctree.h but not actually used, we can
drop it from the exported files.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:35 +02:00
Nikolay Borisov
39c6e0b79c btrfs-progs: add btrfs_uuid_tree_remove
It will be used to clear received data on RW snapshots that were
received. The function is copied from kernel sources.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:34 +02:00
Nikolay Borisov
97640a5b81 btrfs-progs: remove root argument from btrfs_truncate_item
This function lies in the kernel-shared directory and is supposed to be
close to 1:1 copy with its kernel counterpart, yet it takes one extra
argument - root. But this is now unused to simply remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:34 +02:00
Nikolay Borisov
c3584b4fc0 btrfs-progs: remove fs_info argument from leaf_data_end
The function already takes an extent_buffer which has a reference to
the owning filesystem's fs_info. This also brings the function in line
with the kernel's signature.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:34 +02:00
Nikolay Borisov
7c58b09548 btrfs-progs: remove root argument from btrfs_fixup_low_keys
It's not used, so just remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-08 20:46:34 +02:00
David Sterba
d0ea2b2af4 btrfs-progs: zoned: also exclude raid1c3 and raid1c4 from supported profiles
The enumeration of profiles not available for zoned mode in
btrfs_load_block_group_zone_info was lacking the 3 and 4 copy raid1, add
them.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 18:39:58 +02:00
David Sterba
27207d651a btrfs-progs: dump-tree: print complete root_item
The output of root_item in the 'inspect dump-tree' command lacks some
items and some of them are printed conditionally. As the dump utility
is for debugging, it's better to print all the items, with names
matching the structure members and order.

Some values will inevitably be all zeros like uuids or various
timestamps, but that's a minor issue and affecting only a few trees.

Example:

  item 0 key (EXTENT_TREE ROOT_ITEM 0) itemoff 15844 itemsize 439
	  generation 5 root_dirid 0 bytenr 30523392 byte_limit 0 bytes_used 16384
	  last_snapshot 0 flags 0x0(none) refs 1
	  drop_progress key (0 UNKNOWN.0 0) drop_level 0
	  level 0 generation_v2 5
	  uuid 00000000-0000-0000-0000-000000000000
	  parent_uuid 00000000-0000-0000-0000-000000000000
	  received_uuid 00000000-0000-0000-0000-000000000000
	  ctransid 0 otransid 0 stransid 0 rtransid 0
	  ctime 0.0 (1970-01-01 01:00:00)
	  otime 0.0 (1970-01-01 01:00:00)
	  stime 0.0 (1970-01-01 01:00:00)
	  rtime 0.0 (1970-01-01 01:00:00)

  item 3 key (FS_TREE ROOT_ITEM 0) itemoff 14949 itemsize 439
	  generation 4 root_dirid 256 bytenr 30408704 byte_limit 0 bytes_used 16384
	  last_snapshot 0 flags 0x0(none) refs 1
	  drop_progress key (0 UNKNOWN.0 0) drop_level 0
	  level 0 generation_v2 4
	  uuid ec4669b6-6d21-46ab-857e-d60cafde45b3
	  parent_uuid 00000000-0000-0000-0000-000000000000
	  received_uuid 00000000-0000-0000-0000-000000000000
	  ctransid 0 otransid 0 stransid 0 rtransid 0
	  ctime 1633021823.0 (2021-09-30 19:10:23)
	  otime 1633021823.0 (2021-09-30 19:10:23)
	  stime 0.0 (1970-01-01 01:00:00)
	  rtime 0.0 (1970-01-01 01:00:00)

Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 18:39:44 +02:00
Josef Bacik
dd8e7477f7 btrfs-progs: remove data extents from the free space tree
Dave reported a failure of mkfs-test 009 with the free space tree
enabled by default.  This is because 009 pre-populates the file system
with a given directory, and for some reason our data allocation path
isn't the same as in the kernel.  Fix this by making sure when we
allocate a data extent we remove the space from the free space tree, and
with this our mkfs tests now pass.

Issue: #410
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 16:49:52 +02:00
Naohiro Aota
85e102f212 btrfs-progs: properly format btrfs_header in btrfs_create_root()
Enabling quota in zoned mored hits the following assertion:

    $ mkfs.btrfs -f -d single -m single -R quota /dev/nullb0
    btrfs-progs v5.11
    See http://btrfs.wiki.kernel.org for more information.

    Zoned: /dev/nullb0: host-managed device detected, setting zoned feature
    Resetting device zones /dev/nullb0 (1600 zones) ...
    bad tree block 25395200, bytenr mismatch, want=25395200, have=0
    kernel-shared/disk-io.c:549: write_tree_block: BUG_ON `1` triggered, value 1
    ./mkfs.btrfs(+0x26aaa)[0x564d1a7ccaaa]
    ./mkfs.btrfs(write_tree_block+0xb8)[0x564d1a7cee29]
    ./mkfs.btrfs(__commit_transaction+0x91)[0x564d1a7e3740]
    ./mkfs.btrfs(btrfs_commit_transaction+0x135)[0x564d1a7e39aa]
    ./mkfs.btrfs(main+0x1fe9)[0x564d1a7b442a]
    /lib64/libc.so.6(__libc_start_main+0xcd)[0x7f36377d37fd]
    ./mkfs.btrfs(_start+0x2a)[0x564d1a7b1fda]
    zsh: IOT instruction  sudo ./mkfs.btrfs -f -d single -m single -R quota /dev/nullb0

The issue occurs because btrfs_create_root() is not formatting the root
node properly. This is fine in regular mode, because it's fortunately
reusing an once freed buffer. As the previous tree node allocation
kindly formatted the header, it will see the proper bytenr and pass the
checks.

However, we never reuse a once freed buffer on zoned filesystem. As a
result, we have zero-filled bytenr, FSID, and chunk-tree UUID, hitting
the asserts in check_tree_block().

Reported-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 16:49:11 +02:00
Johannes Thumshirn
c22e9487a7 btrfs-progs: remove max_zone_append_size logic
max_zone_append_size is unused and can as well be removed just like we
did on the kernel side.

Keep one sanity check though, so we're not adding devices to a zoned FS
that aren't supporting zone append.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 16:49:07 +02:00
Naohiro Aota
53ec59ead0 btrfs-progs: do not zone reset on emulated zoned mode
We cannot zone reset a regular file with emulated zones. So, mkfs.btrfs
on such a file causes the following error.

  ERROR: zoned: failed to reset device '/home/naota/tmp/btrfs.img' zones: Inappropriate ioctl for device

Introduce btrfs_zoned_device_info->emulated to distinguish the zones are
emulated or not. And, use it to decide it needs zone reset or not.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 16:48:56 +02:00
Qu Wenruo
60651ad9da btrfs-progs: introduce OPEN_CTREE_ALLOW_TRANSID_MISMATCH flag
[BUG]
There is a report that, btrfstune can even work while the fs has transid
mismatch problems.

  $ btrfstune -f -u /dev/sdb1
  Current fsid: b2b5ae8d-4c49-45f0-b42e-46fe7dcfcb07
  New fsid: b2b5ae8d-4c49-45f0-b42e-46fe7dcfcb07
  Set superblock flag CHANGING_FSID
  Change fsid in extents
  parent transid verify failed on 792854528 wanted 20103 found 20091
  parent transid verify failed on 792854528 wanted 20103 found 20091
  parent transid verify failed on 792854528 wanted 20103 found 20091
  Ignoring transid failure
  parent transid verify failed on 792870912 wanted 20103 found 20091
  parent transid verify failed on 792870912 wanted 20103 found 20091
  parent transid verify failed on 792870912 wanted 20103 found 20091
  Ignoring transid failure
  parent transid verify failed on 792887296 wanted 20103 found 20091
  parent transid verify failed on 792887296 wanted 20103 found 20091
  parent transid verify failed on 792887296 wanted 20103 found 20091
  Ignoring transid failure
  ERROR: child eb corrupted: parent bytenr=38010880 item=69 parent level=1 child level=1
  ERROR: failed to change UUID of metadata: -5
  ERROR: btrfstune failed

This leaves a corrupted fs even more corrupted, and due to the extra
CHANGING_FSID flag, btrfs check will not even try to run on it:

  Opening filesystem to check...
  ERROR: Filesystem UUID change in progress
  ERROR: cannot open file system

[CAUSE]
Unlike kernel, btrfs-progs has a less strict check on transid mismatch.

In read_tree_block() we will fall back to use the tree block even its
transid mismatch if we can't find any better copy.

However not all commands in btrfs-progs needs this feature, only
btrfs-check (which may fix the problem) and btrfs-restore (it just tries
to ignore any problems) really utilize this feature.

[FIX]
Introduce a new open ctree flag, OPEN_CTREE_ALLOW_TRANSID_MISMATCH, to
be explicit about whether we really want to ignore transid error.

Currently only btrfs-check and btrfs-restore will utilize this new flag.

Also add btrfs-image to allow opening such fs with transid error.

Link: https://www.reddit.com/r/btrfs/comments/pivpqk/failure_during_btrfstune_u/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-20 12:17:29 +02:00
David Sterba
96a5cf0719 btrfs-progs: handle EINVAL when reading zone size on older kernels
A combination of new progs and old kernel may lead to problems with
detecting zone size by ioctl. Fixed by #376 but still incomplete because
old kernels may return EINVAL for unsupported ioctl. This should be
ENOTTY but hasn't been like that until kernel 5.11.

As we always pass valid arguments to the ioctl we can't conflate the two
and can EINVAL the same way as ENOTTY.

Issue: #399
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-20 11:31:09 +02:00
David Sterba
ee17bcec33 btrfs-progs: remove stale declaration from send.h
We don't use this header for kernel compilation so the guarded
declaration is pointless.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 19:27:59 +02:00
David Sterba
e86425242f btrfs-progs: move send.h to kernel-shared/
The header contains the protocol definitions and is almost exactly the
same as the kernel version, move it to the proper directory.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 19:26:46 +02:00
David Sterba
76ab1fa364 btrfs-progs: rename and move group_profile_max_safe_loss
The helper belongs to the others that translate bg flags to the raid
attr table member.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 16:38:56 +02:00
Qu Wenruo
9a11b1b792 btrfs-progs: backport btrfs_check_node() from kernel
The btrfs_check_node() has far less meaningful error message compared to
kernel counterpart, and it even lacks certain checks like level check.

Backport btrfs_check_node() to btrfs-progs to not only unify the code
but greatly improve the readability of the error messages.

Extra modification includes:

- No fs_info needed
  As we don't need to output fsid.

- Remove unlikely() macro

- Extra BTRFS_TREE_BLOCK_* error type

- Btrfs-progs specific error handling
  To record the corrupted tree blocks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:20:41 +02:00
Qu Wenruo
8f8cafa2ce btrfs-progs: backport btrfs_check_leaf() from kernel
Currently btrfs_check_leaf() provides almost meaningless messages for
things like invalid item offset:

  incorrect offsets 8492 3707786077

While kernel tree-checker is doing a way better job, so it's wise to
backport btrfs_check_leaf() from kernel.

There are some modification needed:

- New generic_err() helper

- Remove unlikely() macro

- Remove empty essential tree check
  Mkfs still needs to create empty essential trees.

- Using BTRFS_TREE_BLOCK_* return value
  Original mode check still relies on them to do certain repair.

- No need for btrfs_fs_info
  We no longer need fsid output, thus no need for btrfs_fs_info.

- No item contents check

- Still using the fail: label for btrfs-progs specific error handling

The new output looks like:

  corrupt leaf: root=2 block=72164753408 slot=109, unexpected item end, have 3707786077 expect 8492

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:19:54 +02:00
Qu Wenruo
1f8dfe681f btrfs-progs: use btrfs_key for btrfs_check_node() and btrfs_check_leaf()
In kernel space we hardly use btrfs_disk_key, unless for very lowlevel
code.

There is no need to intentionally use btrfs_disk_key in btrfs-progs
either.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 13:58:44 +02:00
David Sterba
c3ee6a8a09 btrfs-progs: unify GPL header comments
Add the GPL v2 header to files where it was missing and is not from an
external source, update to the most recent version with the address.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 13:58:44 +02:00
David Sterba
7572839a74 btrfs-progs: add and use bit masks for RAID1 and RAID56 profiles
Many test conditions can be simplified in case they check all the
related profiles.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-06 16:36:18 +02:00
David Sterba
7fe4396467 btrfs-progs: copy some raid_attr helpers from kernel
There are convenience helpers for the raid attr table, copy them from
kernel for further cleanups.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-06 16:36:17 +02:00
Josef Bacik
79e534def9 btrfs-progs: add the incompat flag for extent tree v2
I will have a lot of preparatory patches to reduce the review pain of
this large feature.  In order to enable that work define the incompat
flag.  Once all of the work lands to support the feature there will be a
patch to actually enable us to select it and manipulate file systems
with that incompat flag set.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-06 16:36:17 +02:00
Josef Bacik
826e466028 btrfs-progs: add add_block_group_free_space helper
This exists in the kernel free-space-tree.c but not in progs.  We need
it to generate the free space items for new block groups, which is
needed when we start creating the free space tree in make_btrfs().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-06 16:36:17 +02:00
Josef Bacik
3d870a491f btrfs-progs: make sure track_dirty and ref_cows is set properly
Adding support for the per-block group roots means we will be reading
the roots directly in different places.  Make sure we set ->track_dirty
and ->ref_cows properly in the helper so we don't have to do this
everywhere.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-03 15:33:53 +02:00
David Sterba
a177ef7dd4 btrfs-progs: mkfs: allow degenerate raid0/raid10
Kernel patch b2f78e88052bc0bee ("btrfs: allow degenerate raid0/raid10")
in
5.15 will allow mounting and converting to single device raid0 or two
device raid10.  Let mkfs create such filesystem.

"The motivation is to allow to preserve the profile type as long as it
 possible for some intermediate state (device removal, conversion), or
 when there are disks of different size, with raid0 the otherwise
 unusable space of the last device will be used too.  Similarly for
 raid10, though the two largest devices would need to be the same."

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-27 15:40:53 +02:00
Qu Wenruo
991a598f53 btrfs-progs: move btrfs_format_csum() to common/utils.[ch]
Function btrfs_format_csum() is a special helper only used in
btrfs-progs.

Move it to common/utils.[ch] other than leaving it in
kernel-shared/disk-io.c.

Since we're moving the code, also introduce a macro,
BTRFS_CSUM_STRING_LEN, to replace open-coded string length calculation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-26 14:26:13 +02:00
Josef Bacik
8c3c13bb45 btrfs-progs: check blocks in btrfs_next_sibling_block
By enabling the lowmem checks properly I uncovered the case where test
fsck/007 will infinite loop at the detection stage.  This is because
when checking the inode item we will just btrfs_next_item(), and because
we ignore check tree block failures at read time we don't get an -EIO
from btrfs_next_leaf.

This occurs because we allow fsck to raw-read blocks even if they fail
basic sanity checks, because we want the opportunity to repair the
blocks.  However this means corrupt blocks are sitting in cache marked
as uptodate.  btrfs_search_slot() handles this by doing a check_block()
on every block we add to the path, so that anything that is doing a
search gets a proper -EIO.

btrfs_next_sibling_block() needs a similar check.  With this fix we now
return -EIO on btrfs_next_leaf() properly and we no longer infinite loop
on fsck/007 with lowmem.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-25 15:38:54 +02:00
Qu Wenruo
a138daac17 btrfs-progs: mkfs: set super_cache_generation to 0 if we're using free space tree
[HICCUP]
There is a bug report that mkfs.btrfs -R free-space-tree still makes
kernel to try to cleanup the v1 space cache:

  # mkfs.btrfs -R free-space-tree -f /dev/test/scratch1
  # mount /dev/test/scratch1 /mnt/btrfs
  # dmesg | grep cleaning
  BTRFS info (device dm-6): cleaning free space cache v1

[CAUSE]
By default, mkfs.btrfs will set super cache generation to (u64)-1, which
will inform kernel that the v1 space cache is invalid, needs to
regenerate it.

But for free space cache tree, kernel will set super cache generation to
0, to indicate v1 space cache is not in use.

This means, even we enabled free space tree with all the RO compatible
bits and new tree, as long as super cache generation is not 0, kernel
still consider the fs has some invalid v1 space cache, and will try to
remove them.

[FIX]
This is not a big deal, but to make the "-R free-space-tree" to really
work as kernel, we also need to set super cache generation to 0.

Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtSvgzyOnxtrqQZZirSycEHp+g0eDH5c+Kw9mW=PgxuXmw@mail.gmail.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-20 14:24:55 +02:00
David Sterba
6527771668 btrfs-progs: add nparity for raid1c34 definitions
The values of .ncopies was not explicitly set.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-23 00:59:27 +02:00
Qu Wenruo
07ecf878c1 btrfs-progs: check: batch v1 space cache inodes when clearing
Currently v1 space cache clearing will delete one cache inode just in
one transaction, and then start a new transaction to delete the next
inode.

This is far from efficient and can make the already slow v1 space cache
deleting even slower, as large fs has tons of cache inodes to delete.

This patch will speed up the process by batching up to 16 inode deletion
into one transaction.

A quick benchmark of deleting 702 v1 space cache inodes would look like
this:

Unpatched:		4.898s
Patched:		0.087s

Which is obviously a big win.

Reported-by: Joshua <joshua@mailmag.net>
Link: https://lore.kernel.org/linux-btrfs/0b4cf70fc883e28c97d893a3b2f81b11@mailmag.net/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 16:26:05 +02:00
Sidong Yang
94f3b75c00 btrfs-progs: zoned: fix memory leak in btrfs_sb_io()
In btrfs_sb_io(), blk_zone_report is used for getting information about
zones. But it is not freed if code goes in usual path. This patch frees
the variable just after it used.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-02 17:27:53 +02:00
David Sterba
1dc6f33c28 btrfs-progs: zoned: use fixed width type when reading zone size
The ioctl BLKGETZONESZ expects 32bit integer, declare the target
variable as such.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-02 17:27:53 +02:00
David Sterba
b1f374dd1d btrfs-progs: switch %Lu to %llu format
The %Lu format is not standard and we use %llu everywhere else, so
switch the remaining cases.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-19 22:07:49 +02:00
David Sterba
9f6c055e38 btrfs-progs: dump-tree: add options to dump checksums
Add new options to dumps checksums in node headers and in the checksum
items:

  $ btrfs inspect dump-tree --csum-headers image
  root tree
  leaf 471515136 items 19 free space 12186 generation 15 owner ROOT_TREE
  leaf 471515136 flags 0x1(WRITTEN) backref revision 1 csum 0x756b2d54
  fs uuid df0348df-5773-47dd-81e9-a18221461239

For nodes/leaves it's appended on the 2nd line of the header.

Checksum items are stored in leaves as EXTENT_CSUM key type, with offset
value as the logical offset starting. As the array would be hard to
parse or match, each offset value is printed with the checksum. For
crc32c it's 4 values on a line, for xxhash it's 2 and for the long
256bit checksums it's one checksum per line.

  $ btrfs inspect dump-tree --csum-items image
  leaf 5423104 items 1 free space 30 generation 6 owner CSUM_TREE
  leaf 5423104 flags 0x1(WRITTEN) backref revision 1
  fs uuid bd7c981e-16ff-4081-a734-3ef5d50cafc1
  chunk uuid 13f4c76c-7845-4984-88ed-f01b52e05cf8
	  item 0 key (EXTENT_CSUM EXTENT_CSUM 22020096) itemoff 55 itemsize 16228
		  range start 22020096 end 38637568 length 16617472
		  [22020096] 0x8941f998 [22024192] 0x8941f998 [22028288] 0x8941f998 [22032384] 0x8941f998
		  [22036480] 0x8941f998 [22040576] 0x8941f998 [22044672] 0x8941f998 [22048768] 0x8941f998
		  ...

  $ btrfs inspect dump-tree --csum-items image
  leaf 5718016 items 1 free space 7746 generation 6 owner CSUM_TREE
  leaf 5718016 flags 0x1(WRITTEN) backref revision 1
  fs uuid f453a5b4-8b4a-4fbf-90a2-2925e4fe2335
  chunk uuid eb1da63b-248b-44c2-82da-71b2564bf50e
	  item 0 key (EXTENT_CSUM EXTENT_CSUM 52387840) itemoff 7771 itemsize 8512
		  range start 52387840 end 53477376 length 1089536
		  [52387840] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
		  [52391936] 0x686ede9288c391e7e05026e56f2f91bfd879987a040ea98445dabc76f55b8e5f
		  ...

The options are not on by default, the header checksum is not important
for the structures. Data checksums can be quite big so that would make
the dump long and without any actual data to match against.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-19 22:07:49 +02:00
David Sterba
72d710637c btrfs-progs: print-tree: convert mode to bitmask
Replace follow and traverse by one parameter that takes bits to affect
the behaviour. This allows to extend btrfs_print_tree output with more
modes from one place.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-09 20:31:49 +02:00
David Sterba
6134973527 btrfs-progs: zoned: make it work without kernel support
There's a report that a system with 4.19 kernel fails boot because
device scan exits with error. This is because zoned support is compiled
in btrfs-progs but not in kernel.

To make new progs and old kernels work, do a fallback when the zoned
ioctl is not available, as if it were a non-zoned device. There is no
other option, but this is safe at least for the device scan that would
not error out. Any unaligned writes to a zoned device will fail as
expected.

Issue: #376
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 17:38:46 +02:00
Su Yue
80a86f1b47 btrfs-progs: do not BUG_ON if btrfs_add_to_fsid succeeded to write superblock
Commit 8ef9313cf2 ("btrfs-progs: zoned: implement log-structured
superblock") changed to write BTRFS_SUPER_INFO_SIZE bytes to device.
The before num of bytes to be written is sectorsize.
It causes mkfs.btrfs failed on my 16k pagesize kvm:

  $ /usr/bin/mkfs.btrfs -s 16k -f -mraid0 /dev/vdb2 /dev/vdb3
  btrfs-progs v5.12
  See http://btrfs.wiki.kernel.org for more information.

  ERROR: superblock magic doesn't match
  ERROR: superblock magic doesn't match
  common/device-scan.c:195: btrfs_add_to_fsid: BUG_ON `ret != sectorsize`
  triggered, value 1
  /usr/bin/mkfs.btrfs(btrfs_add_to_fsid+0x274)[0xaaab4fe8a5fc]
  /usr/bin/mkfs.btrfs(main+0x1188)[0xaaab4fe4dc8c]
  /usr/lib/libc.so.6(__libc_start_main+0xe8)[0xffff7223c538]
  /usr/bin/mkfs.btrfs(+0xc558)[0xaaab4fe4c558]

  [1]    225842 abort (core dumped)  /usr/bin/mkfs.btrfs -s 16k -f -mraid0
  /dev/vdb2 /dev/vdb3

btrfs_add_to_fsid() now always calls sbwrite() to write
BTRFS_SUPER_INFO_SIZE bytes to device, so change condition of
the BUG_ON().
Also add comments for sbread() and sbwrite().

Signed-off-by: Su Yue <l@damenly.su>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-12 16:00:14 +02:00
David Sterba
6c53222add btrfs-progs: delete bogus zero checksum check
The check condition (csum_result == 0) does not make sense anymore as
it's not the buffer and not the crc32c result as it used to be. The
message does not bring any value and looks like it's some debugging aid
from the old times (added in 2008 as bb7055ec21 ("Add some extra
debugging around file data checksum failures")).

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-08 00:58:51 +02:00
David Sterba
c19ac510a7 btrfs-progs: move repair.[ch] to common/
Move the file to common as it's used by several parts, while still
keeping the name 'repair' although the only thing it does is adding a
corrupted extent.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:47 +02:00
David Sterba
b19a603d62 btrfs-progs: remove unnecessary linux/*.h includes
Decrease dependency on system headers, remove where they're not needed
or became stale after code moved. The path-utils.h encapsulate path
operations so include linux/limits.h here, that's where PATH_MAX is
defined.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:47 +02:00
David Sterba
aa56bf3a31 btrfs-progs: zoned: replace raw ioctl with a helper for device size
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba
c7b5f884e0 btrfs-progs: add prefix to zero_blocks
This is a public helper for devices, add the prefix to make it clear.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba
2b5d4f2e6f btrfs-progs: add prefix to discard_blocks
This is a helper for devices, make it clear in the function name.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba
bc6864967b btrfs-progs: add prefix to exported queue_param
As this is a public helper, add a prefix that makes it clear what is the
queue related to.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
David Sterba
38254c4934 btrfs-progs: kerncompat: add const_ilog2
The newly added zoned mode constants can utilize the const ilog2
version. Copy it from kernel include/linux/log2.h.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota
8c2dfa6387 btrfs-progs: zoned: wipe temporary superblocks in superblock log zone
mkfs.btrfs uses a temporary superblock during the initialization process.
The temporary superblock uses BTRFS_MAGIC_TEMPORARY as its magic which is
different from a regular superblock. As a result, libblkid, which only
supports the usual magic, cannot recognize the volume as btrfs. So, let's
wipe the temporary magic before writing out the usual superblock.

Technically, we can add the temporary magic to the libblkid's table. But,
it will result in recognizing a half-baked filesystem as btrfs, which is
not ideal.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota
8bbb0c5744 btrfs-progs: zoned: support zero out on zoned block device
If we zero out a region in a sequential write required zone, we cannot
write to the region until we reset the zone. Thus, we must prohibit zeroing
out to a sequential write required zone.

zero_dev_clamped() is modified to take the zone information and it calls
zero_zone_blocks() if the device is host managed to avoid writing to
sequential write required zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota
58ec593892 btrfs-progs: zoned: support resetting zoned device
All zones of zoned block devices should be reset before writing. Support
this by introducing PREP_DEVICE_ZONED.

btrfs_reset_all_zones() walk all the zones on a device, and reset a zone if
it is sequential required zone, or discard the zone range otherwise.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:46 +02:00
Naohiro Aota
bfdb3ae237 btrfs-progs: zoned: reset zone of freed block group
When freeing a chunk, we can/should reset the underlying device zones
for the chunk. Introduce btrfs_reset_chunk_zones() and reset the zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
bfd34b7876 btrfs-progs: zoned: redirty clean extent buffers
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Check if next dirty extent buffer is continuous to a previously written
one. If not, it redirty extent buffers between the previous one and the
next one, so that all dirty buffers are written sequentially.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
feff533e34 btrfs-progs: zoned: calculate allocation offset for conventional zones
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset for sequential allocation if a block
group contains a conventional zone.

But instead, we can consider the end of the highest addressed extent in
the block group for the allocation offset.

For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking
extent buffer lock after chunk mutex, which is already taken in
btrfs_make_block_group(). Since it is a new block group anyways, we can
simply set the allocation offset to 0.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
50ae9f62c7 btrfs-progs: zoned: implement sequential extent allocation
Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.

Since the allocator is really simple, we implement it directly in
find_search_start().

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
f08410f078 btrfs-progs: zoned: load zone's allocation offset
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address
within a block group. To facilitate this, add an "alloc_offset" to the
block group to track the logical addresses of the write pointer.

This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.

For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.

We need fine-grained logical to physical mapping to support such
separated physical address issue. Since it should require additional
metadata type, disable non-single profiles for now.

This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
b031fe84fd btrfs-progs: zoned: implement zoned chunk allocator
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.

To implement the allocator, we need to extend the following functions for
a zoned filesystem:

  - init_alloc_chunk_ctl
  - dev_extent_search_start
  - dev_extent_hole_check
  - decide_stripe_size

Here, dev_extent_hole_check() is newly introduced to check the validity of
a hole found.

init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.

dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.

dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.

With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.

Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
8ef9313cf2 btrfs-progs: zoned: implement log-structured superblock
Superblock (and its copies) is the only data structure in btrfs which has a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone.  One easy solution
is limiting superblock and copies to be placed only in conventional zones.

However, this method has two downsides: one is reduced number of superblock
copies. The location of the second copy of superblock is 256GB, which is in
a sequential write required zone on typical devices in the market today.
So, the number of superblock and copies is limited to be two.  Second
downside is that we cannot support devices which have no conventional zones
at all.

To solve these two problems, we employ superblock log writing. It uses two
adjacent zones as a circular buffer to write updated superblocks.  Once the
first zone is filled up, start writing into the second one.  Then, when
both zones are filled up and before starting to write to the first zone
again, reset the first zone.

We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones are
full. For this situation, we read out the last superblock of each zone, and
compare them to determine which zone is older.

The following zones are reserved as the circular buffer on ZONED btrfs.

- primary superblock: offset   0B (and the following zone)
- first copy:         offset 512G (and the following zone)
- Second copy:        offset   4T (4096G, and the following zone)

If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.

Currently, superblock reading/writing is done by pread/pwrite. This
commit replace the call sites with sbread/sbwrite to wrap the functions.
For zoned btrfs, btrfs_sb_io which is called from sbread/sbwrite
reverses the IO position back to a mirror number, maps the mirror number
into the superblock logging position, and do the IO.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
49d5ce4d0f btrfs-progs: zoned: allow zoned filesystems on non-zoned block devices
Run a zoned filesystem on non-zoned devices. This is done by "slicing
up" the block device into fixed-sized chunks and emulate a conventional
zone on each of them. The emulated zone size is determined from the size
of device extent.

This is mainly aimed at testing of zoned filesystems, i.e. the zoned
chunk allocator, on regular block devices.

Currently, we always use EMULATED_ZONE_SIZE (256MiB) for the emulated
zone size. In the future, this will be customized by mkfs option.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
707f0716e0 btrfs-progs: zoned: disallow mixed-bg in ZONED mode
Placing both data and metadata in a block group is impossible in ZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do that, because the
logical addresses are recorded in other metadata buffers to build up the
trees. As a result, a data buffer can be placed after a metadata buffer,
which is not written yet. Writing out the data buffer will break the
sequential write rule.

Check and disallow MIXED_BG with ZONED mode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
3c0f83e541 btrfs-progs: zoned: introduce max_zone_append_size
The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.

Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.

Zone append command is mandatory for zoned btrfs. So, reject a device
with max_zone_append_size == 0.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
7e520022ff btrfs-progs: zoned: check and enable ZONED mode
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
384840b9c0 btrfs-progs: zoned: get zone information of zoned block devices
Get the zone information (number of zones and zone size) from all the
devices, if the volume contains a zoned block device. To avoid costly
run-time zone report commands to test the device zones type during block
allocation, it also records all the zone status (zone type, write
pointer position, etc.).

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
242c8328bc btrfs-progs: zoned: add new ZONED feature flag
With the zoned feature enabled, a zoned block device-aware btrfs
allocates block groups aligned to the device zones and always written in
sequential zones at the zone write pointer position.

It also supports "emulated" zoned mode on a non-zoned device. In the
emulated mode, btrfs emulates conventional zones by slicing the device
into fixed-size zones.

We don't support conversion from the ext4 volume with the zoned feature
because we can't be sure all the converted block groups are aligned to
zone boundaries.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
acdd22ab68 btrfs-progs: provide fs_info from btrfs_device
Likewise in the kernel code, provide fs_info access from struct
btrfs_device. This will help to unify the code between the kernel and
the userland.

Since fs_info can be NULL at the time of btrfs_add_to_fsid(), let's use
btrfs_open_devices() to set fs_info to the devices.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00
Naohiro Aota
cf67267d33 btrfs-progs: rename calc_size to stripe_size
alloc_chunk_ctl::calc_size is actually the stripe_size in the kernel
side code.  Let's rename it to clarify what the "calc" is.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-06 16:41:45 +02:00