Print warning when one of the following is requested by some command
line option:
- btrfstune -b: conversion to block-group-tree
- mkfs.btrfs --num-global-roots: extent-tree-v2
- btrfs-image -d: dump image with data
Issue: #523
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Even with chunk_objectid bug fixed, mkfs.btrfs can still caused stack
overflow when enabling extent-tree-v2 feature (need experimental
features enabled):
# ./mkfs.btrfs -f -O extent-tree-v2 ~/test.img
btrfs-progs v5.19.1
See http://btrfs.wiki.kernel.org for more information.
ERROR: superblock magic doesn't match
NOTE: several default settings have changed in version 5.15, please make sure
this does not affect your deployments:
- DUP for metadata (-m dup)
- enabled no-holes (-O no-holes)
- enabled free-space-tree (-R free-space-tree)
Label: (null)
UUID: 205c61e7-f58e-4e8f-9dc2-38724f5c554b
Node size: 16384
Sector size: 4096
Filesystem size: 512.00MiB
Block group profiles:
Data: single 8.00MiB
Metadata: DUP 32.00MiB
System: DUP 8.00MiB
SSD detected: no
Zoned device: no
=================================================================
[... Skip full ASAN output ...]
==65655==ABORTING
[CAUSE]
For experimental build, we have unified feature output, but the old
buffer size is only 64 bytes, which is too small to cover the new full
feature string:
extref, skinny-metadata, no-holes, free-space-tree, block-group-tree, extent-tree-v2
Above feature string is already 84 bytes, over the 64 on-stack memory
size.
This can also be proved by the ASAN output:
==65655==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc4e03b1d0 at pc 0x7ff0fc05fafe bp 0x7ffc4e03ac60 sp 0x7ffc4e03a408
WRITE of size 17 at 0x7ffc4e03b1d0 thread T0
#0 0x7ff0fc05fafd in __interceptor_strcat /usr/src/debug/gcc/libsanitizer/asan/asan_interceptors.cpp:377
#1 0x55cdb7b06ca5 in parse_features_to_string common/fsfeatures.c:316
#2 0x55cdb7b06ce1 in btrfs_parse_fs_features_to_string common/fsfeatures.c:324
#3 0x55cdb7a37226 in main mkfs/main.c:1783
#4 0x7ff0fbe3c28f (/usr/lib/libc.so.6+0x2328f)
#5 0x7ff0fbe3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349)
#6 0x55cdb7a2cb34 in _start ../sysdeps/x86_64/start.S:115
[FIX]
Introduce a new macro, BTRFS_FEATURE_STRING_BUF_SIZE, along with a new
sanity check helper, btrfs_assert_feature_buf_size().
The problem is I can not find a build time method to verify
BTRFS_FEATURE_STRING_BUF_SIZE is large enough to contain all feature
names, thus have to go the runtime function to do the BUG_ON() to verify
the macro size.
Now the minimal buffer size for experimental build is 138 bytes, just
bump it to 160 for future expansion.
And if further features go beyond that number, mkfs.btrfs/btrfs-convert
will immediately crash at that BUG_ON(), so we can definitely detect it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Commit "btrfs-progs: prepare merging compat feature lists" tries to
merged "-O" and "-R" options, as they don't correctly represents
btrfs features.
But that commit caused the following bug during mkfs for experimental
build:
$ mkfs.btrfs -f -O block-group-tree /dev/nvme0n1
btrfs-progs v5.19.1
See http://btrfs.wiki.kernel.org for more information.
ERROR: superblock magic doesn't match
ERROR: illegal nodesize 16384 (not equal to 4096 for mixed block group)
[CAUSE]
Currently btrfs_parse_fs_features() will return a u64, and reuse the
same u64 for both incompat and compat RO flags for experimental branch.
This can easily leads to conflicts, as
BTRFS_FEATURE_INCOMPAT_MIXED_BLOCK_GROUP and
BTRFS_FEATURE_COMPAT_RO_BLOCK_GROUP_TREE both share the same bit
(1 << 2).
Thus for above case, mkfs.btrfs believe it has set MIXED_BLOCK_GROUP
feature, but what we really want is BLOCK_GROUP_TREE.
[FIX]
Instead of incorrectly re-using the same bits in btrfs_feature, split
the old flags into 3 flags:
- incompat_flag
- compat_ro_flag
- runtime_flag
The first two flags are easy to understand, the corresponding flag of
each feature.
The last runtime_flag is to compensate features which doesn't have any
on-disk flag set, like QUOTA and LIST_ALL.
And since we're no longer using a single u64 as features, we have to
introduce a new structure, btrfs_mkfs_features, to contain above 3
flags.
This also mean, things like default mkfs features must be converted to
use the new structure, thus those old macros are all converted to
const static structures:
- BTRFS_MKFS_DEFAULT_FEATURES + BTRFS_MKFS_DEFAULT_RUNTIME_FEATURES
-> btrfs_mkfs_default_features
- BTRFS_CONVERT_ALLOWED_FEATURES -> btrfs_convert_allowed_features
And since we're using a structure, it's not longer as easy to implement
a disallowed mask.
Thus functions with @mask_disallowed are all changed to using
an @allowed structure pointer (which can be NULL).
Finally if we have experimental features enabled, all features can be
specified by -O options, and we can output a unified feature list,
instead of the old split ones.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a group of helpers to read device size, the btrfs_device_size
should be one of them. Rename it and so minor cleanup.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace BUG_ON after transaction start failures, all the functions
already handle errors and return them to the caller. The other error
handling is for impossible conditions.
Signed-off-by: David Sterba <dsterba@suse.com>
The leafsize has never been different from nodesize and since 4.0 (2015)
it's been alias for nodesize. This should be enough time for everybody
to update so the support is removed.
Signed-off-by: David Sterba <dsterba@suse.com>
The meaning of the -b/--byte-count option is different than what the
help text says. Historically it was used to set the filesystem size but
with multiple devices it sets the size on each device:
$ mkfs.btrfs /dev/sdx[1234]
...
Number of devices: 4
Devices:
ID SIZE PATH
1 2.00GiB /dev/sdx1
2 2.00GiB /dev/sdx2
3 2.00GiB /dev/sdx3
4 2.00GiB /dev/sdx4
And when set to 1G:
$ mkfs.btrfs -b 1G /dev/sdx[1234]
...
Number of devices: 4
Devices:
ID SIZE PATH
1 1.00GiB /dev/sdx1
2 1.00GiB /dev/sdx2
3 1.00GiB /dev/sdx3
4 1.00GiB /dev/sdx4
Signed-off-by: David Sterba <dsterba@suse.com>
The (unsigned long long) type casts can be dropped, printf understands
%llu and u64 and does not warn. In cases where the type is not u64 keep
the cast.
Signed-off-by: David Sterba <dsterba@suse.com>
When devices are formatted as btrfs, btrfs_prepare_device is called
sequentially for each device, which takes too much time.
Put each btrfs_prepare_device into a thread, wait for the first thread
to complete to mkfs.btrfs, and wait for other threads to complete before
adding other devices to the file system.
During the preparation it's either trim/discard or zone reset.
This was tested with TCMU emulation with two zoned devices. Each device
is 2000G (about 19.53 TiB), the region size is 4MB, Use the following
parameters for targetcli:
create name=zbc0 size=20000G cfgstring=model-HM/zsize-4/conv-100@~/zbc0.raw
Call difftime to calculate the running time of the function
btrfs_prepare_device. Calculate the time from thread creation to
completion of all threads after patching:
$ lsscsi -p
[10:0:1:0] (0x14) LIO-ORG TCMU ZBC device 0002 /dev/sdb - none
[11:0:1:0] (0x14) LIO-ORG TCMU ZBC device 0002 /dev/sdc - none
$ sudo mkfs.btrfs -d single -m single -O zoned /dev/sdc /dev/sdb -f
....
time for prepare devices:4.000000.
....
$ sudo mkfs.btrfs -d single -m single -O zoned /dev/sdc /dev/sdb -f
...
time for prepare devices:2.000000.
...
Issue: #496
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The preferred order:
- system headers
- standard headers
- libraries
- kernel library
- kernel shared
- common headers
- other tools
- own headers
Signed-off-by: David Sterba <dsterba@suse.com>
The source dir points to the argv data, we should make a copy to be sure
it won't change due to further processing.
Signed-off-by: David Sterba <dsterba@suse.com>
The helper parse_label is used only once and is trivial. Open code it in
the argument parsing, also to make the exit() is more visible.
Signed-off-by: David Sterba <dsterba@suse.com>
There's a helper to parse profile name and exits on error. As this is a
trivial helper we can open code it and adapt the error message to be
more specific what failed.
Signed-off-by: David Sterba <dsterba@suse.com>
To reduce the test matrix and to follow the kernel behavior, make sure
for block-group-tree feature, we have no-holes and free-space-tree
features enabled.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block group tree feature is completely a standalone feature, and it has
been over 5 years before the initial introduction to solve the long
mount time.
I don't really want to waste another 5 years waiting for a feature which
may or may not work, but definitely not properly reviewed for its
preparation patches.
So this patch will separate the block group tree feature into a
standalone compat RO feature.
There is a catch, in mkfs create_block_group_tree(), current
tree-checker only accepts block group item with valid chunk_objectid,
but the existing code from extent-tree-v2 didn't properly initialize it.
This patch will also fix above mentioned problem so kernel can mount it
correctly.
Now mkfs/fsck should be able to handle the fs with block group tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add constant for initial value to avoid unexpected clashes with user
defined getopt values and shift the common size getopt values.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/011 with subpage case, even with RAID56 support, it
still fails with the following error:
QA output created by 011
*** test btrfs replace
mkfs failed
(see /home/adam/xfstests-dev/results//btrfs/011.full for details)
The full log shows:
---------workout "-m single -d single -M" 1 no 64-----------
ERROR: illegal nodesize 65536 (not equal to 4096 for mixed block group)
mkfs failed
This is a critical error, making test case to be aborted, without
checking the rest profiles.
[CAUSE]
Mkfs.btrfs always uses the maximum value between sectorsize and page
size for its mixed profile nodesize.
For subpage case, it means we always go PAGE_SIZE, no matter whatever
the sectorsize is passed in.
[FIX]
Just get rid of the direct PAGE_SIZE usage when determining nodesize for
mixed profiles.
And use sectorsize directly (either passed in by the user, or
determined from page size).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have all of the supporting code, add the ability to create
all of the global roots for an extent tree v2 fs. This will default to
nr_cpu's, but also allow the user to specify how many global roots they
would like.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to start create global roots from mkfs, and we need to have
a offset set for the root key. Make the btrfs_create_tree() take a key
for the root_key instead of just the objectid so we can setup these new
style roots properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the extent tree v2 table with the block group tree as a root, and
then create the empty root and use the proper root for cleanup up the
temporary block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of accessing the extent root directory for modifying block
groups, use the helper which will do the correct thing based on the
flags of the file system.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use __BTRFS_LEAF_DATA_SIZE() in a few places for mkfs. With extent
tree v2 we'll be increasing the size of btrfs_header, so it'll be kind
of annoying to add flags to all callers of __BTRFS_LEAF_DATA_SIZE, so
simply calculate it once and put it in the mkfs_config and use that.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass BTRFS_BLOCK_GROUP_DATA and BTRFS_BLOCK_GROUP_METADATA to
zoned_profile_supported(), so we can actually distinguish if it is a data
or a meta-data block group.
Fixes: 8f914d518a46 ("btrfs-progs: zoned support DUP on metadata block groups")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have two places checking if a block-group profile is
supported on a zoned device, one in mkfs/main.c and one in
kernel-shared/zoned.c.
Use the one from kernel-shared/zoned.c in mkfs as well, unifying all
checks.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to enable developers to test the extent tree v2 features as they
are added, add the ability to mkfs an extent tree v2 fs if we have
experimental enabled.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiples of these trees with extent tree v2, so
add a rb tree to track them based on their root key value. This works
for both v1 and v2, so we can remove the direct pointers to these roots
in our fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so access
it via a helper in most cases. We will address the remaining direct
accesses in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we switch to multiple global trees we'll need to access the
appropriate extent root depending on the block group or possibly root.
To handle this, use a helper in most places and then the actual root in
places where it is required. We will whittle down the direct accessors
with future patches, but this does the bulk of the preparatory work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we will have per-block group checksums, so add a
helper to access the csum root and rename the fs_info csum_root to
_csum_root to catch all the places that are accessing it directly.
Convert everybody to use the helper except for internal things.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since btrfs-progs v5.14, mkfs.btrfs no longer cleans up the temporary
SINGLE metadata chunks if "-R free-space-tree" is specified:
$ mkfs.btrfs -f -R free-space-tree -m dup -d dup /dev/test/test
$ btrfs ins dump-tree -t chunk /dev/test/test | grep "type METADATA"
length 8388608 owner 2 stripe_len 65536 type METADATA
length 268435456 owner 2 stripe_len 65536 type METADATA|DUP
[CAUSE]
Since commit 4b6cf2a3eb ("btrfs-progs: mkfs: generate free space tree
at make_btrfs() time"), free space tree is created when the temporary
btrfs image is created.
This behavior itself has no problem at all. The problem happens when
"-m DUP -d DUP" (or other profiles) is specified.
This makes btrfs to create extra chunks, enlarging free space tree so
that it can be as high as level 1.
During mkfs, we rely on recow_roots() to re-COW all tree blocks to the
newly allocated chunks.
But __recow_root() can only handle tree root at level 0, as it forces
root node to be COWed, not bothering the children leaves/nodes.
This makes part of the free space cache tree still live on the old
temporary chunks, leaving later cleanup_temp_chunks() unable to delete
temporary SINGLE chunks.
[FIX]
Rework __recow_root() to do a proper COW of the whole tree.
But above rework is not enough, as if a free space tree block is
allocated during current transaction, but before new chunks added.
Then the reworked __recow_root() can't COW it, as btrfs_search_slot()
won't COW a tree block allocated in current transaction.
So this patch will also commit current transaction before calling
recow_roots(), to force us to re-cow all tree blocks.
This shouldn't be a problem, as at the time of calling, we should have
less than a dozen tree blocks, thus there won't be a performance impact.
Reported-by: FireFish5000 <firefish5000@gmail.com>
Fixes: 4b6cf2a3eb ("btrfs-progs: mkfs: generate free space tree at make_btrfs() time")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to use direct-IO for zoned devices to preserve the write
ordering. Instead of detecting if the device is zoned or not, we simply
use direct-IO for any kind of device (even if emulated zoned mode on a
regular device).
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since zone_size() returns an emulated zone size even for non-zoned
device, we cannot use cfg.zone_size to determine the device is zoned or
not. Set zone_size = 0 on non-zoned mode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Changing several defaults at once is desirable for easier reference,
rather than a number of scattered releases enabling each. The changes
are documented but printing a notice won't hurt as not everybody reads
the documentation or release notes.
Undesired features can be unselected by prepending ^ to the option name,
like:
$ mkfs.btrfs -O ^no-holes
Signed-off-by: David Sterba <dsterba@suse.com>
The original idea of not doing DUP on SSD was that the duplicate blocks
get deduplicated again by the driver firmware. This was in 2013, years
ago. Then it was speculative and even nowadays we don't have much
reliable information from vendors what optimizations are done on the
drive level.
After the year there's enough information gathered by user community and
there's no simple answer. Expensive drives are more reliable but less
common, for cheap consumer drive it's vice versa. The characteristics
are described in more detail in manual page btrfs(5) in section "SOLID
STATE DRIVES (SSD)".
The reasoning is based on numerous reports on IRC and technical
difficulty on mkfs side to do the right decision. The default is chosen
to be the safe option and up to user to change that based on informed
decision.
Issue: #319
Signed-off-by: David Sterba <dsterba@suse.com>
The free space tree is a better way to track the free space and has been
tested in the wild for a long time. The backward compatibility is
sufficient, several long term kernels. On-line conversion from v1 to v2
can be done by mount, switching from v2 to v1 can be done by 'btrfs
check'.
Issue: #295
Signed-off-by: David Sterba <dsterba@suse.com>
Make the helpers using crc32c not inline so the crc32c.h can be removed
from the public headers exported by libbtrfs.
Signed-off-by: David Sterba <dsterba@suse.com>
The default output of mkfs is intentionally verbose so we did not need
the verbosity option. For some additional information it could be useful
to increase the level in case it's wired to the global verbosity
settings.
Signed-off-by: David Sterba <dsterba@suse.com>