Use sbwrite instead of pwrite to support superblock logging in zoned
mode. In addition, call fsync() to persist the superblock to ensure the
write order. It also helps us to detect an unaligned write (write to a
position other than the write pointer) error.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In zoned mode, chunks must be aligned to zone size to ensure sequential
writing to a block group maps to sequential writing to a device zone.
Thus, we need to tweak the position and the size of the initial system
block group.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit disables some features which are incompatible with zoned btrfs.
RAID/DUP is disabled because we cannot handle two zone append writes to
different zones in the kernel. MIXED_BG is disabled because the allocated
metadata region will be write holes for data writes. Space-cache (v1)
require in-place updatings.
It also disables the "--rootdir" option for now. The copying from a
directory needs some tweaks for zoned btrfs (e.g. zone size aware space
calculation), and we do not implement them yet.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make mkfs.btrfs aware of the "zoned" feature flag and prepare the disks
for mkfs.btrfs. It automatically detects host-managed zoned device and
enables the future.
It also adds "zone_size" to struct btrfs_mkfs_config to track the zone
size.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We cannot overwrite superblock magic in a sequential required zone.
Instead, we can reset the zone to wipe it.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we zero out a region in a sequential write required zone, we cannot
write to the region until we reset the zone. Thus, we must prohibit zeroing
out to a sequential write required zone.
zero_dev_clamped() is modified to take the zone information and it calls
zero_zone_blocks() if the device is host managed to avoid writing to
sequential write required zones.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All zones of zoned block devices should be reset before writing. Support
this by introducing PREP_DEVICE_ZONED.
btrfs_reset_all_zones() walk all the zones on a device, and reset a zone if
it is sequential required zone, or discard the zone range otherwise.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing a chunk, we can/should reset the underlying device zones
for the chunk. Introduce btrfs_reset_chunk_zones() and reset the zones.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.
Check if next dirty extent buffer is continuous to a previously written
one. If not, it redirty extent buffers between the previous one and the
next one, so that all dirty buffers are written sequentially.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset for sequential allocation if a block
group contains a conventional zone.
But instead, we can consider the end of the highest addressed extent in
the block group for the allocation offset.
For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking
extent buffer lock after chunk mutex, which is already taken in
btrfs_make_block_group(). Since it is a new block group anyways, we can
simply set the allocation offset to 0.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.
Since the allocator is really simple, we implement it directly in
find_search_start().
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address
within a block group. To facilitate this, add an "alloc_offset" to the
block group to track the logical addresses of the write pointer.
This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.
For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.
We need fine-grained logical to physical mapping to support such
separated physical address issue. Since it should require additional
metadata type, disable non-single profiles for now.
This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.
To implement the allocator, we need to extend the following functions for
a zoned filesystem:
- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size
Here, dev_extent_hole_check() is newly introduced to check the validity of
a hole found.
init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.
dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.
dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.
With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.
Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Superblock (and its copies) is the only data structure in btrfs which has a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone. One easy solution
is limiting superblock and copies to be placed only in conventional zones.
However, this method has two downsides: one is reduced number of superblock
copies. The location of the second copy of superblock is 256GB, which is in
a sequential write required zone on typical devices in the market today.
So, the number of superblock and copies is limited to be two. Second
downside is that we cannot support devices which have no conventional zones
at all.
To solve these two problems, we employ superblock log writing. It uses two
adjacent zones as a circular buffer to write updated superblocks. Once the
first zone is filled up, start writing into the second one. Then, when
both zones are filled up and before starting to write to the first zone
again, reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones are
full. For this situation, we read out the last superblock of each zone, and
compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- primary superblock: offset 0B (and the following zone)
- first copy: offset 512G (and the following zone)
- Second copy: offset 4T (4096G, and the following zone)
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Currently, superblock reading/writing is done by pread/pwrite. This
commit replace the call sites with sbread/sbwrite to wrap the functions.
For zoned btrfs, btrfs_sb_io which is called from sbread/sbwrite
reverses the IO position back to a mirror number, maps the mirror number
into the superblock logging position, and do the IO.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Run a zoned filesystem on non-zoned devices. This is done by "slicing
up" the block device into fixed-sized chunks and emulate a conventional
zone on each of them. The emulated zone size is determined from the size
of device extent.
This is mainly aimed at testing of zoned filesystems, i.e. the zoned
chunk allocator, on regular block devices.
Currently, we always use EMULATED_ZONE_SIZE (256MiB) for the emulated
zone size. In the future, this will be customized by mkfs option.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Placing both data and metadata in a block group is impossible in ZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do that, because the
logical addresses are recorded in other metadata buffers to build up the
trees. As a result, a data buffer can be placed after a metadata buffer,
which is not written yet. Writing out the data buffer will break the
sequential write rule.
Check and disallow MIXED_BG with ZONED mode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.
Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.
Zone append command is mandatory for zoned btrfs. So, reject a device
with max_zone_append_size == 0.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Get the zone information (number of zones and zone size) from all the
devices, if the volume contains a zoned block device. To avoid costly
run-time zone report commands to test the device zones type during block
allocation, it also records all the zone status (zone type, write
pointer position, etc.).
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the zoned feature enabled, a zoned block device-aware btrfs
allocates block groups aligned to the device zones and always written in
sequential zones at the zone write pointer position.
It also supports "emulated" zoned mode on a non-zoned device. In the
emulated mode, btrfs emulates conventional zones by slicing the device
into fixed-size zones.
We don't support conversion from the ext4 volume with the zoned feature
because we can't be sure all the converted block groups are aligned to
zone boundaries.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the kernel supports zoned block devices, the file
/usr/include/linux/blkzoned.h will be present. Check this and define
BTRFS_ZONED if the file is present.
If it present, enables ZONED feature, if not disable it.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Likewise in the kernel code, provide fs_info access from struct
btrfs_device. This will help to unify the code between the kernel and
the userland.
Since fs_info can be NULL at the time of btrfs_add_to_fsid(), let's use
btrfs_open_devices() to set fs_info to the devices.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce the queue_param helper function to get a device request queue
parameter. This helper will be used later to query information of a zoned
device.
Furthermore, rewrite is_ssd() using the helper function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
[Naohiro] fixed error return value
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
alloc_chunk_ctl::calc_size is actually the stripe_size in the kernel
side code. Let's rename it to clarify what the "calc" is.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chunk_bytes_by_type() takes type, calc_size, and ctl as arguments. But
the first two can be obtained from the ctl. Let's drop these arguments
for simplicity.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit b9444efb66 ("btrfs-progs: don't pretend RAID56 has a
different stripe length"), alloc_chunk_ctl::stripe_len is always fixed
to BTRFS_STRIPE_LEN. Let's replace alloc_chunk_ctl::stripe_len with
BTRFS_STRIPE_LEN, like in the kernel code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several calculations in the chunk allocation process use this pattern.
x /= y;
x *= y;
Replace this pattern with round_down().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the DUP profile, we can use only half of the space available in a
device extent. Fix the calculation of calc_size for it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_alloc_data_chunk() and create_chunk() have the most part in common.
Let's rewrite btrfs_alloc_data_chunk() using create_chunk().
There are two differences between btrfs_alloc_data_chunk() and
create_chunk(). create_chunk() uses find_next_chunk() to decide the
logical address of the chunk, and it uses btrfs_alloc_dev_extent() to
decide the physical address of a device extent. On the other hand,
btrfs_alloc_data_chunk() uses *start for both logical and physical
addresses.
To support the btrfs_alloc_data_chunk()'s use case, we use ctl->start
and ctl->dev_offset. If these values are set (non-zero), use the
specified values as the address. It is safe to use 0 to indicate the
value is not set here. Because both lower addresses of logical
(0..BTRFS_FIRST_CHUNK_TREE_OBJECT_ID) and physical
(0..BTRFS_BLOCK_RESERVED_1M_FOR_SUPER) are reserved.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out create_chunk() from btrfs_alloc_chunk(). This new function
creates a chunk.
There is no functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out decide_stripe_size() from btrfs_alloc_chunk(). This new
function calculates the actual stripe size to allocate and decides the
size of a stripe (ctl->calc_size).
This commit has no functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move parameter initialization code for regular allocator to
init_alloc_chunk_ctl_policy_regular(). This will help adding another
allocator in the future.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert alloc_chunk_ctl::type to take the original type in
btrfs_alloc_chunk(). This will help refactoring in the following commits.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out the function dev_extent_search_start() from
find_free_dev_extent_start() to decide the starting position of a device
extent search.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce chunk allocation policy for btrfs. This policy controls how
chunks and device extents are allocated from devices.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was asked on reddit, how to automatically mount a swapfile from
fstab. As this is not completely obvious, document it with an example.
Signed-off-by: David Sterba <dsterba@suse.com>
There were plans to add X as flag to set/unset the btrfs NOCOMPRESS
attribute but that never materialized. In e2fsprogs the letter 'm' has
been assigned to the same functionality and released in version 1.46.2.
Update the docs and mention that the compression options are
conflicting.
Signed-off-by: David Sterba <dsterba@suse.com>
Resize to nums without sign prefix makes false output:
$ btrfs fi resize 1:150g /srv/extra
Resize device id 1 (/dev/sdb1) from 298.09GiB to 0.00B
The resize operation would take effect though.
Fix it by handling the case if mod is 0 in check_resize_args().
Issue: #307
Reported-by: Chris Murphy <lists@colorremedies.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Su Yue <l@damenly.su>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently mkfs.btrfs will output a warning message if the sectorsize is
not the same as page size:
WARNING: the filesystem may not be mountable, sectorsize 4096 doesn't match page size 65536
But since btrfs subpage support for 64K page size is coming, this output
is populating the golden output of fstests, causing tons of false
alerts.
This patch will teach mkfs.btrfs to check
/sys/fs/btrfs/features/supported_sectorsizes and check if the sector
size is supported.
Then only output above warning message if the sector size is not
supported or the file is not found (ie. kernel does not export the file
yet).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This relicenses the libbtrfsutil library to LGPLv2.1+ from LGPLv3.
People that have contributed non-trivial changes acknowledged the change
and are listed below.
There's a potential licensing conflict with the 'btrfs' utility that is
GPLv2 and statically links libbtrfsutil, this is not a valid combination
per the compatibility matrix as found in
https://www.gnu.org/licenses/gpl-faq.html#AllCompatibility or
http://gplv3.fsf.org/dd3-faq .
We also have an explicit request to change the license [1] (issue #323)
from LGPLv3 to allow use in environments that don't like GPLv3. Though
the library license is not GPLv3, the full text of the license is in the
repository and the 'lesser' part is an addendum. This was perhaps a bit
confusing, nevertheless this gets clarified as well.
[1] https://lore.kernel.org/linux-btrfs/b927ca28-e280-4d79-184f-b72867dbdaa8@denx.de/
Acked-by: Omar Sandoval <osandov@fb.com>
Acked-by: Misono Tomhiro <misono.tomohiro@jp.fujitsu.com>
Acked-by: Qu Wenruo <wqu@suse.com>
Acked-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Acked-by: Anand Jain <anand.jain@oracle.com>
Acked-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://bugs.debian.org/985400
Issue: #323
Signed-off-by: Neal Gompa <ngompa@fedoraproject.org>
Signed-off-by: David Sterba <dsterba@suse.com>
In case the right buffer is emptied it's first set to NULL and
subsequently it's dereferenced to get its size to pass to root_sub_used.
This naturally leads to a NULL pointer dereference. The correct thing to
do is to pass the stashed right->len in "blocksize".
Issue: #296
Pull-request: #360
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Marking BUG() unreachable helps us silence unnecessary warnings e.g.
"warning: control reaches end of non-void function [-Wreturn-type]" like
the code below.
int foo()
{
...
if (XXX)
return 0;
else if (YYY)
return 1;
else
BUG();
}
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add general paragraphs and establish per-ioctl argument description
formatting, using [horizontal] where the struct member and the
description are on the same line, unlike ordinary ::.
Signed-off-by: David Sterba <dsterba@suse.com>
Notes for making it work:
- "CFLAGS=-m32 LDFLAGS=-m32 ./configure ..." passes the right flags
everywhere, otherwise just passing it by EXTRA_CFLAGS/EXTRA_LDFLAGS
might miss some binaries or libraries
- note that passing CFLAGS/LDFLAGS to configure will override the
default flags
- libbtrfsutil and python bindings require to pass the -m32 flags via
EXTRA_PYTHON_CFLAGS/EXTRA_PYTHON_LDFLAGS
- all the e2fsprogs, zlib, lzo, util-linux etc need packaging changes to
provide static libraries - usually frowned upon by distribution unless
justified, even less so for static + 32bit versions, very niche
use case
Signed-off-by: David Sterba <dsterba@suse.com>
The correct checksum type value is set a few lines below, there seems
to be stale crc32 initialization. Also remove the crc32c.h include as
it's not used directly anymore.
Signed-off-by: David Sterba <dsterba@suse.com>
For passing authentication keys to the checksumming functions we need a
container for the key.
Pass in a btrfs_fs_info to btrfs_csum_data() so we can use the fs_info
as a container for the authentication key.
Note this is not always possible for all callers of btrfs_csum_data() so
we're just passing in NULL for now
Functions calling btrfs_csum_data() with a NULL fs_info argument are
currently not supported in the context of an authenticated file system.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Extending open_ctree with more parameters would be difficult, we'll need
to add more so factor out the parameters to a structure for easier
extension.
Signed-off-by: David Sterba <dsterba@suse.com>
The paths are not adapted to the TEST_TOP added long time ago in
e44f595dd7 ("btrfs-progs: tests: unify test drivers, make ready for
extenral testsuite").
Signed-off-by: David Sterba <dsterba@suse.com>
Add test vectors, a subset without keys as found in linux kernel sources
in crypto/test-mgr.h for all supported hash algorithms.
Signed-off-by: David Sterba <dsterba@suse.com>