The existing attempt for changing csum types is as the following:
- Create a new temporary csum root
- Generate new data csums into the temporary csum root
- Drop the old csum tree and make the temporary one as csum root
- Change the checksums for metadata in-place
Unfortunately after some experiments, the csum root switch method has a
big pitfall, the backref items in extent tree.
Those backref items still point back to the old tree, meaning without a
lot of extra tricks, the extent tree would be corrupted.
Thus we have to go a new single tree variant:
- Generate new data csums into the csum root
The new data csums would have a different objectid to distinguish
them.
- Drop the old data csum items
- Change the key objectids of the new csums
- Change the checksums for metadata in-place
This means unfortunately we have to revert most of the old code, and
update the temporary item format.
The new temporary item would only record the target csum type.
At every stage we have a method to determine the progress, thus no need
for an item, but in the future it's still open for change.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function write_and_map_eb() is quite abused as a way to write any
generic buffer back to disk.
But we have a more suitable function already, write_data_to_disk().
This patch would remove the abused write_data_to_disk() calls, and
convert the only three valid call sites to write_data_to_disk() instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This syncs tree-checker.c from the kernel. The main modification was to
add a open ctree flag to skip the deeper leaf checks, and plumbing this
through tree-checker.c. We need this for things like fsck or
btrfs-image that need to work with slightly corrupted file systems, and
these checks simply make us unable to look at the corrupted blocks.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These helpers are called __btrfs_check_* in the kernel as they return
the special enum to indicate what part of the leaf/node failed. Rename
the uses in btrfs-progs to match the kernel naming convention to make it
easier to sync that code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This exists in the kernel to do the read on an extent buffer we may have
already looked up and initialized. Simply create this helper by
extracting out the existing code from read_tree_block and make
read_tree_block call this helper. This gives us the helper we need to
sync ctree.c into btrfs-progs, and keeps the code the same in
btrfs-progs.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this extra parameter in the kernel to indicate if we are atomic
and thus can't lock the io_tree when checking the transid for an extent
buffer. This isn't necessary in btrfs-progs, but to allow for easier
syncing of ctree.c add this argument to our copy of btrfs_buffer_uptodate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the kernel we only take a bytenr for this as the extent buffer cache
is indexed on bytenr. Since we're passing in the btrfs_fs_info we can
simply use the ->nodesize for the blocksize, and drop the blocksize
argument completely. This brings us into parity with the kernel, which
will allow the syncing of ctree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The in-kernel version of read_tree_block adds some extra sanity checks
to make sure we don't return blocks that don't match what we expect.
This includes the owning root, the level, and the expected first key.
We don't actually do these checks in btrfs-progs, however kernel code
we're going to sync will expect this calling convention, so update it to
match the in-kernel code and then update all the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the kernel we pass in the root_id for btrfs_free_tree_block instead
of the root itself. Update the btrfs-progs version of the helper to
match what we do in the kernel.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a mirror of the change I've done in the kernel, but in progs
it's even more simply because clean_tree_block was just a wrapper around
clear_extent_buffer_dirty. Change this to btrfs_clear_buffer_dirty, and
then update all the callers to use this helper instead of
clean_tree_block and clear_extent_buffer_dirty.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is how btrfs_alloc_tree_block is defined in the kernel, so when we
go to sync this code in it'll be easier if we're already setup to accept
this argument. Since we're in progs we don't care about nesting so just
use BTRFS_NORMAL_NESTING everywhere, as we sync in the kernel code it'll
get updated to whatever is appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is in keeping with what the function actually does, and is named
this way in the kernel.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We changed from members in the root for all the different flags to a
bit based flag system. In order to make syncing the kernel code into
btrfs-progs easier go ahead and sync in the bits we use and update all
the users of the old ->track_dirty and ->ref_cows to use the state bits.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a bit larger than the previous syncs, because we use
extent_io_tree's everywhere. There's a lot of stuff added to
kerncompat.h, and then I went through and cleaned up all the API
changes, which were
- extent_io_tree_init takes an fs_info and an owner now.
- extent_io_tree_cleanup is now extent_io_tree_release.
- set_extent_dirty takes a gfpmask.
- clear_extent_dirty takes a cached_state.
- find_first_extent_bit takes a cached_state.
The diffstat looks insane for this, but keep in mind extent-io-tree.c
and extent-io-tree.h are ~2000 loc just by themselves.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that this is unused by these helpers and only used by the repair
related code we can remove this argument from the main helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This can be pulled out of the extent buffer that is passed in, drop the
fs_info argument from the function.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With previous btrfstune support to convert to block-group-tree, it has
implemented most of the infrastructure for bi-directional conversion.
This patch will implement the remaining conversion support to go back to
extent tree.
The modification includes:
- New convert_to_extent_tree() function in btrfstune.c
It's almost the same as convert_to_bg_tree(), but with small changes:
* No need to set extra features like NO_HOLES/FST.
* Need to delete the block group tree when everything finished.
- Update btrfs_delete_and_free_root() to handle non-global roots
Currently the function can only accepts global roots (extent/csum/free
space trees)
If we pass a non-global root into the function, we will screw up
global_roots_tree and crash.
Since we're going to use btrfs_delete_and_free_root() to free block
group tree which is not a global tree, this is needed.
- New handling for half converted fs in get_last_converted_bg()
There are two cases need to be handled:
* The bg tree is already empty
We need to grab the first bg in extent tree.
Or at conversion function we will fail at grabbing the first bg.
* The bg tree is not empty
Then we need to grab the last bg in extent tree.
- Extra root switching in involved functions. This involves:
* read_converting_block_groups()
* insert_block_group_item()
* update_block_group_item()
We just need to update our target root according to the current
compat_ro and super flags.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The checksum conversion is still experimental and still does not convert
all filesystems correctly. Do not use on valuable data.
Previous implementation copied the UUID conversion which was not a good
base for the checksum conversion so it left out basically all trees
except extent and checksum.
This update adds the base for the required safety features:
- let the old csum tree intact until the full conversion is done (ie.
all data are still verifiable)
- add on-disk status tracking item, this should keep the from/to
checksum conversion, last generation to catch potential updates of the
underlying filesystem if conversion is interrupted and the filesystem
mounted
- convert most of the fundamental trees, the subvolumes, tree log and
relocation trees are not converted
- trees are converted in-place to avoid potentially running out of space
but this might be better done by transaction protection with a
temporary tree
Known issues:
- not all trees are converted
- not all checksums are correctly inserted into the new tree and reading
the files leads to EIO
Issue: #438
Signed-off-by: David Sterba <dsterba@suse.com>
We have been overloading the extent_state flags for use on the extent
buffers as well. When we sync extent-io-tree.[ch] this will become
impossible, so rename these flags to EXTENT_BUFFER_* and use those
definitions instead of the extent_state definitions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have some extra features in the btrfs-progs copy of the
extent_io_tree that don't exist in the kernel. In order to make syncing
easier simply move this functionality into btrfs_fs_info, that way we
can sync in the new extent_io_tree code and not have to worry about
breaking anything.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do not use the io_tree, don't bother passing it into
verify_parent_transid.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs-progs has a cache tree embedded in the extent_io_tree in order to
track extent buffers. We use the extent_io_tree part to track dirty,
and the cache tree to keep the extent buffers in. When we sync
extent-io-tree.[ch] we'll lose this ability, so separate out the dirty
tracking into its own extent_io_tree. Subsequent patches will adjust
the extent buffer lookup so it doesn't use the custom extent_io_tree
thing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a cleanup patch to make syncing the btrfs kernel code into
btrfs-progs easier. In btrfs-progs we have an extra cache in the
extent_io_tree that's exclusively used for the extent buffer tracking.
In order to untangle this dependency start passing around the fs_info to
search for extent_buffers, and then have the helpers use the appropriate
structure to find the extent buffer.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The radix-tree is not used in userspace code. In kernel it's for
tracking unpersisted and in-memory structures and has been replaced by
the xarray.
Signed-off-by: David Sterba <dsterba@suse.com>
The function csum_tree_block() is not really utilized by anyone, all
current callers just use csum_tree_block_size().
Furthermore there is a stale definition in common/utils.h which is using
the old "struct btrfs_root" as the first argument, while we have already
migrated to "struct btrfs_fs_info".
So just unexport csum_tree_block() and remove the stale definition.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new '-b' option will be responsible for converting to block group
tree compat ro feature.
The workflow looks like this for new convert:
- Setting CHANGING_BG_TREE flag
And initialize fs_info->last_converted_bg_bytenr value to (u64)-1.
Any bg with bytenr >= last_converted_bg_bytenr will have its bg item
update go to the new root (bg tree).
- Iterate each block group by their bytenr in descending order
This involves:
* Delete the old bg item from the old tree (extent tree)
* Update last_converted_bg_bytenr to the bytenr of the bg
* Add the new bg item into the new tree (bg tree)
* If we have converted a bunch of bgs, commit current transaction
- Clear CHANGING_BG_TREE flag
And set the new BLOCK_GROUP_TREE compat ro flag and commit.
And since we're doing the convert in multiple transactions, we also need
to resume from last interrupted convert.
In that case, we just grab the last unconverted bg, and start from it.
And to co-operate with the new kernel requirement for both no-holes and
free-space-tree features, the convert tool will check for
free-space-tree feature. If not enabled, will error out with an error
message to how to continue (by mounting with "-o space_cache=v2").
For missing no-holes feature, we just need to set the flag during
convert.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block group tree feature is completely a standalone feature, and it has
been over 5 years before the initial introduction to solve the long
mount time.
I don't really want to waste another 5 years waiting for a feature which
may or may not work, but definitely not properly reviewed for its
preparation patches.
So this patch will separate the block group tree feature into a
standalone compat RO feature.
There is a catch, in mkfs create_block_group_tree(), current
tree-checker only accepts block group item with valid chunk_objectid,
but the existing code from extent-tree-v2 didn't properly initialize it.
This patch will also fix above mentioned problem so kernel can mount it
correctly.
Now mkfs/fsck should be able to handle the fs with block group tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent tree v2 (thankfully not yet fully materialized) needs a
new root for storing all block group items.
My initial proposal years ago just added a new tree rootid, and load it
from tree root, just like what we did for quota/free space tree/uuid/extent
roots.
But the extent tree v2 patches introduced a completely new (and to me,
wasteful) way to store block group tree root into super block.
Currently there are only 3 trees stored in super blocks, and they all
have their valid reasons:
- Chunk root
Needed for bootstrap.
- Tree root
Really the entrance of all trees.
- Log root
This is special as log root has to be updated out of existing
transaction mechanism.
There is not even any reason to put block group root into super blocks,
the block group tree is updated at the same timing as old extent tree,
no need for extra bootstrap/out-of-transaction update.
So just move block group root from super block into tree root.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function write_data_to_disk() can handle RAID56 writes without any
problem.
So just call write_data_to_disk() inside write_and_map_eb() instead of
manually doing the RAID56 write.
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Shinichiro reported that "mkfs.btrfs -m DUP" is doing repeated write
into the device.
For non-zoned device this is not a big deal, but for zoned device this
is critical, as zoned device doesn't support overwrite at all.
[CAUSE]
The problem is related to write_and_map_eb() call, since commit
2a93728391 ("btrfs-progs: use write_data_to_disk() to replace
write_extent_to_disk()"), we call write_data_to_disk() for metadata
write back.
But the problem is, write_data_to_disk() will call btrfs_map_block()
with rw = WRITE.
By that btrfs_map_block() will always return all stripes, while in
write_data_to_disk() we also iterate through each mirror of the range.
This results above repeated writeback.
[FIX]
Fix this problem by completely remove @mirror argument
from write_data_to_disk().
With extra comments to explicitly show that function will write to
all mirrors.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2a93728391 ("btrfs-progs: use write_data_to_disk() to replace write_extent_to_disk()")
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function read_extent_from_disk() is only a wrapper to read tree
block.
And read_extent_data() is just a while loop to eliminate short read
caused by stripe boundary.
In fact, a lot of call sites of read_extent_data() are either reading
metadata (thus no possible short read) or doing extra loop by
themselves.
This patch will replace those two functions with read_data_from_disk(),
making it the only entrance for data/metadata read.
And update read_data_from_disk() to return the read bytes, so caller can
do a simple while loop.
For the few callers of read_extent_data(), open-code a small while loop
for them.
This will allow later RAID56 read repair using P/Q much easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function write_extent_to_disk() is just writing the content of a tree
block to disk.
It can not handle RAID56, and its work is the same as
write_data_to_disk().
Thus we can replace write_extent_to_disk() with write_data_to_disk()
easily.
There is only one special call site in write_raid56_with_parity(), which
can easily be replace with btrfs_pwrite() directly.
This reduce the write entrance, and make later eb::fd removal easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For metadata restore, our logical address is mapped to a single device
with logical address 1:1 mapped to device physical address.
Move this part of code into a helper, this will make later extent buffer
read path refactoer much easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to use read_whole_eb() to read super block, but it's no longer
the case (so long that I can not even find out which commit did the
conversion).
Thus there is no need for read_whole_eb() to handle super block read
anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to start create global roots from mkfs, and we need to have
a offset set for the root key. Make the btrfs_create_tree() take a key
for the root_key instead of just the objectid so we can setup these new
style roots properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The free space tree code already does this, but we need it for cleaning
up per block group roots. Abstract this code out into a helper so that
we can use it in multiple places in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will now be using block_group->chunk_objectid to point at the global
root id for this particular block group. For now we'll assign this
based on mod'ing the offset of the block group against the number of
global root id's and handle the block_group_item updating appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to make sure the file system is consistent we need to record
the number of global roots we should have in the super block. We could
infer this from the number of global roots we find, however this could
lead to interesting fuzzing problems, so add a source of truth to the
super block in order to make it easier to verify the file system is
consistent.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the ability to load the block group root, as well as make sure
the various backup super block and super block updates are made
appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is going to be a different value based on the incompat settings of
the file system, just store this in the fs_info instead of calculating
it every time.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The chunk recover code passes in a buffer it allocates with metadata but
no fs_info, causing fuzz-test 008 to segfault. Fix this test to only
check the flag if we have buf->fs_info set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With my global roots prep patches I regressed us on handling the case
where we didn't find a root at all. In this case we need to return an
error (prior we returned -ENOENT) or we need to populate a dummy tree if
we have OPEN_CTREE_PARTIAL set. This fixes a segfault of fuzz test 006.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were previously getting away with this because the
load_global_roots() treated ENOENT like everything was a-ok. However
that was a bug and fixing that bug uncovered a problem where we were
unconditionally trying to load the free space tree. Fix that by
skipping the load if we do not have the compat bit set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is still work in progress but can survive some stress testing.
There are still some sanity checks missing, do not user this on valuable
data. To enables this, configure must be run with the experimental
features enabled.
$ mkfs.btrfs --csum crc32c /dev/sdx
$ <mount, fill with data, unmount>
$ btrfstune --csum sha256
Will change the checksum to sha256.
Implementation:
- set bit on superblock when the checksums are being changed (similar to
the uuid rewrite)
- metadata checksums are overwritten in place
- data checksums:
- the checksum tree is completely deleted and no checksums are
verified
- data blocks are enumerated and all checksums generated (same as
check --init-csum-tree)
To make it usable, it should be restartable and track the current
progress somehow. Also the previous data checksums should be verified
any time they're available.
Signed-off-by: David Sterba <dsterba@suse.com>
For !extent tree v2 we should validate the key.offset == 0, and for
extent tree v2 we should validate that key.offset < nr_global_roots. If
this fails we need to fail to load the global root so that the
appropriate action is taken.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the on disk definitions for the block group tree. This will be part
of the super block so we need to add the appropriate helpers to the
super block, as well as adding it to the backup roots.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiples of these trees with extent tree v2, so
add a rb tree to track them based on their root key value. This works
for both v1 and v2, so we can remove the direct pointers to these roots
in our fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so access
it via a helper in most cases. We will address the remaining direct
accesses in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>