Convert is always set to true so there's no point in having it as a
function parameter or using it as a predicate inside
btrfs_alloc_data_chunk. Remove it and all relevant code which would
have never been executed. No semantics changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to BTRFS_BLOCK_GROUP_DATA so sink it into the function.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
sub_stripe variables is by default initialized to 0 and it's overriden
only in case we have RAID10 mode. This leads to the following (minor)
artifacts on a freshly created filesystem:
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
length 1073741824 owner 2 stripe_len 65536 type METADATA|RAID1
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 0
stripe 0 devid 2 offset 9437184
dev_uuid a020fc2f-b526-4800-9278-156f2f431fe9
stripe 1 devid 1 offset 30408704
dev_uuid 0f78aa72-4626-4057-a8f2-285f46b2c664
After balance resulting chunk item is:
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 3251634176) itemoff 15863 itemsize 112
length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 1
stripe 0 devid 2 offset 3230662656
dev_uuid a020fc2f-b526-4800-9278-156f2f431fe9
stripe 1 devid 1 offset 3251634176
dev_uuid 0f78aa72-4626-4057-a8f2-285f46b2c664
Kernel code usually initializes it to 1, since it takes the value from
the raid description table which has it set to 1 for all but RAID10 types.
In userspace it has to be statically initialized to 1 since we don't
have btrfs_bg_flags_to_raid_index. Eventually the kernel/userspace needs
to be merged but for now it wouldn't bring much value if this function
is copied.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Even "btrfs rescue zero-log" only reset btrfs_super_block::log_root and
btrfs_super_block::log_root_level, we still use trasction to write all
super blocks for all devices.
This means we can't handle things like corrupted extent tree:
checksum verify failed on 2172747776 found 000000B6 wanted 00000000
checksum verify failed on 2172747776 found 000000B6 wanted 00000000
bad tree block 2172747776, bytenr mismatch, want=2172747776, have=0
WARNING: could not setup extent tree, skipping it
Clearing log on /dev/nvme/btrfs, previous log_root 0, level 0
ERROR: Corrupted fs, no valid METADATA block group found
ERROR: attempt to start transaction over already running one
[CAUSE]
Because we have extra check in transaction code to ensure we have valid
METADATA block groups.
In fact we don't really need transaction at all.
[FIX]
Instead of commit transaction, we can just call write_all_supers()
manually, so we can still handle multi-device fs while avoid above
error.
Also, add OPEN_CTREE_NO_BLOCK_GROUPS open ctree flag to make it more
robust.
Link: https://lore.kernel.org/linux-btrfs/CAKbQEqG35D_=8raTFH75-yCYoqH2OvpPEmpj2dxgo+PTc=cfhA@mail.gmail.com/
Reported-by: Christian Pernegger <pernegger@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even if we're using OPEN_CTREE_PARTIAL, like "rescue zero log", the
error message still looks too serious even we skipped that tree:
bad tree block 2172747776, bytenr mismatch, want=2172747776, have=0
Couldn't setup extent tree
^^^^^^^^^^^^^^^^^^^^^^^^^^
This patch will change the error message to:
- Use error() if we're not using OPEN_CTREE_PARTIAL
- Use warning() and explicitly show we're skipping that tree
So the result would be something like:
For non-OPEN_CTREE_PARTIAL case:
bad tree block 2172747776, bytenr mismatch, want=2172747776, have=0
ERROR: could not setup extent tree
For OPEN_CTREE_PARTIAL case
bad tree block 2172747776, bytenr mismatch, want=2172747776, have=0
WARNING: could not setup extent tree, skipping it
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The manual page of btrfsck clearly states 'btrfs check --repair' is a
dangerous operation.
Although this warning is in place users do not read the manual page
and/or are used to the behaviour of fsck utilities which repair the
filesystem, and thus potentially cause harm.
Similar to 'btrfs balance' without any filters, add a warning and a
countdown, so users can bail out before eventual corrupting the
filesystem more than it already is.
To override the timeout, let --force skip it and continue.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
All the run_* helpers have unused variable cmd, probably a leftover from
debugging the option injection magic.
Signed-off-by: David Sterba <dsterba@suse.com>
Add support for TEST_ARGS_CONVERT to allow injection of eg. checksum
command for the all tests. Use like
$ make TEST_ARGS_CONVERT='--csum=xxhash' TEST_ENABLE_OVERRIDE=true test-convert
This affects all btrfs-convert commands that are run by run_check and
other helpers, IOW this affects all tests, not just convert specific ones.
Signed-off-by: David Sterba <dsterba@suse.com>
Add support for TEST_ARGS_MKFS to allow injection of eg. checksum
command for the all tests. Use like
$ make TEST_ARGS_MKFS='--csum=xxhash' TEST_ENABLE_OVERRIDE=true test-mkfs
This affects all mkfs.btrfs commands that are run by run_check and other
helpers, IOW this affects all tests, not just mkfs specific ones.
Signed-off-by: David Sterba <dsterba@suse.com>
This tests ensures that the kernel correctly persists backup roots in
case the filesystem has been mounted from a backup root.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ cleanup to use common helpers ]
Signed-off-by: David Sterba <dsterba@suse.com>
As progs' transaction/CoW logic evolved over the years the metadata block
corruption code failed to do so. It's currently impossible to corrupt
the generation because the CoW logic will not only set it to the value
of the currently running transaction (__btrfs_cow_block) but the
current code will ASSERT due to the following check in __btrfs_cow_block:
WARN_ON(!(buf->flags & EXTENT_BAD_TRANSID) &&
btrfs_header_generation(buf) > trans->transid);
Fix this by making the generation corruption code directly write
the modified block, outside of the transaction mechanism. At the same
time move the old code into BTRFS_METADATA_BLOCK_SHIFT_ITEMS handling
case, essentially leaving it unchanged.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We access btrfs_block_group_cache::item mostly for @used and @flags.
@flags is already a dedicated member in btrfs_block_group_cache, only
@used doesn't have a dedicated member.
This patch will remove btrfs_block_group_cache::item and add
btrfs_block_group_cache::used.
It's the btrfs-progs equivalent of the following kernel patches:
btrfs: move block_group_item::used to block group
btrfs: move block_group_item::flags to block group
btrfs: remove embedded block_group_cache::item
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs balance status supports both short and long option -v|--verbose
but usage failed to show it in its --help. This patch fixes the --help.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs balance start supports both short and long option -v|--verbose
however usage failed to show the long option. This patch fixes the --help.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even when -q option specified, the receive sub-command is not quiet as
shown below.
$ btrfs receive -q -f /tmp/t /btrfs1
At snapshot ss3
It must be quiet at least when it's been asked to be quiet.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This test uses tool dmsetup so add the global prereq.
Issue: #192
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Seems that 18.04 has arrived to travis, switch to it. The gcc is 7.4 and
kernel is unfortuantelly still 4.15.
Signed-off-by: David Sterba <dsterba@suse.com>
Avoid introducing new cases of implicit fallthrough by having this flag
always set, though a conditional check is needed to avoid build breakage
on older compilers or on CI.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When compiling with clang, this warning is shown:
common/utils.c:404:3: warning: declaration does not declare anything [-Wmissing-declarations]
__attribute__ ((fallthrough));
This attribute seems to silence the same warning in GCC. Changing this
attribute with /* fallthrough */ fixes the warning for both gcc and
clang.
Full support for the attribute will be in clang 10, gcc supports that
now. Let's use what works for both and switch to the attribute in the
future.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch does the following refactor:
- Refactor parameter from @root to @fs_info
- Refactor the large loop body into another function
Now we have a helper function, read_one_block_group(), to handle
block group cache and space info related routine.
- Refactor the return value
Even we have the code handling ret > 0 from find_first_block_group(),
it never works, as when there is no more block group,
find_first_block_group() just return -ENOENT other than 1.
This is super confusing, it's almost a mircle it even works.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following functions are just using @root to reach fs_info:
- exclude_super_stripes
- free_excluded_extents
- add_excluded_extent
Refactor them to use fs_info directly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The image contains one inode item with invalid generation. The image
can be crafted by "btrfs-corrupt-block -i 257 -f generation". It should
emulate the bad inode generation caused by older kernel around 2014.
The image is repairable for both original and lowmem mode.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are at least two bug reports of kernel tree-checker complaining
about invalid inode generation.
All offending inodes seem to be caused by old kernel around 2014, with
inode generation overflow.
So add such check and repair ability to lowmem mode check first.
This involves:
- Calculate the inode generation upper limit
Unlike the lowmem mode context, we don't have anyway to determine if
this inode belongs to log tree.
So we use super_generation + 1 as upper limit, just like what we did
in kernel tree checker.
- Check if the inode generation is larger than the upper limit
- Repair by resetting inode generation to current transaction
generation
The difference is, in original mode, we have a common trans handle for
all repair and reset path for each repair.
Reported-by: Charles Wright <charles.v.wright@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are at least two bug reports of kernel tree-checker complaining
about invalid inode generation.
All offending inodes seem to be caused by old kernel around 2014, with
inode generation overflow.
So add such check and repair ability to lowmem mode check first.
This involves:
- Calculate the inode generation upper limit
If it's an inode from log tree, then the upper limit is
super_generation + 1, otherwise it's super_generation.
- Check if the inode generation is larger than the upper limit
- Repair by resetting inode generation to current transaction
generation
Reported-by: Charles Wright <charles.v.wright@gmail.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add new test image for imode repair in subvolume trees.
The new test cases including the following cases:
- Regular file with bad imode
It still has the valid INODE_REF and parent dir has correct DIR_INDEX
and DIR_ITEM.
In this case, no matter if the file is empty or not, it should be
repaired using the info from DIR_INDEX of parent dir.
- Non-empty regular file with bad imode, and without INODE_REF
The file should be mostly an orphan, so no INODE_REF for imode lookup.
But it has EXTENT_DATA which should be enough for imode repair.
The repair also involves moving the orphan to lost+found dir.
- Non-empty dir with bad imode, and without INODE_REF
Pretty much the same case, but now a directory.
The repair also involves moving the orphan to lost+found dir.
Also rename the existing test case 039-bad-free-space-cache-inode-mode
to 039-bad-inode-mode, since now we can fix all bad imode.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To make original mode to repair imode error in subvolume trees, this
patch will do:
- Remove the show-stopper checks for root->objectid.
Now repair_imode_original() will accept inodes in subvolume trees.
- Export detect_imode() for original mode
Due to the call requirement, original mode must use an existing trans
handler to do the repair, thus we need to re-implement most of the
work done in repair_imode_common().
- Make repair_imode_original() to use detect_imode().
- Free the path after reset_imode()
reset_imode() keeps the path, as lowmem mode uses path to locate its
current check position.
But for original mode, the unreleased path can cause later repair to
report warning, so we need to manually release the path.
- Update rec->imode after imode reset
So later repair depending on rec->imode can get correct value.
- Move the repair before repair_inode_nlinks()
repair_inode_nlinks() needs correct imode to add DIR_INDEX/DIR_ITEM.
So moving the repair before repair_inode_nlinks() makes the latter
repair happier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For lowmem mode, if we hit a bad inode mode, normally it is reported
when we checking the DIR_INDEX/DIR_ITEM of the parent inode.
If we didn't repair at that time, the error will be recorded even if we
fixed it later.
So this patch will check for INODE_ITEM_MISMATCH error type, and if it's
really caused by invalid imode, repair it and clear the error.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[[PROBLEM]]
Before this patch, repair_imode_common() can only handle two types of
inodes:
- Free space cache inodes
- ROOT DIR inodes
For inodes in subvolume trees, the core complexity is how to determine
the correct imode, thus it was not implemented.
However there are more reports of incorrect imode in subvolume trees, we
need to support such fix.
[[ENHANCEMENT]]
So this patch adds a new function, detect_imode(), to detect imode for
inodes in subvolume trees. The policy here is, try our best to find a
valid imode to recovery. If no convicing info can be found, fail out.
That function will determine imode by:
1) Search for INODE_REF of the inode
If we have INODE_REF, we will then try to find DIR_ITEM/DIR_INDEX.
As long as one valid DIR_ITEM or DIR_INDEX can be found, we convert
the BTRFS_FT_* to imode, then call it a day.
This should be the most accurate way.
2) Search for DIR_INDEX/DIR_ITEM belongs to this inode
If above search fails, we falls back to locate the DIR_INDEX/DIR_ITEM
just after the INODE_ITEM.
Thus this only works for non-empty directory.
If any can be found, it's definitely a directory.
3) Search for EXTENT_DATA belongs to this inode
If EXTENT_DATA can be found, it's either REG or LNK.
Thus this only works for non-empty file or soft link.
For this case, we default to REG, as user can inspect the file to
determine if it's a file or just a path.
4) Use rdev to detect BLK/CHR
If all above fails, but INODE_ITEM has non-zero rdev, then it's either
a BLK or CHR file. Then we default to BLK.
5) Fail out if none of above methods succeeded
No educated guess to make things worse.
[[SHORTCOMING]]
The above search is not perfect, there are cases where we can't repair:
E.g. orphan empty regular inode. Since it's already orphan, it has no
INODE_REF. And it's regular empty file, it has no DIR_INDEX nor
EXTENT_DATA nor rdev. Thus we can't recover. Although for this case, it
really doesn't matter as it's already orphan and will be deleted anyway.
Furthermore, due to the DIR_ITEM/DIR_INDEX/INODE_REF repair code which
can happen before imode repair, it's possible that DIR_ITEM search code
may not be executed. If there is only DIR_ITEM remaining, repair code
will remove the DIR_ITEM completely and move the inode to lost+found,
leaving us no info to rebuild imode. If there is DIR_INDEX missing,
repair code will re-insert the DIR_INDEX, then imode repair code will go
DIR_INDEX directly.
But overall, the repair code should handle the invalid imode caused by
older kernels without problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a function, find_file_type(), to find filetype using info from
INODE_REF, including dir_id from key index/name from inode_ref_item.
This function will:
- Search DIR_INDEX first
DIR_INDEX is easier since there is only one item in it.
- Validate the DIR_INDEX item
If the DIR_INDEX is valid, use the filetype and call it a day.
- Search DIR_ITEM then
It needs extra iteration since it's possible to have hash collision.
- Validate the DIR_ITEM
If valid, call it a day. Or return -ENOENT;
This would be used as the primary method to determine the imode in later
imode repair code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function will be later used by common mode code, so export it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, we were using a very inefficient way to search
chunks:
We iterate through all clusters to find the chunk root tree block first,
then re-iterate all clusters again to find every child tree block.
Each time we need to iterate all clusters just to find a chunk tree
block. This is obviously inefficient, especially when chunk tree gets
larger. So the original author leaves a comment on it:
/* If you have to ask you aren't worthy */
static int search_for_chunk_blocks()
This patch will change the behavior so that we will only iterate all
clusters once.
The idea behind the optimization is, since we have the superblock
restored first, we could use the CHUNK_ITEMs in
super_block::sys_chunk_array to build a SYSTEM chunk mapping.
Then, when we start to iterate through all items, we can easily skip
unrelated items at different level:
- At cluster level
If a cluster starts beyond last system chunk map, it must not contain
any chunk tree blocks (as chunk tree blocks only lives inside system
chunks)
- At item level
If one item has no intersection with any system chunk map, then it
must not contain any tree blocks.
By this, we can iterate through all clusters just once, and find out all
CHUNK_ITEMs.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new helper function, is_in_sys_chunks(), to determine if an
item is in the range of system chunks.
Since btrfs-image will merge adjacent same type extents into one item,
this function is designed to return true for any bytes in system chunk
range.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we are doing a pretty slow search for system chunks before
restoring real data.
The current behavior is to search all clusters for chunk tree root
first, then search all clusters again and again for every chunk tree
block.
This causes recursive calls and pretty slow start up, the only good news
is since chunk tree are normally small, we don't need to iterate too
many times, thus overall it's acceptable.
To address such bad behavior, we could take usage of system chunk array
in the super block.
By recording all system chunks ranges, we could easily determine if an
extent belongs to chunk tree, thus do one loop simple linear search for
chunk tree leaves.
This patch only introduces the code base for later patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to allocate 2 * max_pending_size (which can be 256M) if
we're just extracting super block.
We only need to prepare BTRFS_SUPER_INFO_SIZE as buffer size.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can easily get confusing error message like:
ERROR: restore failed: Success
This is caused by wrong "%m" usage, as we normally use ret to indicate
error, without populating errno.
This patch will fix it by output the return value directly as normally
we have extra error message to show more meaning message than the return
value.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The removed paragraph in btrfs-man5.asciidoc says the same as the
previous one.
Signed-off-by: Merlin Büge <merlin.buege@tuhh.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add definition, crypto wrappers and support to mkfs for blake2 for
checksumming. There are 2 aliases either blake2 or blake2b.
Signed-off-by: David Sterba <dsterba@suse.com>
Upstream commit 997fa5ba1e14b52c554fb03ce39e579e6f27b90c,
git repository: git://github.com/BLAKE2/BLAKE2
The reference implemetation added in this patch is unchanged and will be
modified only to compile in current code base and with minimal other
modifications in case of future sync with upstream code. IOW, the coding
style should stay as-is and does not conform to the other btrfs-progs
code. This is an exception for xxhash and sha256 code as well.
Signed-off-by: David Sterba <dsterba@suse.com>