This reverts commit 31d8235410.
False report of backref mismatches, lots of messages similar to:
Incorrect local backref count on 12713984 root 5 owner 257 offset 12845056 found 1 wanted 0 back 0x7b3ed0
backpointer mismatch on [12713984 131072]
Repairing will make things worse. A fix has been proposed, but is not
finalized so we go with a revert.
Reported-by: Chris Murphy <bugzilla@colorremedies.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=155791
Signed-off-by: David Sterba <dsterba@suse.com>
Add_tree_backref() can cause BUG_ON() and abort() in quite a lot of
cases, from the ENOMEM to existing tree backref records.
Change all these BUG_ON() and abort() to return proper values.
And modify all callers to handle such problems.
Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Exposed by fuzzed image from Lukas, which contains invalid drop level
(16), causing segfault when accessing path->nodes[drop_level].
This patch will check drop level against fs tree level and
BTRFS_MAX_LEVEL to avoid such problem.
Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current we only do chunk validation check at mount time.
It's good for most case, but for fuzzed or manually crafted images, we
can insert a CHUNK_ITEM key into root tree.
Since mount time check will only check chunk tree, it will not check
CHUNK_ITEM in root tree.
Even with previous key type check against leaf owner, it is still
possible to modify the leaf owner to by-pass it.
So we still need to check chunk validation before processing it.
Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs tree implies a lot of restriction on which key types are allowed
in specific roots.
Like CHUNK_ITEM keys are only valid in chunk root.
This patch will add such check at run_next_block() for original mode.
Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As we're passing a set of flags, the enum type is not appropriate.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the single-purpose option --low-memory to a generic option that
takes the mode. Currently supported are the original mode and the
low-memory in the same way.
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new fsck mode: low memory mode.
Old btrfsck is working efficiently but uses some memory for each extent
item. This method will ensure extents are only iterated once at
extent/chunk tree check process.
But since it uses some memory for each extent item, for a large fs with
several TB metadata, this can easily eat up memory and cause OOM.
To handle such limitation and improve scalability, the new low-memory
mode will not use any heap memory to record which extent is checked.
Instead it will use extent backref to avoid most of uneeded checks on
shared fs/subvolume tree blocks.
And with the use forward and backward reference cross check, we can also
ensure every tree block is at least checked once.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new function traverse_tree_block() to do pre-order
traversal, to co-operate with new fs/subvolume tree skip function.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function should_check() to reduced duplicated tree block check
for fs/subvolume tree.
The idea is, we only check the fs/subvolue tree block if we have the
lowest referencer rootid, according to extent tree.
In that case, we can skip a lot of fs/subvolume tree block checks if
there are a lot of snapshots.
Although we will do a lot of extent tree searches for it.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce an entry function, check_leaf_items() to check all
known/valuable items and update related accounting like total_bytes and
csum_bytes.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function check_chunk_item() to check a chunk item.
It will check all chunk stripes with dev extents and the corresponding
block group item.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function check_block_group_item() to check a block group item.
It will check the referencer chunk and the used space accounting with
extent tree.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function check_dev_item() to check used space with dev extent
items.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function check_dev_extent_item() to find its referencer chunk.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function check_extent_item() using previously introduced
functions.
With previous function to check referencer and backref, this function
can be quite easy.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce the function check_shared_data_backref() to check the
referencer of a given shared data backref.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce new function check_extent_data_backref() to search referencer
for a given data backref.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function check_shared_block_backref() to check shared block
ref.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new function check_tree_block_backref() to check if a
backref points to correct referencer.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function query_tree_block_level() to resolve tree block level
by the following method:
1) tree block backref level
2) tree block header level
And only when header level == backref level, and transid matches, it will
return the tree block level.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new function check_data_extent_item() to check if the
corresponding data backref exists in extent tree.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce function check_tree_block_ref() to check whether a tree block
has correct backref in extent tree.
Unlike old extent tree check method, we only use search_slot() to search
reference, no extra structure will be allocated in heap to record what we
have checked.
This method may cause a little more IO, but should work for super large
fs without triggering OOM.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In walk_down_tree(), we may call btrfs_lookup_extent_info() for same tree
block many times, obviously unnecessary. Here we define a simple struct to
record whether we already have gotten tree block's refs:
struct node_refs {
u64 bytenr[BTRFS_MAX_LEVEL];
u64 refs[BTRFS_MAX_LEVEL];
};
I fill a disk partition with linux kernel source codes and use below
test script to have performance test.
#!/bin/bash
echo 3 > /proc/sys/vm/drop_caches
for ((i = 0; i < 20; i++)); do
time ./btrfsck /dev/sdc5
done 2>&1 | grep real | awk -F "[ms]" '{run_time += $2} END{print run_time / 20}'
Before this patch, it averagely took 0.8447s for every btrfsck execution,
and with this patch, it averagely took 0.7807s.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we can verify all qgroups, we can write the corrected qgroups out
to disk when '--repair' is specified. The qgroup status item is also updated
to clear any out-of-date state. The repair_ functions were modeled after the
inode repair code in cmds-check.c.
I also renamed the 'scan' member of qgroup_status_item to 'rescan' in order
to keep consistency with the kernel.
Testing this was easy, I just reproduced qgroup inconsistencies via the
usual routes and had btrfsck fix them.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We now have two data structures that can be used to iterate the same data
set, and there may be quite a few of them in memory. Eliminating the
list_head member will reduce memory consumption while iterating over
the extent backrefs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the pathlogical case, like xfstests generic/297 that creates a
large file consisting of one, repeating reflinked extent, fsck can
take hours. The root cause is that calling find_data_backref while
iterating the extent records is an O(n^2) algorithm. For my
example test run, n was 2*2^20 and fsck was at 8 hours and counting.
This patch supplements the list with an rbtree and drops the runtime
of that testcase to about 20 seconds.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We either open code list_entry calls or outright cast between types. The
compiler will do the right thing if we use static inlines to do
typesafe conversions.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the new add_extent_rec_nolookup() function, we add bytes_used to
update found bytes accounting.
However there is a typo that we used tmpl->nr, which should be rec->nr.
This will make us to add 1 for data backref, instead the correct size.
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent record's max_size might be 0 and the stripe crossing check
will report a false positive, should use the filesyste nodesize. There's
a global fs_info available.
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to add_extent_rec_nolookup, pass the arguments via a temporary
structure. In case the extent is found, some of the values are not
assigned directly so the semantics is preserved.
Signed-off-by: David Sterba <dsterba@suse.com>
There are just 3 values of flag_block_full_backref, we can utilize a
bitfield and save 8 bytes (192 now).
Signed-off-by: David Sterba <dsterba@suse.com>
Reduce number of parameters that just fill the extent_record from a
temporary template that's supposed to be zeroed and filled by the
callers.
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, although btrfsck will check qgroups if quota is
enabled, it always return 0 even qgroup numbers are corrupted.
Fix it by allowing return value from report_qgroups function (formally
defined as print_qgroup_difference).
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Separate the part of add_extent_rec that comes after the lookup does not
succeed, there are callers interested in just this.
Signed-off-by: David Sterba <dsterba@suse.com>
Nodesize is used in kernel, the values are always equal. We have to keep
leafsize in headers, similarly the tree setting functions still take and
set leafsize, but it's effectively a no-op.
Signed-off-by: David Sterba <dsterba@suse.com>
At least 2 user from mail list reported btrfsck reported false alert of
"bad metadata [XXXX,YYYY) crossing stripe boundary".
While the reported number are all inside the same 64K boundary.
After some check, all the false alert have the same bytenr feature,
which can be divided by stripe size (64K).
The result seems to be initial 'max_size' can be 0, causing 'start' +
'max_size' - 1, to cross the stripe boundary.
Fix it by always update extent_record->cross_stripe when the
extent_record is updated, to avoid temporary false alert to be reported.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The need to specify the chunk root is not that common, we will reserve
the short option -c for later use.
Signed-off-by: David Sterba <dsterba@suse.com>