Introduce a new function, __btrfs_map_block_v2().
Unlike old btrfs_map_block(), which needs different parameter to handle
different RAID profile, this new function uses unified btrfs_map_block
structure to handle all RAID profile in a more meaningful method:
Return physical address along with logical address for each stripe.
For RAID1/Single/DUP (none-stripped):
result would be like:
Map block: Logical 128M, Len 10M, Type RAID1, Stripe len 0, Nr_stripes 2
Stripe 0: Logical 128M, Physical X, Len: 10M Dev dev1
Stripe 1: Logical 128M, Physical Y, Len: 10M Dev dev2
Result will be as long as possible, since it's not stripped at all.
For RAID0/10 (stripped without parity):
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID10, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 64K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical 128K, Physical Z, Len 64K Dev dev3
Stripe 3: Logical 128K, Physical W, Len 64K Dev dev4
For RAID5/6 (stripped with parity and dev-rotation):
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID6, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 128K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical RAID5_P, Physical Z, Len 64K Dev dev3
Stripe 3: Logical RAID6_Q, Physical W, Len 64K Dev dev4
The new unified layout should be very flex and can even handle things
like N-way RAID1 (which old mirror_num basic one can't handle well).
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
4 functions are involved in this refactor: btrfs_make_block_group()
btrfs_make_block_groups(), btrfs_alloc_chunk, btrfs_alloc_data_chunk().
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTW, there is a duplicated definition of btrfs_add_device() in
volumes.h, also remove it.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just to keep the 1st paramter the same as kernel.
We can also save a few lines since the parameter is shorter now.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new function, btrfs_get_chunk_stripe_len() to get correct
stripe length.
This is very handy for lowmem mode, which checks the mapping between
device extent and chunk item.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Large numbers like (1024 * 1024 * 1024) may cost reader/reviewer to
waste one second to convert to 1G.
Introduce kernel include/linux/sizes.h to replace any intermediate
number larger than 4096 (not including 4096) to SZ_*.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 854437ca(btrfs-progs: extent-tree: avoid allocating tree block
that crosses stripe boundary) introduces check for logical bytenr not
crossing stripe boundary.
However that check is not completely correct.
It only checks if the logical bytenr and length agaist absolute logical
offset.
That's to say, it only check if a tree block lies in 64K logical stripe.
But in fact, it's possible a block group starts at bytenr unaligned with
64K, just like the following case.
Then btrfsck will give false alert.
0 32K 64K 96K 128K 160K ...
|--------------- Block group A ---------------------
|<-----TB 32K------>|
|/Scrub stripe unit/|
| WRONG UNIT |
In that case, TB(tree block) at bytenr 32K in fact fits into the kernel
scrub stripe unit.
But doesn't fit into the pure logical 64K stripe.
Fix check_crossing_stripes() to compare bytenr to block group start, not
to absolute logical bytenr.
Reported-by: Jussi Kansanen <jussi.kansanen@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current we only do chunk validation check at mount time.
It's good for most case, but for fuzzed or manually crafted images, we
can insert a CHUNK_ITEM key into root tree.
Since mount time check will only check chunk tree, it will not check
CHUNK_ITEM in root tree.
Even with previous key type check against leaf owner, it is still
possible to modify the leaf owner to by-pass it.
So we still need to check chunk validation before processing it.
Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce new function, make_convert_data_chunks(), to build up data
chunks for convert.
It will call a modified version of btrfs_alloc_data_chunk() to force
data chunks to covert all known ext* data.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the overlapping usage and [usage_min, usage_max] members to the
balance args. The min/max values are interpreted iff the corresponding
flag BTRFS_BALANCE_ARGS_USAGE_RANGE is set.
The minimum boundary is inclusive, maximum is exclusive:
* usage_min <= chunk_usage < usage_max
Signed-off-by: David Sterba <dsterba@suse.com>
Add new balance filter 'stripes=<range>' to process only chunks that are
spread accross given number of chunks.
The range minimum and maximum are inclusive.
Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is>
[ reworked a bit to use the range helpers, dropped the single value
for stripes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add the overlapping limit and [limit_min, limit_max] members to the
balance args. The min/max values are interpreted iff the corresponding
flag BTRFS_BALANCE_ARGS_LIMIT_RANGE is set.
The minimum and maximum are inclusive.
Note that the values are only 32bit, but this should be enough for the
foreseeable future.
Signed-off-by: David Sterba <dsterba@suse.com>
Add support to search chunk root, as we only need to search tree roots
in system chunk, which should be very easy to add, just iterate in
system chunks.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ renamed to btrfs_next_bg_* ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 854437ca3c ("btrfs-progs:
extent-tree: avoid allocating tree block that crosses stripe boundary")
does not work for 64k nodesize. Due to an off-by-one error, all queries
to check_crossing_stripes will return that all extents cross a stripe
and this will lead to a false ENOSPC. This crashes later
$ ./mkfs.btrfs -n 64k image
./mkfs.btrfs(btrfs_reserve_extent+0xb77)[0x417f38]
./mkfs.btrfs(btrfs_alloc_free_block+0x57)[0x417fe0]
./mkfs.btrfs(__btrfs_cow_block+0x163)[0x408eb7]
./mkfs.btrfs(btrfs_cow_block+0xd0)[0x4097c4]
./mkfs.btrfs(btrfs_search_slot+0x16f)[0x40be4d]
./mkfs.btrfs(btrfs_insert_empty_items+0xc0)[0x40d5f9]
./mkfs.btrfs(btrfs_insert_item+0x99)[0x40da5f]
./mkfs.btrfs(btrfs_make_block_group+0x4d)[0x41705c]
./mkfs.btrfs(main+0xeef)[0x434b56]
Holger Hoffstätte reports that this also fixes false positives in case
the nodesize is less than 64k. This happens when the node blocks end at
the stripe boundary.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If there is more than one fs_devices in fs_uuids list (like mkfs.btrfs
does), we need close them all before exit. Add a helper for that.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Kernel btrfs_map_block() function has a limitation that it can only
map BTRFS_STRIPE_LEN size.
That will cause scrub fails to scrub tree block which crosses strip
boundary, causing BUG_ON().
Normally, it's OK as metadata is always in metadata chunk and
BTRFS_STRIPE_LEN can always be divided by node/leaf size.
So without mixed block group, tree block won't cross stripe boundary.
But for mixed block group, especially for btrfs converted from ext4,
it's almost sure one or more tree blocks are not aligned with node size
and cross stripe boundary.
Causing bug with kernel scrub.
This patch will report the problem, although we don't have a good idea
how to fix it in user space until we add the ability to relocate tree
block in user space.
Also, kernel code should also be checked for such tree block alloc
problems.
Reported-by: Chris Murphy <lists@colorremedies.com>
Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add includes that let the header files compile or add explicit include
of kerncompat if the uXX types are used.
Signed-off-by: David Sterba <dsterba@suse.cz>
*Note*
this handles the problem under umounted state, the similar problem
under mounted state is already fixed by Anand.
Steps to reproduce:
# mkfs.btrfs -f /dev/sda1
# btrfstune -S 1 /dev/sda1
# mount /dev/sda1 /mnt
# btrfs dev add /dev/sda2 /mnt
# umount /mnt <== (umounted)
# btrfs fi show /dev/sda2
result:
Label: none uuid: XXXXXXXXXXXXXXXXXX
Total devices 2 FS bytes used 368.00KiB
devid 2 size 9.31GiB used 1.25GiB path /dev/sda2
*** Some devices missing
Btrfs v3.16-67-g69f54ea-dirty
It is because @btrfs_scan_lblkid() won't establish mappinig
between the seed and sprout devices. So seeding devices are missing.
We could use @open_ctree_* to detect all seed/sprout mappings
for each fs scanned after @btrfs_scan_lblkid().
sth worthes mention:
o If there are multi-level of seeds, all devices in them will be shown
in the ascending order of @devid
o If device replace is execed on a sprout fs with a device in a seed fs,
the replaced device still exist in the seed fs together with
the replacing device in the sprout fs, so we only keep the latest device
with the newest generation
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Btrfs-progs superblock checksum check is somewhat too restricted for
super-recover, since current btrfs-progs will only read the 1st
superblock and if you need super-recover the 1st superblock is
possibly already damaged.
The fix is introducing super_recover parameter for
btrfs_read_dev_super() and callers to allow scan backup superblocks if
needed.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Add more control to the balance behaviour.
Usage filter may not be finegrained enough and can lead to moving too
many chunks at once. Another example use is in connection with
drange+devid or vrange filters that allow to work with a specific chunk
or even with a chunk on a given device.
The limit filter applies last, the value of 0 means no limiting.
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.cz>
When a struct btrfs_fs_devices was being torn down by
btrfs_close_devices(), there was an invalidated pointer in the global
list fs_uuids which still pointed to it; if a device was closed and
then reopened (which btrfs-convert does), freed memory would be
accessed.
This was found using ThreadSanitizer (pretty much doing what
AddressSanitizer would, but not exiting after the first failure).
To reproduce, build with -fsanitize=thread and run 'make test'.
Representative output is below.
This change makes the current tests TSan-clean.
WARNING: ThreadSanitizer: heap-use-after-free (pid=29161)
Read of size 8 at 0x7d180000eee0 by main thread:
#0 memcmp ??:0
#1 find_fsid .../volumes.c:81
#2 device_list_add .../volumes.c:95
#3 btrfs_scan_one_device .../volumes.c:259
#4 btrfs_scan_fs_devices .../disk-io.c:1002
#5 __open_ctree_fd .../disk-io.c:1090
#6 open_ctree_fd .../disk-io.c:1191
#7 do_convert .../btrfs-convert.c:2317
#8 main .../btrfs-convert.c:2745
Previous write of size 8 at 0x7d180000eee0 by main thread:
#0 free ??:0
#1 btrfs_close_devices .../volumes.c:191
#2 close_ctree .../disk-io.c:1401
#3 do_convert .../btrfs-convert.c:2300
#4 main .../btrfs-convert.c:2745
Location is heap block of size 96 at 0x7d180000eee0 allocated by main thread:
#0 calloc ??:0 (exe+0x00000002acc6)
#1 device_list_add .../volumes.c:97
#2 btrfs_scan_one_device .../volumes.c:259
#3 btrfs_scan_fs_devices .../disk-io.c:1002
#4 __open_ctree_fd .../disk-io.c:1090
#5 open_ctree_fd .../disk-io.c:1191
#6 do_convert .../btrfs-convert.c:2256
#7 main .../btrfs-convert.c:2745
Signed-off-by: Adam Buchbinder <abuchbinder@google.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
a clean up patch, the BTRFS_STRIPE_LEN is been duplicated across
btrfs-progs, the kernel defines it in volume.h so do the same
for progs.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
In files copied from the kernel, mark many functions as static,
and remove any resulting dead code.
Some functions are left unmarked if they aren't static in the
kernel tree.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This adds a 'btrfs-image -m' option, which let us restore an image that
is built from a btrfs of multiple disks onto several disks altogether.
This aims to address the following case,
$ mkfs.btrfs -m raid0 sda sdb
$ btrfs-image sda image.file
$ btrfs-image -r image.file sdc
---------
so we can only restore metadata onto sdc, and another thing is we can
only mount sdc with degraded mode as we don't provide informations of
another disk. And, it's built as RAID0 and we have only one disk,
so after mount sdc we'll get into readonly mode.
This is just annoying for people(like me) who're trying to restore image
but turn to find they cannot make it work.
So this'll make your life easier, just tap
$ btrfs-image -m image.file sdc sdd
---------
then you get everything about metadata done, the same offset with that of
the originals(of course, you need offer enough disk size, at least the disk
size of the original disks).
Besides, this also works with raid5 and raid6 metadata image.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Add chunk rebuild for RAID1/SINGLE/DUP to chunk-recover command.
Before this patch chunk-recover can only scan and reuse the old chunk
data to recover. With this patch, chunk-recover can use the reference
between chunk/block group/dev extent to rebuild the whole chunk tree
even when old chunks are not available.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Add chunk-recover program to check or rebuild chunk tree when the system
chunk array or chunk tree is broken.
Due to the importance of the system chunk array and chunk tree, if one of
them is broken, the whole btrfs will be broken even other data are OK.
But we have some hint(fsid, checksum...) to salvage the old metadata.
So this function will first scan the whole file system and collect the
needed data(chunk/block group/dev extent), and check for the references
between them. If the references are OK, the chunk tree can be rebuilt and
luckily the file system will be mountable.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
David Woodhouse originally contributed this code, and Chris Mason
changed it around to reflect the current design goals for raid56.
The original code expected all metadata and data writes to be full
stripes. This meant metadata block size == stripe size, and had a few
other restrictions.
This version allows metadata blocks smaller than the stripe size. It
implements both raid5 and raid6, although it does not have code to
rebuild from parity if one of the drives is missing or incorrect.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is the user mode part of the device replace patch series.
The command group "btrfs replace" is added with three commands:
- btrfs replace start srcdev|srcdevid targetdev [-Bfr] mount_point
- btrfs replace status mount_point [-1]
- btrfs replace cancel mount_point
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
This is mostly disabled, but it is step one in handling
corrupted block groups in the extent allocation tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Changes from V1 to V2:
- support extended attributes
- move btrfs_alloc_data_chunk function to volumes.c
- fix an execution error when additional useless parameters are specified
- fix traverse_directory function so that the insertion functions for the common items are invoked in a single point
The extended attributes is implemented through llistxattr and getxattr function calls.
Thanks
Signed-off-by: Donggeun Kim <dg77.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch updates btrfs-progs for superblock duplication.
Note: I didn't make this patch as complete as the one for
kernel since updating the converter requires changing the
code again. Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch does the following:
1) Update device management code to match the kernel code.
2) Allocator fixes.
3) Add a program called btrfstune to set/clear the SEEDING
super block flags.
The main changes in this patch are adding chunk handing and data relocation
ability. In the last step of conversion, the converter relocates data in system
chunk and move chunk tree into system chunk. In the rollback process, the
converter remove chunk tree from system chunk and copy data back.
Regards
YZ
---
Block headers now store the chunk tree uuid
Chunk items records the device uuid for each stripes
Device extent items record better back refs to the chunk tree
Block groups record better back refs to the chunk tree
The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk. Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.
This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.