When the btrfs_read_fs_root() function is searching a ROOT_ITEM with
location key offset other than -1, it currently fails via BUG_ON.
The offset can have other value than -1, though. This can happen for
example if a subvolume is renamed:
$ btrfs subvolume create X && sync
Create subvolume './X'
$ btrfs inspect-internal dump-tree /dev/root | grep -B 2 'name: X$
location key (270 ROOT_ITEM 18446744073709551615) type DIR
transid 283 data_len 0 name_len 1
name: X
$ mv X Y && sync
$ btrfs inspect-internal dump-tree /dev/root | grep -B 2 'name: Y$
location key (270 ROOT_ITEM 0) type DIR
transid 285 data_len 0 name_len 1
name: Y
As can be seen the offset changed from -1ULL to 0.
Do not fail in this case.
Signed-off-by: Marek Behún <marek.behun@nic.cz>
CC: Qu Wenruo <wqu@suse.com>
CC: Tom Rini <trini@konsulko.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_open_dir already has a check whether the passed path is a
directory and if so it returns a specific error code (-3) when such an
error occurs. Use this instead of open-coding the directory check. To
avoid regression in cli/003 test also move directory checks before fs
type in btrfs_open.
Output before this check:
ERROR: resize works on mounted filesystems and accepts only
directories as argument. Passing file containing a btrfs image
would resize the underlying filesystem instead of the image.
After:
ERROR: not a directory: /root/btrfs-progs/tests/test.img
ERROR: resize works on mounted filesystems and accepts only
directories as argument. Passing file containing a btrfs image
would resize the underlying filesystem instead of the image.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a test case which ensures that when resize is tried on an image
instead of a directory appropriate warning is produced and the command
fails.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 61ecaff036.
The libmount functionality is not used anymore, we can remove it
entirely.
Signed-off-by: David Sterba <dsterba@suse.com>
Partial revert of 922eaa7b54 ("btrfs-progs: build: fix linking with
static libmount"), remove the necessary workarounds like the weak
symbols and link time warnings. Symbols renamed not to clash with
libmount (parse_size, canonicalize_path) haven't been reverted because
the new names are acceptable.
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 98a88aec64.
The libmount dependency will be dropped so remove it from the
documentation as well.
Signed-off-by: David Sterba <dsterba@suse.com>
In commit 57cfe29e69 ("btrfs-progs: utils: introduce
find_mount_fsroot") the entries in /proc/self/mountinfo are parsed by a
convenience library libmount, because getmntent does not provide the
information we need to distinguish bind mounts.
Using libmount turned out to be problematic in several ways:
- static build got broken due to clashing symbols, eg. for parsing size
or path canonicalization (#333)
- long-term distros do not have libmount new enough (2.24+) to provide
some functions (mnt_table_is_empty, #334)
- libmount internally uses getgrnam_r/mnt_get_uid/... that are not
static-build friendly, a warning is printed during link time for each
binary; we don't use any of the functions
- libmount has further library dependencies that we don't need:
$ ldd /usr/lib64/libmount.so.1
linux-vdso.so.1 (0x00007fff4f175000)
libc.so.6 => /lib64/libc.so.6 (0x00007f44a1763000)
libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f44a1730000)
libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f44a1704000)
/lib64/ld-linux-x86-64.so.2 (0x00007f44a1998000)
libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f44a166c000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f44a1666000)
namely selinux, pcre and dl.
Summing it up, libmount causes more trouble than it's worth using a
convenience library, we want to keep the dependencies minimal so the
custom mountinfo parser was inevitable.
Issue: #333
Issue: #334
Issue: #336
Signed-off-by: David Sterba <dsterba@suse.com>
We had a few bugs on the kernel side of send/receive where capabilities
ended up being lost after receiving a send stream. They all stem from the
fact that the kernel used to send all xattrs before issuing the chown
command, and the later clears any existing capabilities in a file or
directory.
Initially a workaround was added to btrfs-progs' receive command, in commit
123a2a0850 ("btrfs-progs: receive: restore capabilities after chown"),
and that fixed some instances of the problem. More recently, other instances
of the problem were found, a proper fix for the kernel was made, which fixes
the root problem by making send always emit the setxattr command for setting
capabilities after issuing a chown command. This was done in kernel commit
89efda52e6b693 ("btrfs: send: emit file capabilities after chown"), which
landed in kernel 5.8.
However, the workaround on the receive command now causes us to incorrectly
set a capability on a file that should not have it, because it assumes all
setxattr commands for a file always comes before a chown.
Example reproducer:
$ cat send-caps.sh
#!/bin/bash
DEV1=/dev/sdh
DEV2=/dev/sdi
MNT1=/mnt/sdh
MNT2=/mnt/sdi
mkfs.btrfs -f $DEV1 > /dev/null
mkfs.btrfs -f $DEV2 > /dev/null
mount $DEV1 $MNT1
mount $DEV2 $MNT2
touch $MNT1/foo
touch $MNT1/bar
setcap cap_net_raw=p $MNT1/foo
btrfs subvolume snapshot -r $MNT1 $MNT1/snap1
btrfs send $MNT1/snap1 | btrfs receive $MNT2
echo
echo "capabilities on destination filesystem:"
echo
getcap $MNT2/snap1/foo
getcap $MNT2/snap1/bar
umount $MNT1
umount $MNT2
When running the test script, we can see that both files foo and bar get
the capability set, when only file foo should have it:
$ ./send-caps.sh
Create a readonly snapshot of '/mnt/sdh' in '/mnt/sdh/snap1'
At subvol /mnt/sdh/snap1
At subvol snap1
capabilities on destination filesystem:
/mnt/sdi/snap1/foo cap_net_raw=p
/mnt/sdi/snap1/bar cap_net_raw=p
Since the kernel fix was backported to all currently supported stable
releases (5.10.x, 5.4.x, 4.19.x, 4.14.x, 4.9.x and 4.4.x), remove the
workaround from receive. Having such a workaround relying on the order
of commands in a send stream is always troublesome and doomed to break
one day.
A test case for fstests will come soon.
Issue: #85
Issue: #202
Issue: #292
Reported-by: Richard Brown <rbrown@suse.de>
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the test environment is in 'docker', there's some delay before the
device mapper nodes appear in /dev/mapper. Add a delay before the test
continues, usually 1 second was enough but give it more just in case.
Add ad fallback to skip the test if the device node does not show up,
this is a problem in the running environment and the testsuite should
continue.
Signed-off-by: David Sterba <dsterba@suse.com>
Add some randomization to the long device name in case the testsuite
runs multiple times on the same host.
Signed-off-by: David Sterba <dsterba@suse.com>
In some environments the which utility might not be available and the
shell builtin 'type -p' is readily available.
Signed-off-by: David Sterba <dsterba@suse.com>
Add the specific libmount version that has the required functions,
though it still fails to build on CentOS 7 and similar. The libmount
dependency could be dropped in the future but for now at least document
it.
Issue: #334
Signed-off-by: David Sterba <dsterba@suse.com>
That scrub works only on a mounted filesystem is not clear in connection
with the possibility to start scrub on a given device. Update the manual
page and mention the mount requirement where approrpriate.
Issue: #335
Signed-off-by: David Sterba <dsterba@suse.com>
Testing the statically built binaries is not straightforward, add a
convenient way to do that:
$ make TEST_FLAVOR=static
There should be no difference in the test results.
Signed-off-by: David Sterba <dsterba@suse.com>
The libmount dependency has been added in commit 61ecaff036
("btrfs-progs: build: add libmount dependency"), and static build got
broken. There are functions that do basically the same thing and also
share the name, which in turn fails at link time.
ld: /../lib64/libmount.a(libcommon_la-canonicalize.o): in function `canonicalize_dm_name':
util-linux-2.34/lib/canonicalize.c:58: multiple definition of `canonicalize_dm_name';
common/path-utils.static.o:btrfs-progs/common/path-utils.c:286: first defined here
In case the collision can be resolved by renaming, it's done
(canonicalize_path and parse_size). There are 2 symbols from selinux
that are substituted by a weak aliases during the static build.
There's one new warning due to use of getgrnam_r in libmount that
depends on dynamic linking and may not work properly with static build.
We're not using the related functions directly or indirectly, so it
should be safe to ignore the warnings.
ld: ../lib64/libmount.a(la-utils.o): in function `mnt_get_gid':
util-linux-2.34/libmount/src/utils.c:625: warning: Using 'getgrnam_r' in statically linked applications
+requires at runtime the shared libraries from the glibc version used for linking
Issue: #333
Signed-off-by: David Sterba <dsterba@suse.com>
The jobs has been failing for some time due the time limit 1h:
+ qemu-system-x86_64 -m 512 -nographic -kernel /repo/bzImage -drive
file=/repo/qemu-image.img,index=0,media=disk,format=raw -fsdev
local,id=btrfs-progs,path=/repo,security_model=mapped -device
virtio-9p-pci,fsdev=btrfs-progs,mount_tag=btrfs-progs -append
'console=tty1 root=/dev/sda rw'
main-loop: WARNING: I/O thread spun for 1000 iterations
ERROR: Job failed: execution took longer than 1h0m0s seconds
We'd still like to use the qemu test as it could pull the recent
development kernel that the base image does not provide. However the
overall performance is too bad and it does not make sense to waste
GitLab resources.
Also remove the build status badge from README as it changed at some
point and does not render as a SVG image anymore.
Issue: #171
Signed-off-by: David Sterba <dsterba@suse.com>
The id 0 of the default subvolume is an internal alias for the toplevel
fs tree, kernel does that conversion. Until 2116398b1d ("btrfs-progs:
use libbtrfsutil for set-default") there was no manual conversion and
the value was passed to kernel as-is. With the switch to the
libbtrfsutil API this got broken (4.19).
$ btrfs subvol set-default 0 /path
In this case the default subvolume would be containing subvolume of
/path instead of the toplevel one.
Fix it by manually switching the 0 to 5 in case user specifies that to
avoid the difference in the API, that we can't change.
Issue: #327
Reported-by: Chris Murphy
Signed-off-by: David Sterba <dsterba@suse.com>
There are cases where v1 free space cache is still left while user has
already enabled v2 cache. In that case, we still want to force v1 space
cache cleanup in btrfs-check.
This patch will only warn and not exit if v2 is detected while the user
asked to clear v1.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To use optimized CRC implementation, the input buffer must be
unsigned long aligned. btrfs receive calculates checksum based on
read_buf, including btrfs_cmd_header (with zeroed CRC field)
and command content.
Reorder the buffer to the beginning of the structure and force the
alignment to 64, this should be cacheline friendly and could speed up
the data transfers.
Interesting parts from the report:
Sending host:
Fedora 33
AMD ThreadRipper 1920X - 128GB RAM
2x10GBit Ethernet, bonded
MegaRaid 9270
6x16TB Seagate Exos in RAID5
Receiving host:
Fedora 33
Intel i3-7300 - HT enabled - 32GB RAM
10GBit Ethernet, single connection
MegaRaid 9260
12x8TB WD NAS drives in RAID5
The 2 hosts are connected to the same 10G switch. The sender could definitely
saturate a 10GBit link. The practically achievable writes on the backup host
would be lower, but still at least 400MB/s. The file system contains mostly
large files of 1GB+, so there is little meta-data.
With btrfs send/receive I'm getting a steady transfer rate of 60MB/s. The copy
has been running for a little over 5 days now, having only transferred some
25TB. This is way too slow for this setup.
Analyzing resource usage, the sender side is fine, both the btrfs send and the
corresponding ssh process only use about 10-10% CPU, which on a 24 threaded
machine is virtually nothing. However, the receiver is running with a load of
~2.6, with the sshd using 30-50% CPU and the btrfs receive a further 60-70%.
The rest of the load comes from IO wait. So the bottleneck is the btrfs receive
clearly.
Issue: #324
Signed-off-by: Sheng Mao <shngmao@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By using find_mount_fsroot we ensure that we return a valid path to the
final user, by ensuring that even if we return a bind mount, the
pathname of btrfs used was the same from the original mount.
This for a case when bind mounts and normal mount -o subvol=/path are
mixed.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This new function checks for filesystem path name that was mounted, thus
being different from find_mount_root. By using libmount we can easily
parse /proc/self/mountinfo file and check for the pathname field.
The function is useful to filter bind mounts with content different from
the original mount, thus making it safe to assume that the reported path
can be accessed by the user, with the right content.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
API provided by libmount allows to read various information from /proc
files about mount paths.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The lowmem mode excludes all referenced blocks from the allocator in
order to avoid accidentally overwriting blocks while fixing the file
system. However for leaves it wouldn't exclude anything, it would just
pin them down, which gets cleaned up on transaction commit. We're safe
for the first modification, but subsequent modifications could blow up
in our face. Fix this by properly excluding leaves as well as all of
the nodes.
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs-convert only copies ext2 inode timestamps
i_[cma]time from ext4, while filling 0 to nsec and crtime fields.
This change copies nsec and crtime by parsing i_[cma]time_extra fields.
Author: Jiachen YANG <farseerfc@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a simple framework to exercise the json formatter and add testing
target that validates the output.
Run 'make test-json' to execute all available tests, requires 'jq'
utility for validation (https://github.com/stedolan/jq).
Signed-off-by: David Sterba <dsterba@suse.com>
In cases where the compiler does not initialize the formatter context to
all zeros, there could be garbage values left on the depth 0 that is not
explicitly initialized. This could lead to mistakenly printing a ","
separator before the last closing "}", like
{
"__header": {
"version": "1"
},
}
Signed-off-by: David Sterba <dsterba@suse.com>
The long options array for send is missing the zero terminator, so
unknown options result in a crash:
# btrfs send --foo
Segmentation fault (core dumped)
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a Makefile so tests can be run the same way from the standalone
testsuite as they are from git. The build dependencies are not checked
and the default path is for the system binaries.
Because of the path auto detection, running 'make' from the tests/
directory now works the same way as from the toplevel git directory.
Signed-off-by: David Sterba <dsterba@suse.com>
Pre-created image contains a subvolume and a snapshot so that cleaning
of multiple roots is also tested. The mount option 'inode_cache' will be
removed so we need the crafted image.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inode cache feature is going to be removed in kernel 5.11. After this
kernel version items left on disk by this feature will take some extra
space. Testing showed that the size is actually negligible but for
completeness' sake give ability to users to remove such left-overs.
This is achieved by iterating every fs root and removing respective
items as well as relevant csum extents since the ino cache used the csum
tree for csums.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add support for json formatting. Switch hard coded printing code to
formatted print with output formatter. Json output would be useful for
other programs that parse output of the command.
The plain text format is not changed for backward compatibility but this
requires to do another switch by the output type.
Example text format:
device: /dev/vdb
devid 1
write_io_errs: 0
read_io_errs: 0
flush_io_errs: 0
corruption_errs: 0
generation_errs: 0
Example json format:
{
"__header": {
"version": "1"
},
"device-stats": [
{
"device": "/dev/vdb",
"devid": "1",
"write_io_errs": "0",
"read_io_errs": "0",
"flush_io_errs": "0",
"corruption_errs": "0",
"generation_errs": "0"
}
]
}
Issue: #291
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extends fmt_print_start_group() so it can handle when name argument is
NULL. It is useful for printing unnamed array or map.
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While trying to run down a corruption problem I needed to use
btrfs-image to generate known good states in between tests. At some
point this started failing with
either extent tree is corrupted or deprecated extent ref format
This is because the fs had an extent item that was large enough that it
no longer had inline extent references, they were all keyed extent
references. The check is bogus, we can have extent items that are >=
the extent item size, not just > than the extent item size. Fix the
check so that we can generate metadata dumps properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While debugging a corruption problem I realized we don't spit out the
flags for nodes, which is needed when debugging relocation problems so
we know which nodes are the RELOC root items and which are the actual fs
tree's items. Fix this by unifying the header printing helper so both
leaf's and nodes get the same information printed out.
node 41070940160 level 1 items 34 free space 87 generation 7709536 owner ROOT_TREE
node 41070940160 flags 0x1(WRITTEN) backref revision 1
Same for leaves:
leaf 41070944256 items 12 free space 515 generation 7709536 owner ROOT_TREE
leaf 41070944256 flags 0x1(WRITTEN) backref revision 1
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While debugging some corruption, I got confused because it appeared as
if we had an invalid parent set on a extent reference, because of this
message:
tree backref 67014213632 parent 5 root 5 not found in extent tree
But it turns out that parent and the root are a union, and we were just
printing it out regardless of the type of backref it was. Fix the error
message to be consistent with the other mismatch messages, simply print
parent or root, depending on the ref type.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The two variants of unit options are not suitable for all commands, the
short options could interfere with existing options or limit future
extensions.
In 'filesystem du' the short options are not documented neither in help
text, nor in documentation so fix the code
In 'scrub status' it's the same but the documentation needs to be fixed
as well.
Signed-off-by: David Sterba <dsterba@suse.com>
The help text and documentation of the --rootid and --uuid parameters
is wrong as it does not say there's a required parameter. Add it and
enhance the docs to clarify what the options do.
Issue: #317
Signed-off-by: David Sterba <dsterba@suse.com>