[PITFALLS]
There are several hidden pitfalls of the existing traverse_directory():
- Hand written preorder traversal
There is already a better written standard library function, nftw()
doing exactly what we need.
- Over-designed path list
To properly handle the directory change, we have structure
directory_name_entry, to record every inode until rootdir.
But it has two string members, dir_name and path, which is a little
confusing and overkilled.
As for preorder traversal, we will never need to read the parent's
filename, just its btrfs inode number.
And it's exported while no one utilizes it out of mkfs/rootdir.c.
- Weird inode numbers
We use the inode number from st->st_ino, with an extra offset.
This by itself is not safe, if the rootdir has child directories in
another filesystem.
And this results very weird inode numbers, e.g:
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
item 6 key (88347519 INODE_ITEM 0) itemoff 15815 itemsize 160
item 16 key (88347520 INODE_ITEM 0) itemoff 15363 itemsize 160
item 20 key (88347521 INODE_ITEM 0) itemoff 15119 itemsize 160
item 24 key (88347522 INODE_ITEM 0) itemoff 14875 itemsize 160
item 26 key (88347523 INODE_ITEM 0) itemoff 14700 itemsize 160
item 28 key (88347524 INODE_ITEM 0) itemoff 14525 itemsize 160
item 30 key (88347557 INODE_ITEM 0) itemoff 14350 itemsize 160
item 32 key (88347566 INODE_ITEM 0) itemoff 14175 itemsize 160
Which is far from a regular fs created by copying the data.
- Weird directory inode size calculation
Unlike kernel, which updated the directory inode size every time new
child inodes are added, we calculate the directory inode size by
searching all its children first, then later new inodes linked to this
directory won't touch the inode size.
- Bad hard link detection and cross mount point handling
The hard link detection is purely based on the st_ino returned from
the host filesystem, this means we do not have extra checks whether
the inode is even inside the same fs.
And we directly reuse st_nlink from the host filesystem, if there
is a hard link out of rootdir, the st_nlink will be incorrect and
cause a corrupted fs.
Enhance all these points by:
- Use nftw() to do the preorder traversal
It also provides the extra level detection, which is pretty handy.
- Use a simple local inode_entry to record each parent
The only value is a u64 to record the inode number.
And one simple rootdir_path structure to record the list of
inode_entry, alone with the current level.
This rootdir_path structure along with two helpers,
rootdir_path_push() and rootdir_path_pop(), along with the
preorder traversal provided by nftw(), are enough for us to record
all the parent directories until the rootdir.
- Grab new inode number properly
Just call btrfs_get_free_objectid() to grab a proper inode number,
other than using some weird calculated value.
- Treat every inode as a new one
This means we will have no hard link support for now.
But I still believe it's a good trade-off, especially considering the
old handling is buggy for several corner cases.
- Use btrfs_insert_inode() and btrfs_add_link() to update directory
inode automatically
With all the refactoring, the code is shorter and easier to read.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
We use the UASSERT() wrapper instead of the plain assert() as this can
be tuned to print the stack trace too if supported.
Signed-off-by: David Sterba <dsterba@suse.com>
All issues have been fixed in latest master, enable the checks for devel
too. It takes about 17m. Also rename the file, drop the "ci-" prefix.
Signed-off-by: David Sterba <dsterba@suse.com>
Fixes so 'python3 -m build' works and package can be uploaded to pypi
(https://pypi.org/project/btrfsutil/).
- setup.py is still used for local build (make)
- for pypi it must be done by 'python3 -m build' that is build in a
temporary directory
- btrfsutilpy.h must be also distributed
- version is set manually (the git VERSION file is not accessible)
- the project page metadata is empty, the README.md should be added
Issue: #310
Signed-off-by: David Sterba <dsterba@suse.com>
Make it more visible what the result of snapshotted subvolume is. This
partially duplicates the other section.
[ci skip]
Issue: #644
Signed-off-by: David Sterba <dsterba@suse.com>
It is possible to create swapfile on a multi-device filesystem but it's
not reliable. The check that verifies that in kernel:
10698 } else if (device != map->stripes[0].dev) {
10699 btrfs_warn(fs_info, "swapfile must be on one device");
10700 ret = -EINVAL;
10701 goto out;
10702 }
This does not count devices but rather the actual placement of the
swapfile extents, so multi-device filesystem with single profile can
create it as long as there's enough space and the allocator decides to
place it properly.
[ci skip]
Pull-request: #839
Signed-off-by: David Sterba <dsterba@suse.com>
The thread sanitizer finds race conditions and in the past did find
some bugs. There's not much threaded code, it's namely the progress
tracking in btrfs-convert so the coverage is slightly redundant. Add it
just in case.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
ASAN test fails at misc/055 with the following leak:
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 16.00KiB 16.00KiB <stale>
====== RUN CHECK /home/runner/work/btrfs-progs/btrfs-progs/btrfs qgroup clear-stale /home/runner/work/btrfs-progs/btrfs-progs/tests/mnt
=================================================================
==102571==ERROR: LeakSanitizer: detected memory leaks
Indirect leak of 4096 byte(s) in 1 object(s) allocated from:
#0 0x7fd1c98fbb37 in malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69
#1 0x55aa2f8953f8 in btrfs_util_subvolume_path_fd libbtrfsutil/subvolume.c:178
#2 0x55aa2f8fa2a6 in get_or_add_qgroup cmds/qgroup.c:837
#3 0x55aa2f8fa7e9 in update_qgroup_info cmds/qgroup.c:883
#4 0x55aa2f8fd912 in __qgroups_search cmds/qgroup.c:1385
#5 0x55aa2f8fe196 in qgroups_search_all cmds/qgroup.c:1453
#6 0x55aa2f902a7c in cmd_qgroup_clear_stale cmds/qgroup.c:2281
#7 0x55aa2f73425b in cmd_execute cmds/commands.h:126
#8 0x55aa2f734bcc in handle_command_group /home/runner/work/btrfs-progs/btrfs-progs/btrfs.c:177
#9 0x55aa2f73425b in cmd_execute cmds/commands.h:126
#10 0x55aa2f735a96 in main /home/runner/work/btrfs-progs/btrfs-progs/btrfs.c:518
#11 0x7fd1c942a1c9 (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
#12 0x7fd1c942a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
#13 0x55aa2f734144 in _start (/home/runner/work/btrfs-progs/btrfs-progs/btrfs+0x84144) (BuildId: 56f3dd838e1ae189c142c5d27fac025cd46deddb)
Indirect leak of 432 byte(s) in 2 object(s) allocated from:
#0 0x7fd1c98fb4d0 in calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:77
#1 0x55aa2f8fa1a1 in get_or_add_qgroup cmds/qgroup.c:822
#2 0x55aa2f8fa7e9 in update_qgroup_info cmds/qgroup.c:883
#3 0x55aa2f8fd912 in __qgroups_search cmds/qgroup.c:1385
#4 0x55aa2f8fe196 in qgroups_search_all cmds/qgroup.c:1453
#5 0x55aa2f902a7c in cmd_qgroup_clear_stale cmds/qgroup.c:2281
#6 0x55aa2f73425b in cmd_execute cmds/commands.h:126
#7 0x55aa2f734bcc in handle_command_group /home/runner/work/btrfs-progs/btrfs-progs/btrfs.c:177
#8 0x55aa2f73425b in cmd_execute cmds/commands.h:126
#9 0x55aa2f735a96 in main /home/runner/work/btrfs-progs/btrfs-progs/btrfs.c:518
#10 0x7fd1c942a1c9 (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
#11 0x7fd1c942a28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 08134323d00289185684a4cd177d202f39c2a5f3)
#12 0x55aa2f734144 in _start (/home/runner/work/btrfs-progs/btrfs-progs/btrfs+0x84144) (BuildId: 56f3dd838e1ae189c142c5d27fac025cd46deddb)
[CAUSE]
Above leaks are caused by two btrfs_qgroup structures and one path for
toplevel qgroup.
It's caused by the fact that we called qgroups_search_all() but didn't
do any cleanup.
[FIX]
Call __free_all_qgroups() inside cmd_qgroup_clear_stale() to properly
free the qgroups.
Fixes: 701ab151c2 ("btrfs-progs: qgroup: new command to delete stale qgroups")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Commit d7492ec59e ("btrfs-progs: use on-stack buffer in
__ino_to_path_fd") was supposed to switch path buffer from dynamic
allocation to on-stack but it was done wrong. The btrfs_data_container
is a flexible array so it needs to be explicitly allocated to the right
size.
The conversion turned it to an array. Gcc 13.x started to warn about
access to fspath->val[i] being out of bounds. Fortunately overall size
was 65536 and used only first 4096 bytes.
cmds/inspect.c: In function ‘__ino_to_path_fd’:
cmds/inspect.c:86:35: warning: array subscript i is outside array bounds of ‘__u64[]’ {aka ‘long long unsigned int[]’} [-Warray-bounds=]
86 | ptr += fspath->val[i];
| ~~~~~~~~~~~^~~
In file included from ./kernel-shared/accessors.h:11,
from cmds/inspect.c:35:
./kernel-shared/uapi/btrfs.h:724:17: note: while referencing ‘val’
724 | __u64 val[]; /* out */
Add an on-stack buffer and map it over fspath, similar to the previous
dynamic array.
Signed-off-by: David Sterba <dsterba@suse.com>
mkfs_main() is a main-like function, meaning that return and exit are
equivalent. Deduplicate our cleanup code by moving the error label.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recent changes to the python code reworked path_converter() so that it
does not use fd_converter() anymore. Assuming it may be used in the
future again comment it out.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently mkfs uses its own create_uuid_tree(), but that function is
only handling FS_TREE. This means for btrfs-convert we do not generate
the uuid tree, nor add the UUID of the image subvolume. This can be a
problem if we're going to support multiple subvolumes during mkfs time.
To address this, introduce a new helper, btrfs_rebuild_uuid_tree():
- Create a new uuid tree if there is not one
- Remove all the existing items from uuid tree
- Iterate through all subvolumes
* If the subvolume has no valid UUID, regenerate one
* Add the uuid entry for the subvolume UUID
* If the subvolume has received UUID, also add it to UUID tree
By this, this new helper can handle all the uuid tree generation needs for:
- Current mkfs
Only one uuid entry for FS_TREE
- Current btrfs-convert
Only FS_TREE and the image subvolume
- Future multi-subvolume mkfs
As we do the scan for all subvolumes.
- Future "btrfs rescue rebuild-uuid-tree"
Signed-off-by: Qu Wenruo <wqu@suse.com>
The modification is minimal:
- Replace WARN_ON() with UASSERT()
- Remove the @trans parameter for btrfs_extend_item() and
btrfs_mark_buffer_dirty()
As progs version doesn't need a transaction handler.
- Remove the btrfs_uuid_tree_add() in mkfs/main.c
Signed-off-by: Qu Wenruo <wqu@suse.com>
Currently we already have a kernel-shared/uuid-tree.c, which is mostly
shared with kernel.
Kernel also has a uuid-tree.h, but we are still using ctree.h for the
header.
Move all the uuid-tree related definitions to kernel-shared/uuid-tree.h,
making future code sync easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
btrfs_insert_dir_item wasn't setting the transid field in
btrfs_dir_item. Set it to the current transaction ID rather than writing
uninitialized memory to disk.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Python 3.13, currently in beta, removed the internal
_PyObject_LookupSpecial function. The libbtrfsutil Python bindings use
it in the path_converter() function because it was based on internal
path_converter() function in CPython [1]. This is causing build failures
on Fedora Rawhide [2] and Gentoo [3]. Replace path_converter() with a
version that only uses public functions based on the one in drgn [4].
1: d9efa45d74/Modules/posixmodule.c (L1253)
2: https://bugzilla.redhat.com/show_bug.cgi?id=2245650
3: https://github.com/kdave/btrfs-progs/issues/838
4: 9ad29fd864/libdrgn/python/util.c (L81)
Issue: #838
Reported-by: Neal Gompa <neal@gompa.dev>
Reported-by: Sam James <sam@gentoo.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the new feature description output in "btrfs --version" there is no
need to do the config.h hack to determine if we have certain feature.
This provides a more reliable way to detect features.
Signed-off-by: Qu Wenruo <wqu@suse.com>
For "btrfstune --csum", currently we do the following operations in just
one transaction for each:
- Delete old data csums
- Change new data csums objectid
Both operation can modify up to GiB or even TiB level of metadata, doing
them in just one transaction is definitely going to cause problems.
This patch adds a leaf number based threshold (32 leaves), after
modifying/deleting this many leaves, we commit a transaction to avoid
huge amount of dirty leaves piling up.
Signed-off-by: Qu Wenruo <wqu@suse.com>
The new new test case is to make sure the rollback output for a fixed
content converted fs contains the string "ext2_saved/image".
As we have a bug in the past where after the string "ext2_saved", we can
have some unterminated garbage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
When rolling back a converted btrfs, the filename output is corrupted:
$ btrfs-convert -r ~/test.img
btrfs-convert from btrfs-progs v6.9.2
Open filesystem for rollback:
Label:
UUID: df54baf3-c91e-4956-96f9-99413a857576
Restoring from: ext2_saved0ƨy/image
^^^ Corruption
Rollback succeeded
[CAUSE]
The error is in how we handle the filename. In btrfs all our strings
are not '\0' terminated, but with explicit length.
But in C, most strings are '\0' terminated, so after reading a filename
from btrfs, we need to manually terminate the string.
However the code adding the terminating '\0' looks like this:
/* Get the filename length. */
name_len = btrfs_root_ref_name_len(path.nodes[0], root_ref_item);
/*
* This should not happen, but as an extra handling for possible
* corrupted btrfs.
*/
if (name_len > sizeof(dir_name))
name_len = sizeof(dir_name) - 1;
/* Got the real filename into our buffer. */
read_extent_buffer(path.nodes[0], dir_name, (unsigned long)(root_ref_item + 1), name_len);
/* Terminate the string. */
dir_name[sizeof(dir_name) - 1] = 0;
The problem is, the final termination is totally wrong, it always make
the last buffer char '\0', not using the @name_len we read before.
[FIX]
Use @name_len to terminate the string, as we have already updated it to
handle buffer overflow, it can handle both the regular and corrupted
case.
Fixes: dc29a5c51d ("btrfs-progs: convert: update default output")
Signed-off-by: Qu Wenruo <wqu@suse.com>
This test case checks:
- If a regular btrfs-image dump has the unsanitized filenames
- If a sanitized btrfs-image dump has filenames properly censored
Signed-off-by: Qu Wenruo <wqu@suse.com>
The new test case does:
- Make sure the build has error injection support
This is done by checking "btrfs --version" output.
- Inject error at the last commit transaction of new data csum
generation
- Resume the csum conversion and make sure it works
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
There is a bug report that image dump taken by "btrfs-image -s" doesn't
really sanitize the filenames:
# truncates -s 1G source.raw
# mkfs.btrfs -f source.raw
# mount source.raw $mnt
# touch $mnt/top_secret_filename
# touch $mnt/secret_filename
# umount $mnt
# btrfs-image -s source.raw dump.img
# string dump.img | grep filename
top_secret_filename
secret_filename
top_secret_filename
secret_filename
top_secret_filename
[CAUSE]
Using above image to store the fs, and we got the following result in fs
tree:
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 3 transid 7 size 68 nbytes 16384
block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
sequence 2 flags 0x0(none)
item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
index 0 namelen 2 name: ..
item 2 key (256 DIR_ITEM 439756795) itemoff 16062 itemsize 49
location key (257 INODE_ITEM 0) type FILE
transid 7 data_len 0 name_len 19
name: top_secret_filename
item 3 key (256 DIR_ITEM 693462946) itemoff 16017 itemsize 45
location key (258 INODE_ITEM 0) type FILE
transid 7 data_len 0 name_len 15
name: secret_filename
item 4 key (256 DIR_INDEX 2) itemoff 15968 itemsize 49
location key (257 INODE_ITEM 0) type FILE
transid 7 data_len 0 name_len 19
name: top_secret_filename
item 5 key (256 DIR_INDEX 3) itemoff 15923 itemsize 45
location key (258 INODE_ITEM 0) type FILE
transid 7 data_len 0 name_len 15
name: secret_filename
item 6 key (257 INODE_ITEM 0) itemoff 15763 itemsize 160
generation 7 transid 7 size 0 nbytes 0
block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 7 key (257 INODE_REF 256) itemoff 15734 itemsize 29
index 2 namelen 19 name: top_secret_filename
item 8 key (258 INODE_ITEM 0) itemoff 15574 itemsize 160
generation 7 transid 7 size 0 nbytes 0
block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
item 9 key (258 INODE_REF 256) itemoff 15549 itemsize 25
index 3 namelen 15 name: 1���'�gc*&R
The result shows, only the last INODE_REF got sanitized, all the
remaining are not touched at all.
This is caused by how we sanitize the filenames:
copy_buffer()
|- memcpy(dst, src->data, src->len);
| This means we copy the whole eb into our buffer already.
|
|- zero_items()
|- sanitize_name()
|- eb = alloc_dummy_eb();
|- memcpy(eb->data, src->data, src->len);
| This means we generate a dummy eb with the same contents of
| the source eb.
|
|- sanitize_dir_item();
| We override the dir item of the given item (specified by the
| slot number) inside our dummy eb.
|
|- memcpy(dst, eb->data, eb->lem);
The last one copy the dummy eb into our buffer, with only the slot
corrupted.
But when the whole work flow hits the next slot, we only corrupt the
next slot, but still copy the whole dummy eb back to buffer.
This means the previous slot would be overwritten by the old unsanitized
data.
Resulting only the last slot is corrupted.
[FIX]
Fix the bug by only copying back the corrupted item to the buffer.
So that other slots won't be overwritten by unsanitized data.
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
The filename sanitization is not recommended as it introduces mismatches
between DIR_ITEM and INODE_REF.
Even hash collision mode (double "-s" option) is not ensured to always
find a hash collision, and when fails to find one, a mismatch happens.
And when a mismatch happens, the kernel will not resolve the path
correctly since kernel uses the hash from DIR_ITEM to lookup the child
inode.
So add a warning into the "-s" option of btrfs-image.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Add exceptions that should not be reported as typos for a reason (names,
abbreviations, preferred other spelling).
Author: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Run spellchecking on devel branch on push or on a pull-request.
Author: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To ensure that package indexes are up to date.
That should help to avoid recent failed CI runs, which failed to install
certain packages as local cache is out-of-date and remote mirrors no
longer provide that specific (and out-of-date) version of package:
E: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/pool/main/s/systemd/libudev-dev_255.4-1ubuntu8.1_amd64.deb 404 Not Found [IP: 52.147.219.192 80]
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
[ Minor modification on the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ Move cache update to a separate command. ]
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_mksubvol() is very different between btrfs-progs and
kernel, the former version is really just linking a subvolume to another
directory inode, but the kernel version is really to make a completely
new subvolume.
Instead of same-named function, introduce btrfs_link_subvolume() and use
it to replace the old btrfs_mksubvol().
This is done by:
- Introduce btrfs_link_subvolume()
Which does extra checks before doing any modification:
* Make sure the target inode is a directory
* Make sure no filename conflict
Then do the linkage:
* Add the dir_item/dir_index into the parent inode
* Add the forward and backward root refs into tree root
- Introduce link_image_subvolume() helper
Currently btrfs_mksubvol() has a dedicated convert filename retry
behavior, which is unnecessary and should be done by the convert code.
Now move the filename retry behavior into the helper.
- Remove btrfs_mksubvol()
Since there is only one caller utilizing btrfs_mksubvol(), and it's
now gone, we can remove the old btrfs_mksubvol().
Signed-off-by: Qu Wenruo <wqu@suse.com>
There are two different subvolume/data reloc tree creation routines:
- create_subvol() from convert/main.c
* calls btrfs_copy_root() to create an empty root
This is not safe, as it relies on the source root to be empty.
* calls btrfs_read_fs_root() to add it to the cache and trace it
properly
* calls btrfs_make_root_dir() to initialize the empty new root
- create_data_reloc_tree() from mkfs/main.c
* calls btrfs_create_tree() to create an empty root
* Manually add the root to fs_root cache
This is only safe for data reloc tree as it's never updated
inside btrfs-progs.
But not safe for other subvolume trees.
* manually setup the root dir
Both have their good and bad aspects, so here we introduce a new helper,
btrfs_make_subvolume():
- Calls btrfs_create_tree() to create an empty root
- Calls btrfs_read_fs_root() to setup the cache and tracking properly
- Calls btrfs_make_root_dir() to initialize the root dir
- Calls btrfs_update_root() to reflect the rootdir change
So this new helper can replace both create_subvol() and
create_data_reloc_tree().
Signed-off-by: Qu Wenruo <wqu@suse.com>
The list-chunk command is deemed to be reasonably complete so make it
visible in the default build. The output can be tweaked later.
Issue: #559
Signed-off-by: David Sterba <dsterba@suse.com>
The long options now allow to pass the unit mode in the usual way, drop
the local variable for raw byte values.
Signed-off-by: David Sterba <dsterba@suse.com>