Pass a struct btrfs_inode to btrfs_readdir_get_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_readdir_put_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
We can add const to many parameters, this is for clarity and minor
addition to safety. There are some minor effects, in the assembly
code and .ko measured on release config. This patch does not cover all
possible conversions.
Signed-off-by: David Sterba <dsterba@suse.com>
The index_cnt field of struct btrfs_inode is used only for two purposes:
1) To store the index for the next entry added to a directory;
2) For the data relocation inode to track the logical start address of the
block group currently being relocated.
For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (among others).
Since the csum_bytes field is not used for directories, do the following
changes:
1) Put index_cnt and csum_bytes in a union, and index_cnt is only
initialized when the inode is a directory. The csum_bytes is only
accessed in IO paths for regular files, so we're fine here;
2) Use the defrag_bytes field for relocation, since the data relocation
inode is never used for defrag purposes. And to make the naming better,
alias it to reloc_block_group_start by using a union.
This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use the spinlock inode_lock from struct btrfs_root to
serialize access to two different data structures:
1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
2) The inodes xarray (struct btrfs_root::inodes).
Instead of using our own lock, we can use the spinlock that is part of the
xarray implementation, by using the xa_lock() and xa_unlock() APIs and
using the xarray APIs with the double underscore prefix that don't take
the xarray locks and assume the caller is using xa_lock() and xa_unlock().
So remove the spinlock inode_lock from struct btrfs_root and use the
corresponding xarray locks. This brings 2 benefits:
1) We reduce the size of struct btrfs_root, from 1336 bytes down to
1328 bytes on a 64 bits release kernel config;
2) We reduce lock contention by not using anymore the same lock for
changing two different and unrelated xarrays.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl>
@@
expression E,E1;
@@
(
E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
When running delayed inode updates, we do not record the inode's root in
the transaction, but we do allocate PREALLOC and thus converted PERTRANS
space for it. To be sure we free that PERTRANS meta rsv, we must ensure
that we record the root in the transaction.
Fixes: 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The pointer to root is initialized in btrfs_init_delayed_node(), no need
to check for it again. Change the BUG_ON to assertion.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a BUG_ON checking for a valid pointer of fs_info::delayed_root
but it is valid since init_mount_fs_info() and has the same lifetime as
fs_info.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The radix-tree has been superseded by the xarray
(https://lwn.net/Articles/745073), this patch converts the
btrfs_root::delayed_nodes, the APIs are used in a simple way.
First idea is to do xa_insert() but this would require GFP_ATOMIC
allocation which we want to avoid if possible. The preload mechanism of
radix-tree can be emulated within the xarray API.
- xa_reserve() with GFP_NOFS outside of the lock, the reserved entry
is inserted atomically at most once
- xa_store() under a lock, in case something races in we can detect that
and xa_load() returns a valid pointer
All uses of xa_load() must check for a valid pointer in case they manage
to get between the xa_reserve() and xa_store(), this is handled in
btrfs_get_delayed_node().
Otherwise the functionality is equivalent, xarray implements the
radix-tree and there should be no performance difference.
The patch continues the efforts started in 253bf57555 ("btrfs: turn
delayed_nodes_tree into an XArray") and fixes the problems with locking
and GFP flags 088aea3b97 ("Revert "btrfs: turn delayed_nodes_tree
into an XArray"").
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
The function __btrfs_update_delayed_inode() is doing something not
meeting the code standard of today:
path->slots[0]++
if (path->slots[0] >= btrfs_header_nritems(leaf))
goto search;
again:
if (!is_the_target_inode_ref())
goto out;
ret = btrfs_delete_item();
/* Some cleanup. */
return ret;
search:
ret = search_for_the_last_inode_ref();
goto again;
With the tag named "again", it's pretty common to think it's a loop, but
the truth is, we only need to do the search once, to locate the last
(also the first, since there should only be one INODE_REF or
INODE_EXTREF now) ref of the inode.
[FIX]
Instead of the weird jumps, just do them in a stream-lined fashion.
This removes those weird labels, and add extra comments on why we can do
the different searches.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt
WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa
AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP
txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr
Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW
SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83
9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3
9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL
Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ
PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN
3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v
iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec=
=9TVx
-----END PGP SIGNATURE-----
Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- raid-stripe-tree
New tree for logical file extent mapping where the physical mapping
may not match on multiple devices. This is now used in zoned mode
to implement RAID0/RAID1* profiles, but can be used in non-zoned
mode as well. The support for RAID56 is in development and will
eventually fix the problems with the current implementation. This
is a backward incompatible feature and has to be enabled at mkfs
time.
- simple quota accounting (squota)
A simplified mode of qgroup that accounts all space on the initial
extent owners (a subvolume), the snapshots are then cheap to create
and delete. The deletion of snapshots in fully accounting qgroups
is a known CPU/IO performance bottleneck.
The squota is not suitable for the general use case but works well
for containers where the original subvolume exists for the whole
time. This is a backward incompatible feature as it needs extending
some structures, but can be enabled on an existing filesystem.
- temporary filesystem fsid (temp_fsid)
The fsid identifies a filesystem and is hard coded in the
structures, which disallows mounting the same fsid found on
different devices.
For a single device filesystem this is not strictly necessary, a
new temporary fsid can be generated on mount e.g. after a device is
cloned. This will be used by Steam Deck for root partition A/B
testing, or can be used for VM root images.
Other user visible changes:
- filesystems with partially finished metadata_uuid conversion cannot
be mounted anymore and the uuid fixup has to be done by btrfs-progs
(btrfstune).
Performance improvements:
- reduce reservations for checksum deletions (with enabled free space
tree by factor of 4), on a sample workload on file with many
extents the deletion time decreased by 12%
- make extent state merges more efficient during insertions, reduce
rb-tree iterations (run time of critical functions reduced by 5%)
Core changes:
- the integrity check functionality has been removed, this was a
debugging feature and removal does not affect other integrity
checks like checksums or tree-checker
- space reservation changes:
- more efficient delayed ref reservations, this avoids building up
too much work or overusing or exhausting the global block
reserve in some situations
- move delayed refs reservation to the transaction start time,
this prevents some ENOSPC corner cases related to exhaustion of
global reserve
- improvements in reducing excessive reservations for block group
items
- adjust overcommit logic in near full situations, account for one
more chunk to eventually allocate metadata chunk, this is mostly
relevant for small filesystems (<10GiB)
- single device filesystems are scanned but not registered (except
seed devices), this allows temp_fsid to work
- qgroup iterations do not need GFP_ATOMIC allocations anymore
- cleanups, refactoring, reduced data structure size, function
parameter simplifications, error handling fixes"
* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
btrfs: open code timespec64 in struct btrfs_inode
btrfs: remove redundant log root tree index assignment during log sync
btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
btrfs: sysfs: show temp_fsid feature
btrfs: disable the device add feature for temp-fsid
btrfs: disable the seed feature for temp-fsid
btrfs: update comment for temp-fsid, fsid, and metadata_uuid
btrfs: remove pointless empty log context list check when syncing log
btrfs: update comment for struct btrfs_inode::lock
btrfs: remove pointless barrier from btrfs_sync_file()
btrfs: add and use helpers for reading and writing last_trans_committed
btrfs: add and use helpers for reading and writing fs_info->generation
btrfs: add and use helpers for reading and writing log_transid
btrfs: add and use helpers for reading and writing last_log_commit
btrfs: support cloned-device mount capability
btrfs: add helper function find_fsid_by_disk
btrfs: stop reserving excessive space for block group item insertions
btrfs: stop reserving excessive space for block group item updates
btrfs: reorder btrfs_inode to fill gaps
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
The type of timespec64::tv_nsec is 'unsigned long', while we have only
u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode.
Add separate members for sec and nsec with the corresponding type width.
This creates a 4 byte hole in btrfs_inode which can be utilized in the
future.
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument for btrfs_delayed_update_inode() always matches the root
of the given inode, so remove the root argument and get it from the inode
argument.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two callbacks defined in btrfs_work but only two actually make
use of them, otherwise there are NULLs. We can get rid of the freeing
callback making it a special case of the normal work. This reduces the
size of btrfs_work by 8 bytes, final layout:
struct btrfs_work {
btrfs_func_t func; /* 0 8 */
btrfs_ordered_func_t ordered_func; /* 8 8 */
struct work_struct normal_work; /* 16 32 */
struct list_head ordered_list; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct btrfs_workqueue * wq; /* 64 8 */
long unsigned int flags; /* 72 8 */
/* size: 80, cachelines: 2, members: 6 */
/* last cacheline: 16 bytes */
};
This in turn reduces size of other structures (on a release config):
- async_chunk 160 -> 152
- async_submit_bio 152 -> 144
- btrfs_async_delayed_work 104 -> 96
- btrfs_caching_control 176 -> 168
- btrfs_delalloc_work 144 -> 136
- btrfs_fs_info 3608 -> 3600
- btrfs_ordered_extent 440 -> 424
- btrfs_writepage_fixup 104 -> 96
Signed-off-by: David Sterba <dsterba@suse.com>
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The second comment at btrfs_delayed_item_reserve_metadata() refers to a
field named "index_items_size" of a delayed inode, however that field
does not exists - it existed in a previous patch version, but then it
split into the fields "curr_index_batch_size" and "index_item_leaves"
in the final patch version that was picked. So update the comment.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmUmbQMACgkQxWXV+ddt
WDtBshAAqOwMrqRwOKOze/LQ4Kl9A8p0l+XxYdt7nRSY7n15xpN6uLVsc0gTwO5n
HOquDe2ivrpdOXI6ArcujTTFHaBGX+mmubU/yi54MH0iwuCR32dYhj3j7mDUIf6F
GpTEjgxIdE4AMUw7e7Rzqbdcmq//+H+bBdm+2YkNNEBmPP06483GYthjKJ7zWdrn
pPksR9f611aHU4jZnKZJeHgZh4iVrIszIxkjeMD5NJ6KUb8LJmISLOOJzowkmugt
JH8bd1F/+/53MmpntWGnHnURI9J6UxBL0cNnYW26FjY21N3RGR2BumotW73hYaD7
6fwuxs4ZWlLqHUtIOaAVUUSfEVse7k/i7m4+sDB1JLh26alqUHunqCFV+3ROTnOY
jHwWW+qyQhxJnfgtHyDrwcybfW0V41hhmDIhoeezkSDtbnacNTMfwzXS2ELcp0KJ
/13TCruweFN0g4lBR8HfbKJCCzPayxCirtubx1nIMRysHfo10aDWz1MSvr3mkOyo
gwif/j9BMKN0+fg6l9eZNHWHfQ8qfL3dvSRBlvJcP5mnG5ZuVkxJUFH0m/UfdFbZ
sbeJHSP9wex5tJKmG3kJPAuZWwGLHCiMMCnsWoq+02KV8IXrw3Ji5z/8Hhsb51Ps
r7BGRO2A2rD9XLJtc9BCiwiV177/WknmTUtRpOyxHFfb37bKmHg=
=Wz/9
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A revert of recent mount option parsing fix, this breaks mounts with
security options.
The second patch is a flexible array annotation"
* tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size()
Revert "btrfs: reject unknown mount options early"
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
While there, use struct_size() helper, instead of the open-coded
version, to calculate the size for the allocation of the whole
flexible structure, including of course, the flexible-array member.
This code was found with the help of Coccinelle, and audited and
fixed manually.
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmT/hwAACgkQxWXV+ddt
WDsn7hAAngwEMKEAH9Jvu/BtHgRYcAdsGh5Mxw34aQf1+DAaH03GGsZjN6hfHYo4
FMsnnvoZD5VPfuaFaQVd+mS9mRzikm503W7KfZFAPAQTOjz50RZbohLnZWa3eFbI
46OcpoHusxwoYosEmIAt+dcw/gDlT9fpj+W11dKYtwOEjCqGA/OeKoVenfk38hVJ
r+XhLwZFf4dPIqE3Ht26UtJk87Xs2X0/LQxOX3vM1MZ+l38N4dyo7TQnwfTHlQNw
AK9sK6vp3rpRR96rvTV1dWr9lnmE7wky+Vh36DN/jxpzbW7Wx8IVoobBpcsO4Tyk
Vw/rdjB7g7LfBmjLFhWvvQ73jv0WjIUUzXH17RuxOeyAQJ9tXFztVMh+QoVVC/Ka
NxwA5uqyJKR7DIA+kLL06abUnASUVgP6Krdv9Fk7rYCKWluWk1k9ls9XaFFhytvg
eeno/UB0px1rwps5P5zfaSXLIXEl53Luy5rFhTMCCNQfXyo+Qe6PJyTafR3E0uP8
aXJV1lPG+o7qi9Vwg+20yy//1sE5gR0dLrcTaup3/20RK6eljZ/bNSkl3GJR9mlS
YF+J/Ccia06y8Qo0xaeCofxkoI3J/PK6KPOTt8yZDgYoetYgHhrfBRO0I7ZU4Edq
10512hAeskzPt6+5348+/jOEENASffXKP3FJSdDEzWd33vtlaHE=
=mHTa
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- several fixes for handling directory item (inserting, removing,
iteration, error handling)
- fix transaction commit stalls when auto relocation is running and
blocks other tasks that want to commit
- fix a build error when DEBUG is enabled
- fix lockdep warning in inode number lookup ioctl
- fix race when finishing block group creation
- remove link to obsolete wiki in several files
* tag 'for-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org
btrfs: assert delayed node locked when removing delayed item
btrfs: remove BUG() after failure to insert delayed dir index item
btrfs: improve error message after failure to add delayed dir index item
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
btrfs: check for BTRFS_FS_ERROR in pending ordered assert
btrfs: fix lockdep splat and potential deadlock after failure running delayed items
btrfs: do not block starts waiting on previous transaction commit
btrfs: release path before inode lookup during the ino lookup ioctl
btrfs: fix race between finishing block group creation and its item update
When removing a delayed item, or releasing which will remove it as well,
we will modify one of the delayed node's rbtrees and item counter if the
delayed item is in one of the rbtrees. This require having the delayed
node's mutex locked, otherwise we will race with other tasks modifying
the rbtrees and the counter.
This is motivated by a previous version of another patch actually calling
btrfs_release_delayed_item() after unlocking the delayed node's mutex and
against a delayed item that is in a rbtree.
So assert at __btrfs_remove_delayed_item() that the delayed node's mutex
is locked.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling BUG() when we fail to insert a delayed dir index item
into the delayed node's tree, we can just release all the resources we
have allocated/acquired before and return the error to the caller. This is
fine because all existing call chains undo anything they have done before
calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending
snapshots in the transaction commit path).
So remove the BUG() call and do proper error handling.
This relates to a syzbot report linked below, but does not fix it because
it only prevents hitting a BUG(), it does not fix the issue where somehow
we attempt to use twice the same index number for different index items.
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to add a delayed dir index item because there's already another
item with the same index number, we print an error message (and then BUG).
However that message isn't very helpful to debug anything because we don't
know what's the index number and what are the values of index counters in
the inode and its delayed inode (index_cnt fields of struct btrfs_inode
and struct btrfs_delayed_node).
So update the error message to include the index number and counters.
We actually had a recent case where this issue was hit by a syzbot report
(see the link below).
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTskOwACgkQxWXV+ddt
WDsNJw/8CCi41Z7e3LdJsQd2iy3/+oJZUvIGuT5YvshYxTLCbV7AL+diBPnSQs4Q
/KFMGL7RZBgJzwVoSQtXnESXXgX8VOVfN1zY//k5g6z7BscCEQd73H/M0B8ciZy/
aBygm9tJ7EtWbGZWNR8yad8YtOgl6xoClrPnJK/DCLwMGPy2o+fnKP3Y9FOKY5KM
1Sl0Y4FlJ9dTJpxIwYbx4xmuyHrh2OivjU/KnS9SzQlHu0nl6zsIAE45eKem2/EG
1figY5aFBYPpPYfopbLDalEBR3bQGiViZVJuNEop3AimdcMOXw9jBF3EZYUb5Tgn
MleMDgmmjLGOE/txGhvTxKj9kci2aGX+fJn3jXbcIMksAA0OQFLPqzGvEQcrs6Ok
HA0RsmAkS5fWNDCuuo4ZPXEyUPvluTQizkwyoulOfnK+UPJCWaRqbEBMTsvm6M6X
wFT2czwLpaEU/W6loIZkISUhfbRqVoA3DfHy398QXNzRhSrg8fQJjma1f7mrHvTi
CzU+OD5YSC2nXktVOnklyTr0XT+7HF69cumlDbr8TS8u1qu8n1keU/7M3MBB4xZk
BZFJDz8pnsAqpwVA4T434E/w45MDnYlwBw5r+U8Xjyso8xlau+sYXKcim85vT2Q0
yx/L91P6tdekR1y97p4aDdxw/PgTzdkNGMnsTBMVzgtCj+5pMmE=
=N7Yn
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"No new features, the bulk of the changes are fixes, refactoring and
cleanups. The notable fix is the scrub performance restoration after
rewrite in 6.4, though still only partial.
Fixes:
- scrub performance drop due to rewrite in 6.4 partially restored:
- do IO grouping by blg_plug/blk_unplug again
- avoid unnecessary tree searches when processing stripes, in
extent and checksum trees
- the drop is noticeable on fast PCIe devices, -66% and restored
to -33% of the original
- backports to 6.4 planned
- handle more corner cases of transaction commit during orphan
cleanup or delayed ref processing
- use correct fsid/metadata_uuid when validating super block
- copy directory permissions and time when creating a stub subvolume
Core:
- debugging feature integrity checker deprecated, to be removed in
6.7
- in zoned mode, zones are activated just before the write, making
error handling easier, now the overcommit mechanism can be enabled
again which improves performance by avoiding more frequent flushing
- v0 extent handling completely removed, deprecated long time ago
- error handling improvements
- tests:
- extent buffer bitmap tests
- pinned extent splitting tests
- cleanups and refactoring:
- compression writeback
- extent buffer bitmap
- space flushing, ENOSPC handling"
* tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
btrfs: zoned: skip splitting and logical rewriting on pre-alloc write
btrfs: tests: test invalid splitting when skipping pinned drop extent_map
btrfs: tests: add a test for btrfs_add_extent_mapping
btrfs: tests: add extent_map tests for dropping with odd layouts
btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker()
btrfs: scrub: don't go ordered workqueue for dev-replace
btrfs: scrub: fix grouping of read IO
btrfs: scrub: avoid unnecessary csum tree search preparing stripes
btrfs: scrub: avoid unnecessary extent tree search preparing stripes
btrfs: copy dir permission and time when creating a stub subvolume
btrfs: remove pointless empty list check when reading delayed dir indexes
btrfs: drop redundant check to use fs_devices::metadata_uuid
btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
btrfs: use the correct superblock to compare fsid in btrfs_validate_super
btrfs: simplify memcpy either of metadata_uuid or fsid
btrfs: add a helper to read the superblock metadata_uuid
btrfs: remove v0 extent handling
btrfs: output extra debug info if we failed to find an inline backref
btrfs: move the !zoned assert into run_delalloc_cow
btrfs: consolidate the error handling in run_delalloc_nocow
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
At btrfs_readdir_delayed_dir_index(), called when reading a directory, we
have this check for an empty list to return immediately, but it's not
needed since list_for_each_entry_safe(), called immediately after, is
prepared to deal with an empty list, it simply does nothing. So remove
the empty list check.
Besides shorter source code, it also slightly reduces the binary text
size:
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1609408 167269 16864 1793541 1b5e05 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1609392 167269 16864 1793525 1b5df5 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.
The following C program and test script reproduce the problem:
$ cat /mnt/readdir_prog.c
#include <sys/types.h>
#include <dirent.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
DIR *dir = opendir(".");
struct dirent *dd;
while ((dd = readdir(dir))) {
printf("%s\n", dd->d_name);
rename(dd->d_name, "TEMPFILE");
rename("TEMPFILE", dd->d_name);
}
closedir(dir);
}
$ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV &> /dev/null
#mkfs.xfs -f $DEV &> /dev/null
#mkfs.ext4 -F $DEV &> /dev/null
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= 2000; i++)); do
echo -n > $MNT/testdir/file_$i
done
cd $MNT/testdir
/mnt/readdir_prog
cd /mnt
umount $MNT
This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.
So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.
Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-27-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even if the slot is already read out, we may still need to re-balance
the tree, thus it can cause error in that btrfs_del_item() call and we
need to handle it properly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: void0red <void0red@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these prototypes out of ctree.h and into file-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For directories with encrypted files/filenames, we need to store a flag
indicating this fact. There's no room in other fields, so we'll need to
borrow a bit from dir_type. Since it's now a combination of type and
flags, we rename it to dir_flags to reflect its new usage.
The new flag, FT_ENCRYPTED, indicates a directory containing encrypted
data, which is orthogonal to file type; therefore, add the new
flag, and make conversion from directory type to file type strip the
flag.
As the file types almost never change we can afford to use the bits.
Actual usage will be guarded behind an incompat bit, this patch only
adds the support for later use by fscrypt.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to use fs.h to hold fs wide related helpers and definitions,
move the FS_STATE enum and related helpers to fs.h, and then update all
files that need these definitions to include fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a bunch of printk helpers that are in ctree.h. These have
nothing to do with ctree.c, so move them into their own header.
Subsequent patches will cleanup the printk helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This code is used in space-info.c, move the definitions to space-info.h.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory we start by flushing all its delayed items.
That results in adding dir index items to the subvolume btree, for new
dentries, and removing dir index items from the subvolume btree for any
dentries that were deleted.
This makes it straightforward to log a directory simply by iterating over
all the modified subvolume btree leaves, especially when we used to log
both dir index keys and dir item keys (before commit 339d035424
("btrfs: only copy dir index keys when logging a directory") and when we
used to copy old dir index entries for leaves modified in the current
transaction (before commit 732d591a5d ("btrfs: stop copying old dir
items when logging a directory")).
From an efficiency point of view this has a couple of drawbacks:
1) Adds extra latency, due to copying delayed items to the subvolume btree
and deleting dir index items from the btree.
Further if there are other tasks accessing the btree, which is common
(syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
etc, finishing an ordered extent, etc), lock contention can cause
further delays, both to the task logging a directory and to the other
tasks accessing the btree;
2) More time spent overall flushing delayed items, if after logging the
directory further changes are done to the directory in the same
transaction.
For example, if we add 10 dentries to a directory, fsync it, add more
10 dentries, fsync it again, then add more 10 dentries and fsync it
again, then we end up inserting 3 batches of 10 items to the subvolume
btree. With the changes from this patch, we flush all the delayed items
to the btree only once - a single batch of 30 items, and outside the
logging code (transaction commit or when delayed items are flushed
asynchronously).
This change simply skips the flushing of delayed items every time we log a
directory. Instead we copy the delayed insertion items directly to the log
tree and delete delayed deletion items directly from the log tree.
Therefore avoiding changing first the subvolume btree and then scanning it
for new items to copy from it to the log tree and detecting deletions
by observing gaps in consecutive dir index keys in subvolume btree leaves.
Running the following tests on a non-debug kernel (Debian's default kernel
config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
produced the results below.
The results compare a branch without this patch and all the other patches
it depends on versus the same branch with the patchset applied.
The patchset is comprised of the following patches:
btrfs: don't drop dir index range items when logging a directory
btrfs: remove the root argument from log_new_dir_dentries()
btrfs: update stale comment for log_new_dir_dentries()
btrfs: free list element sooner at log_new_dir_dentries()
btrfs: avoid memory allocation at log_new_dir_dentries() for common case
btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
btrfs: store index number instead of key in struct btrfs_delayed_item
btrfs: remove unused logic when looking up delayed items
btrfs: shrink the size of struct btrfs_delayed_item
btrfs: search for last logged dir index if it's not cached in the inode
btrfs: move need_log_inode() to above log_conflicting_inodes()
btrfs: move log_new_dir_dentries() above btrfs_log_inode()
btrfs: log conflicting inodes without holding log mutex of the initial inode
btrfs: skip logging parent dir when conflicting inode is not a dir
btrfs: use delayed items when logging a directory
Custom test script for testing time spent at btrfs_log_inode():
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
# Total number of files to create in the test directory.
NUM_FILES=10000
# Fsync after creating or renaming N files.
FSYNC_AFTER=100
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
TEST_DIR=$MNT/testdir
mkdir $TEST_DIR
echo "Creating files..."
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $TEST_DIR/file_$i
if (( ($i % $FSYNC_AFTER) == 0 )); then
xfs_io -c "fsync" $TEST_DIR
fi
done
sync
echo "Renaming files..."
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
if (( ($i % $FSYNC_AFTER) == 0 )); then
xfs_io -c "fsync" $TEST_DIR
fi
done
umount $MNT
And using the following bpftrace script to capture the total time that is
spent at btrfs_log_inode():
#!/usr/bin/bpftrace
k:btrfs_log_inode
{
@start_log_inode[tid] = nsecs;
}
kr:btrfs_log_inode
/@start_log_inode[tid]/
{
$dur = (nsecs - @start_log_inode[tid]) / 1000;
@btrfs_log_inode_total_time = sum($dur);
delete(@start_log_inode[tid]);
}
END
{
clear(@start_log_inode);
}
Result before applying patchset:
@btrfs_log_inode_total_time: 622642
Result after applying patchset:
@btrfs_log_inode_total_time: 354134 (-43.1% time spent)
The following dbench script was also used for testing:
#!/bin/bash
NUM_JOBS=$(nproc --all)
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
umount $MNT
Before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 3322265 0.034 21.032
Close 2440562 0.002 0.994
Rename 140664 1.150 269.633
Unlink 670796 1.093 269.678
Deltree 96 5.481 15.510
Mkdir 48 0.004 0.052
Qpathinfo 3010924 0.014 8.127
Qfileinfo 528055 0.001 0.518
Qfsinfo 552113 0.003 0.372
Sfileinfo 270575 0.005 0.688
Find 1164176 0.052 13.931
WriteX 1658537 0.019 5.918
ReadX 5207412 0.003 1.034
LockX 10818 0.003 0.079
UnlockX 10818 0.002 0.313
Flush 232811 1.027 269.735
Throughput 869.867 MB/sec (sync dirs) 12 clients 12 procs max_latency=269.741 ms
After patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4152738 0.029 20.863
Close 3050770 0.002 1.119
Rename 175829 0.871 211.741
Unlink 838447 0.845 211.724
Deltree 120 4.798 14.162
Mkdir 60 0.003 0.005
Qpathinfo 3763807 0.011 4.673
Qfileinfo 660111 0.001 0.400
Qfsinfo 690141 0.003 0.429
Sfileinfo 338260 0.005 0.725
Find 1455273 0.046 6.787
WriteX 2073307 0.017 5.690
ReadX 6509193 0.003 1.171
LockX 13522 0.003 0.077
UnlockX 13522 0.002 0.125
Flush 291044 0.811 211.631
Throughput 1089.27 MB/sec (sync dirs) 12 clients 12 procs max_latency=211.750 ms
(+25.2% throughput, -21.5% max latency)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct btrfs_delayed_item has a base size of 96 bytes, but its
size can be decreased by doing the following 2 tweaks:
1) Change data_len from u32 to u16. Our maximum possible leaf size is 64K,
so the data_len can never be larger than that, and in fact it is always
much smaller than that. The max length for a dentry's name is ensured
at the VFS level (PATH_MAX, 4096 bytes) and in struct btrfs_inode_ref
and btrfs_dir_item we use a u16 to store the name's length;
2) Change 'ins_or_del' to a 1 bit enum, which is all we need since it
can only have 2 values. After this there's also no longer the need to
BUG_ON() before using 'ins_or_del' in several places. Also rename the
field from 'ins_or_del' to 'type', which is more clear.
These two tweaks decrease the size of struct btrfs_delayed_item from 96
bytes down to 88 bytes. A previous patch already reduced the size of this
structure by 16 bytes, but an upcoming change will increase its size by
16 bytes (adding a struct list_head element).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass NULL to the 'prev' and 'next' arguments of the function
__btrfs_lookup_delayed_item(), so remove these arguments. Also, remove
the unnecessary wrapper __btrfs_lookup_delayed_insertion_item(), making
btrfs_delete_delayed_insertion_item() directly call
__btrfs_lookup_delayed_item().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All delayed items are for dir index keys, so there's really no point of
having an embedded struct btrfs_key in struct btrfs_delayed_item, which
makes the structure use more space than necessary (and adds a hole of 7
bytes).
So replace the key field with an index number (u64), which reduces the
size of struct btrfs_delayed_item from 112 bytes down to 96 bytes.
Some upcoming work will increase the structure size by 16 bytes, so this
change compensates for that future size increase.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument of btrfs_delayed_item_reserve_metadata() is used only
to get the fs_info object, but we already have a transaction handle, which
we can use to get the fs_info. So remove the root argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With Filipe's recent rework of the delayed inode code one aspect which
isn't batched is the release of the reserved metadata of delayed inode's
delete items. With this patch on top of Filipe's rework and running the
same test as provided in the description of a patch titled
"btrfs: improve batch deletion of delayed dir index items" I observe
the following change of the number of calls to btrfs_block_rsv_release:
Before this change:
- block_rsv_release: 1004
- btrfs_delete_delayed_items_total_time: 14602
- delete_batches: 505
After:
- block_rsv_release: 510
- btrfs_delete_delayed_items_total_time: 13643
- delete_batches: 507
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>