2019-06-20 15:37:45 -04:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
|
btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 16:06:33 -08:00
|
|
|
#include <linux/sizes.h>
|
2021-10-14 18:39:02 +09:00
|
|
|
#include <linux/list_sort.h>
|
2019-08-21 18:54:28 +02:00
|
|
|
#include "misc.h"
|
2019-06-20 15:37:45 -04:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "block-group.h"
|
2019-06-20 15:37:47 -04:00
|
|
|
#include "space-info.h"
|
2019-08-06 16:43:19 +02:00
|
|
|
#include "disk-io.h"
|
|
|
|
#include "free-space-cache.h"
|
|
|
|
#include "free-space-tree.h"
|
2019-06-20 15:37:55 -04:00
|
|
|
#include "volumes.h"
|
|
|
|
#include "transaction.h"
|
|
|
|
#include "ref-verify.h"
|
2019-06-20 15:37:57 -04:00
|
|
|
#include "sysfs.h"
|
|
|
|
#include "tree-log.h"
|
2019-06-20 15:38:00 -04:00
|
|
|
#include "delalloc-space.h"
|
2019-12-13 16:22:14 -08:00
|
|
|
#include "discard.h"
|
2019-12-10 19:57:51 +02:00
|
|
|
#include "raid56.h"
|
2021-02-04 19:21:50 +09:00
|
|
|
#include "zoned.h"
|
2022-10-19 10:50:47 -04:00
|
|
|
#include "fs.h"
|
2022-10-19 10:51:00 -04:00
|
|
|
#include "accessors.h"
|
2022-10-24 14:46:57 -04:00
|
|
|
#include "extent-tree.h"
|
2019-06-20 15:37:45 -04:00
|
|
|
|
2022-09-14 11:06:34 -04:00
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
2024-06-26 23:39:11 +02:00
|
|
|
int btrfs_should_fragment_free_space(const struct btrfs_block_group *block_group)
|
2022-09-14 11:06:34 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
|
|
|
|
|
|
|
return (btrfs_test_opt(fs_info, FRAGMENT_METADATA) &&
|
|
|
|
block_group->flags & BTRFS_BLOCK_GROUP_METADATA) ||
|
|
|
|
(btrfs_test_opt(fs_info, FRAGMENT_DATA) &&
|
|
|
|
block_group->flags & BTRFS_BLOCK_GROUP_DATA);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-06-20 15:38:05 -04:00
|
|
|
/*
|
|
|
|
* Return target flags in extended format or 0 if restripe for this chunk_type
|
|
|
|
* is not in progress
|
|
|
|
*
|
|
|
|
* Should be called with balance_lock held
|
|
|
|
*/
|
2024-06-26 23:39:11 +02:00
|
|
|
static u64 get_restripe_target(const struct btrfs_fs_info *fs_info, u64 flags)
|
2019-06-20 15:38:05 -04:00
|
|
|
{
|
2024-06-26 23:39:11 +02:00
|
|
|
const struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
2019-06-20 15:38:05 -04:00
|
|
|
u64 target = 0;
|
|
|
|
|
|
|
|
if (!bctl)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA &&
|
|
|
|
bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT) {
|
|
|
|
target = BTRFS_BLOCK_GROUP_DATA | bctl->data.target;
|
|
|
|
} else if (flags & BTRFS_BLOCK_GROUP_SYSTEM &&
|
|
|
|
bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
|
|
|
|
target = BTRFS_BLOCK_GROUP_SYSTEM | bctl->sys.target;
|
|
|
|
} else if (flags & BTRFS_BLOCK_GROUP_METADATA &&
|
|
|
|
bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT) {
|
|
|
|
target = BTRFS_BLOCK_GROUP_METADATA | bctl->meta.target;
|
|
|
|
}
|
|
|
|
|
|
|
|
return target;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* @flags: available profiles in extended format (see ctree.h)
|
|
|
|
*
|
|
|
|
* Return reduced profile in chunk format. If profile changing is in progress
|
|
|
|
* (either running or paused) picks the target profile (if it's already
|
|
|
|
* available), otherwise falls back to plain reducing.
|
|
|
|
*/
|
|
|
|
static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
|
|
|
u64 num_devices = fs_info->fs_devices->rw_devices;
|
|
|
|
u64 target;
|
|
|
|
u64 raid_type;
|
|
|
|
u64 allowed = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if restripe for this chunk_type is in progress, if so try to
|
|
|
|
* reduce to the target profile
|
|
|
|
*/
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
2019-06-20 15:38:07 -04:00
|
|
|
target = get_restripe_target(fs_info, flags);
|
2019-06-20 15:38:05 -04:00
|
|
|
if (target) {
|
2020-07-21 10:48:46 -04:00
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
return extended_to_chunk(target);
|
2019-06-20 15:38:05 -04:00
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
|
|
|
|
/* First, mask out the RAID levels which aren't possible */
|
|
|
|
for (raid_type = 0; raid_type < BTRFS_NR_RAID_TYPES; raid_type++) {
|
|
|
|
if (num_devices >= btrfs_raid_array[raid_type].devs_min)
|
|
|
|
allowed |= btrfs_raid_array[raid_type].bg_flag;
|
|
|
|
}
|
|
|
|
allowed &= flags;
|
|
|
|
|
btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
one allocation profile flag, and failing to do so may ultimately
result in a WARN_ON and remount-ro when allocating new blocks, like
the below transaction abort on 6.1.
`btrfs_reduce_alloc_profile` has two ways of determining the profile,
first it checks if a conversion balance is currently running and
uses the profile we're converting to. If no balance is currently
running, it returns the max-redundancy profile which at least one
block in the selected block group has.
This works by simply checking each known allocation profile bit in
redundancy order. However, `btrfs_reduce_alloc_profile` has not been
updated as new flags have been added - first with the `DUP` profile
and later with the RAID1C34 profiles.
Because of the way it checks, if we have blocks with different
profiles and at least one is known, that profile will be selected.
However, if none are known we may return a flag set with multiple
allocation profiles set.
This is currently only possible when a balance from one of the three
unhandled profiles to another of the unhandled profiles is canceled
after allocating at least one block using the new profile.
In that case, a transaction abort like the below will occur and the
filesystem will need to be mounted with -o skip_balance to get it
mounted rw again (but the balance cannot be resumed without a
similar abort).
[770.648] ------------[ cut here ]------------
[770.648] BTRFS: Transaction aborted (error -22)
[770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G W 6.1.0-0.deb11.7-powerpc64le #1 Debian 6.1.20-2~bpo11+1a~test
[770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
[770.648] NIP: c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
[770.648] REGS: c000200089afe9a0 TRAP: 0700 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28848282 XER: 20040000
[770.648] CFAR: c000000000135110 IRQMASK: 0
GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
[770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
[770.648] Call Trace:
[770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
[770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
[770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
[770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
[770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
[770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
[770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
[770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
[770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
[770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
[770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
[770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
[770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
[770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
[770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
[770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
[770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
[770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
[770.648] --- interrupt: c00 at 0x7fff94126800
[770.648] NIP: 00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
[770.648] REGS: c000200089affe80 TRAP: 0c00 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE> CR: 24002848 XER: 00000000
[770.648] IRQMASK: 0
GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
[770.648] NIP [00007fff94126800] 0x7fff94126800
[770.648] LR [0000000107e0b594] 0x107e0b594
[770.648] --- interrupt: c00
[770.648] Instruction dump:
[770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
[770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
[770.648] ---[ end trace 0000000000000000 ]---
[770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
[770.648] BTRFS info (device dm-2: state EA): forced readonly
[770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
[770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
[770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
[770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Matt Corallo <blnxfsl@bluematt.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05 16:49:45 -07:00
|
|
|
/* Select the highest-redundancy RAID level. */
|
|
|
|
if (allowed & BTRFS_BLOCK_GROUP_RAID1C4)
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_RAID1C4;
|
|
|
|
else if (allowed & BTRFS_BLOCK_GROUP_RAID6)
|
2019-06-20 15:38:05 -04:00
|
|
|
allowed = BTRFS_BLOCK_GROUP_RAID6;
|
btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
one allocation profile flag, and failing to do so may ultimately
result in a WARN_ON and remount-ro when allocating new blocks, like
the below transaction abort on 6.1.
`btrfs_reduce_alloc_profile` has two ways of determining the profile,
first it checks if a conversion balance is currently running and
uses the profile we're converting to. If no balance is currently
running, it returns the max-redundancy profile which at least one
block in the selected block group has.
This works by simply checking each known allocation profile bit in
redundancy order. However, `btrfs_reduce_alloc_profile` has not been
updated as new flags have been added - first with the `DUP` profile
and later with the RAID1C34 profiles.
Because of the way it checks, if we have blocks with different
profiles and at least one is known, that profile will be selected.
However, if none are known we may return a flag set with multiple
allocation profiles set.
This is currently only possible when a balance from one of the three
unhandled profiles to another of the unhandled profiles is canceled
after allocating at least one block using the new profile.
In that case, a transaction abort like the below will occur and the
filesystem will need to be mounted with -o skip_balance to get it
mounted rw again (but the balance cannot be resumed without a
similar abort).
[770.648] ------------[ cut here ]------------
[770.648] BTRFS: Transaction aborted (error -22)
[770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G W 6.1.0-0.deb11.7-powerpc64le #1 Debian 6.1.20-2~bpo11+1a~test
[770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
[770.648] NIP: c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
[770.648] REGS: c000200089afe9a0 TRAP: 0700 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28848282 XER: 20040000
[770.648] CFAR: c000000000135110 IRQMASK: 0
GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
[770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
[770.648] Call Trace:
[770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
[770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
[770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
[770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
[770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
[770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
[770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
[770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
[770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
[770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
[770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
[770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
[770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
[770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
[770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
[770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
[770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
[770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
[770.648] --- interrupt: c00 at 0x7fff94126800
[770.648] NIP: 00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
[770.648] REGS: c000200089affe80 TRAP: 0c00 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE> CR: 24002848 XER: 00000000
[770.648] IRQMASK: 0
GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
[770.648] NIP [00007fff94126800] 0x7fff94126800
[770.648] LR [0000000107e0b594] 0x107e0b594
[770.648] --- interrupt: c00
[770.648] Instruction dump:
[770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
[770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
[770.648] ---[ end trace 0000000000000000 ]---
[770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
[770.648] BTRFS info (device dm-2: state EA): forced readonly
[770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
[770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
[770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
[770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Matt Corallo <blnxfsl@bluematt.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05 16:49:45 -07:00
|
|
|
else if (allowed & BTRFS_BLOCK_GROUP_RAID1C3)
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_RAID1C3;
|
2019-06-20 15:38:05 -04:00
|
|
|
else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_RAID5;
|
|
|
|
else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_RAID1;
|
btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
one allocation profile flag, and failing to do so may ultimately
result in a WARN_ON and remount-ro when allocating new blocks, like
the below transaction abort on 6.1.
`btrfs_reduce_alloc_profile` has two ways of determining the profile,
first it checks if a conversion balance is currently running and
uses the profile we're converting to. If no balance is currently
running, it returns the max-redundancy profile which at least one
block in the selected block group has.
This works by simply checking each known allocation profile bit in
redundancy order. However, `btrfs_reduce_alloc_profile` has not been
updated as new flags have been added - first with the `DUP` profile
and later with the RAID1C34 profiles.
Because of the way it checks, if we have blocks with different
profiles and at least one is known, that profile will be selected.
However, if none are known we may return a flag set with multiple
allocation profiles set.
This is currently only possible when a balance from one of the three
unhandled profiles to another of the unhandled profiles is canceled
after allocating at least one block using the new profile.
In that case, a transaction abort like the below will occur and the
filesystem will need to be mounted with -o skip_balance to get it
mounted rw again (but the balance cannot be resumed without a
similar abort).
[770.648] ------------[ cut here ]------------
[770.648] BTRFS: Transaction aborted (error -22)
[770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G W 6.1.0-0.deb11.7-powerpc64le #1 Debian 6.1.20-2~bpo11+1a~test
[770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
[770.648] NIP: c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
[770.648] REGS: c000200089afe9a0 TRAP: 0700 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28848282 XER: 20040000
[770.648] CFAR: c000000000135110 IRQMASK: 0
GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
[770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
[770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
[770.648] Call Trace:
[770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
[770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
[770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
[770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
[770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
[770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
[770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
[770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
[770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
[770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
[770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
[770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
[770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
[770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
[770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
[770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
[770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
[770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
[770.648] --- interrupt: c00 at 0x7fff94126800
[770.648] NIP: 00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
[770.648] REGS: c000200089affe80 TRAP: 0c00 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
[770.648] MSR: 900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE> CR: 24002848 XER: 00000000
[770.648] IRQMASK: 0
GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
[770.648] NIP [00007fff94126800] 0x7fff94126800
[770.648] LR [0000000107e0b594] 0x107e0b594
[770.648] --- interrupt: c00
[770.648] Instruction dump:
[770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
[770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
[770.648] ---[ end trace 0000000000000000 ]---
[770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
[770.648] BTRFS info (device dm-2: state EA): forced readonly
[770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
[770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
[770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
[770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Matt Corallo <blnxfsl@bluematt.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05 16:49:45 -07:00
|
|
|
else if (allowed & BTRFS_BLOCK_GROUP_DUP)
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_DUP;
|
2019-06-20 15:38:05 -04:00
|
|
|
else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
|
|
|
|
allowed = BTRFS_BLOCK_GROUP_RAID0;
|
|
|
|
|
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
|
|
|
|
|
|
|
|
return extended_to_chunk(flags | allowed);
|
|
|
|
}
|
|
|
|
|
2020-01-02 17:14:57 +01:00
|
|
|
u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags)
|
2019-06-20 15:38:05 -04:00
|
|
|
{
|
|
|
|
unsigned seq;
|
|
|
|
u64 flags;
|
|
|
|
|
|
|
|
do {
|
|
|
|
flags = orig_flags;
|
|
|
|
seq = read_seqbegin(&fs_info->profiles_lock);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
flags |= fs_info->avail_data_alloc_bits;
|
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
flags |= fs_info->avail_system_alloc_bits;
|
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
flags |= fs_info->avail_metadata_alloc_bits;
|
|
|
|
} while (read_seqretry(&fs_info->profiles_lock, seq));
|
|
|
|
|
|
|
|
return btrfs_reduce_alloc_profile(fs_info, flags);
|
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_get_block_group(struct btrfs_block_group *cache)
|
2019-06-20 15:37:46 -04:00
|
|
|
{
|
2020-07-06 09:14:11 -04:00
|
|
|
refcount_inc(&cache->refs);
|
2019-06-20 15:37:46 -04:00
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_put_block_group(struct btrfs_block_group *cache)
|
2019-06-20 15:37:46 -04:00
|
|
|
{
|
2020-07-06 09:14:11 -04:00
|
|
|
if (refcount_dec_and_test(&cache->refs)) {
|
2019-06-20 15:37:46 -04:00
|
|
|
WARN_ON(cache->pinned > 0);
|
btrfs: skip reserved bytes warning on unmount after log cleanup failure
After the recent changes made by commit c2e39305299f01 ("btrfs: clear
extent buffer uptodate when we fail to write it") and its followup fix,
commit 651740a5024117 ("btrfs: check WRITE_ERR when trying to read an
extent buffer"), we can now end up not cleaning up space reservations of
log tree extent buffers after a transaction abort happens, as well as not
cleaning up still dirty extent buffers.
This happens because if writeback for a log tree extent buffer failed,
then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
when trying to free the log tree with free_log_tree(), which iterates
over the tree, we can end up getting an -EIO error when trying to read
a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
immediately as we can not iterate over the entire tree.
In that case we never update the reserved space for an extent buffer in
the respective block group and space_info object.
When this happens we get the following traces when unmounting the fs:
[174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
[174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
[174957.399379] ------------[ cut here ]------------
[174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.407523] Modules linked in: btrfs overlay dm_zero (...)
[174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.429717] Code: 21 48 8b bd (...)
[174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
[174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
[174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
[174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
[174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
[174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
[174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.443948] Call Trace:
[174957.444264] <TASK>
[174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs]
[174957.445238] close_ctree+0x301/0x357 [btrfs]
[174957.445803] ? call_rcu+0x16c/0x290
[174957.446250] generic_shutdown_super+0x74/0x120
[174957.446832] kill_anon_super+0x14/0x30
[174957.447305] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.447890] deactivate_locked_super+0x31/0xa0
[174957.448440] cleanup_mnt+0x147/0x1c0
[174957.448888] task_work_run+0x5c/0xa0
[174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.449934] syscall_exit_to_user_mode+0x16/0x40
[174957.450512] do_syscall_64+0x48/0xc0
[174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.451605] RIP: 0033:0x7f328fdc4a97
[174957.452059] Code: 03 0c 00 f7 (...)
[174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.460193] </TASK>
[174957.460534] irq event stamp: 0
[174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.466323] ---[ end trace bc7ee0c490bce3af ]---
[174957.467282] ------------[ cut here ]------------
[174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.470066] Modules linked in: btrfs overlay dm_zero (...)
[174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.488050] Code: 00 00 00 ad de (...)
[174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
[174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
[174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
[174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
[174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
[174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
[174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.506167] Call Trace:
[174957.506654] <TASK>
[174957.507047] close_ctree+0x301/0x357 [btrfs]
[174957.507867] ? call_rcu+0x16c/0x290
[174957.508567] generic_shutdown_super+0x74/0x120
[174957.509447] kill_anon_super+0x14/0x30
[174957.510194] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.511123] deactivate_locked_super+0x31/0xa0
[174957.511976] cleanup_mnt+0x147/0x1c0
[174957.512610] task_work_run+0x5c/0xa0
[174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.514231] syscall_exit_to_user_mode+0x16/0x40
[174957.515069] do_syscall_64+0x48/0xc0
[174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.516688] RIP: 0033:0x7f328fdc4a97
[174957.517413] Code: 03 0c 00 f7 d8 (...)
[174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.529404] </TASK>
[174957.529843] irq event stamp: 0
[174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
[174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
[174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
[174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
[174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
[174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
[174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
[174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
This also means that in case we have log tree extent buffers that are
still dirty, we can end up not cleaning them up in case we find an
extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
we have no way for iterating over the rest of the tree.
This issue is very often triggered with test cases generic/475 and
generic/648 from fstests.
The issue could almost be fixed by iterating over the io tree attached to
each log root which keeps tracks of the range of allocated extent buffers,
log_root->dirty_log_pages, however that does not work and has some
inconveniences:
1) After we sync the log, we clear the range of the extent buffers from
the io tree, so we can't find them after writeback. We could keep the
ranges in the io tree, with a separate bit to signal they represent
extent buffers already written, but that means we need to hold into
more memory until the transaction commits.
How much more memory is used depends a lot on whether we are able to
allocate contiguous extent buffers on disk (and how often) for a log
tree - if we are able to, then a single extent state record can
represent multiple extent buffers, otherwise we need multiple extent
state record structures to track each extent buffer.
In fact, my earlier approach did that:
https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
However that can cause a very significant negative impact on
performance, not only due to the extra memory usage but also because
we get a larger and deeper dirty_log_pages io tree.
We got a report that, on beefy machines at least, we can get such
performance drop with fsmark for example:
https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
2) We would be doing it only to deal with an unexpected and exceptional
case, which is basically failure to read an extent buffer from disk
due to IO failures. On a healthy system we don't expect transaction
aborts to happen after all;
3) Instead of relying on iterating the log tree or tracking the ranges
of extent buffers in the dirty_log_pages io tree, using the radix
tree that tracks extent buffers (fs_info->buffer_radix) to find all
log tree extent buffers is not reliable either, because after writeback
of an extent buffer it can be evicted from memory by the release page
callback of the btree inode (btree_releasepage()).
Since there's no way to be able to properly cleanup a log tree without
being able to read its extent buffers from disk and without using more
memory to track the logical ranges of the allocated extent buffers do
the following:
1) When we fail to cleanup a log tree, setup a flag that indicates that
failure;
2) Trigger writeback of all log tree extent buffers that are still dirty,
and wait for the writeback to complete. This is just to cleanup their
state, page states, page leaks, etc;
3) When unmounting the fs, ignore if the number of bytes reserved in a
block group and in a space_info is not 0 if, and only if, we failed to
cleanup a log tree. Also ignore only for metadata block groups and the
metadata space_info object.
This is far from a perfect solution, but it serves to silence test
failures such as those from generic/475 and generic/648. However having
a non-zero value for the reserved bytes counters on unmount after a
transaction abort, is not such a terrible thing and it's completely
harmless, it does not affect the filesystem integrity in any way.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 13:39:34 +00:00
|
|
|
/*
|
|
|
|
* If there was a failure to cleanup a log tree, very likely due
|
|
|
|
* to an IO failure on a writeback attempt of one or more of its
|
|
|
|
* extent buffers, we could not do proper (and cheap) unaccounting
|
|
|
|
* of their reserved space, so don't warn on reserved > 0 in that
|
|
|
|
* case.
|
|
|
|
*/
|
|
|
|
if (!(cache->flags & BTRFS_BLOCK_GROUP_METADATA) ||
|
|
|
|
!BTRFS_FS_LOG_CLEANUP_ERROR(cache->fs_info))
|
|
|
|
WARN_ON(cache->reserved > 0);
|
2019-06-20 15:37:46 -04:00
|
|
|
|
2019-12-13 16:22:14 -08:00
|
|
|
/*
|
|
|
|
* A block_group shouldn't be on the discard_list anymore.
|
|
|
|
* Remove the block_group from the discard_list to prevent us
|
|
|
|
* from causing a panic due to NULL pointer dereference.
|
|
|
|
*/
|
|
|
|
if (WARN_ON(!list_empty(&cache->discard_list)))
|
|
|
|
btrfs_discard_cancel_work(&cache->fs_info->discard_ctl,
|
|
|
|
cache);
|
|
|
|
|
2019-06-20 15:37:46 -04:00
|
|
|
kfree(cache->free_space_ctl);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
btrfs_free_chunk_map(cache->physical_map);
|
2019-06-20 15:37:46 -04:00
|
|
|
kfree(cache);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
/*
|
|
|
|
* This adds the block group to the fs_info rb tree for the block group cache
|
|
|
|
*/
|
|
|
|
static int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group)
|
2019-06-20 15:37:57 -04:00
|
|
|
{
|
|
|
|
struct rb_node **p;
|
|
|
|
struct rb_node *parent = NULL;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache;
|
2022-04-13 16:20:40 +01:00
|
|
|
bool leftmost = true;
|
2019-06-20 15:37:57 -04:00
|
|
|
|
2020-05-05 07:58:20 +08:00
|
|
|
ASSERT(block_group->length != 0);
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_lock(&info->block_group_cache_lock);
|
2022-04-13 16:20:40 +01:00
|
|
|
p = &info->block_group_cache_tree.rb_root.rb_node;
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
2019-10-29 19:20:18 +01:00
|
|
|
cache = rb_entry(parent, struct btrfs_block_group, cache_node);
|
2019-10-23 18:48:22 +02:00
|
|
|
if (block_group->start < cache->start) {
|
2019-06-20 15:37:57 -04:00
|
|
|
p = &(*p)->rb_left;
|
2019-10-23 18:48:22 +02:00
|
|
|
} else if (block_group->start > cache->start) {
|
2019-06-20 15:37:57 -04:00
|
|
|
p = &(*p)->rb_right;
|
2022-04-13 16:20:40 +01:00
|
|
|
leftmost = false;
|
2019-06-20 15:37:57 -04:00
|
|
|
} else {
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_unlock(&info->block_group_cache_lock);
|
2019-06-20 15:37:57 -04:00
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
rb_link_node(&block_group->cache_node, parent, p);
|
2022-04-13 16:20:40 +01:00
|
|
|
rb_insert_color_cached(&block_group->cache_node,
|
|
|
|
&info->block_group_cache_tree, leftmost);
|
2019-06-20 15:37:57 -04:00
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_unlock(&info->block_group_cache_lock);
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:45 -04:00
|
|
|
/*
|
|
|
|
* This will return the block group at or after bytenr if contains is 0, else
|
|
|
|
* it will return the block group that contains the bytenr
|
|
|
|
*/
|
2019-10-29 19:20:18 +01:00
|
|
|
static struct btrfs_block_group *block_group_cache_tree_search(
|
2019-06-20 15:37:45 -04:00
|
|
|
struct btrfs_fs_info *info, u64 bytenr, int contains)
|
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache, *ret = NULL;
|
2019-06-20 15:37:45 -04:00
|
|
|
struct rb_node *n;
|
|
|
|
u64 end, start;
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
read_lock(&info->block_group_cache_lock);
|
2022-04-13 16:20:40 +01:00
|
|
|
n = info->block_group_cache_tree.rb_root.rb_node;
|
2019-06-20 15:37:45 -04:00
|
|
|
|
|
|
|
while (n) {
|
2019-10-29 19:20:18 +01:00
|
|
|
cache = rb_entry(n, struct btrfs_block_group, cache_node);
|
2019-10-23 18:48:22 +02:00
|
|
|
end = cache->start + cache->length - 1;
|
|
|
|
start = cache->start;
|
2019-06-20 15:37:45 -04:00
|
|
|
|
|
|
|
if (bytenr < start) {
|
2019-10-23 18:48:22 +02:00
|
|
|
if (!contains && (!ret || start < ret->start))
|
2019-06-20 15:37:45 -04:00
|
|
|
ret = cache;
|
|
|
|
n = n->rb_left;
|
|
|
|
} else if (bytenr > start) {
|
|
|
|
if (contains && bytenr <= end) {
|
|
|
|
ret = cache;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
n = n->rb_right;
|
|
|
|
} else {
|
|
|
|
ret = cache;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2022-04-13 16:20:40 +01:00
|
|
|
if (ret)
|
2019-06-20 15:37:45 -04:00
|
|
|
btrfs_get_block_group(ret);
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
read_unlock(&info->block_group_cache_lock);
|
2019-06-20 15:37:45 -04:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the block group that starts at or after bytenr
|
|
|
|
*/
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *btrfs_lookup_first_block_group(
|
2019-06-20 15:37:45 -04:00
|
|
|
struct btrfs_fs_info *info, u64 bytenr)
|
|
|
|
{
|
|
|
|
return block_group_cache_tree_search(info, bytenr, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the block group that contains the given bytenr
|
|
|
|
*/
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *btrfs_lookup_block_group(
|
2019-06-20 15:37:45 -04:00
|
|
|
struct btrfs_fs_info *info, u64 bytenr)
|
|
|
|
{
|
|
|
|
return block_group_cache_tree_search(info, bytenr, 1);
|
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *btrfs_next_block_group(
|
|
|
|
struct btrfs_block_group *cache)
|
2019-06-20 15:37:45 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
|
|
|
struct rb_node *node;
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
read_lock(&fs_info->block_group_cache_lock);
|
2019-06-20 15:37:45 -04:00
|
|
|
|
|
|
|
/* If our block group was removed, we need a full search. */
|
|
|
|
if (RB_EMPTY_NODE(&cache->cache_node)) {
|
2019-10-23 18:48:22 +02:00
|
|
|
const u64 next_bytenr = cache->start + cache->length;
|
2019-06-20 15:37:45 -04:00
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
read_unlock(&fs_info->block_group_cache_lock);
|
2019-06-20 15:37:45 -04:00
|
|
|
btrfs_put_block_group(cache);
|
2022-04-13 16:20:42 +01:00
|
|
|
return btrfs_lookup_first_block_group(fs_info, next_bytenr);
|
2019-06-20 15:37:45 -04:00
|
|
|
}
|
|
|
|
node = rb_next(&cache->cache_node);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
if (node) {
|
2019-10-29 19:20:18 +01:00
|
|
|
cache = rb_entry(node, struct btrfs_block_group, cache_node);
|
2019-06-20 15:37:45 -04:00
|
|
|
btrfs_get_block_group(cache);
|
|
|
|
} else
|
|
|
|
cache = NULL;
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
read_unlock(&fs_info->block_group_cache_lock);
|
2019-06-20 15:37:45 -04:00
|
|
|
return cache;
|
|
|
|
}
|
2019-06-20 15:37:47 -04:00
|
|
|
|
2022-10-27 14:21:42 +02:00
|
|
|
/*
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
* Check if we can do a NOCOW write for a given extent.
|
|
|
|
*
|
|
|
|
* @fs_info: The filesystem information object.
|
|
|
|
* @bytenr: Logical start address of the extent.
|
|
|
|
*
|
|
|
|
* Check if we can do a NOCOW write for the given extent, and increments the
|
|
|
|
* number of NOCOW writers in the block group that contains the extent, as long
|
|
|
|
* as the block group exists and it's currently not in read-only mode.
|
|
|
|
*
|
|
|
|
* Returns: A non-NULL block group pointer if we can do a NOCOW write, the caller
|
|
|
|
* is responsible for calling btrfs_dec_nocow_writers() later.
|
|
|
|
*
|
|
|
|
* Or NULL if we can not do a NOCOW write
|
|
|
|
*/
|
|
|
|
struct btrfs_block_group *btrfs_inc_nocow_writers(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 bytenr)
|
2019-06-20 15:37:47 -04:00
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *bg;
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
bool can_nocow = true;
|
2019-06-20 15:37:47 -04:00
|
|
|
|
|
|
|
bg = btrfs_lookup_block_group(fs_info, bytenr);
|
|
|
|
if (!bg)
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
return NULL;
|
2019-06-20 15:37:47 -04:00
|
|
|
|
|
|
|
spin_lock(&bg->lock);
|
|
|
|
if (bg->ro)
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
can_nocow = false;
|
2019-06-20 15:37:47 -04:00
|
|
|
else
|
|
|
|
atomic_inc(&bg->nocow_writers);
|
|
|
|
spin_unlock(&bg->lock);
|
|
|
|
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
if (!can_nocow) {
|
2019-06-20 15:37:47 -04:00
|
|
|
btrfs_put_block_group(bg);
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
return NULL;
|
|
|
|
}
|
2019-06-20 15:37:47 -04:00
|
|
|
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
/* No put on block group, done by btrfs_dec_nocow_writers(). */
|
|
|
|
return bg;
|
2019-06-20 15:37:47 -04:00
|
|
|
}
|
|
|
|
|
2022-10-27 14:21:42 +02:00
|
|
|
/*
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
* Decrement the number of NOCOW writers in a block group.
|
|
|
|
*
|
|
|
|
* This is meant to be called after a previous call to btrfs_inc_nocow_writers(),
|
|
|
|
* and on the block group returned by that call. Typically this is called after
|
|
|
|
* creating an ordered extent for a NOCOW write, to prevent races with scrub and
|
|
|
|
* relocation.
|
|
|
|
*
|
|
|
|
* After this call, the caller should not use the block group anymore. It it wants
|
|
|
|
* to use it, then it should get a reference on it before calling this function.
|
|
|
|
*/
|
|
|
|
void btrfs_dec_nocow_writers(struct btrfs_block_group *bg)
|
2019-06-20 15:37:47 -04:00
|
|
|
{
|
|
|
|
if (atomic_dec_and_test(&bg->nocow_writers))
|
|
|
|
wake_up_var(&bg->nocow_writers);
|
btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.
The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.
Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().
This is part of a patchset comprised of the following patches:
btrfs: remove search start argument from first_logical_byte()
btrfs: use rbtree with leftmost node cached for tracking lowest block group
btrfs: use a read/write lock for protecting the block groups tree
btrfs: return block group directly at btrfs_next_block_group()
btrfs: avoid double search for block group during NOCOW writes
The following test was used to test these changes from a performance
perspective:
$ cat test.sh
#!/bin/bash
modprobe null_blk nr_devices=0
NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
mkdir $NULL_DEV_PATH
if [ $? -ne 0 ]; then
echo "Failed to create nullb0 directory."
exit 1
fi
echo 2 > $NULL_DEV_PATH/submit_queues
echo 16384 > $NULL_DEV_PATH/size # 16G
echo 1 > $NULL_DEV_PATH/memory_backed
echo 1 > $NULL_DEV_PATH/power
DEV=/dev/nullb0
MNT=/mnt/nullb0
LOOP_MNT="$MNT/loop"
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
cat <<EOF > /tmp/fio-job.ini
[io_uring_writes]
rw=randwrite
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bs=64k
filesize=1g
runtime=300
time_based
directory=$LOOP_MNT
numjobs=8
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $LOOP_MNT
truncate -s 4T $MNT/loopfile
mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
# Trigger the allocation of about 3500 data block groups, without
# actually consuming space on underlying filesystem, just to make
# the tree of block group large.
fallocate -l 3500G $LOOP_MNT/filler
fio /tmp/fio-job.ini
umount $LOOP_MNT
umount $MNT
echo 0 > $NULL_DEV_PATH/power
rmdir $NULL_DEV_PATH
The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.
Before patchset:
WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
After patchset:
WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
+3.3% write throughput and +3.3% IO done in the same time period.
The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:43 +01:00
|
|
|
|
|
|
|
/* For the lookup done by a previous call to btrfs_inc_nocow_writers(). */
|
2019-06-20 15:37:47 -04:00
|
|
|
btrfs_put_block_group(bg);
|
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_wait_nocow_writers(struct btrfs_block_group *bg)
|
2019-06-20 15:37:47 -04:00
|
|
|
{
|
|
|
|
wait_var_event(&bg->nocow_writers, !atomic_read(&bg->nocow_writers));
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
|
|
|
|
const u64 start)
|
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *bg;
|
2019-06-20 15:37:47 -04:00
|
|
|
|
|
|
|
bg = btrfs_lookup_block_group(fs_info, start);
|
|
|
|
ASSERT(bg);
|
|
|
|
if (atomic_dec_and_test(&bg->reservations))
|
|
|
|
wake_up_var(&bg->reservations);
|
|
|
|
btrfs_put_block_group(bg);
|
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_wait_block_group_reservations(struct btrfs_block_group *bg)
|
2019-06-20 15:37:47 -04:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info = bg->space_info;
|
|
|
|
|
|
|
|
ASSERT(bg->ro);
|
|
|
|
|
|
|
|
if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Our block group is read only but before we set it to read only,
|
|
|
|
* some task might have had allocated an extent from it already, but it
|
|
|
|
* has not yet created a respective ordered extent (and added it to a
|
|
|
|
* root's list of ordered extents).
|
|
|
|
* Therefore wait for any task currently allocating extents, since the
|
|
|
|
* block group's reservations counter is incremented while a read lock
|
|
|
|
* on the groups' semaphore is held and decremented after releasing
|
|
|
|
* the read access on that semaphore and creating the ordered extent.
|
|
|
|
*/
|
|
|
|
down_write(&space_info->groups_sem);
|
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
|
|
|
|
wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
|
|
|
|
}
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
struct btrfs_caching_control *btrfs_get_caching_control(
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache)
|
2019-08-06 16:43:19 +02:00
|
|
|
{
|
|
|
|
struct btrfs_caching_control *ctl;
|
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (!cache->caching_ctl) {
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
ctl = cache->caching_ctl;
|
|
|
|
refcount_inc(&ctl->count);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
return ctl;
|
|
|
|
}
|
|
|
|
|
2024-02-29 16:30:07 +08:00
|
|
|
static void btrfs_put_caching_control(struct btrfs_caching_control *ctl)
|
2019-08-06 16:43:19 +02:00
|
|
|
{
|
|
|
|
if (refcount_dec_and_test(&ctl->count))
|
|
|
|
kfree(ctl);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When we wait for progress in the block group caching, its because our
|
|
|
|
* allocation attempt failed at least once. So, we must sleep and let some
|
|
|
|
* progress happen before we try again.
|
|
|
|
*
|
|
|
|
* This function will sleep at least once waiting for new free space to show
|
|
|
|
* up, and then it will check the block group free space numbers for our min
|
|
|
|
* num_bytes. Another option is to have it go ahead and look in the rbtree for
|
|
|
|
* a free extent of a given size, but this is a good start.
|
|
|
|
*
|
|
|
|
* Callers of this must check if cache->cached == BTRFS_CACHE_ERROR before using
|
|
|
|
* any of the information in this block group.
|
|
|
|
*/
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_wait_block_group_cache_progress(struct btrfs_block_group *cache,
|
2019-08-06 16:43:19 +02:00
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-21 16:09:43 -04:00
|
|
|
int progress;
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
caching_ctl = btrfs_get_caching_control(cache);
|
|
|
|
if (!caching_ctl)
|
|
|
|
return;
|
|
|
|
|
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-21 16:09:43 -04:00
|
|
|
/*
|
|
|
|
* We've already failed to allocate from this block group, so even if
|
|
|
|
* there's enough space in the block group it isn't contiguous enough to
|
|
|
|
* allow for an allocation, so wait for at least the next wakeup tick,
|
|
|
|
* or for the thing to be done.
|
|
|
|
*/
|
|
|
|
progress = atomic_read(&caching_ctl->progress);
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
wait_event(caching_ctl->wait, btrfs_block_group_done(cache) ||
|
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-21 16:09:43 -04:00
|
|
|
(progress != atomic_read(&caching_ctl->progress) &&
|
|
|
|
(cache->free_space_ctl->free_space >= num_bytes)));
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
btrfs_put_caching_control(caching_ctl);
|
|
|
|
}
|
|
|
|
|
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 11:28:13 -07:00
|
|
|
static int btrfs_caching_ctl_wait_done(struct btrfs_block_group *cache,
|
|
|
|
struct btrfs_caching_control *caching_ctl)
|
|
|
|
{
|
|
|
|
wait_event(caching_ctl->wait, btrfs_block_group_done(cache));
|
|
|
|
return cache->cached == BTRFS_CACHE_ERROR ? -EIO : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache)
|
2019-08-06 16:43:19 +02:00
|
|
|
{
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 11:28:13 -07:00
|
|
|
int ret;
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
caching_ctl = btrfs_get_caching_control(cache);
|
|
|
|
if (!caching_ctl)
|
|
|
|
return (cache->cached == BTRFS_CACHE_ERROR) ? -EIO : 0;
|
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 11:28:13 -07:00
|
|
|
ret = btrfs_caching_ctl_wait_done(cache, caching_ctl);
|
2019-08-06 16:43:19 +02:00
|
|
|
btrfs_put_caching_control(caching_ctl);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
2019-10-29 19:20:18 +01:00
|
|
|
static void fragment_free_space(struct btrfs_block_group *block_group)
|
2019-08-06 16:43:19 +02:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
2019-10-23 18:48:22 +02:00
|
|
|
u64 start = block_group->start;
|
|
|
|
u64 len = block_group->length;
|
2019-08-06 16:43:19 +02:00
|
|
|
u64 chunk = block_group->flags & BTRFS_BLOCK_GROUP_METADATA ?
|
|
|
|
fs_info->nodesize : fs_info->sectorsize;
|
|
|
|
u64 step = chunk << 1;
|
|
|
|
|
|
|
|
while (len > chunk) {
|
|
|
|
btrfs_remove_free_space(block_group, start, chunk);
|
|
|
|
start += step;
|
|
|
|
if (len < step)
|
|
|
|
len = 0;
|
|
|
|
else
|
|
|
|
len -= step;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
2023-06-30 16:03:45 +01:00
|
|
|
* Add a free space range to the in memory free space cache of a block group.
|
|
|
|
* This checks if the range contains super block locations and any such
|
|
|
|
* locations are not added to the free space cache.
|
|
|
|
*
|
|
|
|
* @block_group: The target block group.
|
|
|
|
* @start: Start offset of the range.
|
|
|
|
* @end: End offset of the range (exclusive).
|
|
|
|
* @total_added_ret: Optional pointer to return the total amount of space
|
|
|
|
* added to the block group's free space cache.
|
|
|
|
*
|
|
|
|
* Returns 0 on success or < 0 on error.
|
2019-08-06 16:43:19 +02:00
|
|
|
*/
|
2023-06-30 16:03:46 +01:00
|
|
|
int btrfs_add_new_free_space(struct btrfs_block_group *block_group, u64 start,
|
|
|
|
u64 end, u64 *total_added_ret)
|
2019-08-06 16:43:19 +02:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *info = block_group->fs_info;
|
2023-06-30 16:03:44 +01:00
|
|
|
u64 extent_start, extent_end, size;
|
2019-08-06 16:43:19 +02:00
|
|
|
int ret;
|
|
|
|
|
2023-06-30 16:03:44 +01:00
|
|
|
if (total_added_ret)
|
|
|
|
*total_added_ret = 0;
|
|
|
|
|
2019-08-06 16:43:19 +02:00
|
|
|
while (start < end) {
|
2023-06-30 16:03:49 +01:00
|
|
|
if (!find_first_extent_bit(&info->excluded_extents, start,
|
|
|
|
&extent_start, &extent_end,
|
|
|
|
EXTENT_DIRTY | EXTENT_UPTODATE,
|
|
|
|
NULL))
|
2019-08-06 16:43:19 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (extent_start <= start) {
|
|
|
|
start = extent_end + 1;
|
|
|
|
} else if (extent_start > start && extent_start < end) {
|
|
|
|
size = extent_start - start;
|
2019-12-13 16:22:14 -08:00
|
|
|
ret = btrfs_add_free_space_async_trimmed(block_group,
|
|
|
|
start, size);
|
2023-06-30 16:03:44 +01:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
if (total_added_ret)
|
|
|
|
*total_added_ret += size;
|
2019-08-06 16:43:19 +02:00
|
|
|
start = extent_end + 1;
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (start < end) {
|
|
|
|
size = end - start;
|
2019-12-13 16:22:14 -08:00
|
|
|
ret = btrfs_add_free_space_async_trimmed(block_group, start,
|
|
|
|
size);
|
2023-06-30 16:03:44 +01:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
if (total_added_ret)
|
|
|
|
*total_added_ret += size;
|
2019-08-06 16:43:19 +02:00
|
|
|
}
|
|
|
|
|
2023-06-30 16:03:44 +01:00
|
|
|
return 0;
|
2019-08-06 16:43:19 +02:00
|
|
|
}
|
|
|
|
|
2022-12-15 16:06:34 -08:00
|
|
|
/*
|
|
|
|
* Get an arbitrary extent item index / max_index through the block group
|
|
|
|
*
|
|
|
|
* @block_group the block group to sample from
|
|
|
|
* @index: the integral step through the block group to grab from
|
|
|
|
* @max_index: the granularity of the sampling
|
|
|
|
* @key: return value parameter for the item we find
|
|
|
|
*
|
|
|
|
* Pre-conditions on indices:
|
|
|
|
* 0 <= index <= max_index
|
|
|
|
* 0 < max_index
|
|
|
|
*
|
|
|
|
* Returns: 0 on success, 1 if the search didn't yield a useful item, negative
|
|
|
|
* error code on error.
|
|
|
|
*/
|
|
|
|
static int sample_block_group_extent_item(struct btrfs_caching_control *caching_ctl,
|
|
|
|
struct btrfs_block_group *block_group,
|
|
|
|
int index, int max_index,
|
2023-02-15 12:59:50 -08:00
|
|
|
struct btrfs_key *found_key)
|
2022-12-15 16:06:34 -08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
|
|
|
struct btrfs_root *extent_root;
|
|
|
|
u64 search_offset;
|
|
|
|
u64 search_end = block_group->start + block_group->length;
|
|
|
|
struct btrfs_path *path;
|
2023-02-15 12:59:50 -08:00
|
|
|
struct btrfs_key search_key;
|
|
|
|
int ret = 0;
|
2022-12-15 16:06:34 -08:00
|
|
|
|
|
|
|
ASSERT(index >= 0);
|
|
|
|
ASSERT(index <= max_index);
|
|
|
|
ASSERT(max_index > 0);
|
|
|
|
lockdep_assert_held(&caching_ctl->mutex);
|
|
|
|
lockdep_assert_held_read(&fs_info->commit_root_sem);
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
extent_root = btrfs_extent_root(fs_info, max_t(u64, block_group->start,
|
|
|
|
BTRFS_SUPER_INFO_OFFSET));
|
|
|
|
|
|
|
|
path->skip_locking = 1;
|
|
|
|
path->search_commit_root = 1;
|
|
|
|
path->reada = READA_FORWARD;
|
|
|
|
|
|
|
|
search_offset = index * div_u64(block_group->length, max_index);
|
2023-02-15 12:59:50 -08:00
|
|
|
search_key.objectid = block_group->start + search_offset;
|
|
|
|
search_key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
search_key.offset = 0;
|
2022-12-15 16:06:34 -08:00
|
|
|
|
2023-02-15 12:59:50 -08:00
|
|
|
btrfs_for_each_slot(extent_root, &search_key, found_key, path, ret) {
|
2022-12-15 16:06:34 -08:00
|
|
|
/* Success; sampled an extent item in the block group */
|
2023-02-15 12:59:50 -08:00
|
|
|
if (found_key->type == BTRFS_EXTENT_ITEM_KEY &&
|
|
|
|
found_key->objectid >= block_group->start &&
|
|
|
|
found_key->objectid + found_key->offset <= search_end)
|
|
|
|
break;
|
2022-12-15 16:06:34 -08:00
|
|
|
|
|
|
|
/* We can't possibly find a valid extent item anymore */
|
2023-02-15 12:59:50 -08:00
|
|
|
if (found_key->objectid >= search_end) {
|
2022-12-15 16:06:34 -08:00
|
|
|
ret = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2023-02-15 12:59:50 -08:00
|
|
|
|
2022-12-15 16:06:34 -08:00
|
|
|
lockdep_assert_held(&caching_ctl->mutex);
|
|
|
|
lockdep_assert_held_read(&fs_info->commit_root_sem);
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Best effort attempt to compute a block group's size class while caching it.
|
|
|
|
*
|
|
|
|
* @block_group: the block group we are caching
|
|
|
|
*
|
|
|
|
* We cannot infer the size class while adding free space extents, because that
|
|
|
|
* logic doesn't care about contiguous file extents (it doesn't differentiate
|
|
|
|
* between a 100M extent and 100 contiguous 1M extents). So we need to read the
|
|
|
|
* file extent items. Reading all of them is quite wasteful, because usually
|
|
|
|
* only a handful are enough to give a good answer. Therefore, we just grab 5 of
|
|
|
|
* them at even steps through the block group and pick the smallest size class
|
|
|
|
* we see. Since size class is best effort, and not guaranteed in general,
|
|
|
|
* inaccuracy is acceptable.
|
|
|
|
*
|
|
|
|
* To be more explicit about why this algorithm makes sense:
|
|
|
|
*
|
|
|
|
* If we are caching in a block group from disk, then there are three major cases
|
|
|
|
* to consider:
|
|
|
|
* 1. the block group is well behaved and all extents in it are the same size
|
|
|
|
* class.
|
|
|
|
* 2. the block group is mostly one size class with rare exceptions for last
|
|
|
|
* ditch allocations
|
|
|
|
* 3. the block group was populated before size classes and can have a totally
|
|
|
|
* arbitrary mix of size classes.
|
|
|
|
*
|
|
|
|
* In case 1, looking at any extent in the block group will yield the correct
|
|
|
|
* result. For the mixed cases, taking the minimum size class seems like a good
|
|
|
|
* approximation, since gaps from frees will be usable to the size class. For
|
|
|
|
* 2., a small handful of file extents is likely to yield the right answer. For
|
|
|
|
* 3, we can either read every file extent, or admit that this is best effort
|
|
|
|
* anyway and try to stay fast.
|
|
|
|
*
|
|
|
|
* Returns: 0 on success, negative error code on error.
|
|
|
|
*/
|
|
|
|
static int load_block_group_size_class(struct btrfs_caching_control *caching_ctl,
|
|
|
|
struct btrfs_block_group *block_group)
|
|
|
|
{
|
2023-02-15 12:59:50 -08:00
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
2022-12-15 16:06:34 -08:00
|
|
|
struct btrfs_key key;
|
|
|
|
int i;
|
|
|
|
u64 min_size = block_group->length;
|
|
|
|
enum btrfs_block_group_size_class size_class = BTRFS_BG_SZ_NONE;
|
|
|
|
int ret;
|
|
|
|
|
2022-12-15 16:06:35 -08:00
|
|
|
if (!btrfs_block_group_should_use_size_class(block_group))
|
2022-12-15 16:06:34 -08:00
|
|
|
return 0;
|
|
|
|
|
2023-02-15 12:59:50 -08:00
|
|
|
lockdep_assert_held(&caching_ctl->mutex);
|
|
|
|
lockdep_assert_held_read(&fs_info->commit_root_sem);
|
2022-12-15 16:06:34 -08:00
|
|
|
for (i = 0; i < 5; ++i) {
|
|
|
|
ret = sample_block_group_extent_item(caching_ctl, block_group, i, 5, &key);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0)
|
|
|
|
continue;
|
|
|
|
min_size = min_t(u64, min_size, key.offset);
|
|
|
|
size_class = btrfs_calc_block_group_size_class(min_size);
|
|
|
|
}
|
|
|
|
if (size_class != BTRFS_BG_SZ_NONE) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->size_class = size_class;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-08-06 16:43:19 +02:00
|
|
|
static int load_extent_tree_free(struct btrfs_caching_control *caching_ctl)
|
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group = caching_ctl->block_group;
|
2019-08-06 16:43:19 +02:00
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
2021-11-05 16:45:45 -04:00
|
|
|
struct btrfs_root *extent_root;
|
2019-08-06 16:43:19 +02:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
u64 total_found = 0;
|
|
|
|
u64 last = 0;
|
|
|
|
u32 nritems;
|
|
|
|
int ret;
|
|
|
|
bool wakeup = true;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2019-10-23 18:48:22 +02:00
|
|
|
last = max_t(u64, block_group->start, BTRFS_SUPER_INFO_OFFSET);
|
2021-11-05 16:45:45 -04:00
|
|
|
extent_root = btrfs_extent_root(fs_info, last);
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
|
|
|
/*
|
|
|
|
* If we're fragmenting we don't want to make anybody think we can
|
|
|
|
* allocate from this block group until we've had a chance to fragment
|
|
|
|
* the free space.
|
|
|
|
*/
|
|
|
|
if (btrfs_should_fragment_free_space(block_group))
|
|
|
|
wakeup = false;
|
|
|
|
#endif
|
|
|
|
/*
|
|
|
|
* We don't want to deadlock with somebody trying to allocate a new
|
|
|
|
* extent for the extent root while also trying to search the extent
|
|
|
|
* root to add free space. So we skip locking and search the commit
|
|
|
|
* root, since its read-only
|
|
|
|
*/
|
|
|
|
path->skip_locking = 1;
|
|
|
|
path->search_commit_root = 1;
|
|
|
|
path->reada = READA_FORWARD;
|
|
|
|
|
|
|
|
key.objectid = last;
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
|
|
|
|
next:
|
|
|
|
ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
if (btrfs_fs_closing(fs_info) > 1) {
|
|
|
|
last = (u64)-1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (path->slots[0] < nritems) {
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
} else {
|
|
|
|
ret = btrfs_find_next_key(extent_root, path, &key, 0, 0);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (need_resched() ||
|
|
|
|
rwsem_is_contended(&fs_info->commit_root_sem)) {
|
|
|
|
btrfs_release_path(path);
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
|
|
|
cond_resched();
|
|
|
|
mutex_lock(&caching_ctl->mutex);
|
|
|
|
down_read(&fs_info->commit_root_sem);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_next_leaf(extent_root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (key.objectid < last) {
|
|
|
|
key.objectid = last;
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
btrfs_release_path(path);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
2019-10-23 18:48:22 +02:00
|
|
|
if (key.objectid < block_group->start) {
|
2019-08-06 16:43:19 +02:00
|
|
|
path->slots[0]++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2019-10-23 18:48:22 +02:00
|
|
|
if (key.objectid >= block_group->start + block_group->length)
|
2019-08-06 16:43:19 +02:00
|
|
|
break;
|
|
|
|
|
|
|
|
if (key.type == BTRFS_EXTENT_ITEM_KEY ||
|
|
|
|
key.type == BTRFS_METADATA_ITEM_KEY) {
|
2023-06-30 16:03:44 +01:00
|
|
|
u64 space_added;
|
|
|
|
|
2023-06-30 16:03:46 +01:00
|
|
|
ret = btrfs_add_new_free_space(block_group, last,
|
|
|
|
key.objectid, &space_added);
|
2023-06-30 16:03:44 +01:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
total_found += space_added;
|
2019-08-06 16:43:19 +02:00
|
|
|
if (key.type == BTRFS_METADATA_ITEM_KEY)
|
|
|
|
last = key.objectid +
|
|
|
|
fs_info->nodesize;
|
|
|
|
else
|
|
|
|
last = key.objectid + key.offset;
|
|
|
|
|
|
|
|
if (total_found > CACHING_CTL_WAKE_UP) {
|
|
|
|
total_found = 0;
|
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-21 16:09:43 -04:00
|
|
|
if (wakeup) {
|
|
|
|
atomic_inc(&caching_ctl->progress);
|
2019-08-06 16:43:19 +02:00
|
|
|
wake_up(&caching_ctl->wait);
|
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-21 16:09:43 -04:00
|
|
|
}
|
2019-08-06 16:43:19 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
path->slots[0]++;
|
|
|
|
}
|
|
|
|
|
2023-06-30 16:03:46 +01:00
|
|
|
ret = btrfs_add_new_free_space(block_group, last,
|
|
|
|
block_group->start + block_group->length,
|
|
|
|
NULL);
|
2019-08-06 16:43:19 +02:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-06-30 16:03:51 +01:00
|
|
|
static inline void btrfs_free_excluded_extents(const struct btrfs_block_group *bg)
|
|
|
|
{
|
|
|
|
clear_extent_bits(&bg->fs_info->excluded_extents, bg->start,
|
|
|
|
bg->start + bg->length - 1, EXTENT_UPTODATE);
|
|
|
|
}
|
|
|
|
|
2019-08-06 16:43:19 +02:00
|
|
|
static noinline void caching_thread(struct btrfs_work *work)
|
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group;
|
2019-08-06 16:43:19 +02:00
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
caching_ctl = container_of(work, struct btrfs_caching_control, work);
|
|
|
|
block_group = caching_ctl->block_group;
|
|
|
|
fs_info = block_group->fs_info;
|
|
|
|
|
|
|
|
mutex_lock(&caching_ctl->mutex);
|
|
|
|
down_read(&fs_info->commit_root_sem);
|
|
|
|
|
2022-12-15 16:06:34 -08:00
|
|
|
load_block_group_size_class(caching_ctl, block_group);
|
2020-10-23 09:58:10 -04:00
|
|
|
if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
|
|
|
|
ret = load_free_space_cache(block_group);
|
|
|
|
if (ret == 1) {
|
|
|
|
ret = 0;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We failed to load the space cache, set ourselves to
|
|
|
|
* CACHE_STARTED and carry on.
|
|
|
|
*/
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->cached = BTRFS_CACHE_STARTED;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
wake_up(&caching_ctl->wait);
|
|
|
|
}
|
|
|
|
|
btrfs: fix possible free space tree corruption with online conversion
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.
This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups. While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.
Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.
CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-15 16:26:17 -05:00
|
|
|
/*
|
|
|
|
* If we are in the transaction that populated the free space tree we
|
|
|
|
* can't actually cache from the free space tree as our commit root and
|
|
|
|
* real root are the same, so we could change the contents of the blocks
|
|
|
|
* while caching. Instead do the slow caching in this case, and after
|
|
|
|
* the transaction has committed we will be safe.
|
|
|
|
*/
|
|
|
|
if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
|
|
|
|
!(test_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags)))
|
2019-08-06 16:43:19 +02:00
|
|
|
ret = load_free_space_tree(caching_ctl);
|
|
|
|
else
|
|
|
|
ret = load_extent_tree_free(caching_ctl);
|
2020-10-23 09:58:10 -04:00
|
|
|
done:
|
2019-08-06 16:43:19 +02:00
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->caching_ctl = NULL;
|
|
|
|
block_group->cached = ret ? BTRFS_CACHE_ERROR : BTRFS_CACHE_FINISHED;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
|
|
|
if (btrfs_should_fragment_free_space(block_group)) {
|
|
|
|
u64 bytes_used;
|
|
|
|
|
|
|
|
spin_lock(&block_group->space_info->lock);
|
|
|
|
spin_lock(&block_group->lock);
|
2019-10-23 18:48:22 +02:00
|
|
|
bytes_used = block_group->length - block_group->used;
|
2019-08-06 16:43:19 +02:00
|
|
|
block_group->space_info->bytes_used += bytes_used >> 1;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
spin_unlock(&block_group->space_info->lock);
|
2019-06-20 15:38:07 -04:00
|
|
|
fragment_free_space(block_group);
|
2019-08-06 16:43:19 +02:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
up_read(&fs_info->commit_root_sem);
|
|
|
|
btrfs_free_excluded_extents(block_group);
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
|
|
|
|
|
|
|
wake_up(&caching_ctl->wait);
|
|
|
|
|
|
|
|
btrfs_put_caching_control(caching_ctl);
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
}
|
|
|
|
|
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 11:28:13 -07:00
|
|
|
int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait)
|
2019-08-06 16:43:19 +02:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
2020-10-23 09:58:10 -04:00
|
|
|
struct btrfs_caching_control *caching_ctl = NULL;
|
2019-08-06 16:43:19 +02:00
|
|
|
int ret = 0;
|
|
|
|
|
2021-02-04 19:21:53 +09:00
|
|
|
/* Allocator for zoned filesystems does not use the cache at all */
|
|
|
|
if (btrfs_is_zoned(fs_info))
|
|
|
|
return 0;
|
|
|
|
|
2019-08-06 16:43:19 +02:00
|
|
|
caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
|
|
|
|
if (!caching_ctl)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&caching_ctl->list);
|
|
|
|
mutex_init(&caching_ctl->mutex);
|
|
|
|
init_waitqueue_head(&caching_ctl->wait);
|
|
|
|
caching_ctl->block_group = cache;
|
2020-10-23 09:58:10 -04:00
|
|
|
refcount_set(&caching_ctl->count, 2);
|
btrfs: wait for actual caching progress during allocation
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-21 16:09:43 -04:00
|
|
|
atomic_set(&caching_ctl->progress, 0);
|
2023-09-19 18:49:23 +02:00
|
|
|
btrfs_init_work(&caching_ctl->work, caching_thread, NULL);
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (cache->cached != BTRFS_CACHE_NO) {
|
|
|
|
kfree(caching_ctl);
|
2020-10-23 09:58:10 -04:00
|
|
|
|
|
|
|
caching_ctl = cache->caching_ctl;
|
|
|
|
if (caching_ctl)
|
|
|
|
refcount_inc(&caching_ctl->count);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
goto out;
|
2019-08-06 16:43:19 +02:00
|
|
|
}
|
|
|
|
WARN_ON(cache->caching_ctl);
|
|
|
|
cache->caching_ctl = caching_ctl;
|
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 11:28:13 -07:00
|
|
|
cache->cached = BTRFS_CACHE_STARTED;
|
2019-08-06 16:43:19 +02:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_lock(&fs_info->block_group_cache_lock);
|
2019-08-06 16:43:19 +02:00
|
|
|
refcount_inc(&caching_ctl->count);
|
|
|
|
list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_unlock(&fs_info->block_group_cache_lock);
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
btrfs_get_block_group(cache);
|
|
|
|
|
|
|
|
btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work);
|
2020-10-23 09:58:10 -04:00
|
|
|
out:
|
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a
few symptoms:
1. "unable to add free space :-17" (EEXIST) errors.
2. Missing free space info items, sometimes caught with a "missing free
space info for X" error.
3. Double-accounted space: ranges that were allocated in the extent tree
and also marked as free in the free space tree, ranges that were
marked as allocated twice in the extent tree, or ranges that were
marked as free twice in the free space tree. If the latter made it
onto disk, the next reboot would hit the BUG_ON() in
add_new_free_space().
4. On some hosts with no on-disk corruption or error messages, the
in-memory space cache (dumped with drgn) disagreed with the free
space tree.
All of these symptoms have the same underlying cause: a race between
caching the free space for a block group and returning free space to the
in-memory space cache for pinned extents causes us to double-add a free
range to the space cache. This race exists when free space is cached
from the free space tree (space_cache=v2) or the extent tree
(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
struct btrfs_block_group::last_byte_to_unpin and struct
btrfs_block_group::progress are supposed to protect against this race,
but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
waiting for a transaction commit") subtly broke this by allowing
multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A.
2. btrfs_commit_transaction() is called for transaction A.
3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
ref for the deleted extent.
4. __btrfs_free_extent() -> do_free_extent_accounting() ->
add_to_free_space_tree() adds the deleted extent back to the free
space tree.
5. do_free_extent_accounting() -> btrfs_update_block_group() ->
btrfs_cache_block_group() queues up the block group to get cached.
block_group->progress is set to block_group->start.
6. btrfs_commit_transaction() for transaction A calls
switch_commit_roots(). It sets block_group->last_byte_to_unpin to
block_group->progress, which is block_group->start because the block
group hasn't been cached yet.
7. The caching thread gets to our block group. Since the commit roots
were already switched, load_free_space_tree() sees the deleted extent
as free and adds it to the space cache. It finishes caching and sets
block_group->progress to U64_MAX.
8. btrfs_commit_transaction() advances transaction A to
TRANS_STATE_SUPER_COMMITTED.
9. fsync calls btrfs_commit_transaction() for transaction B. Since
transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
commit is for fsync, it advances.
10. btrfs_commit_transaction() for transaction B calls
switch_commit_roots(). This time, the block group has already been
cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
11. btrfs_commit_transaction() for transaction A calls
btrfs_finish_extent_commit(), which calls unpin_extent_range() for
the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
transaction B!), so it adds the deleted extent to the space cache
again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free
space is re-added in step 11, it will fail with EEXIST.
* If another thread reallocates the deleted extent in between steps 7
and 11, then step 11 will silently re-add that space to the space
cache as free even though it is actually allocated. Then, if that
space is allocated *again*, the free space tree will be corrupted
(namely, the wrong item will be deleted).
* If we don't catch this free space tree corruption, it will continue
to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted
(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
with load_cache_only=1), so it is not normally affected by this bug.
However, as noted above, if we fail to load the space cache, we will
fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free
space tree or extent tree synchronous. Josef tested this and found no
performance regressions.
A few extra changes fall out of this change. Namely, this fix does the
following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of
btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
that we already hold a reference to.
2. Change the call in btrfs_cache_block_group() of
btrfs_wait_space_cache_v1_finished() to
btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
space_cache option.
3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
space_cache_v1_done().
4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
`bool wait` to more accurately describe its new meaning.
5. Change a few callers which had a separate call to
btrfs_wait_block_group_cache_done() to use wait = true instead.
6. Make btrfs_wait_block_group_cache_done() static now that it's not
used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-23 11:28:13 -07:00
|
|
|
if (wait && caching_ctl)
|
|
|
|
ret = btrfs_caching_ctl_wait_done(cache, caching_ctl);
|
2020-10-23 09:58:10 -04:00
|
|
|
if (caching_ctl)
|
|
|
|
btrfs_put_caching_control(caching_ctl);
|
2019-08-06 16:43:19 +02:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
|
|
|
u64 extra_flags = chunk_to_extended(flags) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
|
|
|
|
|
|
|
write_seqlock(&fs_info->profiles_lock);
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
fs_info->avail_data_alloc_bits &= ~extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
fs_info->avail_metadata_alloc_bits &= ~extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
fs_info->avail_system_alloc_bits &= ~extra_flags;
|
|
|
|
write_sequnlock(&fs_info->profiles_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear incompat bits for the following feature(s):
|
|
|
|
*
|
|
|
|
* - RAID56 - in case there's neither RAID5 nor RAID6 profile block group
|
|
|
|
* in the whole filesystem
|
2019-10-31 15:52:01 +01:00
|
|
|
*
|
|
|
|
* - RAID1C34 - same as above for RAID1C3 and RAID1C4 block groups
|
2019-06-20 15:37:55 -04:00
|
|
|
*/
|
|
|
|
static void clear_incompat_bg_bits(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
2019-10-31 15:52:01 +01:00
|
|
|
bool found_raid56 = false;
|
|
|
|
bool found_raid1c34 = false;
|
|
|
|
|
|
|
|
if ((flags & BTRFS_BLOCK_GROUP_RAID56_MASK) ||
|
|
|
|
(flags & BTRFS_BLOCK_GROUP_RAID1C3) ||
|
|
|
|
(flags & BTRFS_BLOCK_GROUP_RAID1C4)) {
|
2019-06-20 15:37:55 -04:00
|
|
|
struct list_head *head = &fs_info->space_info;
|
|
|
|
struct btrfs_space_info *sinfo;
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(sinfo, head, list) {
|
|
|
|
down_read(&sinfo->groups_sem);
|
|
|
|
if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID5]))
|
2019-10-31 15:52:01 +01:00
|
|
|
found_raid56 = true;
|
2019-06-20 15:37:55 -04:00
|
|
|
if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID6]))
|
2019-10-31 15:52:01 +01:00
|
|
|
found_raid56 = true;
|
|
|
|
if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID1C3]))
|
|
|
|
found_raid1c34 = true;
|
|
|
|
if (!list_empty(&sinfo->block_groups[BTRFS_RAID_RAID1C4]))
|
|
|
|
found_raid1c34 = true;
|
2019-06-20 15:37:55 -04:00
|
|
|
up_read(&sinfo->groups_sem);
|
|
|
|
}
|
2020-03-20 18:43:48 +00:00
|
|
|
if (!found_raid56)
|
2019-10-31 15:52:01 +01:00
|
|
|
btrfs_clear_fs_incompat(fs_info, RAID56);
|
2020-03-20 18:43:48 +00:00
|
|
|
if (!found_raid1c34)
|
2019-10-31 15:52:01 +01:00
|
|
|
btrfs_clear_fs_incompat(fs_info, RAID1C34);
|
2019-06-20 15:37:55 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-05-19 08:14:56 +08:00
|
|
|
static struct btrfs_root *btrfs_block_group_root(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
if (btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE))
|
|
|
|
return fs_info->block_group_root;
|
|
|
|
return btrfs_extent_root(fs_info, 0);
|
|
|
|
}
|
|
|
|
|
2020-05-05 07:58:21 +08:00
|
|
|
static int remove_block_group_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_block_group *block_group)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
struct btrfs_root *root;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret;
|
|
|
|
|
2021-11-05 16:45:36 -04:00
|
|
|
root = btrfs_block_group_root(fs_info);
|
2020-05-05 07:58:21 +08:00
|
|
|
key.objectid = block_group->start;
|
|
|
|
key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
|
|
|
|
key.offset = block_group->length;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
struct btrfs_chunk_map *map)
|
2019-06-20 15:37:55 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
struct btrfs_path *path;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group;
|
2019-06-20 15:37:55 -04:00
|
|
|
struct btrfs_free_cluster *cluster;
|
|
|
|
struct inode *inode;
|
|
|
|
struct kobject *kobj = NULL;
|
|
|
|
int ret;
|
|
|
|
int index;
|
|
|
|
int factor;
|
|
|
|
struct btrfs_caching_control *caching_ctl = NULL;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
bool remove_map;
|
2019-06-20 15:37:55 -04:00
|
|
|
bool remove_rsv = false;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
block_group = btrfs_lookup_block_group(fs_info, map->start);
|
2024-01-20 02:17:03 +01:00
|
|
|
if (!block_group)
|
|
|
|
return -ENOENT;
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
BUG_ON(!block_group->ro);
|
|
|
|
|
|
|
|
trace_btrfs_remove_block_group(block_group);
|
|
|
|
/*
|
|
|
|
* Free the reserved super bytes from this block group before
|
|
|
|
* remove it.
|
|
|
|
*/
|
|
|
|
btrfs_free_excluded_extents(block_group);
|
2019-10-23 18:48:22 +02:00
|
|
|
btrfs_free_ref_tree_range(fs_info, block_group->start,
|
|
|
|
block_group->length);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
index = btrfs_bg_flags_to_raid_index(block_group->flags);
|
|
|
|
factor = btrfs_bg_type_to_factor(block_group->flags);
|
|
|
|
|
|
|
|
/* make sure this block group isn't part of an allocation cluster */
|
|
|
|
cluster = &fs_info->data_alloc_cluster;
|
|
|
|
spin_lock(&cluster->refill_lock);
|
|
|
|
btrfs_return_cluster_to_free_space(block_group, cluster);
|
|
|
|
spin_unlock(&cluster->refill_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* make sure this block group isn't part of a metadata
|
|
|
|
* allocation cluster
|
|
|
|
*/
|
|
|
|
cluster = &fs_info->meta_alloc_cluster;
|
|
|
|
spin_lock(&cluster->refill_lock);
|
|
|
|
btrfs_return_cluster_to_free_space(block_group, cluster);
|
|
|
|
spin_unlock(&cluster->refill_lock);
|
|
|
|
|
2021-02-04 19:22:18 +09:00
|
|
|
btrfs_clear_treelog_bg(block_group);
|
2021-09-09 01:19:26 +09:00
|
|
|
btrfs_clear_data_reloc_bg(block_group);
|
2021-02-04 19:22:18 +09:00
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
2020-06-01 19:12:06 +01:00
|
|
|
goto out;
|
2019-06-20 15:37:55 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* get the inode first so any iput calls done for the io_list
|
|
|
|
* aren't the final iput (no unlinks allowed now)
|
|
|
|
*/
|
|
|
|
inode = lookup_free_space_inode(block_group, path);
|
|
|
|
|
|
|
|
mutex_lock(&trans->transaction->cache_write_mutex);
|
|
|
|
/*
|
|
|
|
* Make sure our free space cache IO is done before removing the
|
|
|
|
* free space inode
|
|
|
|
*/
|
|
|
|
spin_lock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
if (!list_empty(&block_group->io_list)) {
|
|
|
|
list_del_init(&block_group->io_list);
|
|
|
|
|
|
|
|
WARN_ON(!IS_ERR(inode) && inode != block_group->io_ctl.inode);
|
|
|
|
|
|
|
|
spin_unlock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
btrfs_wait_cache_io(trans, block_group, path);
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
spin_lock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!list_empty(&block_group->dirty_list)) {
|
|
|
|
list_del_init(&block_group->dirty_list);
|
|
|
|
remove_rsv = true;
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
}
|
|
|
|
spin_unlock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
mutex_unlock(&trans->transaction->cache_write_mutex);
|
|
|
|
|
2020-11-18 15:06:25 -08:00
|
|
|
ret = btrfs_remove_free_space_inode(trans, inode, block_group);
|
|
|
|
if (ret)
|
2020-06-01 19:12:06 +01:00
|
|
|
goto out;
|
2019-06-20 15:37:55 -04:00
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_lock(&fs_info->block_group_cache_lock);
|
2022-04-13 16:20:40 +01:00
|
|
|
rb_erase_cached(&block_group->cache_node,
|
|
|
|
&fs_info->block_group_cache_tree);
|
2019-06-20 15:37:55 -04:00
|
|
|
RB_CLEAR_NODE(&block_group->cache_node);
|
|
|
|
|
2020-06-01 19:12:06 +01:00
|
|
|
/* Once for the block groups rbtree */
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_unlock(&fs_info->block_group_cache_lock);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
down_write(&block_group->space_info->groups_sem);
|
|
|
|
/*
|
|
|
|
* we must use list_del_init so people can check to see if they
|
|
|
|
* are still on the list after taking the semaphore
|
|
|
|
*/
|
|
|
|
list_del_init(&block_group->list);
|
|
|
|
if (list_empty(&block_group->space_info->block_groups[index])) {
|
|
|
|
kobj = block_group->space_info->block_group_kobjs[index];
|
|
|
|
block_group->space_info->block_group_kobjs[index] = NULL;
|
|
|
|
clear_avail_alloc_bits(fs_info, block_group->flags);
|
|
|
|
}
|
|
|
|
up_write(&block_group->space_info->groups_sem);
|
|
|
|
clear_incompat_bg_bits(fs_info, block_group->flags);
|
|
|
|
if (kobj) {
|
|
|
|
kobject_del(kobj);
|
|
|
|
kobject_put(kobj);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (block_group->cached == BTRFS_CACHE_STARTED)
|
|
|
|
btrfs_wait_block_group_cache_done(block_group);
|
2022-07-15 15:45:27 -04:00
|
|
|
|
|
|
|
write_lock(&fs_info->block_group_cache_lock);
|
|
|
|
caching_ctl = btrfs_get_caching_control(block_group);
|
|
|
|
if (!caching_ctl) {
|
|
|
|
struct btrfs_caching_control *ctl;
|
|
|
|
|
|
|
|
list_for_each_entry(ctl, &fs_info->caching_block_groups, list) {
|
|
|
|
if (ctl->block_group == block_group) {
|
|
|
|
caching_ctl = ctl;
|
|
|
|
refcount_inc(&caching_ctl->count);
|
|
|
|
break;
|
|
|
|
}
|
2019-06-20 15:37:55 -04:00
|
|
|
}
|
|
|
|
}
|
2022-07-15 15:45:27 -04:00
|
|
|
if (caching_ctl)
|
|
|
|
list_del_init(&caching_ctl->list);
|
|
|
|
write_unlock(&fs_info->block_group_cache_lock);
|
|
|
|
|
|
|
|
if (caching_ctl) {
|
|
|
|
/* Once for the caching bgs list and once for us. */
|
|
|
|
btrfs_put_caching_control(caching_ctl);
|
|
|
|
btrfs_put_caching_control(caching_ctl);
|
|
|
|
}
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
spin_lock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
WARN_ON(!list_empty(&block_group->dirty_list));
|
|
|
|
WARN_ON(!list_empty(&block_group->io_list));
|
|
|
|
spin_unlock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
|
|
|
|
btrfs_remove_free_space_cache(block_group);
|
|
|
|
|
|
|
|
spin_lock(&block_group->space_info->lock);
|
|
|
|
list_del_init(&block_group->ro_list);
|
|
|
|
|
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
|
|
|
|
WARN_ON(block_group->space_info->total_bytes
|
2019-10-23 18:48:22 +02:00
|
|
|
< block_group->length);
|
2019-06-20 15:37:55 -04:00
|
|
|
WARN_ON(block_group->space_info->bytes_readonly
|
2021-02-04 19:21:52 +09:00
|
|
|
< block_group->length - block_group->zone_unusable);
|
|
|
|
WARN_ON(block_group->space_info->bytes_zone_unusable
|
|
|
|
< block_group->zone_unusable);
|
2019-06-20 15:37:55 -04:00
|
|
|
WARN_ON(block_group->space_info->disk_total
|
2019-10-23 18:48:22 +02:00
|
|
|
< block_group->length * factor);
|
2019-06-20 15:37:55 -04:00
|
|
|
}
|
2019-10-23 18:48:22 +02:00
|
|
|
block_group->space_info->total_bytes -= block_group->length;
|
2021-02-04 19:21:52 +09:00
|
|
|
block_group->space_info->bytes_readonly -=
|
|
|
|
(block_group->length - block_group->zone_unusable);
|
2023-02-15 09:18:02 +09:00
|
|
|
btrfs_space_info_update_bytes_zone_unusable(fs_info, block_group->space_info,
|
|
|
|
-block_group->zone_unusable);
|
2019-10-23 18:48:22 +02:00
|
|
|
block_group->space_info->disk_total -= block_group->length * factor;
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
spin_unlock(&block_group->space_info->lock);
|
|
|
|
|
btrfs: fix race between block group removal and block group creation
There is a race between block group removal and block group creation
when the removal is completed by a task running fitrim or scrub. When
this happens we end up failing the block group creation with an error
-EEXIST since we attempt to insert a duplicate block group item key
in the extent tree. That results in a transaction abort.
The race happens like this:
1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
block group X with btrfs_freeze_block_group() (until very recently
that was named btrfs_get_block_group_trimming());
2) Task B starts removing block group X, either because it's now unused
or due to relocation for example. So at btrfs_remove_block_group(),
while holding the chunk mutex and the block group's lock, it sets
the 'removed' flag of the block group and it sets the local variable
'remove_em' to false, because the block group is currently frozen
(its 'frozen' counter is > 0, until very recently this counter was
named 'trimming');
3) Task B unlocks the block group and the chunk mutex;
4) Task A is done trimming the block group and unfreezes the block group
by calling btrfs_unfreeze_block_group() (until very recently this was
named btrfs_put_block_group_trimming()). In this function we lock the
block group and set the local variable 'cleanup' to true because we
were able to decrement the block group's 'frozen' counter down to 0 and
the flag 'removed' is set in the block group.
Since 'cleanup' is set to true, it locks the chunk mutex and removes
the extent mapping representing the block group from the mapping tree;
5) Task C allocates a new block group Y and it picks up the logical address
that block group X had as the logical address for Y, because X was the
block group with the highest logical address and now the second block
group with the highest logical address, the last in the fs mapping tree,
ends at an offset corresponding to block group X's logical address (this
logical address selection is done at volumes.c:find_next_chunk()).
At this point the new block group Y does not have yet its item added
to the extent tree (nor the corresponding device extent items and
chunk item in the device and chunk trees). The new group Y is added to
the list of pending block groups in the transaction handle;
6) Before task B proceeds to removing the block group item for block
group X from the extent tree, which has a key matching:
(X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)
task C while ending its transaction handle calls
btrfs_create_pending_block_groups(), which finds block group Y and
tries to insert the block group item for Y into the exten tree, which
fails with -EEXIST since logical offset is the same that X had and
task B hasn't yet deleted the key from the extent tree.
This failure results in a transaction abort, producing a stack like
the following:
------------[ cut here ]------------
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
Code: ff ff ff 48 8b 55 50 f0 48 (...)
RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
btrfs_commit_transaction+0xd0/0xc50 [btrfs]
? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
? __ia32_sys_fdatasync+0x20/0x20
iterate_supers+0xdb/0x180
ksys_sync+0x60/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f2ce1d4d5b7
Code: 83 c4 08 48 3d 01 (...)
RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bd7c03622e0b0a9c ]---
Fix this simply by making btrfs_remove_block_group() remove the block
group's item from the extent tree before it flags the block group as
removed. Also make the free space deletion from the free space tree
before flagging the block group as removed, to avoid a similar race
with adding and removing free space entries for the free space tree.
Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-01 19:12:19 +01:00
|
|
|
/*
|
|
|
|
* Remove the free space for the block group from the free space tree
|
|
|
|
* and the block group's item from the extent tree before marking the
|
|
|
|
* block group as removed. This is to prevent races with tasks that
|
|
|
|
* freeze and unfreeze a block group, this task and another task
|
|
|
|
* allocating a new block group - the unfreeze task ends up removing
|
|
|
|
* the block group's extent map before the task calling this function
|
|
|
|
* deletes the block group item from the extent tree, allowing for
|
|
|
|
* another task to attempt to create another block group with the same
|
|
|
|
* item key (and failing with -EEXIST and a transaction abort).
|
|
|
|
*/
|
|
|
|
ret = remove_block_group_free_space(trans, block_group);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = remove_block_group_item(trans, path, block_group);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
spin_lock(&block_group->lock);
|
2022-07-15 15:45:24 -04:00
|
|
|
set_bit(BLOCK_GROUP_FLAG_REMOVED, &block_group->runtime_flags);
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
/*
|
2020-05-08 11:01:47 +01:00
|
|
|
* At this point trimming or scrub can't start on this block group,
|
|
|
|
* because we removed the block group from the rbtree
|
|
|
|
* fs_info->block_group_cache_tree so no one can't find it anymore and
|
|
|
|
* even if someone already got this block group before we removed it
|
|
|
|
* from the rbtree, they have already incremented block_group->frozen -
|
|
|
|
* if they didn't, for the trimming case they won't find any free space
|
|
|
|
* entries because we already removed them all when we called
|
|
|
|
* btrfs_remove_free_space_cache().
|
2019-06-20 15:37:55 -04:00
|
|
|
*
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
* And we must not remove the chunk map from the fs_info->mapping_tree
|
2019-06-20 15:37:55 -04:00
|
|
|
* to prevent the same logical address range and physical device space
|
2020-05-08 11:01:47 +01:00
|
|
|
* ranges from being reused for a new block group. This is needed to
|
|
|
|
* avoid races with trimming and scrub.
|
|
|
|
*
|
|
|
|
* An fs trim operation (btrfs_trim_fs() / btrfs_ioctl_fitrim()) is
|
2019-06-20 15:37:55 -04:00
|
|
|
* completely transactionless, so while it is trimming a range the
|
|
|
|
* currently running transaction might finish and a new one start,
|
|
|
|
* allowing for new block groups to be created that can reuse the same
|
|
|
|
* physical device locations unless we take this special care.
|
|
|
|
*
|
|
|
|
* There may also be an implicit trim operation if the file system
|
|
|
|
* is mounted with -odiscard. The same protections must remain
|
|
|
|
* in place until the extents have been discarded completely when
|
|
|
|
* the transaction commit has completed.
|
|
|
|
*/
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
remove_map = (atomic_read(&block_group->frozen) == 0);
|
2019-06-20 15:37:55 -04:00
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
if (remove_map)
|
|
|
|
btrfs_remove_chunk_map(fs_info, map);
|
2020-04-21 10:54:11 +08:00
|
|
|
|
2020-06-01 19:12:06 +01:00
|
|
|
out:
|
2020-04-21 10:54:11 +08:00
|
|
|
/* Once for the lookup reference */
|
|
|
|
btrfs_put_block_group(block_group);
|
2019-06-20 15:37:55 -04:00
|
|
|
if (remove_rsv)
|
2023-09-28 11:12:49 +01:00
|
|
|
btrfs_dec_delayed_refs_rsv_bg_updates(fs_info);
|
2019-06-20 15:37:55 -04:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
|
|
|
|
struct btrfs_fs_info *fs_info, const u64 chunk_offset)
|
|
|
|
{
|
2021-11-05 16:45:36 -04:00
|
|
|
struct btrfs_root *root = btrfs_block_group_root(fs_info);
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
struct btrfs_chunk_map *map;
|
2019-06-20 15:37:55 -04:00
|
|
|
unsigned int num_items;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map = btrfs_find_chunk_map(fs_info, chunk_offset, 1);
|
|
|
|
ASSERT(map != NULL);
|
|
|
|
ASSERT(map->start == chunk_offset);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to reserve 3 + N units from the metadata space info in order
|
|
|
|
* to remove a block group (done at btrfs_remove_chunk() and at
|
|
|
|
* btrfs_remove_block_group()), which are used for:
|
|
|
|
*
|
|
|
|
* 1 unit for adding the free space inode's orphan (located in the tree
|
|
|
|
* of tree roots).
|
|
|
|
* 1 unit for deleting the block group item (located in the extent
|
|
|
|
* tree).
|
|
|
|
* 1 unit for deleting the free space item (located in tree of tree
|
|
|
|
* roots).
|
|
|
|
* N units for deleting N device extent items corresponding to each
|
|
|
|
* stripe (located in the device tree).
|
|
|
|
*
|
|
|
|
* In order to remove a block group we also need to reserve units in the
|
|
|
|
* system space info in order to update the chunk tree (update one or
|
|
|
|
* more device items and remove one chunk item), but this is done at
|
|
|
|
* btrfs_remove_chunk() through a call to check_system_chunk().
|
|
|
|
*/
|
|
|
|
num_items = 3 + map->num_stripes;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
btrfs_free_chunk_map(map);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
2021-11-05 16:45:36 -04:00
|
|
|
return btrfs_start_transaction_fallback_global_rsv(root, num_items);
|
2019-06-20 15:37:55 -04:00
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:59 -04:00
|
|
|
/*
|
|
|
|
* Mark block group @cache read-only, so later write won't happen to block
|
|
|
|
* group @cache.
|
|
|
|
*
|
|
|
|
* If @force is not set, this function will only mark the block group readonly
|
|
|
|
* if we have enough free space (1M) in other metadata/system block groups.
|
|
|
|
* If @force is not set, this function will mark the block group readonly
|
|
|
|
* without checking free space.
|
|
|
|
*
|
|
|
|
* NOTE: This function doesn't care if other block groups can contain all the
|
|
|
|
* data in this block group. That check should be done by relocation routine,
|
|
|
|
* not this function.
|
|
|
|
*/
|
2019-10-29 19:20:18 +01:00
|
|
|
static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
|
2019-06-20 15:37:59 -04:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *sinfo = cache->space_info;
|
|
|
|
u64 num_bytes;
|
|
|
|
int ret = -ENOSPC;
|
|
|
|
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
|
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-05 12:55:37 +00:00
|
|
|
if (cache->swap_extents) {
|
|
|
|
ret = -ETXTBSY;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:59 -04:00
|
|
|
if (cache->ro) {
|
|
|
|
cache->ro++;
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2019-10-23 18:48:22 +02:00
|
|
|
num_bytes = cache->length - cache->reserved - cache->pinned -
|
2021-02-04 19:21:52 +09:00
|
|
|
cache->bytes_super - cache->zone_unusable - cache->used;
|
2019-06-20 15:37:59 -04:00
|
|
|
|
|
|
|
/*
|
2020-01-17 09:07:39 -05:00
|
|
|
* Data never overcommits, even in mixed mode, so do just the straight
|
|
|
|
* check of left over space in how much we have allocated.
|
2019-06-20 15:37:59 -04:00
|
|
|
*/
|
2020-01-17 09:07:39 -05:00
|
|
|
if (force) {
|
|
|
|
ret = 0;
|
|
|
|
} else if (sinfo->flags & BTRFS_BLOCK_GROUP_DATA) {
|
|
|
|
u64 sinfo_used = btrfs_space_info_used(sinfo, true);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we make sure if we mark this bg RO, we still have enough
|
|
|
|
* free space as buffer.
|
|
|
|
*/
|
|
|
|
if (sinfo_used + num_bytes <= sinfo->total_bytes)
|
|
|
|
ret = 0;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We overcommit metadata, so we need to do the
|
|
|
|
* btrfs_can_overcommit check here, and we need to pass in
|
|
|
|
* BTRFS_RESERVE_NO_FLUSH to give ourselves the most amount of
|
|
|
|
* leeway to allow us to mark this block group as read only.
|
|
|
|
*/
|
|
|
|
if (btrfs_can_overcommit(cache->fs_info, sinfo, num_bytes,
|
|
|
|
BTRFS_RESERVE_NO_FLUSH))
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ret) {
|
2019-06-20 15:37:59 -04:00
|
|
|
sinfo->bytes_readonly += num_bytes;
|
2021-02-04 19:21:52 +09:00
|
|
|
if (btrfs_is_zoned(cache->fs_info)) {
|
|
|
|
/* Migrate zone_unusable bytes to readonly */
|
|
|
|
sinfo->bytes_readonly += cache->zone_unusable;
|
2023-02-15 09:18:02 +09:00
|
|
|
btrfs_space_info_update_bytes_zone_unusable(cache->fs_info, sinfo,
|
|
|
|
-cache->zone_unusable);
|
2021-02-04 19:21:52 +09:00
|
|
|
cache->zone_unusable = 0;
|
|
|
|
}
|
2019-06-20 15:37:59 -04:00
|
|
|
cache->ro++;
|
|
|
|
list_add_tail(&cache->ro_list, &sinfo->ro_bgs);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
if (ret == -ENOSPC && btrfs_test_opt(cache->fs_info, ENOSPC_DEBUG)) {
|
|
|
|
btrfs_info(cache->fs_info,
|
2019-10-23 18:48:22 +02:00
|
|
|
"unable to make block group %llu ro", cache->start);
|
2019-06-20 15:37:59 -04:00
|
|
|
btrfs_dump_space_info(cache->fs_info, cache->space_info, 0, 0);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-01-20 16:09:18 +02:00
|
|
|
static bool clean_pinned_extents(struct btrfs_trans_handle *trans,
|
2024-06-26 23:39:11 +02:00
|
|
|
const struct btrfs_block_group *bg)
|
2020-01-20 16:09:17 +02:00
|
|
|
{
|
2024-06-26 23:39:11 +02:00
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2020-01-20 16:09:18 +02:00
|
|
|
struct btrfs_transaction *prev_trans = NULL;
|
2020-01-20 16:09:17 +02:00
|
|
|
const u64 start = bg->start;
|
|
|
|
const u64 end = start + bg->length - 1;
|
|
|
|
int ret;
|
|
|
|
|
2020-01-20 16:09:18 +02:00
|
|
|
spin_lock(&fs_info->trans_lock);
|
|
|
|
if (trans->transaction->list.prev != &fs_info->trans_list) {
|
|
|
|
prev_trans = list_last_entry(&trans->transaction->list,
|
|
|
|
struct btrfs_transaction, list);
|
|
|
|
refcount_inc(&prev_trans->use_count);
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->trans_lock);
|
|
|
|
|
2020-01-20 16:09:17 +02:00
|
|
|
/*
|
|
|
|
* Hold the unused_bg_unpin_mutex lock to avoid racing with
|
|
|
|
* btrfs_finish_extent_commit(). If we are at transaction N, another
|
|
|
|
* task might be running finish_extent_commit() for the previous
|
|
|
|
* transaction N - 1, and have seen a range belonging to the block
|
2020-01-20 16:09:18 +02:00
|
|
|
* group in pinned_extents before we were able to clear the whole block
|
|
|
|
* group range from pinned_extents. This means that task can lookup for
|
|
|
|
* the block group after we unpinned it from pinned_extents and removed
|
2024-01-12 18:45:24 +01:00
|
|
|
* it, leading to an error at unpin_extent_range().
|
2020-01-20 16:09:17 +02:00
|
|
|
*/
|
|
|
|
mutex_lock(&fs_info->unused_bg_unpin_mutex);
|
2020-01-20 16:09:18 +02:00
|
|
|
if (prev_trans) {
|
|
|
|
ret = clear_extent_bits(&prev_trans->pinned_extents, start, end,
|
|
|
|
EXTENT_DIRTY);
|
|
|
|
if (ret)
|
2020-04-17 16:36:50 +01:00
|
|
|
goto out;
|
2020-01-20 16:09:18 +02:00
|
|
|
}
|
2020-01-20 16:09:17 +02:00
|
|
|
|
2020-01-20 16:09:18 +02:00
|
|
|
ret = clear_extent_bits(&trans->transaction->pinned_extents, start, end,
|
2020-01-20 16:09:17 +02:00
|
|
|
EXTENT_DIRTY);
|
2020-04-17 16:36:50 +01:00
|
|
|
out:
|
2020-01-20 16:09:17 +02:00
|
|
|
mutex_unlock(&fs_info->unused_bg_unpin_mutex);
|
2020-04-17 16:36:15 +01:00
|
|
|
if (prev_trans)
|
|
|
|
btrfs_put_transaction(prev_trans);
|
2020-01-20 16:09:17 +02:00
|
|
|
|
2020-04-17 16:36:50 +01:00
|
|
|
return ret == 0;
|
2020-01-20 16:09:17 +02:00
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
/*
|
|
|
|
* Process the unused_bgs list and remove any that don't have any allocated
|
|
|
|
* space inside of them.
|
|
|
|
*/
|
|
|
|
void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
LIST_HEAD(retry_list);
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group;
|
2019-06-20 15:37:55 -04:00
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_trans_handle *trans;
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
|
|
|
const bool async_trim_enabled = btrfs_test_opt(fs_info, DISCARD_ASYNC);
|
2019-06-20 15:37:55 -04:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
|
|
|
|
return;
|
|
|
|
|
2022-07-15 15:45:21 -04:00
|
|
|
if (btrfs_fs_closing(fs_info))
|
|
|
|
return;
|
|
|
|
|
2020-12-18 14:24:19 -05:00
|
|
|
/*
|
|
|
|
* Long running balances can keep us blocked here for eternity, so
|
|
|
|
* simply skip deletion if we're unable to get the mutex.
|
|
|
|
*/
|
2021-04-19 16:41:01 +09:00
|
|
|
if (!mutex_trylock(&fs_info->reclaim_bgs_lock))
|
2020-12-18 14:24:19 -05:00
|
|
|
return;
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
while (!list_empty(&fs_info->unused_bgs)) {
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
u64 used;
|
2019-06-20 15:37:55 -04:00
|
|
|
int trimming;
|
|
|
|
|
|
|
|
block_group = list_first_entry(&fs_info->unused_bgs,
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group,
|
2019-06-20 15:37:55 -04:00
|
|
|
bg_list);
|
|
|
|
list_del_init(&block_group->bg_list);
|
|
|
|
|
|
|
|
space_info = block_group->space_info;
|
|
|
|
|
|
|
|
if (ret || btrfs_mixed_space_info(space_info)) {
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
|
|
|
|
2019-12-13 16:22:14 -08:00
|
|
|
btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
/* Don't want to race with allocators so take the groups_sem */
|
|
|
|
down_write(&space_info->groups_sem);
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Async discard moves the final block group discard to be prior
|
|
|
|
* to the unused_bgs code path. Therefore, if it's not fully
|
|
|
|
* trimmed, punt it back to the async discard lists.
|
|
|
|
*/
|
|
|
|
if (btrfs_test_opt(fs_info, DISCARD_ASYNC) &&
|
|
|
|
!btrfs_is_free_space_trimmed(block_group)) {
|
|
|
|
trace_btrfs_skip_unused_block_group(block_group);
|
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
/* Requeue if we failed because of async discard */
|
|
|
|
btrfs_discard_queue_work(&fs_info->discard_ctl,
|
|
|
|
block_group);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
spin_lock(&space_info->lock);
|
2019-06-20 15:37:55 -04:00
|
|
|
spin_lock(&block_group->lock);
|
2024-01-25 09:53:06 +00:00
|
|
|
if (btrfs_is_block_group_used(block_group) || block_group->ro ||
|
2019-06-20 15:37:55 -04:00
|
|
|
list_is_singular(&block_group->list)) {
|
|
|
|
/*
|
|
|
|
* We want to bail if we made new allocations or have
|
|
|
|
* outstanding allocations in this block group. We do
|
|
|
|
* the ro check in case balance is currently acting on
|
|
|
|
* this block group.
|
2024-01-25 09:53:26 +00:00
|
|
|
*
|
|
|
|
* Also bail out if this is the only block group for its
|
|
|
|
* type, because otherwise we would lose profile
|
|
|
|
* information from fs_info->avail_*_alloc_bits and the
|
|
|
|
* next block group of this type would be created with a
|
|
|
|
* "single" profile (even if we're in a raid fs) because
|
|
|
|
* fs_info->avail_*_alloc_bits would be 0.
|
2019-06-20 15:37:55 -04:00
|
|
|
*/
|
|
|
|
trace_btrfs_skip_unused_block_group(block_group);
|
|
|
|
spin_unlock(&block_group->lock);
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
spin_unlock(&space_info->lock);
|
2019-06-20 15:37:55 -04:00
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
goto next;
|
|
|
|
}
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The block group may be unused but there may be space reserved
|
|
|
|
* accounting with the existence of that block group, that is,
|
|
|
|
* space_info->bytes_may_use was incremented by a task but no
|
|
|
|
* space was yet allocated from the block group by the task.
|
|
|
|
* That space may or may not be allocated, as we are generally
|
|
|
|
* pessimistic about space reservation for metadata as well as
|
|
|
|
* for data when using compression (as we reserve space based on
|
|
|
|
* the worst case, when data can't be compressed, and before
|
|
|
|
* actually attempting compression, before starting writeback).
|
|
|
|
*
|
|
|
|
* So check if the total space of the space_info minus the size
|
|
|
|
* of this block group is less than the used space of the
|
|
|
|
* space_info - if that's the case, then it means we have tasks
|
|
|
|
* that might be relying on the block group in order to allocate
|
|
|
|
* extents, and add back the block group to the unused list when
|
|
|
|
* we finish, so that we retry later in case no tasks ended up
|
|
|
|
* needing to allocate extents from the block group.
|
|
|
|
*/
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
2024-02-21 07:35:52 -08:00
|
|
|
if (space_info->total_bytes - block_group->length < used &&
|
|
|
|
block_group->zone_unusable < block_group->length) {
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
/*
|
|
|
|
* Add a reference for the list, compensate for the ref
|
|
|
|
* drop under the "next" label for the
|
|
|
|
* fs_info->unused_bgs list.
|
|
|
|
*/
|
|
|
|
btrfs_get_block_group(block_group);
|
|
|
|
list_add_tail(&block_group->bg_list, &retry_list);
|
|
|
|
|
|
|
|
trace_btrfs_skip_unused_block_group(block_group);
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
spin_unlock(&block_group->lock);
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
spin_unlock(&space_info->lock);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
/* We don't want to force the issue, only flip if it's ok. */
|
2019-06-20 15:38:07 -04:00
|
|
|
ret = inc_block_group_ro(block_group, 0);
|
2019-06-20 15:37:55 -04:00
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
if (ret < 0) {
|
|
|
|
ret = 0;
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
btrfs: zoned: zone finish unused block group
While the active zones within an active block group are reset, and their
active resource is released, the block group itself is kept in the active
block group list and marked as active. As a result, the list will contain
more than max_active_zones block groups. That itself is not fatal for the
device as the zones are properly reset.
However, that inflated list is, of course, strange. Also, a to-appear
patch series, which deactivates an active block group on demand, gets
confused with the wrong list.
So, fix the issue by finishing the unused block group once it gets
read-only, so that we can release the active resource in an early stage.
Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-03 17:48:54 -07:00
|
|
|
ret = btrfs_zone_finish(block_group);
|
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_dec_block_group_ro(block_group);
|
|
|
|
if (ret == -EAGAIN)
|
|
|
|
ret = 0;
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
/*
|
|
|
|
* Want to do this before we do anything else so we can recover
|
|
|
|
* properly if we fail to join the transaction.
|
|
|
|
*/
|
|
|
|
trans = btrfs_start_trans_remove_block_group(fs_info,
|
2019-10-23 18:48:22 +02:00
|
|
|
block_group->start);
|
2019-06-20 15:37:55 -04:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
btrfs_dec_block_group_ro(block_group);
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We could have pending pinned extents for this block group,
|
|
|
|
* just delete them, we don't care about them anymore.
|
|
|
|
*/
|
2020-04-17 16:36:50 +01:00
|
|
|
if (!clean_pinned_extents(trans, block_group)) {
|
|
|
|
btrfs_dec_block_group_ro(block_group);
|
2019-06-20 15:37:55 -04:00
|
|
|
goto end_trans;
|
2020-04-17 16:36:50 +01:00
|
|
|
}
|
2019-06-20 15:37:55 -04:00
|
|
|
|
2019-12-13 16:22:14 -08:00
|
|
|
/*
|
|
|
|
* At this point, the block_group is read only and should fail
|
|
|
|
* new allocations. However, btrfs_finish_extent_commit() can
|
|
|
|
* cause this block_group to be placed back on the discard
|
|
|
|
* lists because now the block_group isn't fully discarded.
|
|
|
|
* Bail here and try again later after discarding everything.
|
|
|
|
*/
|
|
|
|
spin_lock(&fs_info->discard_ctl.lock);
|
|
|
|
if (!list_empty(&block_group->discard_list)) {
|
|
|
|
spin_unlock(&fs_info->discard_ctl.lock);
|
|
|
|
btrfs_dec_block_group_ro(block_group);
|
|
|
|
btrfs_discard_queue_work(&fs_info->discard_ctl,
|
|
|
|
block_group);
|
|
|
|
goto end_trans;
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->discard_ctl.lock);
|
|
|
|
|
2019-06-20 15:37:55 -04:00
|
|
|
/* Reset pinned so btrfs_put_block_group doesn't complain */
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
|
|
|
|
btrfs_space_info_update_bytes_pinned(fs_info, space_info,
|
|
|
|
-block_group->pinned);
|
|
|
|
space_info->bytes_readonly += block_group->pinned;
|
|
|
|
block_group->pinned = 0;
|
|
|
|
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
|
|
|
/*
|
|
|
|
* The normal path here is an unused block group is passed here,
|
|
|
|
* then trimming is handled in the transaction commit path.
|
|
|
|
* Async discard interposes before this to do the trimming
|
|
|
|
* before coming down the unused block group path as trimming
|
|
|
|
* will no longer be done later in the transaction commit path.
|
|
|
|
*/
|
|
|
|
if (!async_trim_enabled && btrfs_test_opt(fs_info, DISCARD_ASYNC))
|
|
|
|
goto flip_async;
|
|
|
|
|
2021-02-04 19:21:56 +09:00
|
|
|
/*
|
|
|
|
* DISCARD can flip during remount. On zoned filesystems, we
|
|
|
|
* need to reset sequential-required zones.
|
|
|
|
*/
|
|
|
|
trimming = btrfs_test_opt(fs_info, DISCARD_SYNC) ||
|
|
|
|
btrfs_is_zoned(fs_info);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
/* Implicit trim during transaction commit. */
|
|
|
|
if (trimming)
|
2020-05-08 11:01:47 +01:00
|
|
|
btrfs_freeze_block_group(block_group);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Btrfs_remove_chunk will abort the transaction if things go
|
|
|
|
* horribly wrong.
|
|
|
|
*/
|
2019-10-23 18:48:22 +02:00
|
|
|
ret = btrfs_remove_chunk(trans, block_group->start);
|
2019-06-20 15:37:55 -04:00
|
|
|
|
|
|
|
if (ret) {
|
|
|
|
if (trimming)
|
2020-05-08 11:01:47 +01:00
|
|
|
btrfs_unfreeze_block_group(block_group);
|
2019-06-20 15:37:55 -04:00
|
|
|
goto end_trans;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're not mounted with -odiscard, we can just forget
|
|
|
|
* about this block group. Otherwise we'll need to wait
|
|
|
|
* until transaction commit to do the actual discard.
|
|
|
|
*/
|
|
|
|
if (trimming) {
|
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
/*
|
|
|
|
* A concurrent scrub might have added us to the list
|
|
|
|
* fs_info->unused_bgs, so use a list_move operation
|
|
|
|
* to add the block group to the deleted_bgs list.
|
|
|
|
*/
|
|
|
|
list_move(&block_group->bg_list,
|
|
|
|
&trans->transaction->deleted_bgs);
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
|
|
|
btrfs_get_block_group(block_group);
|
|
|
|
}
|
|
|
|
end_trans:
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
next:
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
}
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
list_splice_tail(&retry_list, &fs_info->unused_bgs);
|
2019-06-20 15:37:55 -04:00
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
2021-04-19 16:41:01 +09:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
|
|
|
return;
|
|
|
|
|
|
|
|
flip_async:
|
|
|
|
btrfs_end_transaction(trans);
|
btrfs: do not delete unused block group if it may be used soon
Before deleting a block group that is in the list of unused block groups
(fs_info->unused_bgs), we check if the block group became used before
deleting it, as extents from it may have been allocated after it was added
to the list.
However even if the block group was not yet used, there may be tasks that
have only reserved space and have not yet allocated extents, and they
might be relying on the availability of the unused block group in order
to allocate extents. The reservation works first by increasing the
"bytes_may_use" field of the corresponding space_info object (which may
first require flushing delayed items, allocating a new block group, etc),
and only later a task does the actual allocation of extents.
For metadata we usually don't end up using all reserved space, as we are
pessimistic and typically account for the worst cases (need to COW every
single node in a path of a tree at maximum possible height, etc). For
data we usually reserve the exact amount of space we're going to allocate
later, except when using compression where we always reserve space based
on the uncompressed size, as compression is only triggered when writeback
starts so we don't know in advance how much space we'll actually need, or
if the data is compressible.
So don't delete an unused block group if the total size of its space_info
object minus the block group's size is less then the sum of used space and
space that may be used (space_info->bytes_may_use), as that means we have
tasks that reserved space and may need to allocate extents from the block
group. In this case, besides skipping the deletion, re-add the block group
to the list of unused block groups so that it may be reconsidered later,
in case the tasks that reserved space end up not needing to allocate
extents from it.
Allowing the deletion of the block group while we have reserved space, can
result in tasks failing to allocate metadata extents (-ENOSPC) while under
a transaction handle, resulting in a transaction abort, or failure during
writeback for the case of data extents.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:14 +00:00
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
list_splice_tail(&retry_list, &fs_info->unused_bgs);
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
2021-04-19 16:41:01 +09:00
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
btrfs_discard_punt_unused_bgs_list(fs_info);
|
2019-06-20 15:37:55 -04:00
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
|
2019-06-20 15:37:55 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = bg->fs_info;
|
|
|
|
|
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
if (list_empty(&bg->bg_list)) {
|
|
|
|
btrfs_get_block_group(bg);
|
btrfs: fix use-after-free of new block group that became unused
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-28 17:13:37 +01:00
|
|
|
trace_btrfs_add_unused_block_group(bg);
|
2019-06-20 15:37:55 -04:00
|
|
|
list_add_tail(&bg->bg_list, &fs_info->unused_bgs);
|
btrfs: fix use-after-free of new block group that became unused
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-28 17:13:37 +01:00
|
|
|
} else if (!test_bit(BLOCK_GROUP_FLAG_NEW, &bg->runtime_flags)) {
|
2023-06-06 14:36:34 +09:00
|
|
|
/* Pull out the block group from the reclaim_bgs list. */
|
btrfs: fix use-after-free of new block group that became unused
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-28 17:13:37 +01:00
|
|
|
trace_btrfs_add_unused_block_group(bg);
|
2023-06-06 14:36:34 +09:00
|
|
|
list_move_tail(&bg->bg_list, &fs_info->unused_bgs);
|
2019-06-20 15:37:55 -04:00
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
|
|
|
}
|
2019-06-20 15:37:57 -04:00
|
|
|
|
2021-10-14 18:39:02 +09:00
|
|
|
/*
|
|
|
|
* We want block groups with a low number of used bytes to be in the beginning
|
|
|
|
* of the list, so they will get reclaimed first.
|
|
|
|
*/
|
|
|
|
static int reclaim_bgs_cmp(void *unused, const struct list_head *a,
|
|
|
|
const struct list_head *b)
|
|
|
|
{
|
|
|
|
const struct btrfs_block_group *bg1, *bg2;
|
|
|
|
|
|
|
|
bg1 = list_entry(a, struct btrfs_block_group, bg_list);
|
|
|
|
bg2 = list_entry(b, struct btrfs_block_group, bg_list);
|
|
|
|
|
|
|
|
return bg1->used > bg2->used;
|
|
|
|
}
|
|
|
|
|
2024-06-26 23:39:11 +02:00
|
|
|
static inline bool btrfs_should_reclaim(const struct btrfs_fs_info *fs_info)
|
2022-03-29 01:56:09 -07:00
|
|
|
{
|
|
|
|
if (btrfs_is_zoned(fs_info))
|
|
|
|
return btrfs_zoned_should_reclaim(fs_info);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2024-06-26 23:39:11 +02:00
|
|
|
static bool should_reclaim_block_group(const struct btrfs_block_group *bg, u64 bytes_freed)
|
2022-10-13 15:52:10 -07:00
|
|
|
{
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
const int thresh_pct = btrfs_calc_reclaim_threshold(bg->space_info);
|
|
|
|
u64 thresh_bytes = mult_perc(bg->length, thresh_pct);
|
2022-10-13 15:52:10 -07:00
|
|
|
const u64 new_val = bg->used;
|
|
|
|
const u64 old_val = new_val + bytes_freed;
|
|
|
|
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
if (thresh_bytes == 0)
|
2022-10-13 15:52:10 -07:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we were below the threshold before don't reclaim, we are likely a
|
|
|
|
* brand new block group and we don't want to relocate new block groups.
|
|
|
|
*/
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
if (old_val < thresh_bytes)
|
2022-10-13 15:52:10 -07:00
|
|
|
return false;
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
if (new_val >= thresh_bytes)
|
2022-10-13 15:52:10 -07:00
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2021-04-19 16:41:02 +09:00
|
|
|
void btrfs_reclaim_bgs_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info =
|
|
|
|
container_of(work, struct btrfs_fs_info, reclaim_bgs_work);
|
|
|
|
struct btrfs_block_group *bg;
|
|
|
|
struct btrfs_space_info *space_info;
|
2024-06-07 12:50:14 -07:00
|
|
|
LIST_HEAD(retry_list);
|
2021-04-19 16:41:02 +09:00
|
|
|
|
|
|
|
if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
|
|
|
|
return;
|
|
|
|
|
2022-07-15 15:45:21 -04:00
|
|
|
if (btrfs_fs_closing(fs_info))
|
|
|
|
return;
|
|
|
|
|
2022-03-29 01:56:09 -07:00
|
|
|
if (!btrfs_should_reclaim(fs_info))
|
|
|
|
return;
|
|
|
|
|
2022-02-18 13:14:19 +09:00
|
|
|
sb_start_write(fs_info->sb);
|
|
|
|
|
|
|
|
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
|
|
|
|
sb_end_write(fs_info->sb);
|
2021-04-19 16:41:02 +09:00
|
|
|
return;
|
2022-02-18 13:14:19 +09:00
|
|
|
}
|
2021-04-19 16:41:02 +09:00
|
|
|
|
2021-07-06 01:32:38 +09:00
|
|
|
/*
|
|
|
|
* Long running balances can keep us blocked here for eternity, so
|
|
|
|
* simply skip reclaim if we're unable to get the mutex.
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
|
|
|
|
btrfs_exclop_finish(fs_info);
|
2022-02-18 13:14:19 +09:00
|
|
|
sb_end_write(fs_info->sb);
|
2021-07-06 01:32:38 +09:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-04-19 16:41:02 +09:00
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
2021-10-14 18:39:02 +09:00
|
|
|
/*
|
|
|
|
* Sort happens under lock because we can't simply splice it and sort.
|
|
|
|
* The block groups might still be in use and reachable via bg_list,
|
|
|
|
* and their presence in the reclaim_bgs list must be preserved.
|
|
|
|
*/
|
|
|
|
list_sort(NULL, &fs_info->reclaim_bgs, reclaim_bgs_cmp);
|
2021-04-19 16:41:02 +09:00
|
|
|
while (!list_empty(&fs_info->reclaim_bgs)) {
|
2021-06-29 03:16:46 +09:00
|
|
|
u64 zone_unusable;
|
2024-01-25 14:10:30 -08:00
|
|
|
u64 reclaimed;
|
2021-06-21 11:10:38 +01:00
|
|
|
int ret = 0;
|
|
|
|
|
2021-04-19 16:41:02 +09:00
|
|
|
bg = list_first_entry(&fs_info->reclaim_bgs,
|
|
|
|
struct btrfs_block_group,
|
|
|
|
bg_list);
|
|
|
|
list_del_init(&bg->bg_list);
|
|
|
|
|
|
|
|
space_info = bg->space_info;
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
|
|
|
|
|
|
|
/* Don't race with allocators so take the groups_sem */
|
|
|
|
down_write(&space_info->groups_sem);
|
|
|
|
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
spin_lock(&space_info->lock);
|
2021-04-19 16:41:02 +09:00
|
|
|
spin_lock(&bg->lock);
|
|
|
|
if (bg->reserved || bg->pinned || bg->ro) {
|
|
|
|
/*
|
|
|
|
* We want to bail if we made new allocations or have
|
|
|
|
* outstanding allocations in this block group. We do
|
|
|
|
* the ro check in case balance is currently acting on
|
|
|
|
* this block group.
|
|
|
|
*/
|
|
|
|
spin_unlock(&bg->lock);
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
spin_unlock(&space_info->lock);
|
2021-04-19 16:41:02 +09:00
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
goto next;
|
|
|
|
}
|
2022-10-13 15:52:09 -07:00
|
|
|
if (bg->used == 0) {
|
|
|
|
/*
|
|
|
|
* It is possible that we trigger relocation on a block
|
|
|
|
* group as its extents are deleted and it first goes
|
|
|
|
* below the threshold, then shortly after goes empty.
|
|
|
|
*
|
|
|
|
* In this case, relocating it does delete it, but has
|
|
|
|
* some overhead in relocation specific metadata, looking
|
|
|
|
* for the non-existent extents and running some extra
|
|
|
|
* transactions, which we can avoid by using one of the
|
|
|
|
* other mechanisms for dealing with empty block groups.
|
|
|
|
*/
|
|
|
|
if (!btrfs_test_opt(fs_info, DISCARD_ASYNC))
|
|
|
|
btrfs_mark_bg_unused(bg);
|
|
|
|
spin_unlock(&bg->lock);
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
spin_unlock(&space_info->lock);
|
2022-10-13 15:52:09 -07:00
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
goto next;
|
2022-10-13 15:52:10 -07:00
|
|
|
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* The block group might no longer meet the reclaim condition by
|
|
|
|
* the time we get around to reclaiming it, so to avoid
|
|
|
|
* reclaiming overly full block_groups, skip reclaiming them.
|
|
|
|
*
|
|
|
|
* Since the decision making process also depends on the amount
|
|
|
|
* being freed, pass in a fake giant value to skip that extra
|
|
|
|
* check, which is more meaningful when adding to the list in
|
|
|
|
* the first place.
|
|
|
|
*/
|
|
|
|
if (!should_reclaim_block_group(bg, bg->length)) {
|
|
|
|
spin_unlock(&bg->lock);
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
spin_unlock(&space_info->lock);
|
2022-10-13 15:52:10 -07:00
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
goto next;
|
2022-10-13 15:52:09 -07:00
|
|
|
}
|
2021-04-19 16:41:02 +09:00
|
|
|
spin_unlock(&bg->lock);
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 11:52:16 -08:00
|
|
|
spin_unlock(&space_info->lock);
|
2021-04-19 16:41:02 +09:00
|
|
|
|
2023-06-06 14:36:35 +09:00
|
|
|
/*
|
|
|
|
* Get out fast, in case we're read-only or unmounting the
|
|
|
|
* filesystem. It is OK to drop block groups from the list even
|
|
|
|
* for the read-only case. As we did sb_start_write(),
|
|
|
|
* "mount -o remount,ro" won't happen and read-only filesystem
|
|
|
|
* means it is forced read-only due to a fatal error. So, it
|
|
|
|
* never gets back to read-write to let us reclaim again.
|
|
|
|
*/
|
|
|
|
if (btrfs_need_cleaner_sleep(fs_info)) {
|
2021-04-19 16:41:02 +09:00
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
2021-06-29 03:16:46 +09:00
|
|
|
/*
|
|
|
|
* Cache the zone_unusable value before turning the block group
|
|
|
|
* to read only. As soon as the blog group is read only it's
|
|
|
|
* zone_unusable value gets moved to the block group's read-only
|
|
|
|
* bytes and isn't available for calculations anymore.
|
|
|
|
*/
|
|
|
|
zone_unusable = bg->zone_unusable;
|
2021-04-19 16:41:02 +09:00
|
|
|
ret = inc_block_group_ro(bg, 0);
|
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
if (ret < 0)
|
|
|
|
goto next;
|
|
|
|
|
2021-06-29 03:16:46 +09:00
|
|
|
btrfs_info(fs_info,
|
|
|
|
"reclaiming chunk %llu with %llu%% used %llu%% unusable",
|
2023-02-21 10:11:24 -08:00
|
|
|
bg->start,
|
|
|
|
div64_u64(bg->used * 100, bg->length),
|
2021-06-29 03:16:46 +09:00
|
|
|
div64_u64(zone_unusable * 100, bg->length));
|
2021-04-19 16:41:02 +09:00
|
|
|
trace_btrfs_reclaim_block_group(bg);
|
2024-01-25 14:10:30 -08:00
|
|
|
reclaimed = bg->used;
|
2021-04-19 16:41:02 +09:00
|
|
|
ret = btrfs_relocate_chunk(fs_info, bg->start);
|
2022-07-25 13:05:05 -04:00
|
|
|
if (ret) {
|
|
|
|
btrfs_dec_block_group_ro(bg);
|
2021-04-19 16:41:02 +09:00
|
|
|
btrfs_err(fs_info, "error relocating chunk %llu",
|
|
|
|
bg->start);
|
2024-01-25 14:10:30 -08:00
|
|
|
reclaimed = 0;
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
space_info->reclaim_errors++;
|
2024-02-14 11:25:50 -08:00
|
|
|
if (READ_ONCE(space_info->periodic_reclaim))
|
|
|
|
space_info->periodic_reclaim_ready = false;
|
2024-01-25 14:10:30 -08:00
|
|
|
spin_unlock(&space_info->lock);
|
2022-07-25 13:05:05 -04:00
|
|
|
}
|
2024-01-25 14:10:30 -08:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
space_info->reclaim_count++;
|
|
|
|
space_info->reclaim_bytes += reclaimed;
|
|
|
|
spin_unlock(&space_info->lock);
|
2021-04-19 16:41:02 +09:00
|
|
|
|
|
|
|
next:
|
2024-02-14 11:25:50 -08:00
|
|
|
if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
|
2024-06-07 12:50:14 -07:00
|
|
|
/* Refcount held by the reclaim_bgs list after splice. */
|
btrfs: fix adding block group to a reclaim list and the unused list during reclaim
There is a potential parallel list adding for retrying in
btrfs_reclaim_bgs_work and adding to the unused list. Since the block
group is removed from the reclaim list and it is on a relocation work,
it can be added into the unused list in parallel. When that happens,
adding it to the reclaim list will corrupt the list head and trigger
list corruption like below.
Fix it by taking fs_info->unused_bgs_lock.
[177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
[177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
[177.529][T2585409] ------------[ cut here ]------------
[177.537][T2585409] kernel BUG at lib/list_debug.c:65!
[177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1
[177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
[177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
[177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
[177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
[177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
[177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
[177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
[177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
[177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
[177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
[177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
[177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
[177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
[177.742][T2585409] PKRU: 55555554
[177.748][T2585409] Call Trace:
[177.753][T2585409] <TASK>
[177.759][T2585409] ? __die_body.cold+0x19/0x27
[177.766][T2585409] ? die+0x2e/0x50
[177.772][T2585409] ? do_trap+0x1ea/0x2d0
[177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.788][T2585409] ? do_error_trap+0xa3/0x160
[177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.805][T2585409] ? handle_invalid_op+0x2c/0x40
[177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.820][T2585409] ? exc_invalid_op+0x2d/0x40
[177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20
[177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
safe, AFAICS. Since the block group was in the unused list, the used bytes
should be 0 when it was added to the unused list. Then, it checks
block_group->{used,reserved,pinned} are still 0 under the
block_group->lock. So, they should be still eligible for the unused list,
not the reclaim list.
The reason it is safe there it's because because we're holding
space_info->groups_sem in write mode.
That means no other task can allocate from the block group, so while we
are at deleted_unused_bgs() it's not possible for other tasks to
allocate and deallocate extents from the block group, so it can't be
added to the unused list or the reclaim list by anyone else.
The bug can be reproduced by btrfs/166 after a few rounds. In practice
this can be hit when relocation cannot find more chunk space and ends
with ENOSPC.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Fixes: 4eb4e85c4f81 ("btrfs: retry block group reclaim without infinite loop")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-28 13:32:24 +09:00
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
/*
|
|
|
|
* This block group might be added to the unused list
|
|
|
|
* during the above process. Move it back to the
|
|
|
|
* reclaim list otherwise.
|
|
|
|
*/
|
|
|
|
if (list_empty(&bg->bg_list)) {
|
|
|
|
btrfs_get_block_group(bg);
|
|
|
|
list_add_tail(&bg->bg_list, &retry_list);
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
2024-06-07 12:50:14 -07:00
|
|
|
}
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 12:03:38 +00:00
|
|
|
btrfs_put_block_group(bg);
|
2023-06-06 14:36:33 +09:00
|
|
|
|
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
|
|
|
/*
|
|
|
|
* Reclaiming all the block groups in the list can take really
|
|
|
|
* long. Prioritize cleaning up unused block groups.
|
|
|
|
*/
|
|
|
|
btrfs_delete_unused_bgs(fs_info);
|
|
|
|
/*
|
|
|
|
* If we are interrupted by a balance, we can just bail out. The
|
|
|
|
* cleaner thread restart again if necessary.
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&fs_info->reclaim_bgs_lock))
|
|
|
|
goto end;
|
2021-04-19 16:41:02 +09:00
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
|
|
|
mutex_unlock(&fs_info->reclaim_bgs_lock);
|
2023-06-06 14:36:33 +09:00
|
|
|
end:
|
2024-06-07 12:50:14 -07:00
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
list_splice_tail(&retry_list, &fs_info->reclaim_bgs);
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
2021-04-19 16:41:02 +09:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2022-02-18 13:14:19 +09:00
|
|
|
sb_end_write(fs_info->sb);
|
2021-04-19 16:41:02 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
2024-02-02 11:54:33 -08:00
|
|
|
btrfs_reclaim_sweep(fs_info);
|
2021-04-19 16:41:02 +09:00
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
if (!list_empty(&fs_info->reclaim_bgs))
|
|
|
|
queue_work(system_unbound_wq, &fs_info->reclaim_bgs_work);
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_mark_bg_to_reclaim(struct btrfs_block_group *bg)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = bg->fs_info;
|
|
|
|
|
|
|
|
spin_lock(&fs_info->unused_bgs_lock);
|
|
|
|
if (list_empty(&bg->bg_list)) {
|
|
|
|
btrfs_get_block_group(bg);
|
|
|
|
trace_btrfs_add_reclaim_block_group(bg);
|
|
|
|
list_add_tail(&bg->bg_list, &fs_info->reclaim_bgs);
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->unused_bgs_lock);
|
|
|
|
}
|
|
|
|
|
2024-06-26 23:39:11 +02:00
|
|
|
static int read_bg_from_eb(struct btrfs_fs_info *fs_info, const struct btrfs_key *key,
|
|
|
|
const struct btrfs_path *path)
|
2020-06-02 19:05:57 +09:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
struct btrfs_chunk_map *map;
|
2020-06-02 19:05:57 +09:00
|
|
|
struct btrfs_block_group_item bg;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int slot;
|
|
|
|
u64 flags;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
slot = path->slots[0];
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map = btrfs_find_chunk_map(fs_info, key->objectid, key->offset);
|
|
|
|
if (!map) {
|
2020-06-02 19:05:57 +09:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"logical %llu len %llu found bg but no related chunk",
|
|
|
|
key->objectid, key->offset);
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
if (map->start != key->objectid || map->chunk_len != key->offset) {
|
2020-06-02 19:05:57 +09:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"block group %llu len %llu mismatch with chunk %llu len %llu",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
key->objectid, key->offset, map->start, map->chunk_len);
|
2020-06-02 19:05:57 +09:00
|
|
|
ret = -EUCLEAN;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
goto out_free_map;
|
2020-06-02 19:05:57 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
read_extent_buffer(leaf, &bg, btrfs_item_ptr_offset(leaf, slot),
|
|
|
|
sizeof(bg));
|
|
|
|
flags = btrfs_stack_block_group_flags(&bg) &
|
|
|
|
BTRFS_BLOCK_GROUP_TYPE_MASK;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
if (flags != (map->type & BTRFS_BLOCK_GROUP_TYPE_MASK)) {
|
2020-06-02 19:05:57 +09:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"block group %llu len %llu type flags 0x%llx mismatch with chunk type flags 0x%llx",
|
|
|
|
key->objectid, key->offset, flags,
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
(BTRFS_BLOCK_GROUP_TYPE_MASK & map->type));
|
2020-06-02 19:05:57 +09:00
|
|
|
ret = -EUCLEAN;
|
|
|
|
}
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
out_free_map:
|
|
|
|
btrfs_free_chunk_map(map);
|
2020-06-02 19:05:57 +09:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
static int find_first_block_group(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_path *path,
|
2024-06-26 23:39:11 +02:00
|
|
|
const struct btrfs_key *key)
|
2019-06-20 15:37:57 -04:00
|
|
|
{
|
2021-11-05 16:45:36 -04:00
|
|
|
struct btrfs_root *root = btrfs_block_group_root(fs_info);
|
2020-06-02 19:05:57 +09:00
|
|
|
int ret;
|
2019-06-20 15:37:57 -04:00
|
|
|
struct btrfs_key found_key;
|
|
|
|
|
2022-03-09 14:50:39 +01:00
|
|
|
btrfs_for_each_slot(root, key, &found_key, path, ret) {
|
2019-06-20 15:37:57 -04:00
|
|
|
if (found_key.objectid >= key->objectid &&
|
|
|
|
found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
|
2022-03-09 14:50:39 +01:00
|
|
|
return read_bg_from_eb(fs_info, &found_key, path);
|
2019-06-20 15:37:57 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
|
|
|
u64 extra_flags = chunk_to_extended(flags) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
|
|
|
|
|
|
|
write_seqlock(&fs_info->profiles_lock);
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
fs_info->avail_data_alloc_bits |= extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
fs_info->avail_metadata_alloc_bits |= extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
fs_info->avail_system_alloc_bits |= extra_flags;
|
|
|
|
write_sequnlock(&fs_info->profiles_lock);
|
|
|
|
}
|
|
|
|
|
2022-10-27 14:21:42 +02:00
|
|
|
/*
|
|
|
|
* Map a physical disk address to a list of logical addresses.
|
2021-01-22 11:57:58 +02:00
|
|
|
*
|
|
|
|
* @fs_info: the filesystem
|
2019-12-10 19:57:51 +02:00
|
|
|
* @chunk_start: logical address of block group
|
|
|
|
* @physical: physical address to map to logical addresses
|
|
|
|
* @logical: return array of logical addresses which map to @physical
|
|
|
|
* @naddrs: length of @logical
|
|
|
|
* @stripe_len: size of IO stripe for the given block group
|
|
|
|
*
|
|
|
|
* Maps a particular @physical disk address to a list of @logical addresses.
|
|
|
|
* Used primarily to exclude those portions of a block group that contain super
|
|
|
|
* block copies.
|
|
|
|
*/
|
|
|
|
int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
|
2022-12-12 08:37:24 +01:00
|
|
|
u64 physical, u64 **logical, int *naddrs, int *stripe_len)
|
2019-12-10 19:57:51 +02:00
|
|
|
{
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
struct btrfs_chunk_map *map;
|
2019-12-10 19:57:51 +02:00
|
|
|
u64 *buf;
|
|
|
|
u64 bytenr;
|
2019-11-19 14:05:53 +02:00
|
|
|
u64 data_stripe_length;
|
|
|
|
u64 io_stripe_size;
|
|
|
|
int i, nr = 0;
|
|
|
|
int ret = 0;
|
2019-12-10 19:57:51 +02:00
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map = btrfs_get_chunk_map(fs_info, chunk_start, 1);
|
|
|
|
if (IS_ERR(map))
|
2019-12-10 19:57:51 +02:00
|
|
|
return -EIO;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
data_stripe_length = map->stripe_size;
|
2023-02-17 13:36:58 +08:00
|
|
|
io_stripe_size = BTRFS_STRIPE_LEN;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
chunk_start = map->start;
|
2019-12-10 19:57:51 +02:00
|
|
|
|
2020-04-03 16:40:34 +03:00
|
|
|
/* For RAID5/6 adjust to a full IO stripe length */
|
|
|
|
if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
|
2023-06-22 14:42:40 +08:00
|
|
|
io_stripe_size = btrfs_stripe_nr_to_offset(nr_data_stripes(map));
|
2019-12-10 19:57:51 +02:00
|
|
|
|
|
|
|
buf = kcalloc(map->num_stripes, sizeof(u64), GFP_NOFS);
|
2019-11-19 14:05:53 +02:00
|
|
|
if (!buf) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-12-10 19:57:51 +02:00
|
|
|
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
2019-11-19 14:05:53 +02:00
|
|
|
bool already_inserted = false;
|
2023-02-17 13:36:59 +08:00
|
|
|
u32 stripe_nr;
|
|
|
|
u32 offset;
|
2019-11-19 14:05:53 +02:00
|
|
|
int j;
|
|
|
|
|
|
|
|
if (!in_range(physical, map->stripes[i].physical,
|
|
|
|
data_stripe_length))
|
2019-12-10 19:57:51 +02:00
|
|
|
continue;
|
|
|
|
|
2023-02-17 13:36:58 +08:00
|
|
|
stripe_nr = (physical - map->stripes[i].physical) >>
|
|
|
|
BTRFS_STRIPE_LEN_SHIFT;
|
|
|
|
offset = (physical - map->stripes[i].physical) &
|
|
|
|
BTRFS_STRIPE_LEN_MASK;
|
2019-12-10 19:57:51 +02:00
|
|
|
|
2022-06-23 16:57:02 +02:00
|
|
|
if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
|
2023-02-17 13:36:59 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
stripe_nr = div_u64(stripe_nr * map->num_stripes + i,
|
|
|
|
map->sub_stripes);
|
2019-12-10 19:57:51 +02:00
|
|
|
/*
|
|
|
|
* The remaining case would be for RAID56, multiply by
|
|
|
|
* nr_data_stripes(). Alternatively, just use rmap_len below
|
|
|
|
* instead of map->stripe_len
|
|
|
|
*/
|
2021-02-04 19:22:02 +09:00
|
|
|
bytenr = chunk_start + stripe_nr * io_stripe_size + offset;
|
2019-11-19 14:05:53 +02:00
|
|
|
|
|
|
|
/* Ensure we don't add duplicate addresses */
|
2019-12-10 19:57:51 +02:00
|
|
|
for (j = 0; j < nr; j++) {
|
2019-11-19 14:05:53 +02:00
|
|
|
if (buf[j] == bytenr) {
|
|
|
|
already_inserted = true;
|
2019-12-10 19:57:51 +02:00
|
|
|
break;
|
2019-11-19 14:05:53 +02:00
|
|
|
}
|
2019-12-10 19:57:51 +02:00
|
|
|
}
|
2019-11-19 14:05:53 +02:00
|
|
|
|
|
|
|
if (!already_inserted)
|
2019-12-10 19:57:51 +02:00
|
|
|
buf[nr++] = bytenr;
|
|
|
|
}
|
|
|
|
|
|
|
|
*logical = buf;
|
|
|
|
*naddrs = nr;
|
2019-11-19 14:05:53 +02:00
|
|
|
*stripe_len = io_stripe_size;
|
|
|
|
out:
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
btrfs_free_chunk_map(map);
|
2019-11-19 14:05:53 +02:00
|
|
|
return ret;
|
2019-12-10 19:57:51 +02:00
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
static int exclude_super_stripes(struct btrfs_block_group *cache)
|
2019-06-20 15:37:57 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
|
|
|
const bool zoned = btrfs_is_zoned(fs_info);
|
2019-06-20 15:37:57 -04:00
|
|
|
u64 bytenr;
|
|
|
|
u64 *logical;
|
|
|
|
int stripe_len;
|
|
|
|
int i, nr, ret;
|
|
|
|
|
2019-10-23 18:48:22 +02:00
|
|
|
if (cache->start < BTRFS_SUPER_INFO_OFFSET) {
|
|
|
|
stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->start;
|
2019-06-20 15:37:57 -04:00
|
|
|
cache->bytes_super += stripe_len;
|
2023-06-30 16:03:50 +01:00
|
|
|
ret = set_extent_bit(&fs_info->excluded_extents, cache->start,
|
|
|
|
cache->start + stripe_len - 1,
|
|
|
|
EXTENT_UPTODATE, NULL);
|
2019-06-20 15:37:57 -04:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
|
|
|
|
bytenr = btrfs_sb_offset(i);
|
2022-12-12 08:37:24 +01:00
|
|
|
ret = btrfs_rmap_block(fs_info, cache->start,
|
2019-06-20 15:37:57 -04:00
|
|
|
bytenr, &logical, &nr, &stripe_len);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
|
|
|
/* Shouldn't have super stripes in sequential zones */
|
|
|
|
if (zoned && nr) {
|
2023-07-03 12:03:21 +01:00
|
|
|
kfree(logical);
|
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones. However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today. So, the number of superblock and
copies is limited to be two. Second downside is that we cannot support
devices which have no conventional zones at all.
To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.
We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.
The following zones are reserved as the circular buffer on ZONED btrfs.
- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
next to it
If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-10 20:26:14 +09:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"zoned: block group %llu must not contain super block",
|
|
|
|
cache->start);
|
|
|
|
return -EUCLEAN;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
while (nr--) {
|
2020-04-03 16:40:35 +03:00
|
|
|
u64 len = min_t(u64, stripe_len,
|
|
|
|
cache->start + cache->length - logical[nr]);
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
cache->bytes_super += len;
|
2023-06-30 16:03:50 +01:00
|
|
|
ret = set_extent_bit(&fs_info->excluded_extents, logical[nr],
|
|
|
|
logical[nr] + len - 1,
|
|
|
|
EXTENT_UPTODATE, NULL);
|
2019-06-20 15:37:57 -04:00
|
|
|
if (ret) {
|
|
|
|
kfree(logical);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
kfree(logical);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
static struct btrfs_block_group *btrfs_create_block_group_cache(
|
2020-05-05 07:58:20 +08:00
|
|
|
struct btrfs_fs_info *fs_info, u64 start)
|
2019-06-20 15:37:57 -04:00
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache;
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
cache = kzalloc(sizeof(*cache), GFP_NOFS);
|
|
|
|
if (!cache)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
cache->free_space_ctl = kzalloc(sizeof(*cache->free_space_ctl),
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!cache->free_space_ctl) {
|
|
|
|
kfree(cache);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2019-10-23 18:48:22 +02:00
|
|
|
cache->start = start;
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
cache->fs_info = fs_info;
|
|
|
|
cache->full_stripe_len = btrfs_full_stripe_len(fs_info, start);
|
|
|
|
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 16:22:15 -08:00
|
|
|
cache->discard_index = BTRFS_DISCARD_INDEX_UNUSED;
|
|
|
|
|
2020-07-06 09:14:11 -04:00
|
|
|
refcount_set(&cache->refs, 1);
|
2019-06-20 15:37:57 -04:00
|
|
|
spin_lock_init(&cache->lock);
|
|
|
|
init_rwsem(&cache->data_rwsem);
|
|
|
|
INIT_LIST_HEAD(&cache->list);
|
|
|
|
INIT_LIST_HEAD(&cache->cluster_list);
|
|
|
|
INIT_LIST_HEAD(&cache->bg_list);
|
|
|
|
INIT_LIST_HEAD(&cache->ro_list);
|
2019-12-13 16:22:14 -08:00
|
|
|
INIT_LIST_HEAD(&cache->discard_list);
|
2019-06-20 15:37:57 -04:00
|
|
|
INIT_LIST_HEAD(&cache->dirty_list);
|
|
|
|
INIT_LIST_HEAD(&cache->io_list);
|
2021-08-19 21:19:17 +09:00
|
|
|
INIT_LIST_HEAD(&cache->active_bg_list);
|
2020-10-23 09:58:08 -04:00
|
|
|
btrfs_init_free_space_ctl(cache, cache->free_space_ctl);
|
2020-05-08 11:01:47 +01:00
|
|
|
atomic_set(&cache->frozen, 0);
|
2019-06-20 15:37:57 -04:00
|
|
|
mutex_init(&cache->free_space_lock);
|
|
|
|
|
|
|
|
return cache;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate all chunks and verify that each of them has the corresponding block
|
|
|
|
* group
|
|
|
|
*/
|
|
|
|
static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
u64 start = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
while (1) {
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
struct btrfs_block_group *bg;
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
/*
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
* btrfs_find_chunk_map() will return the first chunk map
|
|
|
|
* intersecting the range, so setting @length to 1 is enough to
|
2019-06-20 15:37:57 -04:00
|
|
|
* get the first chunk.
|
|
|
|
*/
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map = btrfs_find_chunk_map(fs_info, start, 1);
|
|
|
|
if (!map)
|
2019-06-20 15:37:57 -04:00
|
|
|
break;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
bg = btrfs_lookup_block_group(fs_info, map->start);
|
2019-06-20 15:37:57 -04:00
|
|
|
if (!bg) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"chunk start=%llu len=%llu doesn't have corresponding block group",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map->start, map->chunk_len);
|
2019-06-20 15:37:57 -04:00
|
|
|
ret = -EUCLEAN;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
btrfs_free_chunk_map(map);
|
2019-06-20 15:37:57 -04:00
|
|
|
break;
|
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
if (bg->start != map->start || bg->length != map->chunk_len ||
|
2019-06-20 15:37:57 -04:00
|
|
|
(bg->flags & BTRFS_BLOCK_GROUP_TYPE_MASK) !=
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
(map->type & BTRFS_BLOCK_GROUP_TYPE_MASK)) {
|
2019-06-20 15:37:57 -04:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"chunk start=%llu len=%llu flags=0x%llx doesn't match block group start=%llu len=%llu flags=0x%llx",
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map->start, map->chunk_len,
|
|
|
|
map->type & BTRFS_BLOCK_GROUP_TYPE_MASK,
|
2019-10-23 18:48:22 +02:00
|
|
|
bg->start, bg->length,
|
2019-06-20 15:37:57 -04:00
|
|
|
bg->flags & BTRFS_BLOCK_GROUP_TYPE_MASK);
|
|
|
|
ret = -EUCLEAN;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
btrfs_free_chunk_map(map);
|
2019-06-20 15:37:57 -04:00
|
|
|
btrfs_put_block_group(bg);
|
|
|
|
break;
|
|
|
|
}
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
start = map->start + map->chunk_len;
|
|
|
|
btrfs_free_chunk_map(map);
|
2019-06-20 15:37:57 -04:00
|
|
|
btrfs_put_block_group(bg);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-10-10 10:39:27 +08:00
|
|
|
static int read_one_block_group(struct btrfs_fs_info *info,
|
2021-02-04 19:21:44 +09:00
|
|
|
struct btrfs_block_group_item *bgi,
|
2019-11-05 09:35:35 +08:00
|
|
|
const struct btrfs_key *key,
|
2019-10-10 10:39:27 +08:00
|
|
|
int need_clear)
|
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache;
|
2019-10-10 10:39:27 +08:00
|
|
|
const bool mixed = btrfs_fs_incompat(info, MIXED_GROUPS);
|
|
|
|
int ret;
|
|
|
|
|
2019-11-05 09:35:35 +08:00
|
|
|
ASSERT(key->type == BTRFS_BLOCK_GROUP_ITEM_KEY);
|
2019-10-10 10:39:27 +08:00
|
|
|
|
2020-05-05 07:58:20 +08:00
|
|
|
cache = btrfs_create_block_group_cache(info, key->objectid);
|
2019-10-10 10:39:27 +08:00
|
|
|
if (!cache)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2021-02-04 19:21:44 +09:00
|
|
|
cache->length = key->offset;
|
|
|
|
cache->used = btrfs_stack_block_group_used(bgi);
|
btrfs: skip update of block group item if used bytes are the same
[BACKGROUND]
When committing a transaction, we will update block group items for all
dirty block groups.
But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several COW operations, but still have the same amount of used bytes.
In that case, we may unnecessarily COW a tree block doing nothing.
[ENHANCEMENT]
This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.
This would be more common for large filesystems, where metadata block
group can be as large as 1GiB, containing at most 64K metadata items.
In that case, if COW added and then deleted one metadata item near the
end of the block group, then it's completely possible we don't need to
touch the block group item at all.
[BENCHMARK]
The change itself can have quite a high chance (20~80%) to skip block
group item updates in lot of workloads.
As a result, it would result shorter time spent on
btrfs_write_dirty_block_groups(), and overall reduce the execution time
of the critical section of btrfs_commit_transaction().
Here comes a fio command, which will do random writes in 4K block size,
causing a very heavy metadata updates.
fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
--ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
--name=random_write --fallocate=none --time_based --fsync_on_close=1
The file size (512M) and number of threads (4) means 2GiB file size in
total, but during the full 300s run time, my dedicated SATA SSD is able
to write around 20~25GiB, which is over 10 times the file size.
Thus after we fill the initial 2G, we should not cause much block group
item updates.
Please note, the fio numbers by themselves don't have much change, but
if we look deeper, there is some reduced execution time, especially for
the critical section of btrfs_commit_transaction().
I added extra trace_printk() to measure the following per-transaction
execution time:
- Critical section of btrfs_commit_transaction()
By re-using the existing update_commit_stats() function, which
has already calculated the interval correctly.
- The while() loop for btrfs_write_dirty_block_groups()
Although this includes the execution time of btrfs_run_delayed_refs(),
it should still be representative overall.
Both result involves transid 7~30, the same amount of transaction
committed.
The result looks like this:
| Before | After | Diff
----------------------+-------------------+----------------+--------
Transaction interval | 229247198.5 | 215016933.6 | -6.2%
Block group interval | 23133.33333 | 18970.83333 | -18.0%
The change in block group item updates is more obvious, as skipped block
group item updates also mean less delayed refs.
And the overall execution time for that block group update loop is
pretty small, thus we can assume the extent tree is already mostly
cached. If we can skip an uncached tree block, it would cause more
obvious change.
Unfortunately the overall reduction in commit transaction critical
section is much smaller, as the block group item updates loop is not
really the major part, at least not for the above fio script.
But still we have a observable reduction in the critical section.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 14:45:22 +08:00
|
|
|
cache->commit_used = cache->used;
|
2021-02-04 19:21:44 +09:00
|
|
|
cache->flags = btrfs_stack_block_group_flags(bgi);
|
2021-12-15 15:40:08 -05:00
|
|
|
cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
|
2020-05-05 07:58:20 +08:00
|
|
|
|
2020-08-21 11:54:44 -03:00
|
|
|
set_free_space_tree_thresholds(cache);
|
|
|
|
|
2019-10-10 10:39:27 +08:00
|
|
|
if (need_clear) {
|
|
|
|
/*
|
|
|
|
* When we mount with old space cache, we need to
|
|
|
|
* set BTRFS_DC_CLEAR and set dirty flag.
|
|
|
|
*
|
|
|
|
* a) Setting 'BTRFS_DC_CLEAR' makes sure that we
|
|
|
|
* truncate the old free space cache inode and
|
|
|
|
* setup a new one.
|
|
|
|
* b) Setting 'dirty flag' makes sure that we flush
|
|
|
|
* the new space cache info onto disk.
|
|
|
|
*/
|
|
|
|
if (btrfs_test_opt(info, SPACE_CACHE))
|
|
|
|
cache->disk_cache_state = BTRFS_DC_CLEAR;
|
|
|
|
}
|
|
|
|
if (!mixed && ((cache->flags & BTRFS_BLOCK_GROUP_METADATA) &&
|
|
|
|
(cache->flags & BTRFS_BLOCK_GROUP_DATA))) {
|
|
|
|
btrfs_err(info,
|
|
|
|
"bg %llu is a mixed block group but filesystem hasn't enabled mixed block groups",
|
|
|
|
cache->start);
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
2021-02-04 19:21:51 +09:00
|
|
|
ret = btrfs_load_block_group_zone_info(cache, false);
|
2021-02-04 19:21:50 +09:00
|
|
|
if (ret) {
|
|
|
|
btrfs_err(info, "zoned: failed to load zone info of bg %llu",
|
|
|
|
cache->start);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
2019-10-10 10:39:27 +08:00
|
|
|
/*
|
|
|
|
* We need to exclude the super stripes now so that the space info has
|
|
|
|
* super bytes accounted for, otherwise we'll think we have more space
|
|
|
|
* than we actually do.
|
|
|
|
*/
|
|
|
|
ret = exclude_super_stripes(cache);
|
|
|
|
if (ret) {
|
|
|
|
/* We may have excluded something, so call this just in case. */
|
|
|
|
btrfs_free_excluded_extents(cache);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-02-04 19:21:52 +09:00
|
|
|
* For zoned filesystem, space after the allocation offset is the only
|
|
|
|
* free space for a block group. So, we don't need any caching work.
|
|
|
|
* btrfs_calc_zone_unusable() will set the amount of free space and
|
|
|
|
* zone_unusable space.
|
|
|
|
*
|
|
|
|
* For regular filesystem, check for two cases, either we are full, and
|
|
|
|
* therefore don't need to bother with the caching work since we won't
|
|
|
|
* find any space, or we are empty, and we can just add all the space
|
|
|
|
* in and be done with it. This saves us _a_lot_ of time, particularly
|
|
|
|
* in the full case.
|
2019-10-10 10:39:27 +08:00
|
|
|
*/
|
2021-02-04 19:21:52 +09:00
|
|
|
if (btrfs_is_zoned(info)) {
|
|
|
|
btrfs_calc_zone_unusable(cache);
|
2021-08-19 21:19:09 +09:00
|
|
|
/* Should not have any excluded extents. Just in case, though. */
|
|
|
|
btrfs_free_excluded_extents(cache);
|
2021-02-04 19:21:52 +09:00
|
|
|
} else if (cache->length == cache->used) {
|
2019-10-10 10:39:27 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
|
|
|
btrfs_free_excluded_extents(cache);
|
|
|
|
} else if (cache->used == 0) {
|
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
2023-06-30 16:03:46 +01:00
|
|
|
ret = btrfs_add_new_free_space(cache, cache->start,
|
|
|
|
cache->start + cache->length, NULL);
|
2019-10-10 10:39:27 +08:00
|
|
|
btrfs_free_excluded_extents(cache);
|
2023-06-30 16:03:44 +01:00
|
|
|
if (ret)
|
|
|
|
goto error;
|
2019-10-10 10:39:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_add_block_group_cache(info, cache);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_remove_free_space_cache(cache);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
trace_btrfs_add_block_group(info, cache, 0);
|
2022-07-15 15:45:23 -04:00
|
|
|
btrfs_add_bg_to_space_info(info, cache);
|
2019-10-10 10:39:27 +08:00
|
|
|
|
|
|
|
set_avail_alloc_bits(info, cache->flags);
|
2021-08-24 13:27:42 +08:00
|
|
|
if (btrfs_chunk_writeable(info, cache->start)) {
|
|
|
|
if (cache->used == 0) {
|
|
|
|
ASSERT(list_empty(&cache->bg_list));
|
|
|
|
if (btrfs_test_opt(info, DISCARD_ASYNC))
|
|
|
|
btrfs_discard_queue_work(&info->discard_ctl, cache);
|
|
|
|
else
|
|
|
|
btrfs_mark_bg_unused(cache);
|
|
|
|
}
|
|
|
|
} else {
|
2019-10-10 10:39:27 +08:00
|
|
|
inc_block_group_ro(cache, 1);
|
|
|
|
}
|
2021-08-24 13:27:42 +08:00
|
|
|
|
2019-10-10 10:39:27 +08:00
|
|
|
return 0;
|
|
|
|
error:
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-10-16 11:29:18 -04:00
|
|
|
static int fill_dummy_bgs(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
int ret = 0;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
for (node = rb_first_cached(&fs_info->mapping_tree); node; node = rb_next(node)) {
|
|
|
|
struct btrfs_chunk_map *map;
|
2020-10-16 11:29:18 -04:00
|
|
|
struct btrfs_block_group *bg;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map = rb_entry(node, struct btrfs_chunk_map, rb_node);
|
|
|
|
bg = btrfs_create_block_group_cache(fs_info, map->start);
|
2020-10-16 11:29:18 -04:00
|
|
|
if (!bg) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Fill dummy cache as FULL */
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
bg->length = map->chunk_len;
|
2020-10-16 11:29:18 -04:00
|
|
|
bg->flags = map->type;
|
|
|
|
bg->cached = BTRFS_CACHE_FINISHED;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
bg->used = map->chunk_len;
|
2020-10-16 11:29:18 -04:00
|
|
|
bg->flags = map->type;
|
|
|
|
ret = btrfs_add_block_group_cache(fs_info, bg);
|
2021-07-19 13:43:04 +08:00
|
|
|
/*
|
|
|
|
* We may have some valid block group cache added already, in
|
|
|
|
* that case we skip to the next one.
|
|
|
|
*/
|
|
|
|
if (ret == -EEXIST) {
|
|
|
|
ret = 0;
|
|
|
|
btrfs_put_block_group(bg);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2020-10-16 11:29:18 -04:00
|
|
|
if (ret) {
|
|
|
|
btrfs_remove_free_space_cache(bg);
|
|
|
|
btrfs_put_block_group(bg);
|
|
|
|
break;
|
|
|
|
}
|
2021-07-19 13:43:04 +08:00
|
|
|
|
2022-07-15 15:45:23 -04:00
|
|
|
btrfs_add_bg_to_space_info(fs_info, bg);
|
2020-10-16 11:29:18 -04:00
|
|
|
|
|
|
|
set_avail_alloc_bits(fs_info, bg->flags);
|
|
|
|
}
|
|
|
|
if (!ret)
|
|
|
|
btrfs_init_global_block_rsv(fs_info);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
int btrfs_read_block_groups(struct btrfs_fs_info *info)
|
|
|
|
{
|
2021-11-05 16:45:36 -04:00
|
|
|
struct btrfs_root *root = btrfs_block_group_root(info);
|
2019-06-20 15:37:57 -04:00
|
|
|
struct btrfs_path *path;
|
|
|
|
int ret;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache;
|
2019-06-20 15:37:57 -04:00
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int need_clear = 0;
|
|
|
|
u64 cache_gen;
|
|
|
|
|
2022-08-09 13:02:16 +08:00
|
|
|
/*
|
|
|
|
* Either no extent root (with ibadroots rescue option) or we have
|
|
|
|
* unsupported RO options. The fs can never be mounted read-write, so no
|
|
|
|
* need to waste time searching block group items.
|
|
|
|
*
|
|
|
|
* This also allows new extent tree related changes to be RO compat,
|
|
|
|
* no need for a full incompat flag.
|
|
|
|
*/
|
|
|
|
if (!root || (btrfs_super_compat_ro_flags(info->super_copy) &
|
|
|
|
~BTRFS_FEATURE_COMPAT_RO_SUPP))
|
2020-10-16 11:29:18 -04:00
|
|
|
return fill_dummy_bgs(info);
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
key.objectid = 0;
|
|
|
|
key.offset = 0;
|
|
|
|
key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
cache_gen = btrfs_super_cache_generation(info->super_copy);
|
|
|
|
if (btrfs_test_opt(info, SPACE_CACHE) &&
|
|
|
|
btrfs_super_generation(info->super_copy) != cache_gen)
|
|
|
|
need_clear = 1;
|
|
|
|
if (btrfs_test_opt(info, CLEAR_CACHE))
|
|
|
|
need_clear = 1;
|
|
|
|
|
|
|
|
while (1) {
|
2021-02-04 19:21:44 +09:00
|
|
|
struct btrfs_block_group_item bgi;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int slot;
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
ret = find_first_block_group(info, path, &key);
|
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
|
|
|
|
2021-02-04 19:21:44 +09:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
|
|
|
|
read_extent_buffer(leaf, &bgi, btrfs_item_ptr_offset(leaf, slot),
|
|
|
|
sizeof(bgi));
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
btrfs_release_path(path);
|
|
|
|
ret = read_one_block_group(info, &bgi, &key, need_clear);
|
2019-10-10 10:39:27 +08:00
|
|
|
if (ret < 0)
|
2019-06-20 15:37:57 -04:00
|
|
|
goto error;
|
2019-10-10 10:39:27 +08:00
|
|
|
key.objectid += key.offset;
|
|
|
|
key.offset = 0;
|
2019-06-20 15:37:57 -04:00
|
|
|
}
|
2020-10-14 17:00:51 -04:00
|
|
|
btrfs_release_path(path);
|
2019-06-20 15:37:57 -04:00
|
|
|
|
2020-09-01 17:40:37 -04:00
|
|
|
list_for_each_entry(space_info, &info->space_info, list) {
|
btrfs: do not create raid sysfs entries under any locks
While running xfstests btrfs/177 I got the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #5 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_dir_ns+0x7a/0xb0
sysfs_create_dir_ns+0x60/0xb0
kobject_add_internal+0xc0/0x2c0
kobject_add+0x6e/0x90
btrfs_sysfs_add_block_group_type+0x102/0x160
btrfs_make_block_group+0x167/0x230
btrfs_alloc_chunk+0x54f/0xb80
btrfs_chunk_alloc+0x18e/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_new_inode+0x225/0x730
btrfs_create+0xab/0x1f0
lookup_open.isra.0+0x52d/0x690
path_openat+0x2a7/0x9e0
do_filp_open+0x75/0x100
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
do_unlinkat+0x1a9/0x2b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it. This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.
Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks. This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory. We
don't worry about the lock on cleanup as it only gets deleted on
unmount.
On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 17:40:38 -04:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
|
|
|
|
if (list_empty(&space_info->block_groups[i]))
|
|
|
|
continue;
|
|
|
|
cache = list_first_entry(&space_info->block_groups[i],
|
|
|
|
struct btrfs_block_group,
|
|
|
|
list);
|
|
|
|
btrfs_sysfs_add_block_group_type(cache);
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
if (!(btrfs_get_alloc_profile(info, space_info->flags) &
|
|
|
|
(BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1_MASK |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID56_MASK |
|
|
|
|
BTRFS_BLOCK_GROUP_DUP)))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* Avoid allocating from un-mirrored block group if there are
|
|
|
|
* mirrored block groups.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(cache,
|
|
|
|
&space_info->block_groups[BTRFS_RAID_RAID0],
|
|
|
|
list)
|
2019-06-20 15:38:07 -04:00
|
|
|
inc_block_group_ro(cache, 1);
|
2019-06-20 15:37:57 -04:00
|
|
|
list_for_each_entry(cache,
|
|
|
|
&space_info->block_groups[BTRFS_RAID_SINGLE],
|
|
|
|
list)
|
2019-06-20 15:38:07 -04:00
|
|
|
inc_block_group_ro(cache, 1);
|
2019-06-20 15:37:57 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_init_global_block_rsv(info);
|
|
|
|
ret = check_chunk_block_group_mappings(info);
|
|
|
|
error:
|
|
|
|
btrfs_free_path(path);
|
2021-07-19 13:43:04 +08:00
|
|
|
/*
|
|
|
|
* We've hit some error while reading the extent tree, and have
|
|
|
|
* rescue=ibadroots mount option.
|
|
|
|
* Try to fill the tree using dummy block groups so that the user can
|
|
|
|
* continue to mount and grab their data.
|
|
|
|
*/
|
|
|
|
if (ret && btrfs_test_opt(info, IGNOREBADROOTS))
|
|
|
|
ret = fill_dummy_bgs(info);
|
2019-06-20 15:37:57 -04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
/*
|
|
|
|
* This function, insert_block_group_item(), belongs to the phase 2 of chunk
|
|
|
|
* allocation.
|
|
|
|
*
|
|
|
|
* See the comment at btrfs_chunk_alloc() for details about the chunk allocation
|
|
|
|
* phases.
|
|
|
|
*/
|
2020-05-05 07:58:22 +08:00
|
|
|
static int insert_block_group_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_block_group *block_group)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
struct btrfs_block_group_item bgi;
|
2021-11-05 16:45:36 -04:00
|
|
|
struct btrfs_root *root = btrfs_block_group_root(fs_info);
|
2020-05-05 07:58:22 +08:00
|
|
|
struct btrfs_key key;
|
btrfs: fix block group item corruption after inserting new block group
We can often end up inserting a block group item, for a new block group,
with a wrong value for the used bytes field.
This happens if for the new allocated block group, in the same transaction
that created the block group, we have tasks allocating extents from it as
well as tasks removing extents from it.
For example:
1) Task A creates a metadata block group X;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K. But that fails since the block group item was not yet
inserted, and so on failure update_block_group_item() sets the
"commit_used" field of the block group back to 0;
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K, but without updating the
"commit_used" field in the block group, which remains with value of 0;
5) The two extents are freed from block X, so its "used" field changes
from 32K to 0;
6) The transaction commit by task B continues, it enters
btrfs_write_dirty_block_groups() which calls update_block_group_item()
for block group X, and there it decides to skip the block group item
update, because "used" has a value of 0 and "commit_used" has a value
of 0 too.
As a result, we end up with a block item having a 32K "used" field but
no extents allocated from it.
When this issue happens, a btrfs check reports an error like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
Fix this by making insert_block_group_item() update the block group's
"commit_used" field.
Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-06 10:13:34 +00:00
|
|
|
u64 old_commit_used;
|
|
|
|
int ret;
|
2020-05-05 07:58:22 +08:00
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
btrfs_set_stack_block_group_used(&bgi, block_group->used);
|
|
|
|
btrfs_set_stack_block_group_chunk_objectid(&bgi,
|
2021-12-15 15:40:08 -05:00
|
|
|
block_group->global_root_id);
|
2020-05-05 07:58:22 +08:00
|
|
|
btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
|
btrfs: fix block group item corruption after inserting new block group
We can often end up inserting a block group item, for a new block group,
with a wrong value for the used bytes field.
This happens if for the new allocated block group, in the same transaction
that created the block group, we have tasks allocating extents from it as
well as tasks removing extents from it.
For example:
1) Task A creates a metadata block group X;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K. But that fails since the block group item was not yet
inserted, and so on failure update_block_group_item() sets the
"commit_used" field of the block group back to 0;
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K, but without updating the
"commit_used" field in the block group, which remains with value of 0;
5) The two extents are freed from block X, so its "used" field changes
from 32K to 0;
6) The transaction commit by task B continues, it enters
btrfs_write_dirty_block_groups() which calls update_block_group_item()
for block group X, and there it decides to skip the block group item
update, because "used" has a value of 0 and "commit_used" has a value
of 0 too.
As a result, we end up with a block item having a 32K "used" field but
no extents allocated from it.
When this issue happens, a btrfs check reports an error like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
Fix this by making insert_block_group_item() update the block group's
"commit_used" field.
Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-06 10:13:34 +00:00
|
|
|
old_commit_used = block_group->commit_used;
|
|
|
|
block_group->commit_used = block_group->used;
|
2020-05-05 07:58:22 +08:00
|
|
|
key.objectid = block_group->start;
|
|
|
|
key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
|
|
|
|
key.offset = block_group->length;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
btrfs: fix block group item corruption after inserting new block group
We can often end up inserting a block group item, for a new block group,
with a wrong value for the used bytes field.
This happens if for the new allocated block group, in the same transaction
that created the block group, we have tasks allocating extents from it as
well as tasks removing extents from it.
For example:
1) Task A creates a metadata block group X;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K. But that fails since the block group item was not yet
inserted, and so on failure update_block_group_item() sets the
"commit_used" field of the block group back to 0;
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K, but without updating the
"commit_used" field in the block group, which remains with value of 0;
5) The two extents are freed from block X, so its "used" field changes
from 32K to 0;
6) The transaction commit by task B continues, it enters
btrfs_write_dirty_block_groups() which calls update_block_group_item()
for block group X, and there it decides to skip the block group item
update, because "used" has a value of 0 and "commit_used" has a value
of 0 too.
As a result, we end up with a block item having a 32K "used" field but
no extents allocated from it.
When this issue happens, a btrfs check reports an error like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
Fix this by making insert_block_group_item() update the block group's
"commit_used" field.
Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-06 10:13:34 +00:00
|
|
|
ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
|
|
|
|
if (ret < 0) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->commit_used = old_commit_used;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
2020-05-05 07:58:22 +08:00
|
|
|
}
|
|
|
|
|
2021-07-05 12:29:19 +03:00
|
|
|
static int insert_dev_extent(struct btrfs_trans_handle *trans,
|
2024-06-26 23:39:11 +02:00
|
|
|
const struct btrfs_device *device, u64 chunk_offset,
|
|
|
|
u64 start, u64 num_bytes)
|
2021-07-05 12:29:19 +03:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = device->fs_info;
|
|
|
|
struct btrfs_root *root = fs_info->dev_root;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_dev_extent *extent;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
WARN_ON(!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state));
|
|
|
|
WARN_ON(test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state));
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = device->devid;
|
|
|
|
key.type = BTRFS_DEV_EXTENT_KEY;
|
|
|
|
key.offset = start;
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*extent));
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
|
|
|
|
btrfs_set_dev_extent_chunk_tree(leaf, extent, BTRFS_CHUNK_TREE_OBJECTID);
|
|
|
|
btrfs_set_dev_extent_chunk_objectid(leaf, extent,
|
|
|
|
BTRFS_FIRST_CHUNK_TREE_OBJECTID);
|
|
|
|
btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
|
|
|
|
|
|
|
|
btrfs_set_dev_extent_length(leaf, extent, num_bytes);
|
2023-09-12 13:04:29 +01:00
|
|
|
btrfs_mark_buffer_dirty(trans, leaf);
|
2021-07-05 12:29:19 +03:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function belongs to phase 2.
|
|
|
|
*
|
|
|
|
* See the comment at btrfs_chunk_alloc() for details about the chunk allocation
|
|
|
|
* phases.
|
|
|
|
*/
|
|
|
|
static int insert_dev_extents(struct btrfs_trans_handle *trans,
|
|
|
|
u64 chunk_offset, u64 chunk_size)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
struct btrfs_device *device;
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
struct btrfs_chunk_map *map;
|
2021-07-05 12:29:19 +03:00
|
|
|
u64 dev_offset;
|
|
|
|
int i;
|
|
|
|
int ret = 0;
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
map = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
|
|
|
|
if (IS_ERR(map))
|
|
|
|
return PTR_ERR(map);
|
2021-07-05 12:29:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Take the device list mutex to prevent races with the final phase of
|
|
|
|
* a device replace operation that replaces the device object associated
|
|
|
|
* with the map's stripes, because the device object's id can change
|
|
|
|
* at any time during that final phase of the device replace operation
|
|
|
|
* (dev-replace.c:btrfs_dev_replace_finishing()), so we could grab the
|
|
|
|
* replaced device and then see it with an ID of BTRFS_DEV_REPLACE_DEVID,
|
|
|
|
* resulting in persisting a device extent item with such ID.
|
|
|
|
*/
|
|
|
|
mutex_lock(&fs_info->fs_devices->device_list_mutex);
|
|
|
|
for (i = 0; i < map->num_stripes; i++) {
|
|
|
|
device = map->stripes[i].dev;
|
|
|
|
dev_offset = map->stripes[i].physical;
|
|
|
|
|
|
|
|
ret = insert_dev_extent(trans, device, chunk_offset, dev_offset,
|
2023-11-21 13:38:39 +00:00
|
|
|
map->stripe_size);
|
2021-07-05 12:29:19 +03:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
mutex_unlock(&fs_info->fs_devices->device_list_mutex);
|
|
|
|
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
btrfs_free_chunk_map(map);
|
2021-07-05 12:29:19 +03:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
/*
|
|
|
|
* This function, btrfs_create_pending_block_groups(), belongs to the phase 2 of
|
|
|
|
* chunk allocation.
|
|
|
|
*
|
|
|
|
* See the comment at btrfs_chunk_alloc() for details about the chunk allocation
|
|
|
|
* phases.
|
|
|
|
*/
|
2019-06-20 15:37:57 -04:00
|
|
|
void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group;
|
2019-06-20 15:37:57 -04:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
while (!list_empty(&trans->new_bgs)) {
|
btrfs: do not create raid sysfs entries under any locks
While running xfstests btrfs/177 I got the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #5 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_dir_ns+0x7a/0xb0
sysfs_create_dir_ns+0x60/0xb0
kobject_add_internal+0xc0/0x2c0
kobject_add+0x6e/0x90
btrfs_sysfs_add_block_group_type+0x102/0x160
btrfs_make_block_group+0x167/0x230
btrfs_alloc_chunk+0x54f/0xb80
btrfs_chunk_alloc+0x18e/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_new_inode+0x225/0x730
btrfs_create+0xab/0x1f0
lookup_open.isra.0+0x52d/0x690
path_openat+0x2a7/0x9e0
do_filp_open+0x75/0x100
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
do_unlinkat+0x1a9/0x2b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it. This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.
Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks. This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory. We
don't worry about the lock on cleanup as it only gets deleted on
unmount.
On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 17:40:38 -04:00
|
|
|
int index;
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
block_group = list_first_entry(&trans->new_bgs,
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group,
|
2019-06-20 15:37:57 -04:00
|
|
|
bg_list);
|
|
|
|
if (ret)
|
|
|
|
goto next;
|
|
|
|
|
btrfs: do not create raid sysfs entries under any locks
While running xfstests btrfs/177 I got the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #5 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_dir_ns+0x7a/0xb0
sysfs_create_dir_ns+0x60/0xb0
kobject_add_internal+0xc0/0x2c0
kobject_add+0x6e/0x90
btrfs_sysfs_add_block_group_type+0x102/0x160
btrfs_make_block_group+0x167/0x230
btrfs_alloc_chunk+0x54f/0xb80
btrfs_chunk_alloc+0x18e/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_new_inode+0x225/0x730
btrfs_create+0xab/0x1f0
lookup_open.isra.0+0x52d/0x690
path_openat+0x2a7/0x9e0
do_filp_open+0x75/0x100
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
do_unlinkat+0x1a9/0x2b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it. This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.
Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks. This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory. We
don't worry about the lock on cleanup as it only gets deleted on
unmount.
On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 17:40:38 -04:00
|
|
|
index = btrfs_bg_flags_to_raid_index(block_group->flags);
|
|
|
|
|
2020-05-05 07:58:22 +08:00
|
|
|
ret = insert_block_group_item(trans, block_group);
|
2019-06-20 15:37:57 -04:00
|
|
|
if (ret)
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
2022-07-15 15:45:24 -04:00
|
|
|
if (!test_bit(BLOCK_GROUP_FLAG_CHUNK_ITEM_INSERTED,
|
|
|
|
&block_group->runtime_flags)) {
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
ret = btrfs_chunk_alloc_add_chunk_item(trans, block_group);
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
if (ret)
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
}
|
2021-07-05 12:29:19 +03:00
|
|
|
ret = insert_dev_extents(trans, block_group->start,
|
|
|
|
block_group->length);
|
2019-06-20 15:37:57 -04:00
|
|
|
if (ret)
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
add_block_group_free_space(trans, block_group);
|
btrfs: do not create raid sysfs entries under any locks
While running xfstests btrfs/177 I got the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #5 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
but task is already holding lock:
ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_dir_ns+0x7a/0xb0
sysfs_create_dir_ns+0x60/0xb0
kobject_add_internal+0xc0/0x2c0
kobject_add+0x6e/0x90
btrfs_sysfs_add_block_group_type+0x102/0x160
btrfs_make_block_group+0x167/0x230
btrfs_alloc_chunk+0x54f/0xb80
btrfs_chunk_alloc+0x18e/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_new_inode+0x225/0x730
btrfs_create+0xab/0x1f0
lookup_open.isra.0+0x52d/0x690
path_openat+0x2a7/0x9e0
do_filp_open+0x75/0x100
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
do_unlinkat+0x1a9/0x2b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it. This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.
Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks. This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory. We
don't worry about the lock on cleanup as it only gets deleted on
unmount.
On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-01 17:40:38 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we restriped during balance, we may have added a new raid
|
|
|
|
* type, so now add the sysfs entries when it is safe to do so.
|
|
|
|
* We don't have to worry about locking here as it's handled in
|
|
|
|
* btrfs_sysfs_add_block_group_type.
|
|
|
|
*/
|
|
|
|
if (block_group->space_info->block_group_kobjs[index] == NULL)
|
|
|
|
btrfs_sysfs_add_block_group_type(block_group);
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
/* Already aborted the transaction if it failed. */
|
|
|
|
next:
|
2023-09-28 11:12:50 +01:00
|
|
|
btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
|
2019-06-20 15:37:57 -04:00
|
|
|
list_del_init(&block_group->bg_list);
|
btrfs: fix use-after-free of new block group that became unused
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-28 17:13:37 +01:00
|
|
|
clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
|
btrfs: add new unused block groups to the list of unused block groups
Space reservations for metadata are, most of the time, pessimistic as we
reserve space for worst possible cases - where tree heights are at the
maximum possible height (8), we need to COW every extent buffer in a tree
path, need to split extent buffers, etc.
For data, we generally reserve the exact amount of space we are going to
allocate. The exception here is when using compression, in which case we
reserve space matching the uncompressed size, as the compression only
happens at writeback time and in the worst possible case we need that
amount of space in case the data is not compressible.
This means that when there's not available space in the corresponding
space_info object, we may need to allocate a new block group, and then
that block group might not be used after all. In this case the block
group is never added to the list of unused block groups and ends up
never being deleted - except if we unmount and mount again the fs, as
when reading block groups from disk we add unused ones to the list of
unused block groups (fs_info->unused_bgs). Otherwise a block group is
only added to the list of unused block groups when we deallocate the
last extent from it, so if no extent is ever allocated, the block group
is kept around forever.
This also means that if we have a bunch of tasks reserving space in
parallel we can end up allocating many block groups that end up never
being used or kept around for too long without being used, which has
the potential to result in ENOSPC failures in case for example we over
allocate too many metadata block groups and then end up in a state
without enough unallocated space to allocate a new data block group.
This is more likely to happen with metadata reservations as of kernel
6.7, namely since commit 28270e25c69a ("btrfs: always reserve space for
delayed refs when starting transaction"), because we started to always
reserve space for delayed references when starting a transaction handle
for a non-zero number of items, and also to try to reserve space to fill
the gap between the delayed block reserve's reserved space and its size.
So to avoid this, when finishing the creation a new block group, add the
block group to the list of unused block groups if it's still unused at
that time. This way the next time the cleaner kthread runs, it will delete
the block group if it's still unused and not needed to satisfy existing
space reservations.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.7+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-25 09:53:19 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the block group is still unused, add it to the list of
|
|
|
|
* unused block groups. The block group may have been created in
|
|
|
|
* order to satisfy a space reservation, in which case the
|
|
|
|
* extent allocation only happens later. But often we don't
|
|
|
|
* actually need to allocate space that we previously reserved,
|
|
|
|
* so the block group may become unused for a long time. For
|
|
|
|
* example for metadata we generally reserve space for a worst
|
|
|
|
* possible scenario, but then don't end up allocating all that
|
|
|
|
* space or none at all (due to no need to COW, extent buffers
|
|
|
|
* were already COWed in the current transaction and still
|
|
|
|
* unwritten, tree heights lower than the maximum possible
|
|
|
|
* height, etc). For data we generally reserve the axact amount
|
|
|
|
* of space we are going to allocate later, the exception is
|
|
|
|
* when using compression, as we must reserve space based on the
|
|
|
|
* uncompressed data size, because the compression is only done
|
|
|
|
* when writeback triggered and we don't know how much space we
|
|
|
|
* are actually going to need, so we reserve the uncompressed
|
|
|
|
* size because the data may be uncompressible in the worst case.
|
|
|
|
*/
|
|
|
|
if (ret == 0) {
|
|
|
|
bool used;
|
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
used = btrfs_is_block_group_used(block_group);
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
if (!used)
|
|
|
|
btrfs_mark_bg_unused(block_group);
|
|
|
|
}
|
2019-06-20 15:37:57 -04:00
|
|
|
}
|
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
|
|
|
}
|
|
|
|
|
2021-12-15 15:40:08 -05:00
|
|
|
/*
|
|
|
|
* For extent tree v2 we use the block_group_item->chunk_offset to point at our
|
|
|
|
* global root id. For v1 it's always set to BTRFS_FIRST_CHUNK_TREE_OBJECTID.
|
|
|
|
*/
|
2024-06-26 23:39:11 +02:00
|
|
|
static u64 calculate_global_root_id(const struct btrfs_fs_info *fs_info, u64 offset)
|
2021-12-15 15:40:08 -05:00
|
|
|
{
|
|
|
|
u64 div = SZ_1G;
|
|
|
|
u64 index;
|
|
|
|
|
|
|
|
if (!btrfs_fs_incompat(fs_info, EXTENT_TREE_V2))
|
|
|
|
return BTRFS_FIRST_CHUNK_TREE_OBJECTID;
|
|
|
|
|
|
|
|
/* If we have a smaller fs index based on 128MiB. */
|
|
|
|
if (btrfs_super_total_bytes(fs_info->super_copy) <= (SZ_1G * 10ULL))
|
|
|
|
div = SZ_128M;
|
|
|
|
|
|
|
|
offset = div64_u64(offset, div);
|
|
|
|
div64_u64_rem(offset, fs_info->nr_global_roots, &index);
|
|
|
|
return index;
|
|
|
|
}
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *trans,
|
2023-03-21 11:13:45 +00:00
|
|
|
u64 type,
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
u64 chunk_offset, u64 size)
|
2019-06-20 15:37:57 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache;
|
2019-06-20 15:37:57 -04:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
btrfs_set_log_full_commit(trans);
|
|
|
|
|
2020-05-05 07:58:20 +08:00
|
|
|
cache = btrfs_create_block_group_cache(fs_info, chunk_offset);
|
2019-06-20 15:37:57 -04:00
|
|
|
if (!cache)
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2019-06-20 15:37:57 -04:00
|
|
|
|
btrfs: fix use-after-free of new block group that became unused
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-28 17:13:37 +01:00
|
|
|
/*
|
|
|
|
* Mark it as new before adding it to the rbtree of block groups or any
|
|
|
|
* list, so that no other task finds it and calls btrfs_mark_bg_unused()
|
|
|
|
* before the new flag is set.
|
|
|
|
*/
|
|
|
|
set_bit(BLOCK_GROUP_FLAG_NEW, &cache->runtime_flags);
|
|
|
|
|
2020-05-05 07:58:20 +08:00
|
|
|
cache->length = size;
|
2020-08-21 11:54:44 -03:00
|
|
|
set_free_space_tree_thresholds(cache);
|
2019-06-20 15:37:57 -04:00
|
|
|
cache->flags = type;
|
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
2021-12-15 15:40:08 -05:00
|
|
|
cache->global_root_id = calculate_global_root_id(fs_info, cache->start);
|
|
|
|
|
2020-11-18 15:06:18 -08:00
|
|
|
if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
|
2022-10-31 20:33:44 +01:00
|
|
|
set_bit(BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE, &cache->runtime_flags);
|
2021-02-04 19:21:50 +09:00
|
|
|
|
2021-02-04 19:21:51 +09:00
|
|
|
ret = btrfs_load_block_group_zone_info(cache, true);
|
2021-02-04 19:21:50 +09:00
|
|
|
if (ret) {
|
|
|
|
btrfs_put_block_group(cache);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
return ERR_PTR(ret);
|
2021-02-04 19:21:50 +09:00
|
|
|
}
|
|
|
|
|
2019-06-20 15:37:57 -04:00
|
|
|
ret = exclude_super_stripes(cache);
|
|
|
|
if (ret) {
|
|
|
|
/* We may have excluded something, so call this just in case */
|
|
|
|
btrfs_free_excluded_extents(cache);
|
|
|
|
btrfs_put_block_group(cache);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
return ERR_PTR(ret);
|
2019-06-20 15:37:57 -04:00
|
|
|
}
|
|
|
|
|
2023-06-30 16:03:46 +01:00
|
|
|
ret = btrfs_add_new_free_space(cache, chunk_offset, chunk_offset + size, NULL);
|
2019-06-20 15:37:57 -04:00
|
|
|
btrfs_free_excluded_extents(cache);
|
2023-06-30 16:03:44 +01:00
|
|
|
if (ret) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure the corresponding space_info object is created and
|
|
|
|
* assigned to our block group. We want our bg to be added to the rbtree
|
|
|
|
* with its ->space_info set.
|
|
|
|
*/
|
|
|
|
cache->space_info = btrfs_find_space_info(fs_info, cache->flags);
|
|
|
|
ASSERT(cache->space_info);
|
|
|
|
|
|
|
|
ret = btrfs_add_block_group_cache(fs_info, cache);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_remove_free_space_cache(cache);
|
|
|
|
btrfs_put_block_group(cache);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
return ERR_PTR(ret);
|
2019-06-20 15:37:57 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that our block group has its ->space_info set and is inserted in
|
|
|
|
* the rbtree, update the space info's counters.
|
|
|
|
*/
|
|
|
|
trace_btrfs_add_block_group(fs_info, cache, 1);
|
2022-07-15 15:45:23 -04:00
|
|
|
btrfs_add_bg_to_space_info(fs_info, cache);
|
2019-06-20 15:37:57 -04:00
|
|
|
btrfs_update_global_block_rsv(fs_info);
|
|
|
|
|
2022-07-15 15:45:22 -04:00
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
|
|
|
if (btrfs_should_fragment_free_space(cache)) {
|
2023-03-21 11:13:45 +00:00
|
|
|
cache->space_info->bytes_used += size >> 1;
|
2022-07-15 15:45:22 -04:00
|
|
|
fragment_free_space(cache);
|
|
|
|
}
|
|
|
|
#endif
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
list_add_tail(&cache->bg_list, &trans->new_bgs);
|
2023-09-28 11:12:50 +01:00
|
|
|
btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
|
2019-06-20 15:37:57 -04:00
|
|
|
|
|
|
|
set_avail_alloc_bits(fs_info, type);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
return cache;
|
2019-06-20 15:37:57 -04:00
|
|
|
}
|
2019-06-20 15:37:59 -04:00
|
|
|
|
btrfs: scrub: Don't check free space before marking a block group RO
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15 10:09:00 +08:00
|
|
|
/*
|
|
|
|
* Mark one block group RO, can be called several times for the same block
|
|
|
|
* group.
|
|
|
|
*
|
|
|
|
* @cache: the destination block group
|
|
|
|
* @do_chunk_alloc: whether need to do chunk pre-allocation, this is to
|
|
|
|
* ensure we still have some free space after marking this
|
|
|
|
* block group RO.
|
|
|
|
*/
|
|
|
|
int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
|
|
|
|
bool do_chunk_alloc)
|
2019-06-20 15:37:59 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
|
|
|
struct btrfs_trans_handle *trans;
|
2021-11-05 16:45:36 -04:00
|
|
|
struct btrfs_root *root = btrfs_block_group_root(fs_info);
|
2019-06-20 15:37:59 -04:00
|
|
|
u64 alloc_flags;
|
|
|
|
int ret;
|
2021-02-17 15:12:50 +02:00
|
|
|
bool dirty_bg_running;
|
2019-06-20 15:37:59 -04:00
|
|
|
|
2021-12-16 19:47:35 +08:00
|
|
|
/*
|
|
|
|
* This can only happen when we are doing read-only scrub on read-only
|
|
|
|
* mount.
|
|
|
|
* In that case we should not start a new transaction on read-only fs.
|
|
|
|
* Thus here we skip all chunk allocations.
|
|
|
|
*/
|
|
|
|
if (sb_rdonly(fs_info->sb)) {
|
|
|
|
mutex_lock(&fs_info->ro_block_group_mutex);
|
|
|
|
ret = inc_block_group_ro(cache, 0);
|
|
|
|
mutex_unlock(&fs_info->ro_block_group_mutex);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-02-17 15:12:50 +02:00
|
|
|
do {
|
2021-11-05 16:45:36 -04:00
|
|
|
trans = btrfs_join_transaction(root);
|
2021-02-17 15:12:50 +02:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
2019-06-20 15:37:59 -04:00
|
|
|
|
2021-02-17 15:12:50 +02:00
|
|
|
dirty_bg_running = false;
|
2019-06-20 15:37:59 -04:00
|
|
|
|
2021-02-17 15:12:50 +02:00
|
|
|
/*
|
|
|
|
* We're not allowed to set block groups readonly after the dirty
|
|
|
|
* block group cache has started writing. If it already started,
|
|
|
|
* back off and let this transaction commit.
|
|
|
|
*/
|
|
|
|
mutex_lock(&fs_info->ro_block_group_mutex);
|
|
|
|
if (test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &trans->transaction->flags)) {
|
|
|
|
u64 transid = trans->transid;
|
2019-06-20 15:37:59 -04:00
|
|
|
|
2021-02-17 15:12:50 +02:00
|
|
|
mutex_unlock(&fs_info->ro_block_group_mutex);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
|
|
|
|
ret = btrfs_wait_for_commit(fs_info, transid);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
dirty_bg_running = true;
|
|
|
|
}
|
|
|
|
} while (dirty_bg_running);
|
2019-06-20 15:37:59 -04:00
|
|
|
|
btrfs: scrub: Don't check free space before marking a block group RO
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15 10:09:00 +08:00
|
|
|
if (do_chunk_alloc) {
|
2019-06-20 15:37:59 -04:00
|
|
|
/*
|
btrfs: scrub: Don't check free space before marking a block group RO
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15 10:09:00 +08:00
|
|
|
* If we are changing raid levels, try to allocate a
|
|
|
|
* corresponding block group with the new raid level.
|
2019-06-20 15:37:59 -04:00
|
|
|
*/
|
btrfs: don't adjust bg flags and use default allocation profiles
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.
Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.
Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level. This
is accomplished through the normal get_alloc_profile path.
Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices. This trips
this check
alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks. This eventually causes us to run out of system space and the
file system aborts and flips read only.
We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation. This keeps us from
allocating a billion system chunks and falling over.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-21 10:48:45 -04:00
|
|
|
alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
|
btrfs: scrub: Don't check free space before marking a block group RO
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15 10:09:00 +08:00
|
|
|
if (alloc_flags != cache->flags) {
|
|
|
|
ret = btrfs_chunk_alloc(trans, alloc_flags,
|
|
|
|
CHUNK_ALLOC_FORCE);
|
|
|
|
/*
|
|
|
|
* ENOSPC is allowed here, we may have enough space
|
|
|
|
* already allocated at the new raid level to carry on
|
|
|
|
*/
|
|
|
|
if (ret == -ENOSPC)
|
|
|
|
ret = 0;
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
2019-06-20 15:37:59 -04:00
|
|
|
}
|
|
|
|
|
2020-01-17 09:07:38 -05:00
|
|
|
ret = inc_block_group_ro(cache, 0);
|
2019-06-20 15:37:59 -04:00
|
|
|
if (!ret)
|
|
|
|
goto out;
|
2023-04-13 13:57:17 +08:00
|
|
|
if (ret == -ETXTBSY)
|
|
|
|
goto unlock_out;
|
|
|
|
|
|
|
|
/*
|
2023-12-05 19:26:39 +01:00
|
|
|
* Skip chunk allocation if the bg is SYSTEM, this is to avoid system
|
2023-04-13 13:57:17 +08:00
|
|
|
* chunk allocation storm to exhaust the system chunk array. Otherwise
|
|
|
|
* we still want to try our best to mark the block group read-only.
|
|
|
|
*/
|
|
|
|
if (!do_chunk_alloc && ret == -ENOSPC &&
|
|
|
|
(cache->flags & BTRFS_BLOCK_GROUP_SYSTEM))
|
|
|
|
goto unlock_out;
|
|
|
|
|
2019-06-20 15:37:59 -04:00
|
|
|
alloc_flags = btrfs_get_alloc_profile(fs_info, cache->space_info->flags);
|
|
|
|
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2022-07-09 08:18:48 +09:00
|
|
|
/*
|
|
|
|
* We have allocated a new chunk. We also need to activate that chunk to
|
|
|
|
* grant metadata tickets for zoned filesystem.
|
|
|
|
*/
|
|
|
|
ret = btrfs_zoned_activate_one_bg(fs_info, cache->space_info, true);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2019-06-20 15:38:07 -04:00
|
|
|
ret = inc_block_group_ro(cache, 0);
|
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-05 12:55:37 +00:00
|
|
|
if (ret == -ETXTBSY)
|
|
|
|
goto unlock_out;
|
2019-06-20 15:37:59 -04:00
|
|
|
out:
|
|
|
|
if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
|
btrfs: don't adjust bg flags and use default allocation profiles
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.
Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.
Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level. This
is accomplished through the normal get_alloc_profile path.
Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices. This trips
this check
alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks. This eventually causes us to run out of system space and the
file system aborts and flips read only.
We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation. This keeps us from
allocating a billion system chunks and falling over.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-21 10:48:45 -04:00
|
|
|
alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
|
2019-06-20 15:37:59 -04:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
check_system_chunk(trans, alloc_flags);
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
}
|
btrfs: scrub: Don't check free space before marking a block group RO
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15 10:09:00 +08:00
|
|
|
unlock_out:
|
2019-06-20 15:37:59 -04:00
|
|
|
mutex_unlock(&fs_info->ro_block_group_mutex);
|
|
|
|
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_dec_block_group_ro(struct btrfs_block_group *cache)
|
2019-06-20 15:37:59 -04:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *sinfo = cache->space_info;
|
|
|
|
u64 num_bytes;
|
|
|
|
|
|
|
|
BUG_ON(!cache->ro);
|
|
|
|
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (!--cache->ro) {
|
2021-02-04 19:21:52 +09:00
|
|
|
if (btrfs_is_zoned(cache->fs_info)) {
|
|
|
|
/* Migrate zone_unusable bytes back */
|
2021-08-19 21:19:10 +09:00
|
|
|
cache->zone_unusable =
|
2023-02-15 09:18:02 +09:00
|
|
|
(cache->alloc_offset - cache->used - cache->pinned -
|
|
|
|
cache->reserved) +
|
2021-08-19 21:19:10 +09:00
|
|
|
(cache->length - cache->zone_capacity);
|
2023-02-15 09:18:02 +09:00
|
|
|
btrfs_space_info_update_bytes_zone_unusable(cache->fs_info, sinfo,
|
|
|
|
cache->zone_unusable);
|
2021-02-04 19:21:52 +09:00
|
|
|
sinfo->bytes_readonly -= cache->zone_unusable;
|
|
|
|
}
|
btrfs: zoned: fix negative space_info->bytes_readonly
Consider we have a using block group on zoned btrfs.
|<- ZU ->|<- used ->|<---free--->|
`- Alloc offset
ZU: Zone unusable
Marking the block group read-only will migrate the zone unusable bytes
to the read-only bytes. So, we will have this.
|<- RO ->|<- used ->|<--- RO --->|
RO: Read only
When marking it back to read-write, btrfs_dec_block_group_ro()
subtracts the above "RO" bytes from the
space_info->bytes_readonly. And, it moves the zone unusable bytes back
and again subtracts those bytes from the space_info->bytes_readonly,
leading to negative bytes_readonly.
This can be observed in the output as eg.:
Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB
Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688
This commit fixes the issue by reordering the operations.
Link: https://github.com/naota/linux/issues/37
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-17 13:56:18 +09:00
|
|
|
num_bytes = cache->length - cache->reserved -
|
|
|
|
cache->pinned - cache->bytes_super -
|
|
|
|
cache->zone_unusable - cache->used;
|
|
|
|
sinfo->bytes_readonly -= num_bytes;
|
2019-06-20 15:37:59 -04:00
|
|
|
list_del_init(&cache->ro_list);
|
|
|
|
}
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
}
|
2019-06-20 15:38:00 -04:00
|
|
|
|
2020-05-05 07:58:23 +08:00
|
|
|
static int update_block_group_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_block_group *cache)
|
2019-06-20 15:38:00 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
int ret;
|
2021-11-05 16:45:36 -04:00
|
|
|
struct btrfs_root *root = btrfs_block_group_root(fs_info);
|
2019-06-20 15:38:00 -04:00
|
|
|
unsigned long bi;
|
|
|
|
struct extent_buffer *leaf;
|
2019-10-23 18:48:11 +02:00
|
|
|
struct btrfs_block_group_item bgi;
|
2019-10-23 18:48:22 +02:00
|
|
|
struct btrfs_key key;
|
btrfs: skip update of block group item if used bytes are the same
[BACKGROUND]
When committing a transaction, we will update block group items for all
dirty block groups.
But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several COW operations, but still have the same amount of used bytes.
In that case, we may unnecessarily COW a tree block doing nothing.
[ENHANCEMENT]
This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.
This would be more common for large filesystems, where metadata block
group can be as large as 1GiB, containing at most 64K metadata items.
In that case, if COW added and then deleted one metadata item near the
end of the block group, then it's completely possible we don't need to
touch the block group item at all.
[BENCHMARK]
The change itself can have quite a high chance (20~80%) to skip block
group item updates in lot of workloads.
As a result, it would result shorter time spent on
btrfs_write_dirty_block_groups(), and overall reduce the execution time
of the critical section of btrfs_commit_transaction().
Here comes a fio command, which will do random writes in 4K block size,
causing a very heavy metadata updates.
fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
--ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
--name=random_write --fallocate=none --time_based --fsync_on_close=1
The file size (512M) and number of threads (4) means 2GiB file size in
total, but during the full 300s run time, my dedicated SATA SSD is able
to write around 20~25GiB, which is over 10 times the file size.
Thus after we fill the initial 2G, we should not cause much block group
item updates.
Please note, the fio numbers by themselves don't have much change, but
if we look deeper, there is some reduced execution time, especially for
the critical section of btrfs_commit_transaction().
I added extra trace_printk() to measure the following per-transaction
execution time:
- Critical section of btrfs_commit_transaction()
By re-using the existing update_commit_stats() function, which
has already calculated the interval correctly.
- The while() loop for btrfs_write_dirty_block_groups()
Although this includes the execution time of btrfs_run_delayed_refs(),
it should still be representative overall.
Both result involves transid 7~30, the same amount of transaction
committed.
The result looks like this:
| Before | After | Diff
----------------------+-------------------+----------------+--------
Transaction interval | 229247198.5 | 215016933.6 | -6.2%
Block group interval | 23133.33333 | 18970.83333 | -18.0%
The change in block group item updates is more obvious, as skipped block
group item updates also mean less delayed refs.
And the overall execution time for that block group update loop is
pretty small, thus we can assume the extent tree is already mostly
cached. If we can skip an uncached tree block, it would cause more
obvious change.
Unfortunately the overall reduction in commit transaction critical
section is much smaller, as the block group item updates loop is not
really the major part, at least not for the above fio script.
But still we have a observable reduction in the critical section.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 14:45:22 +08:00
|
|
|
u64 old_commit_used;
|
|
|
|
u64 used;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Block group items update can be triggered out of commit transaction
|
|
|
|
* critical section, thus we need a consistent view of used bytes.
|
|
|
|
* We cannot use cache->used directly outside of the spin lock, as it
|
|
|
|
* may be changed.
|
|
|
|
*/
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
old_commit_used = cache->commit_used;
|
|
|
|
used = cache->used;
|
|
|
|
/* No change in used bytes, can safely skip it. */
|
|
|
|
if (cache->commit_used == used) {
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
cache->commit_used = used;
|
|
|
|
spin_unlock(&cache->lock);
|
2019-10-23 18:48:22 +02:00
|
|
|
|
|
|
|
key.objectid = cache->start;
|
|
|
|
key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
|
|
|
|
key.offset = cache->length;
|
2019-06-20 15:38:00 -04:00
|
|
|
|
2020-05-05 07:58:23 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
|
2019-06-20 15:38:00 -04:00
|
|
|
if (ret) {
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
|
btrfs: skip update of block group item if used bytes are the same
[BACKGROUND]
When committing a transaction, we will update block group items for all
dirty block groups.
But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several COW operations, but still have the same amount of used bytes.
In that case, we may unnecessarily COW a tree block doing nothing.
[ENHANCEMENT]
This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.
This would be more common for large filesystems, where metadata block
group can be as large as 1GiB, containing at most 64K metadata items.
In that case, if COW added and then deleted one metadata item near the
end of the block group, then it's completely possible we don't need to
touch the block group item at all.
[BENCHMARK]
The change itself can have quite a high chance (20~80%) to skip block
group item updates in lot of workloads.
As a result, it would result shorter time spent on
btrfs_write_dirty_block_groups(), and overall reduce the execution time
of the critical section of btrfs_commit_transaction().
Here comes a fio command, which will do random writes in 4K block size,
causing a very heavy metadata updates.
fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
--ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
--name=random_write --fallocate=none --time_based --fsync_on_close=1
The file size (512M) and number of threads (4) means 2GiB file size in
total, but during the full 300s run time, my dedicated SATA SSD is able
to write around 20~25GiB, which is over 10 times the file size.
Thus after we fill the initial 2G, we should not cause much block group
item updates.
Please note, the fio numbers by themselves don't have much change, but
if we look deeper, there is some reduced execution time, especially for
the critical section of btrfs_commit_transaction().
I added extra trace_printk() to measure the following per-transaction
execution time:
- Critical section of btrfs_commit_transaction()
By re-using the existing update_commit_stats() function, which
has already calculated the interval correctly.
- The while() loop for btrfs_write_dirty_block_groups()
Although this includes the execution time of btrfs_run_delayed_refs(),
it should still be representative overall.
Both result involves transid 7~30, the same amount of transaction
committed.
The result looks like this:
| Before | After | Diff
----------------------+-------------------+----------------+--------
Transaction interval | 229247198.5 | 215016933.6 | -6.2%
Block group interval | 23133.33333 | 18970.83333 | -18.0%
The change in block group item updates is more obvious, as skipped block
group item updates also mean less delayed refs.
And the overall execution time for that block group update loop is
pretty small, thus we can assume the extent tree is already mostly
cached. If we can skip an uncached tree block, it would cause more
obvious change.
Unfortunately the overall reduction in commit transaction critical
section is much smaller, as the block group item updates loop is not
really the major part, at least not for the above fio script.
But still we have a observable reduction in the critical section.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 14:45:22 +08:00
|
|
|
btrfs_set_stack_block_group_used(&bgi, used);
|
2019-10-23 18:48:18 +02:00
|
|
|
btrfs_set_stack_block_group_chunk_objectid(&bgi,
|
2021-12-15 15:40:08 -05:00
|
|
|
cache->global_root_id);
|
2019-10-23 18:48:18 +02:00
|
|
|
btrfs_set_stack_block_group_flags(&bgi, cache->flags);
|
2019-10-23 18:48:11 +02:00
|
|
|
write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
|
2023-09-12 13:04:29 +01:00
|
|
|
btrfs_mark_buffer_dirty(trans, leaf);
|
2019-06-20 15:38:00 -04:00
|
|
|
fail:
|
|
|
|
btrfs_release_path(path);
|
btrfs: fix race between finishing block group creation and its item update
Commit 675dfe1223a6 ("btrfs: fix block group item corruption after
inserting new block group") fixed one race that resulted in not persisting
a block group's item when its "used" bytes field decreases to zero.
However there's another race that can happen in a much shorter time window
that results in the same problem. The following sequence of steps explains
how it can happen:
1) Task A creates a metadata block group X, its "used" and "commit_used"
fields are initialized to 0;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K and its "commit_used" field with a value of 0. However
that fails since the block group item was not yet inserted, so at
update_block_group_item(), the btrfs_search_slot() call returns 1, and
then we set 'ret' to -ENOENT. Before jumping to the label 'fail'...
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K and setting "commit_used", in struct
btrfs_block_group, to the same value (32K);
5) Task B jumps to the 'fail' label and then resets the "commit_used"
field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was
returned from update_block_group_item(), we add the block group again
to the list of dirty block groups, so that we will try again in the
critical section of the transaction commit when calling
btrfs_write_dirty_block_groups();
6) Later the two extents from block group X are freed, so its "used" field
becomes 0;
7) If no more extents are allocated from block group X before we get into
btrfs_write_dirty_block_groups(), then when we call
update_block_group_item() again for block group X, we will not update
the block group item to reflect that it has 0 bytes used, because the
"used" and "commit_used" fields in struct btrfs_block_group have the
same value, a value of 0.
As a result after committing the transaction we have an empty block
group with its block group item having a 32K value for its "used" field.
This will trigger errors from fsck ("btrfs check" command) and after
mounting again the fs, the cleaner kthread will not automatically delete
the empty block group, since its "used" field is not 0. Possibly there
are other issues due to this inconsistency.
When this issue happens, the error reported by fsck is like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
So fix this by not resetting the "commit_used" field of a block group when
we don't find the block group item at update_block_group_item().
Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-04 12:10:31 +01:00
|
|
|
/*
|
|
|
|
* We didn't update the block group item, need to revert commit_used
|
|
|
|
* unless the block group item didn't exist yet - this is to prevent a
|
|
|
|
* race with a concurrent insertion of the block group item, with
|
|
|
|
* insert_block_group_item(), that happened just after we attempted to
|
|
|
|
* update. In that case we would reset commit_used to 0 just after the
|
|
|
|
* insertion set it to a value greater than 0 - if the block group later
|
|
|
|
* becomes with 0 used bytes, we would incorrectly skip its update.
|
|
|
|
*/
|
|
|
|
if (ret < 0 && ret != -ENOENT) {
|
btrfs: skip update of block group item if used bytes are the same
[BACKGROUND]
When committing a transaction, we will update block group items for all
dirty block groups.
But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several COW operations, but still have the same amount of used bytes.
In that case, we may unnecessarily COW a tree block doing nothing.
[ENHANCEMENT]
This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.
This would be more common for large filesystems, where metadata block
group can be as large as 1GiB, containing at most 64K metadata items.
In that case, if COW added and then deleted one metadata item near the
end of the block group, then it's completely possible we don't need to
touch the block group item at all.
[BENCHMARK]
The change itself can have quite a high chance (20~80%) to skip block
group item updates in lot of workloads.
As a result, it would result shorter time spent on
btrfs_write_dirty_block_groups(), and overall reduce the execution time
of the critical section of btrfs_commit_transaction().
Here comes a fio command, which will do random writes in 4K block size,
causing a very heavy metadata updates.
fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
--ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
--name=random_write --fallocate=none --time_based --fsync_on_close=1
The file size (512M) and number of threads (4) means 2GiB file size in
total, but during the full 300s run time, my dedicated SATA SSD is able
to write around 20~25GiB, which is over 10 times the file size.
Thus after we fill the initial 2G, we should not cause much block group
item updates.
Please note, the fio numbers by themselves don't have much change, but
if we look deeper, there is some reduced execution time, especially for
the critical section of btrfs_commit_transaction().
I added extra trace_printk() to measure the following per-transaction
execution time:
- Critical section of btrfs_commit_transaction()
By re-using the existing update_commit_stats() function, which
has already calculated the interval correctly.
- The while() loop for btrfs_write_dirty_block_groups()
Although this includes the execution time of btrfs_run_delayed_refs(),
it should still be representative overall.
Both result involves transid 7~30, the same amount of transaction
committed.
The result looks like this:
| Before | After | Diff
----------------------+-------------------+----------------+--------
Transaction interval | 229247198.5 | 215016933.6 | -6.2%
Block group interval | 23133.33333 | 18970.83333 | -18.0%
The change in block group item updates is more obvious, as skipped block
group item updates also mean less delayed refs.
And the overall execution time for that block group update loop is
pretty small, thus we can assume the extent tree is already mostly
cached. If we can skip an uncached tree block, it would cause more
obvious change.
Unfortunately the overall reduction in commit transaction critical
section is much smaller, as the block group item updates loop is not
really the major part, at least not for the above fio script.
But still we have a observable reduction in the critical section.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 14:45:22 +08:00
|
|
|
spin_lock(&cache->lock);
|
|
|
|
cache->commit_used = old_commit_used;
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
}
|
2019-06-20 15:38:00 -04:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
static int cache_save_setup(struct btrfs_block_group *block_group,
|
2019-06-20 15:38:00 -04:00
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
|
|
|
struct inode *inode = NULL;
|
|
|
|
struct extent_changeset *data_reserved = NULL;
|
|
|
|
u64 alloc_hint = 0;
|
|
|
|
int dcs = BTRFS_DC_ERROR;
|
2021-04-13 14:23:14 +08:00
|
|
|
u64 cache_size = 0;
|
2019-06-20 15:38:00 -04:00
|
|
|
int retries = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
2020-11-18 15:06:26 -08:00
|
|
|
if (!btrfs_test_opt(fs_info, SPACE_CACHE))
|
|
|
|
return 0;
|
|
|
|
|
2019-06-20 15:38:00 -04:00
|
|
|
/*
|
|
|
|
* If this block group is smaller than 100 megs don't bother caching the
|
|
|
|
* block group.
|
|
|
|
*/
|
2019-10-23 18:48:22 +02:00
|
|
|
if (block_group->length < (100 * SZ_1M)) {
|
2019-06-20 15:38:00 -04:00
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->disk_cache_state = BTRFS_DC_WRITTEN;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-02-05 17:34:34 +01:00
|
|
|
if (TRANS_ABORTED(trans))
|
2019-06-20 15:38:00 -04:00
|
|
|
return 0;
|
|
|
|
again:
|
|
|
|
inode = lookup_free_space_inode(block_group, path);
|
|
|
|
if (IS_ERR(inode) && PTR_ERR(inode) != -ENOENT) {
|
|
|
|
ret = PTR_ERR(inode);
|
|
|
|
btrfs_release_path(path);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
BUG_ON(retries);
|
|
|
|
retries++;
|
|
|
|
|
|
|
|
if (block_group->ro)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
ret = create_free_space_inode(trans, block_group, path);
|
|
|
|
if (ret)
|
|
|
|
goto out_free;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We want to set the generation to 0, that way if anything goes wrong
|
|
|
|
* from here on out we know not to trust this cache when we load up next
|
|
|
|
* time.
|
|
|
|
*/
|
|
|
|
BTRFS_I(inode)->generation = 0;
|
2023-09-22 11:37:22 +01:00
|
|
|
ret = btrfs_update_inode(trans, BTRFS_I(inode));
|
2019-06-20 15:38:00 -04:00
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* So theoretically we could recover from this, simply set the
|
|
|
|
* super cache generation to 0 so we know to invalidate the
|
|
|
|
* cache, but then we'd have to keep track of the block groups
|
|
|
|
* that fail this way so we know we _have_ to reset this cache
|
|
|
|
* before the next commit or risk reading stale cache. So to
|
|
|
|
* limit our exposure to horrible edge cases lets just abort the
|
|
|
|
* transaction, this only happens in really bad situations
|
|
|
|
* anyway.
|
|
|
|
*/
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
WARN_ON(ret);
|
|
|
|
|
|
|
|
/* We've already setup this transaction, go ahead and exit */
|
|
|
|
if (block_group->cache_generation == trans->transid &&
|
|
|
|
i_size_read(inode)) {
|
|
|
|
dcs = BTRFS_DC_SETUP;
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (i_size_read(inode) > 0) {
|
|
|
|
ret = btrfs_check_trunc_cache_free_space(fs_info,
|
|
|
|
&fs_info->global_block_rsv);
|
|
|
|
if (ret)
|
|
|
|
goto out_put;
|
|
|
|
|
|
|
|
ret = btrfs_truncate_free_space_cache(trans, NULL, inode);
|
|
|
|
if (ret)
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->cached != BTRFS_CACHE_FINISHED ||
|
|
|
|
!btrfs_test_opt(fs_info, SPACE_CACHE)) {
|
|
|
|
/*
|
|
|
|
* don't bother trying to write stuff out _if_
|
|
|
|
* a) we're not cached,
|
|
|
|
* b) we're with nospace_cache mount option,
|
|
|
|
* c) we're with v2 space_cache (FREE_SPACE_TREE).
|
|
|
|
*/
|
|
|
|
dcs = BTRFS_DC_WRITTEN;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We hit an ENOSPC when setting up the cache in this transaction, just
|
|
|
|
* skip doing the setup, we've already cleared the cache so we're safe.
|
|
|
|
*/
|
|
|
|
if (test_bit(BTRFS_TRANS_CACHE_ENOSPC, &trans->transaction->flags)) {
|
|
|
|
ret = -ENOSPC;
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to preallocate enough space based on how big the block group is.
|
|
|
|
* Keep in mind this has to include any pinned space which could end up
|
|
|
|
* taking up quite a bit since it's not folded into the other space
|
|
|
|
* cache.
|
|
|
|
*/
|
2021-04-13 14:23:14 +08:00
|
|
|
cache_size = div_u64(block_group->length, SZ_256M);
|
|
|
|
if (!cache_size)
|
|
|
|
cache_size = 1;
|
2019-06-20 15:38:00 -04:00
|
|
|
|
2021-04-13 14:23:14 +08:00
|
|
|
cache_size *= 16;
|
|
|
|
cache_size *= fs_info->sectorsize;
|
2019-06-20 15:38:00 -04:00
|
|
|
|
2020-06-03 08:55:41 +03:00
|
|
|
ret = btrfs_check_data_free_space(BTRFS_I(inode), &data_reserved, 0,
|
2022-09-12 12:27:44 -07:00
|
|
|
cache_size, false);
|
2019-06-20 15:38:00 -04:00
|
|
|
if (ret)
|
|
|
|
goto out_put;
|
|
|
|
|
2021-04-13 14:23:14 +08:00
|
|
|
ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, cache_size,
|
|
|
|
cache_size, cache_size,
|
2019-06-20 15:38:00 -04:00
|
|
|
&alloc_hint);
|
|
|
|
/*
|
|
|
|
* Our cache requires contiguous chunks so that we don't modify a bunch
|
|
|
|
* of metadata or split extents when writing the cache out, which means
|
|
|
|
* we can enospc if we are heavily fragmented in addition to just normal
|
|
|
|
* out of space conditions. So if we hit this just skip setting up any
|
|
|
|
* other block groups for this transaction, maybe we'll unpin enough
|
|
|
|
* space the next time around.
|
|
|
|
*/
|
|
|
|
if (!ret)
|
|
|
|
dcs = BTRFS_DC_SETUP;
|
|
|
|
else if (ret == -ENOSPC)
|
|
|
|
set_bit(BTRFS_TRANS_CACHE_ENOSPC, &trans->transaction->flags);
|
|
|
|
|
|
|
|
out_put:
|
|
|
|
iput(inode);
|
|
|
|
out_free:
|
|
|
|
btrfs_release_path(path);
|
|
|
|
out:
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (!ret && dcs == BTRFS_DC_SETUP)
|
|
|
|
block_group->cache_generation = trans->transid;
|
|
|
|
block_group->disk_cache_state = dcs;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
extent_changeset_free(data_reserved);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_setup_space_cache(struct btrfs_trans_handle *trans)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache, *tmp;
|
2019-06-20 15:38:00 -04:00
|
|
|
struct btrfs_transaction *cur_trans = trans->transaction;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
|
|
|
|
if (list_empty(&cur_trans->dirty_bgs) ||
|
|
|
|
!btrfs_test_opt(fs_info, SPACE_CACHE))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* Could add new block groups, use _safe just in case */
|
|
|
|
list_for_each_entry_safe(cache, tmp, &cur_trans->dirty_bgs,
|
|
|
|
dirty_list) {
|
|
|
|
if (cache->disk_cache_state == BTRFS_DC_CLEAR)
|
|
|
|
cache_save_setup(cache, trans, path);
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transaction commit does final block group cache writeback during a critical
|
|
|
|
* section where nothing is allowed to change the FS. This is required in
|
|
|
|
* order for the cache to actually match the block group, but can introduce a
|
|
|
|
* lot of latency into the commit.
|
|
|
|
*
|
|
|
|
* So, btrfs_start_dirty_block_groups is here to kick off block group cache IO.
|
|
|
|
* There's a chance we'll have to redo some of it if the block group changes
|
|
|
|
* again during the commit, but it greatly reduces the commit latency by
|
|
|
|
* getting rid of the easy block groups while we're still allowing others to
|
|
|
|
* join the commit.
|
|
|
|
*/
|
|
|
|
int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache;
|
2019-06-20 15:38:00 -04:00
|
|
|
struct btrfs_transaction *cur_trans = trans->transaction;
|
|
|
|
int ret = 0;
|
|
|
|
int should_put;
|
|
|
|
struct btrfs_path *path = NULL;
|
|
|
|
LIST_HEAD(dirty);
|
|
|
|
struct list_head *io = &cur_trans->io_bgs;
|
|
|
|
int loops = 0;
|
|
|
|
|
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
if (list_empty(&cur_trans->dirty_bgs)) {
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
list_splice_init(&cur_trans->dirty_bgs, &dirty);
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
|
|
|
|
again:
|
|
|
|
/* Make sure all the block groups on our dirty list actually exist */
|
|
|
|
btrfs_create_pending_block_groups(trans);
|
|
|
|
|
|
|
|
if (!path) {
|
|
|
|
path = btrfs_alloc_path();
|
2021-01-14 14:02:43 -05:00
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-06-20 15:38:00 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cache_write_mutex is here only to save us from balance or automatic
|
|
|
|
* removal of empty block groups deleting this block group while we are
|
|
|
|
* writing out the cache
|
|
|
|
*/
|
|
|
|
mutex_lock(&trans->transaction->cache_write_mutex);
|
|
|
|
while (!list_empty(&dirty)) {
|
|
|
|
bool drop_reserve = true;
|
|
|
|
|
2019-10-29 19:20:18 +01:00
|
|
|
cache = list_first_entry(&dirty, struct btrfs_block_group,
|
2019-06-20 15:38:00 -04:00
|
|
|
dirty_list);
|
|
|
|
/*
|
|
|
|
* This can happen if something re-dirties a block group that
|
|
|
|
* is already under IO. Just wait for it to finish and then do
|
|
|
|
* it all again
|
|
|
|
*/
|
|
|
|
if (!list_empty(&cache->io_list)) {
|
|
|
|
list_del_init(&cache->io_list);
|
|
|
|
btrfs_wait_cache_io(trans, cache, path);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* btrfs_wait_cache_io uses the cache->dirty_list to decide if
|
|
|
|
* it should update the cache_state. Don't delete until after
|
|
|
|
* we wait.
|
|
|
|
*
|
|
|
|
* Since we're not running in the commit critical section
|
|
|
|
* we need the dirty_bgs_lock to protect from update_block_group
|
|
|
|
*/
|
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
list_del_init(&cache->dirty_list);
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
|
|
|
|
should_put = 1;
|
|
|
|
|
|
|
|
cache_save_setup(cache, trans, path);
|
|
|
|
|
|
|
|
if (cache->disk_cache_state == BTRFS_DC_SETUP) {
|
|
|
|
cache->io_ctl.inode = NULL;
|
|
|
|
ret = btrfs_write_out_cache(trans, cache, path);
|
|
|
|
if (ret == 0 && cache->io_ctl.inode) {
|
|
|
|
should_put = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The cache_write_mutex is protecting the
|
|
|
|
* io_list, also refer to the definition of
|
|
|
|
* btrfs_transaction::io_bgs for more details
|
|
|
|
*/
|
|
|
|
list_add_tail(&cache->io_list, io);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If we failed to write the cache, the
|
|
|
|
* generation will be bad and life goes on
|
|
|
|
*/
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!ret) {
|
2020-05-05 07:58:23 +08:00
|
|
|
ret = update_block_group_item(trans, path, cache);
|
2019-06-20 15:38:00 -04:00
|
|
|
/*
|
|
|
|
* Our block group might still be attached to the list
|
|
|
|
* of new block groups in the transaction handle of some
|
|
|
|
* other task (struct btrfs_trans_handle->new_bgs). This
|
|
|
|
* means its block group item isn't yet in the extent
|
|
|
|
* tree. If this happens ignore the error, as we will
|
|
|
|
* try again later in the critical section of the
|
|
|
|
* transaction commit.
|
|
|
|
*/
|
|
|
|
if (ret == -ENOENT) {
|
|
|
|
ret = 0;
|
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
if (list_empty(&cache->dirty_list)) {
|
|
|
|
list_add_tail(&cache->dirty_list,
|
|
|
|
&cur_trans->dirty_bgs);
|
|
|
|
btrfs_get_block_group(cache);
|
|
|
|
drop_reserve = false;
|
|
|
|
}
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
} else if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If it's not on the io list, we need to put the block group */
|
|
|
|
if (should_put)
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
if (drop_reserve)
|
2023-09-28 11:12:49 +01:00
|
|
|
btrfs_dec_delayed_refs_rsv_bg_updates(fs_info);
|
2019-06-20 15:38:00 -04:00
|
|
|
/*
|
|
|
|
* Avoid blocking other tasks for too long. It might even save
|
|
|
|
* us from writing caches for block groups that are going to be
|
|
|
|
* removed.
|
|
|
|
*/
|
|
|
|
mutex_unlock(&trans->transaction->cache_write_mutex);
|
2021-01-14 14:02:43 -05:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2019-06-20 15:38:00 -04:00
|
|
|
mutex_lock(&trans->transaction->cache_write_mutex);
|
|
|
|
}
|
|
|
|
mutex_unlock(&trans->transaction->cache_write_mutex);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Go through delayed refs for all the stuff we've just kicked off
|
|
|
|
* and then loop back (just once)
|
|
|
|
*/
|
2020-12-16 11:22:17 -05:00
|
|
|
if (!ret)
|
|
|
|
ret = btrfs_run_delayed_refs(trans, 0);
|
2019-06-20 15:38:00 -04:00
|
|
|
if (!ret && loops == 0) {
|
|
|
|
loops++;
|
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
list_splice_init(&cur_trans->dirty_bgs, &dirty);
|
|
|
|
/*
|
|
|
|
* dirty_bgs_lock protects us from concurrent block group
|
|
|
|
* deletes too (not just cache_write_mutex).
|
|
|
|
*/
|
|
|
|
if (!list_empty(&dirty)) {
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
2021-01-14 14:02:43 -05:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
if (ret < 0) {
|
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
list_splice_init(&dirty, &cur_trans->dirty_bgs);
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
2019-06-20 15:38:00 -04:00
|
|
|
btrfs_cleanup_dirty_bgs(cur_trans, fs_info);
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *cache;
|
2019-06-20 15:38:00 -04:00
|
|
|
struct btrfs_transaction *cur_trans = trans->transaction;
|
|
|
|
int ret = 0;
|
|
|
|
int should_put;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct list_head *io = &cur_trans->io_bgs;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Even though we are in the critical section of the transaction commit,
|
|
|
|
* we can still have concurrent tasks adding elements to this
|
|
|
|
* transaction's list of dirty block groups. These tasks correspond to
|
|
|
|
* endio free space workers started when writeback finishes for a
|
|
|
|
* space cache, which run inode.c:btrfs_finish_ordered_io(), and can
|
|
|
|
* allocate new block groups as a result of COWing nodes of the root
|
|
|
|
* tree when updating the free space inode. The writeback for the space
|
|
|
|
* caches is triggered by an earlier call to
|
|
|
|
* btrfs_start_dirty_block_groups() and iterations of the following
|
|
|
|
* loop.
|
|
|
|
* Also we want to do the cache_save_setup first and then run the
|
|
|
|
* delayed refs to make sure we have the best chance at doing this all
|
|
|
|
* in one shot.
|
|
|
|
*/
|
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
while (!list_empty(&cur_trans->dirty_bgs)) {
|
|
|
|
cache = list_first_entry(&cur_trans->dirty_bgs,
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group,
|
2019-06-20 15:38:00 -04:00
|
|
|
dirty_list);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This can happen if cache_save_setup re-dirties a block group
|
|
|
|
* that is already under IO. Just wait for it to finish and
|
|
|
|
* then do it all again
|
|
|
|
*/
|
|
|
|
if (!list_empty(&cache->io_list)) {
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
list_del_init(&cache->io_list);
|
|
|
|
btrfs_wait_cache_io(trans, cache, path);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't remove from the dirty list until after we've waited on
|
|
|
|
* any pending IO
|
|
|
|
*/
|
|
|
|
list_del_init(&cache->dirty_list);
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
should_put = 1;
|
|
|
|
|
|
|
|
cache_save_setup(cache, trans, path);
|
|
|
|
|
|
|
|
if (!ret)
|
btrfs: allow to run delayed refs by bytes to be released instead of count
When running delayed references, through btrfs_run_delayed_refs(), we can
specify how many to run, run all existing delayed references and keep
running delayed references while we can find any. This is controlled with
the value of the 'count' argument, where a value of 0 means to run all
delayed references that exist by the time btrfs_run_delayed_refs() is
called, (unsigned long)-1 means to keep running delayed references while
we are able find any, and any other value to run that exact number of
delayed references.
Typically a specific value other than 0 or -1 is used when flushing space
to try to release a certain amount of bytes for a ticket. In this case
we just simply calculate how many delayed reference heads correspond to a
specific amount of bytes, with calc_delayed_refs_nr(). However that only
takes into account the space reserved for the reference heads themselves,
and does not account for the space reserved for deleting checksums from
the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
in case we are going to delete a data extent. This means we may end up
running more delayed references than necessary in case we process delayed
references for deleting a data extent.
So change the logic of btrfs_run_delayed_refs() to take a bytes argument
to specify how many bytes of delayed references to run/release, using the
special values of 0 to mean all existing delayed references and U64_MAX
(or (u64)-1) to keep running delayed references while we can find any.
This prevents running more delayed references than necessary, when we have
delayed references for deleting data extents, but also makes the upcoming
changes/patches simpler and it's preparatory work for them.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 18:20:34 +01:00
|
|
|
ret = btrfs_run_delayed_refs(trans, U64_MAX);
|
2019-06-20 15:38:00 -04:00
|
|
|
|
|
|
|
if (!ret && cache->disk_cache_state == BTRFS_DC_SETUP) {
|
|
|
|
cache->io_ctl.inode = NULL;
|
|
|
|
ret = btrfs_write_out_cache(trans, cache, path);
|
|
|
|
if (ret == 0 && cache->io_ctl.inode) {
|
|
|
|
should_put = 0;
|
|
|
|
list_add_tail(&cache->io_list, io);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If we failed to write the cache, the
|
|
|
|
* generation will be bad and life goes on
|
|
|
|
*/
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!ret) {
|
2020-05-05 07:58:23 +08:00
|
|
|
ret = update_block_group_item(trans, path, cache);
|
2019-06-20 15:38:00 -04:00
|
|
|
/*
|
|
|
|
* One of the free space endio workers might have
|
|
|
|
* created a new block group while updating a free space
|
|
|
|
* cache's inode (at inode.c:btrfs_finish_ordered_io())
|
|
|
|
* and hasn't released its transaction handle yet, in
|
|
|
|
* which case the new block group is still attached to
|
|
|
|
* its transaction handle and its creation has not
|
|
|
|
* finished yet (no block group item in the extent tree
|
|
|
|
* yet, etc). If this is the case, wait for all free
|
|
|
|
* space endio workers to finish and retry. This is a
|
2020-08-04 19:48:34 -07:00
|
|
|
* very rare case so no need for a more efficient and
|
2019-06-20 15:38:00 -04:00
|
|
|
* complex approach.
|
|
|
|
*/
|
|
|
|
if (ret == -ENOENT) {
|
|
|
|
wait_event(cur_trans->writer_wait,
|
|
|
|
atomic_read(&cur_trans->num_writers) == 1);
|
2020-05-05 07:58:23 +08:00
|
|
|
ret = update_block_group_item(trans, path, cache);
|
2019-06-20 15:38:00 -04:00
|
|
|
}
|
|
|
|
if (ret)
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If its not on the io list, we need to put the block group */
|
|
|
|
if (should_put)
|
|
|
|
btrfs_put_block_group(cache);
|
2023-09-28 11:12:49 +01:00
|
|
|
btrfs_dec_delayed_refs_rsv_bg_updates(fs_info);
|
2019-06-20 15:38:00 -04:00
|
|
|
spin_lock(&cur_trans->dirty_bgs_lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&cur_trans->dirty_bgs_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Refer to the definition of io_bgs member for details why it's safe
|
|
|
|
* to use it without any locking
|
|
|
|
*/
|
|
|
|
while (!list_empty(io)) {
|
2019-10-29 19:20:18 +01:00
|
|
|
cache = list_first_entry(io, struct btrfs_block_group,
|
2019-06-20 15:38:00 -04:00
|
|
|
io_list);
|
|
|
|
list_del_init(&cache->io_list);
|
|
|
|
btrfs_wait_cache_io(trans, cache, path);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
2019-06-20 15:38:02 -04:00
|
|
|
|
|
|
|
int btrfs_update_block_group(struct btrfs_trans_handle *trans,
|
2021-10-13 14:05:14 +08:00
|
|
|
u64 bytenr, u64 num_bytes, bool alloc)
|
2019-06-20 15:38:02 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *info = trans->fs_info;
|
2023-09-13 12:23:18 +01:00
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_block_group *cache;
|
2019-06-20 15:38:02 -04:00
|
|
|
u64 old_val;
|
2023-09-13 12:23:18 +01:00
|
|
|
bool reclaim = false;
|
2023-09-28 11:12:49 +01:00
|
|
|
bool bg_already_dirty = true;
|
2019-06-20 15:38:02 -04:00
|
|
|
int factor;
|
|
|
|
|
|
|
|
/* Block accounting for super block */
|
|
|
|
spin_lock(&info->delalloc_root_lock);
|
|
|
|
old_val = btrfs_super_bytes_used(info->super_copy);
|
|
|
|
if (alloc)
|
|
|
|
old_val += num_bytes;
|
|
|
|
else
|
|
|
|
old_val -= num_bytes;
|
|
|
|
btrfs_set_super_bytes_used(info->super_copy, old_val);
|
|
|
|
spin_unlock(&info->delalloc_root_lock);
|
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
cache = btrfs_lookup_block_group(info, bytenr);
|
|
|
|
if (!cache)
|
|
|
|
return -ENOENT;
|
2022-03-29 01:56:07 -07:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
/* An extent can not span multiple block groups. */
|
|
|
|
ASSERT(bytenr + num_bytes <= cache->start + cache->length);
|
2019-06-20 15:38:02 -04:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
space_info = cache->space_info;
|
|
|
|
factor = btrfs_bg_type_to_factor(cache->flags);
|
2019-06-20 15:38:02 -04:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
/*
|
|
|
|
* If this block group has free space cache written out, we need to make
|
|
|
|
* sure to load it if we are removing space. This is because we need
|
|
|
|
* the unpinning stage to actually add the space back to the block group,
|
|
|
|
* otherwise we will leak space.
|
|
|
|
*/
|
|
|
|
if (!alloc && !btrfs_block_group_done(cache))
|
|
|
|
btrfs_cache_block_group(cache, true);
|
2019-06-20 15:38:02 -04:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
2019-06-20 15:38:02 -04:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
if (btrfs_test_opt(info, SPACE_CACHE) &&
|
|
|
|
cache->disk_cache_state < BTRFS_DC_CLEAR)
|
|
|
|
cache->disk_cache_state = BTRFS_DC_CLEAR;
|
2019-06-20 15:38:02 -04:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
old_val = cache->used;
|
|
|
|
if (alloc) {
|
|
|
|
old_val += num_bytes;
|
|
|
|
cache->used = old_val;
|
|
|
|
cache->reserved -= num_bytes;
|
2024-02-02 11:54:33 -08:00
|
|
|
cache->reclaim_mark = 0;
|
2023-09-13 12:23:18 +01:00
|
|
|
space_info->bytes_reserved -= num_bytes;
|
|
|
|
space_info->bytes_used += num_bytes;
|
|
|
|
space_info->disk_used += num_bytes * factor;
|
2024-02-14 11:25:50 -08:00
|
|
|
if (READ_ONCE(space_info->periodic_reclaim))
|
|
|
|
btrfs_space_info_update_reclaimable(space_info, -num_bytes);
|
2023-09-13 12:23:18 +01:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
} else {
|
|
|
|
old_val -= num_bytes;
|
|
|
|
cache->used = old_val;
|
|
|
|
cache->pinned += num_bytes;
|
|
|
|
btrfs_space_info_update_bytes_pinned(info, space_info, num_bytes);
|
|
|
|
space_info->bytes_used -= num_bytes;
|
|
|
|
space_info->disk_used -= num_bytes * factor;
|
2024-02-14 11:25:50 -08:00
|
|
|
if (READ_ONCE(space_info->periodic_reclaim))
|
|
|
|
btrfs_space_info_update_reclaimable(space_info, num_bytes);
|
|
|
|
else
|
|
|
|
reclaim = should_reclaim_block_group(cache, num_bytes);
|
2019-06-20 15:38:02 -04:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
2019-06-20 15:38:02 -04:00
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
set_extent_bit(&trans->transaction->pinned_extents, bytenr,
|
|
|
|
bytenr + num_bytes - 1, EXTENT_DIRTY, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
if (list_empty(&cache->dirty_list)) {
|
|
|
|
list_add_tail(&cache->dirty_list, &trans->transaction->dirty_bgs);
|
2023-09-28 11:12:49 +01:00
|
|
|
bg_already_dirty = false;
|
2023-09-13 12:23:18 +01:00
|
|
|
btrfs_get_block_group(cache);
|
|
|
|
}
|
|
|
|
spin_unlock(&trans->transaction->dirty_bgs_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No longer have used bytes in this block group, queue it for deletion.
|
|
|
|
* We do this after adding the block group to the dirty list to avoid
|
|
|
|
* races between cleaner kthread and space cache writeout.
|
|
|
|
*/
|
|
|
|
if (!alloc && old_val == 0) {
|
|
|
|
if (!btrfs_test_opt(info, DISCARD_ASYNC))
|
|
|
|
btrfs_mark_bg_unused(cache);
|
|
|
|
} else if (!alloc && reclaim) {
|
|
|
|
btrfs_mark_bg_to_reclaim(cache);
|
2019-06-20 15:38:02 -04:00
|
|
|
}
|
|
|
|
|
2023-09-13 12:23:18 +01:00
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
|
2019-06-20 15:38:02 -04:00
|
|
|
/* Modified block groups are accounted for in the delayed_refs_rsv. */
|
2023-09-28 11:12:49 +01:00
|
|
|
if (!bg_already_dirty)
|
|
|
|
btrfs_inc_delayed_refs_rsv_bg_updates(info);
|
2023-09-13 12:23:18 +01:00
|
|
|
|
|
|
|
return 0;
|
2019-06-20 15:38:02 -04:00
|
|
|
}
|
|
|
|
|
2022-10-27 14:21:42 +02:00
|
|
|
/*
|
|
|
|
* Update the block_group and space info counters.
|
|
|
|
*
|
2019-06-20 15:38:02 -04:00
|
|
|
* @cache: The cache we are manipulating
|
|
|
|
* @ram_bytes: The number of bytes of file content, and will be same to
|
|
|
|
* @num_bytes except for the compress path.
|
|
|
|
* @num_bytes: The number of bytes in question
|
|
|
|
* @delalloc: The blocks are allocated for the delalloc write
|
|
|
|
*
|
|
|
|
* This is called by the allocator when it reserves space. If this is a
|
|
|
|
* reservation and the block group has become read only we cannot make the
|
|
|
|
* reservation and return -EAGAIN, otherwise this function always succeeds.
|
|
|
|
*/
|
2019-10-29 19:20:18 +01:00
|
|
|
int btrfs_add_reserved_bytes(struct btrfs_block_group *cache,
|
btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 16:06:33 -08:00
|
|
|
u64 ram_bytes, u64 num_bytes, int delalloc,
|
|
|
|
bool force_wrong_size_class)
|
2019-06-20 15:38:02 -04:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info = cache->space_info;
|
btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 16:06:33 -08:00
|
|
|
enum btrfs_block_group_size_class size_class;
|
2019-06-20 15:38:02 -04:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (cache->ro) {
|
|
|
|
ret = -EAGAIN;
|
btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 16:06:33 -08:00
|
|
|
goto out;
|
|
|
|
}
|
2020-07-21 10:22:19 -04:00
|
|
|
|
2022-12-15 16:06:35 -08:00
|
|
|
if (btrfs_block_group_should_use_size_class(cache)) {
|
btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 16:06:33 -08:00
|
|
|
size_class = btrfs_calc_block_group_size_class(num_bytes);
|
|
|
|
ret = btrfs_use_block_group_size_class(cache, size_class, force_wrong_size_class);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2019-06-20 15:38:02 -04:00
|
|
|
}
|
btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 16:06:33 -08:00
|
|
|
cache->reserved += num_bytes;
|
|
|
|
space_info->bytes_reserved += num_bytes;
|
|
|
|
trace_btrfs_space_reservation(cache->fs_info, "space_info",
|
|
|
|
space_info->flags, num_bytes, 1);
|
|
|
|
btrfs_space_info_update_bytes_may_use(cache->fs_info,
|
|
|
|
space_info, -ram_bytes);
|
|
|
|
if (delalloc)
|
|
|
|
cache->delalloc_bytes += num_bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compression can use less space than we reserved, so wake tickets if
|
|
|
|
* that happens.
|
|
|
|
*/
|
|
|
|
if (num_bytes < ram_bytes)
|
|
|
|
btrfs_try_granting_tickets(cache->fs_info, space_info);
|
|
|
|
out:
|
2019-06-20 15:38:02 -04:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-10-27 14:21:42 +02:00
|
|
|
/*
|
|
|
|
* Update the block_group and space info counters.
|
|
|
|
*
|
2019-06-20 15:38:02 -04:00
|
|
|
* @cache: The cache we are manipulating
|
|
|
|
* @num_bytes: The number of bytes in question
|
|
|
|
* @delalloc: The blocks are allocated for the delalloc write
|
|
|
|
*
|
|
|
|
* This is called by somebody who is freeing space that was never actually used
|
|
|
|
* on disk. For example if you reserve some space for a new leaf in transaction
|
|
|
|
* A and before transaction A commits you free that leaf, you call this with
|
|
|
|
* reserve set to 0 in order to clear the reservation.
|
|
|
|
*/
|
2019-10-29 19:20:18 +01:00
|
|
|
void btrfs_free_reserved_bytes(struct btrfs_block_group *cache,
|
2019-06-20 15:38:02 -04:00
|
|
|
u64 num_bytes, int delalloc)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info = cache->space_info;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (cache->ro)
|
|
|
|
space_info->bytes_readonly += num_bytes;
|
|
|
|
cache->reserved -= num_bytes;
|
|
|
|
space_info->bytes_reserved -= num_bytes;
|
|
|
|
space_info->max_extent_size = 0;
|
|
|
|
|
|
|
|
if (delalloc)
|
|
|
|
cache->delalloc_bytes -= num_bytes;
|
|
|
|
spin_unlock(&cache->lock);
|
2020-07-21 10:22:17 -04:00
|
|
|
|
|
|
|
btrfs_try_granting_tickets(cache->fs_info, space_info);
|
2019-06-20 15:38:02 -04:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
2019-06-20 15:38:04 -04:00
|
|
|
|
|
|
|
static void force_metadata_allocation(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
|
2020-09-01 17:40:37 -04:00
|
|
|
list_for_each_entry(found, head, list) {
|
2019-06-20 15:38:04 -04:00
|
|
|
if (found->flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
found->force_alloc = CHUNK_ALLOC_FORCE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-06-26 23:39:11 +02:00
|
|
|
static int should_alloc_chunk(const struct btrfs_fs_info *fs_info,
|
|
|
|
const struct btrfs_space_info *sinfo, int force)
|
2019-06-20 15:38:04 -04:00
|
|
|
{
|
|
|
|
u64 bytes_used = btrfs_space_info_used(sinfo, false);
|
|
|
|
u64 thresh;
|
|
|
|
|
|
|
|
if (force == CHUNK_ALLOC_FORCE)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* in limited mode, we want to have some free space up to
|
|
|
|
* about 1% of the FS size.
|
|
|
|
*/
|
|
|
|
if (force == CHUNK_ALLOC_LIMITED) {
|
|
|
|
thresh = btrfs_super_total_bytes(fs_info->super_copy);
|
2022-10-26 23:25:14 +02:00
|
|
|
thresh = max_t(u64, SZ_64M, mult_perc(thresh, 1));
|
2019-06-20 15:38:04 -04:00
|
|
|
|
|
|
|
if (sinfo->total_bytes - bytes_used < thresh)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2022-10-26 23:25:14 +02:00
|
|
|
if (bytes_used + SZ_2M < mult_perc(sinfo->total_bytes, 80))
|
2019-06-20 15:38:04 -04:00
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans, u64 type)
|
|
|
|
{
|
|
|
|
u64 alloc_flags = btrfs_get_alloc_profile(trans->fs_info, type);
|
|
|
|
|
|
|
|
return btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
|
|
|
|
}
|
|
|
|
|
2022-03-22 18:11:33 +09:00
|
|
|
static struct btrfs_block_group *do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags)
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
{
|
|
|
|
struct btrfs_block_group *bg;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if we have enough space in the system space info because we
|
|
|
|
* will need to update device items in the chunk btree and insert a new
|
|
|
|
* chunk item in the chunk btree as well. This will allocate a new
|
|
|
|
* system block group if needed.
|
|
|
|
*/
|
|
|
|
check_system_chunk(trans, flags);
|
|
|
|
|
2021-08-18 13:41:19 +03:00
|
|
|
bg = btrfs_create_chunk(trans, flags);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
if (IS_ERR(bg)) {
|
|
|
|
ret = PTR_ERR(bg);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_chunk_alloc_add_chunk_item(trans, bg);
|
|
|
|
/*
|
|
|
|
* Normally we are not expected to fail with -ENOSPC here, since we have
|
|
|
|
* previously reserved space in the system space_info and allocated one
|
2021-10-13 10:12:50 +01:00
|
|
|
* new system chunk if necessary. However there are three exceptions:
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
*
|
|
|
|
* 1) We may have enough free space in the system space_info but all the
|
|
|
|
* existing system block groups have a profile which can not be used
|
|
|
|
* for extent allocation.
|
|
|
|
*
|
|
|
|
* This happens when mounting in degraded mode. For example we have a
|
|
|
|
* RAID1 filesystem with 2 devices, lose one device and mount the fs
|
|
|
|
* using the other device in degraded mode. If we then allocate a chunk,
|
|
|
|
* we may have enough free space in the existing system space_info, but
|
|
|
|
* none of the block groups can be used for extent allocation since they
|
|
|
|
* have a RAID1 profile, and because we are in degraded mode with a
|
|
|
|
* single device, we are forced to allocate a new system chunk with a
|
|
|
|
* SINGLE profile. Making check_system_chunk() iterate over all system
|
|
|
|
* block groups and check if they have a usable profile and enough space
|
|
|
|
* can be slow on very large filesystems, so we tolerate the -ENOSPC and
|
|
|
|
* try again after forcing allocation of a new system chunk. Like this
|
|
|
|
* we avoid paying the cost of that search in normal circumstances, when
|
|
|
|
* we were not mounted in degraded mode;
|
|
|
|
*
|
|
|
|
* 2) We had enough free space info the system space_info, and one suitable
|
|
|
|
* block group to allocate from when we called check_system_chunk()
|
|
|
|
* above. However right after we called it, the only system block group
|
|
|
|
* with enough free space got turned into RO mode by a running scrub,
|
|
|
|
* and in this case we have to allocate a new one and retry. We only
|
|
|
|
* need do this allocate and retry once, since we have a transaction
|
2021-10-13 10:12:50 +01:00
|
|
|
* handle and scrub uses the commit root to search for block groups;
|
|
|
|
*
|
|
|
|
* 3) We had one system block group with enough free space when we called
|
|
|
|
* check_system_chunk(), but after that, right before we tried to
|
|
|
|
* allocate the last extent buffer we needed, a discard operation came
|
|
|
|
* in and it temporarily removed the last free space entry from the
|
|
|
|
* block group (discard removes a free space entry, discards it, and
|
|
|
|
* then adds back the entry to the block group cache).
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
*/
|
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
const u64 sys_flags = btrfs_system_alloc_profile(trans->fs_info);
|
|
|
|
struct btrfs_block_group *sys_bg;
|
|
|
|
|
2021-08-18 13:41:19 +03:00
|
|
|
sys_bg = btrfs_create_chunk(trans, sys_flags);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
if (IS_ERR(sys_bg)) {
|
|
|
|
ret = PTR_ERR(sys_bg);
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_chunk_alloc_add_chunk_item(trans, sys_bg);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_chunk_alloc_add_chunk_item(trans, bg);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
} else if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
btrfs_trans_release_chunk_metadata(trans);
|
|
|
|
|
2022-03-22 18:11:33 +09:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
|
|
|
btrfs_get_block_group(bg);
|
|
|
|
return bg;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
}
|
|
|
|
|
2019-06-20 15:38:04 -04:00
|
|
|
/*
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
* Chunk allocation is done in 2 phases:
|
|
|
|
*
|
|
|
|
* 1) Phase 1 - through btrfs_chunk_alloc() we allocate device extents for
|
|
|
|
* the chunk, the chunk mapping, create its block group and add the items
|
|
|
|
* that belong in the chunk btree to it - more specifically, we need to
|
|
|
|
* update device items in the chunk btree and add a new chunk item to it.
|
|
|
|
*
|
|
|
|
* 2) Phase 2 - through btrfs_create_pending_block_groups(), we add the block
|
|
|
|
* group item to the extent btree and the device extent items to the devices
|
|
|
|
* btree.
|
|
|
|
*
|
|
|
|
* This is done to prevent deadlocks. For example when COWing a node from the
|
|
|
|
* extent btree we are holding a write lock on the node's parent and if we
|
|
|
|
* trigger chunk allocation and attempted to insert the new block group item
|
|
|
|
* in the extent btree right way, we could deadlock because the path for the
|
|
|
|
* insertion can include that parent node. At first glance it seems impossible
|
|
|
|
* to trigger chunk allocation after starting a transaction since tasks should
|
|
|
|
* reserve enough transaction units (metadata space), however while that is true
|
|
|
|
* most of the time, chunk allocation may still be triggered for several reasons:
|
|
|
|
*
|
|
|
|
* 1) When reserving metadata, we check if there is enough free space in the
|
|
|
|
* metadata space_info and therefore don't trigger allocation of a new chunk.
|
|
|
|
* However later when the task actually tries to COW an extent buffer from
|
|
|
|
* the extent btree or from the device btree for example, it is forced to
|
|
|
|
* allocate a new block group (chunk) because the only one that had enough
|
|
|
|
* free space was just turned to RO mode by a running scrub for example (or
|
|
|
|
* device replace, block group reclaim thread, etc), so we can not use it
|
|
|
|
* for allocating an extent and end up being forced to allocate a new one;
|
|
|
|
*
|
|
|
|
* 2) Because we only check that the metadata space_info has enough free bytes,
|
|
|
|
* we end up not allocating a new metadata chunk in that case. However if
|
|
|
|
* the filesystem was mounted in degraded mode, none of the existing block
|
|
|
|
* groups might be suitable for extent allocation due to their incompatible
|
|
|
|
* profile (for e.g. mounting a 2 devices filesystem, where all block groups
|
|
|
|
* use a RAID1 profile, in degraded mode using a single device). In this case
|
|
|
|
* when the task attempts to COW some extent buffer of the extent btree for
|
|
|
|
* example, it will trigger allocation of a new metadata block group with a
|
|
|
|
* suitable profile (SINGLE profile in the example of the degraded mount of
|
|
|
|
* the RAID1 filesystem);
|
|
|
|
*
|
|
|
|
* 3) The task has reserved enough transaction units / metadata space, but when
|
|
|
|
* it attempts to COW an extent buffer from the extent or device btree for
|
|
|
|
* example, it does not find any free extent in any metadata block group,
|
|
|
|
* therefore forced to try to allocate a new metadata block group.
|
|
|
|
* This is because some other task allocated all available extents in the
|
|
|
|
* meanwhile - this typically happens with tasks that don't reserve space
|
|
|
|
* properly, either intentionally or as a bug. One example where this is
|
|
|
|
* done intentionally is fsync, as it does not reserve any transaction units
|
|
|
|
* and ends up allocating a variable number of metadata extents for log
|
2021-10-13 10:12:50 +01:00
|
|
|
* tree extent buffers;
|
|
|
|
*
|
|
|
|
* 4) The task has reserved enough transaction units / metadata space, but right
|
|
|
|
* before it tries to allocate the last extent buffer it needs, a discard
|
|
|
|
* operation comes in and, temporarily, removes the last free space entry from
|
|
|
|
* the only metadata block group that had free space (discard starts by
|
|
|
|
* removing a free space entry from a block group, then does the discard
|
|
|
|
* operation and, once it's done, it adds back the free space entry to the
|
|
|
|
* block group).
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
*
|
|
|
|
* We also need this 2 phases setup when adding a device to a filesystem with
|
|
|
|
* a seed device - we must create new metadata and system chunks without adding
|
|
|
|
* any of the block group items to the chunk, extent and device btrees. If we
|
|
|
|
* did not do it this way, we would get ENOSPC when attempting to update those
|
|
|
|
* btrees, since all the chunks from the seed device are read-only.
|
|
|
|
*
|
|
|
|
* Phase 1 does the updates and insertions to the chunk btree because if we had
|
|
|
|
* it done in phase 2 and have a thundering herd of tasks allocating chunks in
|
|
|
|
* parallel, we risk having too many system chunks allocated by many tasks if
|
|
|
|
* many tasks reach phase 1 without the previous ones completing phase 2. In the
|
|
|
|
* extreme case this leads to exhaustion of the system chunk array in the
|
|
|
|
* superblock. This is easier to trigger if using a btree node/leaf size of 64K
|
|
|
|
* and with RAID filesystems (so we have more device items in the chunk btree).
|
|
|
|
* This has happened before and commit eafa4fd0ad0607 ("btrfs: fix exhaustion of
|
|
|
|
* the system chunk array due to concurrent allocations") provides more details.
|
|
|
|
*
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
* Allocation of system chunks does not happen through this function. A task that
|
|
|
|
* needs to update the chunk btree (the only btree that uses system chunks), must
|
|
|
|
* preallocate chunk space by calling either check_system_chunk() or
|
|
|
|
* btrfs_reserve_chunk_metadata() - the former is used when allocating a data or
|
|
|
|
* metadata chunk or when removing a chunk, while the later is used before doing
|
|
|
|
* a modification to the chunk btree - use cases for the later are adding,
|
|
|
|
* removing and resizing a device as well as relocation of a system chunk.
|
|
|
|
* See the comment below for more details.
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
*
|
|
|
|
* The reservation of system space, done through check_system_chunk(), as well
|
|
|
|
* as all the updates and insertions into the chunk btree must be done while
|
|
|
|
* holding fs_info->chunk_mutex. This is important to guarantee that while COWing
|
|
|
|
* an extent buffer from the chunks btree we never trigger allocation of a new
|
|
|
|
* system chunk, which would result in a deadlock (trying to lock twice an
|
|
|
|
* extent buffer of the chunk btree, first time before triggering the chunk
|
|
|
|
* allocation and the second time during chunk allocation while attempting to
|
|
|
|
* update the chunks btree). The system chunk array is also updated while holding
|
|
|
|
* that mutex. The same logic applies to removing chunks - we must reserve system
|
|
|
|
* space, update the chunk btree and the system chunk array in the superblock
|
|
|
|
* while holding fs_info->chunk_mutex.
|
|
|
|
*
|
|
|
|
* This function, btrfs_chunk_alloc(), belongs to phase 1.
|
|
|
|
*
|
|
|
|
* If @force is CHUNK_ALLOC_FORCE:
|
2019-06-20 15:38:04 -04:00
|
|
|
* - return 1 if it successfully allocates a chunk,
|
|
|
|
* - return errors including -ENOSPC otherwise.
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
* If @force is NOT CHUNK_ALLOC_FORCE:
|
2019-06-20 15:38:04 -04:00
|
|
|
* - return 0 if it doesn't need to allocate a new chunk,
|
|
|
|
* - return 1 if it successfully allocates a chunk,
|
|
|
|
* - return errors including -ENOSPC otherwise.
|
|
|
|
*/
|
|
|
|
int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
|
|
|
|
enum btrfs_chunk_alloc_enum force)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
struct btrfs_space_info *space_info;
|
2022-03-22 18:11:33 +09:00
|
|
|
struct btrfs_block_group *ret_bg;
|
2019-06-20 15:38:04 -04:00
|
|
|
bool wait_for_alloc = false;
|
|
|
|
bool should_alloc = false;
|
2022-03-22 18:11:34 +09:00
|
|
|
bool from_extent_allocation = false;
|
2019-06-20 15:38:04 -04:00
|
|
|
int ret = 0;
|
|
|
|
|
2022-03-22 18:11:34 +09:00
|
|
|
if (force == CHUNK_ALLOC_FORCE_FOR_EXTENT) {
|
|
|
|
from_extent_allocation = true;
|
|
|
|
force = CHUNK_ALLOC_FORCE;
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:38:04 -04:00
|
|
|
/* Don't re-enter if we're already allocating a chunk */
|
|
|
|
if (trans->allocating_chunk)
|
|
|
|
return -ENOSPC;
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
/*
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
* Allocation of system chunks can not happen through this path, as we
|
|
|
|
* could end up in a deadlock if we are allocating a data or metadata
|
|
|
|
* chunk and there is another task modifying the chunk btree.
|
|
|
|
*
|
|
|
|
* This is because while we are holding the chunk mutex, we will attempt
|
|
|
|
* to add the new chunk item to the chunk btree or update an existing
|
|
|
|
* device item in the chunk btree, while the other task that is modifying
|
|
|
|
* the chunk btree is attempting to COW an extent buffer while holding a
|
|
|
|
* lock on it and on its parent - if the COW operation triggers a system
|
|
|
|
* chunk allocation, then we can deadlock because we are holding the
|
|
|
|
* chunk mutex and we may need to access that extent buffer or its parent
|
|
|
|
* in order to add the chunk item or update a device item.
|
|
|
|
*
|
|
|
|
* Tasks that want to modify the chunk tree should reserve system space
|
|
|
|
* before updating the chunk btree, by calling either
|
|
|
|
* btrfs_reserve_chunk_metadata() or check_system_chunk().
|
|
|
|
* It's possible that after a task reserves the space, it still ends up
|
|
|
|
* here - this happens in the cases described above at do_chunk_alloc().
|
|
|
|
* The task will have to either retry or fail.
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
*/
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
return -ENOSPC;
|
2019-06-20 15:38:04 -04:00
|
|
|
|
|
|
|
space_info = btrfs_find_space_info(fs_info, flags);
|
|
|
|
ASSERT(space_info);
|
|
|
|
|
|
|
|
do {
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (force < space_info->force_alloc)
|
|
|
|
force = space_info->force_alloc;
|
|
|
|
should_alloc = should_alloc_chunk(fs_info, space_info, force);
|
|
|
|
if (space_info->full) {
|
|
|
|
/* No more free physical space */
|
|
|
|
if (should_alloc)
|
|
|
|
ret = -ENOSPC;
|
|
|
|
else
|
|
|
|
ret = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return ret;
|
|
|
|
} else if (!should_alloc) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return 0;
|
|
|
|
} else if (space_info->chunk_alloc) {
|
|
|
|
/*
|
|
|
|
* Someone is already allocating, so we need to block
|
|
|
|
* until this someone is finished and then loop to
|
|
|
|
* recheck if we should continue with our allocation
|
|
|
|
* attempt.
|
|
|
|
*/
|
|
|
|
wait_for_alloc = true;
|
2022-06-13 18:31:17 -04:00
|
|
|
force = CHUNK_ALLOC_NO_FORCE;
|
2019-06-20 15:38:04 -04:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
} else {
|
|
|
|
/* Proceed with allocation */
|
|
|
|
space_info->chunk_alloc = 1;
|
|
|
|
wait_for_alloc = false;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
} while (wait_for_alloc);
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
trans->allocating_chunk = true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have mixed data/metadata chunks we want to make sure we keep
|
|
|
|
* allocating mixed chunks instead of individual chunks.
|
|
|
|
*/
|
|
|
|
if (btrfs_mixed_space_info(space_info))
|
|
|
|
flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if we're doing a data chunk, go ahead and make sure that
|
|
|
|
* we keep a reasonable number of metadata chunks allocated in the
|
|
|
|
* FS as well.
|
|
|
|
*/
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA && fs_info->metadata_ratio) {
|
|
|
|
fs_info->data_chunk_allocations++;
|
|
|
|
if (!(fs_info->data_chunk_allocations %
|
|
|
|
fs_info->metadata_ratio))
|
|
|
|
force_metadata_allocation(fs_info);
|
|
|
|
}
|
|
|
|
|
2022-03-22 18:11:33 +09:00
|
|
|
ret_bg = do_chunk_alloc(trans, flags);
|
2019-06-20 15:38:04 -04:00
|
|
|
trans->allocating_chunk = false;
|
|
|
|
|
2022-03-22 18:11:34 +09:00
|
|
|
if (IS_ERR(ret_bg)) {
|
2022-03-22 18:11:33 +09:00
|
|
|
ret = PTR_ERR(ret_bg);
|
2023-08-08 01:12:39 +09:00
|
|
|
} else if (from_extent_allocation && (flags & BTRFS_BLOCK_GROUP_DATA)) {
|
2022-03-22 18:11:34 +09:00
|
|
|
/*
|
|
|
|
* New block group is likely to be used soon. Try to activate
|
|
|
|
* it now. Failure is OK for now.
|
|
|
|
*/
|
|
|
|
btrfs_zone_activate(ret_bg);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ret)
|
2022-03-22 18:11:33 +09:00
|
|
|
btrfs_put_block_group(ret_bg);
|
|
|
|
|
2019-06-20 15:38:04 -04:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (ret < 0) {
|
|
|
|
if (ret == -ENOSPC)
|
|
|
|
space_info->full = 1;
|
|
|
|
else
|
|
|
|
goto out;
|
|
|
|
} else {
|
|
|
|
ret = 1;
|
|
|
|
space_info->max_extent_size = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
|
|
|
|
out:
|
|
|
|
space_info->chunk_alloc = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-06-26 23:39:11 +02:00
|
|
|
static u64 get_profile_num_devs(const struct btrfs_fs_info *fs_info, u64 type)
|
2019-06-20 15:38:04 -04:00
|
|
|
{
|
|
|
|
u64 num_dev;
|
|
|
|
|
|
|
|
num_dev = btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
|
|
|
|
if (!num_dev)
|
|
|
|
num_dev = fs_info->fs_devices->rw_devices;
|
|
|
|
|
|
|
|
return num_dev;
|
|
|
|
}
|
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
static void reserve_chunk_space(struct btrfs_trans_handle *trans,
|
|
|
|
u64 bytes,
|
|
|
|
u64 type)
|
2019-06-20 15:38:04 -04:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
struct btrfs_space_info *info;
|
|
|
|
u64 left;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Needed because we can end up allocating a system chunk and for an
|
|
|
|
* atomic and race free space reservation in the chunk block reserve.
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&fs_info->chunk_mutex);
|
|
|
|
|
|
|
|
info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
left = info->total_bytes - btrfs_space_info_used(info, true);
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
if (left < bytes && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
|
2019-06-20 15:38:04 -04:00
|
|
|
btrfs_info(fs_info, "left=%llu, need=%llu, flags=%llu",
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
left, bytes, type);
|
2019-06-20 15:38:04 -04:00
|
|
|
btrfs_dump_space_info(fs_info, info, 0, 0);
|
|
|
|
}
|
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
if (left < bytes) {
|
2019-06-20 15:38:04 -04:00
|
|
|
u64 flags = btrfs_system_alloc_profile(fs_info);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
struct btrfs_block_group *bg;
|
2019-06-20 15:38:04 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ignore failure to create system chunk. We might end up not
|
|
|
|
* needing it, as we might not need to COW all nodes/leafs from
|
|
|
|
* the paths we visit in the chunk tree (they were already COWed
|
|
|
|
* or created in the current transaction for example).
|
|
|
|
*/
|
2021-08-18 13:41:19 +03:00
|
|
|
bg = btrfs_create_chunk(trans, flags);
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
if (IS_ERR(bg)) {
|
|
|
|
ret = PTR_ERR(bg);
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
} else {
|
2022-07-09 08:18:48 +09:00
|
|
|
/*
|
|
|
|
* We have a new chunk. We also need to activate it for
|
|
|
|
* zoned filesystem.
|
|
|
|
*/
|
|
|
|
ret = btrfs_zoned_activate_one_bg(fs_info, info, true);
|
|
|
|
if (ret < 0)
|
|
|
|
return;
|
|
|
|
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
/*
|
|
|
|
* If we fail to add the chunk item here, we end up
|
|
|
|
* trying again at phase 2 of chunk allocation, at
|
|
|
|
* btrfs_create_pending_block_groups(). So ignore
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
* any error here. An ENOSPC here could happen, due to
|
|
|
|
* the cases described at do_chunk_alloc() - the system
|
|
|
|
* block group we just created was just turned into RO
|
|
|
|
* mode by a scrub for example, or a running discard
|
|
|
|
* temporarily removed its free space entries, etc.
|
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:06 +01:00
|
|
|
*/
|
|
|
|
btrfs_chunk_alloc_add_chunk_item(trans, bg);
|
|
|
|
}
|
2019-06-20 15:38:04 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!ret) {
|
2021-11-09 10:12:07 -05:00
|
|
|
ret = btrfs_block_rsv_add(fs_info,
|
2019-06-20 15:38:04 -04:00
|
|
|
&fs_info->chunk_block_rsv,
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
bytes, BTRFS_RESERVE_NO_FLUSH);
|
btrfs: fix deadlock with concurrent chunk allocations involving system chunks
When a task attempting to allocate a new chunk verifies that there is not
currently enough free space in the system space_info and there is another
task that allocated a new system chunk but it did not finish yet the
creation of the respective block group, it waits for that other task to
finish creating the block group. This is to avoid exhaustion of the system
chunk array in the superblock, which is limited, when we have a thundering
herd of tasks allocating new chunks. This problem was described and fixed
by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations").
However there are two very similar scenarios where this can lead to a
deadlock:
1) Task B allocated a new system chunk and task A is waiting on task B
to finish creation of the respective system block group. However before
task B ends its transaction handle and finishes the creation of the
system block group, it attempts to allocate another chunk (like a data
chunk for an fallocate operation for a very large range). Task B will
be unable to progress and allocate the new chunk, because task A set
space_info->chunk_alloc to 1 and therefore it loops at
btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
and set space_info->chunk_alloc to 0, but task A is waiting on task B
to finish creation of the new system block group, therefore resulting
in a deadlock;
2) Task B allocated a new system chunk and task A is waiting on task B to
finish creation of the respective system block group. By the time that
task B enter the final phase of block group allocation, which happens
at btrfs_create_pending_block_groups(), when it modifies the extent
tree, the device tree or the chunk tree to insert the items for some
new block group, it needs to allocate a new chunk, so it ends up at
btrfs_chunk_alloc() and keeps looping there because task A has set
space_info->chunk_alloc to 1, but task A is waiting for task B to
finish creation of the new system block group and release the reserved
system space, therefore resulting in a deadlock.
In short, the problem is if a task B needs to allocate a new chunk after
it previously allocated a new system chunk and if another task A is
currently waiting for task B to complete the allocation of the new system
chunk.
Unfortunately this deadlock scenario introduced by the previous fix for
the system chunk array exhaustion problem does not have a simple and short
fix, and requires a big change to rework the chunk allocation code so that
chunk btree updates are all made in the first phase of chunk allocation.
And since this deadlock regression is being frequently hit on zoned
filesystems and the system chunk array exhaustion problem is triggered
in more extreme cases (originally observed on PowerPC with a node size
of 64K when running the fallocate tests from stress-ng), revert the
changes from that commit. The next patch in the series, with a subject
of "btrfs: rework chunk allocation to avoid exhaustion of the system
chunk array" does the necessary changes to fix the system chunk array
exhaustion problem.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-29 14:43:05 +01:00
|
|
|
if (!ret)
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
trans->chunk_bytes_reserved += bytes;
|
2019-06-20 15:38:04 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: fix deadlock between chunk allocation and chunk btree modifications
When a task is doing some modification to the chunk btree and it is not in
the context of a chunk allocation or a chunk removal, it can deadlock with
another task that is currently allocating a new data or metadata chunk.
These contexts are the following:
* When relocating a system chunk, when we need to COW the extent buffers
that belong to the chunk btree;
* When adding a new device (ioctl), where we need to add a new device item
to the chunk btree;
* When removing a device (ioctl), where we need to remove a device item
from the chunk btree;
* When resizing a device (ioctl), where we need to update a device item in
the chunk btree and may need to relocate a system chunk that lies beyond
the new device size when shrinking a device.
The problem happens due to a sequence of steps like the following:
1) Task A starts a data or metadata chunk allocation and it locks the
chunk mutex;
2) Task B is relocating a system chunk, and when it needs to COW an extent
buffer of the chunk btree, it has locked both that extent buffer as
well as its parent extent buffer;
3) Since there is not enough available system space, either because none
of the existing system block groups have enough free space or because
the only one with enough free space is in RO mode due to the relocation,
task B triggers a new system chunk allocation. It blocks when trying to
acquire the chunk mutex, currently held by task A;
4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
the new chunk item into the chunk btree and update the existing device
items there. But in order to do that, it has to lock the extent buffer
that task B locked at step 2, or its parent extent buffer, but task B
is waiting on the chunk mutex, which is currently locked by task A,
therefore resulting in a deadlock.
One example report when the deadlock happens with system chunk relocation:
INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
__down_read_common kernel/locking/rwsem.c:1214 [inline]
__down_read kernel/locking/rwsem.c:1223 [inline]
down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
__btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
worker_thread+0x90/0xed0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
INFO: task syz-executor:9107 blocked for more than 143 seconds.
Not tainted 5.15.0-rc3+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
__btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
__btrfs_balance fs/btrfs/volumes.c:3911 [inline]
btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
So fix this by making sure that whenever we try to modify the chunk btree
and we are neither in a chunk allocation context nor in a chunk remove
context, we reserve system space before modifying the chunk btree.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
CC: stable@vger.kernel.org # 5.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-13 10:12:49 +01:00
|
|
|
/*
|
|
|
|
* Reserve space in the system space for allocating or removing a chunk.
|
|
|
|
* The caller must be holding fs_info->chunk_mutex.
|
|
|
|
*/
|
|
|
|
void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
const u64 num_devs = get_profile_num_devs(fs_info, type);
|
|
|
|
u64 bytes;
|
|
|
|
|
|
|
|
/* num_devs device items to update and 1 chunk item to add or remove. */
|
|
|
|
bytes = btrfs_calc_metadata_size(fs_info, num_devs) +
|
|
|
|
btrfs_calc_insert_metadata_size(fs_info, 1);
|
|
|
|
|
|
|
|
reserve_chunk_space(trans, bytes, type);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reserve space in the system space, if needed, for doing a modification to the
|
|
|
|
* chunk btree.
|
|
|
|
*
|
|
|
|
* @trans: A transaction handle.
|
|
|
|
* @is_item_insertion: Indicate if the modification is for inserting a new item
|
|
|
|
* in the chunk btree or if it's for the deletion or update
|
|
|
|
* of an existing item.
|
|
|
|
*
|
|
|
|
* This is used in a context where we need to update the chunk btree outside
|
|
|
|
* block group allocation and removal, to avoid a deadlock with a concurrent
|
|
|
|
* task that is allocating a metadata or data block group and therefore needs to
|
|
|
|
* update the chunk btree while holding the chunk mutex. After the update to the
|
|
|
|
* chunk btree is done, btrfs_trans_release_chunk_metadata() should be called.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
void btrfs_reserve_chunk_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
bool is_item_insertion)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = trans->fs_info;
|
|
|
|
u64 bytes;
|
|
|
|
|
|
|
|
if (is_item_insertion)
|
|
|
|
bytes = btrfs_calc_insert_metadata_size(fs_info, 1);
|
|
|
|
else
|
|
|
|
bytes = btrfs_calc_metadata_size(fs_info, 1);
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
reserve_chunk_space(trans, bytes, BTRFS_BLOCK_GROUP_SYSTEM);
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
}
|
|
|
|
|
2019-06-20 15:38:06 -04:00
|
|
|
void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
|
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group;
|
2019-06-20 15:38:06 -04:00
|
|
|
|
2022-07-15 15:45:26 -04:00
|
|
|
block_group = btrfs_lookup_first_block_group(info, 0);
|
|
|
|
while (block_group) {
|
|
|
|
btrfs_wait_block_group_cache_done(block_group);
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (test_and_clear_bit(BLOCK_GROUP_FLAG_IREF,
|
|
|
|
&block_group->runtime_flags)) {
|
2022-01-31 18:46:17 +01:00
|
|
|
struct btrfs_inode *inode = block_group->inode;
|
2022-07-15 15:45:26 -04:00
|
|
|
|
|
|
|
block_group->inode = NULL;
|
2019-06-20 15:38:06 -04:00
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
2022-07-15 15:45:26 -04:00
|
|
|
ASSERT(block_group->io_ctl.inode == NULL);
|
2022-01-31 18:46:17 +01:00
|
|
|
iput(&inode->vfs_inode);
|
2022-07-15 15:45:26 -04:00
|
|
|
} else {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
block_group = btrfs_next_block_group(block_group);
|
2019-06-20 15:38:06 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Must be called only after stopping all workers, since we could have block
|
|
|
|
* group caching kthreads running, and therefore they could race with us if we
|
|
|
|
* freed the block groups before stopping them.
|
|
|
|
*/
|
|
|
|
int btrfs_free_block_groups(struct btrfs_fs_info *info)
|
|
|
|
{
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group *block_group;
|
2019-06-20 15:38:06 -04:00
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
struct rb_node *n;
|
|
|
|
|
btrfs: zoned: activate metadata block group on write time
In the current implementation, block groups are activated at reservation
time to ensure that all reserved bytes can be written to an active metadata
block group. However, this approach has proven to be less efficient, as it
activates block groups more frequently than necessary, putting pressure on
the active zone resource and leading to potential issues such as early
ENOSPC or hung_task.
Another drawback of the current method is that it hampers metadata
over-commit, and necessitates additional flush operations and block group
allocations, resulting in decreased overall performance.
To address these issues, this commit introduces a write-time activation of
metadata and system block group. This involves reserving at least one
active block group specifically for a metadata and system block group.
Since metadata write-out is always allocated sequentially, when we need to
write to a non-active block group, we can wait for the ongoing IOs to
complete, activate a new block group, and then proceed with writing to the
new block group.
Fixes: b09315139136 ("btrfs: zoned: activate metadata block group on flush_space")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-08 01:12:37 +09:00
|
|
|
if (btrfs_is_zoned(info)) {
|
|
|
|
if (info->active_meta_bg) {
|
|
|
|
btrfs_put_block_group(info->active_meta_bg);
|
|
|
|
info->active_meta_bg = NULL;
|
|
|
|
}
|
|
|
|
if (info->active_system_bg) {
|
|
|
|
btrfs_put_block_group(info->active_system_bg);
|
|
|
|
info->active_system_bg = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_lock(&info->block_group_cache_lock);
|
2019-06-20 15:38:06 -04:00
|
|
|
while (!list_empty(&info->caching_block_groups)) {
|
|
|
|
caching_ctl = list_entry(info->caching_block_groups.next,
|
|
|
|
struct btrfs_caching_control, list);
|
|
|
|
list_del(&caching_ctl->list);
|
|
|
|
btrfs_put_caching_control(caching_ctl);
|
|
|
|
}
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_unlock(&info->block_group_cache_lock);
|
2019-06-20 15:38:06 -04:00
|
|
|
|
|
|
|
spin_lock(&info->unused_bgs_lock);
|
|
|
|
while (!list_empty(&info->unused_bgs)) {
|
|
|
|
block_group = list_first_entry(&info->unused_bgs,
|
2019-10-29 19:20:18 +01:00
|
|
|
struct btrfs_block_group,
|
2019-06-20 15:38:06 -04:00
|
|
|
bg_list);
|
|
|
|
list_del_init(&block_group->bg_list);
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
}
|
|
|
|
|
2021-04-19 16:41:02 +09:00
|
|
|
while (!list_empty(&info->reclaim_bgs)) {
|
|
|
|
block_group = list_first_entry(&info->reclaim_bgs,
|
|
|
|
struct btrfs_block_group,
|
|
|
|
bg_list);
|
|
|
|
list_del_init(&block_group->bg_list);
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
}
|
|
|
|
spin_unlock(&info->unused_bgs_lock);
|
|
|
|
|
2021-08-19 21:19:17 +09:00
|
|
|
spin_lock(&info->zone_active_bgs_lock);
|
|
|
|
while (!list_empty(&info->zone_active_bgs)) {
|
|
|
|
block_group = list_first_entry(&info->zone_active_bgs,
|
|
|
|
struct btrfs_block_group,
|
|
|
|
active_bg_list);
|
|
|
|
list_del_init(&block_group->active_bg_list);
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
}
|
|
|
|
spin_unlock(&info->zone_active_bgs_lock);
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_lock(&info->block_group_cache_lock);
|
2022-04-13 16:20:40 +01:00
|
|
|
while ((n = rb_last(&info->block_group_cache_tree.rb_root)) != NULL) {
|
2019-10-29 19:20:18 +01:00
|
|
|
block_group = rb_entry(n, struct btrfs_block_group,
|
2019-06-20 15:38:06 -04:00
|
|
|
cache_node);
|
2022-04-13 16:20:40 +01:00
|
|
|
rb_erase_cached(&block_group->cache_node,
|
|
|
|
&info->block_group_cache_tree);
|
2019-06-20 15:38:06 -04:00
|
|
|
RB_CLEAR_NODE(&block_group->cache_node);
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_unlock(&info->block_group_cache_lock);
|
2019-06-20 15:38:06 -04:00
|
|
|
|
|
|
|
down_write(&block_group->space_info->groups_sem);
|
|
|
|
list_del(&block_group->list);
|
|
|
|
up_write(&block_group->space_info->groups_sem);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We haven't cached this block group, which means we could
|
|
|
|
* possibly have excluded extents on this block group.
|
|
|
|
*/
|
|
|
|
if (block_group->cached == BTRFS_CACHE_NO ||
|
|
|
|
block_group->cached == BTRFS_CACHE_ERROR)
|
|
|
|
btrfs_free_excluded_extents(block_group);
|
|
|
|
|
|
|
|
btrfs_remove_free_space_cache(block_group);
|
|
|
|
ASSERT(block_group->cached != BTRFS_CACHE_STARTED);
|
|
|
|
ASSERT(list_empty(&block_group->dirty_list));
|
|
|
|
ASSERT(list_empty(&block_group->io_list));
|
|
|
|
ASSERT(list_empty(&block_group->bg_list));
|
2020-07-06 09:14:11 -04:00
|
|
|
ASSERT(refcount_read(&block_group->refs) == 1);
|
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-05 12:55:37 +00:00
|
|
|
ASSERT(block_group->swap_extents == 0);
|
2019-06-20 15:38:06 -04:00
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_lock(&info->block_group_cache_lock);
|
2019-06-20 15:38:06 -04:00
|
|
|
}
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 16:20:41 +01:00
|
|
|
write_unlock(&info->block_group_cache_lock);
|
2019-06-20 15:38:06 -04:00
|
|
|
|
|
|
|
btrfs_release_global_block_rsv(info);
|
|
|
|
|
|
|
|
while (!list_empty(&info->space_info)) {
|
|
|
|
space_info = list_entry(info->space_info.next,
|
|
|
|
struct btrfs_space_info,
|
|
|
|
list);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not hide this behind enospc_debug, this is actually
|
|
|
|
* important and indicates a real bug if this happens.
|
|
|
|
*/
|
|
|
|
if (WARN_ON(space_info->bytes_pinned > 0 ||
|
|
|
|
space_info->bytes_may_use > 0))
|
|
|
|
btrfs_dump_space_info(info, space_info, 0, 0);
|
btrfs: skip reserved bytes warning on unmount after log cleanup failure
After the recent changes made by commit c2e39305299f01 ("btrfs: clear
extent buffer uptodate when we fail to write it") and its followup fix,
commit 651740a5024117 ("btrfs: check WRITE_ERR when trying to read an
extent buffer"), we can now end up not cleaning up space reservations of
log tree extent buffers after a transaction abort happens, as well as not
cleaning up still dirty extent buffers.
This happens because if writeback for a log tree extent buffer failed,
then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
when trying to free the log tree with free_log_tree(), which iterates
over the tree, we can end up getting an -EIO error when trying to read
a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
immediately as we can not iterate over the entire tree.
In that case we never update the reserved space for an extent buffer in
the respective block group and space_info object.
When this happens we get the following traces when unmounting the fs:
[174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
[174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
[174957.399379] ------------[ cut here ]------------
[174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.407523] Modules linked in: btrfs overlay dm_zero (...)
[174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.429717] Code: 21 48 8b bd (...)
[174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
[174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
[174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
[174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
[174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
[174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
[174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.443948] Call Trace:
[174957.444264] <TASK>
[174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs]
[174957.445238] close_ctree+0x301/0x357 [btrfs]
[174957.445803] ? call_rcu+0x16c/0x290
[174957.446250] generic_shutdown_super+0x74/0x120
[174957.446832] kill_anon_super+0x14/0x30
[174957.447305] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.447890] deactivate_locked_super+0x31/0xa0
[174957.448440] cleanup_mnt+0x147/0x1c0
[174957.448888] task_work_run+0x5c/0xa0
[174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.449934] syscall_exit_to_user_mode+0x16/0x40
[174957.450512] do_syscall_64+0x48/0xc0
[174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.451605] RIP: 0033:0x7f328fdc4a97
[174957.452059] Code: 03 0c 00 f7 (...)
[174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.460193] </TASK>
[174957.460534] irq event stamp: 0
[174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.466323] ---[ end trace bc7ee0c490bce3af ]---
[174957.467282] ------------[ cut here ]------------
[174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.470066] Modules linked in: btrfs overlay dm_zero (...)
[174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.488050] Code: 00 00 00 ad de (...)
[174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
[174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
[174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
[174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
[174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
[174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
[174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.506167] Call Trace:
[174957.506654] <TASK>
[174957.507047] close_ctree+0x301/0x357 [btrfs]
[174957.507867] ? call_rcu+0x16c/0x290
[174957.508567] generic_shutdown_super+0x74/0x120
[174957.509447] kill_anon_super+0x14/0x30
[174957.510194] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.511123] deactivate_locked_super+0x31/0xa0
[174957.511976] cleanup_mnt+0x147/0x1c0
[174957.512610] task_work_run+0x5c/0xa0
[174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.514231] syscall_exit_to_user_mode+0x16/0x40
[174957.515069] do_syscall_64+0x48/0xc0
[174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.516688] RIP: 0033:0x7f328fdc4a97
[174957.517413] Code: 03 0c 00 f7 d8 (...)
[174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.529404] </TASK>
[174957.529843] irq event stamp: 0
[174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
[174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
[174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
[174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
[174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
[174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
[174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
[174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
This also means that in case we have log tree extent buffers that are
still dirty, we can end up not cleaning them up in case we find an
extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
we have no way for iterating over the rest of the tree.
This issue is very often triggered with test cases generic/475 and
generic/648 from fstests.
The issue could almost be fixed by iterating over the io tree attached to
each log root which keeps tracks of the range of allocated extent buffers,
log_root->dirty_log_pages, however that does not work and has some
inconveniences:
1) After we sync the log, we clear the range of the extent buffers from
the io tree, so we can't find them after writeback. We could keep the
ranges in the io tree, with a separate bit to signal they represent
extent buffers already written, but that means we need to hold into
more memory until the transaction commits.
How much more memory is used depends a lot on whether we are able to
allocate contiguous extent buffers on disk (and how often) for a log
tree - if we are able to, then a single extent state record can
represent multiple extent buffers, otherwise we need multiple extent
state record structures to track each extent buffer.
In fact, my earlier approach did that:
https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
However that can cause a very significant negative impact on
performance, not only due to the extra memory usage but also because
we get a larger and deeper dirty_log_pages io tree.
We got a report that, on beefy machines at least, we can get such
performance drop with fsmark for example:
https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
2) We would be doing it only to deal with an unexpected and exceptional
case, which is basically failure to read an extent buffer from disk
due to IO failures. On a healthy system we don't expect transaction
aborts to happen after all;
3) Instead of relying on iterating the log tree or tracking the ranges
of extent buffers in the dirty_log_pages io tree, using the radix
tree that tracks extent buffers (fs_info->buffer_radix) to find all
log tree extent buffers is not reliable either, because after writeback
of an extent buffer it can be evicted from memory by the release page
callback of the btree inode (btree_releasepage()).
Since there's no way to be able to properly cleanup a log tree without
being able to read its extent buffers from disk and without using more
memory to track the logical ranges of the allocated extent buffers do
the following:
1) When we fail to cleanup a log tree, setup a flag that indicates that
failure;
2) Trigger writeback of all log tree extent buffers that are still dirty,
and wait for the writeback to complete. This is just to cleanup their
state, page states, page leaks, etc;
3) When unmounting the fs, ignore if the number of bytes reserved in a
block group and in a space_info is not 0 if, and only if, we failed to
cleanup a log tree. Also ignore only for metadata block groups and the
metadata space_info object.
This is far from a perfect solution, but it serves to silence test
failures such as those from generic/475 and generic/648. However having
a non-zero value for the reserved bytes counters on unmount after a
transaction abort, is not such a terrible thing and it's completely
harmless, it does not affect the filesystem integrity in any way.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 13:39:34 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If there was a failure to cleanup a log tree, very likely due
|
|
|
|
* to an IO failure on a writeback attempt of one or more of its
|
|
|
|
* extent buffers, we could not do proper (and cheap) unaccounting
|
|
|
|
* of their reserved space, so don't warn on bytes_reserved > 0 in
|
|
|
|
* that case.
|
|
|
|
*/
|
|
|
|
if (!(space_info->flags & BTRFS_BLOCK_GROUP_METADATA) ||
|
|
|
|
!BTRFS_FS_LOG_CLEANUP_ERROR(info)) {
|
|
|
|
if (WARN_ON(space_info->bytes_reserved > 0))
|
|
|
|
btrfs_dump_space_info(info, space_info, 0, 0);
|
|
|
|
}
|
|
|
|
|
2020-04-07 11:38:49 +01:00
|
|
|
WARN_ON(space_info->reclaim_size > 0);
|
2019-06-20 15:38:06 -04:00
|
|
|
list_del(&space_info->list);
|
|
|
|
btrfs_sysfs_remove_space_info(space_info);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2020-05-08 11:01:59 +01:00
|
|
|
|
|
|
|
void btrfs_freeze_block_group(struct btrfs_block_group *cache)
|
|
|
|
{
|
|
|
|
atomic_inc(&cache->frozen);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
|
|
|
bool cleanup;
|
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
cleanup = (atomic_dec_and_test(&block_group->frozen) &&
|
2022-07-15 15:45:24 -04:00
|
|
|
test_bit(BLOCK_GROUP_FLAG_REMOVED, &block_group->runtime_flags));
|
2020-05-08 11:01:59 +01:00
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
if (cleanup) {
|
btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21 13:38:38 +00:00
|
|
|
struct btrfs_chunk_map *map;
|
|
|
|
|
|
|
|
map = btrfs_find_chunk_map(fs_info, block_group->start, 1);
|
|
|
|
/* Logic error, can't happen. */
|
|
|
|
ASSERT(map);
|
|
|
|
|
|
|
|
btrfs_remove_chunk_map(fs_info, map);
|
|
|
|
|
|
|
|
/* Once for our lookup reference. */
|
|
|
|
btrfs_free_chunk_map(map);
|
2020-05-08 11:01:59 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We may have left one free space entry and other possible
|
|
|
|
* tasks trimming this block group have left 1 entry each one.
|
|
|
|
* Free them if any.
|
|
|
|
*/
|
2022-08-08 16:10:27 -04:00
|
|
|
btrfs_remove_free_space_cache(block_group);
|
2020-05-08 11:01:59 +01:00
|
|
|
}
|
|
|
|
}
|
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-05 12:55:37 +00:00
|
|
|
|
|
|
|
bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg)
|
|
|
|
{
|
|
|
|
bool ret = true;
|
|
|
|
|
|
|
|
spin_lock(&bg->lock);
|
|
|
|
if (bg->ro)
|
|
|
|
ret = false;
|
|
|
|
else
|
|
|
|
bg->swap_extents++;
|
|
|
|
spin_unlock(&bg->lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount)
|
|
|
|
{
|
|
|
|
spin_lock(&bg->lock);
|
|
|
|
ASSERT(!bg->ro);
|
|
|
|
ASSERT(bg->swap_extents >= amount);
|
|
|
|
bg->swap_extents -= amount;
|
|
|
|
spin_unlock(&bg->lock);
|
|
|
|
}
|
btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 16:06:33 -08:00
|
|
|
|
|
|
|
enum btrfs_block_group_size_class btrfs_calc_block_group_size_class(u64 size)
|
|
|
|
{
|
|
|
|
if (size <= SZ_128K)
|
|
|
|
return BTRFS_BG_SZ_SMALL;
|
|
|
|
if (size <= SZ_8M)
|
|
|
|
return BTRFS_BG_SZ_MEDIUM;
|
|
|
|
return BTRFS_BG_SZ_LARGE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Handle a block group allocating an extent in a size class
|
|
|
|
*
|
|
|
|
* @bg: The block group we allocated in.
|
|
|
|
* @size_class: The size class of the allocation.
|
|
|
|
* @force_wrong_size_class: Whether we are desperate enough to allow
|
|
|
|
* mismatched size classes.
|
|
|
|
*
|
|
|
|
* Returns: 0 if the size class was valid for this block_group, -EAGAIN in the
|
|
|
|
* case of a race that leads to the wrong size class without
|
|
|
|
* force_wrong_size_class set.
|
|
|
|
*
|
|
|
|
* find_free_extent will skip block groups with a mismatched size class until
|
|
|
|
* it really needs to avoid ENOSPC. In that case it will set
|
|
|
|
* force_wrong_size_class. However, if a block group is newly allocated and
|
|
|
|
* doesn't yet have a size class, then it is possible for two allocations of
|
|
|
|
* different sizes to race and both try to use it. The loser is caught here and
|
|
|
|
* has to retry.
|
|
|
|
*/
|
|
|
|
int btrfs_use_block_group_size_class(struct btrfs_block_group *bg,
|
|
|
|
enum btrfs_block_group_size_class size_class,
|
|
|
|
bool force_wrong_size_class)
|
|
|
|
{
|
|
|
|
ASSERT(size_class != BTRFS_BG_SZ_NONE);
|
|
|
|
|
|
|
|
/* The new allocation is in the right size class, do nothing */
|
|
|
|
if (bg->size_class == size_class)
|
|
|
|
return 0;
|
|
|
|
/*
|
|
|
|
* The new allocation is in a mismatched size class.
|
|
|
|
* This means one of two things:
|
|
|
|
*
|
|
|
|
* 1. Two tasks in find_free_extent for different size_classes raced
|
|
|
|
* and hit the same empty block_group. Make the loser try again.
|
|
|
|
* 2. A call to find_free_extent got desperate enough to set
|
|
|
|
* 'force_wrong_slab'. Don't change the size_class, but allow the
|
|
|
|
* allocation.
|
|
|
|
*/
|
|
|
|
if (bg->size_class != BTRFS_BG_SZ_NONE) {
|
|
|
|
if (force_wrong_size_class)
|
|
|
|
return 0;
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* The happy new block group case: the new allocation is the first
|
|
|
|
* one in the block_group so we set size_class.
|
|
|
|
*/
|
|
|
|
bg->size_class = size_class;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2022-12-15 16:06:35 -08:00
|
|
|
|
2024-06-26 23:39:11 +02:00
|
|
|
bool btrfs_block_group_should_use_size_class(const struct btrfs_block_group *bg)
|
2022-12-15 16:06:35 -08:00
|
|
|
{
|
|
|
|
if (btrfs_is_zoned(bg->fs_info))
|
|
|
|
return false;
|
|
|
|
if (!btrfs_is_block_group_data_only(bg))
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|