2019-06-18 20:09:19 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
|
2024-02-14 19:25:50 +00:00
|
|
|
#include "linux/spinlock.h"
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
#include <linux/minmax.h>
|
2019-08-21 16:54:28 +00:00
|
|
|
#include "misc.h"
|
2019-06-18 20:09:19 +00:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "space-info.h"
|
|
|
|
#include "sysfs.h"
|
|
|
|
#include "volumes.h"
|
2019-06-18 20:09:24 +00:00
|
|
|
#include "free-space-cache.h"
|
2019-06-18 20:09:25 +00:00
|
|
|
#include "ordered-data.h"
|
|
|
|
#include "transaction.h"
|
2019-06-20 19:37:44 +00:00
|
|
|
#include "block-group.h"
|
2022-10-19 14:50:47 +00:00
|
|
|
#include "fs.h"
|
2022-10-19 14:51:00 +00:00
|
|
|
#include "accessors.h"
|
2022-10-24 18:46:57 +00:00
|
|
|
#include "extent-tree.h"
|
2019-06-18 20:09:19 +00:00
|
|
|
|
2020-02-04 18:18:56 +00:00
|
|
|
/*
|
|
|
|
* HOW DOES SPACE RESERVATION WORK
|
|
|
|
*
|
|
|
|
* If you want to know about delalloc specifically, there is a separate comment
|
|
|
|
* for that with the delalloc code. This comment is about how the whole system
|
|
|
|
* works generally.
|
|
|
|
*
|
|
|
|
* BASIC CONCEPTS
|
|
|
|
*
|
|
|
|
* 1) space_info. This is the ultimate arbiter of how much space we can use.
|
|
|
|
* There's a description of the bytes_ fields with the struct declaration,
|
|
|
|
* refer to that for specifics on each field. Suffice it to say that for
|
|
|
|
* reservations we care about total_bytes - SUM(space_info->bytes_) when
|
|
|
|
* determining if there is space to make an allocation. There is a space_info
|
|
|
|
* for METADATA, SYSTEM, and DATA areas.
|
|
|
|
*
|
|
|
|
* 2) block_rsv's. These are basically buckets for every different type of
|
|
|
|
* metadata reservation we have. You can see the comment in the block_rsv
|
|
|
|
* code on the rules for each type, but generally block_rsv->reserved is how
|
|
|
|
* much space is accounted for in space_info->bytes_may_use.
|
|
|
|
*
|
|
|
|
* 3) btrfs_calc*_size. These are the worst case calculations we used based
|
|
|
|
* on the number of items we will want to modify. We have one for changing
|
|
|
|
* items, and one for inserting new items. Generally we use these helpers to
|
|
|
|
* determine the size of the block reserves, and then use the actual bytes
|
|
|
|
* values to adjust the space_info counters.
|
|
|
|
*
|
|
|
|
* MAKING RESERVATIONS, THE NORMAL CASE
|
|
|
|
*
|
|
|
|
* We call into either btrfs_reserve_data_bytes() or
|
|
|
|
* btrfs_reserve_metadata_bytes(), depending on which we're looking for, with
|
|
|
|
* num_bytes we want to reserve.
|
|
|
|
*
|
|
|
|
* ->reserve
|
|
|
|
* space_info->bytes_may_reserve += num_bytes
|
|
|
|
*
|
|
|
|
* ->extent allocation
|
|
|
|
* Call btrfs_add_reserved_bytes() which does
|
|
|
|
* space_info->bytes_may_reserve -= num_bytes
|
|
|
|
* space_info->bytes_reserved += extent_bytes
|
|
|
|
*
|
|
|
|
* ->insert reference
|
|
|
|
* Call btrfs_update_block_group() which does
|
|
|
|
* space_info->bytes_reserved -= extent_bytes
|
|
|
|
* space_info->bytes_used += extent_bytes
|
|
|
|
*
|
|
|
|
* MAKING RESERVATIONS, FLUSHING NORMALLY (non-priority)
|
|
|
|
*
|
|
|
|
* Assume we are unable to simply make the reservation because we do not have
|
|
|
|
* enough space
|
|
|
|
*
|
|
|
|
* -> __reserve_bytes
|
|
|
|
* create a reserve_ticket with ->bytes set to our reservation, add it to
|
|
|
|
* the tail of space_info->tickets, kick async flush thread
|
|
|
|
*
|
|
|
|
* ->handle_reserve_ticket
|
|
|
|
* wait on ticket->wait for ->bytes to be reduced to 0, or ->error to be set
|
|
|
|
* on the ticket.
|
|
|
|
*
|
|
|
|
* -> btrfs_async_reclaim_metadata_space/btrfs_async_reclaim_data_space
|
|
|
|
* Flushes various things attempting to free up space.
|
|
|
|
*
|
|
|
|
* -> btrfs_try_granting_tickets()
|
|
|
|
* This is called by anything that either subtracts space from
|
|
|
|
* space_info->bytes_may_use, ->bytes_pinned, etc, or adds to the
|
|
|
|
* space_info->total_bytes. This loops through the ->priority_tickets and
|
|
|
|
* then the ->tickets list checking to see if the reservation can be
|
|
|
|
* completed. If it can the space is added to space_info->bytes_may_use and
|
|
|
|
* the ticket is woken up.
|
|
|
|
*
|
|
|
|
* -> ticket wakeup
|
|
|
|
* Check if ->bytes == 0, if it does we got our reservation and we can carry
|
|
|
|
* on, if not return the appropriate error (ENOSPC, but can be EINTR if we
|
|
|
|
* were interrupted.)
|
|
|
|
*
|
|
|
|
* MAKING RESERVATIONS, FLUSHING HIGH PRIORITY
|
|
|
|
*
|
|
|
|
* Same as the above, except we add ourselves to the
|
|
|
|
* space_info->priority_tickets, and we do not use ticket->wait, we simply
|
|
|
|
* call flush_space() ourselves for the states that are safe for us to call
|
|
|
|
* without deadlocking and hope for the best.
|
|
|
|
*
|
|
|
|
* THE FLUSHING STATES
|
|
|
|
*
|
|
|
|
* Generally speaking we will have two cases for each state, a "nice" state
|
|
|
|
* and a "ALL THE THINGS" state. In btrfs we delay a lot of work in order to
|
|
|
|
* reduce the locking over head on the various trees, and even to keep from
|
|
|
|
* doing any work at all in the case of delayed refs. Each of these delayed
|
|
|
|
* things however hold reservations, and so letting them run allows us to
|
|
|
|
* reclaim space so we can make new reservations.
|
|
|
|
*
|
|
|
|
* FLUSH_DELAYED_ITEMS
|
|
|
|
* Every inode has a delayed item to update the inode. Take a simple write
|
|
|
|
* for example, we would update the inode item at write time to update the
|
|
|
|
* mtime, and then again at finish_ordered_io() time in order to update the
|
|
|
|
* isize or bytes. We keep these delayed items to coalesce these operations
|
|
|
|
* into a single operation done on demand. These are an easy way to reclaim
|
|
|
|
* metadata space.
|
|
|
|
*
|
|
|
|
* FLUSH_DELALLOC
|
|
|
|
* Look at the delalloc comment to get an idea of how much space is reserved
|
|
|
|
* for delayed allocation. We can reclaim some of this space simply by
|
|
|
|
* running delalloc, but usually we need to wait for ordered extents to
|
|
|
|
* reclaim the bulk of this space.
|
|
|
|
*
|
|
|
|
* FLUSH_DELAYED_REFS
|
|
|
|
* We have a block reserve for the outstanding delayed refs space, and every
|
|
|
|
* delayed ref operation holds a reservation. Running these is a quick way
|
|
|
|
* to reclaim space, but we want to hold this until the end because COW can
|
|
|
|
* churn a lot and we can avoid making some extent tree modifications if we
|
|
|
|
* are able to delay for as long as possible.
|
|
|
|
*
|
|
|
|
* ALLOC_CHUNK
|
|
|
|
* We will skip this the first time through space reservation, because of
|
|
|
|
* overcommit and we don't want to have a lot of useless metadata space when
|
|
|
|
* our worst case reservations will likely never come true.
|
|
|
|
*
|
|
|
|
* RUN_DELAYED_IPUTS
|
|
|
|
* If we're freeing inodes we're likely freeing checksums, file extent
|
|
|
|
* items, and extent tree items. Loads of space could be freed up by these
|
|
|
|
* operations, however they won't be usable until the transaction commits.
|
|
|
|
*
|
|
|
|
* COMMIT_TRANS
|
2021-06-22 12:51:58 +00:00
|
|
|
* This will commit the transaction. Historically we had a lot of logic
|
|
|
|
* surrounding whether or not we'd commit the transaction, but this waits born
|
|
|
|
* out of a pre-tickets era where we could end up committing the transaction
|
|
|
|
* thousands of times in a row without making progress. Now thanks to our
|
|
|
|
* ticketing system we know if we're not making progress and can error
|
|
|
|
* everybody out after a few commits rather than burning the disk hoping for
|
|
|
|
* a different answer.
|
2020-10-09 13:28:21 +00:00
|
|
|
*
|
2020-02-04 18:18:56 +00:00
|
|
|
* OVERCOMMIT
|
|
|
|
*
|
|
|
|
* Because we hold so many reservations for metadata we will allow you to
|
|
|
|
* reserve more space than is currently free in the currently allocate
|
|
|
|
* metadata space. This only happens with metadata, data does not allow
|
|
|
|
* overcommitting.
|
|
|
|
*
|
|
|
|
* You can see the current logic for when we allow overcommit in
|
|
|
|
* btrfs_can_overcommit(), but it only applies to unallocated space. If there
|
|
|
|
* is no unallocated space to be had, all reservations are kept within the
|
|
|
|
* free space in the allocated metadata chunks.
|
|
|
|
*
|
|
|
|
* Because of overcommitting, you generally want to use the
|
|
|
|
* btrfs_can_overcommit() logic for metadata allocations, as it does the right
|
|
|
|
* thing with or without extra unallocated space.
|
|
|
|
*/
|
|
|
|
|
2024-06-26 21:39:11 +00:00
|
|
|
u64 __pure btrfs_space_info_used(const struct btrfs_space_info *s_info,
|
2019-06-18 20:09:19 +00:00
|
|
|
bool may_use_included)
|
|
|
|
{
|
|
|
|
ASSERT(s_info);
|
|
|
|
return s_info->bytes_used + s_info->bytes_reserved +
|
|
|
|
s_info->bytes_pinned + s_info->bytes_readonly +
|
2021-02-04 10:21:52 +00:00
|
|
|
s_info->bytes_zone_unusable +
|
2019-06-18 20:09:19 +00:00
|
|
|
(may_use_included ? s_info->bytes_may_use : 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* after adding space to the filesystem, we need to clear the full flags
|
|
|
|
* on all the space infos.
|
|
|
|
*/
|
|
|
|
void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
|
2020-09-01 21:40:37 +00:00
|
|
|
list_for_each_entry(found, head, list)
|
2019-06-18 20:09:19 +00:00
|
|
|
found->full = 0;
|
|
|
|
}
|
|
|
|
|
2022-03-29 08:56:06 +00:00
|
|
|
/*
|
|
|
|
* Block groups with more than this value (percents) of unusable space will be
|
|
|
|
* scheduled for background reclaim.
|
|
|
|
*/
|
|
|
|
#define BTRFS_DEFAULT_ZONED_RECLAIM_THRESH (75)
|
|
|
|
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
#define BTRFS_UNALLOC_BLOCK_GROUP_TARGET (10ULL)
|
|
|
|
|
2022-02-08 19:31:20 +00:00
|
|
|
/*
|
|
|
|
* Calculate chunk size depending on volume type (regular or zoned).
|
|
|
|
*/
|
|
|
|
static u64 calc_chunk_size(const struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
|
|
|
if (btrfs_is_zoned(fs_info))
|
|
|
|
return fs_info->zone_size;
|
|
|
|
|
|
|
|
ASSERT(flags & BTRFS_BLOCK_GROUP_TYPE_MASK);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
btrfs: fix the max chunk size and stripe length calculation
[BEHAVIOR CHANGE]
Since commit f6fca3917b4d ("btrfs: store chunk size in space-info
struct"), btrfs no longer can create larger data chunks than 1G:
mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
mount $dev1 $mnt
btrfs balance start --full $mnt
btrfs balance start --full $mnt
umount $mnt
btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2
Before that offending commit, what we got is a 4G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Now what we got is only 1G data chunk:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
This will increase the number of data chunks by the number of devices,
not only increase system chunk usage, but also greatly increase mount
time.
Without a proper reason, we should not change the max chunk size.
[CAUSE]
Previously, we set max data chunk size to 10G, while max data stripe
length to 1G.
Commit f6fca3917b4d ("btrfs: store chunk size in space-info struct")
completely ignored the 10G limit, but use 1G max stripe limit instead,
causing above shrink in max data chunk size.
[FIX]
Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
we limit stripe_size to 1G manually.
This should only affect data chunks, as for metadata chunks we always
set the max stripe size the same as max chunk size (256M or 1G
depending on fs size).
Now the same script result the same old result:
item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
io_align 65536 io_width 65536 sector_size 4096
num_stripes 4 sub_stripes 1
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-18 07:06:44 +00:00
|
|
|
return BTRFS_MAX_DATA_CHUNK_SIZE;
|
2022-02-08 19:31:20 +00:00
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
return SZ_32M;
|
|
|
|
|
|
|
|
/* Handle BTRFS_BLOCK_GROUP_METADATA */
|
|
|
|
if (fs_info->fs_devices->total_rw_bytes > 50ULL * SZ_1G)
|
|
|
|
return SZ_1G;
|
|
|
|
|
|
|
|
return SZ_256M;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update default chunk size.
|
|
|
|
*/
|
|
|
|
void btrfs_update_space_info_chunk_size(struct btrfs_space_info *space_info,
|
|
|
|
u64 chunk_size)
|
|
|
|
{
|
|
|
|
WRITE_ONCE(space_info->chunk_size, chunk_size);
|
|
|
|
}
|
|
|
|
|
2019-06-18 20:09:19 +00:00
|
|
|
static int create_space_info(struct btrfs_fs_info *info, u64 flags)
|
|
|
|
{
|
|
|
|
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
int i;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
space_info = kzalloc(sizeof(*space_info), GFP_NOFS);
|
|
|
|
if (!space_info)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2024-02-02 19:52:57 +00:00
|
|
|
space_info->fs_info = info;
|
2019-06-18 20:09:19 +00:00
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
|
|
|
|
INIT_LIST_HEAD(&space_info->block_groups[i]);
|
|
|
|
init_rwsem(&space_info->groups_sem);
|
|
|
|
spin_lock_init(&space_info->lock);
|
|
|
|
space_info->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
|
|
|
|
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
|
|
|
|
INIT_LIST_HEAD(&space_info->ro_bgs);
|
|
|
|
INIT_LIST_HEAD(&space_info->tickets);
|
|
|
|
INIT_LIST_HEAD(&space_info->priority_tickets);
|
2020-10-09 13:28:27 +00:00
|
|
|
space_info->clamp = 1;
|
2022-02-08 19:31:20 +00:00
|
|
|
btrfs_update_space_info_chunk_size(space_info, calc_chunk_size(info, flags));
|
2019-06-18 20:09:19 +00:00
|
|
|
|
2022-03-29 08:56:06 +00:00
|
|
|
if (btrfs_is_zoned(info))
|
|
|
|
space_info->bg_reclaim_threshold = BTRFS_DEFAULT_ZONED_RECLAIM_THRESH;
|
|
|
|
|
2019-08-01 16:50:16 +00:00
|
|
|
ret = btrfs_sysfs_add_space_info_type(info, space_info);
|
|
|
|
if (ret)
|
2019-06-18 20:09:19 +00:00
|
|
|
return ret;
|
|
|
|
|
2020-09-01 21:40:37 +00:00
|
|
|
list_add(&space_info->list, &info->space_info);
|
2019-06-18 20:09:19 +00:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
info->data_sinfo = space_info;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 features;
|
|
|
|
u64 flags;
|
|
|
|
int mixed = 0;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
disk_super = fs_info->super_copy;
|
|
|
|
if (!btrfs_super_root(disk_super))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
features = btrfs_super_incompat_flags(disk_super);
|
|
|
|
if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS)
|
|
|
|
mixed = 1;
|
|
|
|
|
|
|
|
flags = BTRFS_BLOCK_GROUP_SYSTEM;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (mixed) {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
} else {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
flags = BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-07-15 19:45:22 +00:00
|
|
|
void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
|
2022-07-15 19:45:23 +00:00
|
|
|
struct btrfs_block_group *block_group)
|
2019-06-18 20:09:19 +00:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *found;
|
2022-07-15 19:45:23 +00:00
|
|
|
int factor, index;
|
2019-06-18 20:09:19 +00:00
|
|
|
|
2022-07-15 19:45:22 +00:00
|
|
|
factor = btrfs_bg_type_to_factor(block_group->flags);
|
2019-06-18 20:09:19 +00:00
|
|
|
|
2022-07-15 19:45:22 +00:00
|
|
|
found = btrfs_find_space_info(info, block_group->flags);
|
2019-06-18 20:09:19 +00:00
|
|
|
ASSERT(found);
|
|
|
|
spin_lock(&found->lock);
|
2022-07-15 19:45:22 +00:00
|
|
|
found->total_bytes += block_group->length;
|
|
|
|
found->disk_total += block_group->length * factor;
|
|
|
|
found->bytes_used += block_group->used;
|
|
|
|
found->disk_used += block_group->used * factor;
|
|
|
|
found->bytes_readonly += block_group->bytes_super;
|
2023-02-15 00:18:02 +00:00
|
|
|
btrfs_space_info_update_bytes_zone_unusable(info, found, block_group->zone_unusable);
|
2022-07-15 19:45:22 +00:00
|
|
|
if (block_group->length > 0)
|
2019-06-18 20:09:19 +00:00
|
|
|
found->full = 0;
|
2019-08-22 19:10:58 +00:00
|
|
|
btrfs_try_granting_tickets(info, found);
|
2019-06-18 20:09:19 +00:00
|
|
|
spin_unlock(&found->lock);
|
2022-07-15 19:45:23 +00:00
|
|
|
|
|
|
|
block_group->space_info = found;
|
|
|
|
|
|
|
|
index = btrfs_bg_flags_to_raid_index(block_group->flags);
|
|
|
|
down_write(&found->groups_sem);
|
|
|
|
list_add_tail(&block_group->list, &found->block_groups[index]);
|
|
|
|
up_write(&found->groups_sem);
|
2019-06-18 20:09:19 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
|
|
|
|
u64 flags)
|
|
|
|
{
|
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
|
|
|
|
flags &= BTRFS_BLOCK_GROUP_TYPE_MASK;
|
|
|
|
|
2020-09-01 21:40:37 +00:00
|
|
|
list_for_each_entry(found, head, list) {
|
|
|
|
if (found->flags & flags)
|
2019-06-18 20:09:19 +00:00
|
|
|
return found;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
2019-06-18 20:09:20 +00:00
|
|
|
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
static u64 calc_effective_data_chunk_size(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *data_sinfo;
|
|
|
|
u64 data_chunk_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the data_chunk_size, space_info->chunk_size is the
|
|
|
|
* "optimal" chunk size based on the fs size. However when we actually
|
|
|
|
* allocate the chunk we will strip this down further, making it no
|
|
|
|
* more than 10% of the disk or 1G, whichever is smaller.
|
|
|
|
*
|
|
|
|
* On the zoned mode, we need to use zone_size (= data_sinfo->chunk_size)
|
|
|
|
* as it is.
|
|
|
|
*/
|
|
|
|
data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
|
|
|
|
if (btrfs_is_zoned(fs_info))
|
|
|
|
return data_sinfo->chunk_size;
|
|
|
|
data_chunk_size = min(data_sinfo->chunk_size,
|
|
|
|
mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
|
|
|
|
return min_t(u64, data_chunk_size, SZ_1G);
|
|
|
|
}
|
|
|
|
|
2020-02-21 21:41:10 +00:00
|
|
|
static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
|
2024-06-26 21:39:11 +00:00
|
|
|
const struct btrfs_space_info *space_info,
|
2020-02-21 21:41:10 +00:00
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2019-06-18 20:09:20 +00:00
|
|
|
{
|
|
|
|
u64 profile;
|
|
|
|
u64 avail;
|
btrfs: adjust overcommit logic when very close to full
A user reported some unpleasant behavior with very small file systems.
The reproducer is this
$ mkfs.btrfs -f -m single -b 8g /dev/vdb
$ mount /dev/vdb /mnt/test
$ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
This will result in usage that looks like this
Overall:
Device size: 8.00GiB
Device allocated: 8.00GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Device slack: 2.00GiB
Used: 5.47GiB
Free (estimated): 2.52GiB (min: 2.52GiB)
Free (statfs, df): 0.00B
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
/dev/vdb 7.99GiB
Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
/dev/vdb 8.00MiB
System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
/dev/vdb 4.00MiB
Unallocated:
/dev/vdb 1.00MiB
As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.
On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here. Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data. This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.
Address this by adjusting the overcommit logic. Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation. This will allow us to stop overcommitting before we
potentially lose this space to a data allocation. With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-27 17:47:01 +00:00
|
|
|
u64 data_chunk_size;
|
2019-06-18 20:09:20 +00:00
|
|
|
int factor;
|
|
|
|
|
2019-11-26 16:25:53 +00:00
|
|
|
if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
2019-06-18 20:09:20 +00:00
|
|
|
profile = btrfs_system_alloc_profile(fs_info);
|
|
|
|
else
|
|
|
|
profile = btrfs_metadata_alloc_profile(fs_info);
|
|
|
|
|
|
|
|
avail = atomic64_read(&fs_info->free_chunk_space);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have dup, raid1 or raid10 then only half of the free
|
|
|
|
* space is actually usable. For raid56, the space info used
|
|
|
|
* doesn't include the parity drive, so we don't have to
|
|
|
|
* change the math
|
|
|
|
*/
|
|
|
|
factor = btrfs_bg_type_to_factor(profile);
|
|
|
|
avail = div_u64(avail, factor);
|
btrfs: adjust overcommit logic when very close to full
A user reported some unpleasant behavior with very small file systems.
The reproducer is this
$ mkfs.btrfs -f -m single -b 8g /dev/vdb
$ mount /dev/vdb /mnt/test
$ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
This will result in usage that looks like this
Overall:
Device size: 8.00GiB
Device allocated: 8.00GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Device slack: 2.00GiB
Used: 5.47GiB
Free (estimated): 2.52GiB (min: 2.52GiB)
Free (statfs, df): 0.00B
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
/dev/vdb 7.99GiB
Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
/dev/vdb 8.00MiB
System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
/dev/vdb 4.00MiB
Unallocated:
/dev/vdb 1.00MiB
As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.
On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here. Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data. This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.
Address this by adjusting the overcommit logic. Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation. This will allow us to stop overcommitting before we
potentially lose this space to a data allocation. With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-27 17:47:01 +00:00
|
|
|
if (avail == 0)
|
|
|
|
return 0;
|
|
|
|
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
data_chunk_size = calc_effective_data_chunk_size(fs_info);
|
btrfs: adjust overcommit logic when very close to full
A user reported some unpleasant behavior with very small file systems.
The reproducer is this
$ mkfs.btrfs -f -m single -b 8g /dev/vdb
$ mount /dev/vdb /mnt/test
$ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
This will result in usage that looks like this
Overall:
Device size: 8.00GiB
Device allocated: 8.00GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Device slack: 2.00GiB
Used: 5.47GiB
Free (estimated): 2.52GiB (min: 2.52GiB)
Free (statfs, df): 0.00B
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 5.50MiB (used: 0.00B)
Multiple profiles: no
Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
/dev/vdb 7.99GiB
Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
/dev/vdb 8.00MiB
System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
/dev/vdb 4.00MiB
Unallocated:
/dev/vdb 1.00MiB
As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.
On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here. Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data. This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.
Address this by adjusting the overcommit logic. Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation. This will allow us to stop overcommitting before we
potentially lose this space to a data allocation. With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-27 17:47:01 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since data allocations immediately use block groups as part of the
|
|
|
|
* reservation, because we assume that data reservations will == actual
|
|
|
|
* usage, we could potentially overcommit and then immediately have that
|
|
|
|
* available space used by a data allocation, which could put us in a
|
|
|
|
* bind when we get close to filling the file system.
|
|
|
|
*
|
|
|
|
* To handle this simply remove the data_chunk_size from the available
|
|
|
|
* space. If we are relatively empty this won't affect our ability to
|
|
|
|
* overcommit much, and if we're very close to full it'll keep us from
|
|
|
|
* getting into a position where we've given ourselves very little
|
|
|
|
* metadata wiggle room.
|
|
|
|
*/
|
|
|
|
if (avail <= data_chunk_size)
|
|
|
|
return 0;
|
|
|
|
avail -= data_chunk_size;
|
2019-06-18 20:09:20 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we aren't flushing all things, let us overcommit up to
|
|
|
|
* 1/2th of the space. If we can flush, don't let us overcommit
|
|
|
|
* too much, let it overcommit up to 1/8 of the space.
|
|
|
|
*/
|
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_ALL)
|
|
|
|
avail >>= 3;
|
|
|
|
else
|
|
|
|
avail >>= 1;
|
2024-06-20 06:05:45 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* On the zoned mode, we always allocate one zone as one chunk.
|
|
|
|
* Returning non-zone size alingned bytes here will result in
|
|
|
|
* less pressure for the async metadata reclaim process, and it
|
|
|
|
* will over-commit too much leading to ENOSPC. Align down to the
|
|
|
|
* zone size to avoid that.
|
|
|
|
*/
|
|
|
|
if (btrfs_is_zoned(fs_info))
|
|
|
|
avail = ALIGN_DOWN(avail, fs_info->zone_size);
|
|
|
|
|
2020-02-21 21:41:10 +00:00
|
|
|
return avail;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
|
2024-06-26 21:39:11 +00:00
|
|
|
const struct btrfs_space_info *space_info, u64 bytes,
|
2020-02-21 21:41:10 +00:00
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
u64 avail;
|
|
|
|
u64 used;
|
|
|
|
|
|
|
|
/* Don't overcommit when in mixed mode */
|
|
|
|
if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
2023-08-07 16:12:40 +00:00
|
|
|
avail = calc_available_free_space(fs_info, space_info, flush);
|
2019-06-18 20:09:20 +00:00
|
|
|
|
2023-03-13 07:06:14 +00:00
|
|
|
if (used + bytes < space_info->total_bytes + avail)
|
2019-06-18 20:09:20 +00:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
2019-06-18 20:09:22 +00:00
|
|
|
|
2020-04-07 10:38:49 +00:00
|
|
|
static void remove_ticket(struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket)
|
|
|
|
{
|
|
|
|
if (!list_empty(&ticket->list)) {
|
|
|
|
list_del_init(&ticket->list);
|
|
|
|
ASSERT(space_info->reclaim_size >= ticket->bytes);
|
|
|
|
space_info->reclaim_size -= ticket->bytes;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-06-18 20:09:22 +00:00
|
|
|
/*
|
|
|
|
* This is for space we already have accounted in space_info->bytes_may_use, so
|
|
|
|
* basically when we're returning space from block_rsv's.
|
|
|
|
*/
|
2019-08-22 19:10:58 +00:00
|
|
|
void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info)
|
2019-06-18 20:09:22 +00:00
|
|
|
{
|
|
|
|
struct list_head *head;
|
|
|
|
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_NO_FLUSH;
|
|
|
|
|
2019-08-22 19:10:58 +00:00
|
|
|
lockdep_assert_held(&space_info->lock);
|
2019-06-18 20:09:22 +00:00
|
|
|
|
2019-08-22 19:10:58 +00:00
|
|
|
head = &space_info->priority_tickets;
|
2019-06-18 20:09:22 +00:00
|
|
|
again:
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 15:15:24 +00:00
|
|
|
while (!list_empty(head)) {
|
|
|
|
struct reserve_ticket *ticket;
|
|
|
|
u64 used = btrfs_space_info_used(space_info, true);
|
|
|
|
|
|
|
|
ticket = list_first_entry(head, struct reserve_ticket, list);
|
|
|
|
|
2021-05-21 15:42:23 +00:00
|
|
|
/* Check and see if our ticket can be satisfied now. */
|
2023-03-13 07:06:14 +00:00
|
|
|
if ((used + ticket->bytes <= space_info->total_bytes) ||
|
2020-01-17 14:07:39 +00:00
|
|
|
btrfs_can_overcommit(fs_info, space_info, ticket->bytes,
|
|
|
|
flush)) {
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 15:15:24 +00:00
|
|
|
btrfs_space_info_update_bytes_may_use(fs_info,
|
|
|
|
space_info,
|
|
|
|
ticket->bytes);
|
2020-04-07 10:38:49 +00:00
|
|
|
remove_ticket(space_info, ticket);
|
2019-06-18 20:09:22 +00:00
|
|
|
ticket->bytes = 0;
|
|
|
|
space_info->tickets_id++;
|
|
|
|
wake_up(&ticket->wait);
|
|
|
|
} else {
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 15:15:24 +00:00
|
|
|
break;
|
2019-06-18 20:09:22 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 15:15:24 +00:00
|
|
|
if (head == &space_info->priority_tickets) {
|
2019-06-18 20:09:22 +00:00
|
|
|
head = &space_info->tickets;
|
|
|
|
flush = BTRFS_RESERVE_FLUSH_ALL;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
2019-06-18 20:09:24 +00:00
|
|
|
|
|
|
|
#define DUMP_BLOCK_RSV(fs_info, rsv_name) \
|
|
|
|
do { \
|
|
|
|
struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \
|
|
|
|
spin_lock(&__rsv->lock); \
|
|
|
|
btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \
|
|
|
|
__rsv->size, __rsv->reserved); \
|
|
|
|
spin_unlock(&__rsv->lock); \
|
|
|
|
} while (0)
|
|
|
|
|
2022-08-25 07:09:09 +00:00
|
|
|
static const char *space_info_flag_to_str(const struct btrfs_space_info *space_info)
|
|
|
|
{
|
|
|
|
switch (space_info->flags) {
|
|
|
|
case BTRFS_BLOCK_GROUP_SYSTEM:
|
|
|
|
return "SYSTEM";
|
|
|
|
case BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA:
|
|
|
|
return "DATA+METADATA";
|
|
|
|
case BTRFS_BLOCK_GROUP_DATA:
|
|
|
|
return "DATA";
|
|
|
|
case BTRFS_BLOCK_GROUP_METADATA:
|
|
|
|
return "METADATA";
|
|
|
|
default:
|
|
|
|
return "UNKNOWN";
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: dump all space infos if we abort transaction due to ENOSPC
We have hit some transaction abort due to -ENOSPC internally.
Normally we should always reserve enough space for metadata for every
transaction, thus hitting -ENOSPC should really indicate some cases we
didn't expect.
But unfortunately current error reporting will only give a kernel
warning and stack trace, not really helpful to debug what's causing the
problem.
And mount option debug_enospc can only help when user can reproduce the
problem, but under most cases, such transaction abort by -ENOSPC is
really hard to reproduce.
So this patch will dump all space infos (data, metadata, system) when we
abort the first transaction with -ENOSPC.
This should at least provide some clue to us.
The example of a dump would look like this:
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs]
<call trace skipped>
---[ end trace 0000000000000000 ]---
BTRFS info (device dm-1: state A): dumping space info:
BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full
BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0
BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016
BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232
BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728
BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left
BTRFS info (device dm-1: state EA): forced readonly
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-25 07:09:10 +00:00
|
|
|
static void dump_global_block_rsv(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
DUMP_BLOCK_RSV(fs_info, global_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
|
|
|
|
}
|
|
|
|
|
2024-06-26 21:39:11 +00:00
|
|
|
static void __btrfs_dump_space_info(const struct btrfs_fs_info *fs_info,
|
|
|
|
const struct btrfs_space_info *info)
|
2019-06-18 20:09:24 +00:00
|
|
|
{
|
2022-08-25 07:09:09 +00:00
|
|
|
const char *flag_str = space_info_flag_to_str(info);
|
2019-08-22 19:19:04 +00:00
|
|
|
lockdep_assert_held(&info->lock);
|
2019-06-18 20:09:24 +00:00
|
|
|
|
2021-09-16 12:43:29 +00:00
|
|
|
/* The free space could be negative in case of overcommit */
|
2022-08-25 07:09:09 +00:00
|
|
|
btrfs_info(fs_info, "space_info %s has %lld free, is %sfull",
|
|
|
|
flag_str,
|
2021-09-16 12:43:29 +00:00
|
|
|
(s64)(info->total_bytes - btrfs_space_info_used(info, true)),
|
2019-06-18 20:09:24 +00:00
|
|
|
info->full ? "" : "not ");
|
|
|
|
btrfs_info(fs_info,
|
2022-08-25 07:09:09 +00:00
|
|
|
"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu",
|
2019-06-18 20:09:24 +00:00
|
|
|
info->total_bytes, info->bytes_used, info->bytes_pinned,
|
|
|
|
info->bytes_reserved, info->bytes_may_use,
|
2021-02-04 10:21:52 +00:00
|
|
|
info->bytes_readonly, info->bytes_zone_unusable);
|
2019-08-22 19:19:04 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *info, u64 bytes,
|
|
|
|
int dump_block_groups)
|
|
|
|
{
|
2019-10-29 18:20:18 +00:00
|
|
|
struct btrfs_block_group *cache;
|
2023-07-26 15:57:02 +00:00
|
|
|
u64 total_avail = 0;
|
2019-08-22 19:19:04 +00:00
|
|
|
int index = 0;
|
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
__btrfs_dump_space_info(fs_info, info);
|
btrfs: dump all space infos if we abort transaction due to ENOSPC
We have hit some transaction abort due to -ENOSPC internally.
Normally we should always reserve enough space for metadata for every
transaction, thus hitting -ENOSPC should really indicate some cases we
didn't expect.
But unfortunately current error reporting will only give a kernel
warning and stack trace, not really helpful to debug what's causing the
problem.
And mount option debug_enospc can only help when user can reproduce the
problem, but under most cases, such transaction abort by -ENOSPC is
really hard to reproduce.
So this patch will dump all space infos (data, metadata, system) when we
abort the first transaction with -ENOSPC.
This should at least provide some clue to us.
The example of a dump would look like this:
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs]
<call trace skipped>
---[ end trace 0000000000000000 ]---
BTRFS info (device dm-1: state A): dumping space info:
BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full
BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0
BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016
BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232
BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728
BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left
BTRFS info (device dm-1: state EA): forced readonly
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-25 07:09:10 +00:00
|
|
|
dump_global_block_rsv(fs_info);
|
2019-08-22 19:19:04 +00:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
2019-06-18 20:09:24 +00:00
|
|
|
if (!dump_block_groups)
|
|
|
|
return;
|
|
|
|
|
|
|
|
down_read(&info->groups_sem);
|
|
|
|
again:
|
|
|
|
list_for_each_entry(cache, &info->block_groups[index], list) {
|
2023-07-26 15:57:01 +00:00
|
|
|
u64 avail;
|
|
|
|
|
2019-06-18 20:09:24 +00:00
|
|
|
spin_lock(&cache->lock);
|
2023-07-26 15:57:01 +00:00
|
|
|
avail = cache->length - cache->used - cache->pinned -
|
2024-07-11 14:50:58 +00:00
|
|
|
cache->reserved - cache->bytes_super - cache->zone_unusable;
|
2019-06-18 20:09:24 +00:00
|
|
|
btrfs_info(fs_info,
|
2023-07-26 15:57:01 +00:00
|
|
|
"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu delalloc %llu super %llu zone_unusable (%llu bytes available) %s",
|
2023-07-26 15:57:00 +00:00
|
|
|
cache->start, cache->length, cache->used, cache->pinned,
|
|
|
|
cache->reserved, cache->delalloc_bytes,
|
|
|
|
cache->bytes_super, cache->zone_unusable,
|
2023-07-26 15:57:01 +00:00
|
|
|
avail, cache->ro ? "[readonly]" : "");
|
2019-06-18 20:09:24 +00:00
|
|
|
spin_unlock(&cache->lock);
|
btrfs: fix lockdep splat from btrfs_dump_space_info
When running with -o enospc_debug you can get the following splat if one
of the dump_space_info's trip
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc5+ #20 Tainted: G OE
------------------------------------------------------
dd/563090 is trying to acquire lock:
ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]
but task is already holding lock:
ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&cache->lock){+.+.}-{2:2}:
_raw_spin_lock+0x25/0x30
btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
find_free_extent+0x7ef/0x13b0 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x530 [btrfs]
btrfs_cow_block+0x106/0x210 [btrfs]
commit_cowonly_roots+0x55/0x300 [btrfs]
btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x36/0x70
cleanup_mnt+0x104/0x160
task_work_run+0x5f/0x90
__prepare_exit_to_usermode+0x1bd/0x1c0
do_syscall_64+0x5e/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&space_info->lock){+.+.}-{2:2}:
_raw_spin_lock+0x25/0x30
btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
clear_state_bit+0x81/0x1a0 [btrfs]
__clear_extent_bit+0x25c/0x5d0 [btrfs]
clear_extent_bit+0x15/0x20 [btrfs]
btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
truncate_cleanup_page+0x47/0xe0
truncate_inode_pages_range+0x238/0x840
truncate_pagecache+0x44/0x60
btrfs_setattr+0x202/0x5e0 [btrfs]
notify_change+0x33b/0x490
do_truncate+0x76/0xd0
path_openat+0x687/0xa10
do_filp_open+0x91/0x100
do_sys_openat2+0x215/0x2d0
do_sys_open+0x44/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&tree->lock#2){+.+.}-{2:2}:
_raw_spin_lock+0x25/0x30
find_first_extent_bit+0x32/0x150 [btrfs]
write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
__btrfs_write_out_cache+0x172/0x480 [btrfs]
btrfs_write_out_cache+0x7a/0xf0 [btrfs]
btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
commit_cowonly_roots+0x245/0x300 [btrfs]
btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
close_ctree+0xf9/0x2f5 [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x36/0x70
cleanup_mnt+0x104/0x160
task_work_run+0x5f/0x90
__prepare_exit_to_usermode+0x1bd/0x1c0
do_syscall_64+0x5e/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&ctl->tree_lock){+.+.}-{2:2}:
__lock_acquire+0x1240/0x2460
lock_acquire+0xab/0x360
_raw_spin_lock+0x25/0x30
btrfs_dump_free_space+0x2b/0xa0 [btrfs]
btrfs_dump_space_info+0xf4/0x120 [btrfs]
btrfs_reserve_extent+0x176/0x180 [btrfs]
__btrfs_prealloc_file_range+0x145/0x550 [btrfs]
cache_save_setup+0x28d/0x3b0 [btrfs]
btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
btrfs_commit_transaction+0xcc/0xac0 [btrfs]
btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
btrfs_file_write_iter+0x3cf/0x610 [btrfs]
new_sync_write+0x11e/0x1b0
vfs_write+0x1c9/0x200
ksys_write+0x68/0xe0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&ctl->tree_lock --> &space_info->lock --> &cache->lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&cache->lock);
lock(&space_info->lock);
lock(&cache->lock);
lock(&ctl->tree_lock);
*** DEADLOCK ***
6 locks held by dd/563090:
#0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
#1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
#2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
#3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
#4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
#5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
stack backtrace:
CPU: 3 PID: 563090 Comm: dd Tainted: G OE 5.8.0-rc5+ #20
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
Call Trace:
dump_stack+0x96/0xd0
check_noncircular+0x162/0x180
__lock_acquire+0x1240/0x2460
? wake_up_klogd.part.0+0x30/0x40
lock_acquire+0xab/0x360
? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
_raw_spin_lock+0x25/0x30
? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
btrfs_dump_free_space+0x2b/0xa0 [btrfs]
btrfs_dump_space_info+0xf4/0x120 [btrfs]
btrfs_reserve_extent+0x176/0x180 [btrfs]
__btrfs_prealloc_file_range+0x145/0x550 [btrfs]
? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
cache_save_setup+0x28d/0x3b0 [btrfs]
btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
btrfs_commit_transaction+0xcc/0xac0 [btrfs]
? start_transaction+0xe0/0x5b0 [btrfs]
btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
? ktime_get_coarse_real_ts64+0xa8/0xd0
? trace_hardirqs_on+0x1c/0xe0
btrfs_file_write_iter+0x3cf/0x610 [btrfs]
new_sync_write+0x11e/0x1b0
vfs_write+0x1c9/0x200
ksys_write+0x68/0xe0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is because we're holding the block_group->lock while trying to dump
the free space cache. However we don't need this lock, we just need it
to read the values for the printk, so move the free space cache dumping
outside of the block group lock.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-17 19:12:29 +00:00
|
|
|
btrfs_dump_free_space(cache, bytes);
|
2023-07-26 15:57:02 +00:00
|
|
|
total_avail += avail;
|
2019-06-18 20:09:24 +00:00
|
|
|
}
|
|
|
|
if (++index < BTRFS_NR_RAID_TYPES)
|
|
|
|
goto again;
|
|
|
|
up_read(&info->groups_sem);
|
2023-07-26 15:57:02 +00:00
|
|
|
|
|
|
|
btrfs_info(fs_info, "%llu bytes available across all block groups", total_avail);
|
2019-06-18 20:09:24 +00:00
|
|
|
}
|
2019-06-18 20:09:25 +00:00
|
|
|
|
2023-03-21 11:13:54 +00:00
|
|
|
static inline u64 calc_reclaim_items_nr(const struct btrfs_fs_info *fs_info,
|
2019-06-18 20:09:25 +00:00
|
|
|
u64 to_reclaim)
|
|
|
|
{
|
|
|
|
u64 bytes;
|
|
|
|
u64 nr;
|
|
|
|
|
2019-08-22 19:14:33 +00:00
|
|
|
bytes = btrfs_calc_insert_metadata_size(fs_info, 1);
|
2019-06-18 20:09:25 +00:00
|
|
|
nr = div64_u64(to_reclaim, bytes);
|
|
|
|
if (!nr)
|
|
|
|
nr = 1;
|
|
|
|
return nr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* shrink metadata reservation for delalloc
|
|
|
|
*/
|
2020-07-21 14:22:15 +00:00
|
|
|
static void shrink_delalloc(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
2021-04-28 17:38:48 +00:00
|
|
|
u64 to_reclaim, bool wait_ordered,
|
|
|
|
bool for_preempt)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 delalloc_bytes;
|
2020-10-09 13:28:20 +00:00
|
|
|
u64 ordered_bytes;
|
2019-06-18 20:09:25 +00:00
|
|
|
u64 items;
|
|
|
|
long time_left;
|
|
|
|
int loops;
|
|
|
|
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
|
|
|
|
ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
|
|
|
|
if (delalloc_bytes == 0 && ordered_bytes == 0)
|
|
|
|
return;
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
/* Calc the number of the pages we need flush for space reservation */
|
2020-07-21 14:22:14 +00:00
|
|
|
if (to_reclaim == U64_MAX) {
|
|
|
|
items = U64_MAX;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* to_reclaim is set to however much metadata we need to
|
|
|
|
* reclaim, but reclaiming that much data doesn't really track
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
* exactly. What we really want to do is reclaim full inode's
|
|
|
|
* worth of reservations, however that's not available to us
|
|
|
|
* here. We will take a fraction of the delalloc bytes for our
|
|
|
|
* flushing loops and hope for the best. Delalloc will expand
|
|
|
|
* the amount we write to cover an entire dirty extent, which
|
|
|
|
* will reclaim the metadata reservation for that range. If
|
|
|
|
* it's not enough subsequent flush stages will be more
|
|
|
|
* aggressive.
|
2020-07-21 14:22:14 +00:00
|
|
|
*/
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
|
2020-07-21 14:22:14 +00:00
|
|
|
items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
|
|
|
|
}
|
2019-06-18 20:09:25 +00:00
|
|
|
|
2022-03-31 10:34:08 +00:00
|
|
|
trans = current->journal_info;
|
2019-06-18 20:09:25 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are doing more ordered than delalloc we need to just wait on
|
|
|
|
* ordered extents, otherwise we'll waste time trying to flush delalloc
|
|
|
|
* that likely won't give us the space back we need.
|
|
|
|
*/
|
2021-04-28 17:38:48 +00:00
|
|
|
if (ordered_bytes > delalloc_bytes && !for_preempt)
|
2019-06-18 20:09:25 +00:00
|
|
|
wait_ordered = true;
|
|
|
|
|
|
|
|
loops = 0;
|
2020-10-09 13:28:20 +00:00
|
|
|
while ((delalloc_bytes || ordered_bytes) && loops < 3) {
|
2021-01-11 10:58:11 +00:00
|
|
|
u64 temp = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
|
|
|
|
long nr_pages = min_t(u64, temp, LONG_MAX);
|
2021-07-14 18:47:21 +00:00
|
|
|
int async_pages;
|
btrfs: shrink delalloc pages instead of full inodes
Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot. However this introduced a pretty serious
performance regression. To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.
The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush. The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.
We cannot revert this patch unfortunately because of 3d45f221ce62
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.
Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us. This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe. I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)
5.9 0m54.258s
5.10 1m26.212s
5.10+patch 0m38.800s
5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.
Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:
5.10.5 4m00s
5.10.5+patch 1m08s
5.11-rc2 5m14s
5.11-rc2+patch 1m30s
Reported-by: René Rebe <rene@exactcode.de>
Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: stable@vger.kernel.org # 5.10
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add my test results ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 22:08:30 +00:00
|
|
|
|
|
|
|
btrfs_start_delalloc_roots(fs_info, nr_pages, true);
|
2019-06-18 20:09:25 +00:00
|
|
|
|
2021-07-14 18:47:21 +00:00
|
|
|
/*
|
|
|
|
* We need to make sure any outstanding async pages are now
|
|
|
|
* processed before we continue. This is because things like
|
|
|
|
* sync_inode() try to be smart and skip writing if the inode is
|
|
|
|
* marked clean. We don't use filemap_fwrite for flushing
|
|
|
|
* because we want to control how many pages we write out at a
|
|
|
|
* time, thus this is the only safe way to make sure we've
|
|
|
|
* waited for outstanding compressed workers to have started
|
|
|
|
* their jobs and thus have ordered extents set up properly.
|
|
|
|
*
|
|
|
|
* This exists because we do not want to wait for each
|
|
|
|
* individual inode to finish its async work, we simply want to
|
|
|
|
* start the IO on everybody, and then come back here and wait
|
|
|
|
* for all of the async work to catch up. Once we're done with
|
|
|
|
* that we know we'll have ordered extents for everything and we
|
|
|
|
* can decide if we wait for that or not.
|
|
|
|
*
|
|
|
|
* If we choose to replace this in the future, make absolutely
|
|
|
|
* sure that the proper waiting is being done in the async case,
|
|
|
|
* as there have been bugs in that area before.
|
|
|
|
*/
|
|
|
|
async_pages = atomic_read(&fs_info->async_delalloc_pages);
|
|
|
|
if (!async_pages)
|
|
|
|
goto skip_async;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to wait forever, if we wrote less pages in this
|
|
|
|
* loop than we have outstanding, only wait for that number of
|
|
|
|
* pages, otherwise we can wait for all async pages to finish
|
|
|
|
* before continuing.
|
|
|
|
*/
|
|
|
|
if (async_pages > nr_pages)
|
|
|
|
async_pages -= nr_pages;
|
|
|
|
else
|
|
|
|
async_pages = 0;
|
|
|
|
wait_event(fs_info->async_submit_wait,
|
|
|
|
atomic_read(&fs_info->async_delalloc_pages) <=
|
|
|
|
async_pages);
|
|
|
|
skip_async:
|
2019-06-18 20:09:25 +00:00
|
|
|
loops++;
|
|
|
|
if (wait_ordered && !trans) {
|
2024-05-14 14:48:12 +00:00
|
|
|
btrfs_wait_ordered_roots(fs_info, items, NULL);
|
2019-06-18 20:09:25 +00:00
|
|
|
} else {
|
|
|
|
time_left = schedule_timeout_killable(1);
|
|
|
|
if (time_left)
|
|
|
|
break;
|
|
|
|
}
|
2020-07-21 14:22:22 +00:00
|
|
|
|
2021-04-28 17:38:48 +00:00
|
|
|
/*
|
|
|
|
* If we are for preemption we just want a one-shot of delalloc
|
|
|
|
* flushing so we can stop flushing if we decide we don't need
|
|
|
|
* to anymore.
|
|
|
|
*/
|
|
|
|
if (for_preempt)
|
|
|
|
break;
|
|
|
|
|
2020-07-21 14:22:22 +00:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets) &&
|
|
|
|
list_empty(&space_info->priority_tickets)) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
delalloc_bytes = percpu_counter_sum_positive(
|
|
|
|
&fs_info->delalloc_bytes);
|
2020-10-09 13:28:20 +00:00
|
|
|
ordered_bytes = percpu_counter_sum_positive(
|
|
|
|
&fs_info->ordered_bytes);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to flush some data based on policy set by @state. This is only advisory
|
|
|
|
* and may fail for various reasons. The caller is supposed to examine the
|
|
|
|
* state of @space_info to detect the outcome.
|
|
|
|
*/
|
|
|
|
static void flush_space(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info, u64 num_bytes,
|
2020-10-09 13:28:28 +00:00
|
|
|
enum btrfs_flush_state state, bool for_preempt)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
2021-11-05 20:45:43 +00:00
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
2019-06-18 20:09:25 +00:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int nr;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
switch (state) {
|
|
|
|
case FLUSH_DELAYED_ITEMS_NR:
|
|
|
|
case FLUSH_DELAYED_ITEMS:
|
|
|
|
if (state == FLUSH_DELAYED_ITEMS_NR)
|
|
|
|
nr = calc_reclaim_items_nr(fs_info, num_bytes) * 2;
|
|
|
|
else
|
|
|
|
nr = -1;
|
|
|
|
|
2023-07-26 15:57:10 +00:00
|
|
|
trans = btrfs_join_transaction_nostart(root);
|
2019-06-18 20:09:25 +00:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
2023-07-26 15:57:10 +00:00
|
|
|
if (ret == -ENOENT)
|
|
|
|
ret = 0;
|
2019-06-18 20:09:25 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret = btrfs_run_delayed_items_nr(trans, nr);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
break;
|
|
|
|
case FLUSH_DELALLOC:
|
|
|
|
case FLUSH_DELALLOC_WAIT:
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
case FLUSH_DELALLOC_FULL:
|
|
|
|
if (state == FLUSH_DELALLOC_FULL)
|
|
|
|
num_bytes = U64_MAX;
|
2020-07-21 14:22:15 +00:00
|
|
|
shrink_delalloc(fs_info, space_info, num_bytes,
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
state != FLUSH_DELALLOC, for_preempt);
|
2019-06-18 20:09:25 +00:00
|
|
|
break;
|
|
|
|
case FLUSH_DELAYED_REFS_NR:
|
|
|
|
case FLUSH_DELAYED_REFS:
|
2023-07-26 15:57:10 +00:00
|
|
|
trans = btrfs_join_transaction_nostart(root);
|
2019-06-18 20:09:25 +00:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
2023-07-26 15:57:10 +00:00
|
|
|
if (ret == -ENOENT)
|
|
|
|
ret = 0;
|
2019-06-18 20:09:25 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (state == FLUSH_DELAYED_REFS_NR)
|
btrfs: allow to run delayed refs by bytes to be released instead of count
When running delayed references, through btrfs_run_delayed_refs(), we can
specify how many to run, run all existing delayed references and keep
running delayed references while we can find any. This is controlled with
the value of the 'count' argument, where a value of 0 means to run all
delayed references that exist by the time btrfs_run_delayed_refs() is
called, (unsigned long)-1 means to keep running delayed references while
we are able find any, and any other value to run that exact number of
delayed references.
Typically a specific value other than 0 or -1 is used when flushing space
to try to release a certain amount of bytes for a ticket. In this case
we just simply calculate how many delayed reference heads correspond to a
specific amount of bytes, with calc_delayed_refs_nr(). However that only
takes into account the space reserved for the reference heads themselves,
and does not account for the space reserved for deleting checksums from
the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
in case we are going to delete a data extent. This means we may end up
running more delayed references than necessary in case we process delayed
references for deleting a data extent.
So change the logic of btrfs_run_delayed_refs() to take a bytes argument
to specify how many bytes of delayed references to run/release, using the
special values of 0 to mean all existing delayed references and U64_MAX
(or (u64)-1) to keep running delayed references while we can find any.
This prevents running more delayed references than necessary, when we have
delayed references for deleting data extents, but also makes the upcoming
changes/patches simpler and it's preparatory work for them.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 17:20:34 +00:00
|
|
|
btrfs_run_delayed_refs(trans, num_bytes);
|
2019-06-18 20:09:25 +00:00
|
|
|
else
|
btrfs: allow to run delayed refs by bytes to be released instead of count
When running delayed references, through btrfs_run_delayed_refs(), we can
specify how many to run, run all existing delayed references and keep
running delayed references while we can find any. This is controlled with
the value of the 'count' argument, where a value of 0 means to run all
delayed references that exist by the time btrfs_run_delayed_refs() is
called, (unsigned long)-1 means to keep running delayed references while
we are able find any, and any other value to run that exact number of
delayed references.
Typically a specific value other than 0 or -1 is used when flushing space
to try to release a certain amount of bytes for a ticket. In this case
we just simply calculate how many delayed reference heads correspond to a
specific amount of bytes, with calc_delayed_refs_nr(). However that only
takes into account the space reserved for the reference heads themselves,
and does not account for the space reserved for deleting checksums from
the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
in case we are going to delete a data extent. This means we may end up
running more delayed references than necessary in case we process delayed
references for deleting a data extent.
So change the logic of btrfs_run_delayed_refs() to take a bytes argument
to specify how many bytes of delayed references to run/release, using the
special values of 0 to mean all existing delayed references and U64_MAX
(or (u64)-1) to keep running delayed references while we can find any.
This prevents running more delayed references than necessary, when we have
delayed references for deleting data extents, but also makes the upcoming
changes/patches simpler and it's preparatory work for them.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 17:20:34 +00:00
|
|
|
btrfs_run_delayed_refs(trans, 0);
|
2019-06-18 20:09:25 +00:00
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
break;
|
|
|
|
case ALLOC_CHUNK:
|
|
|
|
case ALLOC_CHUNK_FORCE:
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret = btrfs_chunk_alloc(trans,
|
2020-07-21 14:22:16 +00:00
|
|
|
btrfs_get_alloc_profile(fs_info, space_info->flags),
|
2019-06-18 20:09:25 +00:00
|
|
|
(state == ALLOC_CHUNK) ? CHUNK_ALLOC_NO_FORCE :
|
|
|
|
CHUNK_ALLOC_FORCE);
|
|
|
|
btrfs_end_transaction(trans);
|
2022-07-08 23:18:47 +00:00
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
if (ret > 0 || ret == -ENOSPC)
|
|
|
|
ret = 0;
|
|
|
|
break;
|
2019-08-01 22:19:33 +00:00
|
|
|
case RUN_DELAYED_IPUTS:
|
2019-06-18 20:09:25 +00:00
|
|
|
/*
|
|
|
|
* If we have pending delayed iputs then we could free up a
|
|
|
|
* bunch of pinned space, so make sure we run the iputs before
|
|
|
|
* we do our pinned bytes check below.
|
|
|
|
*/
|
|
|
|
btrfs_run_delayed_iputs(fs_info);
|
|
|
|
btrfs_wait_on_delayed_iputs(fs_info);
|
2019-08-01 22:19:33 +00:00
|
|
|
break;
|
|
|
|
case COMMIT_TRANS:
|
2021-06-22 12:51:58 +00:00
|
|
|
ASSERT(current->journal_info == NULL);
|
2023-07-26 15:57:11 +00:00
|
|
|
/*
|
|
|
|
* We don't want to start a new transaction, just attach to the
|
|
|
|
* current one or wait it fully commits in case its commit is
|
|
|
|
* happening at the moment. Note: we don't use a nostart join
|
|
|
|
* because that does not wait for a transaction to fully commit
|
|
|
|
* (only for it to be unblocked, state TRANS_STATE_UNBLOCKED).
|
|
|
|
*/
|
2024-05-22 08:26:44 +00:00
|
|
|
ret = btrfs_commit_current_transaction(root);
|
2020-10-09 13:28:21 +00:00
|
|
|
break;
|
2019-06-18 20:09:25 +00:00
|
|
|
default:
|
|
|
|
ret = -ENOSPC;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
trace_btrfs_flush_space(fs_info, space_info->flags, num_bytes, state,
|
2020-10-09 13:28:28 +00:00
|
|
|
ret, for_preempt);
|
2019-06-18 20:09:25 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2024-06-26 21:39:11 +00:00
|
|
|
static u64 btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
|
|
|
|
const struct btrfs_space_info *space_info)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
|
|
|
u64 used;
|
2020-02-21 21:41:10 +00:00
|
|
|
u64 avail;
|
2020-03-10 09:00:35 +00:00
|
|
|
u64 to_reclaim = space_info->reclaim_size;
|
2019-06-18 20:09:25 +00:00
|
|
|
|
2020-03-10 09:00:35 +00:00
|
|
|
lockdep_assert_held(&space_info->lock);
|
2020-02-21 21:41:10 +00:00
|
|
|
|
|
|
|
avail = calc_available_free_space(fs_info, space_info,
|
|
|
|
BTRFS_RESERVE_FLUSH_ALL);
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We may be flushing because suddenly we have less space than we had
|
|
|
|
* before, and now we're well over-committed based on our current free
|
|
|
|
* space. If that's the case add in our overage so we make sure to put
|
|
|
|
* appropriate pressure on the flushing state machine.
|
|
|
|
*/
|
2023-03-13 07:06:14 +00:00
|
|
|
if (space_info->total_bytes + avail < used)
|
|
|
|
to_reclaim += used - (space_info->total_bytes + avail);
|
2020-02-21 21:41:10 +00:00
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
return to_reclaim;
|
|
|
|
}
|
|
|
|
|
2020-10-09 13:28:23 +00:00
|
|
|
static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
|
2024-06-26 21:39:11 +00:00
|
|
|
const struct btrfs_space_info *space_info)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
2024-02-19 19:41:23 +00:00
|
|
|
const u64 global_rsv_size = btrfs_block_rsv_reserved(&fs_info->global_block_rsv);
|
2020-10-09 13:28:26 +00:00
|
|
|
u64 ordered, delalloc;
|
2022-07-08 23:18:45 +00:00
|
|
|
u64 thresh;
|
2020-10-09 13:28:26 +00:00
|
|
|
u64 used;
|
2019-06-18 20:09:25 +00:00
|
|
|
|
2023-03-13 07:06:14 +00:00
|
|
|
thresh = mult_perc(space_info->total_bytes, 90);
|
2022-07-08 23:18:45 +00:00
|
|
|
|
2022-03-03 00:38:39 +00:00
|
|
|
lockdep_assert_held(&space_info->lock);
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
/* If we're just plain full then async reclaim just slows us down. */
|
2021-04-28 17:38:44 +00:00
|
|
|
if ((space_info->bytes_used + space_info->bytes_reserved +
|
|
|
|
global_rsv_size) >= thresh)
|
2020-10-09 13:28:23 +00:00
|
|
|
return false;
|
2019-06-18 20:09:25 +00:00
|
|
|
|
2021-08-11 18:37:16 +00:00
|
|
|
used = space_info->bytes_may_use + space_info->bytes_pinned;
|
|
|
|
|
|
|
|
/* The total flushable belongs to the global rsv, don't flush. */
|
|
|
|
if (global_rsv_size >= used)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 128MiB is 1/4 of the maximum global rsv size. If we have less than
|
|
|
|
* that devoted to other reservations then there's no sense in flushing,
|
|
|
|
* we don't have a lot of things that need flushing.
|
|
|
|
*/
|
|
|
|
if (used - global_rsv_size <= SZ_128M)
|
|
|
|
return false;
|
|
|
|
|
2020-10-09 13:28:24 +00:00
|
|
|
/*
|
|
|
|
* We have tickets queued, bail so we don't compete with the async
|
|
|
|
* flushers.
|
|
|
|
*/
|
|
|
|
if (space_info->reclaim_size)
|
|
|
|
return false;
|
|
|
|
|
2020-10-09 13:28:26 +00:00
|
|
|
/*
|
|
|
|
* If we have over half of the free space occupied by reservations or
|
|
|
|
* pinned then we want to start flushing.
|
|
|
|
*
|
|
|
|
* We do not do the traditional thing here, which is to say
|
|
|
|
*
|
|
|
|
* if (used >= ((total_bytes + avail) / 2))
|
|
|
|
* return 1;
|
|
|
|
*
|
|
|
|
* because this doesn't quite work how we want. If we had more than 50%
|
|
|
|
* of the space_info used by bytes_used and we had 0 available we'd just
|
|
|
|
* constantly run the background flusher. Instead we want it to kick in
|
2020-10-09 13:28:27 +00:00
|
|
|
* if our reclaimable space exceeds our clamped free space.
|
|
|
|
*
|
|
|
|
* Our clamping range is 2^1 -> 2^8. Practically speaking that means
|
|
|
|
* the following:
|
|
|
|
*
|
|
|
|
* Amount of RAM Minimum threshold Maximum threshold
|
|
|
|
*
|
|
|
|
* 256GiB 1GiB 128GiB
|
|
|
|
* 128GiB 512MiB 64GiB
|
|
|
|
* 64GiB 256MiB 32GiB
|
|
|
|
* 32GiB 128MiB 16GiB
|
|
|
|
* 16GiB 64MiB 8GiB
|
|
|
|
*
|
|
|
|
* These are the range our thresholds will fall in, corresponding to how
|
|
|
|
* much delalloc we need for the background flusher to kick in.
|
2020-10-09 13:28:26 +00:00
|
|
|
*/
|
2020-10-09 13:28:27 +00:00
|
|
|
|
2020-10-09 13:28:26 +00:00
|
|
|
thresh = calc_available_free_space(fs_info, space_info,
|
|
|
|
BTRFS_RESERVE_FLUSH_ALL);
|
2021-04-28 17:38:45 +00:00
|
|
|
used = space_info->bytes_used + space_info->bytes_reserved +
|
|
|
|
space_info->bytes_readonly + global_rsv_size;
|
2023-03-13 07:06:14 +00:00
|
|
|
if (used < space_info->total_bytes)
|
|
|
|
thresh += space_info->total_bytes - used;
|
2020-10-09 13:28:27 +00:00
|
|
|
thresh >>= space_info->clamp;
|
2020-10-09 13:28:25 +00:00
|
|
|
|
2020-10-09 13:28:26 +00:00
|
|
|
used = space_info->bytes_pinned;
|
2020-10-09 13:28:25 +00:00
|
|
|
|
2020-10-09 13:28:26 +00:00
|
|
|
/*
|
|
|
|
* If we have more ordered bytes than delalloc bytes then we're either
|
|
|
|
* doing a lot of DIO, or we simply don't have a lot of delalloc waiting
|
|
|
|
* around. Preemptive flushing is only useful in that it can free up
|
|
|
|
* space before tickets need to wait for things to finish. In the case
|
|
|
|
* of ordered extents, preemptively waiting on ordered extents gets us
|
|
|
|
* nothing, if our reservations are tied up in ordered extents we'll
|
|
|
|
* simply have to slow down writers by forcing them to wait on ordered
|
|
|
|
* extents.
|
|
|
|
*
|
|
|
|
* In the case that ordered is larger than delalloc, only include the
|
|
|
|
* block reserves that we would actually be able to directly reclaim
|
|
|
|
* from. In this case if we're heavy on metadata operations this will
|
|
|
|
* clearly be heavy enough to warrant preemptive flushing. In the case
|
|
|
|
* of heavy DIO or ordered reservations, preemptive flushing will just
|
|
|
|
* waste time and cause us to slow down.
|
2021-04-28 17:38:47 +00:00
|
|
|
*
|
|
|
|
* We want to make sure we truly are maxed out on ordered however, so
|
|
|
|
* cut ordered in half, and if it's still higher than delalloc then we
|
|
|
|
* can keep flushing. This is to avoid the case where we start
|
|
|
|
* flushing, and now delalloc == ordered and we stop preemptively
|
|
|
|
* flushing when we could still have several gigs of delalloc to flush.
|
2020-10-09 13:28:26 +00:00
|
|
|
*/
|
2021-04-28 17:38:47 +00:00
|
|
|
ordered = percpu_counter_read_positive(&fs_info->ordered_bytes) >> 1;
|
2021-03-24 13:44:21 +00:00
|
|
|
delalloc = percpu_counter_read_positive(&fs_info->delalloc_bytes);
|
2020-10-09 13:28:26 +00:00
|
|
|
if (ordered >= delalloc)
|
2024-02-19 19:41:23 +00:00
|
|
|
used += btrfs_block_rsv_reserved(&fs_info->delayed_refs_rsv) +
|
|
|
|
btrfs_block_rsv_reserved(&fs_info->delayed_block_rsv);
|
2020-10-09 13:28:25 +00:00
|
|
|
else
|
2021-04-28 17:38:46 +00:00
|
|
|
used += space_info->bytes_may_use - global_rsv_size;
|
2019-06-18 20:09:25 +00:00
|
|
|
|
|
|
|
return (used >= thresh && !btrfs_fs_closing(fs_info) &&
|
|
|
|
!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
|
|
|
|
}
|
|
|
|
|
2020-03-13 19:58:05 +00:00
|
|
|
static bool steal_from_global_rsv(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
|
|
|
|
u64 min_bytes;
|
|
|
|
|
2021-11-09 15:12:03 +00:00
|
|
|
if (!ticket->steal)
|
|
|
|
return false;
|
|
|
|
|
2020-03-13 19:58:05 +00:00
|
|
|
if (global_rsv->space_info != space_info)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
spin_lock(&global_rsv->lock);
|
2022-10-26 21:25:14 +00:00
|
|
|
min_bytes = mult_perc(global_rsv->size, 10);
|
2020-03-13 19:58:05 +00:00
|
|
|
if (global_rsv->reserved < min_bytes + ticket->bytes) {
|
|
|
|
spin_unlock(&global_rsv->lock);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
global_rsv->reserved -= ticket->bytes;
|
2020-06-27 10:40:44 +00:00
|
|
|
remove_ticket(space_info, ticket);
|
2020-03-13 19:58:05 +00:00
|
|
|
ticket->bytes = 0;
|
|
|
|
wake_up(&ticket->wait);
|
|
|
|
space_info->tickets_id++;
|
|
|
|
if (global_rsv->reserved < global_rsv->size)
|
|
|
|
global_rsv->full = 0;
|
|
|
|
spin_unlock(&global_rsv->lock);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2019-08-28 15:12:47 +00:00
|
|
|
/*
|
2023-09-07 23:09:25 +00:00
|
|
|
* We've exhausted our flushing, start failing tickets.
|
|
|
|
*
|
2019-08-28 15:12:47 +00:00
|
|
|
* @fs_info - fs_info for this fs
|
|
|
|
* @space_info - the space info we were flushing
|
|
|
|
*
|
|
|
|
* We call this when we've exhausted our flushing ability and haven't made
|
|
|
|
* progress in satisfying tickets. The reservation code handles tickets in
|
|
|
|
* order, so if there is a large ticket first and then smaller ones we could
|
|
|
|
* very well satisfy the smaller tickets. This will attempt to wake up any
|
|
|
|
* tickets in the list to catch this case.
|
|
|
|
*
|
|
|
|
* This function returns true if it was able to make progress by clearing out
|
|
|
|
* other tickets, or if it stumbles across a ticket that was smaller than the
|
|
|
|
* first ticket.
|
|
|
|
*/
|
|
|
|
static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
|
|
|
struct reserve_ticket *ticket;
|
2019-08-28 15:12:47 +00:00
|
|
|
u64 tickets_id = space_info->tickets_id;
|
2021-10-05 20:35:26 +00:00
|
|
|
const bool aborted = BTRFS_FS_ERROR(fs_info);
|
2019-08-28 15:12:47 +00:00
|
|
|
|
2021-07-14 18:47:19 +00:00
|
|
|
trace_btrfs_fail_all_tickets(fs_info, space_info);
|
|
|
|
|
2019-08-22 19:19:04 +00:00
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
|
|
|
|
btrfs_info(fs_info, "cannot satisfy tickets, dumping space info");
|
|
|
|
__btrfs_dump_space_info(fs_info, space_info);
|
|
|
|
}
|
|
|
|
|
2019-08-28 15:12:47 +00:00
|
|
|
while (!list_empty(&space_info->tickets) &&
|
|
|
|
tickets_id == space_info->tickets_id) {
|
|
|
|
ticket = list_first_entry(&space_info->tickets,
|
|
|
|
struct reserve_ticket, list);
|
|
|
|
|
2021-11-09 15:12:03 +00:00
|
|
|
if (!aborted && steal_from_global_rsv(fs_info, space_info, ticket))
|
2020-03-13 19:58:05 +00:00
|
|
|
return true;
|
|
|
|
|
2021-10-05 20:35:26 +00:00
|
|
|
if (!aborted && btrfs_test_opt(fs_info, ENOSPC_DEBUG))
|
2019-08-22 19:19:04 +00:00
|
|
|
btrfs_info(fs_info, "failing ticket with %llu bytes",
|
|
|
|
ticket->bytes);
|
|
|
|
|
2020-04-07 10:38:49 +00:00
|
|
|
remove_ticket(space_info, ticket);
|
2021-10-05 20:35:26 +00:00
|
|
|
if (aborted)
|
|
|
|
ticket->error = -EIO;
|
|
|
|
else
|
|
|
|
ticket->error = -ENOSPC;
|
2019-06-18 20:09:25 +00:00
|
|
|
wake_up(&ticket->wait);
|
2019-08-28 15:12:47 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're just throwing tickets away, so more flushing may not
|
|
|
|
* trip over btrfs_try_granting_tickets, so we need to call it
|
|
|
|
* here to see if we can make progress with the next ticket in
|
|
|
|
* the list.
|
|
|
|
*/
|
2021-10-05 20:35:26 +00:00
|
|
|
if (!aborted)
|
|
|
|
btrfs_try_granting_tickets(fs_info, space_info);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
2019-08-28 15:12:47 +00:00
|
|
|
return (tickets_id != space_info->tickets_id);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is for normal flushers, we can wait all goddamned day if we want to. We
|
|
|
|
* will loop and continuously try to flush as long as we are making progress.
|
|
|
|
* We count progress as clearing off tickets each time we have to loop.
|
|
|
|
*/
|
|
|
|
static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
u64 to_reclaim;
|
2020-10-09 13:28:18 +00:00
|
|
|
enum btrfs_flush_state flush_state;
|
2019-06-18 20:09:25 +00:00
|
|
|
int commit_cycles = 0;
|
|
|
|
u64 last_tickets_id;
|
|
|
|
|
|
|
|
fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
|
|
|
|
space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
2019-11-26 16:25:53 +00:00
|
|
|
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
|
2019-06-18 20:09:25 +00:00
|
|
|
if (!to_reclaim) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
do {
|
2020-10-09 13:28:28 +00:00
|
|
|
flush_space(fs_info, space_info, to_reclaim, flush_state, false);
|
2019-06-18 20:09:25 +00:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info,
|
2019-11-26 16:25:53 +00:00
|
|
|
space_info);
|
2019-06-18 20:09:25 +00:00
|
|
|
if (last_tickets_id == space_info->tickets_id) {
|
|
|
|
flush_state++;
|
|
|
|
} else {
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
if (commit_cycles)
|
|
|
|
commit_cycles--;
|
|
|
|
}
|
|
|
|
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
/*
|
|
|
|
* We do not want to empty the system of delalloc unless we're
|
|
|
|
* under heavy pressure, so allow one trip through the flushing
|
|
|
|
* logic before we start doing a FLUSH_DELALLOC_FULL.
|
|
|
|
*/
|
|
|
|
if (flush_state == FLUSH_DELALLOC_FULL && !commit_cycles)
|
|
|
|
flush_state++;
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
/*
|
|
|
|
* We don't want to force a chunk allocation until we've tried
|
|
|
|
* pretty hard to reclaim space. Think of the case where we
|
|
|
|
* freed up a bunch of space and so have a lot of pinned space
|
|
|
|
* to reclaim. We would rather use that than possibly create a
|
|
|
|
* underutilized metadata chunk. So if this is our first run
|
|
|
|
* through the flushing state machine skip ALLOC_CHUNK_FORCE and
|
|
|
|
* commit the transaction. If nothing has changed the next go
|
|
|
|
* around then we can force a chunk allocation.
|
|
|
|
*/
|
|
|
|
if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
|
|
|
|
flush_state++;
|
|
|
|
|
|
|
|
if (flush_state > COMMIT_TRANS) {
|
|
|
|
commit_cycles++;
|
|
|
|
if (commit_cycles > 2) {
|
2019-08-28 15:12:47 +00:00
|
|
|
if (maybe_fail_all_tickets(fs_info, space_info)) {
|
2019-06-18 20:09:25 +00:00
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
commit_cycles--;
|
|
|
|
} else {
|
|
|
|
space_info->flush = 0;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
} while (flush_state <= COMMIT_TRANS);
|
|
|
|
}
|
|
|
|
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
/*
|
|
|
|
* This handles pre-flushing of metadata space before we get to the point that
|
|
|
|
* we need to start blocking threads on tickets. The logic here is different
|
|
|
|
* from the other flush paths because it doesn't rely on tickets to tell us how
|
|
|
|
* much we need to flush, instead it attempts to keep us below the 80% full
|
|
|
|
* watermark of space by flushing whichever reservation pool is currently the
|
|
|
|
* largest.
|
|
|
|
*/
|
|
|
|
static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_block_rsv *delayed_block_rsv;
|
|
|
|
struct btrfs_block_rsv *delayed_refs_rsv;
|
|
|
|
struct btrfs_block_rsv *global_rsv;
|
|
|
|
struct btrfs_block_rsv *trans_rsv;
|
2020-10-09 13:28:27 +00:00
|
|
|
int loops = 0;
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
|
|
|
|
fs_info = container_of(work, struct btrfs_fs_info,
|
|
|
|
preempt_reclaim_work);
|
|
|
|
space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
delayed_block_rsv = &fs_info->delayed_block_rsv;
|
|
|
|
delayed_refs_rsv = &fs_info->delayed_refs_rsv;
|
|
|
|
global_rsv = &fs_info->global_block_rsv;
|
|
|
|
trans_rsv = &fs_info->trans_block_rsv;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
2020-10-09 13:28:26 +00:00
|
|
|
while (need_preemptive_reclaim(fs_info, space_info)) {
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
enum btrfs_flush_state flush;
|
|
|
|
u64 delalloc_size = 0;
|
|
|
|
u64 to_reclaim, block_rsv_size;
|
2024-02-19 19:41:23 +00:00
|
|
|
const u64 global_rsv_size = btrfs_block_rsv_reserved(global_rsv);
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
|
2020-10-09 13:28:27 +00:00
|
|
|
loops++;
|
|
|
|
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
/*
|
|
|
|
* We don't have a precise counter for the metadata being
|
|
|
|
* reserved for delalloc, so we'll approximate it by subtracting
|
|
|
|
* out the block rsv's space from the bytes_may_use. If that
|
|
|
|
* amount is higher than the individual reserves, then we can
|
|
|
|
* assume it's tied up in delalloc reservations.
|
|
|
|
*/
|
|
|
|
block_rsv_size = global_rsv_size +
|
2024-02-19 19:41:23 +00:00
|
|
|
btrfs_block_rsv_reserved(delayed_block_rsv) +
|
|
|
|
btrfs_block_rsv_reserved(delayed_refs_rsv) +
|
|
|
|
btrfs_block_rsv_reserved(trans_rsv);
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
if (block_rsv_size < space_info->bytes_may_use)
|
|
|
|
delalloc_size = space_info->bytes_may_use - block_rsv_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to include the global_rsv in our calculation,
|
|
|
|
* because that's space we can't touch. Subtract it from the
|
|
|
|
* block_rsv_size for the next checks.
|
|
|
|
*/
|
|
|
|
block_rsv_size -= global_rsv_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We really want to avoid flushing delalloc too much, as it
|
|
|
|
* could result in poor allocation patterns, so only flush it if
|
|
|
|
* it's larger than the rest of the pools combined.
|
|
|
|
*/
|
|
|
|
if (delalloc_size > block_rsv_size) {
|
|
|
|
to_reclaim = delalloc_size;
|
|
|
|
flush = FLUSH_DELALLOC;
|
|
|
|
} else if (space_info->bytes_pinned >
|
2024-02-19 19:41:23 +00:00
|
|
|
(btrfs_block_rsv_reserved(delayed_block_rsv) +
|
|
|
|
btrfs_block_rsv_reserved(delayed_refs_rsv))) {
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
to_reclaim = space_info->bytes_pinned;
|
2021-06-22 12:51:58 +00:00
|
|
|
flush = COMMIT_TRANS;
|
2024-02-19 19:41:23 +00:00
|
|
|
} else if (btrfs_block_rsv_reserved(delayed_block_rsv) >
|
|
|
|
btrfs_block_rsv_reserved(delayed_refs_rsv)) {
|
|
|
|
to_reclaim = btrfs_block_rsv_reserved(delayed_block_rsv);
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
flush = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
} else {
|
2024-02-19 19:41:23 +00:00
|
|
|
to_reclaim = btrfs_block_rsv_reserved(delayed_refs_rsv);
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
flush = FLUSH_DELAYED_REFS_NR;
|
|
|
|
}
|
|
|
|
|
2022-02-25 21:20:28 +00:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
/*
|
|
|
|
* We don't want to reclaim everything, just a portion, so scale
|
|
|
|
* down the to_reclaim by 1/4. If it takes us down to 0,
|
|
|
|
* reclaim 1 items worth.
|
|
|
|
*/
|
|
|
|
to_reclaim >>= 2;
|
|
|
|
if (!to_reclaim)
|
|
|
|
to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
|
2020-10-09 13:28:28 +00:00
|
|
|
flush_space(fs_info, space_info, to_reclaim, flush, true);
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
cond_resched();
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
}
|
2020-10-09 13:28:27 +00:00
|
|
|
|
|
|
|
/* We only went through once, back off our clamping. */
|
|
|
|
if (loops == 1 && !space_info->reclaim_size)
|
|
|
|
space_info->clamp = max(1, space_info->clamp - 1);
|
2020-10-09 13:28:29 +00:00
|
|
|
trace_btrfs_done_preemptive_reclaim(fs_info, space_info);
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
2020-07-21 14:22:34 +00:00
|
|
|
/*
|
|
|
|
* FLUSH_DELALLOC_WAIT:
|
|
|
|
* Space is freed from flushing delalloc in one of two ways.
|
|
|
|
*
|
|
|
|
* 1) compression is on and we allocate less space than we reserved
|
|
|
|
* 2) we are overwriting existing space
|
|
|
|
*
|
|
|
|
* For #1 that extra space is reclaimed as soon as the delalloc pages are
|
|
|
|
* COWed, by way of btrfs_add_reserved_bytes() which adds the actual extent
|
|
|
|
* length to ->bytes_reserved, and subtracts the reserved space from
|
|
|
|
* ->bytes_may_use.
|
|
|
|
*
|
|
|
|
* For #2 this is trickier. Once the ordered extent runs we will drop the
|
|
|
|
* extent in the range we are overwriting, which creates a delayed ref for
|
|
|
|
* that freed extent. This however is not reclaimed until the transaction
|
|
|
|
* commits, thus the next stages.
|
|
|
|
*
|
|
|
|
* RUN_DELAYED_IPUTS
|
|
|
|
* If we are freeing inodes, we want to make sure all delayed iputs have
|
|
|
|
* completed, because they could have been on an inode with i_nlink == 0, and
|
|
|
|
* thus have been truncated and freed up space. But again this space is not
|
2024-09-24 03:09:44 +00:00
|
|
|
* immediately reusable, it comes in the form of a delayed ref, which must be
|
2020-07-21 14:22:34 +00:00
|
|
|
* run and then the transaction must be committed.
|
|
|
|
*
|
|
|
|
* COMMIT_TRANS
|
2021-06-22 12:51:58 +00:00
|
|
|
* This is where we reclaim all of the pinned space generated by running the
|
|
|
|
* iputs
|
btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang. This
turned out to be because we'd get into this spot where we were no longer
making progress on data reservations because our exit condition was not
met. The log is basically
while (!space_info->full && !list_empty(&space_info->tickets))
flush_space(space_info, flush_state);
where flush state is our various flush states, but doesn't include
ALLOC_CHUNK_FORCE. This is because we actually lead with allocating
chunks, and so the assumption was that once you got to the actual
flushing states you could no longer allocate chunks. This was a stupid
assumption, because you could have deleted block groups that would be
reclaimed by a transaction commit, thus unsetting space_info->full.
This is essentially what happens with generic/102, and so sometimes
you'd get stuck in the flushing loop because we weren't allocating
chunks, but flushing space wasn't giving us what we needed to make
progress.
Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
that way we will eventually bail out because we did end up with
space_info->full if we free'd a chunk previously. Otherwise, as is the
case for this test, we'll allocate our chunk and continue on our happy
merry way.
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-25 20:56:59 +00:00
|
|
|
*
|
|
|
|
* ALLOC_CHUNK_FORCE
|
|
|
|
* For data we start with alloc chunk force, however we could have been full
|
|
|
|
* before, and then the transaction commit could have freed new block groups,
|
|
|
|
* so if we now have space to allocate do the force chunk allocation.
|
2020-07-21 14:22:34 +00:00
|
|
|
*/
|
2020-07-21 14:22:33 +00:00
|
|
|
static const enum btrfs_flush_state data_flush_states[] = {
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
FLUSH_DELALLOC_FULL,
|
2020-07-21 14:22:33 +00:00
|
|
|
RUN_DELAYED_IPUTS,
|
|
|
|
COMMIT_TRANS,
|
btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang. This
turned out to be because we'd get into this spot where we were no longer
making progress on data reservations because our exit condition was not
met. The log is basically
while (!space_info->full && !list_empty(&space_info->tickets))
flush_space(space_info, flush_state);
where flush state is our various flush states, but doesn't include
ALLOC_CHUNK_FORCE. This is because we actually lead with allocating
chunks, and so the assumption was that once you got to the actual
flushing states you could no longer allocate chunks. This was a stupid
assumption, because you could have deleted block groups that would be
reclaimed by a transaction commit, thus unsetting space_info->full.
This is essentially what happens with generic/102, and so sometimes
you'd get stuck in the flushing loop because we weren't allocating
chunks, but flushing space wasn't giving us what we needed to make
progress.
Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
that way we will eventually bail out because we did end up with
space_info->full if we free'd a chunk previously. Otherwise, as is the
case for this test, we'll allocate our chunk and continue on our happy
merry way.
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-25 20:56:59 +00:00
|
|
|
ALLOC_CHUNK_FORCE,
|
2020-07-21 14:22:33 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static void btrfs_async_reclaim_data_space(struct work_struct *work)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
2020-07-21 14:22:33 +00:00
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
u64 last_tickets_id;
|
2020-10-09 13:28:18 +00:00
|
|
|
enum btrfs_flush_state flush_state = 0;
|
2020-07-21 14:22:33 +00:00
|
|
|
|
|
|
|
fs_info = container_of(work, struct btrfs_fs_info, async_data_reclaim_work);
|
|
|
|
space_info = fs_info->data_sinfo;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
while (!space_info->full) {
|
2020-10-09 13:28:28 +00:00
|
|
|
flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, false);
|
2020-07-21 14:22:33 +00:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
2021-10-05 20:35:26 +00:00
|
|
|
|
|
|
|
/* Something happened, fail everything and bail. */
|
|
|
|
if (BTRFS_FS_ERROR(fs_info))
|
|
|
|
goto aborted_fs;
|
2020-07-21 14:22:33 +00:00
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
while (flush_state < ARRAY_SIZE(data_flush_states)) {
|
|
|
|
flush_space(fs_info, space_info, U64_MAX,
|
2020-10-09 13:28:28 +00:00
|
|
|
data_flush_states[flush_state], false);
|
2020-07-21 14:22:33 +00:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (last_tickets_id == space_info->tickets_id) {
|
|
|
|
flush_state++;
|
|
|
|
} else {
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
flush_state = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flush_state >= ARRAY_SIZE(data_flush_states)) {
|
|
|
|
if (space_info->full) {
|
|
|
|
if (maybe_fail_all_tickets(fs_info, space_info))
|
|
|
|
flush_state = 0;
|
|
|
|
else
|
|
|
|
space_info->flush = 0;
|
|
|
|
} else {
|
|
|
|
flush_state = 0;
|
|
|
|
}
|
2021-10-05 20:35:26 +00:00
|
|
|
|
|
|
|
/* Something happened, fail everything and bail. */
|
|
|
|
if (BTRFS_FS_ERROR(fs_info))
|
|
|
|
goto aborted_fs;
|
|
|
|
|
2020-07-21 14:22:33 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
2021-10-05 20:35:26 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
aborted_fs:
|
|
|
|
maybe_fail_all_tickets(fs_info, space_info);
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
2020-07-21 14:22:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
INIT_WORK(&fs_info->async_reclaim_work, btrfs_async_reclaim_metadata_space);
|
|
|
|
INIT_WORK(&fs_info->async_data_reclaim_work, btrfs_async_reclaim_data_space);
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
INIT_WORK(&fs_info->preempt_reclaim_work,
|
|
|
|
btrfs_preempt_reclaim_metadata_space);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static const enum btrfs_flush_state priority_flush_states[] = {
|
|
|
|
FLUSH_DELAYED_ITEMS_NR,
|
|
|
|
FLUSH_DELAYED_ITEMS,
|
|
|
|
ALLOC_CHUNK,
|
|
|
|
};
|
|
|
|
|
2019-08-01 22:19:37 +00:00
|
|
|
static const enum btrfs_flush_state evict_flush_states[] = {
|
|
|
|
FLUSH_DELAYED_ITEMS_NR,
|
|
|
|
FLUSH_DELAYED_ITEMS,
|
|
|
|
FLUSH_DELAYED_REFS_NR,
|
|
|
|
FLUSH_DELAYED_REFS,
|
|
|
|
FLUSH_DELALLOC,
|
|
|
|
FLUSH_DELALLOC_WAIT,
|
btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-14 18:47:20 +00:00
|
|
|
FLUSH_DELALLOC_FULL,
|
2019-08-01 22:19:37 +00:00
|
|
|
ALLOC_CHUNK,
|
|
|
|
COMMIT_TRANS,
|
|
|
|
};
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
|
2019-08-01 22:19:36 +00:00
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket,
|
|
|
|
const enum btrfs_flush_state *states,
|
|
|
|
int states_nr)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
|
|
|
u64 to_reclaim;
|
2021-11-09 15:12:01 +00:00
|
|
|
int flush_state = 0;
|
2019-06-18 20:09:25 +00:00
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
2019-11-26 16:25:53 +00:00
|
|
|
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
|
2021-11-09 15:12:02 +00:00
|
|
|
/*
|
|
|
|
* This is the priority reclaim path, so to_reclaim could be >0 still
|
2022-05-25 14:27:25 +00:00
|
|
|
* because we may have only satisfied the priority tickets and still
|
2021-11-09 15:12:02 +00:00
|
|
|
* left non priority tickets on the list. We would then have
|
|
|
|
* to_reclaim but ->bytes == 0.
|
|
|
|
*/
|
|
|
|
if (ticket->bytes == 0) {
|
2019-06-18 20:09:25 +00:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-11-09 15:12:01 +00:00
|
|
|
while (flush_state < states_nr) {
|
|
|
|
spin_unlock(&space_info->lock);
|
2020-10-09 13:28:28 +00:00
|
|
|
flush_space(fs_info, space_info, to_reclaim, states[flush_state],
|
|
|
|
false);
|
2019-06-18 20:09:25 +00:00
|
|
|
flush_state++;
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (ticket->bytes == 0) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
2021-11-09 15:12:01 +00:00
|
|
|
}
|
|
|
|
|
2023-07-26 15:57:03 +00:00
|
|
|
/*
|
|
|
|
* Attempt to steal from the global rsv if we can, except if the fs was
|
|
|
|
* turned into error mode due to a transaction abort when flushing space
|
2023-07-26 15:57:06 +00:00
|
|
|
* above, in that case fail with the abort error instead of returning
|
|
|
|
* success to the caller if we can steal from the global rsv - this is
|
|
|
|
* just to have caller fail immeditelly instead of later when trying to
|
|
|
|
* modify the fs, making it easier to debug -ENOSPC problems.
|
2023-07-26 15:57:03 +00:00
|
|
|
*/
|
|
|
|
if (BTRFS_FS_ERROR(fs_info)) {
|
2023-07-26 15:57:06 +00:00
|
|
|
ticket->error = BTRFS_FS_ERROR(fs_info);
|
2023-07-26 15:57:03 +00:00
|
|
|
remove_ticket(space_info, ticket);
|
|
|
|
} else if (!steal_from_global_rsv(fs_info, space_info, ticket)) {
|
2021-11-09 15:12:04 +00:00
|
|
|
ticket->error = -ENOSPC;
|
|
|
|
remove_ticket(space_info, ticket);
|
|
|
|
}
|
|
|
|
|
2021-11-09 15:12:01 +00:00
|
|
|
/*
|
|
|
|
* We must run try_granting_tickets here because we could be a large
|
|
|
|
* ticket in front of a smaller ticket that can now be satisfied with
|
|
|
|
* the available space.
|
|
|
|
*/
|
|
|
|
btrfs_try_granting_tickets(fs_info, space_info);
|
|
|
|
spin_unlock(&space_info->lock);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
|
2020-07-21 14:22:26 +00:00
|
|
|
static void priority_reclaim_data_space(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
2020-07-21 14:22:33 +00:00
|
|
|
struct reserve_ticket *ticket)
|
2020-07-21 14:22:26 +00:00
|
|
|
{
|
2021-11-09 15:12:01 +00:00
|
|
|
spin_lock(&space_info->lock);
|
2021-11-09 15:12:02 +00:00
|
|
|
|
|
|
|
/* We could have been granted before we got here. */
|
|
|
|
if (ticket->bytes == 0) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2020-07-21 14:22:26 +00:00
|
|
|
while (!space_info->full) {
|
2021-11-09 15:12:01 +00:00
|
|
|
spin_unlock(&space_info->lock);
|
2020-10-09 13:28:28 +00:00
|
|
|
flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, false);
|
2020-07-21 14:22:26 +00:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (ticket->bytes == 0) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
2021-11-09 15:12:01 +00:00
|
|
|
|
|
|
|
ticket->error = -ENOSPC;
|
|
|
|
remove_ticket(space_info, ticket);
|
|
|
|
btrfs_try_granting_tickets(fs_info, space_info);
|
|
|
|
spin_unlock(&space_info->lock);
|
2020-07-21 14:22:26 +00:00
|
|
|
}
|
|
|
|
|
2024-10-09 14:31:02 +00:00
|
|
|
static void wait_reserve_ticket(struct btrfs_space_info *space_info,
|
2019-08-01 22:19:34 +00:00
|
|
|
struct reserve_ticket *ticket)
|
2019-06-18 20:09:25 +00:00
|
|
|
|
|
|
|
{
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
while (ticket->bytes > 0 && ticket->error == 0) {
|
|
|
|
ret = prepare_to_wait_event(&ticket->wait, &wait, TASK_KILLABLE);
|
|
|
|
if (ret) {
|
Btrfs: fix race leading to metadata space leak after task received signal
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 09:53:41 +00:00
|
|
|
/*
|
|
|
|
* Delete us from the list. After we unlock the space
|
|
|
|
* info, we don't want the async reclaim job to reserve
|
|
|
|
* space for this ticket. If that would happen, then the
|
|
|
|
* ticket's task would not known that space was reserved
|
|
|
|
* despite getting an error, resulting in a space leak
|
|
|
|
* (bytes_may_use counter of our space_info).
|
|
|
|
*/
|
2020-04-07 10:38:49 +00:00
|
|
|
remove_ticket(space_info, ticket);
|
2019-08-01 22:19:34 +00:00
|
|
|
ticket->error = -EINTR;
|
2019-06-18 20:09:25 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
schedule();
|
|
|
|
|
|
|
|
finish_wait(&ticket->wait, &wait);
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
|
|
|
* Do the appropriate flushing and waiting for a ticket.
|
2021-01-22 09:58:02 +00:00
|
|
|
*
|
|
|
|
* @fs_info: the filesystem
|
|
|
|
* @space_info: space info for the reservation
|
|
|
|
* @ticket: ticket for the reservation
|
2020-10-09 13:28:19 +00:00
|
|
|
* @start_ns: timestamp when the reservation started
|
|
|
|
* @orig_bytes: amount of bytes originally reserved
|
2021-01-22 09:58:02 +00:00
|
|
|
* @flush: how much we can flush
|
2019-08-01 22:19:35 +00:00
|
|
|
*
|
|
|
|
* This does the work of figuring out how to flush for the ticket, waiting for
|
|
|
|
* the reservation, and returning the appropriate error if there is one.
|
|
|
|
*/
|
|
|
|
static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket,
|
2020-10-09 13:28:19 +00:00
|
|
|
u64 start_ns, u64 orig_bytes,
|
2019-08-01 22:19:35 +00:00
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2019-08-01 22:19:37 +00:00
|
|
|
switch (flush) {
|
2020-07-21 14:22:33 +00:00
|
|
|
case BTRFS_RESERVE_FLUSH_DATA:
|
2019-08-01 22:19:37 +00:00
|
|
|
case BTRFS_RESERVE_FLUSH_ALL:
|
2020-03-13 19:58:05 +00:00
|
|
|
case BTRFS_RESERVE_FLUSH_ALL_STEAL:
|
2024-10-09 14:31:02 +00:00
|
|
|
wait_reserve_ticket(space_info, ticket);
|
2019-08-01 22:19:37 +00:00
|
|
|
break;
|
|
|
|
case BTRFS_RESERVE_FLUSH_LIMIT:
|
2019-08-01 22:19:36 +00:00
|
|
|
priority_reclaim_metadata_space(fs_info, space_info, ticket,
|
|
|
|
priority_flush_states,
|
|
|
|
ARRAY_SIZE(priority_flush_states));
|
2019-08-01 22:19:37 +00:00
|
|
|
break;
|
|
|
|
case BTRFS_RESERVE_FLUSH_EVICT:
|
|
|
|
priority_reclaim_metadata_space(fs_info, space_info, ticket,
|
|
|
|
evict_flush_states,
|
|
|
|
ARRAY_SIZE(evict_flush_states));
|
|
|
|
break;
|
2020-07-21 14:22:26 +00:00
|
|
|
case BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE:
|
2020-07-21 14:22:33 +00:00
|
|
|
priority_reclaim_data_space(fs_info, space_info, ticket);
|
2020-07-21 14:22:26 +00:00
|
|
|
break;
|
2019-08-01 22:19:37 +00:00
|
|
|
default:
|
|
|
|
ASSERT(0);
|
|
|
|
break;
|
|
|
|
}
|
2019-08-01 22:19:35 +00:00
|
|
|
|
|
|
|
ret = ticket->error;
|
|
|
|
ASSERT(list_empty(&ticket->list));
|
Btrfs: fix race leading to metadata space leak after task received signal
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 09:53:41 +00:00
|
|
|
/*
|
|
|
|
* Check that we can't have an error set if the reservation succeeded,
|
|
|
|
* as that would confuse tasks and lead them to error out without
|
|
|
|
* releasing reserved space (if an error happens the expectation is that
|
|
|
|
* space wasn't reserved at all).
|
|
|
|
*/
|
|
|
|
ASSERT(!(ticket->bytes == 0 && ticket->error));
|
2020-10-09 13:28:19 +00:00
|
|
|
trace_btrfs_reserve_ticket(fs_info, space_info->flags, orig_bytes,
|
|
|
|
start_ns, flush, ticket->error);
|
2019-08-01 22:19:35 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-03-13 19:58:08 +00:00
|
|
|
/*
|
|
|
|
* This returns true if this flush state will go through the ordinary flushing
|
|
|
|
* code.
|
|
|
|
*/
|
|
|
|
static inline bool is_normal_flushing(enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
return (flush == BTRFS_RESERVE_FLUSH_ALL) ||
|
|
|
|
(flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
|
|
|
|
}
|
|
|
|
|
2020-10-09 13:28:27 +00:00
|
|
|
static inline void maybe_clamp_preempt(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info)
|
|
|
|
{
|
|
|
|
u64 ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
|
|
|
|
u64 delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're heavy on ordered operations then clamping won't help us. We
|
|
|
|
* need to clamp specifically to keep up with dirty'ing buffered
|
|
|
|
* writers, because there's not a 1:1 correlation of writing delalloc
|
|
|
|
* and freeing space, like there is with flushing delayed refs or
|
|
|
|
* delayed nodes. If we're already more ordered than delalloc then
|
|
|
|
* we're keeping up, otherwise we aren't and should probably clamp.
|
|
|
|
*/
|
|
|
|
if (ordered < delalloc)
|
|
|
|
space_info->clamp = min(space_info->clamp + 1, 8);
|
|
|
|
}
|
|
|
|
|
2021-11-09 15:12:04 +00:00
|
|
|
static inline bool can_steal(enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
return (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL ||
|
|
|
|
flush == BTRFS_RESERVE_FLUSH_EVICT);
|
|
|
|
}
|
|
|
|
|
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 13:35:01 +00:00
|
|
|
/*
|
|
|
|
* NO_FLUSH and FLUSH_EMERGENCY don't want to create a ticket, they just want to
|
|
|
|
* fail as quickly as possible.
|
|
|
|
*/
|
|
|
|
static inline bool can_ticket(enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
return (flush != BTRFS_RESERVE_NO_FLUSH &&
|
|
|
|
flush != BTRFS_RESERVE_FLUSH_EMERGENCY);
|
|
|
|
}
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
|
|
|
* Try to reserve bytes from the block_rsv's space.
|
2021-01-22 09:58:02 +00:00
|
|
|
*
|
|
|
|
* @fs_info: the filesystem
|
|
|
|
* @space_info: space info we want to allocate from
|
|
|
|
* @orig_bytes: number of bytes we want
|
|
|
|
* @flush: whether or not we can flush to make our reservation
|
2019-06-18 20:09:25 +00:00
|
|
|
*
|
|
|
|
* This will reserve orig_bytes number of bytes from the space info associated
|
|
|
|
* with the block_rsv. If there is not enough space it will make an attempt to
|
|
|
|
* flush out space to make room. It will do this by flushing delalloc if
|
|
|
|
* possible or committing the transaction. If flush is 0 then no attempts to
|
|
|
|
* regain reservations will be made and this will fail if there is not enough
|
|
|
|
* space already.
|
|
|
|
*/
|
2020-07-21 14:22:28 +00:00
|
|
|
static int __reserve_bytes(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info, u64 orig_bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2019-06-18 20:09:25 +00:00
|
|
|
{
|
2020-07-21 14:22:33 +00:00
|
|
|
struct work_struct *async_work;
|
2019-06-18 20:09:25 +00:00
|
|
|
struct reserve_ticket ticket;
|
2020-10-09 13:28:19 +00:00
|
|
|
u64 start_ns = 0;
|
2019-06-18 20:09:25 +00:00
|
|
|
u64 used;
|
2023-03-21 11:13:42 +00:00
|
|
|
int ret = -ENOSPC;
|
2019-08-22 19:10:54 +00:00
|
|
|
bool pending_tickets;
|
2019-06-18 20:09:25 +00:00
|
|
|
|
|
|
|
ASSERT(orig_bytes);
|
2023-03-21 11:13:41 +00:00
|
|
|
/*
|
|
|
|
* If have a transaction handle (current->journal_info != NULL), then
|
|
|
|
* the flush method can not be neither BTRFS_RESERVE_FLUSH_ALL* nor
|
|
|
|
* BTRFS_RESERVE_FLUSH_EVICT, as we could deadlock because those
|
|
|
|
* flushing methods can trigger transaction commits.
|
|
|
|
*/
|
|
|
|
if (current->journal_info) {
|
|
|
|
/* One assert per line for easier debugging. */
|
|
|
|
ASSERT(flush != BTRFS_RESERVE_FLUSH_ALL);
|
|
|
|
ASSERT(flush != BTRFS_RESERVE_FLUSH_ALL_STEAL);
|
|
|
|
ASSERT(flush != BTRFS_RESERVE_FLUSH_EVICT);
|
|
|
|
}
|
2019-06-18 20:09:25 +00:00
|
|
|
|
2020-07-21 14:22:33 +00:00
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_DATA)
|
|
|
|
async_work = &fs_info->async_data_reclaim_work;
|
|
|
|
else
|
|
|
|
async_work = &fs_info->async_reclaim_work;
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
2020-03-13 19:58:08 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want NO_FLUSH allocations to jump everybody, they can
|
|
|
|
* generally handle ENOSPC in a different way, so treat them the same as
|
|
|
|
* normal flushers when it comes to skipping pending tickets.
|
|
|
|
*/
|
|
|
|
if (is_normal_flushing(flush) || (flush == BTRFS_RESERVE_NO_FLUSH))
|
|
|
|
pending_tickets = !list_empty(&space_info->tickets) ||
|
|
|
|
!list_empty(&space_info->priority_tickets);
|
|
|
|
else
|
|
|
|
pending_tickets = !list_empty(&space_info->priority_tickets);
|
2019-06-18 20:09:25 +00:00
|
|
|
|
|
|
|
/*
|
2019-06-25 18:11:31 +00:00
|
|
|
* Carry on if we have enough space (short-circuit) OR call
|
|
|
|
* can_overcommit() to ensure we can overcommit to continue.
|
2019-06-18 20:09:25 +00:00
|
|
|
*/
|
2019-08-22 19:10:54 +00:00
|
|
|
if (!pending_tickets &&
|
2023-03-13 07:06:14 +00:00
|
|
|
((used + orig_bytes <= space_info->total_bytes) ||
|
2020-01-17 14:07:39 +00:00
|
|
|
btrfs_can_overcommit(fs_info, space_info, orig_bytes, flush))) {
|
2019-06-18 20:09:25 +00:00
|
|
|
btrfs_space_info_update_bytes_may_use(fs_info, space_info,
|
|
|
|
orig_bytes);
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 13:35:01 +00:00
|
|
|
/*
|
|
|
|
* Things are dire, we need to make a reservation so we don't abort. We
|
|
|
|
* will let this reservation go through as long as we have actual space
|
|
|
|
* left to allocate for the block.
|
|
|
|
*/
|
|
|
|
if (ret && unlikely(flush == BTRFS_RESERVE_FLUSH_EMERGENCY)) {
|
|
|
|
used = btrfs_space_info_used(space_info, false);
|
2023-03-13 07:06:14 +00:00
|
|
|
if (used + orig_bytes <= space_info->total_bytes) {
|
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 13:35:01 +00:00
|
|
|
btrfs_space_info_update_bytes_may_use(fs_info, space_info,
|
|
|
|
orig_bytes);
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
/*
|
|
|
|
* If we couldn't make a reservation then setup our reservation ticket
|
|
|
|
* and kick the async worker if it's not already running.
|
|
|
|
*
|
|
|
|
* If we are a priority flusher then we just need to add our ticket to
|
|
|
|
* the list and we will do our own flushing further down.
|
|
|
|
*/
|
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 13:35:01 +00:00
|
|
|
if (ret && can_ticket(flush)) {
|
2019-06-18 20:09:25 +00:00
|
|
|
ticket.bytes = orig_bytes;
|
|
|
|
ticket.error = 0;
|
2020-03-10 09:00:35 +00:00
|
|
|
space_info->reclaim_size += ticket.bytes;
|
2019-06-18 20:09:25 +00:00
|
|
|
init_waitqueue_head(&ticket.wait);
|
2021-11-09 15:12:04 +00:00
|
|
|
ticket.steal = can_steal(flush);
|
2020-10-09 13:28:19 +00:00
|
|
|
if (trace_btrfs_reserve_ticket_enabled())
|
|
|
|
start_ns = ktime_get_ns();
|
|
|
|
|
2020-03-13 19:58:05 +00:00
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_ALL ||
|
2020-07-21 14:22:33 +00:00
|
|
|
flush == BTRFS_RESERVE_FLUSH_ALL_STEAL ||
|
|
|
|
flush == BTRFS_RESERVE_FLUSH_DATA) {
|
2019-06-18 20:09:25 +00:00
|
|
|
list_add_tail(&ticket.list, &space_info->tickets);
|
|
|
|
if (!space_info->flush) {
|
2021-04-28 17:38:43 +00:00
|
|
|
/*
|
|
|
|
* We were forced to add a reserve ticket, so
|
|
|
|
* our preemptive flushing is unable to keep
|
|
|
|
* up. Clamp down on the threshold for the
|
|
|
|
* preemptive flushing in order to keep up with
|
|
|
|
* the workload.
|
|
|
|
*/
|
|
|
|
maybe_clamp_preempt(fs_info, space_info);
|
|
|
|
|
2019-06-18 20:09:25 +00:00
|
|
|
space_info->flush = 1;
|
|
|
|
trace_btrfs_trigger_flush(fs_info,
|
|
|
|
space_info->flags,
|
|
|
|
orig_bytes, flush,
|
|
|
|
"enospc");
|
2020-07-21 14:22:33 +00:00
|
|
|
queue_work(system_unbound_wq, async_work);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
list_add_tail(&ticket.list,
|
|
|
|
&space_info->priority_tickets);
|
|
|
|
}
|
|
|
|
} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
|
|
|
|
/*
|
|
|
|
* We will do the space reservation dance during log replay,
|
|
|
|
* which means we won't have fs_info->fs_root set, so don't do
|
|
|
|
* the async reclaim as we will panic.
|
|
|
|
*/
|
|
|
|
if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
|
2021-04-28 17:38:42 +00:00
|
|
|
!work_busy(&fs_info->preempt_reclaim_work) &&
|
|
|
|
need_preemptive_reclaim(fs_info, space_info)) {
|
2019-06-18 20:09:25 +00:00
|
|
|
trace_btrfs_trigger_flush(fs_info, space_info->flags,
|
|
|
|
orig_bytes, flush, "preempt");
|
|
|
|
queue_work(system_unbound_wq,
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 13:28:22 +00:00
|
|
|
&fs_info->preempt_reclaim_work);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-09 13:35:01 +00:00
|
|
|
if (!ret || !can_ticket(flush))
|
2019-06-18 20:09:25 +00:00
|
|
|
return ret;
|
|
|
|
|
2020-10-09 13:28:19 +00:00
|
|
|
return handle_reserve_ticket(fs_info, space_info, &ticket, start_ns,
|
|
|
|
orig_bytes, flush);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
|
|
|
* Try to reserve metadata bytes from the block_rsv's space.
|
2021-01-22 09:58:02 +00:00
|
|
|
*
|
2021-12-20 07:23:06 +00:00
|
|
|
* @fs_info: the filesystem
|
2023-09-08 17:20:20 +00:00
|
|
|
* @space_info: the space_info we're allocating for
|
2021-01-22 09:58:02 +00:00
|
|
|
* @orig_bytes: number of bytes we want
|
|
|
|
* @flush: whether or not we can flush to make our reservation
|
2019-06-18 20:09:25 +00:00
|
|
|
*
|
|
|
|
* This will reserve orig_bytes number of bytes from the space info associated
|
|
|
|
* with the block_rsv. If there is not enough space it will make an attempt to
|
|
|
|
* flush out space to make room. It will do this by flushing delalloc if
|
|
|
|
* possible or committing the transaction. If flush is 0 then no attempts to
|
|
|
|
* regain reservations will be made and this will fail if there is not enough
|
|
|
|
* space already.
|
|
|
|
*/
|
2021-11-09 15:12:07 +00:00
|
|
|
int btrfs_reserve_metadata_bytes(struct btrfs_fs_info *fs_info,
|
2023-09-08 17:20:20 +00:00
|
|
|
struct btrfs_space_info *space_info,
|
2019-06-18 20:09:25 +00:00
|
|
|
u64 orig_bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2023-09-08 17:20:20 +00:00
|
|
|
ret = __reserve_bytes(fs_info, space_info, orig_bytes, flush);
|
2019-06-18 20:09:25 +00:00
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
trace_btrfs_space_reservation(fs_info, "space_info:enospc",
|
2023-09-08 17:20:20 +00:00
|
|
|
space_info->flags, orig_bytes, 1);
|
2019-06-18 20:09:25 +00:00
|
|
|
|
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
|
2023-09-08 17:20:20 +00:00
|
|
|
btrfs_dump_space_info(fs_info, space_info, orig_bytes, 0);
|
2019-06-18 20:09:25 +00:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
2020-07-21 14:22:25 +00:00
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
|
|
|
* Try to reserve data bytes for an allocation.
|
2021-01-22 09:58:02 +00:00
|
|
|
*
|
|
|
|
* @fs_info: the filesystem
|
|
|
|
* @bytes: number of bytes we need
|
|
|
|
* @flush: how we are allowed to flush
|
2020-07-21 14:22:25 +00:00
|
|
|
*
|
|
|
|
* This will reserve bytes from the data space info. If there is not enough
|
|
|
|
* space then we will attempt to flush space as specified by flush.
|
|
|
|
*/
|
|
|
|
int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
|
2020-07-21 14:22:28 +00:00
|
|
|
int ret;
|
2020-07-21 14:22:25 +00:00
|
|
|
|
2020-07-21 14:22:28 +00:00
|
|
|
ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
|
2022-09-12 19:27:44 +00:00
|
|
|
flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE ||
|
|
|
|
flush == BTRFS_RESERVE_NO_FLUSH);
|
2020-07-21 14:22:25 +00:00
|
|
|
ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
|
|
|
|
|
2020-07-21 14:22:28 +00:00
|
|
|
ret = __reserve_bytes(fs_info, data_sinfo, bytes, flush);
|
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
trace_btrfs_space_reservation(fs_info, "space_info:enospc",
|
2020-07-21 14:22:25 +00:00
|
|
|
data_sinfo->flags, bytes, 1);
|
2020-07-21 14:22:28 +00:00
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
|
|
|
|
btrfs_dump_space_info(fs_info, data_sinfo, bytes, 0);
|
|
|
|
}
|
2020-07-21 14:22:25 +00:00
|
|
|
return ret;
|
|
|
|
}
|
btrfs: dump all space infos if we abort transaction due to ENOSPC
We have hit some transaction abort due to -ENOSPC internally.
Normally we should always reserve enough space for metadata for every
transaction, thus hitting -ENOSPC should really indicate some cases we
didn't expect.
But unfortunately current error reporting will only give a kernel
warning and stack trace, not really helpful to debug what's causing the
problem.
And mount option debug_enospc can only help when user can reproduce the
problem, but under most cases, such transaction abort by -ENOSPC is
really hard to reproduce.
So this patch will dump all space infos (data, metadata, system) when we
abort the first transaction with -ENOSPC.
This should at least provide some clue to us.
The example of a dump would look like this:
BTRFS: Transaction aborted (error -28)
WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs]
<call trace skipped>
---[ end trace 0000000000000000 ]---
BTRFS info (device dm-1: state A): dumping space info:
BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full
BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0
BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full
BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016
BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0
BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232
BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728
BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left
BTRFS info (device dm-1: state EA): forced readonly
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-25 07:09:10 +00:00
|
|
|
|
|
|
|
/* Dump all the space infos when we abort a transaction due to ENOSPC. */
|
|
|
|
__cold void btrfs_dump_space_info_for_trans_abort(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
|
|
|
|
btrfs_info(fs_info, "dumping space info:");
|
|
|
|
list_for_each_entry(space_info, &fs_info->space_info, list) {
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
__btrfs_dump_space_info(fs_info, space_info);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
dump_global_block_rsv(fs_info);
|
|
|
|
}
|
2022-10-24 18:46:56 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Account the unused space of all the readonly block group in the space_info.
|
|
|
|
* takes mirrors into account.
|
|
|
|
*/
|
|
|
|
u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group *block_group;
|
|
|
|
u64 free_bytes = 0;
|
|
|
|
int factor;
|
|
|
|
|
|
|
|
/* It's df, we don't care if it's racy */
|
|
|
|
if (list_empty(&sinfo->ro_bgs))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
list_for_each_entry(block_group, &sinfo->ro_bgs, ro_list) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
|
|
|
|
if (!block_group->ro) {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
factor = btrfs_bg_type_to_factor(block_group->flags);
|
|
|
|
free_bytes += (block_group->length -
|
|
|
|
block_group->used) * factor;
|
|
|
|
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
|
|
|
|
return free_bytes;
|
|
|
|
}
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
|
|
|
|
static u64 calc_pct_ratio(u64 x, u64 y)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!y)
|
|
|
|
return 0;
|
|
|
|
again:
|
|
|
|
err = check_mul_overflow(100, x, &x);
|
|
|
|
if (err)
|
|
|
|
goto lose_precision;
|
|
|
|
return div64_u64(x, y);
|
|
|
|
lose_precision:
|
|
|
|
x >>= 10;
|
|
|
|
y >>= 10;
|
|
|
|
if (!y)
|
|
|
|
y = 1;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A reasonable buffer for unallocated space is 10 data block_groups.
|
|
|
|
* If we claw this back repeatedly, we can still achieve efficient
|
|
|
|
* utilization when near full, and not do too much reclaim while
|
|
|
|
* always maintaining a solid buffer for workloads that quickly
|
|
|
|
* allocate and pressure the unallocated space.
|
|
|
|
*/
|
|
|
|
static u64 calc_unalloc_target(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
2024-02-14 19:25:50 +00:00
|
|
|
u64 chunk_sz = calc_effective_data_chunk_size(fs_info);
|
|
|
|
|
|
|
|
return BTRFS_UNALLOC_BLOCK_GROUP_TARGET * chunk_sz;
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The fundamental goal of automatic reclaim is to protect the filesystem's
|
|
|
|
* unallocated space and thus minimize the probability of the filesystem going
|
|
|
|
* read only when a metadata allocation failure causes a transaction abort.
|
|
|
|
*
|
|
|
|
* However, relocations happen into the space_info's unused space, therefore
|
|
|
|
* automatic reclaim must also back off as that space runs low. There is no
|
|
|
|
* value in doing trivial "relocations" of re-writing the same block group
|
|
|
|
* into a fresh one.
|
|
|
|
*
|
|
|
|
* Furthermore, we want to avoid doing too much reclaim even if there are good
|
|
|
|
* candidates. This is because the allocator is pretty good at filling up the
|
|
|
|
* holes with writes. So we want to do just enough reclaim to try and stay
|
|
|
|
* safe from running out of unallocated space but not be wasteful about it.
|
|
|
|
*
|
|
|
|
* Therefore, the dynamic reclaim threshold is calculated as follows:
|
|
|
|
* - calculate a target unallocated amount of 5 block group sized chunks
|
|
|
|
* - ratchet up the intensity of reclaim depending on how far we are from
|
|
|
|
* that target by using a formula of unalloc / target to set the threshold.
|
|
|
|
*
|
|
|
|
* Typically with 10 block groups as the target, the discrete values this comes
|
|
|
|
* out to are 0, 10, 20, ... , 80, 90, and 99.
|
|
|
|
*/
|
2024-06-26 21:39:11 +00:00
|
|
|
static int calc_dynamic_reclaim_threshold(const struct btrfs_space_info *space_info)
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = space_info->fs_info;
|
|
|
|
u64 unalloc = atomic64_read(&fs_info->free_chunk_space);
|
|
|
|
u64 target = calc_unalloc_target(fs_info);
|
|
|
|
u64 alloc = space_info->total_bytes;
|
|
|
|
u64 used = btrfs_space_info_used(space_info, false);
|
|
|
|
u64 unused = alloc - used;
|
|
|
|
u64 want = target > unalloc ? target - unalloc : 0;
|
|
|
|
u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
|
|
|
|
|
2024-02-14 19:25:50 +00:00
|
|
|
/* If we have no unused space, don't bother, it won't work anyway. */
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
if (unused < data_chunk_size)
|
|
|
|
return 0;
|
|
|
|
|
2024-02-14 19:25:50 +00:00
|
|
|
/* Cast to int is OK because want <= target. */
|
|
|
|
return calc_pct_ratio(want, target);
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
}
|
|
|
|
|
2024-06-26 21:39:11 +00:00
|
|
|
int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info)
|
btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:
- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold
The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)
Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.
No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.
To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target
I tested this by running it on three interesting workloads:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.
3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.
Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-02 19:52:16 +00:00
|
|
|
{
|
|
|
|
lockdep_assert_held(&space_info->lock);
|
|
|
|
|
|
|
|
if (READ_ONCE(space_info->dynamic_reclaim))
|
|
|
|
return calc_dynamic_reclaim_threshold(space_info);
|
|
|
|
return READ_ONCE(space_info->bg_reclaim_threshold);
|
|
|
|
}
|
2024-02-02 19:54:33 +00:00
|
|
|
|
2024-02-14 19:29:50 +00:00
|
|
|
/*
|
|
|
|
* Under "urgent" reclaim, we will reclaim even fresh block groups that have
|
|
|
|
* recently seen successful allocations, as we are desperate to reclaim
|
|
|
|
* whatever we can to avoid ENOSPC in a transaction leading to a readonly fs.
|
|
|
|
*/
|
|
|
|
static bool is_reclaim_urgent(struct btrfs_space_info *space_info)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = space_info->fs_info;
|
|
|
|
u64 unalloc = atomic64_read(&fs_info->free_chunk_space);
|
|
|
|
u64 data_chunk_size = calc_effective_data_chunk_size(fs_info);
|
|
|
|
|
|
|
|
return unalloc < data_chunk_size;
|
|
|
|
}
|
|
|
|
|
2024-10-09 14:31:04 +00:00
|
|
|
static void do_reclaim_sweep(struct btrfs_space_info *space_info, int raid)
|
2024-02-02 19:54:33 +00:00
|
|
|
{
|
|
|
|
struct btrfs_block_group *bg;
|
|
|
|
int thresh_pct;
|
2024-02-14 19:29:50 +00:00
|
|
|
bool try_again = true;
|
|
|
|
bool urgent;
|
2024-02-02 19:54:33 +00:00
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
2024-02-14 19:29:50 +00:00
|
|
|
urgent = is_reclaim_urgent(space_info);
|
2024-02-02 19:54:33 +00:00
|
|
|
thresh_pct = btrfs_calc_reclaim_threshold(space_info);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
down_read(&space_info->groups_sem);
|
2024-02-14 19:29:50 +00:00
|
|
|
again:
|
2024-02-02 19:54:33 +00:00
|
|
|
list_for_each_entry(bg, &space_info->block_groups[raid], list) {
|
|
|
|
u64 thresh;
|
|
|
|
bool reclaim = false;
|
|
|
|
|
|
|
|
btrfs_get_block_group(bg);
|
|
|
|
spin_lock(&bg->lock);
|
|
|
|
thresh = mult_perc(bg->length, thresh_pct);
|
2024-02-14 19:29:50 +00:00
|
|
|
if (bg->used < thresh && bg->reclaim_mark) {
|
|
|
|
try_again = false;
|
2024-02-02 19:54:33 +00:00
|
|
|
reclaim = true;
|
2024-02-14 19:29:50 +00:00
|
|
|
}
|
2024-02-02 19:54:33 +00:00
|
|
|
bg->reclaim_mark++;
|
|
|
|
spin_unlock(&bg->lock);
|
|
|
|
if (reclaim)
|
|
|
|
btrfs_mark_bg_to_reclaim(bg);
|
|
|
|
btrfs_put_block_group(bg);
|
|
|
|
}
|
2024-02-14 19:29:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* In situations where we are very motivated to reclaim (low unalloc)
|
|
|
|
* use two passes to make the reclaim mark check best effort.
|
|
|
|
*
|
|
|
|
* If we have any staler groups, we don't touch the fresher ones, but if we
|
|
|
|
* really need a block group, do take a fresh one.
|
|
|
|
*/
|
|
|
|
if (try_again && urgent) {
|
|
|
|
try_again = false;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2024-02-02 19:54:33 +00:00
|
|
|
up_read(&space_info->groups_sem);
|
|
|
|
}
|
|
|
|
|
2024-02-14 19:25:50 +00:00
|
|
|
void btrfs_space_info_update_reclaimable(struct btrfs_space_info *space_info, s64 bytes)
|
|
|
|
{
|
|
|
|
u64 chunk_sz = calc_effective_data_chunk_size(space_info->fs_info);
|
|
|
|
|
|
|
|
lockdep_assert_held(&space_info->lock);
|
|
|
|
space_info->reclaimable_bytes += bytes;
|
|
|
|
|
|
|
|
if (space_info->reclaimable_bytes >= chunk_sz)
|
|
|
|
btrfs_set_periodic_reclaim_ready(space_info, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool ready)
|
|
|
|
{
|
|
|
|
lockdep_assert_held(&space_info->lock);
|
|
|
|
if (!READ_ONCE(space_info->periodic_reclaim))
|
|
|
|
return;
|
|
|
|
if (ready != space_info->periodic_reclaim_ready) {
|
|
|
|
space_info->periodic_reclaim_ready = ready;
|
|
|
|
if (!ready)
|
|
|
|
space_info->reclaimable_bytes = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
bool btrfs_should_periodic_reclaim(struct btrfs_space_info *space_info)
|
|
|
|
{
|
|
|
|
bool ret;
|
|
|
|
|
|
|
|
if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
return false;
|
|
|
|
if (!READ_ONCE(space_info->periodic_reclaim))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
ret = space_info->periodic_reclaim_ready;
|
|
|
|
btrfs_set_periodic_reclaim_ready(space_info, false);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-06-26 21:39:11 +00:00
|
|
|
void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info)
|
2024-02-02 19:54:33 +00:00
|
|
|
{
|
|
|
|
int raid;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
|
|
|
|
list_for_each_entry(space_info, &fs_info->space_info, list) {
|
2024-02-14 19:25:50 +00:00
|
|
|
if (!btrfs_should_periodic_reclaim(space_info))
|
2024-02-02 19:54:33 +00:00
|
|
|
continue;
|
2024-08-27 10:30:10 +00:00
|
|
|
for (raid = 0; raid < BTRFS_NR_RAID_TYPES; raid++)
|
2024-10-09 14:31:04 +00:00
|
|
|
do_reclaim_sweep(space_info, raid);
|
2024-02-02 19:54:33 +00:00
|
|
|
}
|
|
|
|
}
|