From bf9821ba4792a0d9a2e72803ae7b4341faf3d532 Mon Sep 17 00:00:00 2001 From: Naohiro Aota Date: Tue, 1 Oct 2024 17:03:32 +0900 Subject: [PATCH 1/7] btrfs: zoned: fix zone unusable accounting for freed reserved extent When btrfs reserves an extent and does not use it (e.g, by an error), it calls btrfs_free_reserved_extent() to free the reserved extent. In the process, it calls btrfs_add_free_space() and then it accounts the region bytes as block_group->zone_unusable. However, it leaves the space_info->bytes_zone_unusable side not updated. As a result, ENOSPC can happen while a space_info reservation succeeded. The reservation is fine because the freed region is not added in space_info->bytes_zone_unusable, leaving that space as "free". OTOH, corresponding block group counts it as zone_unusable and its allocation pointer is not rewound, we cannot allocate an extent from that block group. That will also negate space_info's async/sync reclaim process, and cause an ENOSPC error from the extent allocation process. Fix that by returning the space to space_info->bytes_zone_unusable. Ideally, since a bio is not submitted for this reserved region, we should return the space to free space and rewind the allocation pointer. But, it needs rework on extent allocation handling, so let it work in this way for now. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn Signed-off-by: Naohiro Aota Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/block-group.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 7980b2e33a92..4423d8b716a5 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -3819,6 +3819,8 @@ void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, spin_lock(&cache->lock); if (cache->ro) space_info->bytes_readonly += num_bytes; + else if (btrfs_is_zoned(cache->fs_info)) + space_info->bytes_zone_unusable += num_bytes; cache->reserved -= num_bytes; space_info->bytes_reserved -= num_bytes; space_info->max_extent_size = 0; From 3510e684b8f6a569c2f8b86870da116e2ffeec2d Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 14 Oct 2024 16:14:18 +0100 Subject: [PATCH 2/7] btrfs: clear force-compress on remount when compress mount option is given After the migration to use fs context for processing mount options we had a slight change in the semantics for remounting a filesystem that was mounted with compress-force. Before we could clear compress-force by passing only "-o compress[=algo]" during a remount, but after that change that does not work anymore, force-compress is still present and one needs to pass "-o compress-force=no,compress[=algo]" to the mount command. Example, when running on a kernel 6.8+: $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) On a 6.7 kernel (or older): $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) So update btrfs_parse_param() to clear "compress-force" when "compress" is given, providing the same semantics as kernel 6.7 and older. Reported-by: Roman Mamedov Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/ CC: stable@vger.kernel.org # 6.8+ Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/super.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 98fa0f382480..3f8673f97432 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -340,6 +340,15 @@ static int btrfs_parse_param(struct fs_context *fc, struct fs_parameter *param) fallthrough; case Opt_compress: case Opt_compress_type: + /* + * Provide the same semantics as older kernels that don't use fs + * context, specifying the "compress" option clears + * "force-compress" without the need to pass + * "compress-force=[no|none]" before specifying "compress". + */ + if (opt != Opt_compress_force && opt != Opt_compress_force_type) + btrfs_clear_opt(ctx->mount_opt, FORCE_COMPRESS); + if (opt == Opt_compress || opt == Opt_compress_force) { ctx->compress_type = BTRFS_COMPRESS_ZLIB; ctx->compress_level = BTRFS_ZLIB_DEFAULT_LEVEL; From 5f9062a48db260fd6b53d86ecfb4d5dc59266316 Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Tue, 10 Sep 2024 15:21:04 +0930 Subject: [PATCH 3/7] btrfs: qgroup: set a more sane default value for subtree drop threshold Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can automatically skip large subtree scan at the cost of marking qgroup inconsistent. It's designed to address the final performance problem of snapshot drop with qgroup enabled, but to be safe the default value is BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value to make it work. I'd say it's not a good idea to rely on user space tool to set this default value, especially when some operations (snapshot dropping) can be triggered immediately after mount, leaving a very small window to that that sysfs interface. So instead of disabling this new feature by default, enable it with a low threshold (3), so that large subvolume tree drop at mount time won't cause huge qgroup workload. CC: stable@vger.kernel.org # 6.1 Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/disk-io.c | 2 +- fs/btrfs/qgroup.c | 2 +- fs/btrfs/qgroup.h | 2 ++ 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1238a38c59b2..5afb68c0304b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1959,7 +1959,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info) fs_info->qgroup_seq = 1; fs_info->qgroup_ulist = NULL; fs_info->qgroup_rescan_running = false; - fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL; + fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT; mutex_init(&fs_info->qgroup_rescan_lock); } diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 1332ec59c539..a0e8deca87a7 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1407,7 +1407,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info) fs_info->quota_root = NULL; fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON; fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE; - fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL; + fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT; spin_unlock(&fs_info->qgroup_lock); btrfs_free_qgroup_config(fs_info); diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index 98adf4ec7b01..c229256d6fd5 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -121,6 +121,8 @@ struct btrfs_inode; #define BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN (1ULL << 63) #define BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING (1ULL << 62) +#define BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT (3) + /* * Record a dirty extent, and info qgroup to update quota on it */ From f10f59f91a6278e9637327d1206140d28e2d5004 Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Wed, 9 Oct 2024 09:37:03 +1030 Subject: [PATCH 4/7] btrfs: fix the delalloc range locking if sector size < page size Inside lock_delalloc_folios(), there are several problems related to sector size < page size handling: - Set the writer locks without checking if the folio is still valid We call btrfs_folio_start_writer_lock() just like it's folio_lock(). But since the folio may not even be the folio of the current mapping, we can easily screw up the folio->private. - The range is not clamped inside the page This means we can over write other bitmaps if the start/len is not properly handled, and trigger the btrfs_subpage_assert(). - @processed_end is always rounded up to page end If the delalloc range is not page aligned, and we need to retry (returning -EAGAIN), then we will unlock to the page end. Thankfully this is not a huge problem, as now btrfs_folio_end_writer_lock() can handle range larger than the locked range, and only unlock what is already locked. Fix all these problems by: - Lock and check the folio first, then call btrfs_folio_set_writer_lock() So that if we got a folio not belonging to the inode, we won't touch folio->private. - Properly truncate the range inside the page - Update @processed_end to the locked range end Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 309a8ae48434..872cca54cc6c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -262,22 +262,23 @@ static noinline int lock_delalloc_folios(struct inode *inode, for (i = 0; i < found_folios; i++) { struct folio *folio = fbatch.folios[i]; - u32 len = end + 1 - start; + u64 range_start; + u32 range_len; if (folio == locked_folio) continue; - if (btrfs_folio_start_writer_lock(fs_info, folio, start, - len)) - goto out; - + folio_lock(folio); if (!folio_test_dirty(folio) || folio->mapping != mapping) { - btrfs_folio_end_writer_lock(fs_info, folio, start, - len); + folio_unlock(folio); goto out; } + range_start = max_t(u64, folio_pos(folio), start); + range_len = min_t(u64, folio_pos(folio) + folio_size(folio), + end + 1) - range_start; + btrfs_folio_set_writer_lock(fs_info, folio, range_start, range_len); - processed_end = folio_pos(folio) + folio_size(folio) - 1; + processed_end = range_start + range_len - 1; } folio_batch_release(&fbatch); cond_resched(); From 7a2339058ed71f54c1e12e1b3c25aab1b1ba7943 Mon Sep 17 00:00:00 2001 From: Boris Burkov Date: Fri, 18 Oct 2024 15:44:34 -0700 Subject: [PATCH 5/7] btrfs: fix read corruption due to race with extent map merging In debugging some corrupt squashfs files, we observed symptoms of corrupt page cache pages but correct on-disk contents. Further investigation revealed that the exact symptom was a correct page followed by an incorrect, duplicate, page. This got us thinking about extent maps. commit ac05ca913e9f ("Btrfs: fix race between using extent maps and merging them") enforces a reference count on the primary `em` extent_map being merged, as that one gets modified. However, since, commit 3d2ac9922465 ("btrfs: introduce new members for extent_map") both 'em' and 'merge' get modified, which started modifying 'merge' and thus introduced the same race. We were able to reproduce this by looping the affected squashfs workload in parallel on a bunch of separate btrfs-es while also dropping caches. We are still working on a simple enough reproducer to make into an fstest. The simplest fix is to stop modifying 'merge', which is not essential, as it is dropped immediately after the merge. This behavior is simply a consequence of the order of the two extent maps being important in computing the new values. Modify merge_ondisk_extents to take prev and next by const* and also take a third merged parameter that it puts the results in. Note that this introduces the rather odd behavior of passing 'em' to merge_ondisk_extents as a const * and as a regular ptr. Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo Reviewed-by: Filipe Manana Signed-off-by: Omar Sandoval Signed-off-by: Boris Burkov Signed-off-by: David Sterba --- fs/btrfs/extent_map.c | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 25d191f1ac10..668c617444a5 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -243,13 +243,19 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma /* * Handle the on-disk data extents merge for @prev and @next. * + * @prev: left extent to merge + * @next: right extent to merge + * @merged: the extent we will not discard after the merge; updated with new values + * + * After this, one of the two extents is the new merged extent and the other is + * removed from the tree and likely freed. Note that @merged is one of @prev/@next + * so there is const/non-const aliasing occurring here. + * * Only touches disk_bytenr/disk_num_bytes/offset/ram_bytes. * For now only uncompressed regular extent can be merged. - * - * @prev and @next will be both updated to point to the new merged range. - * Thus one of them should be removed by the caller. */ -static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *next) +static void merge_ondisk_extents(const struct extent_map *prev, const struct extent_map *next, + struct extent_map *merged) { u64 new_disk_bytenr; u64 new_disk_num_bytes; @@ -284,15 +290,10 @@ static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *nex new_disk_bytenr; new_offset = prev->disk_bytenr + prev->offset - new_disk_bytenr; - prev->disk_bytenr = new_disk_bytenr; - prev->disk_num_bytes = new_disk_num_bytes; - prev->ram_bytes = new_disk_num_bytes; - prev->offset = new_offset; - - next->disk_bytenr = new_disk_bytenr; - next->disk_num_bytes = new_disk_num_bytes; - next->ram_bytes = new_disk_num_bytes; - next->offset = new_offset; + merged->disk_bytenr = new_disk_bytenr; + merged->disk_num_bytes = new_disk_num_bytes; + merged->ram_bytes = new_disk_num_bytes; + merged->offset = new_offset; } static void dump_extent_map(struct btrfs_fs_info *fs_info, const char *prefix, @@ -361,7 +362,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em) em->generation = max(em->generation, merge->generation); if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) - merge_ondisk_extents(merge, em); + merge_ondisk_extents(merge, em, em); em->flags |= EXTENT_FLAG_MERGED; validate_extent_map(fs_info, em); @@ -378,7 +379,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em) if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) { em->len += merge->len; if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) - merge_ondisk_extents(em, merge); + merge_ondisk_extents(em, merge, em); validate_extent_map(fs_info, em); rb_erase(&merge->rb_node, &tree->root); RB_CLEAR_NODE(&merge->rb_node); From 3c36a72c1d27de6618c1c480c793d9924640f5bb Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Thu, 19 Sep 2024 20:18:11 +0930 Subject: [PATCH 6/7] btrfs: reject ro->rw reconfiguration if there are hard ro requirements [BUG] Syzbot reports the following crash: BTRFS info (device loop0 state MCS): disabling free space tree BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline] RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041 Call Trace: btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530 btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312 btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012 btrfs_remount_rw fs/btrfs/super.c:1309 [inline] btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534 btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline] btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115 vfs_get_tree+0x90/0x2b0 fs/super.c:1800 do_new_mount+0x2be/0xb40 fs/namespace.c:3472 do_mount fs/namespace.c:3812 [inline] __do_sys_mount fs/namespace.c:4020 [inline] __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [CAUSE] To support mounting different subvolume with different RO/RW flags for the new mount APIs, btrfs introduced two workaround to support this feature: - Skip mount option/feature checks if we are mounting a different subvolume - Reconfigure the fs to RW if the initial mount is RO Combining these two, we can have the following sequence: - Mount the fs ro,rescue=all,clear_cache,space_cache=v1 rescue=all will mark the fs as hard read-only, so no v2 cache clearing will happen. - Mount a subvolume rw of the same fs. We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY because our new fc is RW, different from the original fs. Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag first so that we can grab the existing fs_info. Then we reconfigure the fs to RW. - During reconfiguration, option/features check is skipped This means we will restart the v2 cache clearing, and convert back to v1 cache. This will trigger fs writes, and since the original fs has "rescue=all" option, it skips the csum tree read. And eventually causing NULL pointer dereference in super block writeback. [FIX] For reconfiguration caused by different subvolume RO/RW flags, ensure we always run btrfs_check_options() to ensure we have proper hard RO requirements met. In fact the function btrfs_check_options() doesn't really do many complex checks, but hard RO requirement and some feature dependency checks, thus there is no special reason not to do the check for mount reconfiguration. Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/ Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Johannes Thumshirn Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/super.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 3f8673f97432..926d7a9ed99d 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1507,8 +1507,7 @@ static int btrfs_reconfigure(struct fs_context *fc) sync_filesystem(sb); set_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state); - if (!mount_reconfigure && - !btrfs_check_options(fs_info, &ctx->mount_opt, fc->sb_flags)) + if (!btrfs_check_options(fs_info, &ctx->mount_opt, fc->sb_flags)) return -EINVAL; ret = btrfs_check_features(fs_info, !(fc->sb_flags & SB_RDONLY)); From 75f49c3dc7b7423d3734f2e4dabe3dac8d064338 Mon Sep 17 00:00:00 2001 From: Yue Haibing Date: Tue, 22 Oct 2024 17:52:08 +0800 Subject: [PATCH 7/7] btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item() The ret may be zero in btrfs_search_dir_index_item() and should not passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0. This fixes smatch warnings: fs/btrfs/dir-item.c:353 btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR' Fixes: 9dcbe16fccbb ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn Signed-off-by: Yue Haibing Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/dir-item.c | 4 ++-- fs/btrfs/inode.c | 7 ++----- 2 files changed, 4 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c index 001c0c2f872c..1e8cd7c9472e 100644 --- a/fs/btrfs/dir-item.c +++ b/fs/btrfs/dir-item.c @@ -347,8 +347,8 @@ btrfs_search_dir_index_item(struct btrfs_root *root, struct btrfs_path *path, return di; } /* Adjust return code if the key was not found in the next leaf. */ - if (ret > 0) - ret = 0; + if (ret >= 0) + ret = -ENOENT; return ERR_PTR(ret); } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c6e4b58c334c..c6d257fb40af 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4368,11 +4368,8 @@ static int btrfs_unlink_subvol(struct btrfs_trans_handle *trans, */ if (btrfs_ino(inode) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID) { di = btrfs_search_dir_index_item(root, path, dir_ino, &fname.disk_name); - if (IS_ERR_OR_NULL(di)) { - if (!di) - ret = -ENOENT; - else - ret = PTR_ERR(di); + if (IS_ERR(di)) { + ret = PTR_ERR(di); btrfs_abort_transaction(trans, ret); goto out; }