linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2024-12-28 16:52:18 +00:00

Author	SHA1	Message	Date
Anand Jain	eff96dae96	btrfs: introduce RAID1 round-robin read balancing This feature balances I/O across the striped devices when reading from RAID1 blocks. echo round-robin[:min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy The min_contiguous_read parameter defines the minimum read size before switching to the next mirrored device. This setting is optional, with a default value of 256 KiB. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	6b471f9f5c	btrfs: handle value associated with raid1 balancing parameter This change enables specifying additional configuration values alongside the raid1 balancing / read policy in a single input string. Updated btrfs_read_policy_to_enum() to parse and handle a value associated with the policy in the format `policy:value`, the value part if present is converted 64-bit integer. Update btrfs_read_policy_store() to accommodate the new parameter. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	a7b574a1f8	btrfs: add btrfs_read_policy_to_enum helper and refactor read policy store Introduce the `btrfs_read_policy_to_enum` helper function to simplify the conversion of a string read policy to its corresponding enum value. This reduces duplication and improves code clarity in `btrfs_read_policy_store`. The `btrfs_read_policy_store` function has been refactored to use the new helper. The parameter is copied locally to allow modification, enabling the separation of the method and its value. This prepares for the addition of more functionality in subsequent patches. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	30680021e7	btrfs: simplify output formatting in btrfs_read_policy_show Refactor the logic in btrfs_read_policy_show() to streamline the formatting of read policies output. Streamline the space and bracket handling around the active policy without altering the functional output. This is in preparation to add more methods. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	ca56af5079	btrfs: initialize fs_devices->fs_info earlier Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(), but this occurs too late for find_live_mirror(), which is invoked by load_super_root() much earlier than btrfs_init_devices_late(). Fix this by moving the initialization to open_ctree(), before load_super_root(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:48 +01:00
Naohiro Aota	38cc9c2c10	btrfs: zoned: calculate max_zone_append_size properly on non-zoned setup Since commit `559218d43e` ("block: pre-calculate max_zone_append_sectors"), queue_limits's max_zone_append_sectors is default to be 0 and it is only updated when there is a zoned device. So, we have lim->max_zone_append_sectors = 0 when there is no zoned device in the filesystem. That leads to fs_info->max_zone_append_size and fs_info->max_extent_size to be 0, which causes several errors. One example is the divide error as below. Running simple test as btrfs/001 on a non-zoned device with the zoned mode (zoned emulation) leads to this error because we have fs_info->max_extent_size = 0 in count_max_extents(). Oops: divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 21 UID: 0 PID: 2378822 Comm: dd Tainted: G W 6.13.0-rc2-kts #1 Tainted: [W]=WARN Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 RIP: 0010:btrfs_delalloc_reserve_metadata+0x161/0x790 [btrfs] The block layer logic, having max_zone_append_sectors = 0 by stacking non-zoned devices, seems reasonable to me. So, let's deal with that in btrfs side by ignoring max_zone_append_sectors if it is non-zoned setup. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `559218d43e` ("block: pre-calculate max_zone_append_sectors") CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:31 +01:00
Qu Wenruo	149f7d3c2b	btrfs: enhance ordered extent double freeing detection With recent bugs exposed through run_delalloc_range() failure, the importance of detecting double accounting is obvious. Currently the way to detect such errors is to just check if we underflow the btrfs_ordered_extent::bytes_left member. That's fine but that only shows the length we're trying to decrease, not enough to show the problem. Here we enhance the situation by: - Introduce btrfs_ordered_extent::finished_bitmap This is a new bitmap to indicate which blocks are already finished. This bitmap will be initialized at alloc_ordered_extent() and release when the ordered extent is freed. - Detect any already finished block during can_finish_ordered_extent() If double accounting detected, show the full range we're trying and the bitmap. - Make sure the bitmap is all set when the OE is finished - Show the full range we're finishing for the existing double accounting detection This is to enhance the code to work with the new run_delalloc_range() error messages. This will have extra memory and runtime cost, now an ordered extent can have as large as 4K memory just for the finished_bitmap, and extra operations to detect such double accounting. Thus this double accounting detection is only enabled for CONFIG_BTRFS_DEBUG build for developers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:31 +01:00
Qu Wenruo	d39da2bda9	btrfs: add extra error messages for delalloc range related errors All the error handling bugs I hit so far are all -ENOSPC from either: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() Previously when those functions failed, there is no error message at all, making the debugging much harder. So here we introduce extra error messages for: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() - writepage_delalloc() when btrfs_run_delalloc_range() failed - extent_writepage() when extent_writepage_io() failed One example of the new debug error messages is the following one: run fstests generic/750 at 2024-12-08 12:41:41 BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600) BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3 BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536 BTRFS info (device dm-4): using free-space-tree BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-4): checking UUID tree BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28 Which shows an error from cow_file_range() which is called inside a nocow write attempt, along with the extra bitmap from writepage_delalloc(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	87c53d275f	btrfs: subpage: dump the involved bitmap when ASSERT() failed For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call bitmap_test_range_all_zero() to ensure the involved range has not bit set. However with my recent enhanced delalloc range error handling, I'm hitting the ASSERT() inside btrfs_folio_set_lock(), and is wondering if it's some error handling not properly cleanup the locked bitmap but directly unlock the page. So add some extra dumpping for the ASSERTs to dump the involved bitmap to help debug. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	aba5d9b96a	btrfs: subpage: fix the bitmap dump for the locked flags We're dumping the locked bitmap into the @checked_bitmap variable, causing incorrect values during debug. Thankfuklly even during my development I haven't hit a case where I need to dump the locked bitmap. But for the sake of consistency, fix it by dumpping the locked bitmap into @locked_bitmap variable for output. Fixes: `75258f20fb` ("btrfs: subpage: dump extra subpage bitmaps for debug") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	b635fbe47a	btrfs: do proper folio cleanup when run_delalloc_nocow() failed [BUG] With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash with the following VM_BUG_ON_FOLIO(): BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28 BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28 page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664 aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774" flags: 0x2fffff80004028(uptodate\|lru\|private\|node=0\|zone=2\|lastcpupid=0xfffff) page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2992! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : folio_clear_dirty_for_io+0x128/0x258 lr : folio_clear_dirty_for_io+0x128/0x258 Call trace: folio_clear_dirty_for_io+0x128/0x258 btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs] __process_folios_contig+0x154/0x268 [btrfs] extent_clear_unlock_delalloc+0x5c/0x80 [btrfs] run_delalloc_nocow+0x5f8/0x760 [btrfs] btrfs_run_delalloc_range+0xa8/0x220 [btrfs] writepage_delalloc+0x230/0x4c8 [btrfs] extent_writepage+0xb8/0x358 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x178/0x3a8 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] The first two lines of extra debug messages show the problem is caused by the error handling of run_delalloc_nocow(). E.g. we have the following dirtied range (4K blocksize 4K page size): 0 16K 32K \|//////////////////////////////////////\| \| Pre-allocated \| And the range [0, 16K) has a preallocated extent. - Enter run_delalloc_nocow() for range [0, 16K) Which found range [0, 16K) is preallocated, can do the proper NOCOW write. - Enter fallback_to_fow() for range [16K, 32K) Since the range [16K, 32K) is not backed by preallocated extent, we have to go COW. - cow_file_range() failed for range [16K, 32K) So cow_file_range() will do the clean up by clearing folio dirty, unlock the folios. Now the folios in range [16K, 32K) is unlocked. - Enter extent_clear_unlock_delalloc() from run_delalloc_nocow() Which is called with PAGE_START_WRITEBACK to start page writeback. But folios can only be marked writeback when it's properly locked, thus this triggered the VM_BUG_ON_FOLIO(). Furthermore there is another hidden but common bug that run_delalloc_nocow() is not clearing the folio dirty flags in its error handling path. This is the common bug shared between run_delalloc_nocow() and cow_file_range(). [FIX] - Clear folio dirty for range [@start, @cur_offset) Introduce a helper, cleanup_dirty_folios(), which will find and lock the folio in the range, clear the dirty flag and start/end the writeback, with the extra handling for the @locked_folio. - Introduce a helper to record the last failed COW range end This is to trace which range we should skip, to avoid double unlocking. - Skip the failed COW range for the error handling Cc: stable@vger.kernel.org Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	df92c54755	btrfs: do proper folio cleanup when cow_file_range() failed [BUG] When testing with COW fixup marked as BUG_ON() (this is involved with the new pin_user_pages() change, which should not result new out-of-band dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW fixup path. This BUG_ON() happens just after a failed btrfs_run_delalloc_range(): BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1444! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : extent_writepage_io+0x2d4/0x308 [btrfs] lr : extent_writepage_io+0x2d4/0x308 [btrfs] Call trace: extent_writepage_io+0x2d4/0x308 [btrfs] extent_writepage+0x218/0x330 [btrfs] extent_write_cache_pages+0x1d4/0x4b0 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x180/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] That failure is mostly from cow_file_range(), where we can hit -ENOSPC. Although the -ENOSPC is already a bug related to our space reservation code, let's just focus on the error handling. For example, we have the following dirty range [0, 64K) of an inode, with 4K sector size and 4K page size: 0 16K 32K 48K 64K \|///////////////////////////////////////\| \|#######################################\| Where \|///\| means page are still dirty, and \|###\| means the extent io tree has EXTENT_DELALLOC flag. - Enter extent_writepage() for page 0 - Enter btrfs_run_delalloc_range() for range [0, 64K) - Enter cow_file_range() for range [0, 64K) - Function btrfs_reserve_extent() only reserved one 16K extent So we created extent map and ordered extent for range [0, 16K) 0 16K 32K 48K 64K \|////////\|//////////////////////////////\| \|<- OE ->\|##############################\| And range [0, 16K) has its delalloc flag cleared. But since we haven't yet submit any bio, involved 4 pages are still dirty. - Function btrfs_reserve_extent() return with -ENOSPC Now we have to run error cleanup, which will clear all EXTENT_DELALLOC flags and clear the dirty flags for the remaining ranges: 0 16K 32K 48K 64K \|////////\| \| \| \| \| Note that range [0, 16K) still has their pages dirty. - Some time later, writeback are triggered again for the range [0, 16K) since the page range still have dirty flags. - btrfs_run_delalloc_range() will do nothing because there is no EXTENT_DELALLOC flag. - extent_writepage_io() find page 0 has no ordered flag Which falls into the COW fixup path, triggering the BUG_ON(). Unfortunately this error handling bug dates back to the introduction of btrfs. Thankfully with the abuse of cow fixup, at least it won't crash the kernel. [FIX] Instead of immediately unlock the extent and folios, we keep the extent and folios locked until either erroring out or the whole delalloc range finished. When the whole delalloc range finished without error, we just unlock the whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared. And those involved folios will be properly submitted, with their dirty flags cleared during submission. For the error path, it will be a little more complex: - The range with ordered extent allocated (range (1)) We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining flags are cleaned up by btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered(). For folios we finish the IO (clear dirty, start writeback and immediately finish the writeback) and unlock the folios. - The range with reserved extent but no ordered extent (range(2)) - The range we never touched (range(3)) For both range (2) and range(3) the behavior is not changed. Now even if cow_file_range() failed halfway with some successfully reserved extents/ordered extents, we will keep all folios clean, so there will be no future writeback triggered on them. Cc: stable@vger.kernel.org Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	064709bb67	btrfs: fix the error handling of submit_uncompressed_range() [BUG] If btrfs failed to compress the range, or can not reserve a large enough data extent (e.g. too fragmented free space), btrfs will fall back to submit_uncompressed_range(). But inside submit_uncompressed_range(), run_dealloc_cow() can also fail due to -ENOSPC or whatever other errors. In that case there are 3 bugs in the error handling: 1) Double freeing for the same ordered extent Which can lead to crash due to ordered extent double accounting 2) Start/end writeback without updating the subpage writeback bitmap 3) Unlock the folio without clear the subpage lock bitmap Both bug 2) and 3) will crash the kernel if the btrfs block size is smaller than folio size, as the next time the folio get writeback/lock updates, subpage will find the bitmap already have the range set, triggering an ASSERT(). [CAUSE] Bug 1) happens in the following call chain: submit_uncompressed_range() \|- run_dealloc_cow() \| \|- cow_file_range() \| \|- btrfs_reserve_extent() \| Failed with -ENOSPC or whatever error \| \|- btrfs_clean_up_ordered_extents() \| \|- btrfs_mark_ordered_io_finished() \| Which cleans all the ordered extents in the async_extent range. \| \|- btrfs_mark_ordered_io_finished() Which cleans the folio range. The finished ordered extents may not be immediately removed from the ordered io tree, as they are removed inside a work queue. So the second btrfs_mark_ordered_io_finished() may find the finished but not-yet-removed ordered extents, and double free them. Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover other ordered extents. Bug 2) and 3) are more straight forward, btrfs just calls folio_unlock(), folio_start_writeback() and folio_end_writeback(), other than the helpers which handle subpage cases. [FIX] For bug 1) since the first btrfs_cleanup_ordered_extents() call is handling the whole range, we should not do the second btrfs_mark_ordered_io_finished() call. And for the first btrfs_cleanup_ordered_extents(), we no longer need to pass the @locked_page parameter, as we are already in the async extent context, thus will never rely on the error handling inside btrfs_run_delalloc_range(). So just let the btrfs_clean_up_ordered_extents() to handle every folio equally. For bug 2) we should not even call folio_start_writeback()/folio_end_writeback() anymore. As the error handling protocol, cow_file_range() should clear dirty flag and start/finish the writeback for the whole range passed in. For bug 3) just change the folio_unlock() to btrfs_folio_end_lock() helper. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	9c68bdd38f	btrfs: fix double accounting race when extent_writepage_io() failed [BUG] If submit_one_sector() failed inside extent_writepage_io() for sector size < page size cases (e.g. 4K sector size and 64K page size), then we can hit double ordered extent accounting error. This should be very rare, as submit_one_sector() only fails when we failed to grab the extent map, and such extent map should exist inside the memory and have been pinned. [CAUSE] For example we have the following folio layout: 0 4K 32K 48K 60K 64K \|//\| \|//////\| \|///\| Where \|///\| is the dirty range we need to writeback. The 3 different dirty ranges are submitted for regular COW. Now we hit the following sequence: - submit_one_sector() returned 0 for [0, 4K) - submit_one_sector() returned 0 for [32K, 48K) - submit_one_sector() returned error for [60K, 64K) - btrfs_mark_ordered_io_finished() called for the whole folio This will mark the following ranges as finished: * [0, 4K) * [32K, 48K) Both ranges have their IO already submitted, this cleanup will lead to double accounting. * [60K, 64K) That's the correct cleanup. The only good news is, this error is only theoretical, as the target extent map is always pinned, thus we should directly grab it from memory, other than reading it from the disk. [FIX] Instead of calling btrfs_mark_ordered_io_finished() for the whole folio range, which can touch ranges we should not touch, instead move the error handling inside extent_writepage_io(). So that we can cleanup exact sectors that are ought to be submitted but failed. This provide much more accurate cleanup, avoiding the double accounting. Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	e863c615af	btrfs: fix double accounting race when btrfs_run_delalloc_range() failed [BUG] There are several double accounting case, where the WARN_ON_ONCE() is triggered inside can_finish_ordered_extent(). And all such cases points back to the btrfs_mark_ordered_io_finished() call inside extent_writepage() when it hits some error. [CAUSE] With extra debug patches to show where the error is from, it turns out to be btrfs_run_delalloc_range() can fail with -ENOSPC. Such failure itself is already a symptom of some bad data/metadata space reservation, but here we need to focus on the error handling part. For example, we have the following dirty page layout (4K sector size and 4K page size): 0 16K 32K \|/////\|/////\|/////\|/////\|/////\|/////\|/////\|/////\| Where the range [0, 32K) is dirty and we need to write all the 8 pages back. When handling the first page 0, we go the following sequence: - btrfs_run_delalloc_range() for range [0, 32k) We enter cow_file_range() for [0, 32K) - btrfs_reserve_extent() only returned a 16K data extent. This can be caused by fragmentation, and it's already an indication we're almost running of space. Now we have the following layout: 0 16K 32K \|<----- Reserved ------>\|/////\|/////\|/////\|/////\| The range [0, 16K) has ordered extent allocated. - btrfs_reserve_extent() returned -ENOSPC We really run out of space. But since we have reserved space for range [0, 16K) we need to clean them up. But that cleanup for ordered extent only happens inside btrfs_run_delalloc_range(). - btrfs_run_delalloc_range() cleanup the reserved ordered extent By calling btrfs_mark_ordered_io_finished() for range [0, 32K). It will locate the ordered extent [0, 16K) and mark it as IOERR. Also since the ordered extent is only 16K, we're finishing the whole ordered extent. Thus we call btrfs_queue_ordered_fn() to queue to finish the ordered extent. But still, the ordered extent [0, 16K) is still in the btrfs_inode::ordered_tree. - extent_writepage() cleanup the ordered extent inside the folio We call btrfs_mark_ordered_io_finished() for range [0, 4K). Since the finished ordered extent [0, 16K) is not yet removed (racy, depends on when btrfs_finish_one_ordered() is called), if btrfs_mark_ordered_io_finished() is called before btrfs_finish_one_ordered(), we will double account and trigger the warning inside can_finish_ordered_extent(). So the root cause is, we're relying on btrfs_mark_ordered_io_finished() to handle ranges which is already cleaned up. Unfortunately the bug dates back to the early days when btrfs_mark_ordered_io_finished() is introduced as a no-brain choice for error paths, but such no-brain solution just hides all the race and make us less cautious when handling errors. [FIX] Instead of relying on the btrfs_mark_ordered_io_finished() call to cleanup the whole folio range, record the last successfully ran delalloc range. And combined with bio_ctrl->submit_bitmap to properly clean up any newly created ordered extents. Since we have cleaned up the ordered extents in range, we should not rely on the btrfs_mark_ordered_io_finished() inside extent_writepage() anymore. By this, we ensure btrfs_mark_ordered_io_finished() is only called once when writepage_delalloc() failed. Cc: stable@vger.kernel.org # 5.15+ Fixes: `e65f152e43` ("btrfs: refactor how we finish ordered extent io for endio functions") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	d4519378e4	btrfs: validate system chunk array at btrfs_validate_super() Currently btrfs_validate_super() only does a very basic check on the array chunk size (not too large than the available space, but not too small to contain no chunk). The more comprehensive checks (the regular chunk checks and size check inside the system chunk array) is all done inside btrfs_read_sys_array(). It's not a big deal, but for the sake of concentrated verification, we should validate the system chunk array at the time of super block validation. So this patch does the following modification: - Introduce a helper btrfs_check_system_chunk_array() * Validate the disk key * Validate the size before we access the full chunk/stripe items. * Do the full chunk item validation - Call btrfs_check_system_chunk_array() at btrfs_validate_super() - Simplify the checks inside btrfs_read_sys_array() Now the checks will be converted to an ASSERT(). - Simplify the checks inside read_one_chunk() Now all chunk items inside system chunk array and chunk tree is verified, there is no need to verify it again inside read_one_chunk(). This change has the following advantages: - More comprehensive checks at write time Although this also means extra memcpy() for the superblocks at write time, due to the limits that we need a dummy extent buffer to utilize all the extent buffer helpers. - Slightly improved readablity when iterating the system chunk array Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	5bb49ec59a	btrfs: scrub: use generic ratelimit helpers to output error messages Currently scrub goes different ways to rate limits its error messages: - regular btrfs_err_rl_in_rcu() helper for repaired sectors and the initial message for unrepaired sectors - Manually rate limits scrub_print_common_warning() I'd say the different rate limits could lead to cases where we only got the "unable to fixup (regular) error" messages but the detailed report about that corruption is ratelimited. To make the whole rate limit works more consistently, change the rate limit by: - Always using btrfs_*_rl() helpers - Remove the initial "unable to fixup (regular) error" message Since we're ensured to have at least one error message for each unrepaired sector (before rate limit), there is no need for a duplicated line. And if we hit rate limits, we will rate limit all the error messages together, not treating different error messages differently. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	fba78d9f9b	btrfs: scrub: ensure we output at least one error message for unrepaired corruption For btrfs scrub error messages, I have hit a lot of support cases on older kernels where no filename is outputted at all, with only error messages like: BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0 The "unable to fixup (regular) error" line shows we hit an unrepairable error, then normally we would do data/metadata backref walk to grab the correct info. But we can hit cases like the inode is already orphan (unlinked from any parent inode), or even the subvolume is orphan (unlinked and waiting to be deleted). In that case we're not sure what's the proper way to continue (Is it some critical data/metadata? Would it prevent the system from booting?) To improve the situation, this patch would: - Ensure we at least output one message for each unrepairable error If by somehow we didn't output any error message for the error, we always fallback to the basic logical/physical error output, with its type (data or metadata). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	985a948c8a	btrfs: scrub: simplify the inode iteration output The following two output are not really needed: - nlinks Normally file inodes should have nlinks as 1, for those inodes have multiple hard links, the inode/root number is still enough to pin it down to certain inode. - size The size is always fixed to sector size. By removing the nlinks output, we can reduce one inode item lookup. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	861fc3eab7	btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() The @stripe passed into scrub_stripe_report_errors() either has stripe->dev and stripe->physical properly populated (regular data stripes) or stripe->flags would have SCRUB_STRIPE_FLAG_NO_REPORT (RAID56 P/Q stripes). Thus there is no need to go with btrfs_map_block() to get the dev/physical. Just add an extra ASSERT() to make sure we get stripe->dev populated correctly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	ae88f4cfa4	btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() Since commit `2a2dc22f7e` ("btrfs: scrub: use dedicated super block verification function to scrub one super block"), the super block scrubbing is handled in a dedicated helper, thus scrub_print_common_warning() no longer needs to print warning for super block errors. Just remove the parameter. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	df8f0d1132	btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() Currently when we increase the device statistics, it would always lead to an error message in the kernel log. However this output is mostly duplicated with the existing ones: - For scrub operations We always have the following messages: * "fixed up error at logical %llu" * "unable to fixup (regular) error at logical %llu" So no matter if the corruption is repaired or not, it scrub would output an error message to indicate the problem. - For non-scrub read operations We also have the following messages: * "csum failed root %lld inode %llu off %llu" for data csum mismatch * "bad (tree block start\|fsid\|tree block level)" for metadata * "read error corrected: ino %llu off %llu" for repaired data/metadata So the error message from btrfs_dev_stat_inc_and_print() is duplicated. The real usage for the btrfs device statistics is for some user space daemon to check if there is any new errors, acting like some checks on SMART, thus we don't really need/want those messages in dmesg. This patch would reduce the log level to debug (disabled by default) for btrfs_dev_stat_inc_and_print(). For users really want to utilize btrfs devices statistics, they should go check "btrfs device stats" periodically, and we should focus the kernel error messages to more important things. CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
Qu Wenruo	3c43979ad2	btrfs: scrub: fix incorrectly reported logical/physical address [BUG] Scrub is not reporting the correct logical/physical address, it can be verified by the following script: # mkfs.btrfs -f $dev1 # mount $dev1 $mnt # xfs_io -f -c "pwrite -S 0xaa 0 128k" $mnt/file1 # umount $mnt # xfs_io -f -c "pwrite -S 0xff 13647872 4k" $dev1 # mount $dev1 $mnt # btrfs scrub start -fB $mnt # umount $mnt Note above 13647872 is the physical address for logical 13631488 + 4K. Scrub would report the following error: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file1) On the other hand, "btrfs check --check-data-csum" is reporting the correct logical/physical address: Checking filesystem on /dev/test/scratch1 UUID: db2eb621-b09d-4f24-8199-da17dc7b3201 [5/7] checking csums against data mirror 1 bytenr 13647872 csum 0x13fec125 expected csum 0x656bd64e ERROR: errors found in csum tree [CAUSE] In the function scrub_stripe_report_errors(), we always use the stripe->logical and its physical address to print the error message, not taking the sector number into consideration at all. [FIX] Fix the error reporting function by calculating logical/physical with the sector number. Now the scrub report is correct: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872 BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1) Fixes: `0096580713` ("btrfs: scrub: introduce error reporting functionality for scrub_stripe") CC: stable@vger.kernel.org #6.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
David Sterba	fc42bf9d7a	btrfs: handle unexpected parent block offset in btrfs_alloc_tree_block() Change a BUG_ON to a proper error handling, here it checks that a root other than reloc tree does not see a non-zero offset. This is set by btrfs_force_cow_block() and is a special case so the check makes sure it's not accidentally set by other callers. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
David Sterba	e3dea2dbc0	btrfs: === misc-next on b-for-next === Any commits after this one are for testing and evaluation only. Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
Filipe Manana	ebd8327fe7	btrfs: use uuid_is_null() to verify if an uuid is empty At btrfs_is_empty_uuid() we have our custom code to check if an uuid is empty, however there a kernel uuid library that has a function named uuid_is_null() which does the same and probably more efficient. So change btrfs_is_empty_uuid() to use uuid_is_null(), which is almost a directly replacement, it just wraps the necessary casting since our uuid types are u8 arrays while the uuid kernel library uses the uuid_t type, which is just a typedef of an u8 array of 16 elements as well. Also since the function is now to trivial, make it a static inline function in fs.h. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Julian Sun	a298aba81d	btrfs: fix transaction atomicity bug when enabling simple quotas set squota incompat bit before committing the transaction that enables the feature With the config CONFIG_BTRFS_ASSERT enabled, an assertion failure occurs regarding the simple quota feature. [ 5.596534] assertion failed: btrfs_fs_incompat(fs_info, SIMPLE_QUOTA), in fs/btrfs/qgroup.c:365 [ 5.597098] ------------[ cut here ]------------ [ 5.597371] kernel BUG at fs/btrfs/qgroup.c:365! [ 5.597946] CPU: 1 UID: 0 PID: 268 Comm: mount Not tainted 6.13.0-rc2-00031-gf92f4749861b #146 [ 5.598450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 5.599008] RIP: 0010:btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.604303] <TASK> [ 5.605230] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.605538] ? exc_invalid_op+0x56/0x70 [ 5.605775] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.606066] ? asm_exc_invalid_op+0x1f/0x30 [ 5.606441] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.606741] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.607038] ? try_to_wake_up+0x317/0x760 [ 5.607286] open_ctree+0xd9c/0x1710 [ 5.607509] btrfs_get_tree+0x58a/0x7e0 [ 5.608002] vfs_get_tree+0x2e/0x100 [ 5.608224] fc_mount+0x16/0x60 [ 5.608420] btrfs_get_tree+0x2f8/0x7e0 [ 5.608897] vfs_get_tree+0x2e/0x100 [ 5.609121] path_mount+0x4c8/0xbc0 [ 5.609538] __x64_sys_mount+0x10d/0x150 The issue can be easily reproduced using the following reproduer: root@q:linux# cat repro.sh set -e mkfs.btrfs -f /dev/sdb > /dev/null mount /dev/sdb /mnt/btrfs btrfs quota enable -s /mnt/btrfs umount /mnt/btrfs mount /dev/sdb /mnt/btrfs The issue is that when enabling quotas, at btrfs_quota_enable(), we set BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE at fs_info->qgroup_flags and persist it in the quota root in the item with the key BTRFS_QGROUP_STATUS_KEY, but we only set the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA after we commit the transaction used to enable simple quotas. This means that if after that transaction commit we unmount the filesystem without starting and committing any other transaction, or we have a power failure, the next time we mount the filesystem we will find the flag BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE set in the item with the key BTRFS_QGROUP_STATUS_KEY but we will not find the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA set in the superblock, triggering an assertion failure at: btrfs_read_qgroup_config() -> qgroup_read_enable_gen() To fix this issue, set the BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA flag immediately after setting the BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE. This ensures that both flags are flushed to disk within the same transaction. Fixes: `182940f4f4` ("btrfs: qgroup: add new quota mode for simple quotas") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Filipe Manana	515979e1d8	btrfs: remove pointless comment from ctree.h It's pointless to have a comment above the prototype declarations of btrfs_ctree_init() and btrfs_ctree_exit() mentioning that they are declared in ctree.c. This is from the old days when ctree.h was used to place anything that didn't fit in any other file. So remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Filipe Manana	3902cb1b9d	btrfs: move extent-tree function declarations out of ctree.h We have 3 functions that have their prototypes declared in ctree.h but they are defined at extent-tree.c and they are unrelated to the btree data structure. Move the prototypes out of ctree.h and into extent-tree.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Filipe Manana	bc1d3d0046	btrfs: move btrfs_alloc_write_mask() into fs.h Currently btrfs_alloc_write_mask() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	d555752f59	btrfs: move BTRFS_BYTES_TO_BLKS() into fs.h Currently BTRFS_BYTES_TO_BLKS() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	7dc9ea3b2d	btrfs: move the folio ordered helpers from ctree.h into fs.h The folio ordered helper macros are defined at ctree.h but this is not the best place since ctree.{h,c} is all about the btree data structure implementation and not a generic module. So move these macros into the fs.h header. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	ed49aa9fb3	btrfs: move btrfs_is_empty_uuid() from ioctl.c into fs.c It's a generic helper not specific to ioctls and used in several places, so move it out from ioctl.c and into fs.c. While at it change its return type from int to bool and declare the loop variable in the loop itself. This also slightly reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 `1959449` 1de619 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781340 161037 16920 1959297 1de581 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	a52943f27a	btrfs: move the exclusive operation functions into fs.c The declarations for the exclusive operation functions are located at fs.h but their definitions are in ioctl.c, which doesn't make much sense since (most of them) are used in several files other than ioctl.c. Since they are used in several files and they are generic enough, move them out of ioctl.c and into fs.c, even the ones that are currently only used at ioctl.c, for the sake of having them all in the same C file. This also reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 `1959449` 1de619 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:26 +01:00
Filipe Manana	cfbdb7750c	btrfs: move csum related functions from ctree.c into fs.c The ctree module is about the implementation of the btree data structure and not a place holder for generic filesystem things like the csum algorithm details. Move the functions related to the csum algorithm details away from ctree.c and into fs.c, which is a far better place for them. Also fix missing punctuation in comments and change one multiline comment to a single line comment since everything fits in under 80 characters. For some reason this also sligthly reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:26 +01:00
Filipe Manana	0f1523dc0e	btrfs: move abort_should_print_stack() to transaction.h The function abort_should_print_stack() is declared in transaction.h but its definition is in ctree.c, which doesn't make sense since ctree.c is the btree implementation and the function is related to the transaction code. Move its definition into transaction.h as an inline function since it's a very short and trivial function, and also add the 'btrfs_' prefix into its name. This change also reduces the module size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1783148 161137 16920 1961205 1decf5 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:26 +01:00
Johannes Thumshirn	d6a31cd040	btrfs: pass btrfs_io_geometry to is_single_device_io Now that we have the stripe tree decision saved in struct btrfs_io_geometry we can pass it into is_single_device_io() and get rid of another call to btrfs_need_raid_stripe_tree_update(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Johannes Thumshirn	dd88750692	btrfs: cache RAID stripe tree decision in btrfs_io_context Cache the decision if a particular I/O needs to update RAID stripe tree entries in struct btrfs_io_context. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Johannes Thumshirn	f31fcf0b29	btrfs: cache stripe tree usage in io_geometry Cache the return of btrfs_need_stripe_tree_update() in struct btrfs_io_geometry. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Filipe Manana	31dd35f120	btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() We should always call check_delayed_ref() with a path having a locked leaf from the extent tree where either the extent item is located or where it should be located in case it doesn't exist yet (when there's a pending unflushed delayed ref to do it), as we need to lock any existing delayed ref head while holding such leaf locked in order to avoid races with flushing delayed references, which could make us think an extent is not shared when it really is. So add some assertions and a comment about such expectations to btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Filipe Manana	68de37aa69	btrfs: add function comment for check_committed_ref() There are some not immediately obvious details about the operation of check_committed_ref(), namely that when it returns 0 it must return with the path having a locked leaf from the extent tree that contains the extent's extent item, so that we can later check for delayed refs when calling check_delayed_ref() in a way that doesn't race with a task running delayed references. For similar reasons, it must also return with a locked leaf when the extent item is not found, and that leaf is where the extent item should be located, because we may have delayed references that are going to create the extent item. Also document that the function can return false positives in order to not be too slow, and that the most important is to not return false negatives. So add a function comment to check_committed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	dffd80e3d7	btrfs: simplify arguments for btrfs_cross_ref_exist() Instead of passing a root and an objectid which matches an inode number, pass the inode instead, since the root is always the root associated to the inode and the objectid is the number of that inode. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	5cb68c6eac	btrfs: simplify return logic at check_committed_ref() Instead of setting the value to return in a local variable 'ret' and then jumping into a label named 'out' that does nothing but return that value, simplify everything by getting rid of the label and directly returning a value. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	ad6cf0bcc3	btrfs: avoid redundant call to get inline ref type at check_committed_ref() At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type() twice, once before we check if have an inline extent owner ref (for simple qgroups) and then once again sometime after that check. This second call is redundant when we have simple quotas disabled or we found an inline ref that is not of the owner ref type. So avoid this second call unless we have simple quotas enabled and found an owner ref, saving a function call that does inline ref validation again. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	841c77efc3	btrfs: remove the snapshot check from check_committed_ref() At check_committed_ref() we have this check to see if the data extent was created in a generation lower than or equals to the generation where the last snapshot for the root was created, and if so we return immediately with 1, since it's very likely the extent is shared, referenced by other root. The only call chain for check_committed_ref() is the following: can_nocow_file_extent() btrfs_cross_ref_exist() check_committed_ref() And we already do that snapshot check at can_nocow_file_extent(), before we call btrfs_cross_ref_exist(). This makes the check done at check_committed_ref() redundant, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:23 +01:00
Filipe Manana	2a01169cbf	btrfs: remove no longer needed strict argument from can_nocow_extent() All callers of can_nocow_extent() now pass a value of false for its 'strict' argument, making it redundant. So remove the argument from can_nocow_extent() as well as can_nocow_file_extent(), btrfs_cross_ref_exist() and check_committed_ref(), because this argument was used just to influence the behavior of check_committed_ref(). Also remove the 'strict' field from struct can_nocow_file_extent_args, which is now always false as well, as its value is taken from the argument to can_nocow_extent(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:23 +01:00
Filipe Manana	95418f4c79	btrfs: avoid monopolizing a core when activating a swap file During swap activation we iterate over the extents of a file and we can have many thounsands of them, so we can end up in a busy loop monopolizing a core. Avoid this by doing a voluntary reschedule after processing each extent. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:23 +01:00
Filipe Manana	c95d6a3344	btrfs: allow swap activation to be interruptible During swap activation we iterate over the extents of a file, then do several checks for each extent, some of which may take some significant time such as checking if an extent is shared. Since a file can have many thousands of extents, this can be a very slow operation and it's currently not interruptible. I had a bug during development of a previous patch that resulted in an infinite loop when iterating the extents, so a core was busy looping and I couldn't cancel the operation, which is very annoying and requires a reboot. So make the loop interruptible by checking for fatal signals at the end of each iteration and stopping immediately if there is one. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:22 +01:00
Filipe Manana	ab427e6d83	btrfs: fix swap file activation failure due to extents that used to be shared When activating a swap file, to determine if an extent is shared we use can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper is meant to be quick because it's used in the NOCOW write path, when flushing delalloc and when doing a direct IO write, however it does return some false positives, meaning it may indicate that an extent is shared even if it's no longer the case. For the write path this is fine, we just do a unnecessary COW operation instead of doing a more rigorous check which would be too heavy (calling btrfs_is_data_extent_shared()). However when activating a swap file, the false positives simply result in a failure, which is confusing for users/applications. One particular case where this happens is when a data extent only has 1 reference but that reference is not inlined in the extent item located in the extent tree - this happens when we create more than 33 references for an extent and then delete those 33 references plus every other non-inline reference except one. The function check_committed_ref() assumes that if the size of an extent item doesn't match the size of struct btrfs_extent_item plus the size of an inline reference (plus an owner reference in case simple quotas are enabled), then the extent is shared - that is not the case however, we can have a single reference but it's not inlined - the reason we do this is to be fast and avoid inspecting non-inline references which may be located in another leaf of the extent tree, slowing down write paths. The following test script reproduces the bug: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi NUM_CLONES=50 umount $DEV &> /dev/null run_test() { local sync_after_add_reflinks=$1 local sync_after_remove_reflinks=$2 mkfs.btrfs -f $DEV > /dev/null #mkfs.xfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foo chmod 0600 $MNT/foo # On btrfs the file must be NOCOW. chattr +C $MNT/foo &> /dev/null xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo mkswap $MNT/foo for ((i = 1; i <= $NUM_CLONES; i++)); do touch $MNT/foo_clone_$i chmod 0600 $MNT/foo_clone_$i # On btrfs the file must be NOCOW. chattr +C $MNT/foo_clone_$i &> /dev/null cp --reflink=always $MNT/foo $MNT/foo_clone_$i done if [ $sync_after_add_reflinks -ne 0 ]; then # Flush delayed refs and commit current transaction. sync -f $MNT fi # Remove the original file and all clones except the last. rm -f $MNT/foo for ((i = 1; i < $NUM_CLONES; i++)); do rm -f $MNT/foo_clone_$i done if [ $sync_after_remove_reflinks -ne 0 ]; then # Flush delayed refs and commit current transaction. sync -f $MNT fi # Now use the last clone as a swap file. It should work since # its extent are not shared anymore. swapon $MNT/foo_clone_${NUM_CLONES} swapoff $MNT/foo_clone_${NUM_CLONES} umount $MNT } echo -e "\nTest without sync after creating and removing clones" run_test 0 0 echo -e "\nTest with sync after creating clones" run_test 1 0 echo -e "\nTest with sync after removing clones" run_test 0 1 echo -e "\nTest with sync after creating and removing clones" run_test 1 1 Running the test: $ ./test.sh Test without sync after creating and removing clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02 swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument Test with sync after creating clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33 swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument Test with sync after removing clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764 Test with sync after creating and removing clones wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec) Setting up swapspace version 1, size = 1020 KiB (1044480 bytes) no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912 swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument Fix this by reworking btrfs_swap_activate() to instead of using extent maps and checking for shared extents with can_nocow_extent(), iterate over the inode's file extent items and use the accurate btrfs_is_data_extent_shared(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:22 +01:00
Filipe Manana	72fdda4a2f	btrfs: fix race with memory mapped writes when activating swap file When activating the swap file we flush all delalloc and wait for ordered extent completion, so that we don't miss any delalloc and extents before we check that the file's extent layout is usable for a swap file and activate the swap file. We are called with the inode's VFS lock acquired, so we won't race with buffered and direct IO writes, however we can still race with memory mapped writes since they don't acquire the inode's VFS lock. The race window is between flushing all delalloc and locking the whole file's extent range, since memory mapped writes lock an extent range with the length of a page. Fix this by acquiring the inode's mmap lock before we flush delalloc. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:31:35 +01:00
Boris Burkov	bc215833ab	btrfs: check folio mapping after unlock in put_file_data() When we call btrfs_read_folio() we get an unlocked folio, so it is possible for a different thread to concurrently modify folio->mapping. We must check that this hasn't happened once we do have the lock. CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:23:51 +01:00
Boris Burkov	51d5fedbc2	btrfs: check folio mapping after unlock in relocate_one_folio() When we call btrfs_read_folio() to bring a folio uptodate, we unlock the folio. The result of that is that a different thread can modify the mapping (like remove it with invalidate) before we call folio_lock(). This results in an invalid page and we need to try again. In particular, if we are relocating concurrently with aborting a transaction, this can result in a crash like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 76 PID: 1411631 Comm: kworker/u322:5 Workqueue: events_unbound btrfs_reclaim_bgs_work RIP: 0010:set_page_extent_mapped+0x20/0xb0 RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246 RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880 RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0 R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000 R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x78/0xc0 ? page_fault_oops+0x2a8/0x3a0 ? __switch_to+0x133/0x530 ? wq_worker_running+0xa/0x40 ? exc_page_fault+0x63/0x130 ? asm_exc_page_fault+0x22/0x30 ? set_page_extent_mapped+0x20/0xb0 relocate_file_extent_cluster+0x1a7/0x940 relocate_data_extent+0xaf/0x120 relocate_block_group+0x20f/0x480 btrfs_relocate_block_group+0x152/0x320 btrfs_relocate_chunk+0x3d/0x120 btrfs_reclaim_bgs_work+0x2ae/0x4e0 process_scheduled_works+0x184/0x370 worker_thread+0xc6/0x3e0 ? blk_add_timer+0xb0/0xb0 kthread+0xae/0xe0 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork+0x2f/0x40 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> This occurs because cleanup_one_transaction() calls destroy_delalloc_inodes() which calls invalidate_inode_pages2() which takes the folio_lock before setting mapping to NULL. We fail to check this, and subsequently call set_extent_mapping(), which assumes that mapping != NULL (in fact it asserts that in debug mode) Note that the "fixes" patch here is not the one that introduced the race (the very first iteration of this code from 2009) but a more recent change that made this particular crash happen in practice. Fixes: `e7f1326cc2` ("btrfs: set page extent mapped after read_folio in relocate_one_page") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 01:20:16 +01:00
Olga Kornievskaia	9048cf05a1	NFSD: fix management of pending async copies Currently the pending_async_copies count is decremented just before a struct nfsd4_copy is destroyed. After commit `aa0ebd21df` ("NFSD: Add nfsd4_copy time-to-live") nfsd4_copy structures sticks around for 10 lease periods after the COPY itself has completed, the pending_async_copies count stays high for a long time. This causes NFSD to avoid the use of background copy even though the actual background copy workload might no longer be running. In this patch, decrement pending_async_copies once async copy thread is done processing the copy work. Fixes: `aa0ebd21df` ("NFSD: Add nfsd4_copy time-to-live") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-12-17 16:35:53 -05:00
Christian Brauner	ef5bbd2a28	Merge branch 'vfs-6.14.mount' into vfs.all	2024-12-17 21:42:07 +01:00
Christian Brauner	2b26e73aae	Merge branch 'kernel-6.14.cred' into vfs.all	2024-12-17 21:41:51 +01:00
Christian Brauner	f72c407e9e	Merge branch 'vfs-6.14.pidfs' into vfs.all	2024-12-17 21:41:50 +01:00
Christian Brauner	4554288d75	Merge branch 'vfs-6.14.misc' into vfs.all	2024-12-17 21:41:49 +01:00
Christian Brauner	6a6d921e15	Merge branch 'vfs-6.14.kcore' into vfs.all	2024-12-17 21:41:48 +01:00
Christian Brauner	a529c02c8c	Merge branch 'vfs-6.14.netfs' into vfs.all	2024-12-17 21:41:47 +01:00
Christian Brauner	578eb3b6a9	fs: use xarray for old mount id While the ida does use the xarray internally we can use it explicitly which allows us to increment the unique mount id under the xa lock. This allows us to remove the atomic as we're now allocating both ids in one go. Link: https://lore.kernel.org/r/20241217-erhielten-regung-44bb1604ca8f@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 21:41:25 +01:00
Christian Brauner	b782c5efe7	fs: cache first and last mount Speed up listmount() by caching the first and last node making retrieval of the first and last mount of each mount namespace O(1). Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-2-fd55922c4af8@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 21:41:24 +01:00
Christian Brauner	5a4e727d1a	fs: kill MNT_ONRB Move mnt->mnt_node into the union with mnt->mnt_rcu and mnt->mnt_llist instead of keeping it with mnt->mnt_list. This allows us to use RB_CLEAR_NODE(&mnt->mnt_node) in umount_tree() as well as list_empty(&mnt->mnt_node). That in turn allows us to remove MNT_ONRB. Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-1-fd55922c4af8@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 21:41:23 +01:00
Christian Brauner	63ef7e3494	fs: simplify rwlock to spinlock We're not taking the read_lock() anymore now that all lookup is lockless. Just use a simple spinlock. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-6-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 21:41:22 +01:00
Christian Brauner	208fc3fb60	fs: lockless mntns lookup for nsfs We already made the rbtree lookup lockless for the simple lookup case. However, walking the list of mount namespaces via nsfs still happens with taking the read lock blocking concurrent additions of new mount namespaces pointlessly. Plus, such additions are rare anyway so allow lockless lookup of the previous and next mount namespace by keeping a separate list. This also allows to make some things simpler in the code. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-5-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 21:41:22 +01:00
Christian Brauner	b31421b665	fs: lockless mntns rbtree lookup Currently we use a read-write lock but for the simple search case we can make this lockless. Creating a new mount namespace is a rather rare event compared with querying mounts in a foreign mount namespace. Once this is picked up by e.g., systemd to list mounts in another mount in it's isolated services or in containers this will be used a lot so this seems worthwhile doing. Link: https://lore.kernel.org/r/20241213-work-mount-rbtree-lockless-v3-3-6e3cdaf9b280@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 21:41:21 +01:00
Qu Wenruo	dfb92681a1	btrfs: tree-checker: reject inline extent items with 0 ref count [BUG] There is a bug report in the mailing list where btrfs_run_delayed_refs() failed to drop the ref count for logical 25870311358464 num_bytes 2113536. The involved leaf dump looks like this: item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50 extent refs 1 gen 84178 flags 1 ref#0: shared data backref parent 32399126528000 count 0 <<< ref#1: shared data backref parent 31808973717504 count 1 Notice the count number is 0. [CAUSE] There is no concrete evidence yet, but considering 0 -> 1 is also a single bit flipped, it's possible that hardware memory bitflip is involved, causing the on-disk extent tree to be corrupted. [FIX] To prevent us reading such corrupted extent item, or writing such damaged extent item back to disk, enhance the handling of BTRFS_EXTENT_DATA_REF_KEY and BTRFS_SHARED_DATA_REF_KEY keys for both inlined and key items, to detect such 0 ref count and reject them. CC: stable@vger.kernel.org # 5.4+ Link: https://lore.kernel.org/linux-btrfs/7c69dd49-c346-4806-86e7-e6f863a66f48@app.fastmail.com/ Reported-by: Frankie Fisher <frankie@terrorise.me.uk> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-17 19:54:32 +01:00
Christoph Hellwig	be691b5e59	btrfs: split bios to the fs sector size boundary Btrfs like other file systems can't really deal with I/O not aligned to it's internal block size (which strangely is called sector size in btrfs, for historical reasons), but the block layer split helper doesn't even know about that. Round down the split boundary so that all I/Os are aligned. Fixes: `d5e4377d50` ("btrfs: split zone append bios in btrfs_submit_bio") CC: stable@vger.kernel.org # 6.12 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-17 19:54:32 +01:00
Christoph Hellwig	6c3864e055	btrfs: use bio_is_zone_append() in the completion handler Otherwise it won't catch bios turned into regular writes by the block level zone write plugging. The additional test it adds is for emulated zone append. Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") CC: stable@vger.kernel.org # 6.12 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-17 19:54:32 +01:00
Josef Bacik	d75d72a858	btrfs: fix improper generation check in snapshot delete We have been using the following check if (generation <= root->root_key.offset) to make decisions about whether or not to visit a node during snapshot delete. This is because for normal subvolumes this is set to 0, and for snapshots it's set to the creation generation. The idea being that if the generation of the node is less than or equal to our creation generation then we don't need to visit that node, because it doesn't belong to us, we can simply drop our reference and move on. However reloc roots don't have their generation stored in root->root_key.offset, instead that is the objectid of their corresponding fs root. This means we can incorrectly not walk into nodes that need to be dropped when deleting a reloc root. There are a variety of consequences to making the wrong choice in two distinct areas. visit_node_for_delete() 1. False positive. We think we are newer than the block when we really aren't. We don't visit the node and drop our reference to the node and carry on. This would result in leaked space. 2. False negative. We do decide to walk down into a block that we should have just dropped our reference to. However this means that the child node will have refs > 1, so we will switch to UPDATE_BACKREF, and then the subsequent walk_down_proc() will notice that btrfs_header_owner(node) != root->root_key.objectid and it'll break out of the loop, and then walk_up_proc() will drop our reference, so this appears to be ok. do_walk_down() 1. False positive. We are in UPDATE_BACKREF and incorrectly decide that we are done and don't need to update the backref for our lower nodes. This is another case that simply won't happen with relocation, as we only have to do UPDATE_BACKREF if the node below us was shared and didn't have FULL_BACKREF set, and since we don't own that node because we're a reloc root we actually won't end up in this case. 2. False negative. Again this is tricky because as described above, we simply wouldn't be here from relocation, because we don't own any of the nodes because we never set btrfs_header_owner() to the reloc root objectid, and we always use FULL_BACKREF, we never actually need to set FULL_BACKREF on any children. Having spent a lot of time stressing relocation/snapshot delete recently I've not seen this pop in practice. But this is objectively incorrect, so fix this to get the correct starting generation based on the root we're dropping to keep me from thinking there's a problem here. CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-17 19:54:32 +01:00
Linus Torvalds	ed90ed56e4	Changes since last update: - Fix (pcluster) memory leak and (sbi) UAF after umounting; - Fix a case of PSI memstall mis-accounting; - Use buffered I/Os by default for file-backed mounts. -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmdhgxQRHHhpYW5nQGtl cm5lbC5vcmcACgkQUXZn5Zlu5qrLmxAAm621Zq5Jz+AlN2HvBpyfIjD8eXtdCEd6 8r6+2e5aw8HpZKyKBo1ET3gTSA9KO4FbdZl0S9e+SfPJDa/Tak4e5mzaF8su1LnS bzg3MQwU8W7bahsKn6OOnC4pTFvKL1ZdLvujbqjEDYXEP2cUEjxtZbHPpbTCRpte lhbN9444lfJevtyaNK92SP5NQjPYNDN0J6QJZIZuRMB9IDA2zsiuzBnqUVMkGbRx iiH3gsWo0l554RXY81rMwLLHMsW79Qc5fBD2pmkzzp1ioH8YyY0+aylZi/ps9tcr xgOGZNKJT3fouhPVSE/QMdiqlNZW8qd/jwc3S0l8yeYn55pHftKCC0wysrGkXjVw ODHU6WYWSNtZ2uxCU44lDKVnse4fIksFX7w1/BZer7dZy8kUNZ4hexLQp+kSBpFs QKK3bJpN85GfNndk9X+vk6MFPHpEougJNiywVMAPCa55heeCMTES+vW5WjpIBjuz hyU26y5xELAbK4T+VmNlNh16LEbV1rUyvBHaq4vhVJensEQQu8pusqQH0gMYZi3l Bn5drLmsSG6zaMeeBc14609f3IBJBgkzIi7G5wFuIK4viqcRkh0nCf1c6D10vgST G+8CTwks6c2TTHANvIPzs3Ciw6FTBQym/CJSItPcoLpc5xoDfcAYA2uuCyhz9khZ A3kR3lNe0e0= =Idg8 -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.13-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "The first one fixes a syzbot UAF report caused by a commit introduced in this cycle, but it also addresses a longstanding memory leak. The second one resolves a PSI memstall mis-accounting issue. The remaining patches switch file-backed mounts to use buffered I/Os by default instead of direct I/Os, since the page cache of underlay files is typically valid and maybe even dirty. This change also aligns with the default policy of loopback devices. A mount option has been added to try to use direct I/Os explicitly. Summary: - Fix (pcluster) memory leak and (sbi) UAF after umounting - Fix a case of PSI memstall mis-accounting - Use buffered I/Os by default for file-backed mounts" * tag 'erofs-for-6.13-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: use buffered I/O for file-backed mounts by default erofs: reference `struct erofs_device_info` for erofs_map_dev erofs: use `struct erofs_device_info` for the primary device erofs: add erofs_sb_free() helper MAINTAINERS: erofs: update Yue Hu's email address erofs: fix PSI memstall accounting erofs: fix rare pcluster memory leak after unmounting	2024-12-17 09:04:42 -08:00
Zhang Kunbo	bedb4e6088	fs/nfs: fix missing declaration of nfs_idmap_cache_timeout fs/nfs/super.c should include fs/nfs/nfs4idmap.h for declaration of nfs_idmap_cache_timeout. This fixes the sparse warning: fs/nfs/super.c:1397:14: warning: symbol 'nfs_idmap_cache_timeout' was not declared. Should it be static? Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2024-12-17 11:14:20 -05:00
Trond Myklebust	62e2a47cea	NFS/pnfs: Fix a live lock between recalled layouts and layoutget When the server is recalling a layout, we should ignore the count of outstanding layoutget calls, since the server is expected to return either NFS4ERR_RECALLCONFLICT or NFS4ERR_RETURNCONFLICT for as long as the recall is outstanding. Currently, we may end up livelocking, causing the layout to eventually be forcibly revoked. Fixes: `bf0291dd22` ("pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2024-12-17 11:10:55 -05:00
Yang Erkun	69d803c40e	nfsd: Revert "nfsd: release svc_expkey/svc_export with rcu_work" This reverts commit `f8c989a0c8`. Before this commit, svc_export_put or expkey_put will call path_put with sync mode. After this commit, path_put will be called with async mode. And this can lead the unexpected results show as follow. mkfs.xfs -f /dev/sda echo "/ (rw,no_root_squash,fsid=0)" > /etc/exports echo "/mnt (rw,no_root_squash,fsid=1)" >> /etc/exports exportfs -ra service nfs-server start mount -t nfs -o vers=4.0 127.0.0.1:/mnt /mnt1 mount /dev/sda /mnt/sda touch /mnt1/sda/file exportfs -r umount /mnt/sda # failed unexcepted The touch will finally call nfsd_cross_mnt, add refcount to mount, and then add cache_head. Before this commit, exportfs -r will call cache_flush to cleanup all cache_head, and path_put in svc_export_put/expkey_put will be finished with sync mode. So, the latter umount will always success. However, after this commit, path_put will be called with async mode, the latter umount may failed, and if we add some delay, umount will success too. Personally I think this bug and should be fixed. We first revert before bugfix patch, and then fix the original bug with a different way. Fixes: `f8c989a0c8` ("nfsd: release svc_expkey/svc_export with rcu_work") Signed-off-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-12-17 09:45:23 -05:00
Zhang Kunbo	2b2fc0be98	fs: fix missing declaration of init_files fs/file.c should include include/linux/init_task.h for declaration of init_files. This fixes the sparse warning: fs/file.c:501:21: warning: symbol 'init_files' was not declared. Should it be static? Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com> Link: https://lore.kernel.org/r/20241217071836.2634868-1-zhangkunbo@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 13:38:46 +01:00
Yuezhang Mo	d141e72aef	exfat: fix the new buffer was not zeroed before writing In exfat, not only the newly allocated space will be mapped as the new buffer, but also the space between ->valid_size and the file size will be mapped as the new buffer. If the buffer is mapped as new in ->write_begin(), it will be zeroed. But if the buffer has been mapped as new before ->write_begin(), ->write_begin() will not zero them, resulting in access to uninitialized data. So this commit uses folio_zero_new_buffers() to zero the new buffers after ->write_begin(). Fixes: `6630ea4910` ("exfat: move extend valid_size into ->page_mkwrite()") Reported-by: syzbot+91ae49e1c1a2634d20c0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=91ae49e1c1a2634d20c0 Tested-by: syzbot+91ae49e1c1a2634d20c0@syzkaller.appspotmail.com Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>	2024-12-17 20:22:08 +09:00
Yuezhang Mo	4265319f10	exfat: fix the infinite loop in exfat_readdir() If the file system is corrupted so that a cluster is linked to itself in the cluster chain, and there is an unused directory entry in the cluster, 'dentry' will not be incremented, causing condition 'dentry < max_dentries' unable to prevent an infinite loop. This infinite loop causes s_lock not to be released, and other tasks will hang, such as exfat_sync_fs(). This commit stops traversing the cluster chain when there is unused directory entry in the cluster to avoid this infinite loop. Reported-by: syzbot+205c2644abdff9d3f9fc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=205c2644abdff9d3f9fc Tested-by: syzbot+205c2644abdff9d3f9fc@syzkaller.appspotmail.com Fixes: `ca06197382` ("exfat: add directory operations") Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>	2024-12-17 20:22:05 +09:00
Yuezhang Mo	ffee32cf4b	exfat: fix the infinite loop in __exfat_free_cluster() In __exfat_free_cluster(), the cluster chain is traversed until the EOF cluster. If the cluster chain includes a loop due to file system corruption, the EOF cluster cannot be traversed, resulting in an infinite loop. To avoid this infinite loop, this commit changes to only traverse and free the number of clusters indicated by the file size. Reported-by: syzbot+1de5a37cb85a2d536330@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330 Tested-by: syzbot+1de5a37cb85a2d536330@syzkaller.appspotmail.com Fixes: `31023864e6` ("exfat: add fat entry operations") Suggested-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>	2024-12-17 20:22:02 +09:00
Yuezhang Mo	70465acbb0	exfat: fix exfat_find_empty_entry() not returning error on failure On failure, "dentry" is the error code. If the error code indicates that there is no space, a new cluster may need to be allocated; for other errors, it should be returned directly. Only on success, "dentry" is the index of the directory entry, and it needs to be converted into the directory entry index within the cluster where it is located. Fixes: `8a3f5711ad` ("exfat: reduce FAT chain traversal") Reported-by: syzbot+6f6c9397e0078ef60bce@syzkaller.appspotmail.com Tested-by: syzbot+6f6c9397e0078ef60bce@syzkaller.appspotmail.com Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>	2024-12-17 20:21:59 +09:00
Frederic Weisbecker	0936fadc01	treewide: Introduce kthread_run_worker[_on_cpu]() kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>	2024-12-17 11:02:42 +01:00
Frederic Weisbecker	2ff8f9a3b1	kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format kthread_create_on_cpu() uses the CPU argument as an implicit and unique printf argument to add to the format whereas kthread_create_worker_on_cpu() still relies on explicitly passing the printf arguments. This difference in behaviour is error prone and doesn't help standardizing per-CPU kthread names. Unify the behaviours and convert kthread_create_worker_on_cpu() to use the printf behaviour of kthread_create_on_cpu(). Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2024-12-17 11:02:42 +01:00
Christian Brauner	16ecd47cb0	pidfs: lookup pid through rbtree The new pid inode number allocation scheme is neat but I overlooked a possible, even though unlikely, attack that can be used to trigger an overflow on both 32bit and 64bit. An unique 64 bit identifier was constructed for each struct pid by two combining a 32 bit idr with a 32 bit generation number. A 32bit number was allocated using the idr_alloc_cyclic() infrastructure. When the idr wrapped around a 32 bit wraparound counter was incremented. The 32 bit wraparound counter served as the upper 32 bits and the allocated idr number as the lower 32 bits. Since the idr can only allocate up to INT_MAX entries everytime a wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering always starts at 2 to avoid theoretical collisions with the root inode number.). If userspace fully populates the idr such that and puts itself into control of two entries such that one entry is somewhere in the middle and the other entry is the INT_MAX entry then it is possible to overflow the wraparound counter. That is probably difficult to pull off but the mere possibility is annoying. The problem could be contained to 32 bit by switching to a data structure such as the maple tree that allows allocating 64 bit numbers on 64 bit machines. That would leave 32 bit in a lurch but that probably doesn't matter that much. The other problem is that removing entries form the maple tree is somewhat non-trivial because the removal code can be called under the irq write lock of tasklist_lock and irq{save,restore} code. Instead, allocate unique identifiers for struct pid by simply incrementing a 64 bit counter and insert each struct pid into the rbtree so it can be looked up to decode file handles avoiding to leak actual pids across pid namespaces in file handles. On both 64 bit and 32 bit the same 64 bit identifier is used to lookup struct pid in the rbtree. On 64 bit the unique identifier for struct pid simply becomes the inode number. Comparing two pidfds continues to be as simple as comparing inode numbers. On 32 bit the 64 bit number assigned to struct pid is split into two 32 bit numbers. The lower 32 bits are used as the inode number and the upper 32 bits are used as the inode generation number. Whenever a wraparound happens on 32 bit the 64 bit number will be incremented by 2 so inode numbering starts at 2 again. When a wraparound happens on 32 bit multiple pidfds with the same inode number are likely to exist. This isn't a problem since before pidfs pidfds used the anonymous inode meaning all pidfds had the same inode number. On 32 bit sserspace can thus reconstruct the 64 bit identifier by retrieving both the inode number and the inode generation number to compare, or use file handles. This gives the same guarantees on both 32 bit and 64 bit. Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 09:16:18 +01:00
Christian Brauner	8ce3528188	pidfs: check for valid ioctl commands Prior to doing any work, check whether the provided ioctl command is supported by pidfs. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 09:16:18 +01:00
Christian Brauner	b3caba8f7a	pidfs: implement file handle support On 64-bit platforms, userspace can read the pidfd's inode in order to get a never-repeated PID identifier. On 32-bit platforms this identifier is not exposed, as inodes are limited to 32 bits. Instead expose the identifier via export_fh, which makes it available to userspace via name_to_handle_at. In addition we implement fh_to_dentry, which allows userspace to recover a pidfd from a pidfs file handle. Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu> [brauner: patch heavily rewritten] Link: https://lore.kernel.org/r/20241129-work-pidfs-file_handle-v1-6-87d803a42495@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Co-Developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 09:16:17 +01:00
Christian Brauner	c220e216d6	exportfs: add permission method This allows filesystems such as pidfs to provide their custom permission checks. Link: https://lore.kernel.org/r/20241129-work-pidfs-file_handle-v1-5-87d803a42495@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-17 09:16:11 +01:00
Kees Cook	543841d180	exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case Zbigniew mentioned at Linux Plumber's that systemd is interested in switching to execveat() for service execution, but can't, because the contents of /proc/pid/comm are the file descriptor which was used, instead of the path to the binary[1]. This makes the output of tools like top and ps useless, especially in a world where most fds are opened CLOEXEC so the number is truly meaningless. When the filename passed in is empty (e.g. with AT_EMPTY_PATH), use the dentry's filename for "comm" instead of using the useless numeral from the synthetic fdpath construction. This way the actual exec machinery is unchanged, but cosmetically the comm looks reasonable to admins investigating things. Instead of adding TASK_COMM_LEN more bytes to bprm, use one of the unused flag bits to indicate that we need to set "comm" from the dentry. Suggested-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> Suggested-by: Tycho Andersen <tandersen@netflix.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec [1] Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Tested-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> Signed-off-by: Kees Cook <kees@kernel.org>	2024-12-16 16:54:00 -08:00
Kees Cook	3a3f61ce5e	exec: Make sure task->comm is always NUL-terminated Using strscpy() meant that the final character in task->comm may be non-NUL for a moment before the "string too long" truncation happens. Instead of adding a new use of the ambiguous strncpy(), we'd want to use memtostr_pad() which enforces being able to check at compile time that sizes are sensible, but this requires being able to see string buffer lengths. Instead of trying to inline __set_task_comm() (which needs to call trace and perf functions), just open-code it. But to make sure we're always safe, add compile-time checking like we already do for get_task_comm(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Kees Cook <kees@kernel.org>	2024-12-16 16:53:00 -08:00
Ilya Dryomov	18d44c5d06	ceph: allocate sparse_ext map only for sparse reads If mounted with sparseread option, ceph_direct_read_write() ends up making an unnecessarily allocation for O_DIRECT writes. Fixes: `03bc06c7b0` ("ceph: add new mount option to enable sparse reads") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com>	2024-12-16 23:25:44 +01:00
Ilya Dryomov	66e0c4f914	ceph: fix memory leak in ceph_direct_read_write() The bvecs array which is allocated in iter_get_bvecs_alloc() is leaked and pages remain pinned if ceph_alloc_sparse_ext_map() fails. There is no need to delay the allocation of sparse_ext map until after the bvecs array is set up, so fix this by moving sparse_ext allocation a bit earlier. Also, make a similar adjustment in __ceph_sync_read() for consistency (a leak of the same kind in __ceph_sync_read() has been addressed differently). Cc: stable@vger.kernel.org Fixes: `03bc06c7b0` ("ceph: add new mount option to enable sparse reads") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com>	2024-12-16 23:25:44 +01:00
Alex Markuze	9abee47580	ceph: improve error handling and short/overflow-read logic in __ceph_sync_read() This patch refines the read logic in __ceph_sync_read() to ensure more predictable and efficient behavior in various edge cases. - Return early if the requested read length is zero or if the file size (`i_size`) is zero. - Initialize the index variable (`idx`) where needed and reorder some code to ensure it is always set before use. - Improve error handling by checking for negative return values earlier. - Remove redundant encrypted file checks after failures. Only attempt filesystem-level decryption if the read succeeded. - Simplify leftover calculations to correctly handle cases where the read extends beyond the end of the file or stops short. This can be hit by continuously reading a file while, on another client, we keep truncating and writing new data into it. - This resolves multiple issues caused by integer and consequent buffer overflow (`pages` array being accessed beyond `num_pages`): - https://tracker.ceph.com/issues/67524 - https://tracker.ceph.com/issues/68980 - https://tracker.ceph.com/issues/68981 Cc: stable@vger.kernel.org Fixes: `1065da21e5` ("ceph: stop copying to iter at EOF on sync reads") Reported-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2024-12-16 23:25:43 +01:00
Ilya Dryomov	12eb22a5a6	ceph: validate snapdirname option length when mounting It becomes a path component, so it shouldn't exceed NAME_MAX characters. This was hardened in commit `c152737be2` ("ceph: Use strscpy() instead of strcpy() in __get_snap_name()"), but no actual check was put in place. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com>	2024-12-16 23:25:43 +01:00
Max Kellermann	550f7ca98e	ceph: give up on paths longer than PATH_MAX If the full path to be built by ceph_mdsc_build_path() happens to be longer than PATH_MAX, then this function will enter an endless (retry) loop, effectively blocking the whole task. Most of the machine becomes unusable, making this a very simple and effective DoS vulnerability. I cannot imagine why this retry was ever implemented, but it seems rather useless and harmful to me. Let's remove it and fail with ENAMETOOLONG instead. Cc: stable@vger.kernel.org Reported-by: Dario Weißer <dario@cure53.de> Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2024-12-16 23:25:43 +01:00
Max Kellermann	d6fd6f8280	ceph: fix memory leaks in __ceph_sync_read() In two `break` statements, the call to ceph_release_page_vector() was missing, leaking the allocation from ceph_alloc_page_vector(). Instead of adding the missing ceph_release_page_vector() calls, the Ceph maintainers preferred to transfer page ownership to the `ceph_osd_request` by passing `own_pages=true` to osd_req_op_extent_osd_data_pages(). This requires postponing the ceph_osdc_put_request() call until after the block that accesses the `pages`. Cc: stable@vger.kernel.org Fixes: `03bc06c7b0` ("ceph: add new mount option to enable sparse reads") Fixes: `f0fe1e54cf` ("ceph: plumb in decryption during reads") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2024-12-16 23:25:43 +01:00
Kent Overstreet	ca2e7a3de8	bcachefs: Fix assert for online fsck We can't check if we're racing with fsck ending until mark_lock is held. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 17:12:47 -05:00
Kent Overstreet	b5677d4d8d	bcachefs: Handle -BCH_ERR_need_mark_replicas in gc Locking considerations (possibly no longer relevant?) mean that when an accounting update needs a new superblock replicas entry to be created, it's deferred to the transaction commit error path. But accounting updates for gc/fcsk aren't done from the transaction commit path - so we need to handle -BCH_ERR_btree_insert_need_mark_replicas locally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:14:08 -05:00
Kent Overstreet	7b5ddd26bc	bcachefs: Write lock btree node in key cache fills this addresses a key cache coherency bug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Kent Overstreet	989229db3f	bcachefs: kill __bch2_btree_iter_flags() bch2_btree_iter_flags() now takes a level parameter; this fixes a bug where using a node iterator on a leaf wouldn't set BTREE_ITER_with_key_cache, leading to fun cache coherency bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Kent Overstreet	83fa58a370	bcachefs: Drop redundant "read error" call from btree_gc The btree node read error path already calls topology error, so this is entirely redundant, and we're not specific enough about our error codes - this was triggering for bucket_ref_update() errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Kent Overstreet	52084849f3	bcachefs: Drop racy warning Checking for writing past i_size after unlocking the folio and clearing the dirty bit is racy, and we already check it at the start. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Kent Overstreet	959dbb09ed	bcachefs: better check_bp_exists() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Hongbo Li	92b9e40732	bcachefs: add counter_flags for counters In bcachefs, io_read and io_write counter record the amount of data which has been read and written. They increase in unit of sector, so to display correctly, they need to be shifted to the left by the size of a sector. Other counters like io_move, move_extent_{read, write, finish} also have this problem. In order to support different unit, we add extra column to mark the counter type by using TYPE_COUNTER and TYPE_SECTORS in BCH_PERSISTENT_COUNTERS(). Fixes: `1c6fdbd8f2` ("bcachefs: Initial commit") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Kent Overstreet	10e485bba0	bcachefs: bcachefs_metadata_version_autofix_errors It's time to make self healing the default: change the error action for old filesystems to fix_safe, matching the default for current filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Kent Overstreet	7dbe48b963	bcachefs: bcachefs_metadata_version_persistent_inode_cursors Persistent cursors for inode allocation. A free inodes btree would add substantial overhead to inode allocation and freeing - a "next num to allocate" cursor is always going to be faster. We just need it to be persistent, to avoid scanning the inodes btree from the start on startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-16 14:01:32 -05:00
Dmitry Antipov	76f01376df	f2fs: ensure that node info flags are always initialized Syzbot has reported the following KMSAN splat: BUG: KMSAN: uninit-value in f2fs_new_node_page+0x1494/0x1630 f2fs_new_node_page+0x1494/0x1630 f2fs_new_inode_page+0xb9/0x100 f2fs_init_inode_metadata+0x176/0x1e90 f2fs_add_inline_entry+0x723/0xc90 f2fs_do_add_link+0x48f/0xa70 f2fs_symlink+0x6af/0xfc0 vfs_symlink+0x1f1/0x470 do_symlinkat+0x471/0xbc0 __x64_sys_symlink+0xcf/0x140 x64_sys_call+0x2fcc/0x3d90 do_syscall_64+0xd9/0x1b0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable new_ni created at: f2fs_new_node_page+0x9d/0x1630 f2fs_new_inode_page+0xb9/0x100 So adjust 'f2fs_get_node_info()' to ensure that 'flag' field of 'struct node_info' is always initialized. Reported-by: syzbot+5141f6db57a2f7614352@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5141f6db57a2f7614352 Fixes: `e05df3b115` ("f2fs: add node operations") Suggested-by: Chao Yu <chao@kernel.org> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:54 +00:00
Yongpeng Yang	e9a844f6e4	f2fs: The GC triggered by ioctl also needs to mark the segno as victim In SSR mode, the segment selected for allocation might be the same as the target segment of the GC triggered by ioctl, resulting in the GC moving the CURSEG_I(sbi, type)->segno. Thread A Thread B or Thread A - f2fs_ioc_gc_range - __f2fs_ioc_gc_range(.victim_segno=segno#N) - f2fs_gc - __get_victim - f2fs_get_victim : segno#N is valid, return segno#N as source segment of GC - f2fs_allocate_data_block - need_new_seg - get_ssr_segment - f2fs_get_victim : get segno #N as destination segment - change_curseg Fixes: `e066b83c9b` ("f2fs: add ioctl to flush data from faster device to cold area") Signed-off-by: Yongpeng Yang <yangyongpeng1@oppo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:29 +00:00
zangyangyang1	5f65945427	f2fs: cache more dentry pages While traversing dir entries in dentry page, it's better to refresh current accessed page in lru list by using FGP_ACCESSED flag, otherwise, such page may has less chance to survive during memory reclaim, result in causing additional IO when revisiting dentry page. Signed-off-by: zangyangyang1 <zangyangyang1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:28 +00:00
Matthew Wilcox (Oracle)	c910a64bc4	f2fs: Remove calls to folio_file_mapping() All folios that f2fs sees belong to f2fs and not to the swapcache so it can dereference folio->mapping directly like all other filesystems do. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:26 +00:00
Matthew Wilcox (Oracle)	19bbd306dd	f2fs: Convert __read_io_type() to take a folio Remove the last call to page_file_mapping() as both callers can now pass in a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:23 +00:00
Matthew Wilcox (Oracle)	f58d864582	f2fs: Use a data folio in f2fs_submit_page_bio() Remove a call to compound_head(). We can call bio_add_folio_nofail() here because we just allocated the bio, so we know it can't fail and thus the error path can never be taken. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:19 +00:00
Matthew Wilcox (Oracle)	0765b3f989	f2fs: Use a folio more in f2fs_submit_page_bio() Cache the result of page_folio(fio->page) in a local variable so we don't have to keep calling it. Saves a couple of calls to compound_head() and removes an access to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:16 +00:00
Matthew Wilcox (Oracle)	e0821645dd	f2fs: Convert f2fs_finish_read_bio() to use folios Use bio_for_each_folio_all() to iterate over each folio in the bio. This lets us use folio_end_read() which saves an atomic operation and memory barrier compared to marking the folio uptodate and unlocking it as two separate operations. This also removes a few hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:13 +00:00
Matthew Wilcox (Oracle)	1cf7460070	f2fs: Add F2FS_F_SB() This is the folio equivalent of F2FS_P_SB(). Removes a call to page_file_mapping() as we know folios seen by f2fs are never part of the swap cache. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:10 +00:00
Matthew Wilcox (Oracle)	87e2a15bc0	f2fs: Convert submit tracepoints to take a folio Remove accesses to page->index and page->mapping as well as unnecessary calls to page_file_mapping(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:07 +00:00
Matthew Wilcox (Oracle)	ac866908d7	f2fs: Use a folio in f2fs_write_compressed_pages() Remove accesses to page->index and an unnecessary reference to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:04 +00:00
Matthew Wilcox (Oracle)	1cda5bc0b2	f2fs: Use a folio in f2fs_truncate_partial_cluster() Convert the incoming page to a folio and use it throughout. Removes an access to page->index. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:12:01 +00:00
Matthew Wilcox (Oracle)	ff6c82a934	f2fs: Use a folio in f2fs_compress_write_end() This removes an access of page->index. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:11:58 +00:00
Matthew Wilcox (Oracle)	a909c17953	f2fs: Use a folio in f2fs_all_cluster_page_ready() Remove references to page->index and use folio_test_uptodate() instead of PageUptodate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-12-16 16:11:51 +00:00
John Garry	ac3f91005d	block: Delete bio_set_prio() Since commit `43b62ce3ff` ("block: move bio io prio to a new field"), macro bio_set_prio() does nothing but set bio->bi_ioprio. All other places just set bio->bi_ioprio directly, so replace bio_set_prio() remaining callsites with setting bio->bi_ioprio directly and delete that macro. Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-16 06:58:30 -07:00
Gao Xiang	6422cde1b0	erofs: use buffered I/O for file-backed mounts by default For many use cases (e.g. container images are just fetched from remote), performance will be impacted if underlay page cache is up-to-date but direct i/o flushes dirty pages first. Instead, let's use buffered I/O by default to keep in sync with loop devices and add a (re)mount option to explicitly give a try to use direct I/O if supported by the underlying files. The container startup time is improved as below: [workload] docker.io/library/workpress:latest unpack 1st run non-1st runs EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s EROFS snapshotter loop 4.596023152s 0.346s 0.201s Overlayfs snapshotter 5.382851037s 0.206s 0.214s Fixes: `fb17675026` ("erofs: add file-backed mount support") Cc: Derek McGowan <derek@mcg.dev> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com	2024-12-16 21:02:07 +08:00
Gao Xiang	f8d920a402	erofs: reference `struct erofs_device_info` for erofs_map_dev Record `m_sb` and `m_dif` to replace `m_fscache`, `m_daxdev`, `m_fp` and `m_dax_part_off` in order to simplify the codebase. Note that `m_bdev` is still left since it can be assigned from `sb->s_bdev` directly. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212235401.2857246-1-hsiangkao@linux.alibaba.com	2024-12-16 21:02:06 +08:00
Gao Xiang	7b00af2c54	erofs: use `struct erofs_device_info` for the primary device Instead of just listing each one directly in `struct erofs_sb_info` except that we still use `sb->s_bdev` for the primary block device. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241216125310.930933-2-hsiangkao@linux.alibaba.com	2024-12-16 21:01:59 +08:00
Namjae Jeon	fe4ed2f09b	ksmbd: conn lock to serialize smb2 negotiate If client send parallel smb2 negotiate request on same connection, ksmbd_conn can be racy. smb2 negotiate handling that are not performance-related can be serialized with conn lock. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-12-15 22:20:03 -06:00
Marios Makassikis	43fb7bce88	ksmbd: fix broken transfers when exceeding max simultaneous operations Since commit `0a77d947f5` ("ksmbd: check outstanding simultaneous SMB operations"), ksmbd enforces a maximum number of simultaneous operations for a connection. The problem is that reaching the limit causes ksmbd to close the socket, and the client has no indication that it should have slowed down. This behaviour can be reproduced by setting "smb2 max credits = 128" (or lower), and transferring a large file (25GB). smbclient fails as below: $ smbclient //192.168.1.254/testshare -U user%pass smb: \> put file.bin cli_push returned NT_STATUS_USER_SESSION_DELETED putting file file.bin as \file.bin smb2cli_req_compound_submit: Insufficient credits. 0 available, 1 needed NT_STATUS_INTERNAL_ERROR closing remote file \file.bin smb: \> smb2cli_req_compound_submit: Insufficient credits. 0 available, 1 needed Windows clients fail with 0x8007003b (with smaller files even). Fix this by delaying reading from the socket until there's room to allocate a request. This effectively applies backpressure on the client, so the transfer completes, albeit at a slower rate. Fixes: `0a77d947f5` ("ksmbd: check outstanding simultaneous SMB operations") Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-12-15 22:20:03 -06:00
Marios Makassikis	83c47d9e0c	ksmbd: count all requests in req_running counter This changes the semantics of req_running to count all in-flight requests on a given connection, rather than the number of elements in the conn->request list. The latter is used only in smb2_cancel, and the counter is not used Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-12-15 22:20:03 -06:00
Linus Torvalds	7031a38ab7	First batch of EFI fixes for v6.13 - Limit EFI zboot to GZIP and ZSTD before it comes in wider use - Fix inconsistent error when looking up a non-existent file in efivarfs with a name that does not adhere to the NAME-GUID format - Drop some unused code -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZ17ajwAKCRAwbglWLn0t XGkQAQCuIi5yPony5hJf6vrYXm7rnHN2NS9Wg7q3rKNR7TIGMQD/YHRdNJbJ4nO5 BrOVS4eVXvSzvWrYxB/W4EAMJ1uyLgs= =LNFy -----END PGP SIGNATURE----- Merge tag 'efi-fixes-for-v6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - Limit EFI zboot to GZIP and ZSTD before it comes in wider use - Fix inconsistent error when looking up a non-existent file in efivarfs with a name that does not adhere to the NAME-GUID format - Drop some unused code * tag 'efi-fixes-for-v6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/esrt: remove esre_attribute::store() efivarfs: Fix error on non-existent file efi/zboot: Limit compression options to GZIP and ZSTD	2024-12-15 15:33:41 -08:00
Kent Overstreet	b4f1b7e26c	bcachefs: bcachefs_metadata_version_inode_depth This adds a new inode field, bi_depth, for directory inodes: this allows us to make the check_directory_structure pass much more efficient. Currently, to ensure the filesystem is fully connect and has no loops, for every directory we follow backpointers until we find the root. But by adding a depth counter, it sufficies to only check the parent of each directory, and check that the parent's bi_depth is smaller. (fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename causes bi_depth off, but the chain to the root is still strictly decreasing, then the algorithm still works and there's no need for fsck to fixup the bi_depth fields). We've already checked backpointers, so we know that every directory (excluding the root)has a valid parent: if bi_depth is always decreasing, every chain must terminate, and terminate at the root directory. bi_depth will not necessarily be correct when fsck runs, due to directory renames - we can't change bi_depth on every child directory when renaming a directory. That's ok; fsck will silently fix the bi_depth field as needed, and future fsck runs will be much faster. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	7fa06b998c	bcachefs: Option changes now get propagated to reflinked data Now that bch2_move_get_io_opts() re-propagates changed inode io options to bch_extent_rebalance, we can properly suport changing IO path options for reflinked data. Changing a per-file IO path option, either via the xattr interface or via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the inode number is marked as needing a scan, via bch2_set_rebalance_needs_scan()), and rebalance will use bch2_move_data(), which will walk the inode number and pick up the new options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	710fb4e0ab	bcachefs: bcachefs_metadata_version_reflink_p_may_update_opts Previously, io path option changes on a file would be picked up automatically and applied to existing data - but not for reflinked data, as we had no way of doing this safely. A user may have had permission to copy (and reflink) a given file, but not write to it, and if so they shouldn't be allowed to change e.g. nr_replicas or other options. This uses the incompat feature mechanism in the previous patch to add a new incompatible flag to bch_reflink_p, indicating whether a given reflink pointer may propagate io path option changes back to the indirect extent. In this initial patch we're only setting it for the source extents. We'd like to set it for the destination in a reflink copy, when the user has write access to the source, but that requires mnt_idmap which is not curretly plumbed up to remap_file_range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	a06f09e44c	bcachefs: BCH_SB_VERSION_INCOMPAT We've been getting away from feature bits: they don't have any kind of ordering, and thus it's possible for people to enable weird combinations of features that were never tested or intended to be run. Much better to just give every new feature, compatible or incompatible, a version number. Additionally, we probably won't ever rev the major version number: major version numbers represent incompatible versions, but that doesn't really fit with how we actually roll out incompatible features - we need a better way of rolling out incompatible features. So, this patch adds two new superblock fields: - BCH_SB_VERSION_INCOMPAT - BCH_SB_VERSION_INCOMPAT_ALLOWED BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up to version number x are allowed to be used without user prompting, but it does not by itself deny old versions from mounting. BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must be <= BCH_SB_VERSION_INCOMPAT_ALLOWED. BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use an incompatible feature, so as to not unnecessarily break compatibility with old versions. bch2_request_incompat_feature() is the new interface to check if an incompatible feature may be used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	a33c661174	bcachefs: Only run check_backpointers_to_extents in debug mode The backpointers passes, check_backpointers_to_extents() and check_extents_to_backpointers() are the most expensive fsck passes. Now that we're running the same check and repair code when using a backpointer at runtime (via bch2_backpointer_get_key()) that fsck does, there's no reason fsck needs to - except to verify that the filesystem really has no errors in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	ae7a394719	bcachefs: better backpointer_target_not_found() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	19b148fc65	bcachefs: bch2_backpointer_get_key() now repairs dangling backpointers Continuing on with the self healing theme, we should be running any check and repair code at runtime that we can - instead of declaring the filesystemt inconsistent. This will also let us skip running the backpointers -> extents fsck pass except in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	cbe8afdbcd	bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches Instead of walking every extent and every backpointer it points to, first sum up backpointers in each bucket and check for mismatches, and only look for missing backpointers if mismatches were detected, and only check extents in those buckets. This is a major fsck scalability improvement, since the two backpointers passes (backpointers -> extents and extents -> backpointers) are the most expensive fsck passes by far. Additionally, to speed up the upgrade for backpointer bucket gens, or in situations when we have to rebuild alloc info, add a special case for when no backpointers are found in a bucket - don't check each individual backpointer (in particular, avoiding the write buffer flushes), just recreate them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	9364e11cb3	bcachefs: Add write buffer flush param to backpointer_get_key() In an upcoming patch bch2_backpointer_get_key() will be repairing when it finds a dangling backpointer; it will need to flush the btree write buffer before it can definitively say there's an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	78daf5eaab	bcachefs: kill __bch2_extent_ptr_to_bp() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	70e1e1af77	bcachefs: bch2_extent_ptr_to_bp() no longer depends on device bch_backpointer no longer contains the bucket_offset field, it's just a direct LBA mapping (with low bits to account for compressed extent splitting), so we don't need to refer to the device to construct it anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	c6b74e6733	bcachefs: bcachefs_metadata_version_disk_accounting_big_endian Fix sort order for disk accounting keys, in order to fix a regression on mount times. The typetag is now the most significant byte of the key, meaning disk accounting keys of the same type now sort together. This lets us skip over disk accounting keys that aren't mirrored in memory when reading accounting at startup, instead of having them interleaved with other counter types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	ab7eb8e365	bcachefs: bcachefs_metadata_version_backpointer_bucket_gen New on disk format version: backpointers new include the generation number of the bucket they refer to, and the obsolete bucket_offset field (no longer needed because we no longer store backpointers in alloc keys) is gone. This is an expensive forced upgrade - hopefully the last; we have to run the extents_to_backpointers recovery pass to regenerate backpointers. It's a forced incompatible upgrade because the alternative would've been permamently making backpointers bigger, and as one of the biggest btrees (along with the extents btree) that's not an ideal option. It's worth it though, because this allows us to make the check_extents_to_backpointers pass drastically cheaper: an upcoming patch changes it to sum up backpointers in a bucket and check the sum against the sector counts for that bucket, only looking for missing backpointers if they don't match (and then only for specific buckets). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	8062b34861	bcachefs: bch2_btree_path_peek_slot() doesn't return errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	0694b43ff9	bcachefs: trace_key_cache_fill Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:49:02 -05:00
Kent Overstreet	dd0d1ff378	bcachefs: Log message in journal for snapshot deletion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	50dd5a0edf	bcachefs: bch2_trans_log_msg() Export a helper for logging to the journal when we're already in a transaction context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	5d9b21a555	bcachefs: Kill snapshot_t->equiv Now entirely dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	7c410e21d8	bcachefs: Snapshot deletion no longer uses snapshot_t->equiv Switch to generating a private list of interior nodes to delete, instead of using the equivalence class in the global data structure. This eliminates possible races with snapshot creation, and is much cleaner - it'll let us delete a lot of janky code for calculating and maintaining the equivalence classes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	46f92a9e99	bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key() When deleting dead snapshots, we move keys from redundant interior snapshot nodes to child nodes - unless there's already a key, in which case the ancestor key is deleted. Previously, we tracked via equiv_seen whether the child snapshot had a key, but this was tricky w.r.t. transaction restarts, and not transactionally safe w.r.t. updates in the child snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	75bca41052	bcachefs: Don't run overwrite triggers before insert This breaks when the trigger is inserting updates for the same btree, as the inode trigger now does. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	395d7f5e24	bcachefs: alloc_data_type_set() happens in alloc trigger Originally, we ran insert triggers before overwrite so that if an extent was being moved (by fallocate insert/collapse range), the bucket sector count wouldn't hit 0 partway through, and so we don't trigger state changes caused by that too soon. But this is better solved by just moving the data type change to the alloc trigger itself, where it's already called. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	760fbaf1a8	bcachefs: Fix key cache + BTREE_ITER_all_snapshots Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups, since they exist only to represent deletions of keys in ancestor snapshots - except, they should not be filtered in BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean them up. This means that that the key cache has to store whiteouts, and key cache fills cannot filter them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	bf5ec9b976	bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots In BTREE_ITER_all_snapshots mode, we're required to only return keys where the snapshot field matches the iterator position - BTREE_ITER_filter_snapshots requires pulling keys into the key cache from ancestor snapshots, so we have to check for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	eece59055b	bcachefs: tidy btree_trans_peek_journal() Change to match bch2_btree_trans_peek_updates() calling convention. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:48:30 -05:00
Kent Overstreet	52ee09e70b	bcachefs: tidy up __bch2_btree_iter_peek() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-14 22:47:28 -05:00

1 2 3 4 5 ...

96184 Commits