linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2024-12-28 16:52:18 +00:00

Author	SHA1	Message	Date
Stephen Rothwell	8dad5129f0	Merge branch 'for_next' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git	2024-12-20 09:19:18 +11:00
Anand Jain	45d8d33c88	btrfs: modload to print RAID1 balancing status Modified the Btrfs loading message to include the RAID1 balancing status if the experimental feature is enabled. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:42:33 +01:00
Anand Jain	e51f52e1d3	btrfs: enable RAID1 balancing configuration via modprobe parameter This update allows configuring the `raid1-balancing` methods using a modprobe parameter when experimental mode CONFIG_BTRFS_EXPERIMENTAL is enabled. Examples: - Set the RAID1 balancing method to round-robin with a custom `min_contiguous_read` of 192k: $ modprobe btrfs raid1-balancing=round-robin:196608 - Set the round-robin balancing method with the default `min_contiguous_read` of 256k: $ modprobe btrfs raid1-balancing=round-robin - Set the `devid` balancing method, defaulting to the latest device: $ modprobe btrfs raid1-balancing=devid Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:50 +01:00
Anand Jain	e2f11776f9	btrfs: expose experimental mode in module information Commit c9c49e8f157e ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG") introduces a way to enable or disable experimental features, print its status during module load, like so: Btrfs loaded, experimental=on, debug=on, assert=on, zoned=yes, fsverity=yes Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:50 +01:00
Anand Jain	2c0cf2b44b	btrfs: add RAID1 preferred read device When there's stale data on a mirrored device, this feature lets you choose which device to read from. Mainly used for testing. echo "devid:<devid-value>" > /sys/fs/btrfs/<UUID>/read_policy Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:50 +01:00
Anand Jain	eff96dae96	btrfs: introduce RAID1 round-robin read balancing This feature balances I/O across the striped devices when reading from RAID1 blocks. echo round-robin[:min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy The min_contiguous_read parameter defines the minimum read size before switching to the next mirrored device. This setting is optional, with a default value of 256 KiB. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	6b471f9f5c	btrfs: handle value associated with raid1 balancing parameter This change enables specifying additional configuration values alongside the raid1 balancing / read policy in a single input string. Updated btrfs_read_policy_to_enum() to parse and handle a value associated with the policy in the format `policy:value`, the value part if present is converted 64-bit integer. Update btrfs_read_policy_store() to accommodate the new parameter. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	a7b574a1f8	btrfs: add btrfs_read_policy_to_enum helper and refactor read policy store Introduce the `btrfs_read_policy_to_enum` helper function to simplify the conversion of a string read policy to its corresponding enum value. This reduces duplication and improves code clarity in `btrfs_read_policy_store`. The `btrfs_read_policy_store` function has been refactored to use the new helper. The parameter is copied locally to allow modification, enabling the separation of the method and its value. This prepares for the addition of more functionality in subsequent patches. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	30680021e7	btrfs: simplify output formatting in btrfs_read_policy_show Refactor the logic in btrfs_read_policy_show() to streamline the formatting of read policies output. Streamline the space and bracket handling around the active policy without altering the functional output. This is in preparation to add more methods. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	ca56af5079	btrfs: initialize fs_devices->fs_info earlier Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(), but this occurs too late for find_live_mirror(), which is invoked by load_super_root() much earlier than btrfs_init_devices_late(). Fix this by moving the initialization to open_ctree(), before load_super_root(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:48 +01:00
Naohiro Aota	38cc9c2c10	btrfs: zoned: calculate max_zone_append_size properly on non-zoned setup Since commit `559218d43e` ("block: pre-calculate max_zone_append_sectors"), queue_limits's max_zone_append_sectors is default to be 0 and it is only updated when there is a zoned device. So, we have lim->max_zone_append_sectors = 0 when there is no zoned device in the filesystem. That leads to fs_info->max_zone_append_size and fs_info->max_extent_size to be 0, which causes several errors. One example is the divide error as below. Running simple test as btrfs/001 on a non-zoned device with the zoned mode (zoned emulation) leads to this error because we have fs_info->max_extent_size = 0 in count_max_extents(). Oops: divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 21 UID: 0 PID: 2378822 Comm: dd Tainted: G W 6.13.0-rc2-kts #1 Tainted: [W]=WARN Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022 RIP: 0010:btrfs_delalloc_reserve_metadata+0x161/0x790 [btrfs] The block layer logic, having max_zone_append_sectors = 0 by stacking non-zoned devices, seems reasonable to me. So, let's deal with that in btrfs side by ignoring max_zone_append_sectors if it is non-zoned setup. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `559218d43e` ("block: pre-calculate max_zone_append_sectors") CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:31 +01:00
Qu Wenruo	149f7d3c2b	btrfs: enhance ordered extent double freeing detection With recent bugs exposed through run_delalloc_range() failure, the importance of detecting double accounting is obvious. Currently the way to detect such errors is to just check if we underflow the btrfs_ordered_extent::bytes_left member. That's fine but that only shows the length we're trying to decrease, not enough to show the problem. Here we enhance the situation by: - Introduce btrfs_ordered_extent::finished_bitmap This is a new bitmap to indicate which blocks are already finished. This bitmap will be initialized at alloc_ordered_extent() and release when the ordered extent is freed. - Detect any already finished block during can_finish_ordered_extent() If double accounting detected, show the full range we're trying and the bitmap. - Make sure the bitmap is all set when the OE is finished - Show the full range we're finishing for the existing double accounting detection This is to enhance the code to work with the new run_delalloc_range() error messages. This will have extra memory and runtime cost, now an ordered extent can have as large as 4K memory just for the finished_bitmap, and extra operations to detect such double accounting. Thus this double accounting detection is only enabled for CONFIG_BTRFS_DEBUG build for developers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:31 +01:00
Qu Wenruo	d39da2bda9	btrfs: add extra error messages for delalloc range related errors All the error handling bugs I hit so far are all -ENOSPC from either: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() Previously when those functions failed, there is no error message at all, making the debugging much harder. So here we introduce extra error messages for: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() - writepage_delalloc() when btrfs_run_delalloc_range() failed - extent_writepage() when extent_writepage_io() failed One example of the new debug error messages is the following one: run fstests generic/750 at 2024-12-08 12:41:41 BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600) BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3 BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536 BTRFS info (device dm-4): using free-space-tree BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-4): checking UUID tree BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28 Which shows an error from cow_file_range() which is called inside a nocow write attempt, along with the extra bitmap from writepage_delalloc(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	87c53d275f	btrfs: subpage: dump the involved bitmap when ASSERT() failed For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call bitmap_test_range_all_zero() to ensure the involved range has not bit set. However with my recent enhanced delalloc range error handling, I'm hitting the ASSERT() inside btrfs_folio_set_lock(), and is wondering if it's some error handling not properly cleanup the locked bitmap but directly unlock the page. So add some extra dumpping for the ASSERTs to dump the involved bitmap to help debug. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	aba5d9b96a	btrfs: subpage: fix the bitmap dump for the locked flags We're dumping the locked bitmap into the @checked_bitmap variable, causing incorrect values during debug. Thankfuklly even during my development I haven't hit a case where I need to dump the locked bitmap. But for the sake of consistency, fix it by dumpping the locked bitmap into @locked_bitmap variable for output. Fixes: `75258f20fb` ("btrfs: subpage: dump extra subpage bitmaps for debug") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	b635fbe47a	btrfs: do proper folio cleanup when run_delalloc_nocow() failed [BUG] With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash with the following VM_BUG_ON_FOLIO(): BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28 BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28 page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664 aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774" flags: 0x2fffff80004028(uptodate\|lru\|private\|node=0\|zone=2\|lastcpupid=0xfffff) page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2992! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : folio_clear_dirty_for_io+0x128/0x258 lr : folio_clear_dirty_for_io+0x128/0x258 Call trace: folio_clear_dirty_for_io+0x128/0x258 btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs] __process_folios_contig+0x154/0x268 [btrfs] extent_clear_unlock_delalloc+0x5c/0x80 [btrfs] run_delalloc_nocow+0x5f8/0x760 [btrfs] btrfs_run_delalloc_range+0xa8/0x220 [btrfs] writepage_delalloc+0x230/0x4c8 [btrfs] extent_writepage+0xb8/0x358 [btrfs] extent_write_cache_pages+0x21c/0x4e8 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x178/0x3a8 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] The first two lines of extra debug messages show the problem is caused by the error handling of run_delalloc_nocow(). E.g. we have the following dirtied range (4K blocksize 4K page size): 0 16K 32K \|//////////////////////////////////////\| \| Pre-allocated \| And the range [0, 16K) has a preallocated extent. - Enter run_delalloc_nocow() for range [0, 16K) Which found range [0, 16K) is preallocated, can do the proper NOCOW write. - Enter fallback_to_fow() for range [16K, 32K) Since the range [16K, 32K) is not backed by preallocated extent, we have to go COW. - cow_file_range() failed for range [16K, 32K) So cow_file_range() will do the clean up by clearing folio dirty, unlock the folios. Now the folios in range [16K, 32K) is unlocked. - Enter extent_clear_unlock_delalloc() from run_delalloc_nocow() Which is called with PAGE_START_WRITEBACK to start page writeback. But folios can only be marked writeback when it's properly locked, thus this triggered the VM_BUG_ON_FOLIO(). Furthermore there is another hidden but common bug that run_delalloc_nocow() is not clearing the folio dirty flags in its error handling path. This is the common bug shared between run_delalloc_nocow() and cow_file_range(). [FIX] - Clear folio dirty for range [@start, @cur_offset) Introduce a helper, cleanup_dirty_folios(), which will find and lock the folio in the range, clear the dirty flag and start/end the writeback, with the extra handling for the @locked_folio. - Introduce a helper to record the last failed COW range end This is to trace which range we should skip, to avoid double unlocking. - Skip the failed COW range for the error handling Cc: stable@vger.kernel.org Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	df92c54755	btrfs: do proper folio cleanup when cow_file_range() failed [BUG] When testing with COW fixup marked as BUG_ON() (this is involved with the new pin_user_pages() change, which should not result new out-of-band dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW fixup path. This BUG_ON() happens just after a failed btrfs_run_delalloc_range(): BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28 ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:1444! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86 Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] pc : extent_writepage_io+0x2d4/0x308 [btrfs] lr : extent_writepage_io+0x2d4/0x308 [btrfs] Call trace: extent_writepage_io+0x2d4/0x308 [btrfs] extent_writepage+0x218/0x330 [btrfs] extent_write_cache_pages+0x1d4/0x4b0 [btrfs] btrfs_writepages+0x94/0x150 [btrfs] do_writepages+0x74/0x190 filemap_fdatawrite_wbc+0x88/0xc8 start_delalloc_inodes+0x180/0x3b0 [btrfs] btrfs_start_delalloc_roots+0x174/0x280 [btrfs] shrink_delalloc+0x114/0x280 [btrfs] flush_space+0x250/0x2f8 [btrfs] btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] process_one_work+0x164/0x408 worker_thread+0x25c/0x388 kthread+0x100/0x118 ret_from_fork+0x10/0x20 Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000) ---[ end trace 0000000000000000 ]--- [CAUSE] That failure is mostly from cow_file_range(), where we can hit -ENOSPC. Although the -ENOSPC is already a bug related to our space reservation code, let's just focus on the error handling. For example, we have the following dirty range [0, 64K) of an inode, with 4K sector size and 4K page size: 0 16K 32K 48K 64K \|///////////////////////////////////////\| \|#######################################\| Where \|///\| means page are still dirty, and \|###\| means the extent io tree has EXTENT_DELALLOC flag. - Enter extent_writepage() for page 0 - Enter btrfs_run_delalloc_range() for range [0, 64K) - Enter cow_file_range() for range [0, 64K) - Function btrfs_reserve_extent() only reserved one 16K extent So we created extent map and ordered extent for range [0, 16K) 0 16K 32K 48K 64K \|////////\|//////////////////////////////\| \|<- OE ->\|##############################\| And range [0, 16K) has its delalloc flag cleared. But since we haven't yet submit any bio, involved 4 pages are still dirty. - Function btrfs_reserve_extent() return with -ENOSPC Now we have to run error cleanup, which will clear all EXTENT_DELALLOC flags and clear the dirty flags for the remaining ranges: 0 16K 32K 48K 64K \|////////\| \| \| \| \| Note that range [0, 16K) still has their pages dirty. - Some time later, writeback are triggered again for the range [0, 16K) since the page range still have dirty flags. - btrfs_run_delalloc_range() will do nothing because there is no EXTENT_DELALLOC flag. - extent_writepage_io() find page 0 has no ordered flag Which falls into the COW fixup path, triggering the BUG_ON(). Unfortunately this error handling bug dates back to the introduction of btrfs. Thankfully with the abuse of cow fixup, at least it won't crash the kernel. [FIX] Instead of immediately unlock the extent and folios, we keep the extent and folios locked until either erroring out or the whole delalloc range finished. When the whole delalloc range finished without error, we just unlock the whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared. And those involved folios will be properly submitted, with their dirty flags cleared during submission. For the error path, it will be a little more complex: - The range with ordered extent allocated (range (1)) We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining flags are cleaned up by btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered(). For folios we finish the IO (clear dirty, start writeback and immediately finish the writeback) and unlock the folios. - The range with reserved extent but no ordered extent (range(2)) - The range we never touched (range(3)) For both range (2) and range(3) the behavior is not changed. Now even if cow_file_range() failed halfway with some successfully reserved extents/ordered extents, we will keep all folios clean, so there will be no future writeback triggered on them. Cc: stable@vger.kernel.org Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	064709bb67	btrfs: fix the error handling of submit_uncompressed_range() [BUG] If btrfs failed to compress the range, or can not reserve a large enough data extent (e.g. too fragmented free space), btrfs will fall back to submit_uncompressed_range(). But inside submit_uncompressed_range(), run_dealloc_cow() can also fail due to -ENOSPC or whatever other errors. In that case there are 3 bugs in the error handling: 1) Double freeing for the same ordered extent Which can lead to crash due to ordered extent double accounting 2) Start/end writeback without updating the subpage writeback bitmap 3) Unlock the folio without clear the subpage lock bitmap Both bug 2) and 3) will crash the kernel if the btrfs block size is smaller than folio size, as the next time the folio get writeback/lock updates, subpage will find the bitmap already have the range set, triggering an ASSERT(). [CAUSE] Bug 1) happens in the following call chain: submit_uncompressed_range() \|- run_dealloc_cow() \| \|- cow_file_range() \| \|- btrfs_reserve_extent() \| Failed with -ENOSPC or whatever error \| \|- btrfs_clean_up_ordered_extents() \| \|- btrfs_mark_ordered_io_finished() \| Which cleans all the ordered extents in the async_extent range. \| \|- btrfs_mark_ordered_io_finished() Which cleans the folio range. The finished ordered extents may not be immediately removed from the ordered io tree, as they are removed inside a work queue. So the second btrfs_mark_ordered_io_finished() may find the finished but not-yet-removed ordered extents, and double free them. Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover other ordered extents. Bug 2) and 3) are more straight forward, btrfs just calls folio_unlock(), folio_start_writeback() and folio_end_writeback(), other than the helpers which handle subpage cases. [FIX] For bug 1) since the first btrfs_cleanup_ordered_extents() call is handling the whole range, we should not do the second btrfs_mark_ordered_io_finished() call. And for the first btrfs_cleanup_ordered_extents(), we no longer need to pass the @locked_page parameter, as we are already in the async extent context, thus will never rely on the error handling inside btrfs_run_delalloc_range(). So just let the btrfs_clean_up_ordered_extents() to handle every folio equally. For bug 2) we should not even call folio_start_writeback()/folio_end_writeback() anymore. As the error handling protocol, cow_file_range() should clear dirty flag and start/finish the writeback for the whole range passed in. For bug 3) just change the folio_unlock() to btrfs_folio_end_lock() helper. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	9c68bdd38f	btrfs: fix double accounting race when extent_writepage_io() failed [BUG] If submit_one_sector() failed inside extent_writepage_io() for sector size < page size cases (e.g. 4K sector size and 64K page size), then we can hit double ordered extent accounting error. This should be very rare, as submit_one_sector() only fails when we failed to grab the extent map, and such extent map should exist inside the memory and have been pinned. [CAUSE] For example we have the following folio layout: 0 4K 32K 48K 60K 64K \|//\| \|//////\| \|///\| Where \|///\| is the dirty range we need to writeback. The 3 different dirty ranges are submitted for regular COW. Now we hit the following sequence: - submit_one_sector() returned 0 for [0, 4K) - submit_one_sector() returned 0 for [32K, 48K) - submit_one_sector() returned error for [60K, 64K) - btrfs_mark_ordered_io_finished() called for the whole folio This will mark the following ranges as finished: * [0, 4K) * [32K, 48K) Both ranges have their IO already submitted, this cleanup will lead to double accounting. * [60K, 64K) That's the correct cleanup. The only good news is, this error is only theoretical, as the target extent map is always pinned, thus we should directly grab it from memory, other than reading it from the disk. [FIX] Instead of calling btrfs_mark_ordered_io_finished() for the whole folio range, which can touch ranges we should not touch, instead move the error handling inside extent_writepage_io(). So that we can cleanup exact sectors that are ought to be submitted but failed. This provide much more accurate cleanup, avoiding the double accounting. Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	e863c615af	btrfs: fix double accounting race when btrfs_run_delalloc_range() failed [BUG] There are several double accounting case, where the WARN_ON_ONCE() is triggered inside can_finish_ordered_extent(). And all such cases points back to the btrfs_mark_ordered_io_finished() call inside extent_writepage() when it hits some error. [CAUSE] With extra debug patches to show where the error is from, it turns out to be btrfs_run_delalloc_range() can fail with -ENOSPC. Such failure itself is already a symptom of some bad data/metadata space reservation, but here we need to focus on the error handling part. For example, we have the following dirty page layout (4K sector size and 4K page size): 0 16K 32K \|/////\|/////\|/////\|/////\|/////\|/////\|/////\|/////\| Where the range [0, 32K) is dirty and we need to write all the 8 pages back. When handling the first page 0, we go the following sequence: - btrfs_run_delalloc_range() for range [0, 32k) We enter cow_file_range() for [0, 32K) - btrfs_reserve_extent() only returned a 16K data extent. This can be caused by fragmentation, and it's already an indication we're almost running of space. Now we have the following layout: 0 16K 32K \|<----- Reserved ------>\|/////\|/////\|/////\|/////\| The range [0, 16K) has ordered extent allocated. - btrfs_reserve_extent() returned -ENOSPC We really run out of space. But since we have reserved space for range [0, 16K) we need to clean them up. But that cleanup for ordered extent only happens inside btrfs_run_delalloc_range(). - btrfs_run_delalloc_range() cleanup the reserved ordered extent By calling btrfs_mark_ordered_io_finished() for range [0, 32K). It will locate the ordered extent [0, 16K) and mark it as IOERR. Also since the ordered extent is only 16K, we're finishing the whole ordered extent. Thus we call btrfs_queue_ordered_fn() to queue to finish the ordered extent. But still, the ordered extent [0, 16K) is still in the btrfs_inode::ordered_tree. - extent_writepage() cleanup the ordered extent inside the folio We call btrfs_mark_ordered_io_finished() for range [0, 4K). Since the finished ordered extent [0, 16K) is not yet removed (racy, depends on when btrfs_finish_one_ordered() is called), if btrfs_mark_ordered_io_finished() is called before btrfs_finish_one_ordered(), we will double account and trigger the warning inside can_finish_ordered_extent(). So the root cause is, we're relying on btrfs_mark_ordered_io_finished() to handle ranges which is already cleaned up. Unfortunately the bug dates back to the early days when btrfs_mark_ordered_io_finished() is introduced as a no-brain choice for error paths, but such no-brain solution just hides all the race and make us less cautious when handling errors. [FIX] Instead of relying on the btrfs_mark_ordered_io_finished() call to cleanup the whole folio range, record the last successfully ran delalloc range. And combined with bio_ctrl->submit_bitmap to properly clean up any newly created ordered extents. Since we have cleaned up the ordered extents in range, we should not rely on the btrfs_mark_ordered_io_finished() inside extent_writepage() anymore. By this, we ensure btrfs_mark_ordered_io_finished() is only called once when writepage_delalloc() failed. Cc: stable@vger.kernel.org # 5.15+ Fixes: `e65f152e43` ("btrfs: refactor how we finish ordered extent io for endio functions") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:30 +01:00
Qu Wenruo	d4519378e4	btrfs: validate system chunk array at btrfs_validate_super() Currently btrfs_validate_super() only does a very basic check on the array chunk size (not too large than the available space, but not too small to contain no chunk). The more comprehensive checks (the regular chunk checks and size check inside the system chunk array) is all done inside btrfs_read_sys_array(). It's not a big deal, but for the sake of concentrated verification, we should validate the system chunk array at the time of super block validation. So this patch does the following modification: - Introduce a helper btrfs_check_system_chunk_array() * Validate the disk key * Validate the size before we access the full chunk/stripe items. * Do the full chunk item validation - Call btrfs_check_system_chunk_array() at btrfs_validate_super() - Simplify the checks inside btrfs_read_sys_array() Now the checks will be converted to an ASSERT(). - Simplify the checks inside read_one_chunk() Now all chunk items inside system chunk array and chunk tree is verified, there is no need to verify it again inside read_one_chunk(). This change has the following advantages: - More comprehensive checks at write time Although this also means extra memcpy() for the superblocks at write time, due to the limits that we need a dummy extent buffer to utilize all the extent buffer helpers. - Slightly improved readablity when iterating the system chunk array Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	5bb49ec59a	btrfs: scrub: use generic ratelimit helpers to output error messages Currently scrub goes different ways to rate limits its error messages: - regular btrfs_err_rl_in_rcu() helper for repaired sectors and the initial message for unrepaired sectors - Manually rate limits scrub_print_common_warning() I'd say the different rate limits could lead to cases where we only got the "unable to fixup (regular) error" messages but the detailed report about that corruption is ratelimited. To make the whole rate limit works more consistently, change the rate limit by: - Always using btrfs_*_rl() helpers - Remove the initial "unable to fixup (regular) error" message Since we're ensured to have at least one error message for each unrepaired sector (before rate limit), there is no need for a duplicated line. And if we hit rate limits, we will rate limit all the error messages together, not treating different error messages differently. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	fba78d9f9b	btrfs: scrub: ensure we output at least one error message for unrepaired corruption For btrfs scrub error messages, I have hit a lot of support cases on older kernels where no filename is outputted at all, with only error messages like: BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0 The "unable to fixup (regular) error" line shows we hit an unrepairable error, then normally we would do data/metadata backref walk to grab the correct info. But we can hit cases like the inode is already orphan (unlinked from any parent inode), or even the subvolume is orphan (unlinked and waiting to be deleted). In that case we're not sure what's the proper way to continue (Is it some critical data/metadata? Would it prevent the system from booting?) To improve the situation, this patch would: - Ensure we at least output one message for each unrepairable error If by somehow we didn't output any error message for the error, we always fallback to the basic logical/physical error output, with its type (data or metadata). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	985a948c8a	btrfs: scrub: simplify the inode iteration output The following two output are not really needed: - nlinks Normally file inodes should have nlinks as 1, for those inodes have multiple hard links, the inode/root number is still enough to pin it down to certain inode. - size The size is always fixed to sector size. By removing the nlinks output, we can reduce one inode item lookup. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	861fc3eab7	btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() The @stripe passed into scrub_stripe_report_errors() either has stripe->dev and stripe->physical properly populated (regular data stripes) or stripe->flags would have SCRUB_STRIPE_FLAG_NO_REPORT (RAID56 P/Q stripes). Thus there is no need to go with btrfs_map_block() to get the dev/physical. Just add an extra ASSERT() to make sure we get stripe->dev populated correctly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	ae88f4cfa4	btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() Since commit `2a2dc22f7e` ("btrfs: scrub: use dedicated super block verification function to scrub one super block"), the super block scrubbing is handled in a dedicated helper, thus scrub_print_common_warning() no longer needs to print warning for super block errors. Just remove the parameter. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:49 +01:00
Qu Wenruo	df8f0d1132	btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() Currently when we increase the device statistics, it would always lead to an error message in the kernel log. However this output is mostly duplicated with the existing ones: - For scrub operations We always have the following messages: * "fixed up error at logical %llu" * "unable to fixup (regular) error at logical %llu" So no matter if the corruption is repaired or not, it scrub would output an error message to indicate the problem. - For non-scrub read operations We also have the following messages: * "csum failed root %lld inode %llu off %llu" for data csum mismatch * "bad (tree block start\|fsid\|tree block level)" for metadata * "read error corrected: ino %llu off %llu" for repaired data/metadata So the error message from btrfs_dev_stat_inc_and_print() is duplicated. The real usage for the btrfs device statistics is for some user space daemon to check if there is any new errors, acting like some checks on SMART, thus we don't really need/want those messages in dmesg. This patch would reduce the log level to debug (disabled by default) for btrfs_dev_stat_inc_and_print(). For users really want to utilize btrfs devices statistics, they should go check "btrfs device stats" periodically, and we should focus the kernel error messages to more important things. CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
Qu Wenruo	3c43979ad2	btrfs: scrub: fix incorrectly reported logical/physical address [BUG] Scrub is not reporting the correct logical/physical address, it can be verified by the following script: # mkfs.btrfs -f $dev1 # mount $dev1 $mnt # xfs_io -f -c "pwrite -S 0xaa 0 128k" $mnt/file1 # umount $mnt # xfs_io -f -c "pwrite -S 0xff 13647872 4k" $dev1 # mount $dev1 $mnt # btrfs scrub start -fB $mnt # umount $mnt Note above 13647872 is the physical address for logical 13631488 + 4K. Scrub would report the following error: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file1) On the other hand, "btrfs check --check-data-csum" is reporting the correct logical/physical address: Checking filesystem on /dev/test/scratch1 UUID: db2eb621-b09d-4f24-8199-da17dc7b3201 [5/7] checking csums against data mirror 1 bytenr 13647872 csum 0x13fec125 expected csum 0x656bd64e ERROR: errors found in csum tree [CAUSE] In the function scrub_stripe_report_errors(), we always use the stripe->logical and its physical address to print the error message, not taking the sector number into consideration at all. [FIX] Fix the error reporting function by calculating logical/physical with the sector number. Now the scrub report is correct: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872 BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1) Fixes: `0096580713` ("btrfs: scrub: introduce error reporting functionality for scrub_stripe") CC: stable@vger.kernel.org #6.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
David Sterba	fc42bf9d7a	btrfs: handle unexpected parent block offset in btrfs_alloc_tree_block() Change a BUG_ON to a proper error handling, here it checks that a root other than reloc tree does not see a non-zero offset. This is set by btrfs_force_cow_block() and is a special case so the check makes sure it's not accidentally set by other callers. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
David Sterba	e3dea2dbc0	btrfs: === misc-next on b-for-next === Any commits after this one are for testing and evaluation only. Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:40:48 +01:00
Filipe Manana	ebd8327fe7	btrfs: use uuid_is_null() to verify if an uuid is empty At btrfs_is_empty_uuid() we have our custom code to check if an uuid is empty, however there a kernel uuid library that has a function named uuid_is_null() which does the same and probably more efficient. So change btrfs_is_empty_uuid() to use uuid_is_null(), which is almost a directly replacement, it just wraps the necessary casting since our uuid types are u8 arrays while the uuid kernel library uses the uuid_t type, which is just a typedef of an u8 array of 16 elements as well. Also since the function is now to trivial, make it a static inline function in fs.h. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Julian Sun	a298aba81d	btrfs: fix transaction atomicity bug when enabling simple quotas set squota incompat bit before committing the transaction that enables the feature With the config CONFIG_BTRFS_ASSERT enabled, an assertion failure occurs regarding the simple quota feature. [ 5.596534] assertion failed: btrfs_fs_incompat(fs_info, SIMPLE_QUOTA), in fs/btrfs/qgroup.c:365 [ 5.597098] ------------[ cut here ]------------ [ 5.597371] kernel BUG at fs/btrfs/qgroup.c:365! [ 5.597946] CPU: 1 UID: 0 PID: 268 Comm: mount Not tainted 6.13.0-rc2-00031-gf92f4749861b #146 [ 5.598450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 5.599008] RIP: 0010:btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.604303] <TASK> [ 5.605230] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.605538] ? exc_invalid_op+0x56/0x70 [ 5.605775] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.606066] ? asm_exc_invalid_op+0x1f/0x30 [ 5.606441] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.606741] ? btrfs_read_qgroup_config+0x74d/0x7a0 [ 5.607038] ? try_to_wake_up+0x317/0x760 [ 5.607286] open_ctree+0xd9c/0x1710 [ 5.607509] btrfs_get_tree+0x58a/0x7e0 [ 5.608002] vfs_get_tree+0x2e/0x100 [ 5.608224] fc_mount+0x16/0x60 [ 5.608420] btrfs_get_tree+0x2f8/0x7e0 [ 5.608897] vfs_get_tree+0x2e/0x100 [ 5.609121] path_mount+0x4c8/0xbc0 [ 5.609538] __x64_sys_mount+0x10d/0x150 The issue can be easily reproduced using the following reproduer: root@q:linux# cat repro.sh set -e mkfs.btrfs -f /dev/sdb > /dev/null mount /dev/sdb /mnt/btrfs btrfs quota enable -s /mnt/btrfs umount /mnt/btrfs mount /dev/sdb /mnt/btrfs The issue is that when enabling quotas, at btrfs_quota_enable(), we set BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE at fs_info->qgroup_flags and persist it in the quota root in the item with the key BTRFS_QGROUP_STATUS_KEY, but we only set the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA after we commit the transaction used to enable simple quotas. This means that if after that transaction commit we unmount the filesystem without starting and committing any other transaction, or we have a power failure, the next time we mount the filesystem we will find the flag BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE set in the item with the key BTRFS_QGROUP_STATUS_KEY but we will not find the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA set in the superblock, triggering an assertion failure at: btrfs_read_qgroup_config() -> qgroup_read_enable_gen() To fix this issue, set the BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA flag immediately after setting the BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE. This ensures that both flags are flushed to disk within the same transaction. Fixes: `182940f4f4` ("btrfs: qgroup: add new quota mode for simple quotas") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Filipe Manana	515979e1d8	btrfs: remove pointless comment from ctree.h It's pointless to have a comment above the prototype declarations of btrfs_ctree_init() and btrfs_ctree_exit() mentioning that they are declared in ctree.c. This is from the old days when ctree.h was used to place anything that didn't fit in any other file. So remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Filipe Manana	3902cb1b9d	btrfs: move extent-tree function declarations out of ctree.h We have 3 functions that have their prototypes declared in ctree.h but they are defined at extent-tree.c and they are unrelated to the btree data structure. Move the prototypes out of ctree.h and into extent-tree.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:28 +01:00
Filipe Manana	bc1d3d0046	btrfs: move btrfs_alloc_write_mask() into fs.h Currently btrfs_alloc_write_mask() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	d555752f59	btrfs: move BTRFS_BYTES_TO_BLKS() into fs.h Currently BTRFS_BYTES_TO_BLKS() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	7dc9ea3b2d	btrfs: move the folio ordered helpers from ctree.h into fs.h The folio ordered helper macros are defined at ctree.h but this is not the best place since ctree.{h,c} is all about the btree data structure implementation and not a generic module. So move these macros into the fs.h header. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	ed49aa9fb3	btrfs: move btrfs_is_empty_uuid() from ioctl.c into fs.c It's a generic helper not specific to ioctls and used in several places, so move it out from ioctl.c and into fs.c. While at it change its return type from int to bool and declare the loop variable in the loop itself. This also slightly reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 `1959449` 1de619 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781340 161037 16920 1959297 1de581 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:27 +01:00
Filipe Manana	a52943f27a	btrfs: move the exclusive operation functions into fs.c The declarations for the exclusive operation functions are located at fs.h but their definitions are in ioctl.c, which doesn't make much sense since (most of them) are used in several files other than ioctl.c. Since they are used in several files and they are generic enough, move them out of ioctl.c and into fs.c, even the ones that are currently only used at ioctl.c, for the sake of having them all in the same C file. This also reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1781492 161037 16920 `1959449` 1de619 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:26 +01:00
Filipe Manana	cfbdb7750c	btrfs: move csum related functions from ctree.c into fs.c The ctree module is about the implementation of the btree data structure and not a place holder for generic filesystem things like the csum algorithm details. Move the functions related to the csum algorithm details away from ctree.c and into fs.c, which is a far better place for them. Also fix missing punctuation in comments and change one multiline comment to a single line comment since everything fits in under 80 characters. For some reason this also sligthly reduces the module's size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:26 +01:00
Filipe Manana	0f1523dc0e	btrfs: move abort_should_print_stack() to transaction.h The function abort_should_print_stack() is declared in transaction.h but its definition is in ctree.c, which doesn't make sense since ctree.c is the btree implementation and the function is related to the transaction code. Move its definition into transaction.h as an inline function since it's a very short and trivial function, and also add the 'btrfs_' prefix into its name. This change also reduces the module size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1783148 161137 16920 1961205 1decf5 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:26 +01:00
Johannes Thumshirn	d6a31cd040	btrfs: pass btrfs_io_geometry to is_single_device_io Now that we have the stripe tree decision saved in struct btrfs_io_geometry we can pass it into is_single_device_io() and get rid of another call to btrfs_need_raid_stripe_tree_update(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Johannes Thumshirn	dd88750692	btrfs: cache RAID stripe tree decision in btrfs_io_context Cache the decision if a particular I/O needs to update RAID stripe tree entries in struct btrfs_io_context. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Johannes Thumshirn	f31fcf0b29	btrfs: cache stripe tree usage in io_geometry Cache the return of btrfs_need_stripe_tree_update() in struct btrfs_io_geometry. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Filipe Manana	31dd35f120	btrfs: add assertions and comment about path expectations to btrfs_cross_ref_exist() We should always call check_delayed_ref() with a path having a locked leaf from the extent tree where either the extent item is located or where it should be located in case it doesn't exist yet (when there's a pending unflushed delayed ref to do it), as we need to lock any existing delayed ref head while holding such leaf locked in order to avoid races with flushing delayed references, which could make us think an extent is not shared when it really is. So add some assertions and a comment about such expectations to btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:25 +01:00
Filipe Manana	68de37aa69	btrfs: add function comment for check_committed_ref() There are some not immediately obvious details about the operation of check_committed_ref(), namely that when it returns 0 it must return with the path having a locked leaf from the extent tree that contains the extent's extent item, so that we can later check for delayed refs when calling check_delayed_ref() in a way that doesn't race with a task running delayed references. For similar reasons, it must also return with a locked leaf when the extent item is not found, and that leaf is where the extent item should be located, because we may have delayed references that are going to create the extent item. Also document that the function can return false positives in order to not be too slow, and that the most important is to not return false negatives. So add a function comment to check_committed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	dffd80e3d7	btrfs: simplify arguments for btrfs_cross_ref_exist() Instead of passing a root and an objectid which matches an inode number, pass the inode instead, since the root is always the root associated to the inode and the objectid is the number of that inode. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	5cb68c6eac	btrfs: simplify return logic at check_committed_ref() Instead of setting the value to return in a local variable 'ret' and then jumping into a label named 'out' that does nothing but return that value, simplify everything by getting rid of the label and directly returning a value. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	ad6cf0bcc3	btrfs: avoid redundant call to get inline ref type at check_committed_ref() At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type() twice, once before we check if have an inline extent owner ref (for simple qgroups) and then once again sometime after that check. This second call is redundant when we have simple quotas disabled or we found an inline ref that is not of the owner ref type. So avoid this second call unless we have simple quotas enabled and found an owner ref, saving a function call that does inline ref validation again. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:24 +01:00
Filipe Manana	841c77efc3	btrfs: remove the snapshot check from check_committed_ref() At check_committed_ref() we have this check to see if the data extent was created in a generation lower than or equals to the generation where the last snapshot for the root was created, and if so we return immediately with 1, since it's very likely the extent is shared, referenced by other root. The only call chain for check_committed_ref() is the following: can_nocow_file_extent() btrfs_cross_ref_exist() check_committed_ref() And we already do that snapshot check at can_nocow_file_extent(), before we call btrfs_cross_ref_exist(). This makes the check done at check_committed_ref() redundant, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:32:23 +01:00

1 2 3 4 5 ...

13701 Commits