linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-01-09 06:43:09 +00:00

Linux kernel source tree

Go to file

Qu Wenruo f6065f8ede btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() [BUG] There is a small workload which will always fail with recent kernel: (A simplified version from btrfs/125 test case) mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1 sync umount $mnt btrfs dev scan -u $dev3 mount -o degraded $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2 umount $mnt btrfs dev scan mount $dev1 $mnt btrfs balance start --full-balance $mnt umount $mnt The failure is always failed to read some tree blocks: BTRFS info (device dm-4): relocating block group 217710592 flags data\|raid5 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 ... [CAUSE] With the recently added debug output, we can see all RAID56 operations related to full stripe 38928384: 56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384 56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384 56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384 56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384 56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384 56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 Before we enter raid56_parity_recover(), we have triggered some metadata write for the full stripe 38928384, this leads to us to read all the sectors from disk. Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to avoid unnecessary read. This means, for that full stripe, after any partial write, we will have stale data, along with P/Q calculated using that stale data. Thankfully due to patch "btrfs: only write the sectors in the vertical stripe which has data stripes" we haven't submitted all the corrupted P/Q to disk. When we really need to recover certain range, aka in raid56_parity_recover(), we will use the cached rbio, along with its cached sectors (the full stripe is all cached). This explains why we have no event raid56_scrub_read_recover() triggered. Since we have the cached P/Q which is calculated using the stale data, the recovered one will just be stale. In our particular test case, it will always return the same incorrect metadata, thus causing the same error message "parent transid verify failed on 39010304 wanted 9 found 7" again and again. [BTRFS DESTRUCTIVE RMW PROBLEM] Test case btrfs/125 (and above workload) always has its trouble with the destructive read-modify-write (RMW) cycle: 0 32K 64K Data1: \| Good \| Good \| Data2: \| Bad \| Bad \| Parity: \| Good \| Good \| In above case, if we trigger any write into Data1, we will use the bad data in Data2 to re-generate parity, killing the only chance to recovery Data2, thus Data2 is lost forever. This destructive RMW cycle is not specific to btrfs RAID56, but there are some btrfs specific behaviors making the case even worse: - Btrfs will cache sectors for unrelated vertical stripes. In above example, if we're only writing into 0~32K range, btrfs will still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2. This behavior is to cache sectors for later update. Incidentally commit `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") has a bug which makes RAID56 to never trust the cached sectors, thus slightly improve the situation for recovery. Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in steal_rbio" will revert the behavior back to the old one. - Btrfs raid56 partial write will update all P/Q sectors and cache them This means, even if data at (64K ~ 96K) of Data2 is free space, and only (96K ~ 128K) of Data2 is really stale data. And we write into that (96K ~ 128K), we will update all the parity sectors for the full stripe. This unnecessary behavior will completely kill the chance of recovery. Thankfully, an unrelated optimization "btrfs: only write the sectors in the vertical stripe which has data stripes" will prevent submitting the write bio for untouched vertical sectors. That optimization will keep the on-disk P/Q untouched for a chance for later recovery. [FIX] Although we have no good way to completely fix the destructive RMW (unless we go full scrub for each partial write), we can still limit the damage. With patch "btrfs: only write the sectors in the vertical stripe which has data stripes" now we won't really submit the P/Q of unrelated vertical stripes, so the on-disk P/Q should still be fine. Now we really need to do is just drop all the cached sectors when doing recovery. By this, we have a chance to read the original P/Q from disk, and have a chance to recover the stale data, while still keep the cache to speed up regular write path. In fact, just dropping all the cache for recovery path is good enough to allow the test case btrfs/125 along with the small script to pass reliably. The lack of metadata write after the degraded mount, and forced metadata COW is saving us this time. So this patch will fix the behavior by not trust any cache in __raid56_parity_recover(), to solve the problem while still keep the cache useful. But please note that this test pass DOES NOT mean we have solved the destructive RMW problem, we just do better damage control a little better. Related patches: - btrfs: only write the sectors in the vertical stripe - `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") - btrfs: update stripe_sectors::uptodate in steal_rbio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2022-07-25 17:45:37 +02:00
arch	- Reorganize the perf LBR init code so that a TSX quirk is applied early	2022-07-24 09:55:53 -07:00
block	block: fix missing blkcg_bio_issue_init	2022-07-14 10:54:49 -06:00
certs	certs: make system keyring depend on x509 parser	2022-07-24 12:53:55 -07:00
crypto	crypto: s390 - do not depend on CRYPTO_HW for SIMD implementations	2022-07-06 20:04:06 -07:00
Documentation	* Check for invalid flags to KVM_CAP_X86_USER_SPACE_MSR	2022-07-23 10:22:26 -07:00
drivers	- Make retbleed mitigations 64-bit only (32-bit will need a bit more	2022-07-24 09:40:17 -07:00
fs	btrfs: raid56: don't trust any cached sector in __raid56_parity_recover()	2022-07-25 17:45:37 +02:00
include	btrfs: add tracepoints for ordered extents	2022-07-25 17:45:34 +02:00
init	gcc-12: disable '-Warray-bounds' universally for now	2022-06-09 10:11:12 -07:00
ipc	ipc: Free mq_sysctls if ipc namespace creation failed	2022-06-22 17:47:41 -05:00
kernel	A single fix to correct a wrong BUG_ON() condition for deboosted tasks.	2022-07-24 09:50:53 -07:00
lib	ubsan: disable UBSAN_DIV_ZERO for clang	2022-07-14 15:45:26 -07:00
LICENSES	LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers	2021-12-16 14:33:10 +01:00
mm	mm: split huge PUD on wp_huge_pud fallback	2022-07-03 15:42:33 -07:00
net	net/sched: cls_api: Fix flow action initialization	2022-07-20 10:54:27 +01:00
samples	Fixes and minor clean ups for tracing:	2022-07-12 16:17:40 -07:00
scripts	scripts/gdb: Fix gdb 'lx-symbols' command	2022-07-21 10:40:55 -07:00
security	lockdown: Fix kexec lockdown bypass with ima policy	2022-07-20 09:56:48 -07:00
sound	ASoC: Drop Rockchip BCLK management for v5.19	2022-07-15 12:31:07 +02:00
tools	* Check for invalid flags to KVM_CAP_X86_USER_SPACE_MSR	2022-07-23 10:22:26 -07:00
usr	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
virt	KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un\|}blocking	2022-06-09 10:52:20 -04:00
.clang-format	clang-format: Fix space after for_each macros	2022-05-20 19:27:16 +02:00
.cocciconfig	scripts: add Linux .cocciconfig for coccinelle	2016-07-22 12:13:39 +02:00
.get_maintainer.ignore	Opt out of scripts/get_maintainer.pl	2019-05-16 10:53:40 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	kbuild: split the second line of .mod into .usyms	2022-05-08 03:16:59 +09:00
.mailmap	cxl fixes for 5.19-rc6	2022-07-08 12:55:25 -07:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: mark ARM/PALM TREO SUPPORT orphan	2022-07-07 15:17:00 +02:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	Pin control fixes for the v5.19 kernel cycle:	2022-07-22 12:24:04 -07:00
Makefile	Linux 5.19-rc8	2022-07-24 13:26:27 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.