linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2024-12-29 09:12:07 +00:00

Author	SHA1	Message	Date
Kumar Kartikeya Dwivedi	838a10bd2e	bpf: Augment raw_tp arguments with PTR_MAYBE_NULL Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL check branch to be dead code eliminated. A previous attempt [1], i.e. the second fixed commit, was made to simulate symbolic execution as if in most accesses, the argument is a non-NULL raw_tp, except for conditional jumps. This tried to suppress branch prediction while preserving compatibility, but surfaced issues with production programs that were difficult to solve without increasing verifier complexity. A more complete discussion of issues and fixes is available at [2]. Fix this by maintaining an explicit list of tracepoints where the arguments are known to be NULL, and mark the positional arguments as PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments are known to be ERR_PTR, and mark these arguments as scalar values to prevent potential dereference. Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2), shifted by the zero-indexed argument number x 4. This can be represented as follows: 1st arg: 0x1 2nd arg: 0x10 3rd arg: 0x100 ... and so on (likewise for ERR_PTR case). In the future, an automated pass will be used to produce such a list, or insert __nullable annotations automatically for tracepoints. Each compilation unit will be analyzed and results will be collated to find whether a tracepoint pointer is definitely not null, maybe null, or an unknown state where verifier conservatively marks it PTR_MAYBE_NULL. A proof of concept of this tool from Eduard is available at [3]. Note that in case we don't find a specification in the raw_tp_null_args array and the tracepoint belongs to a kernel module, we will conservatively mark the arguments as PTR_MAYBE_NULL. This is because unlike for in-tree modules, out-of-tree module tracepoints may pass NULL freely to the tracepoint. We don't protect against such tracepoints passing ERR_PTR (which is uncommon anyway), lest we mark all such arguments as SCALAR_VALUE. While we are it, let's adjust the test raw_tp_null to not perform dereference of the skb->mark, as that won't be allowed anymore, and make it more robust by using inline assembly to test the dead code elimination behavior, which should still stay the same. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix Fixes: `3f00c52393` ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Fixes: `cb4158ce8e` ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-12-13 16:24:53 -08:00
Kumar Kartikeya Dwivedi	c00d738e16	bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL" This patch reverts commit `cb4158ce8e` ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The patch was well-intended and meant to be as a stop-gap fixing branch prediction when the pointer may actually be NULL at runtime. Eventually, it was supposed to be replaced by an automated script or compiler pass detecting possibly NULL arguments and marking them accordingly. However, it caused two main issues observed for production programs and failed to preserve backwards compatibility. First, programs relied on the verifier not exploring == NULL branch when pointer is not NULL, thus they started failing with a 'dereference of scalar' error. Next, allowing raw_tp arguments to be modified surfaced the warning in the verifier that warns against reg->off when PTR_MAYBE_NULL is set. More information, context, and discusson on both problems is available in [0]. Overall, this approach had several shortcomings, and the fixes would further complicate the verifier's logic, and the entire masking scheme would have to be removed eventually anyway. Hence, revert the patch in preparation of a better fix avoiding these issues to replace this commit. [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com Reported-by: Manu Bretelle <chantra@meta.com> Fixes: `cb4158ce8e` ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-12-13 16:24:53 -08:00
Linus Torvalds	c30c65f3fe	block-6.13-20241213 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdckwAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppJmD/9uUPcPm51b5E7fzD5Hqvlb22uZMYbXs1vR 1NZWdPJhMoMPBXyQ0GN7wHThvwQ8VuTcZNs8pzSJWpZMMhgsdleViDL8hedPeblt TSDc6g2gEt7TtIGIhNqq7bQNW61a+KxZz55B/qKqlJOUsW7ALPuM4m34vSMTNKw8 c/RK3PtTxSvE5nmqLzeynw2Zo7IZ0PL2NSYZ0oID9ZcGtj4ItezhshXTPLuLNuRM ppvc9u3JGyAzVJI/I0GNNW2Xo2maFWtvcWznaegowoBzjQO4Qfo9WDtn3uJFl8Di N6M7H4GASo80l+Hd1eAal3YrM53Z1RW1Mj4xaA2+vtL+p7k5tfV3WAr00hNsK9TW 401KbBGqgvrVS7Y/Y0ADVqqoCePPhblZmWbJu36Jrz0/nDGi3lPOSigYSANlxGsC t0aeeD7lRTd5qJ5+pQnVQL9uaNTXZcUPizgGGG0y9/RcNPzot54ap0/cIEM3fXEQ nd2Tv7O1lSyE5O2JCaoBY/P5ytI7LgZHDH2/ZaUZyDxKFgXSj+2NufTUGj/YmZhr ZKU6U25cFhXEkMVDV4AUUh9Dq723dYkXE4xyl00eNiz6C0J9PQ8uvrlRgC9mPJeQ g4xvyIZxkqeRmzAvUIipKQSZ5Iia1/Jtwqs1RBcda8/7w3sOmQOA9xJGjevLI8G6 he9AQP63JQ== =trxL -----END PGP SIGNATURE----- Merge tag 'block-6.13-20241213' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Series from Damien fixing issues with the zoned write plugging - Fix for a potential UAF in block cgroups - Fix deadlock around queue freezing and the sysfs lock - Various little cleanups and fixes * tag 'block-6.13-20241213' of git://git.kernel.dk/linux: block: Fix potential deadlock while freezing queue and acquiring sysfs_lock block: Fix queue_iostats_passthrough_show() blk-mq: Clean up blk_mq_requeue_work() mq-deadline: Remove a local variable blk-iocost: Avoid using clamp() on inuse in __propagate_weights() block: Make bio_iov_bvec_set() accept pointer to const iov_iter block: get wp_offset by bdev_offset_from_zone_start blk-cgroup: Fix UAF in blkcg_unpin_online() MAINTAINERS: update Coly Li's email address block: Prevent potential deadlocks in zone write plug error recovery dm: Fix dm-zoned-reclaim zone write pointer alignment block: Ignore REQ_NOWAIT for zone reset and zone finish operations block: Use a zone write plug BIO work for REQ_NOWAIT BIOs	2024-12-13 15:10:59 -08:00
Linus Torvalds	2a816e4f6a	io_uring-6.13-20241213 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdckxAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq3LD/9oZzEBKsIvAl7tPeHsmVBTaQIC8kS93gQB 8RxCZ3YJmM1Ilu/sezKnkDEbl9aVEAtgwCh04Lep42ye4tBA4eMfQ8kPj2/HWYaY Vt4vM20Ls/diGGrjSK0958F8zwBeMKZsgSYeXhNf5IyOzXX1lY+JhCg2I+MLnYNi fw5IngQBB3yZwtMk4VN35Jf5M/PvE7pww8MMarSsfYsSAawf50Fs85K5xoJ4vTWE NxYcIU7MG6bqc6qrretNhgnXnw1OeZpjivXZAJv9+0WAJkMTxJ5QPLUpD7NJDi4b GUD/+SiLaxCmXs/jib/pafXPEbcuhX4a7c3yy0Ceq94OrE7F7H52Dz7PqQ4VxAAP wU5FaFjmytx+EqwDCAk1Cgvb3LX/gD793rA4A8FP64gBhlSmaW/yeMkffO9p9/Xb BIaX8BLb5iVER5I5/BpdpTa6TPy0mzGDDAD/+/DU+V65PLa7Qnt4bDLpSgUzw0OJ EIaqOxGZVF4xi4PqMFlOZqysI+rci3kOfHCrILWxuBlxOCaspRk8HFN0eypsgLu8 NenTlQBY+fKaKK32AQEfd21in5+tNBSy+E6H3HIWwUedNk2zihSuO12muXkOsfim BG0CgSfMUzIA6Pxgd87NaE62JXQBhZ2JZQW8swIqKc1m8/p/V//9yYeH97sbZasK 7cdiHV4anQ== =0uqv -----END PGP SIGNATURE----- Merge tag 'io_uring-6.13-20241213' of git://git.kernel.dk/linux Pull io_uring fix from Jens Axboe: "A single fix for a regression introduced in the 6.13 merge window" * tag 'io_uring-6.13-20241213' of git://git.kernel.dk/linux: io_uring/rsrc: don't put/free empty buffers	2024-12-13 15:08:55 -08:00
Linus Torvalds	1c021e7908	libnvdimm fixes for 6.13-rc3 - sysbot fix for out of bounds access -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQSgX9xt+GwmrJEQ+euebuN7TNx1MQUCZ1x73BQcaXJhLndlaW55 QGludGVsLmNvbQAKCRCebuN7TNx1Md/JAQC1l9Fxu9FR4UW8nA+wgU0P8XYGKqmD EqRUb82LI13f5QD/WtBQEVgBTxD5sVlMBjEGIuYE9hMZbdN0LB8wnK0HfQ8= =87RX -----END PGP SIGNATURE----- Merge tag 'libnvdimm-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fix from Ira Weiny: - sysbot fix for out of bounds access * tag 'libnvdimm-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: acpi: nfit: vmalloc-out-of-bounds Read in acpi_nfit_ctl	2024-12-13 15:07:22 -08:00
Leon Romanovsky	824927e884	ARC: build: Try to guess GCC variant of cross compiler ARC GCC compiler is packaged starting from Fedora 39i and the GCC variant of cross compile tools has arc-linux-gnu- prefix and not arc-linux-. This is causing that CROSS_COMPILE variable is left unset. This change allows builds without need to supply CROSS_COMPILE argument if distro package is used. Before this change: $ make -j 128 ARCH=arc W=1 drivers/infiniband/hw/mlx4/ gcc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead gcc: error: unrecognized command-line option ‘-mmedium-calls’ gcc: error: unrecognized command-line option ‘-mlock’ gcc: error: unrecognized command-line option ‘-munaligned-access’ [1] https://packages.fedoraproject.org/pkgs/cross-gcc/gcc-arc-linux-gnu/index.html Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Vineet Gupta <vgupta@kernel.org>	2024-12-13 14:39:24 -08:00
Linus Torvalds	4800575d8c	Bug fixes for 6.13-rc3 * Fixes for scrub subsystem * Fix quota crashes * Fix sb_spino_align checons on large fsblock sizes * Fix discarded superblock updates * fix stuck unmount due to a locked inode Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZ1yHBQAKCRBGdaER5Qtf pvogAX98nf8qE2Y2I6DvXVeX+LOPWNG+WTiRpF3d8NxjhCfGn79am71YrtkHf6qC V0halh8Bfi+SWsVqYYuJK+7/SUjGS2MzNqG0NY32wNYqAIroQA/Xh9IaQXX0q8t2 FLGWIaxWOA== =0g8J -----END PGP SIGNATURE----- Merge tag 'xfs-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: - Fixes for scrub subsystem - Fix quota crashes - Fix sb_spino_align checons on large fsblock sizes - Fix discarded superblock updates - Fix stuck unmount due to a locked inode * tag 'xfs-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: port xfs_ioc_start_commit to multigrain timestamps xfs: return from xfs_symlink_verify early on V4 filesystems xfs: fix zero byte checking in the superblock scrubber xfs: check pre-metadir fields correctly xfs: don't crash on corrupt /quotas dirent xfs: don't move nondir/nonreg temporary repair files to the metadir namespace xfs: fix sb_spino_align checks for large fsblock sizes xfs: convert quotacheck to attach dquot buffers xfs: attach dquot buffer to dquot log item buffer xfs: clean up log item accesses in xfs_qm_dqflush{,_done} xfs: separate dquot buffer reads from xfs_dqflush xfs: don't lose solo dquot update transactions xfs: don't lose solo superblock counter update transactions xfs: avoid nested calls to __xfs_trans_commit xfs: only run precommits once per transaction object xfs: unlock inodes when erroring out of xfs_trans_alloc_dir xfs: fix scrub tracepoints when inode-rooted btrees are involved xfs: update btree keys correctly when _insrec splits an inode root block xfs: fix error bailout in xfs_rtginode_create xfs: fix null bno_hint handling in xfs_rtallocate_rtg ...	2024-12-13 14:27:40 -08:00
Linus Torvalds	01af50af76	arm64 fixes for 6.13-rc3: - arm64 stacktrace: address some fallout from the recent changes to unwinding across exception boundaries - Ensure the arm64 signal delivery failure is recoverable - only override the return registers after all the user accesses took place - Fix the arm64 kselftest access to SVCR - only when SME is detected -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmdccV0ACgkQa9axLQDI XvHC6Q//X7A2aziMs+FTYrDc0sqK09Ur53SNyQZuA6v1dXqxLH+SbzM75n6TMv9O a5lv/1Jxw0aQPOhTsRgg7wV+svMWyOx52ouUSjFVRPocBFgnxiyyxcxgC9TmPMTh MbZPa36R7qSpfUSPK6+uOTI/eQ6cD66az4fq7Y936cyY8J1/xg5pQXSE2WWGx4nz Ryci4iksNB8IXH37twDmPvE6446Q82eLcJR9zfh+lhf+Rgx5/rNEPdqtNJPR1a7N ftVDHTvcMTLmoMsjvOkhtfjbwsGBbz4Ezji0GW3+zPcZK5EYRjUZvOnG3HD2/mds iYfh0xux0GBEsjozkIshu7wg1JqMwJJ8zBXu62/TcKHG8LG3C/k6kCr+SN7IL8p8 qzxxkkD61AxAY0aHuJ56Ao9nCLbHkAFGOJXUiyDJQALzt2KKN4aoOt+0zj35yYFz PONt4FPy3ulX6fk0XrAcI5E64hjeW/ib5wluOxb79Wn07ZYZQ/Ff42kJEMVL6VA7 bFbbYKtbjFhQQPac/TLbqtwgMX0lA6bSZ5bPhoPZys3iReVbDiGvszRsWh4WQYgX X0sujmaSMkjU1IICfZs3yABLQ0aDJKeuaKHnri9WT3iuBOjiLlChVnXqd4VRgXfy DUE6bKQ3ohHVTrIxpNcBzWUYsSAeZMPyakuCY6pxuVwzBPrOR28= =s7gn -----END PGP SIGNATURE----- Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: - arm64 stacktrace: address some fallout from the recent changes to unwinding across exception boundaries - Ensure the arm64 signal delivery failure is recoverable - only override the return registers after all the user accesses took place - Fix the arm64 kselftest access to SVCR - only when SME is detected * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: kselftest/arm64: abi: fix SVCR detection arm64: signal: Ensure signal delivery failure is recoverable arm64: stacktrace: Don't WARN when unwinding other tasks arm64: stacktrace: Skip reporting LR at exception boundaries	2024-12-13 14:23:09 -08:00
Linus Torvalds	a5b3d5dfce	RISC-V Fixes for 6.13-rc3 * A fix to avoid taking a mutex while resolving jump_labels in the mutex implementation. * A fix to avoid trying to resolve the early boot DT pointer via the linear map. * A fix to avoid trying to IPI kfence TLB flushes, as kfence might flush with IRQs disabled. * A fix to avoid calling PMD destructors on PMDs that were never constructed. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmdcYiETHHBhbG1lckBk YWJiZWx0LmNvbQAKCRAuExnzX7sYif8DD/4x/V+ZDInlTpQcPpKC7LtDaNBUyZ2r 4nENb4CcHkLTJjHHKUS9WFkH+woH0hx43yDkm5AKO3RttYj41fbybhBiomeEQdbE iA6d7zQD3EFTXP+Pg/fBfzEWgMUCl9GJjtJn8ZpwqsibMOcNrfog9i/6ErPnPOKi Q5+nx/Qyg2pxKI/9kTFsJ8W39iRaFZTVltrfZHc26i96ohx1b8J93KXT2dRFvkV8 doPcJkuEAP1l52B4y+Vo3T+01LS4bVWwv2Ye5r268lLB4DkYIwk/Hh6HG0DTeAID o0TWbCREuR4gvur8ge3C6P4v7lCT+gv2tSxW9vxd1y6vMhIgRnDHYMoloDDWw352 6hTDg6akqTC4jfMFwyOFo98qze/bghCG+ELgMxhbgQamrWO7qzVZCjkgKB2Wal3b rcY5eYMPFIR0YIBHa/kQ4J6mFVGhK4sBxpB9w3Z7BSYnMU+uCXChMmxo4kSXJelt N2Ny9rYlNxxhUg6OSDXAWQAoIUHo6O89gWZyXnFlBUfaI6cnVFDwd+2F6QoRhmhu XNTHl699qcV5Unb08qd2wnCXFv0X0WVVGTPO5rFH2/HlkPoPwZIGEv/kU+U42n5p OiIteste0TH5w20knrWYHZKEnNkaONcxNqmloTK89qQL+526s65pcgPk6W1hFqow 4WpJWIgwTRdd3A== =uaBB -----END PGP SIGNATURE----- Merge tag 'riscv-for-linus-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - avoid taking a mutex while resolving jump_labels in the mutex implementation - avoid trying to resolve the early boot DT pointer via the linear map - avoid trying to IPI kfence TLB flushes, as kfence might flush with IRQs disabled - avoid calling PMD destructors on PMDs that were never constructed * tag 'riscv-for-linus-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: mm: Do not call pmd dtor on vmemmap page table teardown riscv: Fix IPIs usage in kfence_protect_page() riscv: Fix wrong usage of __pa() on a fixmap address riscv: Fixup boot failure when CONFIG_DEBUG_RT_MUTEXES=y	2024-12-13 14:18:18 -08:00
Rafael J. Wysocki	e14d5ae28e	Merge branch 'acpica' Merge an ACPICA fix for 6.13-rc3: - Don't release a context_mutex that was never acquired in acpi_remove_address_space_handler() (Daniil Tatianin). * acpica: ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired	2024-12-13 21:32:06 +01:00
Paolo Bonzini	3522c41975	KVM/riscv fixes for 6.13, take #1 - Replace csr_write() with csr_set() for HVIEN PMU overflow bit -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmdcUn0ACgkQrUjsVaLH LAcoyg//RD70/kMg4ZvE9rjNSBE7c4TemvHaCPgSpI4qM/r1HeJSMUDj+AIHvhg2 S3jwgi32oJYaSXHEknvwwtaPN61JHn/mMeTcDlNSBG0vXacoAnprmfDMKzUoYNHc Ui9bkXhmkUUjLsbaW+lTOIHgUUCaLm4hh5m6W3adslCA/AfYn8v0XKiCHYVRSAXR qOgWvo2QSyowaGf9tU9ttvaigkV9FBUyIWJyOJcxwvkcvOg/fhvYF1CkQTxJh+m1 N7C7fh5CzvdugojlWtjMIm9kAGPnhLXnuNFLGZfAkgxltAlI+nzGALAS2tdq70/n DOBbf0J2zI1zPCCGYt0nPCIyqacduJHTxHLbS7gX+nIGdfoACVtS/ir/3gNIS4RW NnMgJi7OSvnAudlmOluJdp6K8Xavk/EVxgGY7ItUXpncH2g2lOe/2C7Ihsb6x/mF xbap8hTYOtUh9jUhmjcx47CzrmkufUKGwBJAk36PsLAwRkvVqS/gsQDVuHwqhpby 66cAj0xDqq8gSg2dJhQQzOB8bUvWhozUAjoaJMpWX1UjNph9ZkqwZPsKB1YXyLsx N5OhcmGCMdVIaJ2R/wmflWZCJI4U4ycZXO9VMdOFW882JoNya6uiNohJv/fosUjF 5JXFAgbM4tJ0M+UdDyAb3k85SQ/pTIKE/aoON8VtZBefku/DyNw= =/5Vc -----END PGP SIGNATURE----- Merge tag 'kvm-riscv-fixes-6.13-1' of https://github.com/kvm-riscv/linux into HEAD KVM/riscv fixes for 6.13, take #1 - Replace csr_write() with csr_set() for HVIEN PMU overflow bit	2024-12-13 13:59:20 -05:00
Sean Christopherson	1201f226c8	KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to avoid the overead of CPUID during runtime. The offset, size, and metadata for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e. is constant for a given CPU, and thus can be cached during module load. On Intel's Emerald Rapids, CPUID is wildly expensive, to the point where recomputing XSAVE offsets and sizes results in a 4x increase in latency of nested VM-Enter and VM-Exit (nested transitions can trigger xstate_required_size() multiple times per transition), relative to using cached values. The issue is easily visible by running `perf top` while triggering nested transitions: kvm_update_cpuid_runtime() shows up at a whopping 50%. As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald Rapids (EMR) takes: SKX 11650 ICX 22350 EMR 28850 Using cached values, the latency drops to: SKX 6850 ICX 9000 EMR 7900 The underlying issue is that CPUID itself is slow on ICX, and comically slow on EMR. The problem is exacerbated on CPUs which support XSAVES and/or XSAVEC, as KVM invokes xstate_required_size() twice on each runtime CPUID update, and because there are more supported XSAVE features (CPUID for supported XSAVE feature sub-leafs is significantly slower). SKX: CPUID.0xD.2 = 348 cycles CPUID.0xD.3 = 400 cycles CPUID.0xD.4 = 276 cycles CPUID.0xD.5 = 236 cycles <other sub-leaves are similar> EMR: CPUID.0xD.2 = 1138 cycles CPUID.0xD.3 = 1362 cycles CPUID.0xD.4 = 1068 cycles CPUID.0xD.5 = 910 cycles CPUID.0xD.6 = 914 cycles CPUID.0xD.7 = 1350 cycles CPUID.0xD.8 = 734 cycles CPUID.0xD.9 = 766 cycles CPUID.0xD.10 = 732 cycles CPUID.0xD.11 = 718 cycles CPUID.0xD.12 = 734 cycles CPUID.0xD.13 = 1700 cycles CPUID.0xD.14 = 1126 cycles CPUID.0xD.15 = 898 cycles CPUID.0xD.16 = 716 cycles CPUID.0xD.17 = 748 cycles CPUID.0xD.18 = 776 cycles Note, updating runtime CPUID information multiple times per nested transition is itself a flaw, especially since CPUID is a mandotory intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated CPUID state is up-to-date while running L2. That flaw will be fixed in a future patch, as deferring runtime CPUID updates is more subtle than it appears at first glance, the benefits aren't super critical to have once the XSAVE issue is resolved, and caching CPUID output is desirable even if KVM's updates are deferred. Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241211013302.1347853-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-13 13:58:10 -05:00
Linus Torvalds	243f750a2d	gpio fixes for v6.13-rc3 - fix several low-level issues in gpio-graniterapids - fix an initialization order issue that manifests itself with __counted_by() checks in gpio-ljca - don't default to y for CONFIG_GPIO_MVEBU with COMPILE_TEST enabled - move the DEFAULT_SYMBOL_NAMESPACE define before the export.h include in gpio-idio-16 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmdcOcIACgkQEacuoBRx 13Ky3g/+JZawOniCmm5vBEH5SS4lhjEMts0i0WkaH5oIhESxsLgfZO8e0ucQt4td AmVk7ingHaoY1GYL1bmhkZEELHhqrSYb0uCMMtK1AxvXOeMUdGRj2QYpdKLGdWL3 dCrWemJi8sx0UDWs4LLvR5HcoKAACB5Mu2NmOMTtuYiBB8qIfCWQ0+0Q09/yhJM8 WNhGComnuKPvVgndHrGcd5H7CTUJobyacYu3dwvzPjPC+PvC6YTrkos5xonbyx1E GQQ3bxgU/ZvuGp6OKKeugaMxJcWOcrBm7emYiCyYwcasrh3GghUAcQ8UAP9CLYSx U1AzWI1lDqNdh7a+pr+HYR43aJ8RJ8lWFKA5wlUHSZv97G4aZyFRPBIC0AI4C9os OvW/sS4IvJ/snY7dyQ7972RLEMzmSSpEDWyYdrwOCDAtRt6KGpK7GH6yQrxQJCUN 6g30wry+zzlybki0gyyQB53b8X5pth6VuDtpaPSwFPQo1JL+Z0abbIPjqJKrVZBt UoVLKAev5Jo6U5/aiPvddBMw31RPeAN2dQTuYxKI8fv9xtWMYrpa9y9Hzgt4s1HX F+Pl2cVUnM+cKWYQbk7m1s07jHMjsnaH0RCxPX54RnXaQ7hevXfBn0HeNkETN426 uZmlQIPgIB7cP2zdBDnJ6Jzgzv5mu+azIuTb11PL6JmL+DOXzhw= =nunS -----END PGP SIGNATURE----- Merge tag 'gpio-fixes-for-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix several low-level issues in gpio-graniterapids - fix an initialization order issue that manifests itself with __counted_by() checks in gpio-ljca - don't default to y for CONFIG_GPIO_MVEBU with COMPILE_TEST enabled - move the DEFAULT_SYMBOL_NAMESPACE define before the export.h include in gpio-idio-16 * tag 'gpio-fixes-for-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace gpio: graniterapids: Fix GPIO Ack functionality gpio: graniterapids: Check if GPIO line can be used for IRQs gpio: graniterapids: Determine if GPIO pad can be used by driver gpio: graniterapids: Fix invalid RXEVCFG register bitmask gpio: graniterapids: Fix invalid GPI_IS register offset gpio: graniterapids: Fix incorrect BAR assignment gpio: graniterapids: Fix vGPIO driver crash gpio: ljca: Initialize num before accessing item in ljca_gpio_config gpio: GPIO_MVEBU should not default to y when compile-testing	2024-12-13 09:57:15 -08:00
Nilay Shroff	be26ba9642	block: Fix potential deadlock while freezing queue and acquiring sysfs_lock For storing a value to a queue attribute, the queue_attr_store function first freezes the queue (->q_usage_counter(io)) and then acquire ->sysfs_lock. This seems not correct as the usual ordering should be to acquire ->sysfs_lock before freezing the queue. This incorrect ordering causes the following lockdep splat which we are able to reproduce always simply by accessing /sys/kernel/debug file using ls command: [ 57.597146] WARNING: possible circular locking dependency detected [ 57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G W [ 57.597162] ------------------------------------------------------ [ 57.597168] ls/4605 is trying to acquire lock: [ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0 [ 57.597200] but task is already holding lock: [ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 [ 57.597226] which lock already depends on the new lock. [ 57.597233] the existing dependency chain (in reverse order) is: [ 57.597241] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 57.597255] down_write+0x6c/0x18c [ 57.597264] start_creating+0xb4/0x24c [ 57.597274] debugfs_create_dir+0x2c/0x1e8 [ 57.597283] blk_register_queue+0xec/0x294 [ 57.597292] add_disk_fwnode+0x2e4/0x548 [ 57.597302] brd_alloc+0x2c8/0x338 [ 57.597309] brd_init+0x100/0x178 [ 57.597317] do_one_initcall+0x88/0x3e4 [ 57.597326] kernel_init_freeable+0x3cc/0x6e0 [ 57.597334] kernel_init+0x34/0x1cc [ 57.597342] ret_from_kernel_user_thread+0x14/0x1c [ 57.597350] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}: [ 57.597362] __mutex_lock+0xfc/0x12a0 [ 57.597370] blk_register_queue+0xd4/0x294 [ 57.597379] add_disk_fwnode+0x2e4/0x548 [ 57.597388] brd_alloc+0x2c8/0x338 [ 57.597395] brd_init+0x100/0x178 [ 57.597402] do_one_initcall+0x88/0x3e4 [ 57.597410] kernel_init_freeable+0x3cc/0x6e0 [ 57.597418] kernel_init+0x34/0x1cc [ 57.597426] ret_from_kernel_user_thread+0x14/0x1c [ 57.597434] -> #3 (&q->sysfs_lock){+.+.}-{4:4}: [ 57.597446] __mutex_lock+0xfc/0x12a0 [ 57.597454] queue_attr_store+0x9c/0x110 [ 57.597462] sysfs_kf_write+0x70/0xb0 [ 57.597471] kernfs_fop_write_iter+0x1b0/0x2ac [ 57.597480] vfs_write+0x3dc/0x6e8 [ 57.597488] ksys_write+0x84/0x140 [ 57.597495] system_call_exception+0x130/0x360 [ 57.597504] system_call_common+0x160/0x2c4 [ 57.597516] -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}: [ 57.597530] __submit_bio+0x5ec/0x828 [ 57.597538] submit_bio_noacct_nocheck+0x1e4/0x4f0 [ 57.597547] iomap_readahead+0x2a0/0x448 [ 57.597556] xfs_vm_readahead+0x28/0x3c [ 57.597564] read_pages+0x88/0x41c [ 57.597571] page_cache_ra_unbounded+0x1ac/0x2d8 [ 57.597580] filemap_get_pages+0x188/0x984 [ 57.597588] filemap_read+0x13c/0x4bc [ 57.597596] xfs_file_buffered_read+0x88/0x17c [ 57.597605] xfs_file_read_iter+0xac/0x158 [ 57.597614] vfs_read+0x2d4/0x3b4 [ 57.597622] ksys_read+0x84/0x144 [ 57.597629] system_call_exception+0x130/0x360 [ 57.597637] system_call_common+0x160/0x2c4 [ 57.597647] -> #1 (mapping.invalidate_lock#2){++++}-{4:4}: [ 57.597661] down_read+0x6c/0x220 [ 57.597669] filemap_fault+0x870/0x100c [ 57.597677] xfs_filemap_fault+0xc4/0x18c [ 57.597684] __do_fault+0x64/0x164 [ 57.597693] __handle_mm_fault+0x1274/0x1dac [ 57.597702] handle_mm_fault+0x248/0x484 [ 57.597711] ___do_page_fault+0x428/0xc0c [ 57.597719] hash__do_page_fault+0x30/0x68 [ 57.597727] do_hash_fault+0x90/0x35c [ 57.597736] data_access_common_virt+0x210/0x220 [ 57.597745] _copy_from_user+0xf8/0x19c [ 57.597754] sel_write_load+0x178/0xd54 [ 57.597762] vfs_write+0x108/0x6e8 [ 57.597769] ksys_write+0x84/0x140 [ 57.597777] system_call_exception+0x130/0x360 [ 57.597785] system_call_common+0x160/0x2c4 [ 57.597794] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 57.597806] __lock_acquire+0x17cc/0x2330 [ 57.597814] lock_acquire+0x138/0x400 [ 57.597822] __might_fault+0x7c/0xc0 [ 57.597830] filldir64+0xe8/0x390 [ 57.597839] dcache_readdir+0x80/0x2d4 [ 57.597846] iterate_dir+0xd8/0x1d4 [ 57.597855] sys_getdents64+0x88/0x2d4 [ 57.597864] system_call_exception+0x130/0x360 [ 57.597872] system_call_common+0x160/0x2c4 [ 57.597881] other info that might help us debug this: [ 57.597888] Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3 [ 57.597905] Possible unsafe locking scenario: [ 57.597911] CPU0 CPU1 [ 57.597917] ---- ---- [ 57.597922] rlock(&sb->s_type->i_mutex_key#3); [ 57.597932] lock(&q->debugfs_mutex); [ 57.597940] lock(&sb->s_type->i_mutex_key#3); [ 57.597950] rlock(&mm->mmap_lock); [ 57.597958] * DEADLOCK * [ 57.597965] 2 locks held by ls/4605: [ 57.597971] #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154 [ 57.597989] #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 Prevent the above lockdep warning by acquiring ->sysfs_lock before freezing the queue while storing a queue attribute in queue_attr_store function. Later, we also found[1] another function __blk_mq_update_nr_ hw_queues where we first freeze queue and then acquire the ->sysfs_lock. So we've also updated lock ordering in __blk_mq_update_nr_hw_queues function and ensured that in all code paths we follow the correct lock ordering i.e. acquire ->sysfs_lock before freezing the queue. [1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/ Reported-by: kjain@linux.ibm.com Fixes: `af28141498` ("block: freeze the queue in queue_attr_store") Tested-by: kjain@linux.ibm.com Cc: hch@lst.de Cc: axboe@kernel.dk Cc: ritesh.list@gmail.com Cc: ming.lei@redhat.com Cc: gjoyce@linux.ibm.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-13 10:51:58 -07:00
Linus Torvalds	de20dc2b96	sound fixes for 6.13-rc3 A collection of fixes; all look small, device-specific and boring. -----BEGIN PGP SIGNATURE----- iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAmda9xkOHHRpd2FpQHN1 c2UuZGUACgkQLtJE4w1nLE/kUA/+Jx7jW36zxRhPqRydtpmkRXGPOcsV1y6NZyD8 i0vjlW6oNHZZeOtdU4ap2CL8x6YMYnCzQhtFbZD1Lob2O2G2aUYi9tWtYWeetws1 OIB/Nj7xebJDtPl7yIrE5I9TRYGQKMT33LVopnFdihcVlQhSGmxIhJSm1C+n1Nsf gFGt2smUnaJ3dzaRXcUDaAcwoc4+DGJ4owpWJfnQae+klI+A6iiVSEoOcfoFL++q yp200Zb9Q49OUnDkqe2Nwjsqj3rSMulEtuF6S/NOiFqDR/Q6TDQfv+a1jaTPx58v OVtyeByMFV+usvq8qRPEgaAsQuWX+IukHoHEkLH3uUEt3DXAi5p2USl33bnabNv/ le+du0O1vJmYkZD72jPBtkKwDElnRs8w1BlMcrenAdjHtwfB+Wy+cW6bzugiV53K ykQV3jGEThPerXomR9etYrumsvBNW7SsaZpy+8T7WjTAS1IPHaOkitlHAI16Ejv4 F8o2SDm8pf2lBc9pGPZmOk3wJWLeLk1dn2q62wtRCD5RHFbC/Eh4JFn7hKq2rfyl am9c9fmfs+3BMHNuue8UJgsO4BV8+pUwRy5ZrG1sbUQ0ZOxqczSw+/+ehHozahBC i7DtlpJNBuOhRrZukAp3sGE0Mx6muoslZXo0CVqvFIJFgXR0BXx4Ymt8LLqOTBTl 1w0FBoE= =u9jP -----END PGP SIGNATURE----- Merge tag 'sound-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A collection of fixes; all look small, device-specific and boring" * tag 'sound-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ASoC: Intel: sof_sdw: Add space for a terminator into DAIs array ASoC: fsl_spdif: change IFACE_PCM to IFACE_MIXER ASoC: fsl_xcvr: change IFACE_PCM to IFACE_MIXER ASoC: tas2781: Fix calibration issue in stress test ASoC: audio-graph-card: Call of_node_put() on correct node ASoC: amd: yc: Fix the wrong return value ALSA: control: Avoid WARN() for symlink errors sound: usb: format: don't warn that raw DSD is unsupported sound: usb: enable DSD output for ddHiFi TC44C ALSA: hda/realtek: Add new alc2xx-fixup-headset-mic model ALSA: hda/ca0132: Use standard HD-audio quirk matching helpers ALSA: usb-audio: Add implicit feedback quirk for Yamaha THR5 ALSA: hda/realtek - Add support for ASUS Zen AIO 27 Z272SD_A272SD audio ALSA: hda/realtek: Fix headset mic on Acer Nitro 5 ALSA: hda: cs35l56: Remove calls to cs35l56_force_sync_asp1_registers_from_cache()	2024-12-13 09:51:40 -08:00
Linus Torvalds	a3170b7d93	A single fix for a docs-build regression caused by the EXPORT_SYMBOL_NS() mass change. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmdcUaMPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YHocH+gJMedy9MoAzDORi1D22qRBYdYcjSkE1vFcu lNNoxDckzu1OnycWFfdkKG4dZct8G9JoSec3sxoofqRlHCvayrfeRanLveY7kLMv p5KK1HZSe9G7C8HXnJGoiRzmTiBzYFI8NWtCb/S1XplwzoTUeorbKSiV4EmyGdcS Dm4Hx0PK9HIij8klL9KN1QKWhpWFk5qs8Js8fJIIUZiV0T1B9VDMYAK8LjPdH7lG gKPTHT21OGF7AuIgLQBrqHwutBKSbnrfHqfoRFY/dyZQS8PXtqmMuNepCqsPAQQ0 vNuAbdAihsaFSof69Sdx8s4uw7PoTDZIf/GoWrB+B9Q9MLh75G8= =HZYa -----END PGP SIGNATURE----- Merge tag 'docs-6.13-fix' of git://git.lwn.net/linux Pull documentation fix from Jonathan Corbet: "A single fix for a docs-build regression caused by the EXPORT_SYMBOL_NS() mass change" * tag 'docs-6.13-fix' of git://git.lwn.net/linux: scripts/kernel-doc: Get -export option working again	2024-12-13 09:46:02 -08:00
Linus Torvalds	266facde83	slab fixes for 6.13-rc3 -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmdb/84ACgkQu+CwddJF iJqEhQf+KHXQfZ/m7ma6bUEisjVOkUsxnpSj/DM1CP03+czqXoSEr+CHX2g1v0Os pZSOGGPTtvDgSBO0UZgo1P+sXyl0hrs2A/XrVQfKPmTfc9mPoxFTkbhI7Rv9hZfi 9OVh9kg3OE2hixD9ussGmxtFTXE3vXeu9zASa2sEdpakc3gxgeNhIklGK9r3oQuF 4aa33JpRV/oF80XAAUch5ltAO0tqQWtN50CbnoKc5cGRDuFRRdzXMMiL7gGiCctI te3sVMc6mnusdJqvp+AE10aT0iQpRWwL1lzGrgp3PuZ7DN8UQHxBIf8u8VIjyPML D2JZEYmqykKSvXbDzYBdOAF8XLEYCw== =PpHb -----END PGP SIGNATURE----- Merge tag 'slab-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Fix for memcg unreclaimable slab stats drift when post-charging large kmalloc allocations (Shakeel Butt) * tag 'slab-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: memcg: slub: fix SUnreclaim for post charged objects	2024-12-13 09:43:50 -08:00
Marc Zyngier	773c05f417	irqchip/gic-v3: Work around insecure GIC integrations It appears that the relatively popular RK3399 SoC has been put together using a large amount of illicit substances, as experiments reveal that its integration of GIC500 exposes the secure programming interface to non-secure. This has some pretty bad effects on the way priorities are handled, and results in a dead machine if booting with pseudo-NMI enabled (irqchip.gicv3_pseudo_nmi=1) if the kernel contains `18fdb6348c` ("arm64: irqchip/gic-v3: Select priorities at boot time"), which relies on the priorities being programmed using the NS view. Let's restore some sanity by going one step further and disable security altogether in this case. This is not any worse, and puts us in a mode where priorities actually make some sense. Huge thanks to Mark Kettenis who initially identified this issue on OpenBSD, and to Chen-Yu Tsai who reported the problem in Linux. Fixes: `18fdb6348c` ("arm64: irqchip/gic-v3: Select priorities at boot time") Reported-by: Mark Kettenis <mark.kettenis@xs4all.nl> Reported-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Chen-Yu Tsai <wens@csie.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241213141037.3995049-1-maz@kernel.org	2024-12-13 18:15:29 +01:00
Uros Bizjak	a1855f1b7c	irqchip/gic: Correct declaration of percpu_base pointer in union gic_base percpu_base is used in various percpu functions that expect variable in __percpu address space. Correct the declaration of percpu_base to void __iomem __percpu percpu_base; to declare the variable as __percpu pointer. The patch fixes several sparse warnings: irq-gic.c:1172:44: warning: incorrect type in assignment (different address spaces) irq-gic.c:1172:44: expected void [noderef] __percpu [noderef] __iomem percpu_base irq-gic.c:1172:44: got void [noderef] __iomem [noderef] __percpu * ... irq-gic.c:1231:43: warning: incorrect type in argument 1 (different address spaces) irq-gic.c:1231:43: expected void [noderef] __percpu __pdata irq-gic.c:1231:43: got void [noderef] __percpu [noderef] __iomem *percpu_base There were no changes in the resulting object files. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/all/20241213145809.2918-2-ubizjak@gmail.com	2024-12-13 18:11:52 +01:00
Bart Van Assche	a6fe7b7051	block: Fix queue_iostats_passthrough_show() Make queue_iostats_passthrough_show() report 0/1 in sysfs instead of 0/4. This patch fixes the following sparse warning: block/blk-sysfs.c:266:31: warning: incorrect type in argument 1 (different base types) block/blk-sysfs.c:266:31: expected unsigned long var block/blk-sysfs.c:266:31: got restricted blk_flags_t Cc: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: `110234da18` ("block: enable passthrough command statistics") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241212212941.1268662-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-13 08:09:43 -07:00
Bart Van Assche	312ccd4b75	blk-mq: Clean up blk_mq_requeue_work() Move a statement that occurs in both branches of an if-statement in front of the if-statement. Fix a typo in a source code comment. No functionality has been changed. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-13 08:09:43 -07:00
Bart Van Assche	e01424fab3	mq-deadline: Remove a local variable Since commit `fde02699c2` ("block: mq-deadline: Remove support for zone write locking"), the local variable 'insert_before' is assigned once and is used once. Hence remove this local variable. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-13 08:09:43 -07:00
Weizhao Ouyang	ce03573a19	kselftest/arm64: abi: fix SVCR detection When using svcr_in to check ZA and Streaming Mode, we should make sure that the value in x2 is correct, otherwise it may trigger an Illegal instruction if FEAT_SVE and !FEAT_SME. Fixes: `43e3f85523` ("kselftest/arm64: Add SME support to syscall ABI test") Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241211111639.12344-1-o451686892@gmail.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2024-12-13 14:59:37 +00:00
Lu Baolu	dda2b8c3c6	iommu/vt-d: Avoid draining PRQ in sva mm release path When a PASID is used for SVA by a device, it's possible that the PASID entry is cleared before the device flushes all ongoing DMA requests and removes the SVA domain. This can occur when an exception happens and the process terminates before the device driver stops DMA and calls the iommu driver to unbind the PASID. There's no need to drain the PRQ in the mm release path. Instead, the PRQ will be drained in the SVA unbind path. Unfortunately, commit `c43e1ccdeb` ("iommu/vt-d: Drain PRQs when domain removed from RID") changed this behavior by unconditionally draining the PRQ in intel_pasid_tear_down_entry(). This can lead to a potential sleeping-in-atomic-context issue. Smatch static checker warning: drivers/iommu/intel/prq.c:95 intel_iommu_drain_pasid_prq() warn: sleeping in atomic context To avoid this issue, prevent draining the PRQ in the SVA mm release path and restore the previous behavior. Fixes: `c43e1ccdeb` ("iommu/vt-d: Drain PRQs when domain removed from RID") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-iommu/c5187676-2fa2-4e29-94e0-4a279dc88b49@stanley.mountain/ Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20241212021529.1104745-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2024-12-13 15:54:27 +01:00
Yi Liu	74536f9196	iommu/vt-d: Fix qi_batch NULL pointer with nested parent domain The qi_batch is allocated when assigning cache tag for a domain. While for nested parent domain, it is missed. Hence, when trying to map pages to the nested parent, NULL dereference occurred. Also, there is potential memleak since there is no lock around domain->qi_batch allocation. To solve it, add a helper for qi_batch allocation, and call it in both the __cache_tag_assign_domain() and __cache_tag_assign_parent_domain(). BUG: kernel NULL pointer dereference, address: 0000000000000200 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 8104795067 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 223 UID: 0 PID: 4357 Comm: qemu-system-x86 Not tainted 6.13.0-rc1-00028-g4b50c3c3b998-dirty #2632 Call Trace: ? __die+0x24/0x70 ? page_fault_oops+0x80/0x150 ? do_user_addr_fault+0x63/0x7b0 ? exc_page_fault+0x7c/0x220 ? asm_exc_page_fault+0x26/0x30 ? cache_tag_flush_range_np+0x13c/0x260 intel_iommu_iotlb_sync_map+0x1a/0x30 iommu_map+0x61/0xf0 batch_to_domain+0x188/0x250 iopt_area_fill_domains+0x125/0x320 ? rcu_is_watching+0x11/0x50 iopt_map_pages+0x63/0x100 iopt_map_common.isra.0+0xa7/0x190 iopt_map_user_pages+0x6a/0x80 iommufd_ioas_map+0xcd/0x1d0 iommufd_fops_ioctl+0x118/0x1c0 __x64_sys_ioctl+0x93/0xc0 do_syscall_64+0x71/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: `705c1cdf1e` ("iommu/vt-d: Introduce batched cache invalidation") Cc: stable@vger.kernel.org Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20241210130322.17175-1-yi.l.liu@intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2024-12-13 15:54:25 +01:00
Lu Baolu	1f2557e08a	iommu/vt-d: Remove cache tags before disabling ATS The current implementation removes cache tags after disabling ATS, leading to potential memory leaks and kernel crashes. Specifically, CACHE_TAG_DEVTLB type cache tags may still remain in the list even after the domain is freed, causing a use-after-free condition. This issue really shows up when multiple VFs from different PFs passed through to a single user-space process via vfio-pci. In such cases, the kernel may crash with kernel messages like: BUG: kernel NULL pointer dereference, address: 0000000000000014 PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2 RIP: 0010:cache_tag_flush_range+0x9b/0x250 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x163/0x590 ? exc_page_fault+0x72/0x190 ? asm_exc_page_fault+0x22/0x30 ? cache_tag_flush_range+0x9b/0x250 ? cache_tag_flush_range+0x5d/0x250 intel_iommu_tlb_sync+0x29/0x40 intel_iommu_unmap_pages+0xfe/0x160 __iommu_unmap+0xd8/0x1a0 vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1] vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1] Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix it. Fixes: `3b1d9e2b2d` ("iommu/vt-d: Add cache tag assignment interface") Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20241129020506.576413-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2024-12-13 15:54:24 +01:00
Kevin Brodsky	a3b4647e2f	arm64: signal: Ensure signal delivery failure is recoverable Commit `eaf62ce156` ("arm64/signal: Set up and restore the GCS context for signal handlers") introduced a potential failure point at the end of setup_return(). This is unfortunate as it is too late to deliver a SIGSEGV: if that SIGSEGV is handled, the subsequent sigreturn will end up returning to the original handler, which is not the intention (since we failed to deliver that signal). Make sure this does not happen by calling gcs_signal_entry() at the very beginning of setup_return(), and add a comment just after to discourage error cases being introduced from that point onwards. While at it, also take care of copy_siginfo_to_user(): since it may fail, we shouldn't be calling it after setup_return() either. Call it before setup_return() instead, and move the setting of X1/X2 inside setup_return() where it belongs (after the "point of no failure"). Background: the first part of setup_rt_frame(), including setup_sigframe(), has no impact on the execution of the interrupted thread. The signal frame is written to the stack, but the stack pointer remains unchanged. Failure at this stage can be recovered by a SIGSEGV handler, and sigreturn will restore the original context, at the point where the original signal occurred. On the other hand, once setup_return() has updated registers including SP, the thread's control flow has been modified and we must deliver the original signal. Fixes: `eaf62ce156` ("arm64/signal: Set up and restore the GCS context for signal handlers") Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241210160940.2031997-1-kevin.brodsky@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2024-12-13 14:13:27 +00:00
Vineeth Pillai (Google)	c7f7e9c731	sched/dlserver: Fix dlserver time accounting dlserver time is accounted when: - dlserver is active and the dlserver proxies the cfs task. - dlserver is active but deferred and cfs task runs after being picked through the normal fair class pick. dl_server_update is called in two places to make sure that both the above times are accounted for. But it doesn't check if dlserver is active or not. Now that we have this dl_server_active flag, we can consolidate dl_server_update into one place and all we need to check is whether dlserver is active or not. When dlserver is active there is only two possible conditions: - dlserver is deferred. - cfs task is running on behalf of dlserver. Fixes: `a110a81c52` ("sched/deadline: Deferrable dl server") Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org	2024-12-13 12:57:35 +01:00
Vineeth Pillai (Google)	b53127db1d	sched/dlserver: Fix dlserver double enqueue dlserver can get dequeued during a dlserver pick_task due to the delayed deueue feature and this can lead to issues with dlserver logic as it still thinks that dlserver is on the runqueue. The dlserver throttling and replenish logic gets confused and can lead to double enqueue of dlserver. Double enqueue of dlserver could happend due to couple of reasons: Case 1 ------ Delayed dequeue feature[1] can cause dlserver being stopped during a pick initiated by dlserver: __pick_next_task pick_task_dl -> server_pick_task pick_task_fair pick_next_entity (if (sched_delayed)) dequeue_entities dl_server_stop server_pick_task goes ahead with update_curr_dl_se without knowing that dlserver is dequeued and this confuses the logic and may lead to unintended enqueue while the server is stopped. Case 2 ------ A race condition between a task dequeue on one cpu and same task's enqueue on this cpu by a remote cpu while the lock is released causing dlserver double enqueue. One cpu would be in the schedule() and releasing RQ-lock: current->state = TASK_INTERRUPTIBLE(); schedule(); deactivate_task() dl_stop_server(); pick_next_task() pick_next_task_fair() sched_balance_newidle() rq_unlock(this_rq) at which point another CPU can take our RQ-lock and do: try_to_wake_up() ttwu_queue() rq_lock() ... activate_task() dl_server_start() --> first enqueue wakeup_preempt() := check_preempt_wakeup_fair() update_curr() update_curr_task() if (current->dl_server) dl_server_update() enqueue_dl_entity() --> second enqueue This bug was not apparent as the enqueue in dl_server_start doesn't usually happen because of the defer logic. But as a side effect of the first case(dequeue during dlserver pick), dl_throttled and dl_yield will be set and this causes the time accounting of dlserver to messup and then leading to a enqueue in dl_server_start. Have an explicit flag representing the status of dlserver to avoid the confusion. This is set in dl_server_start and reset in dlserver_stop. Fixes: `63ba8422f8` ("sched/deadline: Introduce deadline servers") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org	2024-12-13 12:57:34 +01:00
Jiri Slaby (SUSE)	145ac100b6	efi/esrt: remove esre_attribute::store() esre_attribute::store() is not needed since commit `af97a77bc0` (efi: Move some sysfs files to be read-only by root). Drop it. Found by https://github.com/jirislaby/clang-struct. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: linux-efi@vger.kernel.org Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2024-12-13 08:43:58 +01:00
Carlos Maiolino	bf354410af	xfs: bug fixes for 6.13 [01/12] Bug fixes for 6.13. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZ1uRzwAKCRBKO3ySh0YR poYjAP0YxGr59OFEmdu9fZzLRQoARjchlqMmYiMOokbXxqGfhgD/Wo7Er+Dpj4KE jIvDWUy8anoKuE2pvcRVBYyYaPoTNgY= =i+9a -----END PGP SIGNATURE----- Merge tag 'xfs-6.13-fixes_2024-12-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into next-rc xfs: bug fixes for 6.13 [01/12] Bug fixes for 6.13. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2024-12-13 07:47:12 +01:00
Darrick J. Wong	12f2930f5f	xfs: port xfs_ioc_start_commit to multigrain timestamps Take advantage of the multigrain timestamp APIs to ensure that nobody can sneak in and write things to a file between starting a file update operation and committing the results. This should have been part of the multigrain timestamp merge, but I forgot to fling it at jlayton when he resubmitted the patchset due to developer bandwidth problems. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `4e40eff0b5` ("fs: add infrastructure for multigrain timestamps") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2024-12-12 17:45:13 -08:00
Darrick J. Wong	7f8b718c58	xfs: return from xfs_symlink_verify early on V4 filesystems V4 symlink blocks didn't have headers, so return early if this is a V4 filesystem. Cc: <stable@vger.kernel.org> # v5.1 Fixes: `39708c20ab` ("xfs: miscellaneous verifier magic value fixups") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:13 -08:00
Darrick J. Wong	c004a793e0	xfs: fix zero byte checking in the superblock scrubber The logic to check that the region past the end of the superblock is all zeroes is wrong -- we don't want to check only the bytes past the end of the maximally sized ondisk superblock structure as currently defined in xfs_format.h; we want to check the bytes beyond the end of the ondisk as defined by the feature bits. Port the superblock size logic from xfs_repair and then put it to use in xfs_scrub. Cc: <stable@vger.kernel.org> # v4.15 Fixes: `21fb4cb198` ("xfs: scrub the secondary superblocks") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:13 -08:00
Darrick J. Wong	06b20ef09b	xfs: check pre-metadir fields correctly The checks that were added to the superblock scrubber for metadata directories aren't quite right -- the old inode pointers are now defined to be zeroes until someone else reuses them. Also consolidate the new metadir field checks to one place; they were inexplicably scattered around. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `28d756d4d5` ("xfs: update sb field checks when metadir is turned on") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:12 -08:00
Darrick J. Wong	e57e083be9	xfs: don't crash on corrupt /quotas dirent If the /quotas dirent points to an inode but the inode isn't loadable (and hence mkdir returns -EEXIST), don't crash, just bail out. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `e80fbe1ad8` ("xfs: use metadir for quota inodes") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:12 -08:00
Darrick J. Wong	3853b5e1d7	xfs: don't move nondir/nonreg temporary repair files to the metadir namespace Only directories or regular files are allowed in the metadata directory tree. Don't move the repair tempfile to the metadir namespace if this is not true; this will cause the inode verifiers to trip. xrep_tempfile_adjust_directory_tree opportunistically moves sc->tempip from the regular directory tree to the metadata directory tree if sc->ip is part of the metadata directory tree. However, the scrub setup functions grab sc->ip and create sc->tempip before we actually get around to checking if the file mode is the right type for the scrubber. IOWs, you can invoke the symlink scrubber with the file handle of a subdirectory in the metadir. xrep_setup_symlink will create a temporary symlink file, xrep_tempfile_adjust_directory_tree will foolishly try to set the METADATA flag on the temp symlink, which trips the inode verifier in the inode item precommit, which shuts down the filesystem when expensive checks are turned on. If they're /not/ turned on, then xchk_symlink will return ENOENT when it sees that it's been passed a symlink, but the invalid inode could still get flushed to disk. We don't want that. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `9dc31acb01` ("xfs: move repair temporary files to the metadata directory tree") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:12 -08:00
Darrick J. Wong	7f8a44f372	xfs: fix sb_spino_align checks for large fsblock sizes For a sparse inodes filesystem, mkfs.xfs computes the values of sb_spino_align and sb_inoalignmt with the following code: int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE; if (cfg->sb_feat.crcs_enabled) cluster_size = cfg->inodesize / XFS_DINODE_MIN_SIZE; sbp->sb_spino_align = cluster_size >> cfg->blocklog; sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK cfg->inodesize >> cfg->blocklog; On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results in cluster_size = 8192 * (512 / 256) = 16384. As a result, sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately, this trips the new sb_spino_align check that was just added to xfs_validate_sb_common, and the mkfs fails: # mkfs.xfs -f -b size=64k, /dev/sda meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=0 metadir=0 data = bsize=65536 blocks=324544, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0 log =internal log bsize=65536 blocks=5006, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=65536 blocks=0, rtextents=0 = rgcount=0 rgsize=0 extents Discarding blocks...Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: writing AG headers failed, err=22 Prior to commit `59e43f5479` this all worked fine, even if "sparse" inodes are somewhat meaningless when everything fits in a single fsblock. Adjust the checks to handle existing filesystems. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `59e43f5479` ("xfs: sb_spino_align is not verified") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:12 -08:00
Darrick J. Wong	ca378189fd	xfs: convert quotacheck to attach dquot buffers Now that we've converted the dquot logging machinery to attach the dquot buffer to the li_buf pointer so that the AIL dqflush doesn't have to allocate or read buffers in a reclaim path, do the same for the quotacheck code so that the reclaim shrinker dqflush call doesn't have to do that either. Cc: <stable@vger.kernel.org> # v6.12 Fixes: `903edea6c5` ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:12 -08:00
Darrick J. Wong	acc8f8628c	xfs: attach dquot buffer to dquot log item buffer Ever since 6.12-rc1, I've observed a pile of warnings from the kernel when running fstests with quotas enabled: WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18 CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c <snip> Call trace: __alloc_pages_noprof+0xc9c/0xf18 alloc_pages_mpol_noprof+0x94/0x240 alloc_pages_noprof+0x68/0xf8 new_slab+0x3e0/0x568 ___slab_alloc+0x5a0/0xb88 __slab_alloc.constprop.0+0x7c/0xf8 __kmalloc_noprof+0x404/0x4d0 xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88] xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88] kthread+0x110/0x128 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- This corresponds to the line: WARN_ON_ONCE(current->flags & PF_MEMALLOC); within the NOFAIL checks. What's happening here is that the XFS AIL is trying to write a disk quota update back into the filesystem, but for that it needs to read the ondisk buffer for the dquot. The buffer is not in memory anymore, probably because it was evicted. Regardless, the buffer cache tries to allocate a new buffer, but those allocations are NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim) since commit `43ff2122e6` ("xfs: on-stack delayed write buffer lists") presumably because reclaim can push on XFS to push on the AIL. An easy way to fix this probably would have been to drop the NOFAIL flag from the xfs_buf allocation and open code a retry loop, but then there's still the problem that for bs>ps filesystems, the buffer itself could require up to 64k worth of pages. Inode items had similar behavior (multi-page cluster buffers that we don't want to allocate in the AIL) which we solved by making transaction precommit attach the inode cluster buffers to the dirty log item. Let's solve the dquot problem in the same way. So: Make a real precommit handler to read the dquot buffer and attach it to the log item; pass it to dqflush in the push method; and have the iodone function detach the buffer once we've flushed everything. Add a state flag to the log item to track when a thread has entered the precommit -> push mechanism to skip the detaching if it turns out that the dquot is very busy, as we don't hold the dquot lock between log item commit and AIL push). Reading and attaching the dquot buffer in the precommit hook is inspired by the work done for inode cluster buffers some time ago. Cc: <stable@vger.kernel.org> # v6.12 Fixes: `903edea6c5` ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	ec88b41b93	xfs: clean up log item accesses in xfs_qm_dqflush{,_done} Clean up these functions a little bit before we move on to the real modifications, and make the variable naming consistent for dquot log items. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	a40fe30868	xfs: separate dquot buffer reads from xfs_dqflush The first step towards holding the dquot buffer in the li_buf instead of reading it in the AIL is to separate the part that reads the buffer from the actual flush code. There should be no functional changes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	07137e925f	xfs: don't lose solo dquot update transactions Quota counter updates are tracked via incore objects which hang off the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_dquot_deltas just prior to commiting the log items to the CIL. However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure quota counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because quota updates always dirty at least one other metadata item, but a subsequent bug fix will add dquot log item precommits, so we actually need a dirty dquot log item prior to xfs_trans_run_precommits. Also let's not leave a logic bomb. Cc: <stable@vger.kernel.org> # v2.6.35 Fixes: `0924378a68` ("xfs: split out iclog writing from xfs_trans_commit()") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	3762113b59	xfs: don't lose solo superblock counter update transactions Superblock counter updates are tracked via per-transaction counters in the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_sb_deltas just prior to commiting the log items to the CIL. However, updating the per-transaction counter deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because sb counter updates always dirty at least one other metadata item, but let's not leave a logic bomb. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	a004afdc62	xfs: avoid nested calls to __xfs_trans_commit Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls __xfs_trans_commit again on the same transaction. In other words, there's a nested function call (albeit with slightly different arguments) that has caused minor amounts of confusion in the past. There's no reason to keep this around, since there's only one place where we actually want the xfs_defer_finish_noroll, and that is in the top level xfs_trans_commit call. This also reduces stack usage a little bit. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:11 -08:00
Darrick J. Wong	44d9b07e52	xfs: only run precommits once per transaction object Committing a transaction tx0 with a defer ops chain of (A, B, C) creates a chain of transactions that looks like this: tx0 -> txA -> txB -> txC Prior to commit `cb04211748`, __xfs_trans_commit would run precommits on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C]. Unfortunately, after the finish_noroll loop we forgot to run precommits on txC. That was fixed by adding the second precommit call. Unfortunately, none of us remembered that xfs_defer_finish_noroll calls __xfs_trans_commit a second time to commit tx0 before finishing work A in txA and committing that. In other words, we run precommits twice on tx0: xfs_trans_commit(tx0) __xfs_trans_commit(tx0, false) xfs_trans_run_precommits(tx0) xfs_defer_finish_noroll(tx0) xfs_trans_roll(tx0) txA = xfs_trans_dup(tx0) __xfs_trans_commit(tx0, true) xfs_trans_run_precommits(tx0) This currently isn't an issue because the inode item precommit is idempotent; the iunlink item precommit deletes itself so it can't be called again; and the buffer/dquot item precommits only check the incore objects for corruption. However, it doesn't make sense to run precommits twice. Fix this situation by only running precommits after finish_noroll. Cc: <stable@vger.kernel.org> # v6.4 Fixes: `cb04211748` ("xfs: defered work could create precommits") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	53b001a21c	xfs: unlock inodes when erroring out of xfs_trans_alloc_dir Debugging a filesystem patch with generic/475 caused the system to hang after observing the following sequences in dmesg: XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5 XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5 XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5 XFS (dm-0): log I/O error -5 XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem. XFS (dm-0): Please unmount the filesystem and rectify the problem(s) XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs] XFS (dm-0): Corruption detected. Unmount and run xfs_repair XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7 The system is stuck in unmount trying to lock a couple of inodes so that they can be purged. The dquot corruption notice above is a clue to what happened -- a link() call tried to set up a transaction to link a child into a directory. Quota reservation for the transaction failed after IO errors shut down the filesystem, but then we forgot to unlock the inodes on our way out. Fix that. Cc: <stable@vger.kernel.org> # v6.10 Fixes: `bd5562111d` ("xfs: Hold inode locks in xfs_trans_alloc_dir") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	ffc3ea4f3c	xfs: fix scrub tracepoints when inode-rooted btrees are involved Fix a minor mistakes in the scrub tracepoints that can manifest when inode-rooted btrees are enabled. The existing code worked fine for bmap btrees, but we should tighten the code up to be less sloppy. Cc: <stable@vger.kernel.org> # v5.7 Fixes: `92219c292a` ("xfs: convert btree cursor inode-private member names") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	6d7b4bc1c3	xfs: update btree keys correctly when _insrec splits an inode root block In commit `2c813ad66a`, I partially fixed a bug wherein xfs_btree_insrec would erroneously try to update the parent's key for a block that had been split if we decided to insert the new record into the new block. The solution was to detect this situation and update the in-core key value that we pass up to the caller so that the caller will (eventually) add the new block to the parent level of the tree with the correct key. However, I missed a subtlety about the way inode-rooted btrees work. If the full block was a maximally sized inode root block, we'll solve that fullness by moving the root block's records to a new block, resizing the root block, and updating the root to point to the new block. We don't pass a pointer to the new block to the caller because that work has already been done. The new record will /always/ land in the new block, so in this case we need to use xfs_btree_update_keys to update the keys. This bug can theoretically manifest itself in the very rare case that we split a bmbt root block and the new record lands in the very first slot of the new block, though I've never managed to trigger it in practice. However, it is very easy to reproduce by running generic/522 with the realtime rmapbt patchset if rtinherit=1. Cc: <stable@vger.kernel.org> # v4.8 Fixes: `2c813ad66a` ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00
Darrick J. Wong	23bee6f390	xfs: fix error bailout in xfs_rtginode_create smatch reported that we screwed up the error cleanup in this function. Fix it. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: `ae897e0bed` ("xfs: support creating per-RTG files in growfs") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-12 17:45:10 -08:00

1 2 3 4 5 ...

1324295 Commits