Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL. However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].
Thus, there is a discrepancy between the reality, that raw_tp arguments can
actually be NULL, and the verifier's knowledge, that they are never NULL,
causing explicit NULL check branch to be dead code eliminated.
A previous attempt [1], i.e. the second fixed commit, was made to
simulate symbolic execution as if in most accesses, the argument is a
non-NULL raw_tp, except for conditional jumps. This tried to suppress
branch prediction while preserving compatibility, but surfaced issues
with production programs that were difficult to solve without increasing
verifier complexity. A more complete discussion of issues and fixes is
available at [2].
Fix this by maintaining an explicit list of tracepoints where the
arguments are known to be NULL, and mark the positional arguments as
PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments
are known to be ERR_PTR, and mark these arguments as scalar values to
prevent potential dereference.
Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2),
shifted by the zero-indexed argument number x 4. This can be represented
as follows:
1st arg: 0x1
2nd arg: 0x10
3rd arg: 0x100
... and so on (likewise for ERR_PTR case).
In the future, an automated pass will be used to produce such a list, or
insert __nullable annotations automatically for tracepoints. Each
compilation unit will be analyzed and results will be collated to find
whether a tracepoint pointer is definitely not null, maybe null, or an
unknown state where verifier conservatively marks it PTR_MAYBE_NULL.
A proof of concept of this tool from Eduard is available at [3].
Note that in case we don't find a specification in the raw_tp_null_args
array and the tracepoint belongs to a kernel module, we will
conservatively mark the arguments as PTR_MAYBE_NULL. This is because
unlike for in-tree modules, out-of-tree module tracepoints may pass NULL
freely to the tracepoint. We don't protect against such tracepoints
passing ERR_PTR (which is uncommon anyway), lest we mark all such
arguments as SCALAR_VALUE.
While we are it, let's adjust the test raw_tp_null to not perform
dereference of the skb->mark, as that won't be allowed anymore, and make
it more robust by using inline assembly to test the dead code
elimination behavior, which should still stay the same.
[0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb
[1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com
[2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
[3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params
Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug
Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch reverts commit
cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The
patch was well-intended and meant to be as a stop-gap fixing branch
prediction when the pointer may actually be NULL at runtime. Eventually,
it was supposed to be replaced by an automated script or compiler pass
detecting possibly NULL arguments and marking them accordingly.
However, it caused two main issues observed for production programs and
failed to preserve backwards compatibility. First, programs relied on
the verifier not exploring == NULL branch when pointer is not NULL, thus
they started failing with a 'dereference of scalar' error. Next,
allowing raw_tp arguments to be modified surfaced the warning in the
verifier that warns against reg->off when PTR_MAYBE_NULL is set.
More information, context, and discusson on both problems is available
in [0]. Overall, this approach had several shortcomings, and the fixes
would further complicate the verifier's logic, and the entire masking
scheme would have to be removed eventually anyway.
Hence, revert the patch in preparation of a better fix avoiding these
issues to replace this commit.
[0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
Reported-by: Manu Bretelle <chantra@meta.com>
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdckwAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppJmD/9uUPcPm51b5E7fzD5Hqvlb22uZMYbXs1vR
1NZWdPJhMoMPBXyQ0GN7wHThvwQ8VuTcZNs8pzSJWpZMMhgsdleViDL8hedPeblt
TSDc6g2gEt7TtIGIhNqq7bQNW61a+KxZz55B/qKqlJOUsW7ALPuM4m34vSMTNKw8
c/RK3PtTxSvE5nmqLzeynw2Zo7IZ0PL2NSYZ0oID9ZcGtj4ItezhshXTPLuLNuRM
ppvc9u3JGyAzVJI/I0GNNW2Xo2maFWtvcWznaegowoBzjQO4Qfo9WDtn3uJFl8Di
N6M7H4GASo80l+Hd1eAal3YrM53Z1RW1Mj4xaA2+vtL+p7k5tfV3WAr00hNsK9TW
401KbBGqgvrVS7Y/Y0ADVqqoCePPhblZmWbJu36Jrz0/nDGi3lPOSigYSANlxGsC
t0aeeD7lRTd5qJ5+pQnVQL9uaNTXZcUPizgGGG0y9/RcNPzot54ap0/cIEM3fXEQ
nd2Tv7O1lSyE5O2JCaoBY/P5ytI7LgZHDH2/ZaUZyDxKFgXSj+2NufTUGj/YmZhr
ZKU6U25cFhXEkMVDV4AUUh9Dq723dYkXE4xyl00eNiz6C0J9PQ8uvrlRgC9mPJeQ
g4xvyIZxkqeRmzAvUIipKQSZ5Iia1/Jtwqs1RBcda8/7w3sOmQOA9xJGjevLI8G6
he9AQP63JQ==
=trxL
-----END PGP SIGNATURE-----
Merge tag 'block-6.13-20241213' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Series from Damien fixing issues with the zoned write plugging
- Fix for a potential UAF in block cgroups
- Fix deadlock around queue freezing and the sysfs lock
- Various little cleanups and fixes
* tag 'block-6.13-20241213' of git://git.kernel.dk/linux:
block: Fix potential deadlock while freezing queue and acquiring sysfs_lock
block: Fix queue_iostats_passthrough_show()
blk-mq: Clean up blk_mq_requeue_work()
mq-deadline: Remove a local variable
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
block: Make bio_iov_bvec_set() accept pointer to const iov_iter
block: get wp_offset by bdev_offset_from_zone_start
blk-cgroup: Fix UAF in blkcg_unpin_online()
MAINTAINERS: update Coly Li's email address
block: Prevent potential deadlocks in zone write plug error recovery
dm: Fix dm-zoned-reclaim zone write pointer alignment
block: Ignore REQ_NOWAIT for zone reset and zone finish operations
block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdckxAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpq3LD/9oZzEBKsIvAl7tPeHsmVBTaQIC8kS93gQB
8RxCZ3YJmM1Ilu/sezKnkDEbl9aVEAtgwCh04Lep42ye4tBA4eMfQ8kPj2/HWYaY
Vt4vM20Ls/diGGrjSK0958F8zwBeMKZsgSYeXhNf5IyOzXX1lY+JhCg2I+MLnYNi
fw5IngQBB3yZwtMk4VN35Jf5M/PvE7pww8MMarSsfYsSAawf50Fs85K5xoJ4vTWE
NxYcIU7MG6bqc6qrretNhgnXnw1OeZpjivXZAJv9+0WAJkMTxJ5QPLUpD7NJDi4b
GUD/+SiLaxCmXs/jib/pafXPEbcuhX4a7c3yy0Ceq94OrE7F7H52Dz7PqQ4VxAAP
wU5FaFjmytx+EqwDCAk1Cgvb3LX/gD793rA4A8FP64gBhlSmaW/yeMkffO9p9/Xb
BIaX8BLb5iVER5I5/BpdpTa6TPy0mzGDDAD/+/DU+V65PLa7Qnt4bDLpSgUzw0OJ
EIaqOxGZVF4xi4PqMFlOZqysI+rci3kOfHCrILWxuBlxOCaspRk8HFN0eypsgLu8
NenTlQBY+fKaKK32AQEfd21in5+tNBSy+E6H3HIWwUedNk2zihSuO12muXkOsfim
BG0CgSfMUzIA6Pxgd87NaE62JXQBhZ2JZQW8swIqKc1m8/p/V//9yYeH97sbZasK
7cdiHV4anQ==
=0uqv
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.13-20241213' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
"A single fix for a regression introduced in the 6.13 merge window"
* tag 'io_uring-6.13-20241213' of git://git.kernel.dk/linux:
io_uring/rsrc: don't put/free empty buffers
- sysbot fix for out of bounds access
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQSgX9xt+GwmrJEQ+euebuN7TNx1MQUCZ1x73BQcaXJhLndlaW55
QGludGVsLmNvbQAKCRCebuN7TNx1Md/JAQC1l9Fxu9FR4UW8nA+wgU0P8XYGKqmD
EqRUb82LI13f5QD/WtBQEVgBTxD5sVlMBjEGIuYE9hMZbdN0LB8wnK0HfQ8=
=87RX
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fix from Ira Weiny:
- sysbot fix for out of bounds access
* tag 'libnvdimm-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
acpi: nfit: vmalloc-out-of-bounds Read in acpi_nfit_ctl
ARC GCC compiler is packaged starting from Fedora 39i and the GCC
variant of cross compile tools has arc-linux-gnu- prefix and not
arc-linux-. This is causing that CROSS_COMPILE variable is left unset.
This change allows builds without need to supply CROSS_COMPILE argument
if distro package is used.
Before this change:
$ make -j 128 ARCH=arc W=1 drivers/infiniband/hw/mlx4/
gcc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead
gcc: error: unrecognized command-line option ‘-mmedium-calls’
gcc: error: unrecognized command-line option ‘-mlock’
gcc: error: unrecognized command-line option ‘-munaligned-access’
[1] https://packages.fedoraproject.org/pkgs/cross-gcc/gcc-arc-linux-gnu/index.html
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
* Fixes for scrub subsystem
* Fix quota crashes
* Fix sb_spino_align checons on large fsblock sizes
* Fix discarded superblock updates
* fix stuck unmount due to a locked inode
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZ1yHBQAKCRBGdaER5Qtf
pvogAX98nf8qE2Y2I6DvXVeX+LOPWNG+WTiRpF3d8NxjhCfGn79am71YrtkHf6qC
V0halh8Bfi+SWsVqYYuJK+7/SUjGS2MzNqG0NY32wNYqAIroQA/Xh9IaQXX0q8t2
FLGWIaxWOA==
=0g8J
-----END PGP SIGNATURE-----
Merge tag 'xfs-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- Fixes for scrub subsystem
- Fix quota crashes
- Fix sb_spino_align checons on large fsblock sizes
- Fix discarded superblock updates
- Fix stuck unmount due to a locked inode
* tag 'xfs-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
xfs: port xfs_ioc_start_commit to multigrain timestamps
xfs: return from xfs_symlink_verify early on V4 filesystems
xfs: fix zero byte checking in the superblock scrubber
xfs: check pre-metadir fields correctly
xfs: don't crash on corrupt /quotas dirent
xfs: don't move nondir/nonreg temporary repair files to the metadir namespace
xfs: fix sb_spino_align checks for large fsblock sizes
xfs: convert quotacheck to attach dquot buffers
xfs: attach dquot buffer to dquot log item buffer
xfs: clean up log item accesses in xfs_qm_dqflush{,_done}
xfs: separate dquot buffer reads from xfs_dqflush
xfs: don't lose solo dquot update transactions
xfs: don't lose solo superblock counter update transactions
xfs: avoid nested calls to __xfs_trans_commit
xfs: only run precommits once per transaction object
xfs: unlock inodes when erroring out of xfs_trans_alloc_dir
xfs: fix scrub tracepoints when inode-rooted btrees are involved
xfs: update btree keys correctly when _insrec splits an inode root block
xfs: fix error bailout in xfs_rtginode_create
xfs: fix null bno_hint handling in xfs_rtallocate_rtg
...
- arm64 stacktrace: address some fallout from the recent changes to
unwinding across exception boundaries
- Ensure the arm64 signal delivery failure is recoverable - only
override the return registers after all the user accesses took place
- Fix the arm64 kselftest access to SVCR - only when SME is detected
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmdccV0ACgkQa9axLQDI
XvHC6Q//X7A2aziMs+FTYrDc0sqK09Ur53SNyQZuA6v1dXqxLH+SbzM75n6TMv9O
a5lv/1Jxw0aQPOhTsRgg7wV+svMWyOx52ouUSjFVRPocBFgnxiyyxcxgC9TmPMTh
MbZPa36R7qSpfUSPK6+uOTI/eQ6cD66az4fq7Y936cyY8J1/xg5pQXSE2WWGx4nz
Ryci4iksNB8IXH37twDmPvE6446Q82eLcJR9zfh+lhf+Rgx5/rNEPdqtNJPR1a7N
ftVDHTvcMTLmoMsjvOkhtfjbwsGBbz4Ezji0GW3+zPcZK5EYRjUZvOnG3HD2/mds
iYfh0xux0GBEsjozkIshu7wg1JqMwJJ8zBXu62/TcKHG8LG3C/k6kCr+SN7IL8p8
qzxxkkD61AxAY0aHuJ56Ao9nCLbHkAFGOJXUiyDJQALzt2KKN4aoOt+0zj35yYFz
PONt4FPy3ulX6fk0XrAcI5E64hjeW/ib5wluOxb79Wn07ZYZQ/Ff42kJEMVL6VA7
bFbbYKtbjFhQQPac/TLbqtwgMX0lA6bSZ5bPhoPZys3iReVbDiGvszRsWh4WQYgX
X0sujmaSMkjU1IICfZs3yABLQ0aDJKeuaKHnri9WT3iuBOjiLlChVnXqd4VRgXfy
DUE6bKQ3ohHVTrIxpNcBzWUYsSAeZMPyakuCY6pxuVwzBPrOR28=
=s7gn
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- arm64 stacktrace: address some fallout from the recent changes to
unwinding across exception boundaries
- Ensure the arm64 signal delivery failure is recoverable - only
override the return registers after all the user accesses took place
- Fix the arm64 kselftest access to SVCR - only when SME is detected
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
kselftest/arm64: abi: fix SVCR detection
arm64: signal: Ensure signal delivery failure is recoverable
arm64: stacktrace: Don't WARN when unwinding other tasks
arm64: stacktrace: Skip reporting LR at exception boundaries
* A fix to avoid taking a mutex while resolving jump_labels in the mutex
implementation.
* A fix to avoid trying to resolve the early boot DT pointer via the
linear map.
* A fix to avoid trying to IPI kfence TLB flushes, as kfence might flush
with IRQs disabled.
* A fix to avoid calling PMD destructors on PMDs that were never
constructed.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmdcYiETHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYif8DD/4x/V+ZDInlTpQcPpKC7LtDaNBUyZ2r
4nENb4CcHkLTJjHHKUS9WFkH+woH0hx43yDkm5AKO3RttYj41fbybhBiomeEQdbE
iA6d7zQD3EFTXP+Pg/fBfzEWgMUCl9GJjtJn8ZpwqsibMOcNrfog9i/6ErPnPOKi
Q5+nx/Qyg2pxKI/9kTFsJ8W39iRaFZTVltrfZHc26i96ohx1b8J93KXT2dRFvkV8
doPcJkuEAP1l52B4y+Vo3T+01LS4bVWwv2Ye5r268lLB4DkYIwk/Hh6HG0DTeAID
o0TWbCREuR4gvur8ge3C6P4v7lCT+gv2tSxW9vxd1y6vMhIgRnDHYMoloDDWw352
6hTDg6akqTC4jfMFwyOFo98qze/bghCG+ELgMxhbgQamrWO7qzVZCjkgKB2Wal3b
rcY5eYMPFIR0YIBHa/kQ4J6mFVGhK4sBxpB9w3Z7BSYnMU+uCXChMmxo4kSXJelt
N2Ny9rYlNxxhUg6OSDXAWQAoIUHo6O89gWZyXnFlBUfaI6cnVFDwd+2F6QoRhmhu
XNTHl699qcV5Unb08qd2wnCXFv0X0WVVGTPO5rFH2/HlkPoPwZIGEv/kU+U42n5p
OiIteste0TH5w20knrWYHZKEnNkaONcxNqmloTK89qQL+526s65pcgPk6W1hFqow
4WpJWIgwTRdd3A==
=uaBB
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- avoid taking a mutex while resolving jump_labels in the mutex
implementation
- avoid trying to resolve the early boot DT pointer via the linear map
- avoid trying to IPI kfence TLB flushes, as kfence might flush with
IRQs disabled
- avoid calling PMD destructors on PMDs that were never constructed
* tag 'riscv-for-linus-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: mm: Do not call pmd dtor on vmemmap page table teardown
riscv: Fix IPIs usage in kfence_protect_page()
riscv: Fix wrong usage of __pa() on a fixmap address
riscv: Fixup boot failure when CONFIG_DEBUG_RT_MUTEXES=y
Merge an ACPICA fix for 6.13-rc3:
- Don't release a context_mutex that was never acquired in
acpi_remove_address_space_handler() (Daniil Tatianin).
* acpica:
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
- Replace csr_write() with csr_set() for HVIEN PMU overflow bit
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmdcUn0ACgkQrUjsVaLH
LAcoyg//RD70/kMg4ZvE9rjNSBE7c4TemvHaCPgSpI4qM/r1HeJSMUDj+AIHvhg2
S3jwgi32oJYaSXHEknvwwtaPN61JHn/mMeTcDlNSBG0vXacoAnprmfDMKzUoYNHc
Ui9bkXhmkUUjLsbaW+lTOIHgUUCaLm4hh5m6W3adslCA/AfYn8v0XKiCHYVRSAXR
qOgWvo2QSyowaGf9tU9ttvaigkV9FBUyIWJyOJcxwvkcvOg/fhvYF1CkQTxJh+m1
N7C7fh5CzvdugojlWtjMIm9kAGPnhLXnuNFLGZfAkgxltAlI+nzGALAS2tdq70/n
DOBbf0J2zI1zPCCGYt0nPCIyqacduJHTxHLbS7gX+nIGdfoACVtS/ir/3gNIS4RW
NnMgJi7OSvnAudlmOluJdp6K8Xavk/EVxgGY7ItUXpncH2g2lOe/2C7Ihsb6x/mF
xbap8hTYOtUh9jUhmjcx47CzrmkufUKGwBJAk36PsLAwRkvVqS/gsQDVuHwqhpby
66cAj0xDqq8gSg2dJhQQzOB8bUvWhozUAjoaJMpWX1UjNph9ZkqwZPsKB1YXyLsx
N5OhcmGCMdVIaJ2R/wmflWZCJI4U4ycZXO9VMdOFW882JoNya6uiNohJv/fosUjF
5JXFAgbM4tJ0M+UdDyAb3k85SQ/pTIKE/aoON8VtZBefku/DyNw=
=/5Vc
-----END PGP SIGNATURE-----
Merge tag 'kvm-riscv-fixes-6.13-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv fixes for 6.13, take #1
- Replace csr_write() with csr_set() for HVIEN PMU overflow bit
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime. The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.
On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values. The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.
As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:
SKX 11650
ICX 22350
EMR 28850
Using cached values, the latency drops to:
SKX 6850
ICX 9000
EMR 7900
The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR. The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).
SKX:
CPUID.0xD.2 = 348 cycles
CPUID.0xD.3 = 400 cycles
CPUID.0xD.4 = 276 cycles
CPUID.0xD.5 = 236 cycles
<other sub-leaves are similar>
EMR:
CPUID.0xD.2 = 1138 cycles
CPUID.0xD.3 = 1362 cycles
CPUID.0xD.4 = 1068 cycles
CPUID.0xD.5 = 910 cycles
CPUID.0xD.6 = 914 cycles
CPUID.0xD.7 = 1350 cycles
CPUID.0xD.8 = 734 cycles
CPUID.0xD.9 = 766 cycles
CPUID.0xD.10 = 732 cycles
CPUID.0xD.11 = 718 cycles
CPUID.0xD.12 = 734 cycles
CPUID.0xD.13 = 1700 cycles
CPUID.0xD.14 = 1126 cycles
CPUID.0xD.15 = 898 cycles
CPUID.0xD.16 = 716 cycles
CPUID.0xD.17 = 748 cycles
CPUID.0xD.18 = 776 cycles
Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2. That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- fix several low-level issues in gpio-graniterapids
- fix an initialization order issue that manifests itself with
__counted_by() checks in gpio-ljca
- don't default to y for CONFIG_GPIO_MVEBU with COMPILE_TEST enabled
- move the DEFAULT_SYMBOL_NAMESPACE define before the export.h include
in gpio-idio-16
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmdcOcIACgkQEacuoBRx
13Ky3g/+JZawOniCmm5vBEH5SS4lhjEMts0i0WkaH5oIhESxsLgfZO8e0ucQt4td
AmVk7ingHaoY1GYL1bmhkZEELHhqrSYb0uCMMtK1AxvXOeMUdGRj2QYpdKLGdWL3
dCrWemJi8sx0UDWs4LLvR5HcoKAACB5Mu2NmOMTtuYiBB8qIfCWQ0+0Q09/yhJM8
WNhGComnuKPvVgndHrGcd5H7CTUJobyacYu3dwvzPjPC+PvC6YTrkos5xonbyx1E
GQQ3bxgU/ZvuGp6OKKeugaMxJcWOcrBm7emYiCyYwcasrh3GghUAcQ8UAP9CLYSx
U1AzWI1lDqNdh7a+pr+HYR43aJ8RJ8lWFKA5wlUHSZv97G4aZyFRPBIC0AI4C9os
OvW/sS4IvJ/snY7dyQ7972RLEMzmSSpEDWyYdrwOCDAtRt6KGpK7GH6yQrxQJCUN
6g30wry+zzlybki0gyyQB53b8X5pth6VuDtpaPSwFPQo1JL+Z0abbIPjqJKrVZBt
UoVLKAev5Jo6U5/aiPvddBMw31RPeAN2dQTuYxKI8fv9xtWMYrpa9y9Hzgt4s1HX
F+Pl2cVUnM+cKWYQbk7m1s07jHMjsnaH0RCxPX54RnXaQ7hevXfBn0HeNkETN426
uZmlQIPgIB7cP2zdBDnJ6Jzgzv5mu+azIuTb11PL6JmL+DOXzhw=
=nunS
-----END PGP SIGNATURE-----
Merge tag 'gpio-fixes-for-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix several low-level issues in gpio-graniterapids
- fix an initialization order issue that manifests itself with
__counted_by() checks in gpio-ljca
- don't default to y for CONFIG_GPIO_MVEBU with COMPILE_TEST enabled
- move the DEFAULT_SYMBOL_NAMESPACE define before the export.h include
in gpio-idio-16
* tag 'gpio-fixes-for-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace
gpio: graniterapids: Fix GPIO Ack functionality
gpio: graniterapids: Check if GPIO line can be used for IRQs
gpio: graniterapids: Determine if GPIO pad can be used by driver
gpio: graniterapids: Fix invalid RXEVCFG register bitmask
gpio: graniterapids: Fix invalid GPI_IS register offset
gpio: graniterapids: Fix incorrect BAR assignment
gpio: graniterapids: Fix vGPIO driver crash
gpio: ljca: Initialize num before accessing item in ljca_gpio_config
gpio: GPIO_MVEBU should not default to y when compile-testing
A collection of fixes; all look small, device-specific and boring.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAmda9xkOHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE/kUA/+Jx7jW36zxRhPqRydtpmkRXGPOcsV1y6NZyD8
i0vjlW6oNHZZeOtdU4ap2CL8x6YMYnCzQhtFbZD1Lob2O2G2aUYi9tWtYWeetws1
OIB/Nj7xebJDtPl7yIrE5I9TRYGQKMT33LVopnFdihcVlQhSGmxIhJSm1C+n1Nsf
gFGt2smUnaJ3dzaRXcUDaAcwoc4+DGJ4owpWJfnQae+klI+A6iiVSEoOcfoFL++q
yp200Zb9Q49OUnDkqe2Nwjsqj3rSMulEtuF6S/NOiFqDR/Q6TDQfv+a1jaTPx58v
OVtyeByMFV+usvq8qRPEgaAsQuWX+IukHoHEkLH3uUEt3DXAi5p2USl33bnabNv/
le+du0O1vJmYkZD72jPBtkKwDElnRs8w1BlMcrenAdjHtwfB+Wy+cW6bzugiV53K
ykQV3jGEThPerXomR9etYrumsvBNW7SsaZpy+8T7WjTAS1IPHaOkitlHAI16Ejv4
F8o2SDm8pf2lBc9pGPZmOk3wJWLeLk1dn2q62wtRCD5RHFbC/Eh4JFn7hKq2rfyl
am9c9fmfs+3BMHNuue8UJgsO4BV8+pUwRy5ZrG1sbUQ0ZOxqczSw+/+ehHozahBC
i7DtlpJNBuOhRrZukAp3sGE0Mx6muoslZXo0CVqvFIJFgXR0BXx4Ymt8LLqOTBTl
1w0FBoE=
=u9jP
-----END PGP SIGNATURE-----
Merge tag 'sound-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of fixes; all look small, device-specific and boring"
* tag 'sound-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: Intel: sof_sdw: Add space for a terminator into DAIs array
ASoC: fsl_spdif: change IFACE_PCM to IFACE_MIXER
ASoC: fsl_xcvr: change IFACE_PCM to IFACE_MIXER
ASoC: tas2781: Fix calibration issue in stress test
ASoC: audio-graph-card: Call of_node_put() on correct node
ASoC: amd: yc: Fix the wrong return value
ALSA: control: Avoid WARN() for symlink errors
sound: usb: format: don't warn that raw DSD is unsupported
sound: usb: enable DSD output for ddHiFi TC44C
ALSA: hda/realtek: Add new alc2xx-fixup-headset-mic model
ALSA: hda/ca0132: Use standard HD-audio quirk matching helpers
ALSA: usb-audio: Add implicit feedback quirk for Yamaha THR5
ALSA: hda/realtek - Add support for ASUS Zen AIO 27 Z272SD_A272SD audio
ALSA: hda/realtek: Fix headset mic on Acer Nitro 5
ALSA: hda: cs35l56: Remove calls to cs35l56_force_sync_asp1_registers_from_cache()
mass change.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmdcUaMPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YHocH+gJMedy9MoAzDORi1D22qRBYdYcjSkE1vFcu
lNNoxDckzu1OnycWFfdkKG4dZct8G9JoSec3sxoofqRlHCvayrfeRanLveY7kLMv
p5KK1HZSe9G7C8HXnJGoiRzmTiBzYFI8NWtCb/S1XplwzoTUeorbKSiV4EmyGdcS
Dm4Hx0PK9HIij8klL9KN1QKWhpWFk5qs8Js8fJIIUZiV0T1B9VDMYAK8LjPdH7lG
gKPTHT21OGF7AuIgLQBrqHwutBKSbnrfHqfoRFY/dyZQS8PXtqmMuNepCqsPAQQ0
vNuAbdAihsaFSof69Sdx8s4uw7PoTDZIf/GoWrB+B9Q9MLh75G8=
=HZYa
-----END PGP SIGNATURE-----
Merge tag 'docs-6.13-fix' of git://git.lwn.net/linux
Pull documentation fix from Jonathan Corbet:
"A single fix for a docs-build regression caused by the
EXPORT_SYMBOL_NS() mass change"
* tag 'docs-6.13-fix' of git://git.lwn.net/linux:
scripts/kernel-doc: Get -export option working again
It appears that the relatively popular RK3399 SoC has been put together
using a large amount of illicit substances, as experiments reveal that its
integration of GIC500 exposes the *secure* programming interface to
non-secure.
This has some pretty bad effects on the way priorities are handled, and
results in a dead machine if booting with pseudo-NMI enabled
(irqchip.gicv3_pseudo_nmi=1) if the kernel contains 18fdb6348c ("arm64:
irqchip/gic-v3: Select priorities at boot time"), which relies on the
priorities being programmed using the NS view.
Let's restore some sanity by going one step further and disable security
altogether in this case. This is not any worse, and puts us in a mode where
priorities actually make some sense.
Huge thanks to Mark Kettenis who initially identified this issue on
OpenBSD, and to Chen-Yu Tsai who reported the problem in Linux.
Fixes: 18fdb6348c ("arm64: irqchip/gic-v3: Select priorities at boot time")
Reported-by: Mark Kettenis <mark.kettenis@xs4all.nl>
Reported-by: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen-Yu Tsai <wens@csie.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241213141037.3995049-1-maz@kernel.org
percpu_base is used in various percpu functions that expect variable in
__percpu address space. Correct the declaration of percpu_base to
void __iomem * __percpu *percpu_base;
to declare the variable as __percpu pointer.
The patch fixes several sparse warnings:
irq-gic.c:1172:44: warning: incorrect type in assignment (different address spaces)
irq-gic.c:1172:44: expected void [noderef] __percpu *[noderef] __iomem *percpu_base
irq-gic.c:1172:44: got void [noderef] __iomem *[noderef] __percpu *
...
irq-gic.c:1231:43: warning: incorrect type in argument 1 (different address spaces)
irq-gic.c:1231:43: expected void [noderef] __percpu *__pdata
irq-gic.c:1231:43: got void [noderef] __percpu *[noderef] __iomem *percpu_base
There were no changes in the resulting object files.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/all/20241213145809.2918-2-ubizjak@gmail.com
Make queue_iostats_passthrough_show() report 0/1 in sysfs instead of 0/4.
This patch fixes the following sparse warning:
block/blk-sysfs.c:266:31: warning: incorrect type in argument 1 (different base types)
block/blk-sysfs.c:266:31: expected unsigned long var
block/blk-sysfs.c:266:31: got restricted blk_flags_t
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 110234da18 ("block: enable passthrough command statistics")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241212212941.1268662-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move a statement that occurs in both branches of an if-statement in front
of the if-statement. Fix a typo in a source code comment. No functionality
has been changed.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241212212941.1268662-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit fde02699c2 ("block: mq-deadline: Remove support for zone
write locking"), the local variable 'insert_before' is assigned once and
is used once. Hence remove this local variable.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241212212941.1268662-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When using svcr_in to check ZA and Streaming Mode, we should make sure
that the value in x2 is correct, otherwise it may trigger an Illegal
instruction if FEAT_SVE and !FEAT_SME.
Fixes: 43e3f85523 ("kselftest/arm64: Add SME support to syscall ABI test")
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241211111639.12344-1-o451686892@gmail.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
When a PASID is used for SVA by a device, it's possible that the PASID
entry is cleared before the device flushes all ongoing DMA requests and
removes the SVA domain. This can occur when an exception happens and the
process terminates before the device driver stops DMA and calls the
iommu driver to unbind the PASID.
There's no need to drain the PRQ in the mm release path. Instead, the PRQ
will be drained in the SVA unbind path.
Unfortunately, commit c43e1ccdeb ("iommu/vt-d: Drain PRQs when domain
removed from RID") changed this behavior by unconditionally draining the
PRQ in intel_pasid_tear_down_entry(). This can lead to a potential
sleeping-in-atomic-context issue.
Smatch static checker warning:
drivers/iommu/intel/prq.c:95 intel_iommu_drain_pasid_prq()
warn: sleeping in atomic context
To avoid this issue, prevent draining the PRQ in the SVA mm release path
and restore the previous behavior.
Fixes: c43e1ccdeb ("iommu/vt-d: Drain PRQs when domain removed from RID")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-iommu/c5187676-2fa2-4e29-94e0-4a279dc88b49@stanley.mountain/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241212021529.1104745-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The qi_batch is allocated when assigning cache tag for a domain. While
for nested parent domain, it is missed. Hence, when trying to map pages
to the nested parent, NULL dereference occurred. Also, there is potential
memleak since there is no lock around domain->qi_batch allocation.
To solve it, add a helper for qi_batch allocation, and call it in both
the __cache_tag_assign_domain() and __cache_tag_assign_parent_domain().
BUG: kernel NULL pointer dereference, address: 0000000000000200
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 8104795067 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 223 UID: 0 PID: 4357 Comm: qemu-system-x86 Not tainted 6.13.0-rc1-00028-g4b50c3c3b998-dirty #2632
Call Trace:
? __die+0x24/0x70
? page_fault_oops+0x80/0x150
? do_user_addr_fault+0x63/0x7b0
? exc_page_fault+0x7c/0x220
? asm_exc_page_fault+0x26/0x30
? cache_tag_flush_range_np+0x13c/0x260
intel_iommu_iotlb_sync_map+0x1a/0x30
iommu_map+0x61/0xf0
batch_to_domain+0x188/0x250
iopt_area_fill_domains+0x125/0x320
? rcu_is_watching+0x11/0x50
iopt_map_pages+0x63/0x100
iopt_map_common.isra.0+0xa7/0x190
iopt_map_user_pages+0x6a/0x80
iommufd_ioas_map+0xcd/0x1d0
iommufd_fops_ioctl+0x118/0x1c0
__x64_sys_ioctl+0x93/0xc0
do_syscall_64+0x71/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 705c1cdf1e ("iommu/vt-d: Introduce batched cache invalidation")
Cc: stable@vger.kernel.org
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241210130322.17175-1-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The current implementation removes cache tags after disabling ATS,
leading to potential memory leaks and kernel crashes. Specifically,
CACHE_TAG_DEVTLB type cache tags may still remain in the list even
after the domain is freed, causing a use-after-free condition.
This issue really shows up when multiple VFs from different PFs
passed through to a single user-space process via vfio-pci. In such
cases, the kernel may crash with kernel messages like:
BUG: kernel NULL pointer dereference, address: 0000000000000014
PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2
RIP: 0010:cache_tag_flush_range+0x9b/0x250
Call Trace:
<TASK>
? __die+0x1f/0x60
? page_fault_oops+0x163/0x590
? exc_page_fault+0x72/0x190
? asm_exc_page_fault+0x22/0x30
? cache_tag_flush_range+0x9b/0x250
? cache_tag_flush_range+0x5d/0x250
intel_iommu_tlb_sync+0x29/0x40
intel_iommu_unmap_pages+0xfe/0x160
__iommu_unmap+0xd8/0x1a0
vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1]
vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1]
Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix
it.
Fixes: 3b1d9e2b2d ("iommu/vt-d: Add cache tag assignment interface")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241129020506.576413-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit eaf62ce156 ("arm64/signal: Set up and restore the GCS
context for signal handlers") introduced a potential failure point
at the end of setup_return(). This is unfortunate as it is too late
to deliver a SIGSEGV: if that SIGSEGV is handled, the subsequent
sigreturn will end up returning to the original handler, which is
not the intention (since we failed to deliver that signal).
Make sure this does not happen by calling gcs_signal_entry()
at the very beginning of setup_return(), and add a comment just
after to discourage error cases being introduced from that point
onwards.
While at it, also take care of copy_siginfo_to_user(): since it may
fail, we shouldn't be calling it after setup_return() either. Call
it before setup_return() instead, and move the setting of X1/X2
inside setup_return() where it belongs (after the "point of no
failure").
Background: the first part of setup_rt_frame(), including
setup_sigframe(), has no impact on the execution of the interrupted
thread. The signal frame is written to the stack, but the stack
pointer remains unchanged. Failure at this stage can be recovered by
a SIGSEGV handler, and sigreturn will restore the original context,
at the point where the original signal occurred. On the other hand,
once setup_return() has updated registers including SP, the thread's
control flow has been modified and we must deliver the original
signal.
Fixes: eaf62ce156 ("arm64/signal: Set up and restore the GCS context for signal handlers")
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241210160940.2031997-1-kevin.brodsky@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after being picked
through the normal fair class pick.
dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
- dlserver is deferred.
- cfs task is running on behalf of dlserver.
Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.
Double enqueue of dlserver could happend due to couple of reasons:
Case 1
------
Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop
server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.
Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.
One cpu would be in the schedule() and releasing RQ-lock:
current->state = TASK_INTERRUPTIBLE();
schedule();
deactivate_task()
dl_stop_server();
pick_next_task()
pick_next_task_fair()
sched_balance_newidle()
rq_unlock(this_rq)
at which point another CPU can take our RQ-lock and do:
try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue
This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.
Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.
Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
esre_attribute::store() is not needed since commit af97a77bc0 (efi:
Move some sysfs files to be read-only by root). Drop it.
Found by https://github.com/jirislaby/clang-struct.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Bug fixes for 6.13.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZ1uRzwAKCRBKO3ySh0YR
poYjAP0YxGr59OFEmdu9fZzLRQoARjchlqMmYiMOokbXxqGfhgD/Wo7Er+Dpj4KE
jIvDWUy8anoKuE2pvcRVBYyYaPoTNgY=
=i+9a
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.13-fixes_2024-12-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into next-rc
xfs: bug fixes for 6.13 [01/12]
Bug fixes for 6.13.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Take advantage of the multigrain timestamp APIs to ensure that nobody
can sneak in and write things to a file between starting a file update
operation and committing the results. This should have been part of the
multigrain timestamp merge, but I forgot to fling it at jlayton when he
resubmitted the patchset due to developer bandwidth problems.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 4e40eff0b5 ("fs: add infrastructure for multigrain timestamps")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
V4 symlink blocks didn't have headers, so return early if this is a V4
filesystem.
Cc: <stable@vger.kernel.org> # v5.1
Fixes: 39708c20ab ("xfs: miscellaneous verifier magic value fixups")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The logic to check that the region past the end of the superblock is all
zeroes is wrong -- we don't want to check only the bytes past the end of
the maximally sized ondisk superblock structure as currently defined in
xfs_format.h; we want to check the bytes beyond the end of the ondisk as
defined by the feature bits.
Port the superblock size logic from xfs_repair and then put it to use in
xfs_scrub.
Cc: <stable@vger.kernel.org> # v4.15
Fixes: 21fb4cb198 ("xfs: scrub the secondary superblocks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The checks that were added to the superblock scrubber for metadata
directories aren't quite right -- the old inode pointers are now defined
to be zeroes until someone else reuses them. Also consolidate the new
metadir field checks to one place; they were inexplicably scattered
around.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 28d756d4d5 ("xfs: update sb field checks when metadir is turned on")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If the /quotas dirent points to an inode but the inode isn't loadable
(and hence mkdir returns -EEXIST), don't crash, just bail out.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: e80fbe1ad8 ("xfs: use metadir for quota inodes")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Only directories or regular files are allowed in the metadata directory
tree. Don't move the repair tempfile to the metadir namespace if this
is not true; this will cause the inode verifiers to trip.
xrep_tempfile_adjust_directory_tree opportunistically moves sc->tempip
from the regular directory tree to the metadata directory tree if sc->ip
is part of the metadata directory tree. However, the scrub setup
functions grab sc->ip and create sc->tempip before we actually get
around to checking if the file mode is the right type for the scrubber.
IOWs, you can invoke the symlink scrubber with the file handle of a
subdirectory in the metadir. xrep_setup_symlink will create a temporary
symlink file, xrep_tempfile_adjust_directory_tree will foolishly try to
set the METADATA flag on the temp symlink, which trips the inode
verifier in the inode item precommit, which shuts down the filesystem
when expensive checks are turned on. If they're /not/ turned on, then
xchk_symlink will return ENOENT when it sees that it's been passed a
symlink, but the invalid inode could still get flushed to disk. We
don't want that.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 9dc31acb01 ("xfs: move repair temporary files to the metadata directory tree")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
For a sparse inodes filesystem, mkfs.xfs computes the values of
sb_spino_align and sb_inoalignmt with the following code:
int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
if (cfg->sb_feat.crcs_enabled)
cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
sbp->sb_spino_align = cluster_size >> cfg->blocklog;
sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK *
cfg->inodesize >> cfg->blocklog;
On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results
in cluster_size = 8192 * (512 / 256) = 16384. As a result,
sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately,
this trips the new sb_spino_align check that was just added to
xfs_validate_sb_common, and the mkfs fails:
# mkfs.xfs -f -b size=64k, /dev/sda
meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=65536 blocks=324544, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=65536 blocks=5006, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=65536 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
Discarding blocks...Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: writing AG headers failed, err=22
Prior to commit 59e43f5479 this all worked fine, even if "sparse"
inodes are somewhat meaningless when everything fits in a single
fsblock. Adjust the checks to handle existing filesystems.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 59e43f5479 ("xfs: sb_spino_align is not verified")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've converted the dquot logging machinery to attach the dquot
buffer to the li_buf pointer so that the AIL dqflush doesn't have to
allocate or read buffers in a reclaim path, do the same for the
quotacheck code so that the reclaim shrinker dqflush call doesn't have
to do that either.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c5 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Ever since 6.12-rc1, I've observed a pile of warnings from the kernel
when running fstests with quotas enabled:
WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18
CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c
<snip>
Call trace:
__alloc_pages_noprof+0xc9c/0xf18
alloc_pages_mpol_noprof+0x94/0x240
alloc_pages_noprof+0x68/0xf8
new_slab+0x3e0/0x568
___slab_alloc+0x5a0/0xb88
__slab_alloc.constprop.0+0x7c/0xf8
__kmalloc_noprof+0x404/0x4d0
xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88]
kthread+0x110/0x128
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
This corresponds to the line:
WARN_ON_ONCE(current->flags & PF_MEMALLOC);
within the NOFAIL checks. What's happening here is that the XFS AIL is
trying to write a disk quota update back into the filesystem, but for
that it needs to read the ondisk buffer for the dquot. The buffer is
not in memory anymore, probably because it was evicted. Regardless, the
buffer cache tries to allocate a new buffer, but those allocations are
NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim)
since commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists")
presumably because reclaim can push on XFS to push on the AIL.
An easy way to fix this probably would have been to drop the NOFAIL flag
from the xfs_buf allocation and open code a retry loop, but then there's
still the problem that for bs>ps filesystems, the buffer itself could
require up to 64k worth of pages.
Inode items had similar behavior (multi-page cluster buffers that we
don't want to allocate in the AIL) which we solved by making transaction
precommit attach the inode cluster buffers to the dirty log item. Let's
solve the dquot problem in the same way.
So: Make a real precommit handler to read the dquot buffer and attach it
to the log item; pass it to dqflush in the push method; and have the
iodone function detach the buffer once we've flushed everything. Add a
state flag to the log item to track when a thread has entered the
precommit -> push mechanism to skip the detaching if it turns out that
the dquot is very busy, as we don't hold the dquot lock between log item
commit and AIL push).
Reading and attaching the dquot buffer in the precommit hook is inspired
by the work done for inode cluster buffers some time ago.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c5 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Clean up these functions a little bit before we move on to the real
modifications, and make the variable naming consistent for dquot log
items.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The first step towards holding the dquot buffer in the li_buf instead of
reading it in the AIL is to separate the part that reads the buffer from
the actual flush code. There should be no functional changes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Quota counter updates are tracked via incore objects which hang off the
xfs_trans object. These changes are then turned into dirty log items in
xfs_trans_apply_dquot_deltas just prior to commiting the log items to
the CIL.
However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be
set on the transaction. In other words, a pure quota counter update
will be silently discarded if there are no other dirty log items
attached to the transaction.
This is currently not the case anywhere in the filesystem because quota
updates always dirty at least one other metadata item, but a subsequent
bug fix will add dquot log item precommits, so we actually need a dirty
dquot log item prior to xfs_trans_run_precommits. Also let's not leave
a logic bomb.
Cc: <stable@vger.kernel.org> # v2.6.35
Fixes: 0924378a68 ("xfs: split out iclog writing from xfs_trans_commit()")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Superblock counter updates are tracked via per-transaction counters in
the xfs_trans object. These changes are then turned into dirty log
items in xfs_trans_apply_sb_deltas just prior to commiting the log items
to the CIL.
However, updating the per-transaction counter deltas do not cause
XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb
counter update will be silently discarded if there are no other dirty
log items attached to the transaction.
This is currently not the case anywhere in the filesystem because sb
counter updates always dirty at least one other metadata item, but let's
not leave a logic bomb.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls
__xfs_trans_commit again on the same transaction. In other words,
there's a nested function call (albeit with slightly different
arguments) that has caused minor amounts of confusion in the past.
There's no reason to keep this around, since there's only one place
where we actually want the xfs_defer_finish_noroll, and that is in the
top level xfs_trans_commit call.
This also reduces stack usage a little bit.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Committing a transaction tx0 with a defer ops chain of (A, B, C)
creates a chain of transactions that looks like this:
tx0 -> txA -> txB -> txC
Prior to commit cb04211748, __xfs_trans_commit would run precommits
on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C].
Unfortunately, after the finish_noroll loop we forgot to run precommits
on txC. That was fixed by adding the second precommit call.
Unfortunately, none of us remembered that xfs_defer_finish_noroll
calls __xfs_trans_commit a second time to commit tx0 before finishing
work A in txA and committing that. In other words, we run precommits
twice on tx0:
xfs_trans_commit(tx0)
__xfs_trans_commit(tx0, false)
xfs_trans_run_precommits(tx0)
xfs_defer_finish_noroll(tx0)
xfs_trans_roll(tx0)
txA = xfs_trans_dup(tx0)
__xfs_trans_commit(tx0, true)
xfs_trans_run_precommits(tx0)
This currently isn't an issue because the inode item precommit is
idempotent; the iunlink item precommit deletes itself so it can't be
called again; and the buffer/dquot item precommits only check the incore
objects for corruption. However, it doesn't make sense to run
precommits twice.
Fix this situation by only running precommits after finish_noroll.
Cc: <stable@vger.kernel.org> # v6.4
Fixes: cb04211748 ("xfs: defered work could create precommits")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Debugging a filesystem patch with generic/475 caused the system to hang
after observing the following sequences in dmesg:
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5
XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5
XFS (dm-0): log I/O error -5
XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem.
XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs]
XFS (dm-0): Corruption detected. Unmount and run xfs_repair
XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7
The system is stuck in unmount trying to lock a couple of inodes so that
they can be purged. The dquot corruption notice above is a clue to what
happened -- a link() call tried to set up a transaction to link a child
into a directory. Quota reservation for the transaction failed after IO
errors shut down the filesystem, but then we forgot to unlock the inodes
on our way out. Fix that.
Cc: <stable@vger.kernel.org> # v6.10
Fixes: bd5562111d ("xfs: Hold inode locks in xfs_trans_alloc_dir")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix a minor mistakes in the scrub tracepoints that can manifest when
inode-rooted btrees are enabled. The existing code worked fine for bmap
btrees, but we should tighten the code up to be less sloppy.
Cc: <stable@vger.kernel.org> # v5.7
Fixes: 92219c292a ("xfs: convert btree cursor inode-private member names")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In commit 2c813ad66a, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Cc: <stable@vger.kernel.org> # v4.8
Fixes: 2c813ad66a ("xfs: support btrees with overlapping intervals for keys")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
smatch reported that we screwed up the error cleanup in this function.
Fix it.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: ae897e0bed ("xfs: support creating per-RTG files in growfs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>