46761 Commits

Author SHA1 Message Date
Peter Zijlstra
2a77e4be12 sched/fair: Untangle NEXT_BUDDY and pick_next_task()
There are 3 sites using set_next_buddy() and only one is conditional
on NEXT_BUDDY, the other two sites are unconditional; to note:

  - yield_to_task()
  - cgroup dequeue / pick optimization

However, having NEXT_BUDDY control both the wakeup-preemption and the
picking side of things means its near useless.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241129101541.GA33464@noisy.programming.kicks-ass.net
2024-12-09 11:48:13 +01:00
Andy Shevchenko
95d9fed3a2 sched/fair: Mark m*_vruntime() with __maybe_unused
When max_vruntime() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:

kernel/sched/fair.c:526:19: error: unused function 'max_vruntime' [-Werror,-Wunused-function]
  526 | static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
      |                   ^~~~~~~~~~~~

Fix this by marking them with __maybe_unused (all cases for the sake of
symmetry).

See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241202173546.634433-1-andriy.shevchenko@linux.intel.com
2024-12-09 11:48:13 +01:00
Vincent Guittot
0429489e09 sched/fair: Fix variable declaration position
Move variable declaration at the beginning of the function

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-12-vincent.guittot@linaro.org
2024-12-09 11:48:13 +01:00
Vincent Guittot
61b82dfb6b sched/fair: Do not try to migrate delayed dequeue task
Migrating a delayed dequeued task doesn't help in balancing the number
of runnable tasks in the system.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-11-vincent.guittot@linaro.org
2024-12-09 11:48:13 +01:00
Vincent Guittot
736c55a02c sched/fair: Rename cfs_rq.nr_running into nr_queued
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the
reality as the value includes both the ready to run tasks and the delayed
dequeue tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-10-vincent.guittot@linaro.org
2024-12-09 11:48:12 +01:00
Vincent Guittot
43eef7c3a4 sched/fair: Remove unused cfs_rq.idle_nr_running
cfs_rq.idle_nr_running field is not used anywhere so we can remove the
useless associated computation. Last user went in commit 5e963f2bd465
("sched/fair: Commit to EEVDF").

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-9-vincent.guittot@linaro.org
2024-12-09 11:48:12 +01:00
Vincent Guittot
31898e7b87 sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
Use same naming convention as others starting with h_nr_* and rename
idle_h_nr_running into h_nr_idle.
The "running" is not correct anymore as it includes delayed dequeue tasks
as well.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-8-vincent.guittot@linaro.org
2024-12-09 11:48:12 +01:00
Vincent Guittot
9216582b0b sched/fair: Removed unsued cfs_rq.h_nr_delayed
h_nr_delayed is not used anymore. We now have:
 - h_nr_runnable which tracks tasks ready to run
 - h_nr_queued which tracks enqueued tasks either ready to run or
   delayed dequeue

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-7-vincent.guittot@linaro.org
2024-12-09 11:48:11 +01:00
Vincent Guittot
1a49104496 sched/fair: Use the new cfs_rq.h_nr_runnable
Use the new h_nr_runnable that tracks only queued and runnable tasks in the
statistics that are used to balance the system:

 - PELT runnable_avg
 - deciding if a group is overloaded or has spare capacity
 - numa stats
 - reduced capacity management
 - load balance
 - nohz kick

It should be noticed that the rq->nr_running still counts the delayed
dequeued tasks as delayed dequeue is a fair feature that is meaningless
at core level.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-6-vincent.guittot@linaro.org
2024-12-09 11:48:11 +01:00
Vincent Guittot
c2a295bffe sched/fair: Add new cfs_rq.h_nr_runnable
With delayed dequeued feature, a sleeping sched_entity remains queued in
the rq until its lag has elapsed. As a result, it stays also visible
in the statistics that are used to balance the system and in particular
the field cfs.h_nr_queued when the sched_entity is associated to a task.

Create a new h_nr_runnable that tracks only queued and runnable tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org
2024-12-09 11:48:11 +01:00
Vincent Guittot
7b8a702d94 sched/fair: Rename h_nr_running into h_nr_queued
With delayed dequeued feature, a sleeping sched_entity remains queued
in the rq until its lag has elapsed but can't run.
Rename h_nr_running into h_nr_queued to reflect this new behavior.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org
2024-12-09 11:48:11 +01:00
Peter Zijlstra
40c3b94fbb Merge branch 'sched/urgent'
Sync with urgent bits as a base for further work.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-12-09 11:48:10 +01:00
Peter Zijlstra
76f2f78329 sched/eevdf: More PELT vs DELAYED_DEQUEUE
Vincent and Dietmar noted that while
commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes
the entity runnable stats, it does not adjust the cfs_rq runnable stats,
which are based off of h_nr_running.

Track h_nr_delayed such that we can discount those and adjust the
signal.

Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
Closes: https://lore.kernel.org/lkml/a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com/
Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/
[ Fixes checkpatch warnings and rebased ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org
2024-12-09 11:48:09 +01:00
Vincent Guittot
c1f43c342e sched/fair: Fix sched_can_stop_tick() for fair tasks
We can't stop the tick of a rq if there are at least 2 tasks enqueued in
the whole hierarchy and not only at the root cfs rq.

rq->cfs.nr_running tracks the number of sched_entity at one level
whereas rq->cfs.h_nr_running tracks all queued tasks in the
hierarchy.

Fixes: 11cc374f4643b ("sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-2-vincent.guittot@linaro.org
2024-12-09 11:48:09 +01:00
K Prateek Nayak
493afbd187 sched/fair: Fix NEXT_BUDDY
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in
pick_next_entity().

Moving clear_buddies() up before the delayed dequeue bits ensures
no ->next buddy becomes delayed. Further ensure no new ->next buddy
ever starts as delayed.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Adam Li <adamli@os.amperecomputing.com>
Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
2024-12-09 11:48:09 +01:00
Wardenjohn
3dae09de40 livepatch: Add stack_order sysfs attribute
Add "stack_order" sysfs attribute which holds the order in which a live
patch module was loaded into the system. A user can then determine an
active live patched version of a function.

cat /sys/kernel/livepatch/livepatch_1/stack_order -> 1

means that livepatch_1 is the first live patch applied

cat /sys/kernel/livepatch/livepatch_module/stack_order -> N

means that livepatch_module is the Nth live patch applied

Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Miroslav Benes <mbenes@suse.cz>
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Wardenjohn <zhangwarden@gmail.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/20241008014856.3729-2-zhangwarden@gmail.com
[pmladek@suse.com: Updated kernel version and date in the ABI documentation.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-12-09 11:44:03 +01:00
Alexei Starovoitov
442bc81bd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR.

Trivial conflict:
tools/testing/selftests/bpf/prog_tests/verifier.c

Adjacent changes in:
Auto-merging kernel/bpf/verifier.c
Auto-merging samples/bpf/Makefile
Auto-merging tools/testing/selftests/bpf/.gitignore
Auto-merging tools/testing/selftests/bpf/Makefile
Auto-merging tools/testing/selftests/bpf/prog_tests/verifier.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-08 17:01:51 -08:00
Linus Torvalds
eadaac4dd2 - Fix a /proc/interrupts formatting regression
- Have the BCM2836 interrupt controller enter power management states properly
 
 - Other fixlets
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdVeI0ACgkQEsHwGGHe
 VUpE9A/9Gs2K9ImUlQu5giKEM2dhu0U/oEN/KNKayd6mjSR4MWEovvMeE16M2AUA
 uALQVxCviJKy6vl5J2aYw5QIhktfGwrdmr4zGK61pfhGWD8dClaCxJpbfUvtx1Bu
 1K9U8PpSh1grScbcjDFBOIIzHsmQgpHFTJxVwktLXEktJsWJyzhxqvdiduMXvDlF
 T5WOIr/A5MwGtoP0kySQA49k0ymgUkXf1UgAl7nTksLEI50SMo3Wt7vpQ+qLk6xD
 3RdywlARaVOLY5GlRsFtUTZP6o06/8aDrEmrUHnOiip9u7pKzMiJR4aKzlScCdTN
 BRKEegLxgJqA0uvQAfYP7kCP4l5fFAAVHmcenZCuXDlcexJsnLxYJChxJKUK7CAt
 wCMZifysp13aF3gyT6BfNKEPZOEqFDykxUvzT/F1d2t0Z7yq1GHIXcDHC8eG0p2H
 3f/YOSi+5KgDTi+xzT1hOPn3HcTADhF7wrj2oppOmJa/FyJVrRVeL+DP+uNPD+ux
 tMlVknBp3nlYQ4Ll773GZmvdaamkUfw9U5eZRi04CrvVOcanq/KjKIL24bq9ODS0
 i6dgPfZPwYI+nuenGv7VEpgo92/oynqGxuShESk1c1S/rCL+D2kMiNaRvR9cVsU3
 +Gvr2v9B0KqLy79gw9Fhdqzbhnzhkcpa72B4JNJwEtHdO+t4iwI=
 =Uyoi
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix a /proc/interrupts formatting regression

 - Have the BCM2836 interrupt controller enter power management states
   properly

 - Other fixlets

* tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing
  genirq/proc: Add missing space separator back
  irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
  irqchip/gic-v3: Fix irq_complete_ack() comment
2024-12-08 11:54:04 -08:00
Linus Torvalds
c25ca0c2e4 - Handle the case where clocksources with small counter width can, in
conjunction with overly long idle sleeps, falsely trigger the negative
   motion detection of clocksources
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmdVb44ACgkQEsHwGGHe
 VUr/bA/+L5vH39rL4IGQ05LZ/TaZC5iSadsWeW26UccbV+5jw4q2hmVterH3TXDp
 ujsvPmRLPN/tjCdwMNKDRmVTc5nBoO+OHBErf+o3mqZMI8XdXz7bOEvNhjnlIPnT
 ILu/c8h2g3bHpFBYHsJjyhrIkfvfspG/yeD6V7SXI1r0StqRfoo6QmbbZwYpdrn3
 +ORs8TdW6GEp7gJhdCzSXxzbfXnvtRsBZvsBLxoIter9Kqd+pFpVxj7CqoHWEiBM
 NQHN/2DG3uczoVVtOB7VK9edAYlpe9mzokB4wRClXo21D7JFze0m2TJGJ3hf9eRZ
 RbzZea0CQNa11NUlxoRUrN+jG/CHjnptNFycRJIEtb6YgKyoizJ/x8CBOWI3a8nU
 NTGBIwXAeYcYrrsP5f3bmDRcks9OO+E2quZiGJZorq1zDxzxnPs8ALmtwCB64UhD
 ro0VAT1d7JuMdnmFUKSwf35nLydnUiBqRC2cb03jMZAze+YmFCgMh5xjtzTPw+WE
 QDIR9Eu6ebSM80ldXGISHYn1wHxQVFtJ2cN9gmV+Lnaekys6huDQkARCoWfLhdYc
 CiqZSvlMPis+VAhkglwnHlxc/mGpfFTZXh47oXzNbXw2J2bWhhbyzl9X8UiSqw+A
 UOfq/JRis2j6NsmKkGvTU/hGtEPOzXw2EStAuXm+OVg5TDKwr74=
 =mARf
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Handle the case where clocksources with small counter width can,
   in conjunction with overly long idle sleeps, falsely trigger the
   negative motion detection of clocksources

* tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Make negative motion detection more robust
2024-12-08 11:51:29 -08:00
Linus Torvalds
553c89ec31 24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM.
The usual bunch of singletons - please see the relevant changelogs for
 details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ1U/QwAKCRDdBJ7gKXxA
 jnE7AQC0eyNNvaL5pLCIxN/Vmr8YeuWP1dldgI29TjrH/JKjSQEAihZNqVZYjoIT
 Gf7Y+IKnc4LbfAXcTe+MfJFeDexM5AU=
 =U5LQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "24 hotfixes.  17 are cc:stable.  15 are MM and 9 are non-MM.

  The usual bunch of singletons - please see the relevant changelogs for
  details"

* tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
  iio: magnetometer: yas530: use signed integer type for clamp limits
  sched/numa: fix memory leak due to the overwritten vma->numab_state
  mm/damon: fix order of arguments in damos_before_apply tracepoint
  lib: stackinit: hide never-taken branch from compiler
  mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
  scatterlist: fix incorrect func name in kernel-doc
  mm: correct typo in MMAP_STATE() macro
  mm: respect mmap hint address when aligning for THP
  mm: memcg: declare do_memsw_account inline
  mm/codetag: swap tags when migrate pages
  ocfs2: update seq_file index in ocfs2_dlm_seq_next
  stackdepot: fix stack_depot_save_flags() in NMI context
  mm: open-code page_folio() in dump_page()
  mm: open-code PageTail in folio_flags() and const_folio_flags()
  mm: fix vrealloc()'s KASAN poisoning logic
  Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()"
  selftests/damon: add _damon_sysfs.py to TEST_FILES
  selftest: hugetlb_dio: fix test naming
  ocfs2: free inode when ocfs2_get_init_inode() fails
  nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry()
  ...
2024-12-08 11:26:13 -08:00
Masami Hiramatsu (Google)
7dbc10961c tracing/dynevent: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in dynamic events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289891992.73724.9491350426847416169.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:28:06 +09:00
Masami Hiramatsu (Google)
36364aa071 tracing/eprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in eprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289890996.73724.17421347964110362029.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:27:53 +09:00
Masami Hiramatsu (Google)
c776572f1e tracing/uprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in uprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289889911.73724.12457932738419630525.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:27:41 +09:00
Masami Hiramatsu (Google)
93bc1de21e tracing/kprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in kprobe events for critical sections
rather than discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289888883.73724.6586200652276577583.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:27:25 +09:00
Masami Hiramatsu (Google)
b6b68a3e64 kprobes: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() for critical sections rather than
discrete lock/unlock pairs.

Link: https://lore.kernel.org/all/173289887835.73724.608223217359025939.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:27:12 +09:00
Thomas Weißschuh
e77ab93e0e kprobes: Reduce preempt disable scope in check_kprobe_access_safe()
Commit a189d0350f387 ("kprobes: disable preempt for module_text_address() and kernel_text_address()")
introduced a preempt_disable() region to protect against concurrent
module unloading. However this region also includes the call to
jump_label_text_reserved() which takes a long time;
up to 400us, iterating over approx 6000 jump tables.

The scope protected by preempt_disable() is largen than necessary.
core_kernel_text() does not need to be protected as it does not interact
with module code at all.
Only the scope from __module_text_address() to try_module_get() needs to
be protected.
By limiting the critical section to __module_text_address() and
try_module_get() the function responsible for the latency spike remains
preemptible.

This works fine even when !CONFIG_MODULES as in that case
try_module_get() will always return true and that block can be optimized
away.

Limit the critical section to __module_text_address() and
try_module_get(). Use guard(preempt)() for easier error handling.

While at it also remove a spurious *probed_mod = NULL in an error
path. On errors the output parameter is never inspected by the caller.
Some error paths were clearing the parameters, some didn't.
Align them for clarity.

Link: https://lore.kernel.org/all/20241121-kprobes-preempt-v1-1-fd581ee7fcbb@linutronix.de/

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:26:49 +09:00
Masami Hiramatsu (Google)
494b332064 tracing/eprobe: Fix to release eprobe when failed to add dyn_event
Fix eprobe event to unregister event call and release eprobe when it fails
to add dynamic event correctly.

Link: https://lore.kernel.org/all/173289886698.73724.1959899350183686006.stgit@devnote2/

Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:25:09 +09:00
Linus Torvalds
b5f217084a BPF fixes:
- Fix several issues for BPF LPM trie map which were found by
   syzbot and during addition of new test cases (Hou Tao)
 
 - Fix a missing process_iter_arg register type check in the
   BPF verifier (Kumar Kartikeya Dwivedi, Tao Lyu)
 
 - Fix several correctness gaps in the BPF verifier when
   interacting with the BPF stack without CAP_PERFMON
   (Kumar Kartikeya Dwivedi, Eduard Zingerman, Tao Lyu)
 
 - Fix OOB BPF map writes when deleting elements for the case of
   xsk map as well as devmap (Maciej Fijalkowski)
 
 - Fix xsk sockets to always clear DMA mapping information when
   unmapping the pool (Larysa Zaremba)
 
 - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge
   after sent bytes have been finalized (Zijian Zhang)
 
 - Fix BPF sockmap with vsocks which was missing a queue check
   in poll and sockmap cleanup on close (Michal Luczaj)
 
 - Fix tools infra to override makefile ARCH variable if defined
   but empty, which addresses cross-building tools. (Björn Töpel)
 
 - Fix two resolve_btfids build warnings on unresolved bpf_lsm
   symbols (Thomas Weißschuh)
 
 - Fix a NULL pointer dereference in bpftool (Amir Mohammadi)
 
 - Fix BPF selftests to check for CONFIG_PREEMPTION instead of
   CONFIG_PREEMPT (Sebastian Andrzej Siewior)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ1N8bhUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIO6ZAD+ITpujJgxvFGC0R7E9o3XJ7V1SpmR
 SlW0lGpj6vOHTUAA/2MRoZurJSTbdT3fbWiCUgU1rMcwkoErkyxUaPuBci0D
 =kgXL
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann::

 - Fix several issues for BPF LPM trie map which were found by syzbot
   and during addition of new test cases (Hou Tao)

 - Fix a missing process_iter_arg register type check in the BPF
   verifier (Kumar Kartikeya Dwivedi, Tao Lyu)

 - Fix several correctness gaps in the BPF verifier when interacting
   with the BPF stack without CAP_PERFMON (Kumar Kartikeya Dwivedi,
   Eduard Zingerman, Tao Lyu)

 - Fix OOB BPF map writes when deleting elements for the case of xsk map
   as well as devmap (Maciej Fijalkowski)

 - Fix xsk sockets to always clear DMA mapping information when
   unmapping the pool (Larysa Zaremba)

 - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge after
   sent bytes have been finalized (Zijian Zhang)

 - Fix BPF sockmap with vsocks which was missing a queue check in poll
   and sockmap cleanup on close (Michal Luczaj)

 - Fix tools infra to override makefile ARCH variable if defined but
   empty, which addresses cross-building tools. (Björn Töpel)

 - Fix two resolve_btfids build warnings on unresolved bpf_lsm symbols
   (Thomas Weißschuh)

 - Fix a NULL pointer dereference in bpftool (Amir Mohammadi)

 - Fix BPF selftests to check for CONFIG_PREEMPTION instead of
   CONFIG_PREEMPT (Sebastian Andrzej Siewior)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (31 commits)
  selftests/bpf: Add more test cases for LPM trie
  selftests/bpf: Move test_lpm_map.c to map_tests
  bpf: Use raw_spinlock_t for LPM trie
  bpf: Switch to bpf mem allocator for LPM trie
  bpf: Fix exact match conditions in trie_get_next_key()
  bpf: Handle in-place update for full LPM trie correctly
  bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  bpf: Remove unnecessary check when updating LPM trie
  selftests/bpf: Add test for narrow spill into 64-bit spilled scalar
  selftests/bpf: Add test for reading from STACK_INVALID slots
  selftests/bpf: Introduce __caps_unpriv annotation for tests
  bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
  bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
  samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGS
  bpf: Zero index arg error string for dynptr and iter
  selftests/bpf: Add tests for iter arg check
  bpf: Ensure reg is PTR_TO_STACK in process_iter_arg
  tools: Override makefile ARCH variable if defined, but empty
  selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap
  ...
2024-12-06 15:07:48 -08:00
Paul Moore
8bea8f86f7 Automated merge of 'dev' into 'next'
* dev:
  audit: fix suffixed '/' filename matching
2024-12-06 16:56:48 -05:00
Hou Tao
6a5c63d43c bpf: Use raw_spinlock_t for LPM trie
After switching from kmalloc() to the bpf memory allocator, there will be
no blocking operation during the update of LPM trie. Therefore, change
trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in
atomic context, even on RT kernels.

The max value of prefixlen is 2048. Therefore, update or deletion
operations will find the target after at most 2048 comparisons.
Constructing a test case which updates an element after 2048 comparisons
under a 8 CPU VM, and the average time and the maximal time for such
update operation is about 210us and 900us.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao
3d8dc43eb2 bpf: Switch to bpf mem allocator for LPM trie
Multiple syzbot warnings have been reported. These warnings are mainly
about the lock order between trie->lock and kmalloc()'s internal lock.
See report [1] as an example:

======================================================
WARNING: possible circular locking dependency detected
6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted
------------------------------------------------------
syz.3.2069/15008 is trying to acquire lock:
ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ...

but task is already holding lock:
ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ...

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&trie->lock){-.-.}-{2:2}:
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       trie_delete_elem+0xb0/0x820
       ___bpf_prog_run+0x3e51/0xabd0
       __bpf_prog_run32+0xc1/0x100
       bpf_dispatcher_nop_func
       ......
       bpf_trace_run2+0x231/0x590
       __bpf_trace_contention_end+0xca/0x110
       trace_contention_end.constprop.0+0xea/0x170
       __pv_queued_spin_lock_slowpath+0x28e/0xcc0
       pv_queued_spin_lock_slowpath
       queued_spin_lock_slowpath
       queued_spin_lock
       do_raw_spin_lock+0x210/0x2c0
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x42/0x60
       __put_partials+0xc3/0x170
       qlink_free
       qlist_free_all+0x4e/0x140
       kasan_quarantine_reduce+0x192/0x1e0
       __kasan_slab_alloc+0x69/0x90
       kasan_slab_alloc
       slab_post_alloc_hook
       slab_alloc_node
       kmem_cache_alloc_node_noprof+0x153/0x310
       __alloc_skb+0x2b1/0x380
       ......

-> #0 (&n->list_lock){-.-.}-{2:2}:
       check_prev_add
       check_prevs_add
       validate_chain
       __lock_acquire+0x2478/0x3b30
       lock_acquire
       lock_acquire+0x1b1/0x560
       __raw_spin_lock_irqsave
       _raw_spin_lock_irqsave+0x3a/0x60
       get_partial_node.part.0+0x20/0x350
       get_partial_node
       get_partial
       ___slab_alloc+0x65b/0x1870
       __slab_alloc.constprop.0+0x56/0xb0
       __slab_alloc_node
       slab_alloc_node
       __do_kmalloc_node
       __kmalloc_node_noprof+0x35c/0x440
       kmalloc_node_noprof
       bpf_map_kmalloc_node+0x98/0x4a0
       lpm_trie_node_alloc
       trie_update_elem+0x1ef/0xe00
       bpf_map_update_value+0x2c1/0x6c0
       map_update_elem+0x623/0x910
       __sys_bpf+0x90c/0x49a0
       ...

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&trie->lock);
                               lock(&n->list_lock);
                               lock(&trie->lock);
  lock(&n->list_lock);

 *** DEADLOCK ***

[1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7

A bpf program attached to trace_contention_end() triggers after
acquiring &n->list_lock. The program invokes trie_delete_elem(), which
then acquires trie->lock. However, it is possible that another
process is invoking trie_update_elem(). trie_update_elem() will acquire
trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke
get_partial_node() and try to acquire &n->list_lock (not necessarily the
same lock object). Therefore, lockdep warns about the circular locking
dependency.

Invoking kmalloc() before acquiring trie->lock could fix the warning.
However, since BPF programs call be invoked from any context (e.g.,
through kprobe/tracepoint/fentry), there may still be lock ordering
problems for internal locks in kmalloc() or trie->lock itself.

To eliminate these potential lock ordering problems with kmalloc()'s
internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent
BPF memory allocator APIs that can be invoked in any context. The lock
ordering problems with trie->lock (e.g., reentrance) will be handled
separately.

Three aspects of this change require explanation:

1. Intermediate and leaf nodes are allocated from the same allocator.
Since the value size of LPM trie is usually small, using a single
alocator reduces the memory overhead of the BPF memory allocator.

2. Leaf nodes are allocated before disabling IRQs. This handles cases
where leaf_size is large (e.g., > 4KB - 8) and updates require
intermediate node allocation. If leaf nodes were allocated in
IRQ-disabled region, the free objects in BPF memory allocator would not
be refilled timely and the intermediate node allocation may fail.

3. Paired migrate_{disable|enable}() calls for node alloc and free. The
BPF memory allocator uses per-CPU struct internally, these paired calls
are necessary to guarantee correctness.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao
27abc7b3fa bpf: Fix exact match conditions in trie_get_next_key()
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.

Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.

Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao
532d6b36b2 bpf: Handle in-place update for full LPM trie correctly
When a LPM trie is full, in-place updates of existing elements
incorrectly return -ENOSPC.

Fix this by deferring the check of trie->n_entries. For new insertions,
n_entries must not exceed max_entries. However, in-place updates are
allowed even when the trie is full.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao
eae6a075e9 bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST
flags. These flags can be specified by users and are relevant since LPM
trie supports exact matches during update.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:26 -08:00
Hou Tao
3d5611b4d7 bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
There is no need to call kfree(im_node) when updating element fails,
because im_node must be NULL. Remove the unnecessary kfree() for
im_node.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:25 -08:00
Hou Tao
156c977c53 bpf: Remove unnecessary check when updating LPM trie
When "node->prefixlen == matchlen" is true, it means that the node is
fully matched. If "node->prefixlen == key->prefixlen" is false, it means
the prefix length of key is greater than the prefix length of node,
otherwise, matchlen will not be equal with node->prefixlen. However, it
also implies that the prefix length of node must be less than
max_prefixlen.

Therefore, "node->prefixlen == trie->max_prefixlen" will always be false
when the check of "node->prefixlen == key->prefixlen" returns false.
Remove this unnecessary comparison.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06 09:14:25 -08:00
Adrian Huang
5f1b64e9a9 sched/numa: fix memory leak due to the overwritten vma->numab_state
[Problem Description]
When running the hackbench program of LTP, the following memory leak is
reported by kmemleak.

  # /opt/ltp/testcases/bin/hackbench 20 thread 1000
  Running with 20*40 (== 800) tasks.

  # dmesg | grep kmemleak
  ...
  kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
  kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

  # cat /sys/kernel/debug/kmemleak
  unreferenced object 0xffff888cd8ca2c40 (size 64):
    comm "hackbench", pid 17142, jiffies 4299780315
    hex dump (first 32 bytes):
      ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00  .tI.....L.I.....
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace (crc bff18fd4):
      [<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0
      [<ffffffff8113f715>] task_numa_work+0x725/0xa00
      [<ffffffff8110f878>] task_work_run+0x58/0x90
      [<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0
      [<ffffffff81dd78d5>] do_syscall_64+0x85/0x150
      [<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
  ...

This issue can be consistently reproduced on three different servers:
  * a 448-core server
  * a 256-core server
  * a 192-core server

[Root Cause]
Since multiple threads are created by the hackbench program (along with
the command argument 'thread'), a shared vma might be accessed by two or
more cores simultaneously. When two or more cores observe that
vma->numab_state is NULL at the same time, vma->numab_state will be
overwritten.

Although current code ensures that only one thread scans the VMAs in a
single 'numa_scan_period', there might be a chance for another thread
to enter in the next 'numa_scan_period' while we have not gotten till
numab_state allocation [1].

Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000`
cannot the reproduce the issue. It is verified with 200+ test runs.

[Solution]
Use the cmpxchg atomic operation to ensure that only one thread executes
the vma->numab_state assignment.

[1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/

Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com
Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Jiwei Sun <sunjw10@lenovo.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05 19:54:48 -08:00
Alexander Lobakin
7cd1107f48 bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Address at least some of them and add `const` attributes there. The
object code didn't change, but that may prevent unwanted data
modifications and also allow more helpers to have const arguments.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05 18:41:06 -08:00
Ricardo Robaina
e92eebb0d6 audit: fix suffixed '/' filename matching
When the user specifies a directory to delete with the suffix '/',
the audit record fails to collect the filename, resulting in the
following logs:

 type=PATH msg=audit(10/30/2024 14:11:17.796:6304) : item=2 name=(null)
 type=PATH msg=audit(10/30/2024 14:11:17.796:6304) : item=1 name=(null)

It happens because the value of the variables dname, and n->name->name
in __audit_inode_child() differ only by the suffix '/'. This commit
treats this corner case by handling pathname's trailing slashes in
audit_compare_dname_path().

Steps to reproduce the issue:

 # auditctl -w /tmp
 $ mkdir /tmp/foo
 $ rm -r /tmp/foo/
 # ausearch -i | grep PATH | tail -3

The first version of this patch was based on a GitHub patch/PR by
user @hqh2010 [1].

Link: https://github.com/linux-audit/audit-kernel/pull/148 [1]

Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
[PM: subject tweak, trim old metadata]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-05 19:22:38 -05:00
Linus Torvalds
b8f52214c6 audit/stable-6.13 PR 20241205
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmdSMGQUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNAZg//aDeY3r9tIBdE8FKXdhLFUuggkBAL
 BY4LR67FGFTklelO6oy+mGPxuby06BkwL3VLo086req55vL19pUOceNnBtWA6d1G
 ncZz6qnmFquEXTO5aDXrCIf9fG9zSSdd/D+sDjeUpq/35YJlXnEuyUMY88nm/sAQ
 LKSYOhiooTNRkE6MK60Wd9c6geAi8ER3dxO7l0agD3FGRrK3TOQkK2R/WsOcZTkE
 HtVI/s1EL+ao0s/UiY+xGuRhOgujJ0Gtokqc29m5F6a8I6SiXtXpG6okZK/7KCv0
 ta/87U3VGqXeTCf2aWKWkcDZQLZLajggJthWx0vb3OsA6ppON3FCkKqtSShqhphY
 7QQZV+CWXhHabwDhm4E5UrKu5JDb9wa1duyDnuVXLoPYFAfgMcukDrkrc9PWiKCr
 rbfV/tuwQO3XRKp643sXRcnn6lyM9KPyG77kmxyOD8duYdwX/B5Szf0o8DROwXz0
 /2htu2ZH9UPfwrQhf/MduKuN7Izz+vHFzMnMZZ2l0CVQXVgYq86kcQ4BS7C+ruXw
 PUD+7K1l1Q4tK0mk43K8I5zUqIf8wHERgYCiGkCwJew+q4rea7LNmE4vXvEnLl4W
 sMMYwtFynnNuEedMY3rhk66iz9+epZyou9W3VMoNKiwMdm5DCsPcuSX88zYUDsrP
 TFwMQjfqgpzOtFM=
 =RX05
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20241205' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit build problem workaround from Paul Moore:
 "A minor audit patch that shuffles some code slightly to workaround a
  GCC bug affecting a number of people.

  The GCC folks have been able to reproduce the problem and are
  discussing solutions (see the bug report link in the commit), but
  since the workaround is trivial let's do that in the kernel so we can
  unblock people who are hitting this"

* tag 'audit-pr-20241205' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: workaround a GCC bug triggered by task comm changes
2024-12-05 15:11:39 -08:00
Linus Torvalds
9d6a414ad3 tracing fixes for v6.13:
- Fix trace histogram sort function cmp_entries_dup()
 
   The sort function cmp_entries_dup() returns either 1 or 0, and not
   -1 if parameter "a" is less than "b" by memcmp().
 
 - Fix archs that call trace_hardirqs_off() without RCU watching
 
   Both x86 and arm64 no longer call any tracepoints with RCU not
   watching. It was assumed that it was safe to get rid of
   trace_*_rcuidle() version of the tracepoint calls. This was needed
   to get rid of the SRCU protection and be able to implement features
   like faultable traceponits and add rust tracepoints.
 
   Unfortunately, there were a few architectures that still relied on
   that logic. There's only one file that has tracepoints that are
   called without RCU watching. Add macro logic around the tracepoints
   for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined
   will check if the code is in the idle path (the only place RCU isn't
   watching), and enable RCU around calling the tracepoint, but only
   do it if the tracepoint is enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ1G5gxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnsXAQCzFHRbTrrmSmvKRHdWxUxhlYjKALHA
 v6DCySLgdNtv0QD8D5hHeGzhVXUhECG0mUcduZ7wvaym+yAQWU5V9gUcRwU=
 =E8i1
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix trace histogram sort function cmp_entries_dup()

   The sort function cmp_entries_dup() returns either 1 or 0, and not -1
   if parameter "a" is less than "b" by memcmp().

 - Fix archs that call trace_hardirqs_off() without RCU watching

   Both x86 and arm64 no longer call any tracepoints with RCU not
   watching. It was assumed that it was safe to get rid of
   trace_*_rcuidle() version of the tracepoint calls. This was needed to
   get rid of the SRCU protection and be able to implement features like
   faultable traceponits and add rust tracepoints.

   Unfortunately, there were a few architectures that still relied on
   that logic. There's only one file that has tracepoints that are
   called without RCU watching. Add macro logic around the tracepoints
   for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined
   will check if the code is in the idle path (the only place RCU isn't
   watching), and enable RCU around calling the tracepoint, but only do
   it if the tracepoint is enabled.

* tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix archs that still call tracepoints without RCU watching
  tracing: Fix cmp_entries_dup() to respect sort() comparison rules
2024-12-05 10:17:55 -08:00
Thomas Gleixner
76031d9536 clocksource: Make negative motion detection more robust
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.

It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.

  max_idle_ns:                    597268854 ns
  negative motion tripping point: 671088640 ns

If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.

Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.

For the case at hand this results in a tripping point of 1174405120ns.

Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.

Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.

Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.net
2024-12-05 16:03:24 +01:00
Steven Rostedt
dc1b157b82 tracing: Fix archs that still call tracepoints without RCU watching
Tracepoints require having RCU "watching" as it uses RCU to do updates to
the tracepoints. There are some cases that would call a tracepoint when
RCU was not "watching". This was usually in the idle path where RCU has
"shutdown". For the few locations that had tracepoints without RCU
watching, there was an trace_*_rcuidle() variant that could be used. This
used SRCU for protection.

There are tracepoints that trace when interrupts and preemption are
enabled and disabled. In some architectures, these tracepoints are called
in a path where RCU is not watching. When x86 and arm64 removed these
locations, it was incorrectly assumed that it would be safe to remove the
trace_*_rcuidle() variant and also remove the SRCU logic, as it made the
code more complex and harder to implement new tracepoint features (like
faultable tracepoints and tracepoints in rust).

Instead of bringing back the trace_*_rcuidle(), as it will not be trivial
to do as new code has already been added depending on its removal, add a
workaround to the one file that still requires it (trace_preemptirq.c). If
the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if
the code is in the idle path, and if so, call ct_irq_enter/exit() which
will enable RCU around the tracepoint.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()")
Closes: https://lore.kernel.org/all/bddb02de-957a-4df5-8e77-829f55728ea2@roeck-us.net/
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-05 09:28:58 -05:00
Mathieu Desnoyers
63a48181fb smp/scf: Evaluate local cond_func() before IPI side-effects
In smp_call_function_many_cond(), the local cond_func() is evaluated
after triggering the remote CPU IPIs.

If cond_func() depends on loading shared state updated by other CPU's
IPI handlers func(), then triggering execution of remote CPUs IPI before
evaluating cond_func() may have unexpected consequences.

One example scenario is evaluating a jiffies delay in cond_func(), which
is updated by func() in the IPI handlers. This situation can prevent
execution of periodic cleanup code on the local CPU.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241203163558.3455535-1-mathieu.desnoyers@efficios.com
2024-12-05 14:25:28 +01:00
Wolfram Sang
d6482311ef PM: sleep: autosleep: don't include 'pm_wakeup.h' directly
The header clearly states that it does not want to be included directly,
only via 'device.h'. 'platform_device.h' works equally well. Remove the
direct inclusion.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://patch.msgid.link/20241118072917.3853-16-wsa+renesas@sang-engineering.com
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-12-05 12:14:26 +01:00
Yafang shao
d9381508ea audit: workaround a GCC bug triggered by task comm changes
A build failure has been reported with the following details:

   In file included from include/linux/string.h:390,
                    from include/linux/bitmap.h:13,
                    from include/linux/cpumask.h:12,
                    from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/spinlock.h:63,
                    from include/linux/wait.h:9,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from kernel/auditsc.c:37:
   In function 'sized_strscpy',
       inlined from '__audit_ptrace' at kernel/auditsc.c:2732:2:
>> include/linux/fortify-string.h:293:17:
   error: call to '__write_overflow' declared with attribute error:
   detected write beyond size of object (1st parameter)
     293 |                 __write_overflow();
         |                 ^~~~~~~~~~~~~~~~~~
   In function 'sized_strscpy',
       inlined from 'audit_signal_info_syscall' at kernel/auditsc.c:2759:3:
>> include/linux/fortify-string.h:293:17:
   error: call to '__write_overflow' declared with attribute error:
   detected write beyond size of object (1st parameter)
     293 |                 __write_overflow();
         |                 ^~~~~~~~~~~~~~~~~~

The issue appears to be a GCC bug, though the root cause remains
unclear at this time. For now, let's implement a workaround.

A bug report has also been filed with GCC [0].

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117912 [0]

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410171420.1V00ICVG-lkp@intel.com/
Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/20241128182435.57a1ea6f@gandalf.local.home/
Reported-by: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
Closes: https://lore.kernel.org/all/CY8PR11MB71348E568DBDA576F17DAFF389362@CY8PR11MB7134.namprd11.prod.outlook.com/
Originally-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/linux-hardening/202410171059.C2C395030@keescook/
Signed-off-by: Yafang shao <laoar.shao@gmail.com>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
[PM: subject tweak, description line wrapping]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04 22:57:46 -05:00
Andrea Righi
4572541892 sched_ext: Use the NUMA scheduling domain for NUMA optimizations
Rely on the NUMA scheduling domain topology, instead of accessing NUMA
topology information directly.

There is basically no functional change, but in this way we ensure
consistent use of the same topology information determined by the
scheduling subsystem.

Fixes: f6ce6b949304 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-04 09:49:56 -10:00
Casey Schaufler
2d470c7781 lsm: replace context+len with lsm_context
Replace the (secctx,seclen) pointer pair with a single
lsm_context pointer to allow return of the LSM identifier
along with the context and context length. This allows
security_release_secctx() to know how to release the
context. Callers have been modified to use or save the
returned data from the new structure.

security_secid_to_secctx() and security_lsmproc_to_secctx()
will now return the length value on success instead of 0.

Cc: netdev@vger.kernel.org
Cc: audit@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: Todd Kjos <tkjos@google.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject tweak, kdoc fix, signedness fix from Dan Carpenter]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04 14:42:31 -05:00
Tao Lyu
b0e66977dc bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
When CAP_PERFMON and CAP_SYS_ADMIN (allow_ptr_leaks) are disabled, the
verifier aims to reject partial overwrite on an 8-byte stack slot that
contains a spilled pointer.

However, in such a scenario, it rejects all partial stack overwrites as
long as the targeted stack slot is a spilled register, because it does
not check if the stack slot is a spilled pointer.

Incomplete checks will result in the rejection of valid programs, which
spill narrower scalar values onto scalar slots, as shown below.

0: R1=ctx() R10=fp0
; asm volatile ( @ repro.bpf.c:679
0: (7a) *(u64 *)(r10 -8) = 1          ; R10=fp0 fp-8_w=1
1: (62) *(u32 *)(r10 -8) = 1
attempt to corrupt spilled pointer on stack
processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0.

Fix this by expanding the check to not consider spilled scalar registers
when rejecting the write into the stack.

Previous discussion on this patch is at link [0].

  [0]: https://lore.kernel.org/bpf/20240403202409.2615469-1-tao.lyu@epfl.ch

Fixes: ab125ed3ec1c ("bpf: fix check for attempt to corrupt spilled pointer")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241204044757.1483141-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04 09:19:50 -08:00
Kumar Kartikeya Dwivedi
69772f509e bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
Inside mark_stack_slot_misc, we should not upgrade STACK_INVALID to
STACK_MISC when allow_ptr_leaks is false, since invalid contents
shouldn't be read unless the program has the relevant capabilities.
The relaxation only makes sense when env->allow_ptr_leaks is true.

However, such conversion in privileged mode becomes unnecessary, as
invalid slots can be read without being upgraded to STACK_MISC.

Currently, the condition is inverted (i.e. checking for true instead of
false), simply remove it to restore correct behavior.

Fixes: eaf18febd6eb ("bpf: preserve STACK_ZERO slots on partial reg spills")
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reported-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241204044757.1483141-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04 09:19:50 -08:00