46682 Commits

Author SHA1 Message Date
Ingo Molnar
8e94367b8d Merge branch into tip/master: 'sched/core'
# New commits in sched/core:
    7c8cd569ff66 ("docs: Update Schedstat version to 17")
    011b3a14dc66 ("sched/stats: Print domain name in /proc/schedstat")
    1c055a0f5d3b ("sched: Move sched domain name out of CONFIG_SCHED_DEBUG")
    3b2a793ea70f ("sched: Report the different kinds of imbalances in /proc/schedstat")
    c3856c9ce6b8 ("sched/fair: Cleanup in migrate_degrades_locality() to improve readability")
    a430d99e3490 ("sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat")
    ee8118c1f186 ("sched/fair: Update comments after sched_tick() rename.")
    af98d8a36a96 ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
    7675361ff9a1 ("sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_task")
    7d5265ffcd8b ("rseq: Validate read-only fields under DEBUG_RSEQ config")
    2a77e4be12cb ("sched/fair: Untangle NEXT_BUDDY and pick_next_task()")
    95d9fed3a2ae ("sched/fair: Mark m*_vruntime() with __maybe_unused")
    0429489e0928 ("sched/fair: Fix variable declaration position")
    61b82dfb6b7e ("sched/fair: Do not try to migrate delayed dequeue task")
    736c55a02c47 ("sched/fair: Rename cfs_rq.nr_running into nr_queued")
    43eef7c3a4a6 ("sched/fair: Remove unused cfs_rq.idle_nr_running")
    31898e7b87dd ("sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle")
    9216582b0bfb ("sched/fair: Removed unsued cfs_rq.h_nr_delayed")
    1a49104496d3 ("sched/fair: Use the new cfs_rq.h_nr_runnable")
    c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
    7b8a702d9438 ("sched/fair: Rename h_nr_running into h_nr_queued")
    c907cd44a108 ("sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE")
    6010d245ddc9 ("sched/isolation: Consolidate housekeeping cpumasks that are always identical")
    1174b9344bc7 ("sched/isolation: Make "isolcpus=nohz" equivalent to "nohz_full"")
    ae5c677729e9 ("sched/core: Remove HK_TYPE_SCHED")
    a76328d44c7a ("sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()")
    3a181f20fb4e ("sched/deadline: Consolidate Timer Cancellation")
    53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
    d4742f6ed7ea ("sched/deadline: Correctly account for allocated bandwidth during hotplug")
    41d4200b7103 ("sched/deadline: Restore dl_server bandwidth on non-destructive root domain changes")
    59297e2093ce ("sched: add READ_ONCE to task_on_rq_queued")
    108ad0999085 ("sched: Don't try to catch up excess steal time.")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-01-13 18:13:43 +01:00
Ingo Molnar
b431780cdd Merge branch into tip/master: 'perf/core'
# New commits in perf/core:
    b709eb872e19 ("perf: map pages in advance")
    6d642735cdb6 ("perf/x86/intel/uncore: Support more units on Granite Rapids")
    3f710be02ea6 ("perf/x86/intel/uncore: Clean up func_id")
    0e45818ec189 ("perf/x86/intel: Support RDPMC metrics clear mode")
    02c56362a7d3 ("uprobes: Guard against kmemdup() failing in dup_return_instance()")
    d29e744c7167 ("perf/x86: Relax privilege filter restriction on AMD IBS")
    6057b90ecc84 ("perf/core: Export perf_exclude_event()")
    8622e45b5da1 ("uprobes: Reuse return_instances between multiple uretprobes within task")
    0cf981de7687 ("uprobes: Ensure return_instance is detached from the list before freeing")
    636666a1c733 ("uprobes: Decouple return_instance list traversal and freeing")
    2ff913ab3f47 ("uprobes: Simplify session consumer tracking")
    e0925f2dc4de ("uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution")
    83e3dc9a5d4d ("uprobes: simplify find_active_uprobe_rcu() VMA checks")
    03a001b156d2 ("mm: introduce mmap_lock_speculate_{try_begin|retry}")
    eb449bd96954 ("mm: convert mm_lock_seq to a proper seqcount")
    7528585290a1 ("mm/gup: Use raw_seqcount_try_begin()")
    96450ead1652 ("seqlock: add raw_seqcount_try_begin")
    b4943b8bfc41 ("perf/x86/rapl: Add core energy counter support for AMD CPUs")
    54d2759778c1 ("perf/x86/rapl: Move the cntr_mask to rapl_pmus struct")
    bdc57ec70548 ("perf/x86/rapl: Remove the global variable rapl_msrs")
    abf03d9bd20c ("perf/x86/rapl: Modify the generic variable names to *_pkg*")
    eeca4c6b2529 ("perf/x86/rapl: Add arguments to the init and cleanup functions")
    cd29d83a6d81 ("perf/x86/rapl: Make rapl_model struct global")
    8bf1c86e5ac8 ("perf/x86/rapl: Rename rapl_pmu variables")
    1d5e2f637a94 ("perf/x86/rapl: Remove the cpu_to_rapl_pmu() function")
    e4b444347795 ("x86/topology: Introduce topology_logical_core_id()")
    2f2db347071a ("perf/x86/rapl: Remove the unused get_rapl_pmu_cpumask() function")
    ae55e308bde2 ("perf/x86/intel/ds: Simplify the PEBS records processing for adaptive PEBS")
    3c00ed344cef ("perf/x86/intel/ds: Factor out functions for PEBS records processing")
    7087bfb0adc9 ("perf/x86/intel/ds: Clarify adaptive PEBS processing")
    faac6f105ef1 ("perf/core: Check sample_type in perf_sample_save_brstack")
    f226805bc5f6 ("perf/core: Check sample_type in perf_sample_save_callchain")
    b9c44b91476b ("perf/core: Save raw sample data conditionally based on sample type")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-01-13 18:13:43 +01:00
Ingo Molnar
90df9792d9 Merge branch into tip/master: 'locking/core'
# New commits in locking/core:
    cb4ccc70344c ("MAINTAINERS: Add static_call_inline.c to STATIC BRANCH/CALL")
    a937f384c9da ("cleanup, tags: Create tags for the cleanup primitives")
    abfdccd6af2b ("sched/wake_q: Add helper to call wake_up_q after unlock with preemption disabled")
    fbd7a5a0359b ("rust: sync: Add lock::Backend::assert_is_held()")
    eb5ccb038284 ("rust: sync: Add SpinLockGuard type alias")
    37624dde4768 ("rust: sync: Add MutexGuard type alias")
    daa03fe50ec3 ("rust: sync: Make Guard::new() public")
    15abc88057ee ("rust: sync: Add Lock::from_raw() for Lock<(), B>")
    9793c9bb91f1 ("locking: MAINTAINERS: Start watching Rust locking primitives")
    343060092585 ("lockdep: Move lockdep_assert_locked() under #ifdef CONFIG_PROVE_LOCKING")
    8148fa2e022b ("lockdep: Mark chain_hlock_class_idx() with __maybe_unused")
    bd7b5ae26618 ("lockdep: Document MAX_LOCKDEP_CHAIN_HLOCKS calculation")
    88a79e88a97c ("lockdep: Clarify size for LOCKDEP_*_BITS configs")
    e638072e6172 ("lockdep: Fix upper limit for LOCKDEP_*_BITS configs")
    0d3547df6934 ("locking/ww_mutex/test: Use swap() macro")
    63a48181fbcd ("smp/scf: Evaluate local cond_func() before IPI side-effects")
    d387ceb17149 ("locking/lockdep: Enforce PROVE_RAW_LOCK_NESTING only if ARCH_SUPPORTS_RT")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-01-13 18:13:42 +01:00
Ingo Molnar
7308d9dc81 Merge branch into tip/master: 'irq/core'
# New commits in irq/core:
    b4706d814921 ("genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown")
    bad6722e478f ("kexec: Consolidate machine_kexec_mask_interrupts() implementation")
    429f49ad361c ("genirq: Reuse irq_thread_fn() for forced thread case")
    6f8b79683dfb ("genirq: Move irq_thread_fn() further up in the code")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-01-13 18:13:42 +01:00
Ingo Molnar
e46ca77dd1 Merge branch into tip/master: 'sched/urgent'
# New commits in sched/urgent:
    66951e4860d3 ("sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE")
    6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-01-13 18:13:40 +01:00
Peter Zijlstra
66951e4860 sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.

However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.

As such, don't adjust the weight for empty group entities and let them compete
at their original weight.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
2025-01-13 13:50:56 +01:00
Linus Torvalds
a603abe345 - Fix a #GP in the perf user callchain code caused by a race between uprobe
freeing the task and the bpf profiler unwinding the task's user stack
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeDphkACgkQEsHwGGHe
 VUrFjxAAickP9S3nlduOzjOO9Pa85MUbQ5wgzrpa29KV75xez9w7IWmambBbYkrY
 zxV/vJqVEjuaJki/kqgtPNmp7tHjDBwW/sTqSI8TTeIwogfht4WPPA2YEHR2pDK4
 t8XNEHGnP38o1oJ6j+zLO9vktieJ/T65yZurmGwVfmGpNOIHNBSzCFopGFCXV41k
 WcNi1E3dOgSbAQESvF+J1ZtkcmBXovoyE7k+H5bbuRcoyFF1RhIDvKcGGY5m7FDo
 Cb92wJTbm9kQaWdOc8oa808pyVtmh0wy+1I9dvoQ+sPlhLzy4p32uOpWUlJkpV51
 lZgPO0NunnLlHNL4zK4M7OBphlEbr8JaXQbgDLtn8TnfPKlh1sZ0DWoVcyXqB77g
 cOlsSEDYzSbf/5TKDZMfeh4koEZvtNmDH6SjUYxC6bdfpfd8D5zp8TbvPJ6XmM8m
 tFn4rhTY5rf2+AjgZs16jkpNlDk+pmwXiczxhldMR/U9y5meea96pe+r8HPpQk27
 1t9N0ixt+EhY1xkITYkS06UV/nJJzejbtrCytkh/FLePQCSi+IbgpxUASVHnJSur
 4ctWZTm+1CxZ7SRZ9VEsPYXfRfRtJjPKOqheQR2RNRi9SnBi7AlJfMOffEZqj8/p
 q8C2qtwOlBdxo/t87NnTsvmZE3mfWJBgN2KmO/5YsshRx15qPis=
 =oobp
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Fix a #GP in the perf user callchain code caused by a race between
   uprobe freeing the task and the bpf profiler unwinding the task's
   user stack

* tag 'perf_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  uprobes: Fix race in uprobe_free_utask
2025-01-12 11:57:45 -08:00
Linus Torvalds
a87d1203bb Probes fixes for v6.13-rc6:
- tracing/kprobes: Fix to free trace_kprobe objects at a failure path
   in __trace_kprobe_create() function. This fixes a memory leak.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmeDLdAbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bOzgH/3ctLQDber/jgWtiwqII
 Y4xEBjXfS+Dia9YOhM/Bu1HgD7dCuyESoxjDpvCRM2oKLV0O5HuMqrNSOmKjqJru
 vH0j6UG6sMnhOzbB/GArkt5NFYrgxvlOqvEoAK7PBem9rf0/cBFbzYMALwzto/pc
 1v2ipv6V29H8aNCNBKgcDU/MlPTR2wpnStpVuXJzVtjlXpFCwbJjhIs2OEs2Xfud
 lzw/QV91h70rgj+YjhI5i/B0h0LVQmAz++8UYPjvdSjUeVnjQz07eaNWb+pHQepV
 lJ8yUoho+cPm1UyVwt7Sw3dDjfSOhcAgpwOUAqWndHQ4Dm6uzsZd2472aRrJRUC3
 NeE=
 =k41d
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:
 "Fix to free trace_kprobe objects at a failure path in
  __trace_kprobe_create() function. This fixes a memory leak"

* tag 'probes-fixes-v6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobes: Fix to free objects when failed to copy a symbol
2025-01-11 20:34:12 -08:00
Linus Torvalds
2e3f3090bd sched_ext: Fixes for v6.13-rc6
- Fix corner case bug where ops.dispatch() couldn't extend the execution of
   the current task if SCX_OPS_ENQ_LAST is set.
 
 - Fix ops.cpu_release() not being called when a SCX task is preempted by a
   higher priority sched class task.
 
 - Fix buitin idle mask being incorrectly left as busy after an idle CPU is
   picked and kicked.
 
 - scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq
   pinning related sanity checks which could trigger spuriously. Switch to
   raw_spin_rq_lock().
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gmpw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGVntAP0b4i4PEIkupj9+i8ZzlwqvYX3gFJ7E4v3wmjDp
 1VYdrAD/ZetrhrM+9RyyKpMIDFnN+xE6YbslBSlAzGzgfdsbXA0=
 =zGXi
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix corner case bug where ops.dispatch() couldn't extend the
   execution of the current task if SCX_OPS_ENQ_LAST is set.

 - Fix ops.cpu_release() not being called when a SCX task is preempted
   by a higher priority sched class task.

 - Fix buitin idle mask being incorrectly left as busy after an idle CPU
   is picked and kicked.

 - scx_ops_bypass() was unnecessarily using rq_lock() which comes with
   rq pinning related sanity checks which could trigger spuriously.
   Switch to raw_spin_rq_lock().

* tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Refresh idle masks during idle-to-idle transitions
  sched_ext: switch class when preempted by higher priority scheduler
  sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
  sched_ext: keep running prev when prev->scx.slice != 0
2025-01-10 15:11:58 -08:00
Linus Torvalds
58624e4bc8 cgroup: Fixes for v6.13-rc6
All are cpuset changes:
 
 - Fix isolated CPUs leaking into sched domains.
 
 - Remove now unnecessary kernfs active break which can trigger a warning.
 
 - Comment updates.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gkug4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGXRGAQCf9aL+UWZZiVqcvRjBt8z3gxW9HQOCXYXNGlLF
 EKFFuAD+KLox+flPLbgNv9IwZnswv9+SdOTCE1TlT0GQFBPZcQU=
 =suPy
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
 "Cpuset fixes:

   - Fix isolated CPUs leaking into sched domains

   - Remove now unnecessary kernfs active break which can trigger a
     warning

   - Comment updates"

* tag 'cgroup-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: remove kernfs active break
  cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains
  cgroup/cpuset: Remove stale text
2025-01-10 15:03:02 -08:00
Linus Torvalds
257a8be4e9 workqueue: Fixes for v6.13-rc6
- Add a WARN_ON_ONCE() on queue_delayed_work_on() on an offline CPU as such
   work items won't get executed till the CPU comes back online.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gjlw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTNSAQDX7+9ODdDXHEUnViU4QCK6EAsKmp+PHlZLo/0K
 PVm4SQD/QtPj3jwyEhhdRlaL0+IbTyfG3rURxv53XUGl+TJ1qA8=
 =SYtY
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:

 - Add a WARN_ON_ONCE() on queue_delayed_work_on() on an offline CPU as
   such work items won't get executed till the CPU comes back online

* tag 'wq-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: warn if delayed_work is queued to an offlined cpu.
2025-01-10 14:52:30 -08:00
Andrea Righi
a2a3374c47 sched_ext: idle: Refresh idle masks during idle-to-idle transitions
With the consolidation of put_prev_task/set_next_task(), see
commit 436f3eed5c69 ("sched: Combine the last put_prev_task() and the
first set_next_task()"), we are now skipping the transition between
these two functions when the previous and the next tasks are the same.

As a result, the scx idle state of a CPU is updated only when
transitioning to or from the idle thread. While this is generally
correct, it can lead to uneven and inefficient core utilization in
certain scenarios [1].

A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu()
selects and marks an idle CPU as busy, followed by a wake-up via
scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU
continues running the idle thread, returns to idle, but remains marked
as busy, preventing it from being selected again as an idle CPU (until a
task eventually runs on it and releases the CPU).

For example, running a workload that uses 20% of each CPU, combined with
an scx scheduler using proactive wake-ups, results in the following core
utilization:

 CPU 0: 25.7%
 CPU 1: 29.3%
 CPU 2: 26.5%
 CPU 3: 25.5%
 CPU 4:  0.0%
 CPU 5: 25.5%
 CPU 6:  0.0%
 CPU 7: 10.5%

To address this, refresh the idle state also in pick_task_idle(), during
idle-to-idle transitions, but only trigger ops.update_idle() on actual
state changes to prevent unnecessary updates to the scx scheduler and
maintain balanced state transitions.

With this change in place, the core utilization in the previous example
becomes the following:

 CPU 0: 18.8%
 CPU 1: 19.4%
 CPU 2: 18.0%
 CPU 3: 18.7%
 CPU 4: 19.3%
 CPU 5: 18.9%
 CPU 6: 18.7%
 CPU 7: 19.3%

[1] https://github.com/sched-ext/scx/pull/1139

Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10 12:40:42 -10:00
Imran Khan
da30ba227c workqueue: warn if delayed_work is queued to an offlined cpu.
delayed_work submitted to an offlined cpu, will not get executed,
after the specified delay if the cpu remains offline. If the cpu
never comes online the work will never get executed.
checking for online cpu in __queue_delayed_work, does not sound
like a good idea because to do this reliably we need hotplug lock
and since work may be submitted from atomic contexts, we would
have to use cpus_read_trylock. But if trylock fails we would queue
the work on any cpu and this may not be optimal because our intended
cpu might still be online.

Putting a WARN_ON_ONCE for an already offlined cpu, will indicate users
of queue_delayed_work_on, if they are (wrongly) trying to queue
delayed_work on offlined cpu. Also indicate the problem of using
offlined cpu with queue_delayed_work_on, in its description.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-10 08:33:39 -10:00
Lorenzo Stoakes
b709eb872e perf: map pages in advance
We are adjusting struct page to make it smaller, removing unneeded fields
which correctly belong to struct folio.

Two of those fields are page->index and page->mapping. Perf is currently
making use of both of these. This is unnecessary. This patch eliminates
this.

Perf establishes its own internally controlled memory-mapped pages using
vm_ops hooks. The first page in the mapping is the read/write user control
page, and the rest of the mapping consists of read-only pages.

The VMA is backed by kernel memory either from the buddy allocator or
vmalloc depending on configuration. It is intended to be mapped read/write,
but because it has a page_mkwrite() hook, vma_wants_writenotify() indicates
that it should be mapped read-only.

When a write fault occurs, the provided page_mkwrite() hook,
perf_mmap_fault() (doing double duty handing faults as well) uses the
vmf->pgoff field to determine if this is the first page, allowing for the
desired read/write first page, read-only rest mapping.

For this to work the implementation has to carefully work around faulting
logic. When a page is write-faulted, the fault() hook is called first, then
its page_mkwrite() hook is called (to allow for dirty tracking in file
systems).

On fault we set the folio's mapping in perf_mmap_fault(), this is because
when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
as an indicator that the fault should be retried.

We also set the folio's index so, given the folio is being treated as faux
user memory, it correctly references its offset within the VMA.

This explains why the mapping and index fields are used - but it's not
necessary.

We preallocate pages when perf_mmap() is called for the first time via
rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
needed if the mapping requires it.

This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
and so we can instead simply map all the memory right away here - there's
no point in handling (read) page faults when we don't demand page nor need
to be notified about them (perf does not).

This patch therefore changes this logic to map everything when the mmap()
hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
to provide the required read/write vs. read-only behaviour, which does not
require the previously implemented workarounds.

While it is not ideal to use a VM_PFNMAP here, doing anything else will
result in the page_mkwrite() hook need to be provided, which requires the
same page->mapping hack this patch seeks to undo.

It will also result in the pages being treated as folios and placed on the
rmap, which really does not make sense for these mappings.

Semantically it makes sense to establish this as some kind of special
mapping, as the pages are managed by perf and are not strictly user pages,
but currently the only means by which we can do so functionally while
maintaining the required R/W and R/O behaviour is a PFN map.

There should be no change to actual functionality as a result of this
change.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250103153151.124163-1-lorenzo.stoakes@oracle.com
2025-01-10 18:16:50 +01:00
Jiri Olsa
b583ef82b6 uprobes: Fix race in uprobe_free_utask
Max Makarov reported kernel panic [1] in perf user callchain code.

The reason for that is the race between uprobe_free_utask and bpf
profiler code doing the perf user stack unwind and is triggered
within uprobe_free_utask function:
  - after current->utask is freed and
  - before current->utask is set to NULL

 general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI
 RIP: 0010:is_uprobe_at_func_entry+0x28/0x80
 ...
  ? die_addr+0x36/0x90
  ? exc_general_protection+0x217/0x420
  ? asm_exc_general_protection+0x26/0x30
  ? is_uprobe_at_func_entry+0x28/0x80
  perf_callchain_user+0x20a/0x360
  get_perf_callchain+0x147/0x1d0
  bpf_get_stackid+0x60/0x90
  bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b
  ? __smp_call_single_queue+0xad/0x120
  bpf_overflow_handler+0x75/0x110
  ...
  asm_sysvec_apic_timer_interrupt+0x1a/0x20
 RIP: 0010:__kmem_cache_free+0x1cb/0x350
 ...
  ? uprobe_free_utask+0x62/0x80
  ? acct_collect+0x4c/0x220
  uprobe_free_utask+0x62/0x80
  mm_release+0x12/0xb0
  do_exit+0x26b/0xaa0
  __x64_sys_exit+0x1b/0x20
  do_syscall_64+0x5a/0x80

It can be easily reproduced by running following commands in
separate terminals:

  # while :; do bpftrace -e 'uprobe:/bin/ls:_start  { printf("hit\n"); }' -c ls; done
  # bpftrace -e 'profile:hz:100000 { @[ustack()] = count(); }'

Fixing this by making sure current->utask pointer is set to NULL
before we start to release the utask object.

[1] https://github.com/grafana/pyroscope/issues/3673

Fixes: cfa7f3d2c526 ("perf,x86: avoid missing caller address in stack traces captured in uprobe")
Reported-by: Max Makarov <maxpain@linux.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250109141440.2692173-1-jolsa@kernel.org
2025-01-10 09:28:01 +01:00
Masami Hiramatsu (Google)
30c8fd31c5 tracing/kprobes: Fix to free objects when failed to copy a symbol
In __trace_kprobe_create(), if something fails it must goto error block
to free objects. But when strdup() a symbol, it returns without that.
Fix it to goto the error block to free objects correctly.

Link: https://lore.kernel.org/all/173643297743.1514810.2408159540454241947.stgit@devnote2/

Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-10 08:57:18 +09:00
Peter Zijlstra
6d71a9c616 sched/fair: Fix EEVDF entity placement bug causing scheduling lag
I noticed this in my traces today:

       turbostat-1222    [006] d..2.   311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
       turbostat-1222    [006] d..2.   311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
                               { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -> 2 -> 1048576.

One would expect the lag to shoot out *AND* come back, notably:

  -1123*1048576/2 = -588775424
  -588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:

    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net
2025-01-09 12:55:27 +01:00
Chen Ridong
3cb97a927f cgroup/cpuset: remove kernfs active break
A warning was found:

WARNING: CPU: 10 PID: 3486953 at fs/kernfs/file.c:828
CPU: 10 PID: 3486953 Comm: rmdir Kdump: loaded Tainted: G
RIP: 0010:kernfs_should_drain_open_files+0x1a1/0x1b0
RSP: 0018:ffff8881107ef9e0 EFLAGS: 00010202
RAX: 0000000080000002 RBX: ffff888154738c00 RCX: dffffc0000000000
RDX: 0000000000000007 RSI: 0000000000000004 RDI: ffff888154738c04
RBP: ffff888154738c04 R08: ffffffffaf27fa15 R09: ffffed102a8e7180
R10: ffff888154738c07 R11: 0000000000000000 R12: ffff888154738c08
R13: ffff888750f8c000 R14: ffff888750f8c0e8 R15: ffff888154738ca0
FS:  00007f84cd0be740(0000) GS:ffff8887ddc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555f9fbe00c8 CR3: 0000000153eec001 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 kernfs_drain+0x15e/0x2f0
 __kernfs_remove+0x165/0x300
 kernfs_remove_by_name_ns+0x7b/0xc0
 cgroup_rm_file+0x154/0x1c0
 cgroup_addrm_files+0x1c2/0x1f0
 css_clear_dir+0x77/0x110
 kill_css+0x4c/0x1b0
 cgroup_destroy_locked+0x194/0x380
 cgroup_rmdir+0x2a/0x140

It can be explained by:
rmdir 				echo 1 > cpuset.cpus
				kernfs_fop_write_iter // active=0
cgroup_rm_file
kernfs_remove_by_name_ns	kernfs_get_active // active=1
__kernfs_remove					  // active=0x80000002
kernfs_drain			cpuset_write_resmask
wait_event
//waiting (active == 0x80000001)
				kernfs_break_active_protection
				// active = 0x80000001
// continue
				kernfs_unbreak_active_protection
				// active = 0x80000002
...
kernfs_should_drain_open_files
// warning occurs
				kernfs_put_active

This warning is caused by 'kernfs_break_active_protection' when it is
writing to cpuset.cpus, and the cgroup is removed concurrently.

The commit 3a5a6d0c2b03 ("cpuset: don't nest cgroup_mutex inside
get_online_cpus()") made cpuset_hotplug_workfn asynchronous, This change
involves calling flush_work(), which can create a multiple processes
circular locking dependency that involve cgroup_mutex, potentially leading
to a deadlock. To avoid deadlock. the commit 76bb5ab8f6e3 ("cpuset: break
kernfs active protection in cpuset_write_resmask()") added
'kernfs_break_active_protection' in the cpuset_write_resmask. This could
lead to this warning.

After the commit 2125c0034c5d ("cgroup/cpuset: Make cpuset hotplug
processing synchronous"), the cpuset_write_resmask no longer needs to
wait the hotplug to finish, which means that concurrent hotplug and cpuset
operations are no longer possible. Therefore, the deadlock doesn't exist
anymore and it does not have to 'break active protection' now. To fix this
warning, just remove kernfs_break_active_protection operation in the
'cpuset_write_resmask'.

Fixes: bdb2fd7fc56e ("kernfs: Skip kernfs_drain_open_files() more aggressively")
Fixes: 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()")
Reported-by: Ji Fa <jifa@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 15:54:39 -10:00
Honglei Wang
68e449d849 sched_ext: switch class when preempted by higher priority scheduler
ops.cpu_release() function, if defined, must be invoked when preempted by
a higher priority scheduler class task. This scenario was skipped in
commit f422316d7466 ("sched_ext: Remove switch_class_scx()"). Let's fix
it.

Fixes: f422316d7466 ("sched_ext: Remove switch_class_scx()")
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:51:40 -10:00
Changwoo Min
6268d5bc10 sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks.
For each CPU, it acquires a lock using rq_lock() regardless of whether
a CPU is offline or the CPU is currently running a task in a higher
scheduler class (e.g., deadline). The rq_lock() is supposed to be used
for online CPUs, and the use of rq_lock() may trigger an unnecessary
warning in rq_pin_lock(). Therefore, replace rq_lock() to
raw_spin_rq_lock() in scx_ops_bypass().

Without this change, we observe the following warning:

===== START =====
[    6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90
=====  END  =====

Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()")
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:48:53 -10:00
Henry Huang
30dd3b13f9 sched_ext: keep running prev when prev->scx.slice != 0
When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0,
@prev will be dispacthed into the local DSQ in put_prev_task_scx().
However, pick_task_scx() is executed before put_prev_task_scx(),
so it will not pick @prev.
Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx()
can pick @prev.

Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-01-08 06:48:33 -10:00
Linus Torvalds
fbfd64d25c vfs-6.13-rc7.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ3vs1AAKCRCRxhvAZXjc
 omdqAP9Mn4HF85p5X7WRtUgrF7MGQft3EBfWE+sUxCMTc49NGQD/Ti7hqGNleEih
 MmjUjLZSG1e3lFHYQm0nqmjO2RexbQ0=
 =Li7D
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Relax assertions on failure to encode file handles

   The ->encode_fh() method can fail for various reasons. None of them
   warrant a WARN_ON().

 - Fix overlayfs file handle encoding by allowing encoding an fid from
   an inode without an alias

 - Make sure fuse_dir_open() handles FOPEN_KEEP_CACHE. If it's not
   specified fuse needs to invaludate the directory inode page cache

 - Fix qnx6 so it builds with gcc-15

 - Various fixes for netfslib and ceph and nfs filesystems:
     - Ignore silly rename files from afs and nfs when building header
       archives
     - Fix read result collection in netfslib with multiple subrequests
     - Handle ENOMEM for netfslib buffered reads
     - Fix oops in nfs_netfs_init_request()
     - Parse the secctx command immediately in cachefiles
     - Remove a redundant smp_rmb() in netfslib
     - Handle recursion in read retry in netfslib
     - Fix clearing of folio_queue
     - Fix missing cancellation of copy-to_cache when the cache for a
       file is temporarly disabled in netfslib

 - Sanity check the hfs root record

 - Fix zero padding data issues in concurrent write scenarios

 - Fix is_mnt_ns_file() after converting nsfs to path_from_stashed()

 - Fix missing declaration of init_files

 - Increase I/O priority when writing revoke records in jbd2

 - Flush filesystem device before updating tail sequence in jbd2

* tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
  ovl: support encoding fid from inode with no alias
  ovl: pass realinode to ovl_encode_real_fh() instead of realdentry
  fuse: respect FOPEN_KEEP_CACHE on opendir
  netfs: Fix is-caching check in read-retry
  netfs: Fix the (non-)cancellation of copy when cache is temporarily disabled
  netfs: Fix ceph copy to cache on write-begin
  netfs: Work around recursion by abandoning retry if nothing read
  netfs: Fix missing barriers by using clear_and_wake_up_bit()
  netfs: Remove redundant use of smp_rmb()
  cachefiles: Parse the "secctx" immediately
  nfs: Fix oops in nfs_netfs_init_request() when copying to cache
  netfs: Fix enomem handling in buffered reads
  netfs: Fix non-contiguous donation between completed reads
  kheaders: Ignore silly-rename files
  fs: relax assertions on failure to encode file handles
  fs: fix missing declaration of init_files
  fs: fix is_mnt_ns_file()
  iomap: fix zero padding data issue in concurrent append writes
  iomap: pass byte granular end position to iomap_add_to_ioend
  jbd2: flush filesystem device before updating tail sequence
  ...
2025-01-06 10:26:39 -08:00
Linus Torvalds
5635d8bad2 25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM.
The usual bunch of singletons and two doubletons - please see the relevant
 changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ3noXwAKCRDdBJ7gKXxA
 jkzRAP9Ejb8kbgCrA3cptnzlVkDCDUm0TmleepT3bx6B2rH0BgEAzSiTXf4ioZPg
 4pOHnKIGOWEVPcVwBrdA0irWG+QPYAQ=
 =nEIZ
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "25 hotfixes.  16 are cc:stable.  18 are MM and 7 are non-MM.

  The usual bunch of singletons and two doubletons - please see the
  relevant changelogs for details"

* tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits)
  MAINTAINERS: change Arınç _NAL's name and email address
  scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity
  mm/util: make memdup_user_nul() similar to memdup_user()
  mm, madvise: fix potential workingset node list_lru leaks
  mm/damon/core: fix ignored quota goals and filters of newly committed schemes
  mm/damon/core: fix new damon_target objects leaks on damon_commit_targets()
  mm/list_lru: fix false warning of negative counter
  vmstat: disable vmstat_work on vmstat_cpu_down_prep()
  mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
  mm: shmem: fix incorrect index alignment for within_size policy
  percpu: remove intermediate variable in PERCPU_PTR()
  mm: zswap: fix race between [de]compression and CPU hotunplug
  ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv
  fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit
  kcov: mark in_softirq_really() as __always_inline
  docs: mm: fix the incorrect 'FileHugeMapped' field
  mailmap: modify the entry for Mathieu Othacehe
  mm/kmemleak: fix sleeping function called from invalid context at print message
  mm: hugetlb: independent PMD page table shared count
  maple_tree: reload mas before the second call for mas_empty_area
  ...
2025-01-05 10:37:45 -08:00
Linus Torvalds
63676eefb7 sched_ext: Fixes for v6.13-rc5
- Fix the bug where bpf_iter_scx_dsq_new() was not initializing the
   iterator's flags and could inadvertently enable e.g. reverse iteration.
 
 - Fix the bug where scx_ops_bypass() could call irq_restore twice.
 
 - Add Andrea and Changwoo as maintainers for better review coverage.
 
 - selftests and tools/sched_ext build and other fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hpXg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGS/lAQDOZDfcJtO1VEsLoPY9NhFHPuBDTfoJyjSi/4mh
 GsjgDAD/Sx0rN6C9S/+ToUjmq3FA+ft0m2+97VqgLwkzwA9YxwI=
 =jaZ6
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the
   iterator's flags and could inadvertently enable e.g. reverse
   iteration

 - Fix a bug where scx_ops_bypass() could call irq_restore twice

 - Add Andrea and Changwoo as maintainers for better review coverage

 - selftests and tools/sched_ext build and other fixes

* tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix dsq_local_on selftest
  sched_ext: initialize kit->cursor.flags
  sched_ext: Fix invalid irq restore in scx_ops_bypass()
  MAINTAINERS: add me as reviewer for sched_ext
  MAINTAINERS: add self as reviewer for sched_ext
  scx: Fix maximal BPF selftest prog
  sched_ext: fix application of sizeof to pointer
  selftests/sched_ext: fix build after renames in sched_ext API
  sched_ext: Add __weak to fix the build errors
2025-01-03 15:09:12 -08:00
Linus Torvalds
f9aa1fb9f8 workqueue: Fixes for v6.13-rc5
- Suppress a corner case spurious flush dependency warning.
 
 - Two trivial changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hmjA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGUrkAP90cajNtGbtFR1J61N4dTSfjBz8L7oQ6GLLyjCB
 MDxvpQD/ViVVpHBl9/jfObk//p6YMBTBD2Zp/aBc3mkKOVhfqws=
 =eUNO
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:

 - Suppress a corner case spurious flush dependency warning

 - Two trivial changes

* tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: add printf attribute to __alloc_workqueue()
  workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker
  rust: add safety comment in workqueue traits
2025-01-03 15:03:56 -08:00
Linus Torvalds
e30dd219c7 Fixes for ftrace in v6.13:
- Add needed READ_ONCE() around access to the fgraph array element
 
   The updates to the fgraph array can happen when callbacks are registered
   and unregistered. The __ftrace_return_to_handler() can handle reading
   either the old value or the new value. But once it reads that value
   it must stay consistent otherwise the check that looks to see if the
   value is a stub may show false, but if the compiler decides to re-read
   after that check, it can be true which can cause the code to crash
   later on.
 
 - Make function profiler use the top level ops for filtering again
 
   When function graph became available for instances, its filter ops became
   independent from the top level set_ftrace_filter. In the process the
   function profiler received its own filter ops as well. But the function
   profiler uses the top level set_ftrace_filter file and does not have one
   of its own. In giving it its own filter ops, it lost any user interface
   it once had. Make it use the top level set_ftrace_filter file again.
   This fixes a regression.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ3cR4RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qjxfAQCPhNztdmGmEYmuBtONPHwejidWnuJ6
 Rl2mQxEbp40OUgD+JvSWofhRsvtXWlymqZ9j+dKMegLqMeq834hB0LK4NAg=
 =+KqV
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fixes from Steven Rostedt:

 - Add needed READ_ONCE() around access to the fgraph array element

   The updates to the fgraph array can happen when callbacks are
   registered and unregistered. The __ftrace_return_to_handler() can
   handle reading either the old value or the new value. But once it
   reads that value it must stay consistent otherwise the check that
   looks to see if the value is a stub may show false, but if the
   compiler decides to re-read after that check, it can be true which
   can cause the code to crash later on.

 - Make function profiler use the top level ops for filtering again

   When function graph became available for instances, its filter ops
   became independent from the top level set_ftrace_filter. In the
   process the function profiler received its own filter ops as well.
   But the function profiler uses the top level set_ftrace_filter file
   and does not have one of its own. In giving it its own filter ops, it
   lost any user interface it once had. Make it use the top level
   set_ftrace_filter file again. This fixes a regression.

* tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Fix function profiler's filtering functionality
  fgraph: Add READ_ONCE() when accessing fgraph_array[]
2025-01-03 10:04:43 -08:00
Kohei Enju
789a8cff8d ftrace: Fix function profiler's filtering functionality
Commit c132be2c4fcc ("function_graph: Have the instances use their own
ftrace_ops for filtering"), function profiler (enabled via
function_profile_enabled) has been showing statistics for all functions,
ignoring set_ftrace_filter settings.

While tracers are instantiated, the function profiler is not. Therefore, it
should use the global set_ftrace_filter for consistency.  This patch
modifies the function profiler to use the global filter, fixing the
filtering functionality.

Before (filtering not working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
  Function                               Hit    Time            Avg
     s^2
  --------                               ---    ----            ---
     ---
  schedule                               314    22290594 us     70989.15 us
     40372231 us
  x64_sys_call                          1527    8762510 us      5738.382 us
     3414354 us
  schedule_hrtimeout_range               176    8665356 us      49234.98 us
     405618876 us
  __x64_sys_ppoll                        324    5656635 us      17458.75 us
     19203976 us
  do_sys_poll                            324    5653747 us      17449.83 us
     19214945 us
  schedule_timeout                        67    5531396 us      82558.15 us
     2136740827 us
  __x64_sys_pselect6                      12    3029540 us      252461.7 us
     63296940171 us
  do_pselect.constprop.0                  12    3029532 us      252461.0 us
     63296952931 us
```

After (filtering working):
```
root@localhost:~# echo 'vfs*' > /sys/kernel/tracing/set_ftrace_filter
root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# sleep 1
root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled
root@localhost:~# head /sys/kernel/tracing/trace_stat/*
  Function                               Hit    Time            Avg
     s^2
  --------                               ---    ----            ---
     ---
  vfs_write                              462    68476.43 us     148.217 us
     25874.48 us
  vfs_read                               641    9611.356 us     14.994 us
     28868.07 us
  vfs_fstat                              890    878.094 us      0.986 us
     1.667 us
  vfs_fstatat                            227    757.176 us      3.335 us
     18.928 us
  vfs_statx                              226    610.610 us      2.701 us
     17.749 us
  vfs_getattr_nosec                     1187    460.919 us      0.388 us
     0.326 us
  vfs_statx_path                         297    343.287 us      1.155 us
     11.116 us
  vfs_rename                               6    291.575 us      48.595 us
     9889.236 us
```

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250101190820.72534-1-enjuk@amazon.com
Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-02 17:21:33 -05:00
Zilin Guan
d654740337 fgraph: Add READ_ONCE() when accessing fgraph_array[]
In __ftrace_return_to_handler(), a loop iterates over the fgraph_array[]
elements, which are fgraph_ops. The loop checks if an element is a
fgraph_stub to prevent using a fgraph_stub afterward.

However, if the compiler reloads fgraph_array[] after this check, it might
race with an update to fgraph_array[] that introduces a fgraph_stub. This
could result in the stub being processed, but the stub contains a null
"func_hash" field, leading to a NULL pointer dereference.

To ensure that the gops compared against the fgraph_stub matches the gops
processed later, add a READ_ONCE(). A similar patch appears in commit
63a8dfb ("function_graph: Add READ_ONCE() when accessing fgraph_array[]").

Cc: stable@vger.kernel.org
Fixes: 37238abe3cb47 ("ftrace/function_graph: Pass fgraph_ops to function graph callbacks")
Link: https://lore.kernel.org/20241231113731.277668-1-zilin@seu.edu.cn
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-02 17:21:18 -05:00
Steven Rostedt
afc6717628 tracing: Have process_string() also allow arrays
In order to catch a common bug where a TRACE_EVENT() TP_fast_assign()
assigns an address of an allocated string to the ring buffer and then
references it in TP_printk(), which can be executed hours later when the
string is free, the function test_event_printk() runs on all events as
they are registered to make sure there's no unwanted dereferencing.

It calls process_string() to handle cases in TP_printk() format that has
"%s". It returns whether or not the string is safe. But it can have some
false positives.

For instance, xe_bo_move() has:

 TP_printk("move_lacks_source:%s, migrate object %p [size %zu] from %s to %s device_id:%s",
            __entry->move_lacks_source ? "yes" : "no", __entry->bo, __entry->size,
            xe_mem_type_to_name[__entry->old_placement],
            xe_mem_type_to_name[__entry->new_placement], __get_str(device_id))

Where the "%s" references into xe_mem_type_to_name[]. This is an array of
pointers that should be safe for the event to access. Instead of flagging
this as a bad reference, if a reference points to an array, where the
record field is the index, consider it safe.

Link: https://lore.kernel.org/all/9dee19b6185d325d0e6fa5f7cbba81d007d99166.camel@sapience.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241231000646.324fb5f7@gandalf.local.home
Fixes: 65a25d9f7ac02 ("tracing: Add "%s" check in test_event_printk()")
Reported-by: Genes Lists <lists@sapience.com>
Tested-by: Gene C <arch@sapience.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-31 00:10:32 -05:00
Arnd Bergmann
cb0ca08b32 kcov: mark in_softirq_really() as __always_inline
If gcc decides not to inline in_softirq_really(), objtool warns about a
function call with UACCESS enabled:

kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc+0x1e: call to in_softirq_really() with UACCESS enabled
kernel/kcov.o: warning: objtool: check_kcov_mode+0x11: call to in_softirq_really() with UACCESS enabled

Mark this as __always_inline to avoid the problem.

Link: https://lkml.kernel.org/r/20241217071814.2261620-1-arnd@kernel.org
Fixes: 7d4df2dad312 ("kcov: properly check for softirq context")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-30 17:59:08 -08:00
Linus Torvalds
bf7a281b80 Fix missed rtmutex wakeups causing sporadic boot hangs
and other misbehavior.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmdxC+ERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jvDw/+Kl24Gjai6hy7yFukGRFRkAezx3YRyK8F
 SM/vg2GzNaottkUSO3ywD//SMoG3qqkBOIukrS8kJXjLNlx1TI6AqGLVA9g9LpMw
 KFgqvIb4llstsAh7s8coCSJIVOCGcNC306EPfqvrhlU16YqFHRggQUqSiycRXQEd
 SDSAiNsiez0g0a0x1qI0lbFtF7l/Xht1CxOmpc0NQe8OXZcOXJI1z92DbzDsY+r4
 g77sJ3jHT9j3rpz7MPdh4xS8RJnT/E3wAKn5dnS0pSJ58UFOndIgncKoeEpPC3gW
 1hFWx+3IC2n0/t4m5TQhtpSFv0W4tkhwWOMI7JlRw2Sx2z0T/gnJsYH7E+DSu138
 XYmRCuW+BHrFjG+Pns4bpndf8Gy2HSHjvp0AB9iUqzfIkWVkQNjBdonfdvY5pey0
 EwkxCZPcWT8j0HehM9MhntYojfgy/Au/Z+EOZQSDDHKLAvkkE5ai1FPCjvhBxrCe
 FGko03zS77O+yayTFwXdtbn0StM1Bfa8WcCKxAKErsYqOOB4AP1bJWAknBKw0O4b
 Kj2nVSf7etDcue6sey9HWd1+pNzUsAlsuRM+bsa/dp2rxHxbbHVVHV1Yy0jTgHTL
 RkK8C3FyZbya4nhl0qY7kYudes37S8aT8AQEvyJ9/Y0aLURuESzdxiX1Knk0W2zs
 WsRnDI85Yq0=
 =0Vde
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2024-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix missed rtmutex wakeups causing sporadic boot hangs and other
  misbehavior"

* tag 'locking-urgent-2024-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Make sure we wake anything on the wake_q when we release the lock->wait_lock
2024-12-29 10:04:47 -08:00
Linus Torvalds
411a678d30 Probes fixes for v6.13-rc4:
- tracing/kprobes: Change the priority of the module callback of kprobe
   events so that it is called after the jump label list on the module is
   updated. This ensures the kprobe can check whether it is not on the
   jump label address correctly.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmduAMgbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bJ6YH/2QBkWNTe3qjxdPsTxJ2
 MyL2PO8tMwZbNSyYZ1yGnbguWUUKVkuiheS/qWhLNpuVEyb6Q9/Zuifh5rFqDbf0
 Ug3YvsP7gQurmqDm1NGlnMic3zlmZaYDtXCKB+kiA3HO3iP92zesTJlasiok3aSd
 sQphxUzmG41BQUDN5/LktGjVb5juf3Xq6i6bdCd6wunUbGWCEE+XmFrg1oVnutES
 GTckUGswUBGbgkcVPc07UfKZpNzZdyZlmbVfOISCdYIAddUKftATN7SaUrM29oqC
 /lkUcxeXSVXBIUkbA1p50nfjYzTWNeXG92WrvMrRZjNivyMf/nUJnxrlHsv5h2Dy
 gtI=
 =d3Zj
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:
 "Change the priority of the module callback of kprobe events so that it
  is called after the jump label list on the module is updated.

  This ensures the kprobe can check whether it is not on the jump label
  address correctly"

* tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobe: Make trace_kprobe's module callback called after jump_label update
2024-12-27 11:03:15 -08:00
Henry Huang
35bf430e08 sched_ext: initialize kit->cursor.flags
struct bpf_iter_scx_dsq *it maybe not initialized.
If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice
before scx_bpf_dsq_move, it would cause unexpected behaviors:
1. Assign a huge slice into p->scx.slice
2. Assign a invalid vtime into p->scx.dsq_vtime

Signed-off-by: Henry Huang <henry.hj@antgroup.com>
Fixes: 6462dd53a260 ("sched_ext: Compact struct bpf_iter_scx_dsq_kern")
Cc: stable@vger.kernel.org # v6.12
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24 10:56:08 -10:00
Su Hui
d57212f281 workqueue: add printf attribute to __alloc_workqueue()
Fix a compiler warning with W=1:
kernel/workqueue.c: error:
function ‘__alloc_workqueue’ might be a candidate for ‘gnu_printf’
format attribute[-Werror=suggest-attribute=format]
 5657 |  name_len = vsnprintf(wq->name, sizeof(wq->name), fmt, args);
      |  ^~~~~~~~

Fixes: 9b59a85a84dc ("workqueue: Don't call va_start / va_end twice")
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-24 09:50:38 -10:00
Lizhi Xu
98feccbf32 tracing: Prevent bad count for tracing_cpumask_write
If a large count is provided, it will trigger a warning in bitmap_parse_user.
Also check zero for it.

Cc: stable@vger.kernel.org
Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions")
Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com
Reported-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd
Tested-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-23 21:59:15 -05:00
Masami Hiramatsu (Google)
d685d55dfc tracing/kprobe: Make trace_kprobe's module callback called after jump_label update
Make sure the trace_kprobe's module notifer callback function is called
after jump_label's callback is called. Since the trace_kprobe's callback
eventually checks jump_label address during registering new kprobe on
the loading module, jump_label must be updated before this registration
happens.

Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/

Fixes: 614243181050 ("tracing/kprobes: Support module init function probing")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-24 00:08:13 +09:00
Peter Zijlstra
630a937016 Lockdep changes for v6.14:
- Use swap() macro in the ww_mutex test.
 - Minor fixes and documentation for lockdep configs on internal data structure sizes.
 - Some "-Wunused-function" warning fixes for Clang.
 
 Rust locking changes for v6.14:
 
 - Add Rust locking files into LOCKING PRIMITIVES maintainer entry.
 - Add `Lock<(), ..>::from_raw()` function to support abstraction on low level locking.
 - Expose `Guard::new()` for public usage and add type alias for spinlock and mutex guards.
 - Add lockdep checking when creating a new lock `Guard`.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmdl/LoACgkQSXnow7UH
 +rhNrAf/epZAkkTmFgSqdx0ZNtKUA14Hqp9ie7SJylU6B9dfXmvZzaNBlowk5Edq
 yGGJQYuzuT+PFYZkNEuSZYcrqUq+b4s8MyF/8h3+lyZT6p1Jhapvq16id5yA1u0l
 MxMqAZC1D1ruDev2H8IxLlhHlDsSYS0erVNB2ZTFJwL0rZNyUXMZ4Y/o972GjAPt
 8g9NlPB3ZTCVmyVtwy7rCexSuVTGDE3BRL9/W9q8eMZNnHq46xDsHRrn9NO4cDmv
 FogniH9xjFYetZMilYkpHwygAMX1P2t6x29Q+u464bStIWIOjkthYjkoePNXwZQd
 XgvN37j508VHLJ3sod38+IpnfhlZHA==
 =IJvk
 -----END PGP SIGNATURE-----

Merge tag 'lockdep-for-tip.20241220' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux into locking/core

Lockdep changes for v6.14:

- Use swap() macro in the ww_mutex test.
- Minor fixes and documentation for lockdep configs on internal data structure sizes.
- Some "-Wunused-function" warning fixes for Clang.

Rust locking changes for v6.14:

- Add Rust locking files into LOCKING PRIMITIVES maintainer entry.
- Add `Lock<(), ..>::from_raw()` function to support abstraction on low level locking.
- Expose `Guard::new()` for public usage and add type alias for spinlock and mutex guards.
- Add lockdep checking when creating a new lock `Guard`.
2024-12-22 12:43:31 +01:00
Linus Torvalds
4aa748dd1a 25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM.
The usual bunch of singletons and doubletons - please see the relevant
 changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ2cghQAKCRDdBJ7gKXxA
 jgrsAQCvlSmHYYLXBE1A6cram4qWgEP/2vD94d6sVv9UipO/FAEA8y1K7dbT2AGX
 A5ESuRndu5Iy76mb6Tiarqa/yt56QgU=
 =ZYVx
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "25 hotfixes.  16 are cc:stable.  19 are MM and 6 are non-MM.

  The usual bunch of singletons and doubletons - please see the relevant
  changelogs for details"

* tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits)
  mm: huge_memory: handle strsep not finding delimiter
  alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG
  alloc_tag: fix module allocation tags populated area calculation
  mm/codetag: clear tags before swap
  mm/vmstat: fix a W=1 clang compiler warning
  mm: convert partially_mapped set/clear operations to be atomic
  nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
  vmalloc: fix accounting with i915
  mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy()
  fork: avoid inappropriate uprobe access to invalid mm
  nilfs2: prevent use of deleted inode
  zram: fix uninitialized ZRAM not releasing backing device
  zram: refuse to use zero sized block device as backing device
  mm: use clear_user_(high)page() for arch with special user folio handling
  mm: introduce cpu_icache_is_aliasing() across all architectures
  mm: add RCU annotation to pte_offset_map(_lock)
  mm: correctly reference merged VMA
  mm: use aligned address in copy_user_gigantic_page()
  mm: use aligned address in clear_gigantic_page()
  mm: shmem: fix ShmemHugePages at swapout
  ...
2024-12-21 15:31:56 -08:00
Linus Torvalds
9c707ba99f BPF fixes:
- Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP
   systems (Andrea Righi)
 
 - Fix BPF USDT selftests helper code to use asm constraint "m"
   for LoongArch (Tiezhu Yang)
 
 - Fix BPF selftest compilation error in get_uprobe_offset when
   PROCMAP_QUERY is not defined (Jerome Marchand)
 
 - Fix BPF bpf_skb_change_tail helper when used in context of
   BPF sockmap to handle negative skb header offsets (Cong Wang)
 
 - Several fixes to BPF sockmap code, among others, in the area
   of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ2YJABUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISINDEgD+N4uVg+rp8Z8pg9jcai4WUERmRG20
 NcQTfBXczLHkwIcBALvn7NVvbTAINJzBTnukbjX3XbWFz2cJ/xHxDYXycP4I
 =SwXG
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull BPF fixes from Daniel Borkmann:

 - Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP
   systems (Andrea Righi)

 - Fix BPF USDT selftests helper code to use asm constraint "m" for
   LoongArch (Tiezhu Yang)

 - Fix BPF selftest compilation error in get_uprobe_offset when
   PROCMAP_QUERY is not defined (Jerome Marchand)

 - Fix BPF bpf_skb_change_tail helper when used in context of BPF
   sockmap to handle negative skb header offsets (Cong Wang)

 - Several fixes to BPF sockmap code, among others, in the area of
   socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test bpf_skb_change_tail() in TC ingress
  selftests/bpf: Introduce socket_helpers.h for TC tests
  selftests/bpf: Add a BPF selftest for bpf_skb_change_tail()
  bpf: Check negative offsets in __bpf_skb_min_len()
  tcp_bpf: Fix copied value in tcp_bpf_sendmsg
  skmsg: Return copied bytes in sk_msg_memcopy_from_iter
  tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection
  tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress()
  selftests/bpf: Fix compilation error in get_uprobe_offset()
  selftests/bpf: Use asm constraint "m" for LoongArch
  bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP
2024-12-21 11:07:19 -08:00
David Howells
973b710b88
kheaders: Ignore silly-rename files
Tell tar to ignore silly-rename files (".__afs*" and ".nfs*") when building
the header archive.  These occur when a file that is open is unlinked
locally, but hasn't yet been closed.  Such files are visible to the user
via the getdents() syscall and so programs may want to do things with them.

During the kernel build, such files may be made during the processing of
header files and the cleanup may get deferred by fput() which may result in
tar seeing these files when it reads the directory, but they may have
disappeared by the time it tries to open them, causing tar to fail with an
error.  Further, we don't want to include them in the tarball if they still
exist.

With CONFIG_HEADERS_INSTALL=y, something like the following may be seen:

   find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory
   tar: ./include/linux/greybus/.__afs3C95: File removed before we read it

The find warning doesn't seem to cause a problem.

Fix this by telling tar when called from in gen_kheaders.sh to exclude such
files.  This only affects afs and nfs; cifs uses the Windows Hidden
attribute to prevent the file from being seen.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-2-dhowells@redhat.com
cc: Masahiro Yamada <masahiroy@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20 22:07:55 +01:00
Linus Torvalds
5b83bcdea5 ring-buffer fixes for v6.13:
- Fix possible overflow of mmapped ring buffer with bad offset
 
   If the mmap() to the ring buffer passes in a start address that
   is passed the end of the mmapped file, it is not caught and
   a slab-out-of-bounds is triggered.
 
   Add a check to make sure the start address is within the bounds
 
 - Do not use TP_printk() to boot mapped ring buffers
 
   As a boot mapped ring buffer's data may have pointers that map to
   the previous boot's memory map, it is unsafe to allow the TP_printk()
   to be used to read the boot mapped buffer's events. If a TP_printk()
   points to a static string from within the kernel it will not match
   the current kernel mapping if KASLR is active, and it can fault.
 
   Have it simply print out the raw fields.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2QuXRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qncvAQDf2s2WWsy4pYp2mpRtBXvAPf6tpBdi
 J9eceJQbwJVJHAEApQjEFfbUxLh2WgPU1Cn++PwDA+NLiru70+S0vtDLWwE=
 =OI+v
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer fixes from Steven Rostedt:

 - Fix possible overflow of mmapped ring buffer with bad offset

   If the mmap() to the ring buffer passes in a start address that is
   passed the end of the mmapped file, it is not caught and a
   slab-out-of-bounds is triggered.

   Add a check to make sure the start address is within the bounds

 - Do not use TP_printk() to boot mapped ring buffers

   As a boot mapped ring buffer's data may have pointers that map to the
   previous boot's memory map, it is unsafe to allow the TP_printk() to
   be used to read the boot mapped buffer's events. If a TP_printk()
   points to a static string from within the kernel it will not match
   the current kernel mapping if KASLR is active, and it can fault.

   Have it simply print out the raw fields.

* tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers
  ring-buffer: Fix overflow in __rb_map_vma
2024-12-20 10:13:26 -08:00
John Stultz
abfdccd6af sched/wake_q: Add helper to call wake_up_q after unlock with preemption disabled
A common pattern seen when wake_qs are used to defer a wakeup
until after a lock is released is something like:
  preempt_disable();
  raw_spin_unlock(lock);
  wake_up_q(wake_q);
  preempt_enable();

So create some raw_spin_unlock*_wake() helper functions to clean
this up.

Applies on top of the fix I submitted here:
 https://lore.kernel.org/lkml/20241212222138.2400498-1-jstultz@google.com/

NOTE: I recognise the unlock()/unlock_irq()/unlock_irqrestore()
variants creates its own duplication, which we could use a macro
to generate the similar functions, but I often dislike how those
generation macros making finding the actual implementation
harder, so I left the three functions as is. If folks would
prefer otherwise, let me know and I'll switch it.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241217040803.243420-1-jstultz@google.com
2024-12-20 15:31:21 +01:00
Peter Zijlstra
c2db11a750 Merge branch 'locking/urgent'
Sync with urgent -- avoid conflicts.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-12-20 15:31:19 +01:00
Swapnil Sapkal
7c8cd569ff docs: Update Schedstat version to 17
Update the Schedstat version to 17 as more fields are added to report
different kinds of imbalances in the sched domain. Also domain field
started printing corresponding domain name.

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-7-swapnil.sapkal@amd.com
2024-12-20 15:31:18 +01:00
K Prateek Nayak
011b3a14dc sched/stats: Print domain name in /proc/schedstat
Currently, there does not exist a straightforward way to extract the
names of the sched domains and match them to the per-cpu domain entry in
/proc/schedstat other than looking at the debugfs files which are only
visible after enabling "verbose" debug after commit 34320745dfc9
("sched/debug: Put sched/domains files under the verbose flag")

Since tools like `perf sched stats`[1] require displaying per-domain
information in user friendly manner, display the names of sched domain,
alongside their level in /proc/schedstat.

Domain names also makes the /proc/schedstat data unambiguous when some
of the cpus are offline. For example, on a 128 cpus AMD Zen3 machine
where CPU0 and CPU64 are SMT siblings and CPU64 is offline:

Before:
    cpu0 ...
    domain0 ...
    domain1 ...
    cpu1 ...
    domain0 ...
    domain1 ...
    domain2 ...

After:
    cpu0 ...
    domain0 MC ...
    domain1 PKG ...
    cpu1 ...
    domain0 SMT ...
    domain1 MC ...
    domain2 PKG ...

[1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20241220063224.17767-6-swapnil.sapkal@amd.com
2024-12-20 15:31:18 +01:00
Swapnil Sapkal
1c055a0f5d sched: Move sched domain name out of CONFIG_SCHED_DEBUG
/proc/schedstat file shows cpu and sched domain level scheduler
statistics. It does not show domain name instead shows domain level.
It will be very useful for tools like `perf sched stats`[1] to
aggragate domain level stats if domain names are shown in /proc/schedstat.
But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the
discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG.

[1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/
[2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/

Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-5-swapnil.sapkal@amd.com
2024-12-20 15:31:17 +01:00
Swapnil Sapkal
3b2a793ea7 sched: Report the different kinds of imbalances in /proc/schedstat
In /proc/schedstat, lb_imbalance reports the sum of imbalances
discovered in sched domains with each call to sched_balance_rq(), which is
not very useful because lb_imbalance does not mention whether the imbalance
is due to load, utilization, nr_tasks or misfit_tasks. Remove this field
from /proc/schedstat.

Currently there is no field in /proc/schedstat to report different types
of imbalances. Introduce new fields in /proc/schedstat to report the
total imbalances in load, utilization, nr_tasks or misfit_tasks.

Added fields to /proc/schedstat:
        - lb_imbalance_load: Total imbalance due to load.
        - lb_imbalance_util: Total imbalance due to utilization.
        - lb_imbalance_task: Total imbalance due to number of tasks.
        - lb_imbalance_misfit: Total imbalance due to misfit tasks.

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20241220063224.17767-4-swapnil.sapkal@amd.com
2024-12-20 15:31:17 +01:00
Peter Zijlstra
c3856c9ce6 sched/fair: Cleanup in migrate_degrades_locality() to improve readability
migrate_degrade_locality() would return {1, 0, -1} respectively to
indicate that migration would degrade-locality, would improve
locality, would be ambivalent to locality improvements.

This patch improves readability by changing the return value to mean:
* Any positive value degrades locality
* 0 migration doesn't affect locality
* Any negative value improves locality

[Swapnil: Fixed comments around code and wrote commit log]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-3-swapnil.sapkal@amd.com
2024-12-20 15:31:17 +01:00
Peter Zijlstra
a430d99e34 sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled
during load balance. This value is incremented in can_migrate_task()
if the task is migratable and hot. After incrementing the value,
load balancer can still decide not to migrate this task leading to wrong
accounting. Fix this by incrementing stats when hot tasks are detached.
This issue only exists in detach_tasks() where we can decide to not
migrate hot task even if it is migratable. However, in detach_one_task(),
we migrate it unconditionally.

[Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log]

Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241220063224.17767-2-swapnil.sapkal@amd.com
2024-12-20 15:31:16 +01:00
Sebastian Andrzej Siewior
ee8118c1f1 sched/fair: Update comments after sched_tick() rename.
scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2
("sched/balancing: Rename scheduler_tick() => sched_tick()").

Update comments still referring to scheduler_tick.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241219085839.302378-1-bigeasy@linutronix.de
2024-12-20 15:31:16 +01:00