linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-15 02:05:33 +00:00

Author	SHA1	Message	Date
Ingo Molnar	8e94367b8d	Merge branch into tip/master: 'sched/core' # New commits in sched/core: 7c8cd569ff66 ("docs: Update Schedstat version to 17") 011b3a14dc66 ("sched/stats: Print domain name in /proc/schedstat") 1c055a0f5d3b ("sched: Move sched domain name out of CONFIG_SCHED_DEBUG") 3b2a793ea70f ("sched: Report the different kinds of imbalances in /proc/schedstat") c3856c9ce6b8 ("sched/fair: Cleanup in migrate_degrades_locality() to improve readability") a430d99e3490 ("sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat") ee8118c1f186 ("sched/fair: Update comments after sched_tick() rename.") af98d8a36a96 ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug") 7675361ff9a1 ("sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_task") 7d5265ffcd8b ("rseq: Validate read-only fields under DEBUG_RSEQ config") 2a77e4be12cb ("sched/fair: Untangle NEXT_BUDDY and pick_next_task()") 95d9fed3a2ae ("sched/fair: Mark m*_vruntime() with __maybe_unused") 0429489e0928 ("sched/fair: Fix variable declaration position") 61b82dfb6b7e ("sched/fair: Do not try to migrate delayed dequeue task") 736c55a02c47 ("sched/fair: Rename cfs_rq.nr_running into nr_queued") 43eef7c3a4a6 ("sched/fair: Remove unused cfs_rq.idle_nr_running") 31898e7b87dd ("sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle") 9216582b0bfb ("sched/fair: Removed unsued cfs_rq.h_nr_delayed") 1a49104496d3 ("sched/fair: Use the new cfs_rq.h_nr_runnable") c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable") 7b8a702d9438 ("sched/fair: Rename h_nr_running into h_nr_queued") c907cd44a108 ("sched: Unify HK_TYPE_{TIMER\|TICK\|MISC} to HK_TYPE_KERNEL_NOISE") 6010d245ddc9 ("sched/isolation: Consolidate housekeeping cpumasks that are always identical") 1174b9344bc7 ("sched/isolation: Make "isolcpus=nohz" equivalent to "nohz_full"") ae5c677729e9 ("sched/core: Remove HK_TYPE_SCHED") a76328d44c7a ("sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()") 3a181f20fb4e ("sched/deadline: Consolidate Timer Cancellation") 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") d4742f6ed7ea ("sched/deadline: Correctly account for allocated bandwidth during hotplug") 41d4200b7103 ("sched/deadline: Restore dl_server bandwidth on non-destructive root domain changes") 59297e2093ce ("sched: add READ_ONCE to task_on_rq_queued") 108ad0999085 ("sched: Don't try to catch up excess steal time.") Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-01-13 18:13:43 +01:00
Ingo Molnar	b431780cdd	Merge branch into tip/master: 'perf/core' # New commits in perf/core: b709eb872e19 ("perf: map pages in advance") 6d642735cdb6 ("perf/x86/intel/uncore: Support more units on Granite Rapids") 3f710be02ea6 ("perf/x86/intel/uncore: Clean up func_id") 0e45818ec189 ("perf/x86/intel: Support RDPMC metrics clear mode") 02c56362a7d3 ("uprobes: Guard against kmemdup() failing in dup_return_instance()") d29e744c7167 ("perf/x86: Relax privilege filter restriction on AMD IBS") 6057b90ecc84 ("perf/core: Export perf_exclude_event()") 8622e45b5da1 ("uprobes: Reuse return_instances between multiple uretprobes within task") 0cf981de7687 ("uprobes: Ensure return_instance is detached from the list before freeing") 636666a1c733 ("uprobes: Decouple return_instance list traversal and freeing") 2ff913ab3f47 ("uprobes: Simplify session consumer tracking") e0925f2dc4de ("uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution") 83e3dc9a5d4d ("uprobes: simplify find_active_uprobe_rcu() VMA checks") 03a001b156d2 ("mm: introduce mmap_lock_speculate_{try_begin\|retry}") eb449bd96954 ("mm: convert mm_lock_seq to a proper seqcount") 7528585290a1 ("mm/gup: Use raw_seqcount_try_begin()") 96450ead1652 ("seqlock: add raw_seqcount_try_begin") b4943b8bfc41 ("perf/x86/rapl: Add core energy counter support for AMD CPUs") 54d2759778c1 ("perf/x86/rapl: Move the cntr_mask to rapl_pmus struct") bdc57ec70548 ("perf/x86/rapl: Remove the global variable rapl_msrs") abf03d9bd20c ("perf/x86/rapl: Modify the generic variable names to _pkg") eeca4c6b2529 ("perf/x86/rapl: Add arguments to the init and cleanup functions") cd29d83a6d81 ("perf/x86/rapl: Make rapl_model struct global") 8bf1c86e5ac8 ("perf/x86/rapl: Rename rapl_pmu variables") 1d5e2f637a94 ("perf/x86/rapl: Remove the cpu_to_rapl_pmu() function") e4b444347795 ("x86/topology: Introduce topology_logical_core_id()") 2f2db347071a ("perf/x86/rapl: Remove the unused get_rapl_pmu_cpumask() function") ae55e308bde2 ("perf/x86/intel/ds: Simplify the PEBS records processing for adaptive PEBS") 3c00ed344cef ("perf/x86/intel/ds: Factor out functions for PEBS records processing") 7087bfb0adc9 ("perf/x86/intel/ds: Clarify adaptive PEBS processing") faac6f105ef1 ("perf/core: Check sample_type in perf_sample_save_brstack") f226805bc5f6 ("perf/core: Check sample_type in perf_sample_save_callchain") b9c44b91476b ("perf/core: Save raw sample data conditionally based on sample type") Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-01-13 18:13:43 +01:00
Ingo Molnar	90df9792d9	Merge branch into tip/master: 'locking/core' # New commits in locking/core: cb4ccc70344c ("MAINTAINERS: Add static_call_inline.c to STATIC BRANCH/CALL") a937f384c9da ("cleanup, tags: Create tags for the cleanup primitives") abfdccd6af2b ("sched/wake_q: Add helper to call wake_up_q after unlock with preemption disabled") fbd7a5a0359b ("rust: sync: Add lock::Backend::assert_is_held()") eb5ccb038284 ("rust: sync: Add SpinLockGuard type alias") 37624dde4768 ("rust: sync: Add MutexGuard type alias") daa03fe50ec3 ("rust: sync: Make Guard::new() public") 15abc88057ee ("rust: sync: Add Lock::from_raw() for Lock<(), B>") 9793c9bb91f1 ("locking: MAINTAINERS: Start watching Rust locking primitives") 343060092585 ("lockdep: Move lockdep_assert_locked() under #ifdef CONFIG_PROVE_LOCKING") 8148fa2e022b ("lockdep: Mark chain_hlock_class_idx() with __maybe_unused") bd7b5ae26618 ("lockdep: Document MAX_LOCKDEP_CHAIN_HLOCKS calculation") 88a79e88a97c ("lockdep: Clarify size for LOCKDEP__BITS configs") e638072e6172 ("lockdep: Fix upper limit for LOCKDEP__BITS configs") 0d3547df6934 ("locking/ww_mutex/test: Use swap() macro") 63a48181fbcd ("smp/scf: Evaluate local cond_func() before IPI side-effects") d387ceb17149 ("locking/lockdep: Enforce PROVE_RAW_LOCK_NESTING only if ARCH_SUPPORTS_RT") Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-01-13 18:13:42 +01:00
Ingo Molnar	7308d9dc81	Merge branch into tip/master: 'irq/core' # New commits in irq/core: b4706d814921 ("genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown") bad6722e478f ("kexec: Consolidate machine_kexec_mask_interrupts() implementation") 429f49ad361c ("genirq: Reuse irq_thread_fn() for forced thread case") 6f8b79683dfb ("genirq: Move irq_thread_fn() further up in the code") Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-01-13 18:13:42 +01:00
Ingo Molnar	e46ca77dd1	Merge branch into tip/master: 'sched/urgent' # New commits in sched/urgent: 66951e4860d3 ("sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE") 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-01-13 18:13:40 +01:00
Peter Zijlstra	66951e4860	sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE Normally dequeue_entities() will continue to dequeue an empty group entity; except DELAY_DEQUEUE changes things -- it retains empty entities such that they might continue to compete and burn off some lag. However, doing this results in update_cfs_group() re-computing the cgroup weight 'slice' for an empty group, which it (rightly) figures isn't much at all. This in turn means that the delayed entity is not competing at the expected weight. Worse, the very low weight causes its lag to be inflated, which combined with avg_vruntime() using scale_load_down(), leads to artifacts. As such, don't adjust the weight for empty group entities and let them compete at their original weight. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net	2025-01-13 13:50:56 +01:00
Linus Torvalds	a603abe345	- Fix a #GP in the perf user callchain code caused by a race between uprobe freeing the task and the bpf profiler unwinding the task's user stack -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeDphkACgkQEsHwGGHe VUrFjxAAickP9S3nlduOzjOO9Pa85MUbQ5wgzrpa29KV75xez9w7IWmambBbYkrY zxV/vJqVEjuaJki/kqgtPNmp7tHjDBwW/sTqSI8TTeIwogfht4WPPA2YEHR2pDK4 t8XNEHGnP38o1oJ6j+zLO9vktieJ/T65yZurmGwVfmGpNOIHNBSzCFopGFCXV41k WcNi1E3dOgSbAQESvF+J1ZtkcmBXovoyE7k+H5bbuRcoyFF1RhIDvKcGGY5m7FDo Cb92wJTbm9kQaWdOc8oa808pyVtmh0wy+1I9dvoQ+sPlhLzy4p32uOpWUlJkpV51 lZgPO0NunnLlHNL4zK4M7OBphlEbr8JaXQbgDLtn8TnfPKlh1sZ0DWoVcyXqB77g cOlsSEDYzSbf/5TKDZMfeh4koEZvtNmDH6SjUYxC6bdfpfd8D5zp8TbvPJ6XmM8m tFn4rhTY5rf2+AjgZs16jkpNlDk+pmwXiczxhldMR/U9y5meea96pe+r8HPpQk27 1t9N0ixt+EhY1xkITYkS06UV/nJJzejbtrCytkh/FLePQCSi+IbgpxUASVHnJSur 4ctWZTm+1CxZ7SRZ9VEsPYXfRfRtJjPKOqheQR2RNRi9SnBi7AlJfMOffEZqj8/p q8C2qtwOlBdxo/t87NnTsvmZE3mfWJBgN2KmO/5YsshRx15qPis= =oobp -----END PGP SIGNATURE----- Merge tag 'perf_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Fix a #GP in the perf user callchain code caused by a race between uprobe freeing the task and the bpf profiler unwinding the task's user stack * tag 'perf_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: uprobes: Fix race in uprobe_free_utask	2025-01-12 11:57:45 -08:00
Linus Torvalds	a87d1203bb	Probes fixes for v6.13-rc6: - tracing/kprobes: Fix to free trace_kprobe objects at a failure path in __trace_kprobe_create() function. This fixes a memory leak. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmeDLdAbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bOzgH/3ctLQDber/jgWtiwqII Y4xEBjXfS+Dia9YOhM/Bu1HgD7dCuyESoxjDpvCRM2oKLV0O5HuMqrNSOmKjqJru vH0j6UG6sMnhOzbB/GArkt5NFYrgxvlOqvEoAK7PBem9rf0/cBFbzYMALwzto/pc 1v2ipv6V29H8aNCNBKgcDU/MlPTR2wpnStpVuXJzVtjlXpFCwbJjhIs2OEs2Xfud lzw/QV91h70rgj+YjhI5i/B0h0LVQmAz++8UYPjvdSjUeVnjQz07eaNWb+pHQepV lJ8yUoho+cPm1UyVwt7Sw3dDjfSOhcAgpwOUAqWndHQ4Dm6uzsZd2472aRrJRUC3 NeE= =k41d -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: "Fix to free trace_kprobe objects at a failure path in __trace_kprobe_create() function. This fixes a memory leak" * tag 'probes-fixes-v6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/kprobes: Fix to free objects when failed to copy a symbol	2025-01-11 20:34:12 -08:00
Linus Torvalds	2e3f3090bd	sched_ext: Fixes for v6.13-rc6 - Fix corner case bug where ops.dispatch() couldn't extend the execution of the current task if SCX_OPS_ENQ_LAST is set. - Fix ops.cpu_release() not being called when a SCX task is preempted by a higher priority sched class task. - Fix buitin idle mask being incorrectly left as busy after an idle CPU is picked and kicked. - scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq pinning related sanity checks which could trigger spuriously. Switch to raw_spin_rq_lock(). -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gmpw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGVntAP0b4i4PEIkupj9+i8ZzlwqvYX3gFJ7E4v3wmjDp 1VYdrAD/ZetrhrM+9RyyKpMIDFnN+xE6YbslBSlAzGzgfdsbXA0= =zGXi -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix corner case bug where ops.dispatch() couldn't extend the execution of the current task if SCX_OPS_ENQ_LAST is set. - Fix ops.cpu_release() not being called when a SCX task is preempted by a higher priority sched class task. - Fix buitin idle mask being incorrectly left as busy after an idle CPU is picked and kicked. - scx_ops_bypass() was unnecessarily using rq_lock() which comes with rq pinning related sanity checks which could trigger spuriously. Switch to raw_spin_rq_lock(). * tag 'sched_ext-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Refresh idle masks during idle-to-idle transitions sched_ext: switch class when preempted by higher priority scheduler sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass() sched_ext: keep running prev when prev->scx.slice != 0	2025-01-10 15:11:58 -08:00
Linus Torvalds	58624e4bc8	cgroup: Fixes for v6.13-rc6 All are cpuset changes: - Fix isolated CPUs leaking into sched domains. - Remove now unnecessary kernfs active break which can trigger a warning. - Comment updates. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gkug4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGXRGAQCf9aL+UWZZiVqcvRjBt8z3gxW9HQOCXYXNGlLF EKFFuAD+KLox+flPLbgNv9IwZnswv9+SdOTCE1TlT0GQFBPZcQU= =suPy -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Cpuset fixes: - Fix isolated CPUs leaking into sched domains - Remove now unnecessary kernfs active break which can trigger a warning - Comment updates" * tag 'cgroup-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: remove kernfs active break cgroup/cpuset: Prevent leakage of isolated CPUs into sched domains cgroup/cpuset: Remove stale text	2025-01-10 15:03:02 -08:00
Linus Torvalds	257a8be4e9	workqueue: Fixes for v6.13-rc6 - Add a WARN_ON_ONCE() on queue_delayed_work_on() on an offline CPU as such work items won't get executed till the CPU comes back online. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ4Gjlw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTNSAQDX7+9ODdDXHEUnViU4QCK6EAsKmp+PHlZLo/0K PVm4SQD/QtPj3jwyEhhdRlaL0+IbTyfG3rURxv53XUGl+TJ1qA8= =SYtY -----END PGP SIGNATURE----- Merge tag 'wq-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Add a WARN_ON_ONCE() on queue_delayed_work_on() on an offline CPU as such work items won't get executed till the CPU comes back online * tag 'wq-for-6.13-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: warn if delayed_work is queued to an offlined cpu.	2025-01-10 14:52:30 -08:00
Andrea Righi	a2a3374c47	sched_ext: idle: Refresh idle masks during idle-to-idle transitions With the consolidation of put_prev_task/set_next_task(), see commit 436f3eed5c69 ("sched: Combine the last put_prev_task() and the first set_next_task()"), we are now skipping the transition between these two functions when the previous and the next tasks are the same. As a result, the scx idle state of a CPU is updated only when transitioning to or from the idle thread. While this is generally correct, it can lead to uneven and inefficient core utilization in certain scenarios [1]. A typical scenario involves proactive wake-ups: scx_bpf_pick_idle_cpu() selects and marks an idle CPU as busy, followed by a wake-up via scx_bpf_kick_cpu(), without dispatching any tasks. In this case, the CPU continues running the idle thread, returns to idle, but remains marked as busy, preventing it from being selected again as an idle CPU (until a task eventually runs on it and releases the CPU). For example, running a workload that uses 20% of each CPU, combined with an scx scheduler using proactive wake-ups, results in the following core utilization: CPU 0: 25.7% CPU 1: 29.3% CPU 2: 26.5% CPU 3: 25.5% CPU 4: 0.0% CPU 5: 25.5% CPU 6: 0.0% CPU 7: 10.5% To address this, refresh the idle state also in pick_task_idle(), during idle-to-idle transitions, but only trigger ops.update_idle() on actual state changes to prevent unnecessary updates to the scx scheduler and maintain balanced state transitions. With this change in place, the core utilization in the previous example becomes the following: CPU 0: 18.8% CPU 1: 19.4% CPU 2: 18.0% CPU 3: 18.7% CPU 4: 19.3% CPU 5: 18.9% CPU 6: 18.7% CPU 7: 19.3% [1] https://github.com/sched-ext/scx/pull/1139 Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-10 12:40:42 -10:00
Imran Khan	da30ba227c	workqueue: warn if delayed_work is queued to an offlined cpu. delayed_work submitted to an offlined cpu, will not get executed, after the specified delay if the cpu remains offline. If the cpu never comes online the work will never get executed. checking for online cpu in __queue_delayed_work, does not sound like a good idea because to do this reliably we need hotplug lock and since work may be submitted from atomic contexts, we would have to use cpus_read_trylock. But if trylock fails we would queue the work on any cpu and this may not be optimal because our intended cpu might still be online. Putting a WARN_ON_ONCE for an already offlined cpu, will indicate users of queue_delayed_work_on, if they are (wrongly) trying to queue delayed_work on offlined cpu. Also indicate the problem of using offlined cpu with queue_delayed_work_on, in its description. Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-10 08:33:39 -10:00
Lorenzo Stoakes	b709eb872e	perf: map pages in advance We are adjusting struct page to make it smaller, removing unneeded fields which correctly belong to struct folio. Two of those fields are page->index and page->mapping. Perf is currently making use of both of these. This is unnecessary. This patch eliminates this. Perf establishes its own internally controlled memory-mapped pages using vm_ops hooks. The first page in the mapping is the read/write user control page, and the rest of the mapping consists of read-only pages. The VMA is backed by kernel memory either from the buddy allocator or vmalloc depending on configuration. It is intended to be mapped read/write, but because it has a page_mkwrite() hook, vma_wants_writenotify() indicates that it should be mapped read-only. When a write fault occurs, the provided page_mkwrite() hook, perf_mmap_fault() (doing double duty handing faults as well) uses the vmf->pgoff field to determine if this is the first page, allowing for the desired read/write first page, read-only rest mapping. For this to work the implementation has to carefully work around faulting logic. When a page is write-faulted, the fault() hook is called first, then its page_mkwrite() hook is called (to allow for dirty tracking in file systems). On fault we set the folio's mapping in perf_mmap_fault(), this is because when do_page_mkwrite() is subsequently invoked, it treats a missing mapping as an indicator that the fault should be retried. We also set the folio's index so, given the folio is being treated as faux user memory, it correctly references its offset within the VMA. This explains why the mapping and index fields are used - but it's not necessary. We preallocate pages when perf_mmap() is called for the first time via rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as needed if the mapping requires it. This allocation is done in the f_ops->mmap() hook provided in perf_mmap(), and so we can instead simply map all the memory right away here - there's no point in handling (read) page faults when we don't demand page nor need to be notified about them (perf does not). This patch therefore changes this logic to map everything when the mmap() hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite() to provide the required read/write vs. read-only behaviour, which does not require the previously implemented workarounds. While it is not ideal to use a VM_PFNMAP here, doing anything else will result in the page_mkwrite() hook need to be provided, which requires the same page->mapping hack this patch seeks to undo. It will also result in the pages being treated as folios and placed on the rmap, which really does not make sense for these mappings. Semantically it makes sense to establish this as some kind of special mapping, as the pages are managed by perf and are not strictly user pages, but currently the only means by which we can do so functionally while maintaining the required R/W and R/O behaviour is a PFN map. There should be no change to actual functionality as a result of this change. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250103153151.124163-1-lorenzo.stoakes@oracle.com	2025-01-10 18:16:50 +01:00
Jiri Olsa	b583ef82b6	uprobes: Fix race in uprobe_free_utask Max Makarov reported kernel panic [1] in perf user callchain code. The reason for that is the race between uprobe_free_utask and bpf profiler code doing the perf user stack unwind and is triggered within uprobe_free_utask function: - after current->utask is freed and - before current->utask is set to NULL general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI RIP: 0010:is_uprobe_at_func_entry+0x28/0x80 ... ? die_addr+0x36/0x90 ? exc_general_protection+0x217/0x420 ? asm_exc_general_protection+0x26/0x30 ? is_uprobe_at_func_entry+0x28/0x80 perf_callchain_user+0x20a/0x360 get_perf_callchain+0x147/0x1d0 bpf_get_stackid+0x60/0x90 bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b ? __smp_call_single_queue+0xad/0x120 bpf_overflow_handler+0x75/0x110 ... asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:__kmem_cache_free+0x1cb/0x350 ... ? uprobe_free_utask+0x62/0x80 ? acct_collect+0x4c/0x220 uprobe_free_utask+0x62/0x80 mm_release+0x12/0xb0 do_exit+0x26b/0xaa0 __x64_sys_exit+0x1b/0x20 do_syscall_64+0x5a/0x80 It can be easily reproduced by running following commands in separate terminals: # while :; do bpftrace -e 'uprobe:/bin/ls:_start { printf("hit\n"); }' -c ls; done # bpftrace -e 'profile:hz:100000 { @[ustack()] = count(); }' Fixing this by making sure current->utask pointer is set to NULL before we start to release the utask object. [1] https://github.com/grafana/pyroscope/issues/3673 Fixes: cfa7f3d2c526 ("perf,x86: avoid missing caller address in stack traces captured in uprobe") Reported-by: Max Makarov <maxpain@linux.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250109141440.2692173-1-jolsa@kernel.org	2025-01-10 09:28:01 +01:00
Masami Hiramatsu (Google)	30c8fd31c5	tracing/kprobes: Fix to free objects when failed to copy a symbol In __trace_kprobe_create(), if something fails it must goto error block to free objects. But when strdup() a symbol, it returns without that. Fix it to goto the error block to free objects correctly. Link: https://lore.kernel.org/all/173643297743.1514810.2408159540454241947.stgit@devnote2/ Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-10 08:57:18 +09:00
Peter Zijlstra	6d71a9c616	sched/fair: Fix EEVDF entity placement bug causing scheduling lag I noticed this in my traces today: turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -> { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -> { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 } Which is a weight transition: 1048576 -> 2 -> 1048576. One would expect the lag to shoot out AND come back, notably: -11231048576/2 = -588775424 -5887754242/1048576 = -1123 Except the trace shows it is all off. Worse, subsequent cycles shoot it out further and further. This made me have a very hard look at reweight_entity(), and specifically the ->on_rq case, which is more prominent with DELAY_DEQUEUE. And indeed, it is all sorts of broken. While the computation of the new lag is correct, the computation for the new vruntime, using the new lag is broken for it does not consider the logic set out in place_entity(). With the below patch, I now see things like: migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 } Which isn't perfect yet, but much closer. Reported-by: Doug Smythies <dsmythies@telus.net> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net	2025-01-09 12:55:27 +01:00
Chen Ridong	3cb97a927f	cgroup/cpuset: remove kernfs active break A warning was found: WARNING: CPU: 10 PID: 3486953 at fs/kernfs/file.c:828 CPU: 10 PID: 3486953 Comm: rmdir Kdump: loaded Tainted: G RIP: 0010:kernfs_should_drain_open_files+0x1a1/0x1b0 RSP: 0018:ffff8881107ef9e0 EFLAGS: 00010202 RAX: 0000000080000002 RBX: ffff888154738c00 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000004 RDI: ffff888154738c04 RBP: ffff888154738c04 R08: ffffffffaf27fa15 R09: ffffed102a8e7180 R10: ffff888154738c07 R11: 0000000000000000 R12: ffff888154738c08 R13: ffff888750f8c000 R14: ffff888750f8c0e8 R15: ffff888154738ca0 FS: 00007f84cd0be740(0000) GS:ffff8887ddc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f9fbe00c8 CR3: 0000000153eec001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: kernfs_drain+0x15e/0x2f0 __kernfs_remove+0x165/0x300 kernfs_remove_by_name_ns+0x7b/0xc0 cgroup_rm_file+0x154/0x1c0 cgroup_addrm_files+0x1c2/0x1f0 css_clear_dir+0x77/0x110 kill_css+0x4c/0x1b0 cgroup_destroy_locked+0x194/0x380 cgroup_rmdir+0x2a/0x140 It can be explained by: rmdir echo 1 > cpuset.cpus kernfs_fop_write_iter // active=0 cgroup_rm_file kernfs_remove_by_name_ns kernfs_get_active // active=1 __kernfs_remove // active=0x80000002 kernfs_drain cpuset_write_resmask wait_event //waiting (active == 0x80000001) kernfs_break_active_protection // active = 0x80000001 // continue kernfs_unbreak_active_protection // active = 0x80000002 ... kernfs_should_drain_open_files // warning occurs kernfs_put_active This warning is caused by 'kernfs_break_active_protection' when it is writing to cpuset.cpus, and the cgroup is removed concurrently. The commit 3a5a6d0c2b03 ("cpuset: don't nest cgroup_mutex inside get_online_cpus()") made cpuset_hotplug_workfn asynchronous, This change involves calling flush_work(), which can create a multiple processes circular locking dependency that involve cgroup_mutex, potentially leading to a deadlock. To avoid deadlock. the commit 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") added 'kernfs_break_active_protection' in the cpuset_write_resmask. This could lead to this warning. After the commit 2125c0034c5d ("cgroup/cpuset: Make cpuset hotplug processing synchronous"), the cpuset_write_resmask no longer needs to wait the hotplug to finish, which means that concurrent hotplug and cpuset operations are no longer possible. Therefore, the deadlock doesn't exist anymore and it does not have to 'break active protection' now. To fix this warning, just remove kernfs_break_active_protection operation in the 'cpuset_write_resmask'. Fixes: bdb2fd7fc56e ("kernfs: Skip kernfs_drain_open_files() more aggressively") Fixes: 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") Reported-by: Ji Fa <jifa@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 15:54:39 -10:00
Honglei Wang	68e449d849	sched_ext: switch class when preempted by higher priority scheduler ops.cpu_release() function, if defined, must be invoked when preempted by a higher priority scheduler class task. This scenario was skipped in commit f422316d7466 ("sched_ext: Remove switch_class_scx()"). Let's fix it. Fixes: f422316d7466 ("sched_ext: Remove switch_class_scx()") Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 06:51:40 -10:00
Changwoo Min	6268d5bc10	sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass() scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks. For each CPU, it acquires a lock using rq_lock() regardless of whether a CPU is offline or the CPU is currently running a task in a higher scheduler class (e.g., deadline). The rq_lock() is supposed to be used for online CPUs, and the use of rq_lock() may trigger an unnecessary warning in rq_pin_lock(). Therefore, replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass(). Without this change, we observe the following warning: ===== START ===== [ 6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90 ===== END ===== Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()") Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 06:48:53 -10:00
Henry Huang	30dd3b13f9	sched_ext: keep running prev when prev->scx.slice != 0 When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0, @prev will be dispacthed into the local DSQ in put_prev_task_scx(). However, pick_task_scx() is executed before put_prev_task_scx(), so it will not pick @prev. Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx() can pick @prev. Signed-off-by: Henry Huang <henry.hj@antgroup.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 06:48:33 -10:00
Linus Torvalds	fbfd64d25c	vfs-6.13-rc7.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ3vs1AAKCRCRxhvAZXjc omdqAP9Mn4HF85p5X7WRtUgrF7MGQft3EBfWE+sUxCMTc49NGQD/Ti7hqGNleEih MmjUjLZSG1e3lFHYQm0nqmjO2RexbQ0= =Li7D -----END PGP SIGNATURE----- Merge tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Relax assertions on failure to encode file handles The ->encode_fh() method can fail for various reasons. None of them warrant a WARN_ON(). - Fix overlayfs file handle encoding by allowing encoding an fid from an inode without an alias - Make sure fuse_dir_open() handles FOPEN_KEEP_CACHE. If it's not specified fuse needs to invaludate the directory inode page cache - Fix qnx6 so it builds with gcc-15 - Various fixes for netfslib and ceph and nfs filesystems: - Ignore silly rename files from afs and nfs when building header archives - Fix read result collection in netfslib with multiple subrequests - Handle ENOMEM for netfslib buffered reads - Fix oops in nfs_netfs_init_request() - Parse the secctx command immediately in cachefiles - Remove a redundant smp_rmb() in netfslib - Handle recursion in read retry in netfslib - Fix clearing of folio_queue - Fix missing cancellation of copy-to_cache when the cache for a file is temporarly disabled in netfslib - Sanity check the hfs root record - Fix zero padding data issues in concurrent write scenarios - Fix is_mnt_ns_file() after converting nsfs to path_from_stashed() - Fix missing declaration of init_files - Increase I/O priority when writing revoke records in jbd2 - Flush filesystem device before updating tail sequence in jbd2 * tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits) ovl: support encoding fid from inode with no alias ovl: pass realinode to ovl_encode_real_fh() instead of realdentry fuse: respect FOPEN_KEEP_CACHE on opendir netfs: Fix is-caching check in read-retry netfs: Fix the (non-)cancellation of copy when cache is temporarily disabled netfs: Fix ceph copy to cache on write-begin netfs: Work around recursion by abandoning retry if nothing read netfs: Fix missing barriers by using clear_and_wake_up_bit() netfs: Remove redundant use of smp_rmb() cachefiles: Parse the "secctx" immediately nfs: Fix oops in nfs_netfs_init_request() when copying to cache netfs: Fix enomem handling in buffered reads netfs: Fix non-contiguous donation between completed reads kheaders: Ignore silly-rename files fs: relax assertions on failure to encode file handles fs: fix missing declaration of init_files fs: fix is_mnt_ns_file() iomap: fix zero padding data issue in concurrent append writes iomap: pass byte granular end position to iomap_add_to_ioend jbd2: flush filesystem device before updating tail sequence ...	2025-01-06 10:26:39 -08:00
Linus Torvalds	5635d8bad2	25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM. The usual bunch of singletons and two doubletons - please see the relevant changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ3noXwAKCRDdBJ7gKXxA jkzRAP9Ejb8kbgCrA3cptnzlVkDCDUm0TmleepT3bx6B2rH0BgEAzSiTXf4ioZPg 4pOHnKIGOWEVPcVwBrdA0irWG+QPYAQ= =nEIZ -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM. The usual bunch of singletons and two doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) MAINTAINERS: change Arınç _NAL's name and email address scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity mm/util: make memdup_user_nul() similar to memdup_user() mm, madvise: fix potential workingset node list_lru leaks mm/damon/core: fix ignored quota goals and filters of newly committed schemes mm/damon/core: fix new damon_target objects leaks on damon_commit_targets() mm/list_lru: fix false warning of negative counter vmstat: disable vmstat_work on vmstat_cpu_down_prep() mm: shmem: fix the update of 'shmem_falloc->nr_unswapped' mm: shmem: fix incorrect index alignment for within_size policy percpu: remove intermediate variable in PERCPU_PTR() mm: zswap: fix race between [de]compression and CPU hotunplug ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit kcov: mark in_softirq_really() as __always_inline docs: mm: fix the incorrect 'FileHugeMapped' field mailmap: modify the entry for Mathieu Othacehe mm/kmemleak: fix sleeping function called from invalid context at print message mm: hugetlb: independent PMD page table shared count maple_tree: reload mas before the second call for mas_empty_area ...	2025-01-05 10:37:45 -08:00
Linus Torvalds	63676eefb7	sched_ext: Fixes for v6.13-rc5 - Fix the bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration. - Fix the bug where scx_ops_bypass() could call irq_restore twice. - Add Andrea and Changwoo as maintainers for better review coverage. - selftests and tools/sched_ext build and other fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hpXg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGS/lAQDOZDfcJtO1VEsLoPY9NhFHPuBDTfoJyjSi/4mh GsjgDAD/Sx0rN6C9S/+ToUjmq3FA+ft0m2+97VqgLwkzwA9YxwI= =jaZ6 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration - Fix a bug where scx_ops_bypass() could call irq_restore twice - Add Andrea and Changwoo as maintainers for better review coverage - selftests and tools/sched_ext build and other fixes * tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix dsq_local_on selftest sched_ext: initialize kit->cursor.flags sched_ext: Fix invalid irq restore in scx_ops_bypass() MAINTAINERS: add me as reviewer for sched_ext MAINTAINERS: add self as reviewer for sched_ext scx: Fix maximal BPF selftest prog sched_ext: fix application of sizeof to pointer selftests/sched_ext: fix build after renames in sched_ext API sched_ext: Add __weak to fix the build errors	2025-01-03 15:09:12 -08:00
Linus Torvalds	f9aa1fb9f8	workqueue: Fixes for v6.13-rc5 - Suppress a corner case spurious flush dependency warning. - Two trivial changes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hmjA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGUrkAP90cajNtGbtFR1J61N4dTSfjBz8L7oQ6GLLyjCB MDxvpQD/ViVVpHBl9/jfObk//p6YMBTBD2Zp/aBc3mkKOVhfqws= =eUNO -----END PGP SIGNATURE----- Merge tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Suppress a corner case spurious flush dependency warning - Two trivial changes * tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add printf attribute to __alloc_workqueue() workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker rust: add safety comment in workqueue traits	2025-01-03 15:03:56 -08:00
Linus Torvalds	e30dd219c7	Fixes for ftrace in v6.13: - Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ3cR4RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qjxfAQCPhNztdmGmEYmuBtONPHwejidWnuJ6 Rl2mQxEbp40OUgD+JvSWofhRsvtXWlymqZ9j+dKMegLqMeq834hB0LK4NAg= =+KqV -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. * tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix function profiler's filtering functionality fgraph: Add READ_ONCE() when accessing fgraph_array[]	2025-01-03 10:04:43 -08:00
Kohei Enju	789a8cff8d	ftrace: Fix function profiler's filtering functionality Commit c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering"), function profiler (enabled via function_profile_enabled) has been showing statistics for all functions, ignoring set_ftrace_filter settings. While tracers are instantiated, the function profiler is not. Therefore, it should use the global set_ftrace_filter for consistency. This patch modifies the function profiler to use the global filter, fixing the filtering functionality. Before (filtering not working): ``` root@localhost:~# echo 'vfs' > /sys/kernel/tracing/set_ftrace_filter root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# sleep 1 root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# head /sys/kernel/tracing/trace_stat/ Function Hit Time Avg s^2 -------- --- ---- --- --- schedule 314 22290594 us 70989.15 us 40372231 us x64_sys_call 1527 8762510 us 5738.382 us 3414354 us schedule_hrtimeout_range 176 8665356 us 49234.98 us 405618876 us __x64_sys_ppoll 324 5656635 us 17458.75 us 19203976 us do_sys_poll 324 5653747 us 17449.83 us 19214945 us schedule_timeout 67 5531396 us 82558.15 us 2136740827 us __x64_sys_pselect6 12 3029540 us 252461.7 us 63296940171 us do_pselect.constprop.0 12 3029532 us 252461.0 us 63296952931 us ``` After (filtering working): ``` root@localhost:~# echo 'vfs' > /sys/kernel/tracing/set_ftrace_filter root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# sleep 1 root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# head /sys/kernel/tracing/trace_stat/ Function Hit Time Avg s^2 -------- --- ---- --- --- vfs_write 462 68476.43 us 148.217 us 25874.48 us vfs_read 641 9611.356 us 14.994 us 28868.07 us vfs_fstat 890 878.094 us 0.986 us 1.667 us vfs_fstatat 227 757.176 us 3.335 us 18.928 us vfs_statx 226 610.610 us 2.701 us 17.749 us vfs_getattr_nosec 1187 460.919 us 0.388 us 0.326 us vfs_statx_path 297 343.287 us 1.155 us 11.116 us vfs_rename 6 291.575 us 48.595 us 9889.236 us ``` Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250101190820.72534-1-enjuk@amazon.com Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Kohei Enju <enjuk@amazon.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-02 17:21:33 -05:00
Zilin Guan	d654740337	fgraph: Add READ_ONCE() when accessing fgraph_array[] In __ftrace_return_to_handler(), a loop iterates over the fgraph_array[] elements, which are fgraph_ops. The loop checks if an element is a fgraph_stub to prevent using a fgraph_stub afterward. However, if the compiler reloads fgraph_array[] after this check, it might race with an update to fgraph_array[] that introduces a fgraph_stub. This could result in the stub being processed, but the stub contains a null "func_hash" field, leading to a NULL pointer dereference. To ensure that the gops compared against the fgraph_stub matches the gops processed later, add a READ_ONCE(). A similar patch appears in commit 63a8dfb ("function_graph: Add READ_ONCE() when accessing fgraph_array[]"). Cc: stable@vger.kernel.org Fixes: 37238abe3cb47 ("ftrace/function_graph: Pass fgraph_ops to function graph callbacks") Link: https://lore.kernel.org/20241231113731.277668-1-zilin@seu.edu.cn Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-02 17:21:18 -05:00
Steven Rostedt	afc6717628	tracing: Have process_string() also allow arrays In order to catch a common bug where a TRACE_EVENT() TP_fast_assign() assigns an address of an allocated string to the ring buffer and then references it in TP_printk(), which can be executed hours later when the string is free, the function test_event_printk() runs on all events as they are registered to make sure there's no unwanted dereferencing. It calls process_string() to handle cases in TP_printk() format that has "%s". It returns whether or not the string is safe. But it can have some false positives. For instance, xe_bo_move() has: TP_printk("move_lacks_source:%s, migrate object %p [size %zu] from %s to %s device_id:%s", __entry->move_lacks_source ? "yes" : "no", __entry->bo, __entry->size, xe_mem_type_to_name[__entry->old_placement], xe_mem_type_to_name[__entry->new_placement], __get_str(device_id)) Where the "%s" references into xe_mem_type_to_name[]. This is an array of pointers that should be safe for the event to access. Instead of flagging this as a bad reference, if a reference points to an array, where the record field is the index, consider it safe. Link: https://lore.kernel.org/all/9dee19b6185d325d0e6fa5f7cbba81d007d99166.camel@sapience.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241231000646.324fb5f7@gandalf.local.home Fixes: 65a25d9f7ac02 ("tracing: Add "%s" check in test_event_printk()") Reported-by: Genes Lists <lists@sapience.com> Tested-by: Gene C <arch@sapience.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-12-31 00:10:32 -05:00
Arnd Bergmann	cb0ca08b32	kcov: mark in_softirq_really() as __always_inline If gcc decides not to inline in_softirq_really(), objtool warns about a function call with UACCESS enabled: kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc+0x1e: call to in_softirq_really() with UACCESS enabled kernel/kcov.o: warning: objtool: check_kcov_mode+0x11: call to in_softirq_really() with UACCESS enabled Mark this as __always_inline to avoid the problem. Link: https://lkml.kernel.org/r/20241217071814.2261620-1-arnd@kernel.org Fixes: 7d4df2dad312 ("kcov: properly check for softirq context") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Marco Elver <elver@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-30 17:59:08 -08:00
Linus Torvalds	bf7a281b80	Fix missed rtmutex wakeups causing sporadic boot hangs and other misbehavior. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmdxC+ERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jvDw/+Kl24Gjai6hy7yFukGRFRkAezx3YRyK8F SM/vg2GzNaottkUSO3ywD//SMoG3qqkBOIukrS8kJXjLNlx1TI6AqGLVA9g9LpMw KFgqvIb4llstsAh7s8coCSJIVOCGcNC306EPfqvrhlU16YqFHRggQUqSiycRXQEd SDSAiNsiez0g0a0x1qI0lbFtF7l/Xht1CxOmpc0NQe8OXZcOXJI1z92DbzDsY+r4 g77sJ3jHT9j3rpz7MPdh4xS8RJnT/E3wAKn5dnS0pSJ58UFOndIgncKoeEpPC3gW 1hFWx+3IC2n0/t4m5TQhtpSFv0W4tkhwWOMI7JlRw2Sx2z0T/gnJsYH7E+DSu138 XYmRCuW+BHrFjG+Pns4bpndf8Gy2HSHjvp0AB9iUqzfIkWVkQNjBdonfdvY5pey0 EwkxCZPcWT8j0HehM9MhntYojfgy/Au/Z+EOZQSDDHKLAvkkE5ai1FPCjvhBxrCe FGko03zS77O+yayTFwXdtbn0StM1Bfa8WcCKxAKErsYqOOB4AP1bJWAknBKw0O4b Kj2nVSf7etDcue6sey9HWd1+pNzUsAlsuRM+bsa/dp2rxHxbbHVVHV1Yy0jTgHTL RkK8C3FyZbya4nhl0qY7kYudes37S8aT8AQEvyJ9/Y0aLURuESzdxiX1Knk0W2zs WsRnDI85Yq0= =0Vde -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2024-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix missed rtmutex wakeups causing sporadic boot hangs and other misbehavior" * tag 'locking-urgent-2024-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Make sure we wake anything on the wake_q when we release the lock->wait_lock	2024-12-29 10:04:47 -08:00
Linus Torvalds	411a678d30	Probes fixes for v6.13-rc4: - tracing/kprobes: Change the priority of the module callback of kprobe events so that it is called after the jump label list on the module is updated. This ensures the kprobe can check whether it is not on the jump label address correctly. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmduAMgbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bJ6YH/2QBkWNTe3qjxdPsTxJ2 MyL2PO8tMwZbNSyYZ1yGnbguWUUKVkuiheS/qWhLNpuVEyb6Q9/Zuifh5rFqDbf0 Ug3YvsP7gQurmqDm1NGlnMic3zlmZaYDtXCKB+kiA3HO3iP92zesTJlasiok3aSd sQphxUzmG41BQUDN5/LktGjVb5juf3Xq6i6bdCd6wunUbGWCEE+XmFrg1oVnutES GTckUGswUBGbgkcVPc07UfKZpNzZdyZlmbVfOISCdYIAddUKftATN7SaUrM29oqC /lkUcxeXSVXBIUkbA1p50nfjYzTWNeXG92WrvMrRZjNivyMf/nUJnxrlHsv5h2Dy gtI= =d3Zj -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: "Change the priority of the module callback of kprobe events so that it is called after the jump label list on the module is updated. This ensures the kprobe can check whether it is not on the jump label address correctly" * tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/kprobe: Make trace_kprobe's module callback called after jump_label update	2024-12-27 11:03:15 -08:00
Henry Huang	35bf430e08	sched_ext: initialize kit->cursor.flags struct bpf_iter_scx_dsq *it maybe not initialized. If we didn't call scx_bpf_dsq_move_set_vtime and scx_bpf_dsq_move_set_slice before scx_bpf_dsq_move, it would cause unexpected behaviors: 1. Assign a huge slice into p->scx.slice 2. Assign a invalid vtime into p->scx.dsq_vtime Signed-off-by: Henry Huang <henry.hj@antgroup.com> Fixes: 6462dd53a260 ("sched_ext: Compact struct bpf_iter_scx_dsq_kern") Cc: stable@vger.kernel.org # v6.12 Signed-off-by: Tejun Heo <tj@kernel.org>	2024-12-24 10:56:08 -10:00
Su Hui	d57212f281	workqueue: add printf attribute to __alloc_workqueue() Fix a compiler warning with W=1: kernel/workqueue.c: error: function ‘__alloc_workqueue’ might be a candidate for ‘gnu_printf’ format attribute[-Werror=suggest-attribute=format] 5657 \| name_len = vsnprintf(wq->name, sizeof(wq->name), fmt, args); \| ^~~~~~~~ Fixes: 9b59a85a84dc ("workqueue: Don't call va_start / va_end twice") Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-12-24 09:50:38 -10:00
Lizhi Xu	98feccbf32	tracing: Prevent bad count for tracing_cpumask_write If a large count is provided, it will trigger a warning in bitmap_parse_user. Also check zero for it. Cc: stable@vger.kernel.org Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions") Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com Reported-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd Tested-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-12-23 21:59:15 -05:00
Masami Hiramatsu (Google)	d685d55dfc	tracing/kprobe: Make trace_kprobe's module callback called after jump_label update Make sure the trace_kprobe's module notifer callback function is called after jump_label's callback is called. Since the trace_kprobe's callback eventually checks jump_label address during registering new kprobe on the loading module, jump_label must be updated before this registration happens. Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/ Fixes: 614243181050 ("tracing/kprobes: Support module init function probing") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2024-12-24 00:08:13 +09:00
Peter Zijlstra	630a937016	Lockdep changes for v6.14: - Use swap() macro in the ww_mutex test. - Minor fixes and documentation for lockdep configs on internal data structure sizes. - Some "-Wunused-function" warning fixes for Clang. Rust locking changes for v6.14: - Add Rust locking files into LOCKING PRIMITIVES maintainer entry. - Add `Lock<(), ..>::from_raw()` function to support abstraction on low level locking. - Expose `Guard::new()` for public usage and add type alias for spinlock and mutex guards. - Add lockdep checking when creating a new lock `Guard`. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmdl/LoACgkQSXnow7UH +rhNrAf/epZAkkTmFgSqdx0ZNtKUA14Hqp9ie7SJylU6B9dfXmvZzaNBlowk5Edq yGGJQYuzuT+PFYZkNEuSZYcrqUq+b4s8MyF/8h3+lyZT6p1Jhapvq16id5yA1u0l MxMqAZC1D1ruDev2H8IxLlhHlDsSYS0erVNB2ZTFJwL0rZNyUXMZ4Y/o972GjAPt 8g9NlPB3ZTCVmyVtwy7rCexSuVTGDE3BRL9/W9q8eMZNnHq46xDsHRrn9NO4cDmv FogniH9xjFYetZMilYkpHwygAMX1P2t6x29Q+u464bStIWIOjkthYjkoePNXwZQd XgvN37j508VHLJ3sod38+IpnfhlZHA== =IJvk -----END PGP SIGNATURE----- Merge tag 'lockdep-for-tip.20241220' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux into locking/core Lockdep changes for v6.14: - Use swap() macro in the ww_mutex test. - Minor fixes and documentation for lockdep configs on internal data structure sizes. - Some "-Wunused-function" warning fixes for Clang. Rust locking changes for v6.14: - Add Rust locking files into LOCKING PRIMITIVES maintainer entry. - Add `Lock<(), ..>::from_raw()` function to support abstraction on low level locking. - Expose `Guard::new()` for public usage and add type alias for spinlock and mutex guards. - Add lockdep checking when creating a new lock `Guard`.	2024-12-22 12:43:31 +01:00
Linus Torvalds	4aa748dd1a	25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM. The usual bunch of singletons and doubletons - please see the relevant changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ2cghQAKCRDdBJ7gKXxA jgrsAQCvlSmHYYLXBE1A6cram4qWgEP/2vD94d6sVv9UipO/FAEA8y1K7dbT2AGX A5ESuRndu5Iy76mb6Tiarqa/yt56QgU= =ZYVx -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 19 are MM and 6 are non-MM. The usual bunch of singletons and doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) mm: huge_memory: handle strsep not finding delimiter alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG alloc_tag: fix module allocation tags populated area calculation mm/codetag: clear tags before swap mm/vmstat: fix a W=1 clang compiler warning mm: convert partially_mapped set/clear operations to be atomic nilfs2: fix buffer head leaks in calls to truncate_inode_pages() vmalloc: fix accounting with i915 mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy() fork: avoid inappropriate uprobe access to invalid mm nilfs2: prevent use of deleted inode zram: fix uninitialized ZRAM not releasing backing device zram: refuse to use zero sized block device as backing device mm: use clear_user_(high)page() for arch with special user folio handling mm: introduce cpu_icache_is_aliasing() across all architectures mm: add RCU annotation to pte_offset_map(_lock) mm: correctly reference merged VMA mm: use aligned address in copy_user_gigantic_page() mm: use aligned address in clear_gigantic_page() mm: shmem: fix ShmemHugePages at swapout ...	2024-12-21 15:31:56 -08:00
Linus Torvalds	9c707ba99f	BPF fixes: - Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP systems (Andrea Righi) - Fix BPF USDT selftests helper code to use asm constraint "m" for LoongArch (Tiezhu Yang) - Fix BPF selftest compilation error in get_uprobe_offset when PROCMAP_QUERY is not defined (Jerome Marchand) - Fix BPF bpf_skb_change_tail helper when used in context of BPF sockmap to handle negative skb header offsets (Cong Wang) - Several fixes to BPF sockmap code, among others, in the area of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ2YJABUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISINDEgD+N4uVg+rp8Z8pg9jcai4WUERmRG20 NcQTfBXczLHkwIcBALvn7NVvbTAINJzBTnukbjX3XbWFz2cJ/xHxDYXycP4I =SwXG -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull BPF fixes from Daniel Borkmann: - Fix inlining of bpf_get_smp_processor_id helper for !CONFIG_SMP systems (Andrea Righi) - Fix BPF USDT selftests helper code to use asm constraint "m" for LoongArch (Tiezhu Yang) - Fix BPF selftest compilation error in get_uprobe_offset when PROCMAP_QUERY is not defined (Jerome Marchand) - Fix BPF bpf_skb_change_tail helper when used in context of BPF sockmap to handle negative skb header offsets (Cong Wang) - Several fixes to BPF sockmap code, among others, in the area of socket buffer accounting (Levi Zim, Zijian Zhang, Cong Wang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test bpf_skb_change_tail() in TC ingress selftests/bpf: Introduce socket_helpers.h for TC tests selftests/bpf: Add a BPF selftest for bpf_skb_change_tail() bpf: Check negative offsets in __bpf_skb_min_len() tcp_bpf: Fix copied value in tcp_bpf_sendmsg skmsg: Return copied bytes in sk_msg_memcopy_from_iter tcp_bpf: Add sk_rmem_alloc related logic for tcp_bpf ingress redirection tcp_bpf: Charge receive socket buffer in bpf_tcp_ingress() selftests/bpf: Fix compilation error in get_uprobe_offset() selftests/bpf: Use asm constraint "m" for LoongArch bpf: Fix bpf_get_smp_processor_id() on !CONFIG_SMP	2024-12-21 11:07:19 -08:00
David Howells	973b710b88	kheaders: Ignore silly-rename files Tell tar to ignore silly-rename files (".__afs" and ".nfs") when building the header archive. These occur when a file that is open is unlinked locally, but hasn't yet been closed. Such files are visible to the user via the getdents() syscall and so programs may want to do things with them. During the kernel build, such files may be made during the processing of header files and the cleanup may get deferred by fput() which may result in tar seeing these files when it reads the directory, but they may have disappeared by the time it tries to open them, causing tar to fail with an error. Further, we don't want to include them in the tarball if they still exist. With CONFIG_HEADERS_INSTALL=y, something like the following may be seen: find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory tar: ./include/linux/greybus/.__afs3C95: File removed before we read it The find warning doesn't seem to cause a problem. Fix this by telling tar when called from in gen_kheaders.sh to exclude such files. This only affects afs and nfs; cifs uses the Windows Hidden attribute to prevent the file from being seen. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20241213135013.2964079-2-dhowells@redhat.com cc: Masahiro Yamada <masahiroy@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-kernel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-12-20 22:07:55 +01:00
Linus Torvalds	5b83bcdea5	ring-buffer fixes for v6.13: - Fix possible overflow of mmapped ring buffer with bad offset If the mmap() to the ring buffer passes in a start address that is passed the end of the mmapped file, it is not caught and a slab-out-of-bounds is triggered. Add a check to make sure the start address is within the bounds - Do not use TP_printk() to boot mapped ring buffers As a boot mapped ring buffer's data may have pointers that map to the previous boot's memory map, it is unsafe to allow the TP_printk() to be used to read the boot mapped buffer's events. If a TP_printk() points to a static string from within the kernel it will not match the current kernel mapping if KASLR is active, and it can fault. Have it simply print out the raw fields. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2QuXRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qncvAQDf2s2WWsy4pYp2mpRtBXvAPf6tpBdi J9eceJQbwJVJHAEApQjEFfbUxLh2WgPU1Cn++PwDA+NLiru70+S0vtDLWwE= =OI+v -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix possible overflow of mmapped ring buffer with bad offset If the mmap() to the ring buffer passes in a start address that is passed the end of the mmapped file, it is not caught and a slab-out-of-bounds is triggered. Add a check to make sure the start address is within the bounds - Do not use TP_printk() to boot mapped ring buffers As a boot mapped ring buffer's data may have pointers that map to the previous boot's memory map, it is unsafe to allow the TP_printk() to be used to read the boot mapped buffer's events. If a TP_printk() points to a static string from within the kernel it will not match the current kernel mapping if KASLR is active, and it can fault. Have it simply print out the raw fields. * tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers ring-buffer: Fix overflow in __rb_map_vma	2024-12-20 10:13:26 -08:00
John Stultz	abfdccd6af	sched/wake_q: Add helper to call wake_up_q after unlock with preemption disabled A common pattern seen when wake_qs are used to defer a wakeup until after a lock is released is something like: preempt_disable(); raw_spin_unlock(lock); wake_up_q(wake_q); preempt_enable(); So create some raw_spin_unlock*_wake() helper functions to clean this up. Applies on top of the fix I submitted here: https://lore.kernel.org/lkml/20241212222138.2400498-1-jstultz@google.com/ NOTE: I recognise the unlock()/unlock_irq()/unlock_irqrestore() variants creates its own duplication, which we could use a macro to generate the similar functions, but I often dislike how those generation macros making finding the actual implementation harder, so I left the three functions as is. If folks would prefer otherwise, let me know and I'll switch it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241217040803.243420-1-jstultz@google.com	2024-12-20 15:31:21 +01:00
Peter Zijlstra	c2db11a750	Merge branch 'locking/urgent' Sync with urgent -- avoid conflicts. Signed-off-by: Peter Zijlstra <peterz@infradead.org>	2024-12-20 15:31:19 +01:00
Swapnil Sapkal	7c8cd569ff	docs: Update Schedstat version to 17 Update the Schedstat version to 17 as more fields are added to report different kinds of imbalances in the sched domain. Also domain field started printing corresponding domain name. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-7-swapnil.sapkal@amd.com	2024-12-20 15:31:18 +01:00
K Prateek Nayak	011b3a14dc	sched/stats: Print domain name in /proc/schedstat Currently, there does not exist a straightforward way to extract the names of the sched domains and match them to the per-cpu domain entry in /proc/schedstat other than looking at the debugfs files which are only visible after enabling "verbose" debug after commit 34320745dfc9 ("sched/debug: Put sched/domains files under the verbose flag") Since tools like `perf sched stats`[1] require displaying per-domain information in user friendly manner, display the names of sched domain, alongside their level in /proc/schedstat. Domain names also makes the /proc/schedstat data unambiguous when some of the cpus are offline. For example, on a 128 cpus AMD Zen3 machine where CPU0 and CPU64 are SMT siblings and CPU64 is offline: Before: cpu0 ... domain0 ... domain1 ... cpu1 ... domain0 ... domain1 ... domain2 ... After: cpu0 ... domain0 MC ... domain1 PKG ... cpu1 ... domain0 SMT ... domain1 MC ... domain2 PKG ... [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20241220063224.17767-6-swapnil.sapkal@amd.com	2024-12-20 15:31:18 +01:00
Swapnil Sapkal	1c055a0f5d	sched: Move sched domain name out of CONFIG_SCHED_DEBUG /proc/schedstat file shows cpu and sched domain level scheduler statistics. It does not show domain name instead shows domain level. It will be very useful for tools like `perf sched stats`[1] to aggragate domain level stats if domain names are shown in /proc/schedstat. But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG. [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ [2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/ Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-5-swapnil.sapkal@amd.com	2024-12-20 15:31:17 +01:00
Swapnil Sapkal	3b2a793ea7	sched: Report the different kinds of imbalances in /proc/schedstat In /proc/schedstat, lb_imbalance reports the sum of imbalances discovered in sched domains with each call to sched_balance_rq(), which is not very useful because lb_imbalance does not mention whether the imbalance is due to load, utilization, nr_tasks or misfit_tasks. Remove this field from /proc/schedstat. Currently there is no field in /proc/schedstat to report different types of imbalances. Introduce new fields in /proc/schedstat to report the total imbalances in load, utilization, nr_tasks or misfit_tasks. Added fields to /proc/schedstat: - lb_imbalance_load: Total imbalance due to load. - lb_imbalance_util: Total imbalance due to utilization. - lb_imbalance_task: Total imbalance due to number of tasks. - lb_imbalance_misfit: Total imbalance due to misfit tasks. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241220063224.17767-4-swapnil.sapkal@amd.com	2024-12-20 15:31:17 +01:00
Peter Zijlstra	c3856c9ce6	sched/fair: Cleanup in migrate_degrades_locality() to improve readability migrate_degrade_locality() would return {1, 0, -1} respectively to indicate that migration would degrade-locality, would improve locality, would be ambivalent to locality improvements. This patch improves readability by changing the return value to mean: * Any positive value degrades locality * 0 migration doesn't affect locality * Any negative value improves locality [Swapnil: Fixed comments around code and wrote commit log] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-3-swapnil.sapkal@amd.com	2024-12-20 15:31:17 +01:00
Peter Zijlstra	a430d99e34	sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled during load balance. This value is incremented in can_migrate_task() if the task is migratable and hot. After incrementing the value, load balancer can still decide not to migrate this task leading to wrong accounting. Fix this by incrementing stats when hot tasks are detached. This issue only exists in detach_tasks() where we can decide to not migrate hot task even if it is migratable. However, in detach_one_task(), we migrate it unconditionally. [Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log] Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-2-swapnil.sapkal@amd.com	2024-12-20 15:31:16 +01:00
Sebastian Andrzej Siewior	ee8118c1f1	sched/fair: Update comments after sched_tick() rename. scheduler_tick() was renamed to sched_tick() in 86dd6c04ef9f2 ("sched/balancing: Rename scheduler_tick() => sched_tick()"). Update comments still referring to scheduler_tick. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241219085839.302378-1-bigeasy@linutronix.de	2024-12-20 15:31:16 +01:00

1 2 3 4 5 ...

46682 Commits