linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-15 11:47:47 +00:00

Author	SHA1	Message	Date
Masami Hiramatsu (Google)	4e83017e4c	tracing/eprobe: Adopt guard() and scoped_guard() Use guard() or scoped_guard() in eprobe events for critical sections rather than discrete lock/unlock pairs. Link: https://lore.kernel.org/all/173289890996.73724.17421347964110362029.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)	f8821732dc	tracing/uprobe: Adopt guard() and scoped_guard() Use guard() or scoped_guard() in uprobe events for critical sections rather than discrete lock/unlock pairs. Link: https://lore.kernel.org/all/173289889911.73724.12457932738419630525.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)	2cba0070cd	tracing/kprobe: Adopt guard() and scoped_guard() Use guard() or scoped_guard() in kprobe events for critical sections rather than discrete lock/unlock pairs. Link: https://lore.kernel.org/all/173289888883.73724.6586200652276577583.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)	587e8e6d64	kprobes: Adopt guard() and scoped_guard() Use guard() or scoped_guard() for critical sections rather than discrete lock/unlock pairs. Link: https://lore.kernel.org/all/173289887835.73724.608223217359025939.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-01-10 09:00:12 +09:00
Thomas Weißschuh	1703abb4de	kprobes: Reduce preempt disable scope in check_kprobe_access_safe() Commit a189d0350f387 ("kprobes: disable preempt for module_text_address() and kernel_text_address()") introduced a preempt_disable() region to protect against concurrent module unloading. However this region also includes the call to jump_label_text_reserved() which takes a long time; up to 400us, iterating over approx 6000 jump tables. The scope protected by preempt_disable() is largen than necessary. core_kernel_text() does not need to be protected as it does not interact with module code at all. Only the scope from __module_text_address() to try_module_get() needs to be protected. By limiting the critical section to __module_text_address() and try_module_get() the function responsible for the latency spike remains preemptible. This works fine even when !CONFIG_MODULES as in that case try_module_get() will always return true and that block can be optimized away. Limit the critical section to __module_text_address() and try_module_get(). Use guard(preempt)() for easier error handling. While at it also remove a spurious *probed_mod = NULL in an error path. On errors the output parameter is never inspected by the caller. Some error paths were clearing the parameters, some didn't. Align them for clarity. Link: https://lore.kernel.org/all/20241121-kprobes-preempt-v1-1-fd581ee7fcbb@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-01-10 09:00:12 +09:00
Masami Hiramatsu (Google)	30c8fd31c5	tracing/kprobes: Fix to free objects when failed to copy a symbol In __trace_kprobe_create(), if something fails it must goto error block to free objects. But when strdup() a symbol, it returns without that. Fix it to goto the error block to free objects correctly. Link: https://lore.kernel.org/all/173643297743.1514810.2408159540454241947.stgit@devnote2/ Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-10 08:57:18 +09:00
Alexei Starovoitov	6e90b3222a	Merge branch 'bpf-next/master' into for-next Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-09 07:42:08 -08:00
Peter Zijlstra	6d71a9c616	sched/fair: Fix EEVDF entity placement bug causing scheduling lag I noticed this in my traces today: turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } -> { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6) { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } -> { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 } Which is a weight transition: 1048576 -> 2 -> 1048576. One would expect the lag to shoot out AND come back, notably: -11231048576/2 = -588775424 -5887754242/1048576 = -1123 Except the trace shows it is all off. Worse, subsequent cycles shoot it out further and further. This made me have a very hard look at reweight_entity(), and specifically the ->on_rq case, which is more prominent with DELAY_DEQUEUE. And indeed, it is all sorts of broken. While the computation of the new lag is correct, the computation for the new vruntime, using the new lag is broken for it does not consider the logic set out in place_entity(). With the below patch, I now see things like: migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12) { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } -> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 } migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15) { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } -> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 } Which isn't perfect yet, but much closer. Reported-by: Doug Smythies <dsmythies@telus.net> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net	2025-01-09 12:55:27 +01:00
Hou Tao	d86088e2c3	bpf: Remove migrate_{disable\|enable} from bpf_selem_free() bpf_selem_free() has the following three callers: (1) bpf_local_storage_update It will be invoked through ->map_update_elem syscall or helpers for storage map. Migration has already been disabled in these running contexts. (2) bpf_sk_storage_clone It has already disabled migration before invoking bpf_selem_free(). (3) bpf_selem_free_list bpf_selem_free_list() has three callers: bpf_selem_unlink_storage(), bpf_local_storage_update() and bpf_local_storage_destroy(). The callers of bpf_selem_unlink_storage() includes: storage map ->map_delete_elem syscall, storage map delete helpers and bpf_local_storage_map_free(). These contexts have already disabled migration when invoking bpf_selem_unlink() which invokes bpf_selem_unlink_storage() and bpf_selem_free_list() correspondingly. bpf_local_storage_update() has been analyzed as the first caller above. bpf_local_storage_destroy() is invoked when freeing the local storage for the kernel object. Now cgroup, task, inode and sock storage have already disabled migration before invoking bpf_local_storage_destroy(). After the analyses above, it is safe to remove migrate_{disable\|enable} from bpf_selem_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-17-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	7b984359e0	bpf: Remove migrate_{disable\|enable} from bpf_local_storage_free() bpf_local_storage_free() has three callers: 1) bpf_local_storage_alloc() Its caller must have disabled migration. 2) bpf_local_storage_destroy() Its four callers (bpf_{cgrp\|inode\|task\|sk}_storage_free()) have already invoked migrate_disable() before invoking bpf_local_storage_destroy(). 3) bpf_selem_unlink() Its callers include: cgrp/inode/task/sk storage ->map_delete_elem callbacks, bpf_{cgrp\|inode\|task\|sk}_storage_delete() helpers and bpf_local_storage_map_free(). All of these callers have already disabled migration before invoking bpf_selem_unlink(). Therefore, it is OK to remove migrate_{disable\|enable} pair from bpf_local_storage_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-16-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	4855a75ebf	bpf: Remove migrate_{disable\|enable} from bpf_local_storage_alloc() These two callers of bpf_local_storage_alloc() are the same as bpf_selem_alloc(): bpf_sk_storage_clone() and bpf_local_storage_update(). The running contexts of these two callers have already disabled migration, therefore, there is no need to add extra migrate_{disable\|enable} pair in bpf_local_storage_alloc(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-15-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	2269b32ab0	bpf: Remove migrate_{disable\|enable} from bpf_selem_alloc() bpf_selem_alloc() has two callers: (1) bpf_sk_storage_clone_elem() bpf_sk_storage_clone() has already disabled migration before invoking bpf_sk_storage_clone_elem(). (2) bpf_local_storage_update() Its callers include: cgrp/task/inode/sock storage ->map_update_elem() callbacks and bpf_{cgrp\|task\|inode\|sk}_storage_get() helpers. These running contexts have already disabled migration Therefore, there is no need to add extra migrate_{disable\|enable} pair in bpf_selem_alloc(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-14-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	6a52b965ab	bpf: Remove migrate_{disable,enable} in bpf_cpumask_release() When BPF program invokes bpf_cpumask_release(), the migration must have been disabled. When bpf_cpumask_release_dtor() invokes bpf_cpumask_release(), the caller bpf_obj_free_fields() also has disabled migration, therefore, it is OK to remove the unnecessary migrate_{disable\|enable} pair in bpf_cpumask_release(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-13-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:37 -08:00
Hou Tao	1d2dbe7120	bpf: Remove migrate_{disable\|enable} in bpf_obj_free_fields() The callers of bpf_obj_free_fields() have already guaranteed that the migration is disabled, therefore, there is no need to invoke migrate_{disable,enable} pair in bpf_obj_free_fields()'s underly implementation. This patch removes unnecessary migrate_{disable\|enable} pairs from bpf_obj_free_fields() and its callees: bpf_list_head_free() and bpf_rb_root_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-12-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	4b7e7cd1c1	bpf: Disable migration before calling ops->map_free() The freeing of all map elements may invoke bpf_obj_free_fields() to free the special fields in the map value. Since these special fields may be allocated from bpf memory allocator, migrate_{disable\|enable} pairs are necessary for the freeing of these special fields. To simplify reasoning about when migrate_disable() is needed for the freeing of these special fields, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, disabling migration before calling ops->map_free() to simplify the freeing of map values or special fields allocated from bpf memory allocator. After disabling migration in bpf_map_free(), there is no need for additional migration_{disable\|enable} pairs in these ->map_free() callbacks. Remove these redundant invocations. The migrate_{disable\|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-11-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	090d7f2e64	bpf: Disable migration in bpf_selem_free_rcu bpf_selem_free_rcu() calls bpf_obj_free_fields() to free the special fields in map value (e.g., kptr). Since kptrs may be allocated from bpf memory allocator, migrate_{disable\|enable} pairs are necessary for the freeing of these kptrs. To simplify reasoning about when migrate_disable() is needed for the freeing of these dynamically-allocated kptrs, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, the patch adds migrate_{disable\|enable} pair in bpf_selem_free_rcu(). The migrate_{disable\|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-10-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	e319cdc895	bpf: Disable migration when destroying inode storage When destroying inode storage, it invokes bpf_local_storage_destroy() to remove all storage elements saved in the inode storage. The destroy procedure will call bpf_selem_free() to free the element, and bpf_selem_free() calls bpf_obj_free_fields() to free the special fields in map value (e.g., kptr). Since kptrs may be allocated from bpf memory allocator, migrate_{disable\|enable} pairs are necessary for the freeing of these kptrs. To simplify reasoning about when migrate_disable() is needed for the freeing of these dynamically-allocated kptrs, let the caller to guarantee migration is disabled before invoking bpf_obj_free_fields(). Therefore, the patch adds migrate_{disable\|enable} pair in bpf_inode_storage_free(). The migrate_{disable\|enable} pairs in the underlying implementation of bpf_obj_free_fields() will be removed by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	9e6c958b54	bpf: Remove migrate_{disable\|enable} from bpf_task_storage_lock helpers Three callers of bpf_task_storage_lock() are ->map_lookup_elem, ->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for these three operations of task storage has already disabled migration. Another two callers are bpf_task_storage_get() and bpf_task_storage_delete() helpers which will be used by BPF program. Two callers of bpf_task_storage_trylock() are bpf_task_storage_get() and bpf_task_storage_delete() helpers. The running contexts of these helpers have already disabled migration. Therefore, it is safe to remove migrate_{disable\|enable} from task storage lock helpers for these call sites. However, bpf_task_storage_free() also invokes bpf_task_storage_lock() and its running context doesn't disable migration, therefore, add the missed migrate_{disable\|enable} in bpf_task_storage_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	25dc65f75b	bpf: Remove migrate_{disable\|enable} from bpf_cgrp_storage_lock helpers Three callers of bpf_cgrp_storage_lock() are ->map_lookup_elem, ->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for these three operations of cgrp storage has already disabled migration. Two call sites of bpf_cgrp_storage_trylock() are bpf_cgrp_storage_get(), and bpf_cgrp_storage_delete() helpers. The running contexts of these helpers have already disabled migration. Therefore, it is safe to remove migrate_disable() for these callers. However, bpf_cgrp_storage_free() also invokes bpf_cgrp_storage_lock() and its running context doesn't disable migration. Therefore, also add the missed migrate_{disabled\|enable} in bpf_cgrp_storage_free(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	53f2ba0b1c	bpf: Remove migrate_{disable\|enable} in htab_elem_free htab_elem_free() has two call-sites: delete_all_elements() has already disabled migration, free_htab_elem() is invoked by other 4 functions: __htab_map_lookup_and_delete_elem, __htab_map_lookup_and_delete_batch, htab_map_update_elem and htab_map_delete_elem. BPF syscall has already disabled migration before invoking ->map_update_elem, ->map_delete_elem, and ->map_lookup_and_delete_elem callbacks for hash map. __htab_map_lookup_and_delete_batch() also disables migration before invoking free_htab_elem(). ->map_update_elem() and ->map_delete_elem() of hash map may be invoked by BPF program and the running context of BPF program has already disabled migration. Therefore, it is safe to remove the migration_{disable\|enable} pair in htab_elem_free() Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:36 -08:00
Hou Tao	ea5b229630	bpf: Remove migrate_{disable\|enable} in ->map_for_each_callback BPF program may call bpf_for_each_map_elem(), and it will call the ->map_for_each_callback callback of related bpf map. Considering the running context of bpf program has already disabled migration, remove the unnecessary migrate_{disable\|enable} pair in the implementations of ->map_for_each_callback. To ensure the guarantee will not be voilated later, also add cant_migrate() check in the implementations. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:35 -08:00
Hou Tao	1b1a01db17	bpf: Remove migrate_{disable\|enable} from LPM trie Both bpf program and bpf syscall may invoke ->update or ->delete operation for LPM trie. For bpf program, its running context has already disabled migration explicitly through (migrate_disable()) or implicitly through (preempt_disable() or disable irq). For bpf syscall, the migration is disabled through the use of bpf_disable_instrumentation() before invoking the corresponding map operation callback. Therefore, it is safe to remove the migrate_{disable\|enable){} pair from LPM trie. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250108010728.207536-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 18:06:35 -08:00
Chen Ridong	3cb97a927f	cgroup/cpuset: remove kernfs active break A warning was found: WARNING: CPU: 10 PID: 3486953 at fs/kernfs/file.c:828 CPU: 10 PID: 3486953 Comm: rmdir Kdump: loaded Tainted: G RIP: 0010:kernfs_should_drain_open_files+0x1a1/0x1b0 RSP: 0018:ffff8881107ef9e0 EFLAGS: 00010202 RAX: 0000000080000002 RBX: ffff888154738c00 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000004 RDI: ffff888154738c04 RBP: ffff888154738c04 R08: ffffffffaf27fa15 R09: ffffed102a8e7180 R10: ffff888154738c07 R11: 0000000000000000 R12: ffff888154738c08 R13: ffff888750f8c000 R14: ffff888750f8c0e8 R15: ffff888154738ca0 FS: 00007f84cd0be740(0000) GS:ffff8887ddc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f9fbe00c8 CR3: 0000000153eec001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: kernfs_drain+0x15e/0x2f0 __kernfs_remove+0x165/0x300 kernfs_remove_by_name_ns+0x7b/0xc0 cgroup_rm_file+0x154/0x1c0 cgroup_addrm_files+0x1c2/0x1f0 css_clear_dir+0x77/0x110 kill_css+0x4c/0x1b0 cgroup_destroy_locked+0x194/0x380 cgroup_rmdir+0x2a/0x140 It can be explained by: rmdir echo 1 > cpuset.cpus kernfs_fop_write_iter // active=0 cgroup_rm_file kernfs_remove_by_name_ns kernfs_get_active // active=1 __kernfs_remove // active=0x80000002 kernfs_drain cpuset_write_resmask wait_event //waiting (active == 0x80000001) kernfs_break_active_protection // active = 0x80000001 // continue kernfs_unbreak_active_protection // active = 0x80000002 ... kernfs_should_drain_open_files // warning occurs kernfs_put_active This warning is caused by 'kernfs_break_active_protection' when it is writing to cpuset.cpus, and the cgroup is removed concurrently. The commit 3a5a6d0c2b03 ("cpuset: don't nest cgroup_mutex inside get_online_cpus()") made cpuset_hotplug_workfn asynchronous, This change involves calling flush_work(), which can create a multiple processes circular locking dependency that involve cgroup_mutex, potentially leading to a deadlock. To avoid deadlock. the commit 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") added 'kernfs_break_active_protection' in the cpuset_write_resmask. This could lead to this warning. After the commit 2125c0034c5d ("cgroup/cpuset: Make cpuset hotplug processing synchronous"), the cpuset_write_resmask no longer needs to wait the hotplug to finish, which means that concurrent hotplug and cpuset operations are no longer possible. Therefore, the deadlock doesn't exist anymore and it does not have to 'break active protection' now. To fix this warning, just remove kernfs_break_active_protection operation in the 'cpuset_write_resmask'. Fixes: bdb2fd7fc56e ("kernfs: Skip kernfs_drain_open_files() more aggressively") Fixes: 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") Reported-by: Ji Fa <jifa@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 15:54:39 -10:00
Jiri Olsa	2ebadb60cb	bpf: Return error for missed kprobe multi bpf program execution When kprobe multi bpf program can't be executed due to recursion check, we currently return 0 (success) to fprobe layer where it's ignored for standard kprobe multi probes. For kprobe session the success return value will make fprobe layer to install return probe and try to execute it as well. But the return session probe should not get executed, because the entry part did not run. FWIW the return probe bpf program most likely won't get executed, because its recursion check will likely fail as well, but we don't need to run it in the first place.. also we can make this clear and obvious. It also affects missed counts for kprobe session program execution, which are now doubled (extra count for not executed return probe). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250106175048.1443905-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 09:39:58 -08:00
Pu Lehui	ca3c4f646a	bpf: Move out synchronize_rcu_tasks_trace from mutex CS Commit ef1b808e3b7c ("bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors") resolved a possible UAF issue in uprobes that attach non-sleepable bpf prog by explicitly waiting for a tasks-trace-RCU grace period. But, in the current implementation, synchronize_rcu_tasks_trace is included within the mutex critical section, which increases the length of the critical section and may affect performance. So let's move out synchronize_rcu_tasks_trace from mutex CS. Signed-off-by: Pu Lehui <pulehui@huawei.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250104013946.1111785-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 09:38:41 -08:00
Soma Nakata	b8b1e30016	bpf: Fix range_tree_set() error handling range_tree_set() might fail and return -ENOMEM, causing subsequent `bpf_arena_alloc_pages` to fail. Add the error handling. Signed-off-by: Soma Nakata <soma.nakata@somane.sakura.ne.jp> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250106231536.52856-1-soma.nakata@somane.sakura.ne.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-08 09:35:33 -08:00
Honglei Wang	68e449d849	sched_ext: switch class when preempted by higher priority scheduler ops.cpu_release() function, if defined, must be invoked when preempted by a higher priority scheduler class task. This scenario was skipped in commit f422316d7466 ("sched_ext: Remove switch_class_scx()"). Let's fix it. Fixes: f422316d7466 ("sched_ext: Remove switch_class_scx()") Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 06:51:40 -10:00
Changwoo Min	6268d5bc10	sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass() scx_ops_bypass() iterates all CPUs to re-enqueue all the scx tasks. For each CPU, it acquires a lock using rq_lock() regardless of whether a CPU is offline or the CPU is currently running a task in a higher scheduler class (e.g., deadline). The rq_lock() is supposed to be used for online CPUs, and the use of rq_lock() may trigger an unnecessary warning in rq_pin_lock(). Therefore, replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass(). Without this change, we observe the following warning: ===== START ===== [ 6.615205] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 6.615208] WARNING: CPU: 2 PID: 0 at kernel/sched/sched.h:1730 __schedule+0x1130/0x1c90 ===== END ===== Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()") Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 06:48:53 -10:00
Henry Huang	30dd3b13f9	sched_ext: keep running prev when prev->scx.slice != 0 When %SCX_OPS_ENQ_LAST is set and prev->scx.slice != 0, @prev will be dispacthed into the local DSQ in put_prev_task_scx(). However, pick_task_scx() is executed before put_prev_task_scx(), so it will not pick @prev. Set %SCX_RQ_BAL_KEEP in balance_one() to ensure that pick_task_scx() can pick @prev. Signed-off-by: Henry Huang <henry.hj@antgroup.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-01-08 06:48:33 -10:00
Masami Hiramatsu (Google)	66fc6f521a	tracing/hist: Support POLLPRI event for poll on histogram Since POLLIN will not be flushed until the hist file is read, the user needs to repeatedly read() and poll() on the hist file for monitoring the event continuously. But the read() is somewhat redundant when the user is only monitoring for event updates. Add POLLPRI poll event on the hist file so the event returns when a histogram is updated after open(), poll() or read(). Thus it is possible to wait for the next event without having to issue a read(). Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527248770.464571.2536902137325258133.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-07 11:46:32 -05:00
Masami Hiramatsu (Google)	1bd13edbbe	tracing/hist: Add poll(POLLIN) support on hist file Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN. Currently, there is no way to wait for a specific event in userspace. So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`. But it is not a good idea to peek at the `trace` for an event that randomly happens. And `trace_pipe` is not coming back until a page is filled with events. This allows a user to wait for a specific event on the `hist` file. User can set a histogram trigger on the event which they want to monitor and poll() on its `hist` file. Since this poll() returns POLLIN, the next poll() will return soon unless a read() happens on that hist file. NOTE: To read the hist file again, you must set the file offset to 0, but just for monitoring the event, you may not need to read the histogram. Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-07 11:44:49 -05:00
Steven Rostedt	22bec11a56	tracing: Fix using ret variable in tracing_set_tracer() When the function tracing_set_tracer() switched over to using the guard() infrastructure, it did not need to save the 'ret' variable and would just return the value when an error arised, instead of setting ret and jumping to an out label. When CONFIG_TRACER_SNAPSHOT is enabled, it had code that expected the "ret" variable to be initialized to zero and had set 'ret' while holding an arch_spin_lock() (not used by guard), and then upon releasing the lock it would check 'ret' and exit if set. But because ret was only set when an error occurred while holding the locks, 'ret' would be used uninitialized if there was no error. The code in the CONFIG_TRACER_SNAPSHOT block should be self contain. Make sure 'ret' is also set when no error occurred. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250106111143.2f90ff65@gandalf.local.home Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202412271654.nJVBuwmF-lkp@intel.com/ Fixes: d33b10c0c73ad ("tracing: Switch trace.c code over to use guard()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-01-07 11:39:46 -05:00
Emil Tsalapatis	512816403e	bpf: Allow bpf_for/bpf_repeat calls while holding a spinlock Add the bpf_iter_num_* kfuncs called by bpf_for in special_kfunc_list, and allow the calls even while holding a spin lock. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250104202528.882482-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-06 10:59:49 -08:00
Linus Torvalds	fbfd64d25c	vfs-6.13-rc7.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ3vs1AAKCRCRxhvAZXjc omdqAP9Mn4HF85p5X7WRtUgrF7MGQft3EBfWE+sUxCMTc49NGQD/Ti7hqGNleEih MmjUjLZSG1e3lFHYQm0nqmjO2RexbQ0= =Li7D -----END PGP SIGNATURE----- Merge tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Relax assertions on failure to encode file handles The ->encode_fh() method can fail for various reasons. None of them warrant a WARN_ON(). - Fix overlayfs file handle encoding by allowing encoding an fid from an inode without an alias - Make sure fuse_dir_open() handles FOPEN_KEEP_CACHE. If it's not specified fuse needs to invaludate the directory inode page cache - Fix qnx6 so it builds with gcc-15 - Various fixes for netfslib and ceph and nfs filesystems: - Ignore silly rename files from afs and nfs when building header archives - Fix read result collection in netfslib with multiple subrequests - Handle ENOMEM for netfslib buffered reads - Fix oops in nfs_netfs_init_request() - Parse the secctx command immediately in cachefiles - Remove a redundant smp_rmb() in netfslib - Handle recursion in read retry in netfslib - Fix clearing of folio_queue - Fix missing cancellation of copy-to_cache when the cache for a file is temporarly disabled in netfslib - Sanity check the hfs root record - Fix zero padding data issues in concurrent write scenarios - Fix is_mnt_ns_file() after converting nsfs to path_from_stashed() - Fix missing declaration of init_files - Increase I/O priority when writing revoke records in jbd2 - Flush filesystem device before updating tail sequence in jbd2 * tag 'vfs-6.13-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits) ovl: support encoding fid from inode with no alias ovl: pass realinode to ovl_encode_real_fh() instead of realdentry fuse: respect FOPEN_KEEP_CACHE on opendir netfs: Fix is-caching check in read-retry netfs: Fix the (non-)cancellation of copy when cache is temporarily disabled netfs: Fix ceph copy to cache on write-begin netfs: Work around recursion by abandoning retry if nothing read netfs: Fix missing barriers by using clear_and_wake_up_bit() netfs: Remove redundant use of smp_rmb() cachefiles: Parse the "secctx" immediately nfs: Fix oops in nfs_netfs_init_request() when copying to cache netfs: Fix enomem handling in buffered reads netfs: Fix non-contiguous donation between completed reads kheaders: Ignore silly-rename files fs: relax assertions on failure to encode file handles fs: fix missing declaration of init_files fs: fix is_mnt_ns_file() iomap: fix zero padding data issue in concurrent append writes iomap: pass byte granular end position to iomap_add_to_ioend jbd2: flush filesystem device before updating tail sequence ...	2025-01-06 10:26:39 -08:00
Maarten Lankhorst	b168ed458d	kernel/cgroup: Add "dmem" memory accounting cgroup This code is based on the RDMA and misc cgroup initially, but now uses page_counter. It uses the same min/low/max semantics as the memory cgroup as a result. There's a small mismatch as TTM uses u64, and page_counter long pages. In practice it's not a problem. 32-bits systems don't really come with >=4GB cards and as long as we're consistently wrong with units, it's fine. The device page size may not be in the same units as kernel page size, and each region might also have a different page size (VRAM vs GART for example). The interface is simple: - Call dmem_cgroup_register_region() - Use dmem_cgroup_try_charge to check if you can allocate a chunk of memory, use dmem_cgroup__uncharge when freeing it. This may return an error code, or -EAGAIN when the cgroup limit is reached. In that case a reference to the limiting pool is returned. - The limiting cs can be used as compare function for dmem_cgroup_state_evict_valuable. - After having evicted enough, drop reference to limiting cs with dmem_cgroup_pool_state_put. This API allows you to limit device resources with cgroups. You can see the supported cards in /sys/fs/cgroup/dmem.capacity You need to echo +dmem to cgroup.subtree_control, and then you can partition device memory. Co-developed-by: Friedrich Vock <friedrich.vock@gmx.de> Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> Co-developed-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20241204143112.1250983-1-dev@lankhorst.se Signed-off-by: Maxime Ripard <mripard@kernel.org>	2025-01-06 17:24:38 +01:00
Linus Torvalds	5635d8bad2	25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM. The usual bunch of singletons and two doubletons - please see the relevant changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ3noXwAKCRDdBJ7gKXxA jkzRAP9Ejb8kbgCrA3cptnzlVkDCDUm0TmleepT3bx6B2rH0BgEAzSiTXf4ioZPg 4pOHnKIGOWEVPcVwBrdA0irWG+QPYAQ= =nEIZ -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM. The usual bunch of singletons and two doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) MAINTAINERS: change Arınç _NAL's name and email address scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity mm/util: make memdup_user_nul() similar to memdup_user() mm, madvise: fix potential workingset node list_lru leaks mm/damon/core: fix ignored quota goals and filters of newly committed schemes mm/damon/core: fix new damon_target objects leaks on damon_commit_targets() mm/list_lru: fix false warning of negative counter vmstat: disable vmstat_work on vmstat_cpu_down_prep() mm: shmem: fix the update of 'shmem_falloc->nr_unswapped' mm: shmem: fix incorrect index alignment for within_size policy percpu: remove intermediate variable in PERCPU_PTR() mm: zswap: fix race between [de]compression and CPU hotunplug ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit kcov: mark in_softirq_really() as __always_inline docs: mm: fix the incorrect 'FileHugeMapped' field mailmap: modify the entry for Mathieu Othacehe mm/kmemleak: fix sleeping function called from invalid context at print message mm: hugetlb: independent PMD page table shared count maple_tree: reload mas before the second call for mas_empty_area ...	2025-01-05 10:37:45 -08:00
Thomas Weißschuh	9ff6e943bc	padata: fix sysfs store callback check padata_sysfs_store() was copied from padata_sysfs_show() but this check was not adapted. Today there is no attribute which can fail this check, but if there is one it may as well be correct. Fixes: 5e017dc3f8bc ("padata: Added sysfs primitives to padata subsystem") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-01-04 08:53:47 +08:00
Jakub Kicinski	385f186aba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h f91a5b808938 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK") 3f330db30638 ("net: reformat kdoc return statements") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-01-03 16:29:29 -08:00
Linus Torvalds	63676eefb7	sched_ext: Fixes for v6.13-rc5 - Fix the bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration. - Fix the bug where scx_ops_bypass() could call irq_restore twice. - Add Andrea and Changwoo as maintainers for better review coverage. - selftests and tools/sched_ext build and other fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hpXg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGS/lAQDOZDfcJtO1VEsLoPY9NhFHPuBDTfoJyjSi/4mh GsjgDAD/Sx0rN6C9S/+ToUjmq3FA+ft0m2+97VqgLwkzwA9YxwI= =jaZ6 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration - Fix a bug where scx_ops_bypass() could call irq_restore twice - Add Andrea and Changwoo as maintainers for better review coverage - selftests and tools/sched_ext build and other fixes * tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix dsq_local_on selftest sched_ext: initialize kit->cursor.flags sched_ext: Fix invalid irq restore in scx_ops_bypass() MAINTAINERS: add me as reviewer for sched_ext MAINTAINERS: add self as reviewer for sched_ext scx: Fix maximal BPF selftest prog sched_ext: fix application of sizeof to pointer selftests/sched_ext: fix build after renames in sched_ext API sched_ext: Add __weak to fix the build errors	2025-01-03 15:09:12 -08:00
Linus Torvalds	f9aa1fb9f8	workqueue: Fixes for v6.13-rc5 - Suppress a corner case spurious flush dependency warning. - Two trivial changes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ3hmjA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGUrkAP90cajNtGbtFR1J61N4dTSfjBz8L7oQ6GLLyjCB MDxvpQD/ViVVpHBl9/jfObk//p6YMBTBD2Zp/aBc3mkKOVhfqws= =eUNO -----END PGP SIGNATURE----- Merge tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Suppress a corner case spurious flush dependency warning - Two trivial changes * tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add printf attribute to __alloc_workqueue() workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker rust: add safety comment in workqueue traits	2025-01-03 15:03:56 -08:00
Martin KaFai Lau	96ea081ed5	bpf: Reject struct_ops registration that uses module ptr and the module btf_id is missing There is a UAF report in the bpf_struct_ops when CONFIG_MODULES=n. In particular, the report is on tcp_congestion_ops that has a "struct module owner" member. For struct_ops that has a "struct module owner" member, it can be extended either by the regular kernel module or by the bpf_struct_ops. bpf_try_module_get() will be used to do the refcounting and different refcount is done based on the owner pointer. When CONFIG_MODULES=n, the btf_id of the "struct module" is missing: WARN: resolve_btfids: unresolved symbol module Thus, the bpf_try_module_get() cannot do the correct refcounting. Not all subsystem's struct_ops requires the "struct module owner" member. e.g. the recent sched_ext_ops. This patch is to disable bpf_struct_ops registration if the struct_ops has the "struct module " member and the "struct module" btf_id is missing. The btf_type_is_fwd() helper is moved to the btf.h header file for this test. This has happened since the beginning of bpf_struct_ops which has gone through many changes. The Fixes tag is set to a recent commit that this patch can apply cleanly. Considering CONFIG_MODULES=n is not common and the age of the issue, targeting for bpf-next also. Fixes: 1611603537a4 ("bpf: Create argument information for nullable arguments.") Reported-by: Robert Morris <rtm@csail.mit.edu> Closes: https://lore.kernel.org/bpf/74665.1733669976@localhost/ Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241220201818.127152-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-01-03 10:16:46 -08:00
Linus Torvalds	e30dd219c7	Fixes for ftrace in v6.13: - Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ3cR4RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qjxfAQCPhNztdmGmEYmuBtONPHwejidWnuJ6 Rl2mQxEbp40OUgD+JvSWofhRsvtXWlymqZ9j+dKMegLqMeq834hB0LK4NAg= =+KqV -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. * tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix function profiler's filtering functionality fgraph: Add READ_ONCE() when accessing fgraph_array[]	2025-01-03 10:04:43 -08:00
Kohei Enju	789a8cff8d	ftrace: Fix function profiler's filtering functionality Commit c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering"), function profiler (enabled via function_profile_enabled) has been showing statistics for all functions, ignoring set_ftrace_filter settings. While tracers are instantiated, the function profiler is not. Therefore, it should use the global set_ftrace_filter for consistency. This patch modifies the function profiler to use the global filter, fixing the filtering functionality. Before (filtering not working): ``` root@localhost:~# echo 'vfs' > /sys/kernel/tracing/set_ftrace_filter root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# sleep 1 root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# head /sys/kernel/tracing/trace_stat/ Function Hit Time Avg s^2 -------- --- ---- --- --- schedule 314 22290594 us 70989.15 us 40372231 us x64_sys_call 1527 8762510 us 5738.382 us 3414354 us schedule_hrtimeout_range 176 8665356 us 49234.98 us 405618876 us __x64_sys_ppoll 324 5656635 us 17458.75 us 19203976 us do_sys_poll 324 5653747 us 17449.83 us 19214945 us schedule_timeout 67 5531396 us 82558.15 us 2136740827 us __x64_sys_pselect6 12 3029540 us 252461.7 us 63296940171 us do_pselect.constprop.0 12 3029532 us 252461.0 us 63296952931 us ``` After (filtering working): ``` root@localhost:~# echo 'vfs' > /sys/kernel/tracing/set_ftrace_filter root@localhost:~# echo 1 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# sleep 1 root@localhost:~# echo 0 > /sys/kernel/tracing/function_profile_enabled root@localhost:~# head /sys/kernel/tracing/trace_stat/ Function Hit Time Avg s^2 -------- --- ---- --- --- vfs_write 462 68476.43 us 148.217 us 25874.48 us vfs_read 641 9611.356 us 14.994 us 28868.07 us vfs_fstat 890 878.094 us 0.986 us 1.667 us vfs_fstatat 227 757.176 us 3.335 us 18.928 us vfs_statx 226 610.610 us 2.701 us 17.749 us vfs_getattr_nosec 1187 460.919 us 0.388 us 0.326 us vfs_statx_path 297 343.287 us 1.155 us 11.116 us vfs_rename 6 291.575 us 48.595 us 9889.236 us ``` Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250101190820.72534-1-enjuk@amazon.com Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Kohei Enju <enjuk@amazon.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-02 17:21:33 -05:00
Zilin Guan	d654740337	fgraph: Add READ_ONCE() when accessing fgraph_array[] In __ftrace_return_to_handler(), a loop iterates over the fgraph_array[] elements, which are fgraph_ops. The loop checks if an element is a fgraph_stub to prevent using a fgraph_stub afterward. However, if the compiler reloads fgraph_array[] after this check, it might race with an update to fgraph_array[] that introduces a fgraph_stub. This could result in the stub being processed, but the stub contains a null "func_hash" field, leading to a NULL pointer dereference. To ensure that the gops compared against the fgraph_stub matches the gops processed later, add a READ_ONCE(). A similar patch appears in commit 63a8dfb ("function_graph: Add READ_ONCE() when accessing fgraph_array[]"). Cc: stable@vger.kernel.org Fixes: 37238abe3cb47 ("ftrace/function_graph: Pass fgraph_ops to function graph callbacks") Link: https://lore.kernel.org/20241231113731.277668-1-zilin@seu.edu.cn Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-01-02 17:21:18 -05:00
Steven Rostedt	afc6717628	tracing: Have process_string() also allow arrays In order to catch a common bug where a TRACE_EVENT() TP_fast_assign() assigns an address of an allocated string to the ring buffer and then references it in TP_printk(), which can be executed hours later when the string is free, the function test_event_printk() runs on all events as they are registered to make sure there's no unwanted dereferencing. It calls process_string() to handle cases in TP_printk() format that has "%s". It returns whether or not the string is safe. But it can have some false positives. For instance, xe_bo_move() has: TP_printk("move_lacks_source:%s, migrate object %p [size %zu] from %s to %s device_id:%s", __entry->move_lacks_source ? "yes" : "no", __entry->bo, __entry->size, xe_mem_type_to_name[__entry->old_placement], xe_mem_type_to_name[__entry->new_placement], __get_str(device_id)) Where the "%s" references into xe_mem_type_to_name[]. This is an array of pointers that should be safe for the event to access. Instead of flagging this as a bad reference, if a reference points to an array, where the record field is the index, consider it safe. Link: https://lore.kernel.org/all/9dee19b6185d325d0e6fa5f7cbba81d007d99166.camel@sapience.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241231000646.324fb5f7@gandalf.local.home Fixes: 65a25d9f7ac02 ("tracing: Add "%s" check in test_event_printk()") Reported-by: Genes Lists <lists@sapience.com> Tested-by: Gene C <arch@sapience.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-12-31 00:10:32 -05:00
Pei Xiao	dfa94ce54f	bpf: Use refcount_t instead of atomic_t for mmap_count Use an API that resembles more the actual use of mmap_count. Found by cocci: kernel/bpf/arena.c:245:6-25: WARNING: atomic_dec_and_test variation before object free at line 249. Fixes: b90d77e5fd78 ("bpf: Fix remap of arena.") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412292037.LXlYSHKl-lkp@intel.com/ Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn> Link: https://lore.kernel.org/r/6ecce439a6bc81adb85d5080908ea8959b792a50.1735542814.git.xiaopei01@kylinos.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-12-30 20:12:21 -08:00
Arnd Bergmann	cb0ca08b32	kcov: mark in_softirq_really() as __always_inline If gcc decides not to inline in_softirq_really(), objtool warns about a function call with UACCESS enabled: kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc+0x1e: call to in_softirq_really() with UACCESS enabled kernel/kcov.o: warning: objtool: check_kcov_mode+0x11: call to in_softirq_really() with UACCESS enabled Mark this as __always_inline to avoid the problem. Link: https://lkml.kernel.org/r/20241217071814.2261620-1-arnd@kernel.org Fixes: 7d4df2dad312 ("kcov: properly check for softirq context") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Marco Elver <elver@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-30 17:59:08 -08:00
Lorenzo Pieralisi	654a3381e3	bpf: Remove unused MT_ENTRY define The range tree introduction removed the need for maple tree usage but missed removing the MT_ENTRY defined value that was used to mark maple tree allocated entries. Remove the MT_ENTRY define. Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Link: https://lore.kernel.org/r/20241223115901.14207-1-lpieralisi@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-12-30 15:18:13 -08:00
Thomas Weißschuh	4a24035964	bpf: Fix holes in special_kfunc_list if !CONFIG_NET If the function is not available its entry has to be replaced with BTF_ID_UNUSED instead of skipped. Otherwise the list doesn't work correctly. Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/lkml/CAADnVQJQpVziHzrPCCpGE5=8uzw2OkxP8gqe1FkJ6_XVVyVbNw@mail.gmail.com/ Fixes: 00a5acdbf398 ("bpf: Fix configuration-dependent BTF function references") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20241219-bpf-fix-special_kfunc_list-v1-1-d9d50dd61505@weissschuh.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-12-30 14:52:08 -08:00
Matan Shachnai	9aa0ebde00	bpf, verifier: Improve precision of BPF_MUL This patch improves (or maintains) the precision of register value tracking in BPF_MUL across all possible inputs. It also simplifies scalar32_min_max_mul() and scalar_min_max_mul(). As it stands, BPF_MUL is composed of three functions: case BPF_MUL: tnum_mul(); scalar32_min_max_mul(); scalar_min_max_mul(); The current implementation of scalar_min_max_mul() restricts the u64 input ranges of dst_reg and src_reg to be within [0, U32_MAX]: /* Both values are positive, so we can work with unsigned and * copy the result to signed (unless it exceeds S64_MAX). / if (umax_val > U32_MAX \|\| dst_reg->umax_value > U32_MAX) { / Potential overflow, we know nothing / __mark_reg64_unbounded(dst_reg); return; } This restriction is done to avoid unsigned overflow, which could otherwise wrap the result around 0, and leave an unsound output where umin > umax. We also observe that limiting these u64 input ranges to [0, U32_MAX] leads to a loss of precision. Consider the case where the u64 bounds of dst_reg are [0, 2^34] and the u64 bounds of src_reg are [0, 2^2]. While the multiplication of these two bounds doesn't overflow and is sound [0, 2^36], the current scalar_min_max_mul() would set the entire register state to unbounded. Importantly, we update BPF_MUL to allow signed bound multiplication (i.e. multiplying negative bounds) as well as allow u64 inputs to take on values from [0, U64_MAX]. We perform signed multiplication on two bounds [a,b] and [c,d] by multiplying every combination of the bounds (i.e. ac, ad, bc, and bd) and checking for overflow of each product. If there is an overflow, we mark the signed bounds unbounded [S64_MIN, S64_MAX]. In the case of no overflow, we take the minimum of these products to be the resulting smin, and the maximum to be the resulting smax. The key idea here is that if there’s no possibility of overflow, either when multiplying signed bounds or unsigned bounds, we can safely multiply the respective bounds; otherwise, we set the bounds that exhibit overflow (during multiplication) to unbounded. if (check_mul_overflow(dst_umax, src_reg->umax_value, dst_umax) \|\| (check_mul_overflow(dst_umin, src_reg->umin_value, dst_umin))) { / Overflow possible, we know nothing / dst_umin = 0; dst_umax = U64_MAX; } ... Below, we provide an example BPF program (below) that exhibits the imprecision in the current BPF_MUL, where the outputs are all unbounded. In contrast, the updated BPF_MUL produces a bounded register state: BPF_LD_IMM64(BPF_REG_1, 11), BPF_LD_IMM64(BPF_REG_2, 4503599627370624), BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0), BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0), BPF_ALU64_REG(BPF_AND, BPF_REG_1, BPF_REG_2), BPF_LD_IMM64(BPF_REG_3, 809591906117232263), BPF_ALU64_REG(BPF_MUL, BPF_REG_3, BPF_REG_1), BPF_MOV64_IMM(BPF_REG_0, 1), BPF_EXIT_INSN(), Verifier log using the old BPF_MUL: func#0 @0 0: R1=ctx() R10=fp0 0: (18) r1 = 0xb ; R1_w=11 2: (18) r2 = 0x10000000000080 ; R2_w=0x10000000000080 4: (87) r2 = -r2 ; R2_w=scalar() 5: (87) r2 = -r2 ; R2_w=scalar() 6: (5f) r1 &= r2 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar() 7: (18) r3 = 0xb3c3f8c99262687 ; R3_w=0xb3c3f8c99262687 9: (2f) r3 = r1 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar() ... Verifier using the new updated BPF_MUL (more precise bounds at label 9) func#0 @0 0: R1=ctx() R10=fp0 0: (18) r1 = 0xb ; R1_w=11 2: (18) r2 = 0x10000000000080 ; R2_w=0x10000000000080 4: (87) r2 = -r2 ; R2_w=scalar() 5: (87) r2 = -r2 ; R2_w=scalar() 6: (5f) r1 &= r2 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar() 7: (18) r3 = 0xb3c3f8c99262687 ; R3_w=0xb3c3f8c99262687 9: (2f) r3 *= r1 ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar(smin=0,smax=umax=0x7b96bb0a94a3a7cd,var_off=(0x0; 0x7fffffffffffffff)) ... Finally, we proved the soundness of the new scalar_min_max_mul() and scalar32_min_max_mul() functions. Typically, multiplication operations are expensive to check with bitvector-based solvers. We were able to prove the soundness of these functions using Non-Linear Integer Arithmetic (NIA) theory. Additionally, using Agni [2,3], we obtained the encodings for scalar32_min_max_mul() and scalar_min_max_mul() in bitvector theory, and were able to prove their soundness using 8-bit bitvectors (instead of 64-bit bitvectors that the functions actually use). In conclusion, with this patch, 1. We were able to show that we can improve the overall precision of BPF_MUL. We proved (using an SMT solver) that this new version of BPF_MUL is at least as precise as the current version for all inputs and more precise for some inputs. 2. We are able to prove the soundness of the new scalar_min_max_mul() and scalar32_min_max_mul(). By leveraging the existing proof of tnum_mul [1], we can say that the composition of these three functions within BPF_MUL is sound. [1] https://ieeexplore.ieee.org/abstract/document/9741267 [2] https://link.springer.com/chapter/10.1007/978-3-031-37709-9_12 [3] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@gmail.com> Link: https://lore.kernel.org/r/20241218032337.12214-2-m.shachnai@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-12-30 14:49:42 -08:00

1 2 3 4 5 ...

46872 Commits