Commit Graph

44097 Commits

Author SHA1 Message Date
Tejun Heo
e9a8e01f9b workqueue: Clean up enum work_bits and related constants
The bits of work->data are used for a few different purposes. How the bits
are used is determined by enum work_bits. The planned disable/enable support
will add another use, so let's clean it up a bit in preparation.

- Let WORK_STRUCT_*_BIT's values be determined by enum definition order.

- Deliminate different bit sections the same way using SHIFT and BITS
  values.

- Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency.

- Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and
  WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity.

- Improve documentation.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-20 19:36:14 -10:00
Tejun Heo
c5f5b9422a workqueue: Introduce work_cancel_flags
The cancel path used bool @is_dwork to distinguish canceling a regular work
and a delayed one. The planned disable/enable support will need passing
around another flag in the code path. As passing them around with bools will
be confusing, let's introduce named flags to pass around in the cancel path.

WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-20 19:36:14 -10:00
Tejun Heo
c26e2f2e2f workqueue: Use variable name irq_flags for saving local irq flags
Using the generic term `flags` for irq flags is conventional but can be
confusing as there's quite a bit of code dealing with work flags which
involves some subtleties. Let's use a more explicit name `irq_flags` for
local irq flags. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-20 19:36:14 -10:00
Tejun Heo
cdc6e4b329 workqueue: Reorganize flush and cancel[_sync] functions
They are currently a bit disorganized with flush and cancel functions mixed.
Reoranize them so that flush functions come first, cancel next and
cancel_sync last. This way, we won't have to add prototypes for internal
functions for the planned disable/enable support.

This is pure code reorganization. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-20 19:36:14 -10:00
Tejun Heo
c5140688d1 workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
__cancel_work_timer() is used to implement cancel_work_sync() and
cancel_delayed_work_sync(), similarly to how __cancel_work() is used to
implement cancel_work() and cancel_delayed_work(). ie. The _timer part of
the name is a complete misnomer. The difference from __cancel_work() is the
fact that it syncs against work item execution not whether it handles timers
or not.

Let's rename it to less confusing __cancel_work_sync(). No functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-20 19:36:14 -10:00
Tejun Heo
d355001fa9 workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
The different flavors of RCU read critical sections have been unified. Let's
update the locking assertion macros accordingly to avoid requiring
unnecessary explicit rcu_read_[un]lock() calls.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-20 19:36:13 -10:00
Tejun Heo
c7a40c49af workqueue: Cosmetic changes
Reorder some global declarations and adjust comments and whitespaces for
clarity and consistency. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-20 19:36:13 -10:00
Marco Elver
de2683e7fd hardening: Enable KFENCE in the hardening config
KFENCE is not a security mitigation mechanism (due to sampling), but has
the performance characteristics of unintrusive hardening techniques.
When used at scale, however, it improves overall security by allowing
kernel developers to detect heap memory-safety bugs cheaply.

Link: https://lkml.kernel.org/r/79B9A832-B3DE-4229-9D87-748B2CFB7D12@kernel.org
Cc: Matthieu Baerts <matttbe@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20240212130116.997627-1-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-20 20:47:32 -08:00
Lukas Bulwahn
7b3133aa4b hardening: drop obsolete DRM_LEGACY from config fragment
Commit 94f8f319cb ("drm: Remove Kconfig option for legacy support
(CONFIG_DRM_LEGACY)") removes the config DRM_LEGACY, but one reference to
that config is left in the hardening.config fragment.

As there is no drm legacy driver left, we do not need to recommend this
attack surface reduction anymore.

Drop this reference in hardening.config fragment.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20240208091045.9219-3-lukas.bulwahn@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-20 20:47:32 -08:00
Lukas Bulwahn
006eac3fe2 hardening: drop obsolete UBSAN_SANITIZE_ALL from config fragment
Commit 7a628f818499 ("ubsan: Remove CONFIG_UBSAN_SANITIZE_ALL") removes the
config UBSAN_SANITIZE_ALL, but one reference to that config is left in the
hardening.config fragment.

Drop this reference in hardening.config fragment.

Note that CONFIG_UBSAN is still enabled in the hardening.config fragment,
so the functionality when using this fragment remains the same.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20240208091045.9219-2-lukas.bulwahn@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-20 20:47:32 -08:00
Linus Torvalds
944d5fe50f sched/membarrier: reduce the ability to hammer on sys_membarrier
On some systems, sys_membarrier can be very expensive, causing overall
slowdowns for everything.  So put a lock on the path in order to
serialize the accesses to prevent the ability for this to be called at
too high of a frequency and saturate the machine.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Fixes: 22e4ebb975 ("membarrier: Provide expedited private command")
Fixes: c5f58bd58f ("membarrier: Provide GLOBAL_EXPEDITED command")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-20 09:38:05 -08:00
Marc Zyngier
5aa3c0cf5b genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens
Users of the IRQCHIP_PLATFORM_DRIVER_{BEGIN,END} helpers rely on a fwspec
containing only the fwnode (and crucially a number of parameters set to 0)
together with a DOMAIN_BUS_ANY token to check whether a parent irqchip has
probed and registered a domain.

Since de1ff306dc ("genirq/irqdomain: Remove the param count restriction
from select()"), ops->select() is called unconditionally, meaning that
irqchips implementing select() now need to handle ANY as a match.

Instead of adding more esoteric checks to the individual drivers, add that
condition to irq_find_matching_fwspec(), and let it handle the corner case,
as per the comment in the function.

This restores the functionality of the above helpers.

Fixes: de1ff306dc ("genirq/irqdomain: Remove the param count restriction from select()")
Reported-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Link: https://lore.kernel.org/r/20240220114731.1898534-1-maz@kernel.org
Link: https://lore.kernel.org/r/20240219-gic-fix-child-domain-v1-1-09f8fd2d9a8f@linaro.org
2024-02-20 17:30:57 +01:00
Ingo Molnar
94bf12af35 Merge tag 'v6.8-rc5' into timers/core, to resolve conflict
There's a conflict between this recent upstream fix:

  dad6a09f31 ("hrtimer: Report offline hrtimer enqueue")

and a pending commit in the timers tree:

  1a4729ecaf ("hrtimers: Move hrtimer base related definitions into hrtimer_defs.h")

Resolve it by applying the upstream fix to the new <linux/hrtimer_defs.h> header.

 Conflict:
	include/linux/hrtimer.h
 Semantic conflict:
	include/linux/hrtimer_defs.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-19 22:27:57 +01:00
Alexey Dobriyan
da92df490e cpu: Mark cpu_possible_mask as __ro_after_init
cpu_possible_mask is by definition "cpus which could be hotplugged without
reboot". It's a property which is fixed after kernel enumerates the
hardware configuration.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/41cd78af-92a3-4f23-8c7a-4316a04a66d8@p183
2024-02-19 18:05:47 +01:00
Yafang Shao
5f2ae606cb bpf: Fix an issue due to uninitialized bpf_iter_task
Failure to initialize it->pos, coupled with the presence of an invalid
value in the flags variable, can lead to it->pos referencing an invalid
task, potentially resulting in a kernel panic. To mitigate this risk, it's
crucial to ensure proper initialization of it->pos to NULL.

Fixes: ac8148d957 ("bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/bpf/20240217114152.1623-2-laoar.shao@gmail.com
2024-02-19 12:28:15 +01:00
Martin KaFai Lau
0281b919e1 bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel
The following race is possible between bpf_timer_cancel_and_free
and bpf_timer_cancel. It will lead a UAF on the timer->timer.

bpf_timer_cancel();
	spin_lock();
	t = timer->time;
	spin_unlock();

					bpf_timer_cancel_and_free();
						spin_lock();
						t = timer->timer;
						timer->timer = NULL;
						spin_unlock();
						hrtimer_cancel(&t->timer);
						kfree(t);

	/* UAF on t */
	hrtimer_cancel(&t->timer);

In bpf_timer_cancel_and_free, this patch frees the timer->timer
after a rcu grace period. This requires a rcu_head addition
to the "struct bpf_hrtimer". Another kfree(t) happens in bpf_timer_init,
this does not need a kfree_rcu because it is still under the
spin_lock and timer->timer has not been visible by others yet.

In bpf_timer_cancel, rcu_read_lock() is added because this helper
can be used in a non rcu critical section context (e.g. from
a sleepable bpf prog). Other timer->timer usages in helpers.c
have been audited, bpf_timer_cancel() is the only place where
timer->timer is used outside of the spin_lock.

Another solution considered is to mark a t->flag in bpf_timer_cancel
and clear it after hrtimer_cancel() is done.  In bpf_timer_cancel_and_free,
it busy waits for the flag to be cleared before kfree(t). This patch
goes with a straight forward solution and frees timer->timer after
a rcu grace period.

Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20240215211218.990808-1-martin.lau@linux.dev
2024-02-19 12:26:46 +01:00
Peter Hilber
14274d0bd3 timekeeping: Fix cross-timestamp interpolation for non-x86
So far, get_device_system_crosststamp() unconditionally passes
system_counterval.cycles to timekeeping_cycles_to_ns(). But when
interpolating system time (do_interp == true), system_counterval.cycles is
before tkr_mono.cycle_last, contrary to the timekeeping_cycles_to_ns()
expectations.

On x86, CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE will mitigate on
interpolating, setting delta to 0. With delta == 0, xtstamp->sys_monoraw
and xtstamp->sys_realtime are then set to the last update time, as
implicitly expected by adjust_historical_crosststamp(). On other
architectures, the resulting nonsense xtstamp->sys_monoraw and
xtstamp->sys_realtime corrupt the xtstamp (ts) adjustment in
adjust_historical_crosststamp().

Fix this by deriving xtstamp->sys_monoraw and xtstamp->sys_realtime from
the last update time when interpolating, by using the local variable
"cycles". The local variable already has the right value when
interpolating, unlike system_counterval.cycles.

Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20231218073849.35294-4-peter.hilber@opensynergy.com
2024-02-19 12:18:51 +01:00
Peter Hilber
87a4113088 timekeeping: Fix cross-timestamp interpolation corner case decision
The cycle_between() helper checks if parameter test is in the open interval
(before, after). Colloquially speaking, this also applies to the counter
wrap-around special case before > after. get_device_system_crosststamp()
currently uses cycle_between() at the first call site to decide whether to
interpolate for older counter readings.

get_device_system_crosststamp() has the following problem with
cycle_between() testing against an open interval: Assume that, by chance,
cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for
brevity). Then, cycle_between() at the first call site, with effective
argument values cycle_between(cycle_last, cycles, now), returns false,
enabling interpolation. During interpolation,
get_device_system_crosststamp() will then call cycle_between() at the
second call site (if a history_begin was supplied). The effective argument
values are cycle_between(history_begin->cycles, cycles, cycles), since
system_counterval.cycles == interval_start == cycles, per the assumption.
Due to the test against the open interval, cycle_between() returns false
again. This causes get_device_system_crosststamp() to return -EINVAL.

This failure should be avoided, since get_device_system_crosststamp() works
both when cycles follows cycle_last (no interpolation), and when cycles
precedes cycle_last (interpolation). For the case cycles == cycle_last,
interpolation is actually unneeded.

Fix this by changing cycle_between() into timestamp_in_interval(), which
now checks against the closed interval, rather than the open interval.

This changes the get_device_system_crosststamp() behavior for three corner
cases:

1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last,
   fixing the problem described above.

2. At the first timestamp_in_interval() call site, cycles == now no longer
   causes failure.

3. At the second timestamp_in_interval() call site, history_begin->cycles
   == system_counterval.cycles no longer causes failure.
   adjust_historical_crosststamp() also works for this corner case,
   where partial_history_cycles == total_history_cycles.

These behavioral changes should not cause any problems.

Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231218073849.35294-3-peter.hilber@opensynergy.com
2024-02-19 12:18:51 +01:00
Peter Hilber
84dccadd3e timekeeping: Fix cross-timestamp interpolation on counter wrap
cycle_between() decides whether get_device_system_crosststamp() will
interpolate for older counter readings.

cycle_between() yields wrong results for a counter wrap-around where after
< before < test, and for the case after < test < before.

Fix the comparison logic.

Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20231218073849.35294-2-peter.hilber@opensynergy.com
2024-02-19 12:18:51 +01:00
Crystal Wood
c99303a2d2 genirq: Wake interrupt threads immediately when changing affinity
The affinity setting of interrupt threads happens in the context of the
thread when the thread is woken up by an hard interrupt. As this can be an
arbitrary after changing the affinity, the thread can become runnable on an
isolated CPU and cause isolation disruption.

Avoid this by checking the set affinity request in wait_for_interrupt() and
waking the threads immediately when the affinity is modified.

Note that this is of the most benefit on systems where the interrupt
affinity itself does not need to be deferred to the interrupt handler, but
even where that's not the case, the total dirsuption will be less.

Signed-off-by: Crystal Wood <crwood@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122235353.15235-1-crwood@redhat.com
2024-02-19 11:43:46 +01:00
Anna-Maria Behnsen
892abd3571 timers: Add struct member description for timer_base
timer_base struct lacks description of struct members. Important struct
member information is sprinkled in comments or in code all over the place.

Collect information and write struct description to keep track of most
important information in a single place.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240123164702.55612-5-anna-maria@linutronix.de
2024-02-19 09:38:00 +01:00
Anna-Maria Behnsen
f365d05506 tick/sched: Add function description for tick_nohz_next_event()
The return value of tick_nohz_next_event() is not obvious at the first
glance. Add a kernel-doc compatible function description which also covers
return values.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240123164702.55612-4-anna-maria@linutronix.de
2024-02-19 09:38:00 +01:00
Anna-Maria Behnsen
ca2768bbf5 hrtimers: Update formatting of documentation
Documentation of functions lacks the annotations which are used by
kernel-doc and *.rst to make appearance in rendered documents more
user-friendly.

Use those annotations to improve user-friendliness. While at it prevent
duplication of comments and use a reference instead.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240123164702.55612-3-anna-maria@linutronix.de
2024-02-19 09:37:59 +01:00
Linus Torvalds
ad645dea35 Probes fixes for v6.8-rc4:
- tracing/probes: Fix BTF structure member finder to find the members
     which are placed after any anonymous union member correctly.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmXQqH8bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bKCsH/3EWGLBbwNMFFje+t2Bp
 L6LgeyG7ZUzeDb4ioO2bae++e3m7xtR7YAWbZoX193/BfPbBpr9rkUCfYqAaImBk
 dzdI+E8/gOkliLP0Vvj5GIecKDUdlB053NRgoOcq0n1NYOO3/4kyPZ+53tX5gsBI
 YNcdIKDdEVWX0+vinK6cCjWOLdVJc/8pH/6G2KNgG6Qs2T1dV0YzW5QRv1hhUOr3
 1Obog64AFt7tVtIkGOK19gRJKa6VpysdD8wpekJnKfquTtFffrsgf4yTg84zf7Fe
 ueaOtJ7pV4R0SbQ/n5SnXe5cxXjX7MD24FZaeHCB7mp8sOsdeTukFmSTsIgOFeD/
 cBQ=
 =KY+B
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:

 - tracing/probes: Fix BTF structure member finder to find the members
   which are placed after any anonymous union member correctly.

* tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix to search structure fields correctly
2024-02-17 07:59:47 -08:00
Masami Hiramatsu (Google)
9704669c38 tracing/probes: Fix to search structure fields correctly
Fix to search a field from the structure which has anonymous union
correctly.
Since the reference `type` pointer was updated in the loop, the search
loop suddenly aborted where it hits an anonymous union. Thus it can not
find the field after the anonymous union. This avoids updating the
cursor `type` pointer in the loop.

Link: https://lore.kernel.org/all/170791694361.389532.10047514554799419688.stgit@devnote2/

Fixes: 302db0f5b3 ("tracing/probes: Add a function to search a member of a struct/union")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-02-17 21:25:42 +09:00
Linus Torvalds
975b26ab30 workqueue: Fixes for v6.8-rc4
Just one patch to revert ca10d851b9 ("workqueue: Override implicit ordered
 attribute in workqueue_apply_unbound_cpumask()"). This could break ordering
 guarantee for ordered workqueues. The problem that the commit tried to
 resolve partially - making ordered workqueues follow unbound cpumask - is
 fully solved in wq/for-6.9 branch.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZc/UeA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGR/pAQCeGwvE0m/NzdIsFh0Gd6fLwZlE9S9Kh0zsCaV4
 3N5GdwD/XAXG77ZIBx1z88CT6YDyRt7iKWU6acSnfglgeFEJoAU=
 =Z54U
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.8-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "Just one patch to revert commit ca10d851b9 ("workqueue: Override
  implicit ordered attribute in workqueue_apply_unbound_cpumask()").

  This commit could break ordering guarantees for ordered workqueues.
  The problem that the commit tried to resolve partially - making
  ordered workqueues follow unbound cpumask - is fully solved in
  wq/for-6.9 branch"

* tag 'wq-for-6.8-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"
2024-02-16 14:00:19 -08:00
Linus Torvalds
4b6f7c624e Tracing fixes for v6.8:
- Fix the #ifndef that didn't have CONFIG_ on HAVE_DYNAMIC_FTRACE_WITH_REGS
   The fix to have dynamic trampolines work with x86 broke arm64 as
   the config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not
   CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the
   previous fix was to fix.
 
 - Fix tracing_on state
   The code to test if "tracing_on" is set used ring_buffer_record_is_on()
   which returns false if the ring buffer isn't able to be written to.
   But the ring buffer disable has several bits that disable it.
   One is internal disabling which is used for resizing and other
   modifications of the ring buffer. But the "tracing_on" user space
   visible flag should only report if tracing is actually on and not
   internally disabled, as this can cause confusion as writing "1"
   when it is disabled will not enable it.
 
   Instead use ring_buffer_record_is_set_on() which shows the user space
   visible settings.
 
 - Fix a false positive kmemleak on saved cmdlines
   Now that the saved_cmdlines structure is allocated via alloc_page()
   and not via kmalloc() it has become invisible to kmemleak.
   The allocation done to one of its pointers was flagged as a
   dangling allocation leak. Make kmemleak aware of this allocation
   and free.
 
 - Fix synthetic event dynamic strings.
   A update that cleaned up the synthetic event code removed the
   return value of trace_string(), and had it return zero instead
   of the length, causing dynamic strings in the synthetic event
   to always have zero size.
 
 - Clean up documentation and header files for seq_buf
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZc92CxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiD6AQCCtF9hBWq7slLlBQ+k07hWXOi1h4nR
 Ae6UmoGlu3u4ugEA6/g8mO2vjABagnK7RSk/R+s1SvSGqWkmAsWeEKirEAY=
 =Eqjs
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix the #ifndef that didn't have the 'CONFIG_' prefix on
   HAVE_DYNAMIC_FTRACE_WITH_REGS

   The fix to have dynamic trampolines work with x86 broke arm64 as the
   config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not
   CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the
   previous fix was to fix.

 - Fix tracing_on state

   The code to test if "tracing_on" is set incorrectly used
   ring_buffer_record_is_on() which returns false if the ring buffer
   isn't able to be written to.

   But the ring buffer disable has several bits that disable it. One is
   internal disabling which is used for resizing and other modifications
   of the ring buffer. But the "tracing_on" user space visible flag
   should only report if tracing is actually on and not internally
   disabled, as this can cause confusion as writing "1" when it is
   disabled will not enable it.

   Instead use ring_buffer_record_is_set_on() which shows the user space
   visible settings.

 - Fix a false positive kmemleak on saved cmdlines

   Now that the saved_cmdlines structure is allocated via alloc_page()
   and not via kmalloc() it has become invisible to kmemleak. The
   allocation done to one of its pointers was flagged as a dangling
   allocation leak. Make kmemleak aware of this allocation and free.

 - Fix synthetic event dynamic strings

   An update that cleaned up the synthetic event code removed the return
   value of trace_string(), and had it return zero instead of the
   length, causing dynamic strings in the synthetic event to always have
   zero size.

 - Clean up documentation and header files for seq_buf

* tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  seq_buf: Fix kernel documentation
  seq_buf: Don't use "proxy" headers
  tracing/synthetic: Fix trace_string() return value
  tracing: Inform kmemleak of saved_cmdlines allocation
  tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
  tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef
2024-02-16 10:33:51 -08:00
Tejun Heo
fd0a68a233 workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
2f34d7337d ("workqueue: Fix queue_work_on() with BH workqueues") added
irq_work usage to workqueue; however, it turns out irq_work is actually
optional and the change breaks build on configuration which doesn't have
CONFIG_IRQ_WORK enabled.

Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling
CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may
be better to just always enable it. However, this still saves a small bit of
memory for tiny UP configs and also the least amount of change, so, for now,
let's keep it conditional.

Verified to do the right thing for x86_64 allnoconfig and defconfig, and
aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects
IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects
IRQ_WORK.

v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is
    selected by something else when !CONFIG_SMP. Use `def_bool y if SMP`
    instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: 2f34d7337d ("workqueue: Fix queue_work_on() with BH workqueues")
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
2024-02-16 06:33:43 -10:00
Shrikanth Hegde
8cec3dd9e5 sched/core: Simplify code by removing duplicate #ifdefs
There's a few cases of nested #ifdefs in the scheduler code
that can be simplified:

  #ifdef DEFINE_A
  ...code block...
    #ifdef DEFINE_A       <-- This is a duplicate.
    ...code block...
    #endif
  #else
    #ifndef DEFINE_A     <-- This is also duplicate.
    ...code block...
    #endif
  #endif

More details about the script and methods used to find these code
patterns can be found at:

  https://lore.kernel.org/all/20240118080326.13137-1-sshegde@linux.ibm.com/

No change in functionality intended.

[ mingo: Clarified the changelog. ]

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240216061433.535522-1-sshegde@linux.ibm.com
2024-02-16 09:37:15 +01:00
Matthieu Baerts (NGI0)
3738d710af configs/debug: add NET debug config
The debug.config file is really great to easily enable a bunch of
general debugging features on a CI-like setup. But it would be great to
also include core networking debugging config.

A few CI's validating features from the Net tree also enable a few other
debugging options on top of debug.config. A small selection is quite
generic for the whole net tree. They validate some assumptions in
different parts of the core net tree. As suggested by Jakub Kicinski in
[1], having them added to this debug.config file would help other CIs
using network features to find bugs in this area.

Note that the two REFCNT configs also select REF_TRACKER, which doesn't
seem to be an issue.

Link: https://lore.kernel.org/netdev/20240202093148.33bd2b14@kernel.org/T/ [1]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240212-kconfig-debug-enable-net-v1-1-fb026de8174c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15 17:46:53 -08:00
Jakub Kicinski
73be9a3aab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

net/core/dev.c
  9f30831390 ("net: add rcu safety to rtnl_prop_list_size()")
  723de3ebef ("net: free altname using an RCU callback")

net/unix/garbage.c
  11498715f2 ("af_unix: Remove io_uring code for GC.")
  25236c91b5 ("af_unix: Fix task hung while purging oob_skb in GC.")

drivers/net/ethernet/renesas/ravb_main.c
  ed4adc0720 ("net: ravb: Count packets instead of descriptors in GbEth RX path"
)
  c2da940857 ("ravb: Add Rx checksum offload support for GbEth")

net/mptcp/protocol.c
  bdd70eb689 ("mptcp: drop the push_pending field")
  28e5c13805 ("mptcp: annotate lockless accesses around read-mostly fields")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15 16:20:04 -08:00
Yonghong Song
682158ab53 bpf: Fix test verif_scale_strobemeta_subprogs failure due to llvm19
With latest llvm19, I hit the following selftest failures with

  $ ./test_progs -j
  libbpf: prog 'on_event': BPF program load failed: Permission denied
  libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG --
  combined stack size of 4 calls is 544. Too large
  verification time 1344153 usec
  stack depth 24+440+0+32
  processed 51008 insns (limit 1000000) max_states_per_insn 19 total_states 1467 peak_states 303 mark_read 146
  -- END PROG LOAD LOG --
  libbpf: prog 'on_event': failed to load: -13
  libbpf: failed to load object 'strobemeta_subprogs.bpf.o'
  scale_test:FAIL:expect_success unexpected error: -13 (errno 13)
  #498     verif_scale_strobemeta_subprogs:FAIL

The verifier complains too big of the combined stack size (544 bytes) which
exceeds the maximum stack limit 512. This is a regression from llvm19 ([1]).

In the above error log, the original stack depth is 24+440+0+32.
To satisfy interpreter's need, in verifier the stack depth is adjusted to
32+448+32+32=544 which exceeds 512, hence the error. The same adjusted
stack size is also used for jit case.

But the jitted codes could use smaller stack size.

  $ egrep -r stack_depth | grep round_up
  arm64/net/bpf_jit_comp.c:       ctx->stack_size = round_up(prog->aux->stack_depth, 16);
  loongarch/net/bpf_jit.c:        bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
  powerpc/net/bpf_jit_comp.c:     cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
  riscv/net/bpf_jit_comp32.c:             round_up(ctx->prog->aux->stack_depth, STACK_ALIGN);
  riscv/net/bpf_jit_comp64.c:     bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
  s390/net/bpf_jit_comp.c:        u32 stack_depth = round_up(fp->aux->stack_depth, 8);
  sparc/net/bpf_jit_comp_64.c:            stack_needed += round_up(stack_depth, 16);
  x86/net/bpf_jit_comp.c:         EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
  x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8);
  x86/net/bpf_jit_comp.c:                     round_up(stack_depth, 8));
  x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8);
  x86/net/bpf_jit_comp.c:         EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));

In the above, STACK_ALIGN in riscv/net/bpf_jit_comp32.c is defined as 16.
So stack is aligned in either 8 or 16, x86/s390 having 8-byte stack alignment and
the rest having 16-byte alignment.

This patch calculates total stack depth based on 16-byte alignment if jit is requested.
For the above failing case, the new stack size will be 32+448+0+32=512 and no verification
failure. llvm19 regression will be discussed separately in llvm upstream.

The verifier change caused three test failures as these tests compared messages
with stack size. More specifically,
  - test_global_funcs/global_func1: fail with interpreter mode and success with jit mode.
    Adjusted stack sizes so both jit and interpreter modes will fail.
  - async_stack_depth/{pseudo_call_check, async_call_root_check}: since jit and interpreter
    will calculate different stack sizes, the failure msg is adjusted to omit those
    specific stack size numbers.

  [1] https://lore.kernel.org/bpf/32bde0f0-1881-46c9-931a-673be566c61d@linux.dev/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240214232951.4113094-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15 13:45:27 -08:00
Andrii Nakryiko
57354f5fde bpf: improve duplicate source code line detection
Verifier log avoids printing the same source code line multiple times
when a consecutive block of BPF assembly instructions are covered by the
same original (C) source code line. This greatly improves verifier log
legibility.

Unfortunately, this check is imperfect and in production applications it
quite often happens that verifier log will have multiple duplicated
source lines emitted, for no apparently good reason. E.g., this is
excerpt from a real-world BPF application (with register states omitted
for clarity):

BEFORE
======
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5369: (07) r8 += 2                    ;
5370: (07) r7 += 16                   ;
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5371: (07) r9 += 1                    ;
5372: (79) r4 = *(u64 *)(r10 -32)     ;
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5373: (55) if r9 != 0xf goto pc+2
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5376: (79) r1 = *(u64 *)(r10 -40)     ;
5377: (79) r1 = *(u64 *)(r1 +8)       ;
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5378: (dd) if r1 s<= r9 goto pc-5     ;
; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398
5379: (b4) w1 = 0                     ;
5380: (6b) *(u16 *)(r8 -30) = r1      ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5381: (79) r3 = *(u64 *)(r7 -8)       ;
5382: (7b) *(u64 *)(r10 -24) = r6     ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5383: (bc) w6 = w6                    ;
; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280
5384: (bf) r2 = r6                    ;
5385: (bf) r1 = r4                    ;

As can be seen, line 394 is emitted thrice, 396 is emitted twice, and
line 400 is duplicated as well. Note that there are no intermingling
other lines of source code in between these duplicates, so the issue is
not compiler reordering assembly instruction such that multiple original
source code lines are in effect.

It becomes more obvious what's going on if we look at *full* original line info
information (using btfdump for this, [0]):

  #2764: line: insn #5363 --> 394:3 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2765: line: insn #5373 --> 394:21 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2766: line: insn #5375 --> 394:47 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2767: line: insn #5377 --> 394:3 @ ./././strobemeta_probe.bpf.c
            for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
  #2768: line: insn #5378 --> 414:10 @ ./././strobemeta_probe.bpf.c
            return off;

We can see that there are four line info records covering
instructions #5363 through #5377 (instruction indices are shifted due to
subprog instruction being appended to main program), all of them are
pointing to the same C source code line #394. But each of them points to
a different part of that line, which is denoted by differing column
numbers (3, 21, 47, 3).

But verifier log doesn't distinguish between parts of the same source code line
and doesn't emit this column number information, so for end user it's just a
repetitive visual noise. So let's improve the detection of repeated source code
line and avoid this.

With the changes in this patch, we get this output for the same piece of BPF
program log:

AFTER
=====
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5369: (07) r8 += 2                    ;
5370: (07) r7 += 16                   ;
5371: (07) r9 += 1                    ;
5372: (79) r4 = *(u64 *)(r10 -32)     ;
5373: (55) if r9 != 0xf goto pc+2
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5376: (79) r1 = *(u64 *)(r10 -40)     ;
5377: (79) r1 = *(u64 *)(r1 +8)       ;
5378: (dd) if r1 s<= r9 goto pc-5     ;
; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398
5379: (b4) w1 = 0                     ;
5380: (6b) *(u16 *)(r8 -30) = r1      ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5381: (79) r3 = *(u64 *)(r7 -8)       ;
5382: (7b) *(u64 *)(r10 -24) = r6     ;
5383: (bc) w6 = w6                    ;
; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280
5384: (bf) r2 = r6                    ;
5385: (bf) r1 = r4                    ;

All the duplication is gone and the log is cleaner and less distracting.

  [0] https://github.com/anakryiko/btfdump

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240214174100.2847419-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15 13:00:48 -08:00
Thomas Gleixner
9bbe13a5d4 genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV
Some platform-MSI implementations require that power management is
redirected to the underlying interrupt chip device. To make this work
with per device MSI domains provide a new feature flag and let the
core code handle the setup of dev->pm_dev when set during device MSI
domain creation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-14-apatel@ventanamicro.com
2024-02-15 17:55:41 +01:00
Thomas Gleixner
e49312fe09 genirq/irqdomain: Reroute device MSI create_mapping
Reroute interrupt allocation in irq_create_fwspec_mapping() if the domain
is a MSI device domain. This is required to convert the support for wire
to MSI bridges to per device MSI domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-13-apatel@ventanamicro.com
2024-02-15 17:55:41 +01:00
Thomas Gleixner
0ee1578b00 genirq/msi: Provide allocation/free functions for "wired" MSI interrupts
To support wire to MSI bridges proper in the MSI core infrastructure it is
required to have separate allocation/free interfaces which can be invoked
from the regular irqdomain allocaton/free functions.

The mechanism for allocation is:
  - Allocate the next free MSI descriptor index in the domain
  - Store the hardware interrupt number and the trigger type
    which was extracted by the irqdomain core from the firmware spec
    in the MSI descriptor device cookie so it can be retrieved by
    the underlying interrupt domain and interrupt chip
  - Use the regular MSI allocation mechanism for the newly allocated
    index which returns a fully initialized Linux interrupt on succes

This works because:
  - the domains have a fixed size
  - each hardware interrupt is only allocated once
  - the underlying domain does not care about the MSI index it only cares
    about the hardware interrupt number and the trigger type

The free function looks up the MSI index in the MSI descriptor of the
provided Linux interrupt number and uses the regular index based free
functions of the MSI core.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-12-apatel@ventanamicro.com
2024-02-15 17:55:41 +01:00
Thomas Gleixner
9d1c58c800 genirq/msi: Optionally use dev->fwnode for device domain
To support wire to MSI domains via the MSI infrastructure it is required to
use the firmware node of the device which implements this for creating the
MSI domain. Otherwise the existing firmware match mechanisms to find the
correct irqdomain for a wired interrupt which is connected to a wire to MSI
bridge would fail.

This cannot be used for the general case because not all devices provide
firmware nodes and all regular per device MSI domains are directly
associated to the device and have not be searched for.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-11-apatel@ventanamicro.com
2024-02-15 17:55:41 +01:00
Thomas Gleixner
3095cc0d5b genirq/msi: Split msi_domain_alloc_irq_at()
In preparation for providing a special allocation function for wired
interrupts which are connected to a wire to MSI bridge, split the inner
workings of msi_domain_alloc_irq_at() out into a helper function so the
code can be shared.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-9-apatel@ventanamicro.com
2024-02-15 17:55:40 +01:00
Thomas Gleixner
9c78c1a85c genirq/msi: Provide optional translation op
irq_create_fwspec_mapping() requires translation of the firmware spec to a
hardware interrupt number and the trigger type information.

Wired interrupts which are connected to a wire to MSI bridge, like MBIGEN
are allocated that way. So far MBIGEN provides a regular irqdomain which
then hooks backwards into the MSI infrastructure. That's an unholy mess and
will be replaced with per device MSI domains which are regular MSI domains.

Interrupts on MSI domains are not supported by irq_create_fwspec_mapping(),
but for making the wire to MSI bridges sane it makes sense to provide a
special allocation/free interface in the MSI infrastructure. That avoids
the backdoors into the core MSI allocation code and just shares all the
regular MSI infrastructure.

Provide an optional translation callback in msi_domain_ops which can be
utilized by these wire to MSI bridges. No other MSI domain should provide a
translation callback. The default translation callback of the MSI
irqdomains will warn when it is invoked on a non-prepared MSI domain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-8-apatel@ventanamicro.com
2024-02-15 17:55:40 +01:00
Thomas Gleixner
de1ff306dc genirq/irqdomain: Remove the param count restriction from select()
Now that the GIC-v3 callback can handle invocation with a fwspec parameter
count of 0 lift the restriction in the core code and invoke select()
unconditionally when the domain provides it.

Preparatory change for per device MSI domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-3-apatel@ventanamicro.com
2024-02-15 17:55:39 +01:00
Thorsten Blum
9b6326354c tracing/synthetic: Fix trace_string() return value
Fix trace_string() by assigning the string length to the return variable
which got lost in commit ddeea494a1 ("tracing/synthetic: Use union
instead of casts") and caused trace_string() to always return 0.

Link: https://lore.kernel.org/linux-trace-kernel/20240214220555.711598-1-thorsten.blum@toblux.com

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: ddeea494a1 ("tracing/synthetic: Use union instead of casts")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-15 11:40:01 -05:00
Andrii Nakryiko
a4561f5afe bpf: Use O(log(N)) binary search to find line info record
Real-world BPF applications keep growing in size. Medium-sized production
application can easily have 50K+ verified instructions, and its line
info section in .BTF.ext has more than 3K entries.

When verifier emits log with log_level>=1, it annotates assembly code
with matched original C source code. Currently it uses linear search
over line info records to find a match. As complexity of BPF
applications grows, this O(K * N) approach scales poorly.

So, let's instead of linear O(N) search for line info record use faster
equivalent O(log(N)) binary search algorithm. It's not a plain binary
search, as we don't look for exact match. It's an upper bound search
variant, looking for rightmost line info record that starts at or before
given insn_off.

Some unscientific measurements were done before and after this change.
They were done in VM and fluctuate a bit, but overall the speed up is
undeniable.

BASELINE
========
File                              Program           Duration (us)   Insns
--------------------------------  ----------------  -------------  ------
katran.bpf.o                      balancer_ingress        2497130  343552
pyperf600.bpf.linked3.o           on_event               12389611  627288
strobelight_pyperf_libbpf.o       on_py_event              387399   52445
--------------------------------  ----------------  -------------  ------

BINARY SEARCH
=============

File                              Program           Duration (us)   Insns
--------------------------------  ----------------  -------------  ------
katran.bpf.o                      balancer_ingress        2339312  343552
pyperf600.bpf.linked3.o           on_event                5602203  627288
strobelight_pyperf_libbpf.o       on_py_event              294761   52445
--------------------------------  ----------------  -------------  ------

While Katran's speed up is pretty modest (about 105ms, or 6%), for
production pyperf BPF program (on_py_event) it's much greater already,
going from 387ms down to 295ms (23% improvement).

Looking at BPF selftests's biggest pyperf example, we can see even more
dramatic improvement, shaving more than 50% of time, going from 12.3s
down to 5.6s.

Different amount of improvement is the function of overall amount of BPF
assembly instructions in .bpf.o files (which contributes to how much
line info records there will be and thus, on average, how much time linear
search will take), among other things:

$ llvm-objdump -d katran.bpf.o | wc -l
3863
$ llvm-objdump -d strobelight_pyperf_libbpf.o | wc -l
6997
$ llvm-objdump -d pyperf600.bpf.linked3.o | wc -l
87854

Granted, this only applies to debugging cases (e.g., using veristat, or
failing verification in production), but seems worth doing to improve
overall developer experience anyways.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240214002311.2197116-1-andrii@kernel.org
2024-02-14 23:53:42 +01:00
Tejun Heo
2f34d7337d workqueue: Fix queue_work_on() with BH workqueues
When queue_work_on() is used to queue a BH work item on a remote CPU, the
work item is queued on that CPU but kick_pool() raises softirq on the local
CPU. This leads to stalls as the work item won't be executed until something
else on the remote CPU schedules a BH work item or tasklet locally.

Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4cb1ef6460 ("workqueue: Implement BH workqueues to eventually replace tasklets")
2024-02-14 08:33:55 -10:00
Steven Rostedt (Google)
2394ac4145 tracing: Inform kmemleak of saved_cmdlines allocation
The allocation of the struct saved_cmdlines_buffer structure changed from:

        s = kmalloc(sizeof(*s), GFP_KERNEL);
	s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);

to:

	orig_size = sizeof(*s) + val * TASK_COMM_LEN;
	order = get_order(orig_size);
	size = 1 << (order + PAGE_SHIFT);
	page = alloc_pages(GFP_KERNEL, order);
	if (!page)
		return NULL;

	s = page_address(page);
	memset(s, 0, sizeof(*s));

	s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);

Where that s->saved_cmdlines allocation looks to be a dangling allocation
to kmemleak. That's because kmemleak only keeps track of kmalloc()
allocations. For allocations that use page_alloc() directly, the kmemleak
needs to be explicitly informed about it.

Add kmemleak_alloc() and kmemleak_free() around the page allocation so
that it doesn't give the following false positive:

unreferenced object 0xffff8881010c8000 (size 32760):
  comm "swapper", pid 0, jiffies 4294667296
  hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  backtrace (crc ae6ec1b9):
    [<ffffffff86722405>] kmemleak_alloc+0x45/0x80
    [<ffffffff8414028d>] __kmalloc_large_node+0x10d/0x190
    [<ffffffff84146ab1>] __kmalloc+0x3b1/0x4c0
    [<ffffffff83ed7103>] allocate_cmdlines_buffer+0x113/0x230
    [<ffffffff88649c34>] tracer_alloc_buffers.isra.0+0x124/0x460
    [<ffffffff8864a174>] early_trace_init+0x14/0xa0
    [<ffffffff885dd5ae>] start_kernel+0x12e/0x3c0
    [<ffffffff885f5758>] x86_64_start_reservations+0x18/0x30
    [<ffffffff885f582b>] x86_64_start_kernel+0x7b/0x80
    [<ffffffff83a001c3>] secondary_startup_64_no_verify+0x15e/0x16b

Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r.fsf@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20240214112046.09a322d6@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Fixes: 44dc5c41b5 ("tracing: Fix wasted memory in saved_cmdlines logic")
Reported-by: Kalle Valo <kvalo@kernel.org>
Tested-by: Kalle Valo <kvalo@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-14 12:36:34 -05:00
Onkarnath
c90e3ecc91 rcu/sync: remove un-used rcu_sync_enter_start function
With commit '6a010a49b63a ("cgroup: Make !percpu threadgroup_rwsem
operations optional")' usage of rcu_sync_enter_start is removed.

So this function can also be removed.

In the words of Oleg Nesterov:

	__rcu_sync_enter(wait => false) is a better alternative if
	someone needs rcu_sync_enter_start() again.

Link: https://lore.kernel.org/all/20220725121208.GB28662@redhat.com/
Signed-off-by: Onkarnath <onkarnath.1@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 08:00:57 -08:00
Paul E. McKenney
fd2a749d3f rcutorture: Suppress rtort_pipe_count warnings until after stalls
Currently, if rcu_torture_writer() sees fewer than ten grace periods
having elapsed during a call to stutter_wait() that actually waited,
the rtort_pipe_count warning is emitted.  This has worked well for
a long time.  Except that the rcutorture TREE07 scenario now does a
short-term 14-second RCU CPU stall, which can most definitely case
false-positive rtort_pipe_count warnings.

This commit therefore changes rcu_torture_writer() to compute the
full expected holdoff and stall duration, and to refuse to report any
rtort_pipe_count warnings until after all stalls have completed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 08:00:57 -08:00
Joel Fernandes (Google)
67050837ec srcu: Improve comments about acceleration leak
The comments added in commit 1ef990c4b36b ("srcu: No need to
advance/accelerate if no callback enqueued") are a bit confusing.
The comments are describing a scenario for code that was moved and is
no longer the way it was (snapshot after advancing). Improve the code
comments to reflect this and also document why acceleration can never
fail.

Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 08:00:57 -08:00
Qais Yousef
7f66f099de rcu: Provide a boot time parameter to control lazy RCU
To allow more flexible arrangements while still provide a single kernel
for distros, provide a boot time parameter to enable/disable lazy RCU.

Specify:

	rcutree.enable_rcu_lazy=[y|1|n|0]

Which also requires

	rcu_nocbs=all

at boot time to enable/disable lazy RCU.

To disable it by default at build time when CONFIG_RCU_LAZY=y, the new
CONFIG_RCU_LAZY_DEFAULT_OFF can be used.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 08:00:57 -08:00
Frederic Weisbecker
499d7e7e83 rcu: Rename jiffies_till_flush to jiffies_lazy_flush
The variable name jiffies_till_flush is too generic and therefore:

* It may shadow a global variable
* It doesn't tell on what it operates

Make the name more precise, along with the related APIs.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 08:00:57 -08:00
Paul E. McKenney
3b239b308e context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}()
Document the "state" parameter of both of these functions.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312041922.YZCcEPYD-lkp@intel.com/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:53:50 -08:00
Frederic Weisbecker
23da2ad64d rcu/exp: Remove rcu_par_gp_wq
TREE04 running on short iterations can produce writer stalls of the
following kind:

 ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0
 task:rcu_torture_wri state:D stack:14568 pid:83    ppid:2      flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x2de/0x850
  ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0
  schedule+0x4f/0x90
  synchronize_rcu_expedited+0x430/0x670
  ? __pfx_autoremove_wake_function+0x10/0x10
  ? __pfx_synchronize_rcu_expedited+0x10/0x10
  do_rtws_sync.constprop.0+0xde/0x230
  rcu_torture_writer+0x4b4/0xcd0
  ? __pfx_rcu_torture_writer+0x10/0x10
  kthread+0xc7/0xf0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x2f/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

Waiting for an expedited grace period and polling for an expedited
grace period both are operations that internally rely on the same
workqueue performing necessary asynchronous work.

However, a dependency chain is involved between those two operations,
as depicted below:

       ====== CPU 0 =======                          ====== CPU 1 =======

                                                     synchronize_rcu_expedited()
                                                         exp_funnel_lock()
                                                             mutex_lock(&rcu_state.exp_mutex);
    start_poll_synchronize_rcu_expedited
        queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
                                                         synchronize_rcu_expedited_queue_work()
                                                             queue_work(rcu_gp_wq, &rew->rew_work);
                                                         wait_event() // A, wait for &rew->rew_work completion
                                                         mutex_unlock() // B
    //======> switch to kworker

    sync_rcu_do_polled_gp() {
        synchronize_rcu_expedited()
            exp_funnel_lock()
                mutex_lock(&rcu_state.exp_mutex); // C, wait B
                ....
    } // D

Since workqueues are usually implemented on top of several kworkers
handling the queue concurrently, the above situation wouldn't deadlock
most of the time because A then doesn't depend on D. But in case of
memory stress, a single kworker may end up handling alone all the works
in a serialized way. In that case the above layout becomes a problem
because A then waits for D, closing a circular dependency:

	A -> D -> C -> B -> A

This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed
synchronize_rcu_expedited() is otherwise implemented on top of a kthread
worker while polling still relies on rcu_gp_wq workqueue, breaking the
above circular dependency chain.

Fix this with making expedited grace period to always rely on kthread
worker. The workqueue based implementation is essentially a duplicate
anyway now that the per-node initialization is performed by per-node
kthread workers.

Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to
manage the scheduler policy of these kthread workers.

Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:36 -08:00
Frederic Weisbecker
b67cffcbbf rcu/exp: Handle parallel exp gp kworkers affinity
Affine the parallel expedited gp kworkers to their respective RCU node
in order to make them close to the cache their are playing with.

This reuses the boost kthreads machinery that probe into CPU hotplug
operations such that the kthreads become/stay affine to their respective
node as soon/long as they contain online CPUs. Otherwise and if the
current CPU going down was the last online on the leaf node, the related
kthread is affine to the housekeeping CPUs.

In the long run, this affinity VS CPU hotplug operation game should
probably be implemented at the generic kthread level.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
[boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:36 -08:00
Frederic Weisbecker
8e5e621566 rcu/exp: Make parallel exp gp kworker per rcu node
When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node
initialization is performed in parallel via workqueues (one work per
node).

However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is
performed by a single kworker serializing each node initialization (one
work for all nodes).

The second part is certainly less scalable and efficient beyond a single
leaf node.

To improve this, expand this single kworker into per-node kworkers. This
new layout is eventually intended to remove the workqueues based
implementation since it will essentially now become duplicate code.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:36 -08:00
Frederic Weisbecker
c19e5d3b49 rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu()
The expedited kthread worker performing the per node initialization is
going to be split into per node kthreads. As such, the future per node
kthread creation will need to be called from CPU hotplug callbacks
instead of an initcall, right beside the per node boost kthread
creation.

To prepare for that, move the kthread worker creation above
rcutree_prepare_cpu() as a first step to make the review smoother for
the upcoming modifications.

No intended functional change.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:36 -08:00
Frederic Weisbecker
7836b27060 rcu: s/boost_kthread_mutex/kthread_mutex
This mutex is currently protecting per node boost kthreads creation and
affinity setting across CPU hotplug operations.

Since the expedited kworkers will soon be split per node as well, they
will be subject to the same concurrency constraints against hotplug.

Therefore their creation and affinity tuning operations will be grouped
with those of boost kthreads and then rely on the same mutex.

To prepare for that, generalize its name.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:36 -08:00
Frederic Weisbecker
e7539ffc9a rcu/exp: Handle RCU expedited grace period kworker allocation failure
Just like is done for the kworker performing nodes initialization,
gracefully handle the possible allocation failure of the RCU expedited
grace period main kworker.

While at it perform a rename of the related checking functions to better
reflect the expedited specifics.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44 ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:36 -08:00
Frederic Weisbecker
a636c5e6f8 rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery
Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited
grace periods is queued to a kworker. However if the allocation of that
kworker failed, the nodes initialization is performed synchronously by
the caller instead.

Now the check for kworker initialization failure relies on the kworker
pointer to be NULL while its value might actually encapsulate an
allocation failure error.

Make sure to handle this case.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9621fbee44 ("rcu: Move expedited grace period (GP) work to RT kthread_worker")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:36 -08:00
Frederic Weisbecker
a7e4074dcc rcu/exp: Remove full barrier upon main thread wakeup
When an expedited grace period is ending, care must be taken so that all
the quiescent states propagated up to the root are correctly ordered
against the wake up of the main expedited grace period workqueue.

This ordering is already carried through the root rnp locking augmented
by an smp_mb__after_unlock_lock() barrier.

Therefore the explicit smp_mb() placed before the wake up is not needed
and can be removed.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:51:35 -08:00
Zqiang
f3c4c00784 rcu/nocb: Check rdp_gp->nocb_timer in __call_rcu_nocb_wake()
Currently, only rdp_gp->nocb_timer is used, for nocb_timer of
no-rdp_gp structure, the timer_pending() is always return false,
this commit therefore need to check rdp_gp->nocb_timer in
__call_rcu_nocb_wake().

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:50:45 -08:00
Zqiang
dda98810b5 rcu/nocb: Fix WARN_ON_ONCE() in the rcu_nocb_bypass_lock()
For the kernels built with CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y and
CONFIG_RCU_LAZY=y, the following scenarios will trigger WARN_ON_ONCE()
in the rcu_nocb_bypass_lock() and rcu_nocb_wait_contended() functions:

        CPU2                                               CPU11
kthread
rcu_nocb_cb_kthread                                       ksys_write
rcu_do_batch                                              vfs_write
rcu_torture_timer_cb                                      proc_sys_write
__kmem_cache_free                                         proc_sys_call_handler
kmemleak_free                                             drop_caches_sysctl_handler
delete_object_full                                        drop_slab
__delete_object                                           shrink_slab
put_object                                                lazy_rcu_shrink_scan
call_rcu                                                  rcu_nocb_flush_bypass
__call_rcu_commn                                            rcu_nocb_bypass_lock
                                                            raw_spin_trylock(&rdp->nocb_bypass_lock) fail
                                                            atomic_inc(&rdp->nocb_lock_contended);
rcu_nocb_wait_contended                                     WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
 WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended))                                          |
                            |_ _ _ _ _ _ _ _ _ _same rdp and rdp->cpu != 11_ _ _ _ _ _ _ _ _ __|

Reproduce this bug with "echo 3 > /proc/sys/vm/drop_caches".

This commit therefore uses rcu_nocb_try_flush_bypass() instead of
rcu_nocb_flush_bypass() in lazy_rcu_shrink_scan().  If the nocb_bypass
queue is being flushed, then rcu_nocb_try_flush_bypass will return
directly.

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:50:45 -08:00
Frederic Weisbecker
afd4e69647 rcu/nocb: Re-arrange call_rcu() NOCB specific code
Currently the call_rcu() function interleaves NOCB and !NOCB enqueue
code in a complicated way such that:

* The bypass enqueue code may or may not have enqueued and may or may
  not have locked the ->nocb_lock. Everything that follows is in a
  Schrödinger locking state for the unwary reviewer's eyes.

* The was_alldone is always set but only used in NOCB related code.

* The NOCB wake up is distantly related to the locking hopefully
  performed by the bypass enqueue code that did not enqueue on the
  bypass list.

Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to
their own functions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:50:45 -08:00
Frederic Weisbecker
b913c3fe68 rcu/nocb: Make IRQs disablement symmetric
Currently IRQs are disabled on call_rcu() and then depending on the
context:

* If the CPU is in nocb mode:

   - If the callback is enqueued in the bypass list, IRQs are re-enabled
     implictly by rcu_nocb_try_bypass()

   - If the callback is enqueued in the normal list, IRQs are re-enabled
     implicitly by __call_rcu_nocb_wake()

* If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu()

This makes the code a bit hard to follow, especially as it interleaves
with nocb locking.

To make the IRQ flags coverage clearer and also in order to prepare for
moving all the nocb enqueue code to its own function, always re-enable
the IRQ flags explicitly from call_rcu().

Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:50:45 -08:00
Frederic Weisbecker
1e8e6951a5 rcu/nocb: Remove needless full barrier after callback advancing
A full barrier is issued from nocb_gp_wait() upon callbacks advancing
to order grace period completion with callbacks execution.

However these two events are already ordered by the
smp_mb__after_unlock_lock() barrier within the call to
raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
happen.

The following litmus test shows the kind of guarantee that this barrier
provides:

	C smp_mb__after_unlock_lock

	{}

	// rcu_gp_cleanup()
	P0(spinlock_t *rnp_lock, int *gpnum)
	{
		// Grace period cleanup increase gp sequence number
		spin_lock(rnp_lock);
		WRITE_ONCE(*gpnum, 1);
		spin_unlock(rnp_lock);
	}

	// nocb_gp_wait()
	P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready)
	{
		int r1;

		// Call rcu_advance_cbs() from nocb_gp_wait()
		spin_lock(nocb_lock);
		spin_lock(rnp_lock);
		smp_mb__after_unlock_lock();
		r1 = READ_ONCE(*gpnum);
		WRITE_ONCE(*cb_ready, 1);
		spin_unlock(rnp_lock);
		spin_unlock(nocb_lock);
	}

	// nocb_cb_wait()
	P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
	{
		int r2;

		// rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
		spin_lock(nocb_lock);
		r2 = READ_ONCE(*cb_ready);
		spin_unlock(nocb_lock);

		// Actual callback execution
		WRITE_ONCE(*cb_executed, 1);
	}

	P3(int *cb_executed, int *gpnum)
	{
		int r3;

		WRITE_ONCE(*cb_executed, 2);
		smp_mb();
		r3 = READ_ONCE(*gpnum);
	}

	exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *)

Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is
removed. This barrier orders the grace period completion against
callbacks advancing and even later callbacks invocation, thanks to the
opportunistic propagation via the ->nocb_lock to nocb_cb_wait().

Therefore the smp_mb() placed after callbacks advancing can be safely
removed.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:50:45 -08:00
Frederic Weisbecker
ca16265aaf rcu/nocb: Remove needless LOAD-ACQUIRE
The LOAD-ACQUIRE access performed on rdp->nocb_cb_sleep advertizes
ordering callback execution against grace period completion. However
this is contradicted by the following:

* This LOAD-ACQUIRE doesn't pair with anything. The only counterpart
  barrier that can be found is the smp_mb() placed after callbacks
  advancing in nocb_gp_wait(). However the barrier is placed _after_
  ->nocb_cb_sleep write.

* Callbacks can be concurrently advanced between the LOAD-ACQUIRE on
  ->nocb_cb_sleep and the call to rcu_segcblist_extract_done_cbs() in
  rcu_do_batch(), making any ordering based on ->nocb_cb_sleep broken.

* Both rcu_segcblist_extract_done_cbs() and rcu_advance_cbs() are called
  under the nocb_lock, the latter hereby providing already the desired
  ACQUIRE semantics.

Therefore it is safe to access ->nocb_cb_sleep with a simple compiler
barrier.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14 07:50:44 -08:00
Ingo Molnar
4589f199eb Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches
Merge in pending alternatives patching infrastructure changes, before
applying more patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-14 10:49:37 +01:00
Andrii Nakryiko
7cc13adbd0 bpf: emit source code file name and line number in verifier log
As BPF applications grow in size and complexity and are separated into
multiple .bpf.c files that are statically linked together, it becomes
harder and harder to match verifier's BPF assembly level output to
original C code. While often annotated C source code is unique enough to
be able to identify the file it belongs to, quite often this is actually
problematic as parts of source code can be quite generic.

Long story short, it is very useful to see source code file name and
line number information along with the original C code. Verifier already
knows this information, we just need to output it.

This patch extends verifier log with file name and line number
information, emitted next to original (presumably C) source code,
annotating BPF assembly output, like so:

  ; <original C code> @ <filename>.bpf.c:<line>

If file name has directory names in it, they are stripped away. This
should be fine in practice as file names tend to be pretty unique with
C code anyways, and keeping log size smaller is always good.

In practice this might look something like below, where some code is
coming from application files, while others are from libbpf's usdt.bpf.h
header file:

  ; if (STROBEMETA_READ( @ strobemeta_probe.bpf.c:534
  5592: (79) r1 = *(u64 *)(r10 -56)     ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0
  5593: (7b) *(u64 *)(r10 -56) = r1     ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0
  5594: (79) r3 = *(u64 *)(r10 -8)      ; R3_w=scalar() R10=fp0 fp-8=mmmmmmmm

  ...

  170: (71) r1 = *(u8 *)(r8 +15)        ; frame1: R1_w=scalar(...) R8_w=map_value(map=__bpf_usdt_spec,ks=4,vs=208)
  171: (67) r1 <<= 56                   ; frame1: R1_w=scalar(...)
  172: (c7) r1 s>>= 56                  ; frame1: R1_w=scalar(smin=smin32=-128,smax=smax32=127)
  ; val <<= arg_spec->arg_bitshift; @ usdt.bpf.h:183
  173: (67) r1 <<= 32                   ; frame1: R1_w=scalar(...)
  174: (77) r1 >>= 32                   ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  175: (79) r2 = *(u64 *)(r10 -8)       ; frame1: R2_w=scalar() R10=fp0 fp-8=mmmmmmmm
  176: (6f) r2 <<= r1                   ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=scalar()
  177: (7b) *(u64 *)(r10 -8) = r2       ; frame1: R2_w=scalar(id=61) R10=fp0 fp-8_w=scalar(id=61)
  ; if (arg_spec->arg_signed) @ usdt.bpf.h:184
  178: (bf) r3 = r2                     ; frame1: R2_w=scalar(id=61) R3_w=scalar(id=61)
  179: (7f) r3 >>= r1                   ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R3_w=scalar()
  ; if (arg_spec->arg_signed) @ usdt.bpf.h:184
  180: (71) r4 = *(u8 *)(r8 +14)
  181: safe

log_fixup tests needed a minor adjustment as verifier log output
increased a bit and that test is quite sensitive to such changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212235944.2816107-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:51:32 -08:00
Andrii Nakryiko
879bbe7aa4 bpf: don't infer PTR_TO_CTX for programs with unnamed context type
For program types that don't have named context type name (e.g., BPF
iterator programs or tracepoint programs), ctx_tname will be a non-NULL
empty string. For such programs it shouldn't be possible to have
PTR_TO_CTX argument for global subprogs based on type name alone.
arg:ctx tag is the only way to have PTR_TO_CTX passed into global
subprog for such program types.

Fix this loophole, which currently would assume PTR_TO_CTX whenever
user uses a pointer to anonymous struct as an argument to their global
subprogs. This happens in practice with the following (quite common, in
practice) approach:

typedef struct { /* anonymous */
    int x;
} my_type_t;

int my_subprog(my_type_t *arg) { ... }

User's intent is to have PTR_TO_MEM argument for `arg`, but verifier
will complain about expecting PTR_TO_CTX.

This fix also closes unintended s390x-specific KPROBE handling of
PTR_TO_CTX case. Selftest change is necessary to accommodate this.

Fixes: 91cc1a9974 ("bpf: Annotate context types")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:46:47 -08:00
Andrii Nakryiko
824c58fb10 bpf: handle bpf_user_pt_regs_t typedef explicitly for PTR_TO_CTX global arg
Expected canonical argument type for global function arguments
representing PTR_TO_CTX is `bpf_user_pt_regs_t *ctx`. This currently
works on s390x by accident because kernel resolves such typedef to
underlying struct (which is anonymous on s390x), and erroneously
accepting it as expected context type. We are fixing this problem next,
which would break s390x arch, so we need to handle `bpf_user_pt_regs_t`
case explicitly for KPROBE programs.

Fixes: 91cc1a9974 ("bpf: Annotate context types")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:46:47 -08:00
Andrii Nakryiko
fb5b86cfd4 bpf: simplify btf_get_prog_ctx_type() into btf_is_prog_ctx_type()
Return result of btf_get_prog_ctx_type() is never used and callers only
check NULL vs non-NULL case to determine if given type matches expected
PTR_TO_CTX type. So rename function to `btf_is_prog_ctx_type()` and
return a simple true/false. We'll use this simpler interface to handle
kprobe program type's special typedef case in the next patch.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240212233221.2575350-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13 18:46:46 -08:00
Oliver Crumrine
32e18e7688 bpf: remove check in __cgroup_bpf_run_filter_skb
Originally, this patch removed a redundant check in
BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in
the function it called, __cgroup_bpf_run_filter_skb. For v2, it was
reccomended that I remove the check from __cgroup_bpf_run_filter_skb,
and add the checks to the other macro that calls that function,
BPF_CGROUP_RUN_PROG_INET_INGRESS.

To sum it up, checking that the socket exists and that it is a full
socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and
BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the
function they call, __cgroup_bpf_run_filter_skb.

v3->v4: Fixed weird merge conflict.
v2->v3: Sent to bpf-next instead of generic patch
v1->v2: Addressed feedback about where check should be removed.

Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:41:17 -08:00
Kui-Feng Lee
1611603537 bpf: Create argument information for nullable arguments.
Collect argument information from the type information of stub functions to
mark arguments of BPF struct_ops programs with PTR_MAYBE_NULL if they are
nullable.  A nullable argument is annotated by suffixing "__nullable" at
the argument name of stub function.

For nullable arguments, this patch sets a struct bpf_ctx_arg_aux to label
their reg_type with PTR_TO_BTF_ID | PTR_TRUSTED | PTR_MAYBE_NULL. This
makes the verifier to check programs and ensure that they properly check
the pointer. The programs should check if the pointer is null before
accessing the pointed memory.

The implementer of a struct_ops type should annotate the arguments that can
be null. The implementer should define a stub function (empty) as a
placeholder for each defined operator. The name of a stub function should
be in the pattern "<st_op_type>__<operator name>". For example, for
test_maybe_null of struct bpf_testmod_ops, it's stub function name should
be "bpf_testmod_ops__test_maybe_null". You mark an argument nullable by
suffixing the argument name with "__nullable" at the stub function.

Since we already has stub functions for kCFI, we just reuse these stub
functions with the naming convention mentioned earlier. These stub
functions with the naming convention is only required if there are nullable
arguments to annotate. For functions having not nullable arguments, stub
functions are not necessary for the purpose of this patch.

This patch will prepare a list of struct bpf_ctx_arg_aux, aka arg_info, for
each member field of a struct_ops type.  "arg_info" will be assigned to
"prog->aux->ctx_arg_info" of BPF struct_ops programs in
check_struct_ops_btf_id() so that it can be used by btf_ctx_access() later
to set reg_type properly for the verifier.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-4-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:16:44 -08:00
Kui-Feng Lee
6115a0aeef bpf: Move __kfunc_param_match_suffix() to btf.c.
Move __kfunc_param_match_suffix() to btf.c and rename it as
btf_param_match_suffix(). It can be reused by bpf_struct_ops later.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-3-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:16:44 -08:00
Kui-Feng Lee
77c0208e19 bpf: add btf pointer to struct bpf_ctx_arg_aux.
Enable the providers to use types defined in a module instead of in the
kernel (btf_vmlinux).

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240209023750.1153905-2-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13 15:16:44 -08:00
Hari Bathini
11f522256e bpf: Fix warning for bpf_cpumask in verifier
Compiling with CONFIG_BPF_SYSCALL & !CONFIG_BPF_JIT throws the below
warning:

  "WARN: resolve_btfids: unresolved symbol bpf_cpumask"

Fix it by adding the appropriate #ifdef.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20240208100115.602172-1-hbathini@linux.ibm.com
2024-02-13 11:13:39 -08:00
Yonghong Song
178c54666f bpf: Mark bpf_spin_{lock,unlock}() helpers with notrace correctly
Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}()
helper calls. This is to prevent deadlock for the following cases:
  - there is a prog (prog-A) calling bpf_spin_{lock,unlock}().
  - there is a tracing program (prog-B), e.g., fentry, attached
    to bpf_spin_lock() and/or bpf_spin_unlock().
  - prog-B calls bpf_spin_{lock,unlock}().
For such a case, when prog-A calls bpf_spin_{lock,unlock}(),
a deadlock will happen.

The related source codes are below in kernel/bpf/helpers.c:
  notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
  notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
notrace is supposed to prevent fentry prog from attaching to
bpf_spin_{lock,unlock}().

But actually this is not the case and fentry prog can successfully
attached to bpf_spin_lock(). Siddharth Chintamaneni reported
the issue in [1]. The following is the macro definition for
above BPF_CALL_1:
  #define BPF_CALL_x(x, name, ...)                                               \
        static __always_inline                                                 \
        u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__));   \
        typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \
        u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__));         \
        u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__))          \
        {                                                                      \
                return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\
        }                                                                      \
        static __always_inline                                                 \
        u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__))

  #define BPF_CALL_1(name, ...)   BPF_CALL_x(1, name, __VA_ARGS__)

The notrace attribute is actually applied to the static always_inline function
____bpf_spin_{lock,unlock}(). The actual callback function
bpf_spin_{lock,unlock}() is not marked with notrace, hence
allowing fentry prog to attach to two helpers, and this
may cause the above mentioned deadlock. Siddharth Chintamaneni
actually has a reproducer in [2].

To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which
will add notrace attribute to the original function instead of
the hidden always_inline function and this fixed the problem.

  [1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/
  [2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/

Fixes: d83525ca62 ("bpf: introduce bpf_spin_lock")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev
2024-02-13 11:11:25 -08:00
Daniel Xu
5b268d1ebc bpf: Have bpf_rdonly_cast() take a const pointer
Since 20d59ee551 ("libbpf: add bpf_core_cast() macro"), libbpf is now
exporting a const arg version of bpf_rdonly_cast(). This causes the
following conflicting type error when generating kfunc prototypes from
BTF:

In file included from skeleton/pid_iter.bpf.c:5:
/home/dxu/dev/linux/tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_core_read.h:297:14: error: conflicting types for 'bpf_rdonly_cast'
extern void *bpf_rdonly_cast(const void *obj__ign, __u32 btf_id__k) __ksym __weak;
             ^
./vmlinux.h:135625:14: note: previous declaration is here
extern void *bpf_rdonly_cast(void *obj__ign, u32 btf_id__k) __weak __ksym;

This is b/c the kernel defines bpf_rdonly_cast() with non-const arg.
Since const arg is more permissive and thus backwards compatible, we
change the kernel definition as well to avoid conflicting type errors.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/dfd3823f11ffd2d4c838e961d61ec9ae8a646773.1707080349.git.dxu@dxuuu.xyz
2024-02-13 11:05:26 -08:00
Sven Schnelle
a6eaa24f1c tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
tracer_tracing_is_on() checks whether record_disabled is not zero. This
checks both the record_disabled counter and the RB_BUFFER_OFF flag.
Reading the source it looks like this function should only check for
the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
This fixes spurious fails in the 'test for function traceon/off triggers'
test from the ftrace testsuite when the system is under load.

Link: https://lore.kernel.org/linux-trace-kernel/20240205065340.2848065-1-svens@linux.ibm.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-By: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-13 12:04:17 -05:00
Petr Pavlu
bdbddb109c tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef
Commit a8b9cf62ad ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by
default") attempted to fix an issue with direct trampolines on x86, see
its description for details. However, it wrongly referenced the
HAVE_DYNAMIC_FTRACE_WITH_REGS config option and the problem is still
present.

Add the missing "CONFIG_" prefix for the logic to work as intended.

Link: https://lore.kernel.org/linux-trace-kernel/20240213132434.22537-1-petr.pavlu@suse.com

Fixes: a8b9cf62ad ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-13 11:56:49 -05:00
Marco Elver
68bc61c26c bpf: Allow compiler to inline most of bpf_local_storage_lookup()
In various performance profiles of kernels with BPF programs attached,
bpf_local_storage_lookup() appears as a significant portion of CPU
cycles spent. To enable the compiler generate more optimal code, turn
bpf_local_storage_lookup() into a static inline function, where only the
cache insertion code path is outlined

Notably, outlining cache insertion helps avoid bloating callers by
duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on
architectures which do not inline spin_lock/unlock, such as x86), which
would cause the compiler produce worse code by deciding to outline
otherwise inlinable functions. The call overhead is neutral, because we
make 2 calls either way: either calling raw_spin_lock_irqsave() and
raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(),
which calls raw_spin_lock_irqsave(), followed by a tail-call to
raw_spin_unlock_irqsave() where the compiler can perform TCO and (in
optimized uninstrumented builds) turns it into a plain jump. The call to
__bpf_local_storage_insert_cache() can be elided entirely if
cacheit_lockit is a false constant expression.

Based on results from './benchs/run_bench_local_storage.sh' (21 trials,
reboot between each trial; x86 defconfig + BPF, clang 16) this produces
improvements in throughput and latency in the majority of cases, with an
average (geomean) improvement of 8%:

+---- Hashmap Control --------------------
|
| + num keys: 10
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 14.789 M ops/s       | 14.745 M ops/s (  ~  )
|   +- hits latency                       | 67.679 ns/op         | 67.879 ns/op   (  ~  )
|   +- important_hits throughput          | 14.789 M ops/s       | 14.745 M ops/s (  ~  )
|
| + num keys: 1000
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 12.233 M ops/s       | 12.170 M ops/s (  ~  )
|   +- hits latency                       | 81.754 ns/op         | 82.185 ns/op   (  ~  )
|   +- important_hits throughput          | 12.233 M ops/s       | 12.170 M ops/s (  ~  )
|
| + num keys: 10000
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 7.220 M ops/s        | 7.204 M ops/s  (  ~  )
|   +- hits latency                       | 138.522 ns/op        | 138.842 ns/op  (  ~  )
|   +- important_hits throughput          | 7.220 M ops/s        | 7.204 M ops/s  (  ~  )
|
| + num keys: 100000
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 5.061 M ops/s        | 5.165 M ops/s  (+2.1%)
|   +- hits latency                       | 198.483 ns/op        | 194.270 ns/op  (-2.1%)
|   +- important_hits throughput          | 5.061 M ops/s        | 5.165 M ops/s  (+2.1%)
|
| + num keys: 4194304
| :                                         <before>             | <after>
| +-+ hashmap (control) sequential get    +----------------------+----------------------
|   +- hits throughput                    | 2.864 M ops/s        | 2.882 M ops/s  (  ~  )
|   +- hits latency                       | 365.220 ns/op        | 361.418 ns/op  (-1.0%)
|   +- important_hits throughput          | 2.864 M ops/s        | 2.882 M ops/s  (  ~  )
|
+---- Local Storage ----------------------
|
| + num_maps: 1
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 33.005 M ops/s       | 39.068 M ops/s (+18.4%)
|   +- hits latency                       | 30.300 ns/op         | 25.598 ns/op   (-15.5%)
|   +- important_hits throughput          | 33.005 M ops/s       | 39.068 M ops/s (+18.4%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 37.151 M ops/s       | 44.926 M ops/s (+20.9%)
|   +- hits latency                       | 26.919 ns/op         | 22.259 ns/op   (-17.3%)
|   +- important_hits throughput          | 37.151 M ops/s       | 44.926 M ops/s (+20.9%)
|
| + num_maps: 10
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 32.288 M ops/s       | 38.099 M ops/s (+18.0%)
|   +- hits latency                       | 30.972 ns/op         | 26.248 ns/op   (-15.3%)
|   +- important_hits throughput          | 3.229 M ops/s        | 3.810 M ops/s  (+18.0%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 34.473 M ops/s       | 41.145 M ops/s (+19.4%)
|   +- hits latency                       | 29.010 ns/op         | 24.307 ns/op   (-16.2%)
|   +- important_hits throughput          | 12.312 M ops/s       | 14.695 M ops/s (+19.4%)
|
| + num_maps: 16
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 32.524 M ops/s       | 38.341 M ops/s (+17.9%)
|   +- hits latency                       | 30.748 ns/op         | 26.083 ns/op   (-15.2%)
|   +- important_hits throughput          | 2.033 M ops/s        | 2.396 M ops/s  (+17.9%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 34.575 M ops/s       | 41.338 M ops/s (+19.6%)
|   +- hits latency                       | 28.925 ns/op         | 24.193 ns/op   (-16.4%)
|   +- important_hits throughput          | 11.001 M ops/s       | 13.153 M ops/s (+19.6%)
|
| + num_maps: 17
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 28.861 M ops/s       | 32.756 M ops/s (+13.5%)
|   +- hits latency                       | 34.649 ns/op         | 30.530 ns/op   (-11.9%)
|   +- important_hits throughput          | 1.700 M ops/s        | 1.929 M ops/s  (+13.5%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 31.529 M ops/s       | 36.110 M ops/s (+14.5%)
|   +- hits latency                       | 31.719 ns/op         | 27.697 ns/op   (-12.7%)
|   +- important_hits throughput          | 9.598 M ops/s        | 10.993 M ops/s (+14.5%)
|
| + num_maps: 24
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 18.602 M ops/s       | 19.937 M ops/s (+7.2%)
|   +- hits latency                       | 53.767 ns/op         | 50.166 ns/op   (-6.7%)
|   +- important_hits throughput          | 0.776 M ops/s        | 0.831 M ops/s  (+7.2%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 21.718 M ops/s       | 23.332 M ops/s (+7.4%)
|   +- hits latency                       | 46.047 ns/op         | 42.865 ns/op   (-6.9%)
|   +- important_hits throughput          | 6.110 M ops/s        | 6.564 M ops/s  (+7.4%)
|
| + num_maps: 32
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 14.118 M ops/s       | 14.626 M ops/s (+3.6%)
|   +- hits latency                       | 70.856 ns/op         | 68.381 ns/op   (-3.5%)
|   +- important_hits throughput          | 0.442 M ops/s        | 0.458 M ops/s  (+3.6%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 17.111 M ops/s       | 17.906 M ops/s (+4.6%)
|   +- hits latency                       | 58.451 ns/op         | 55.865 ns/op   (-4.4%)
|   +- important_hits throughput          | 4.776 M ops/s        | 4.998 M ops/s  (+4.6%)
|
| + num_maps: 100
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 5.281 M ops/s        | 5.528 M ops/s  (+4.7%)
|   +- hits latency                       | 192.398 ns/op        | 183.059 ns/op  (-4.9%)
|   +- important_hits throughput          | 0.053 M ops/s        | 0.055 M ops/s  (+4.9%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 6.265 M ops/s        | 6.498 M ops/s  (+3.7%)
|   +- hits latency                       | 161.436 ns/op        | 152.877 ns/op  (-5.3%)
|   +- important_hits throughput          | 1.636 M ops/s        | 1.697 M ops/s  (+3.7%)
|
| + num_maps: 1000
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  +----------------------+----------------------
|   +- hits throughput                    | 0.355 M ops/s        | 0.354 M ops/s  (  ~  )
|   +- hits latency                       | 2826.538 ns/op       | 2827.139 ns/op (  ~  )
|   +- important_hits throughput          | 0.000 M ops/s        | 0.000 M ops/s  (  ~  )
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get +----------------------+----------------------
|   +- hits throughput                    | 0.404 M ops/s        | 0.403 M ops/s  (  ~  )
|   +- hits latency                       | 2481.190 ns/op       | 2487.555 ns/op (  ~  )
|   +- important_hits throughput          | 0.102 M ops/s        | 0.101 M ops/s  (  ~  )

The on_lookup test in {cgrp,task}_ls_recursion.c is removed
because the bpf_local_storage_lookup is no longer traceable
and adding tracepoint will make the compiler generate worse
code: https://lore.kernel.org/bpf/ZcJmok64Xqv6l4ZS@elver.google.com/

Signed-off-by: Marco Elver <elver@google.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240207122626.3508658-1-elver@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-11 14:06:24 -08:00
Linus Torvalds
2766f59ca4 - Make sure a warning is issued when a hrtimer gets queued after the
timers have been migrated on the CPU down path and thus said timer
   will get ignored
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXIpjEACgkQEsHwGGHe
 VUrG8g/7BuGzGzFHYli7LZKuzI76tN0CU44v7oPauqZvkcaSUmF/E+/RAeHxMjUa
 20Mlo2AUGrPkKbRUgN93IrRKZAfdjqKQ6UZiH/FTyFUtFfs9XNv2G0cCIVQAepXL
 WbKPxL/M9vbnJwK6CC5prSLHazGH2y5vg0zbY9RycGKvxza+HgkIrZoYp7ctHX1A
 xZFF5EyLu6g0x+yz7Tt0Zf93tADJxFSHmfz7Nmx1RFh0GJzceuKUvC2ZVyUr63fv
 ttQn4TLm0NKySaR+SPYPKKp1lLkHvfh9pFV2kdI/c7oo4Pig6bFTjMJpf0o541A2
 s87sz2w6P16LMi2sjf/ASQmgMHmGiIQlmjjFbVX8sKeibdtUM3Vg7s/Hs/EilY7Q
 P7ANSmZTtQBoQsWd/E+8aOBUkC263Ua0uoOufH7dcfL9mSJxUos1SPleRnaO4o5n
 mm5GVDxggNj879nHZUBh9g7+JsLdZ5yozWne0xAyrI0WycsK0hzWuW0B5p4QMK3T
 4zamSuZNObBUdbwb5cD1fL5X5aRkPvorj9iLliui1X0wfA+nByR5cjlcptXirCq4
 8D2WCiQ+tbP3w4UJ0gtrC3mlXokWorMV8XoLZm9+RtWLi2LFAhMCZ9U3B0EgsziD
 whMGJBXKRaeeFnCnxPjlHC/5a1nXTCqFlks+PSUVT6RiavYJ7VA=
 =+SjG
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Make sure a warning is issued when a hrtimer gets queued after the
   timers have been migrated on the CPU down path and thus said timer
   will get ignored

* tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Report offline hrtimer enqueue
2024-02-11 11:44:14 -08:00
Linus Torvalds
7521f258ea 21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7
issues or aren't considered to be needed in earlier kernel versions.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZcfLvgAKCRDdBJ7gKXxA
 joCTAP4/XdBXA7Sj3GyjSAkYjg2U0quwX9oRhsx2Qy9duPDaLAD+NRl9XG14YSOB
 f/7OiTQoDfnwVgHAOVBHY/ylrcgZRQg=
 =2wdS
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7
  issues or aren't considered to be needed in earlier kernel versions"

* tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  nilfs2: fix potential bug in end_buffer_async_write
  mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup
  nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
  MAINTAINERS: Leo Yan has moved
  mm/zswap: don't return LRU_SKIP if we have dropped lru lock
  fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super
  mailmap: switch email address for John Moon
  mm: zswap: fix objcg use-after-free in entry destruction
  mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range()
  arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
  selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros
  mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page
  mm: memcg: optimize parent iteration in memcg_rstat_updated()
  nilfs2: fix data corruption in dsync block recovery for small block sizes
  mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get()
  exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
  fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats
  fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand()
  getrusage: use sig->stats_lock rather than lock_task_sighand()
  getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
  ...
2024-02-10 15:28:07 -08:00
Oleg Nesterov
81b9d8ac06
pidfd: change pidfd_send_signal() to respect PIDFD_THREAD
Turn kill_pid_info() into kill_pid_info_type(), this allows to pass any
pid_type to group_send_sig_info(), despite its name it should work fine
even if type = PIDTYPE_PID.

Change pidfd_send_signal() to use PIDTYPE_PID or PIDTYPE_TGID depending
on PIDFD_THREAD.

While at it kill another TODO comment in pidfd_show_fdinfo(). As Christian
expains fdinfo reports f_flags, userspace can already detect PIDFD_THREAD.

Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240209130650.GA8048@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-10 22:37:23 +01:00
Oleg Nesterov
c044a95026
signal: fill in si_code in prepare_kill_siginfo()
So that do_tkill() can use this helper too. This also simplifies
the next patch.

TODO: perhaps we can kill prepare_kill_siginfo() and change the
callers to use SEND_SIG_NOINFO,  but this needs some changes in
__send_signal_locked() and TP_STORE_SIGINFO().

Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240209130620.GA8039@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-10 22:37:18 +01:00
Tejun Heo
bf52b1ac6a async: Use a dedicated unbound workqueue with raised min_active
Async can schedule a number of interdependent work items. However, since
5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for
unbound workqueues"), unbound workqueues have separate min_active which sets
the number of interdependent work items that can be handled. This default
value is 8 which isn't sufficient for async and can lead to stalls during
resume from suspend in some cases.

Let's use a dedicated unbound workqueue with raised min_active.

Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09 11:13:59 -10:00
Tejun Heo
8f172181f2 workqueue: Implement workqueue_set_min_active()
Since 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement
for unbound workqueues"), unbound workqueues have separate min_active which
sets the number of interdependent work items that can be handled. This value
is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high
enough for some users, let's add an interface to adjust the setting.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09 11:13:59 -10:00
Waiman Long
516d3dc99f workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable
proper processing and formatting of the embedded ASCII diagram.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09 11:04:13 -10:00
Linus Torvalds
ca8a66738a Tracing fixes for v6.8-rc3:
- Fix broken direct trampolines being called when another callback is
   attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and
   when it added direct trampoline calls from ftrace, it removed the
   "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke
   x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't
   noticed because direct trampolines work as long as the function it is
   attached to is not shared with other callbacks (like the function tracer).
   When there's other callbacks, a helper trampoline is called, to call all
   the non direct callbacks and when it returns, the direct trampoline is
   called. For x86, the direct trampoline sets a flag in the regs field to
   tell the x86 specific code to call the direct trampoline. But this only
   works if the ftrace_ops had WITH_REGS set. ARM does things differently
   that does not require this. For now, set WITH_REGS if the arch supports
   WITH_REGS (which ARM does not), and this makes it work for both ARM64 and
   x86.
 
 - Fix wasted memory in the saved_cmdlines logic.
 
   The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can
   use. Most trace events only save the PID in the event. The saved_cmdlines
   file lists PIDs to COMMs so that the tracing tools can show an actual name
   and not just a PID for each event. There's an array of PIDs that map to a
   small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which
   is usually set to 32768. When a PID comes in, it will add itself to this
   array along with the index into the COMM array (note if the system allows
   more than PID_MAX_DEFAULT, this cache is similar to cache lines as an
   update of a PID that has the same PID_MAX_DEFAULT bits set will flush out
   another task with the same matching bits set).
 
   A while ago, the size of this cache was changed to be dynamic and the
   array was moved into a structure and created with kmalloc(). But this
   new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc
   allocates in powers of two, it was actually allocating 0x40000 bytes
   (262144) leaving 131040 bytes of wasted memory. The last element of this
   structure was a pointer to the COMM string array which defaulted to just
   saving 128 COMMs.
 
   By changing the last field of this structure to a variable length string,
   and just having it round up to fill the allocated memory, the default
   size of the saved COMM cache is now 8190. This not only uses the wasted
   space, but actually saves space by removing the extra allocation for the
   COMM names.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZcYi8RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqENAQD6xGE9EPkbHArElKfgpSuQOfGhcyyP
 LjgVhqVgmIoqUwD8CeVpxk3VwZIOQYvPn5XictcZgkYSeEWUZcKYg4c/3gs=
 =iIBv
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix broken direct trampolines being called when another callback is
   attached the same function.

   ARM 64 does not support FTRACE_WITH_REGS, and when it added direct
   trampoline calls from ftrace, it removed the "WITH_REGS" flag from
   the ftrace_ops for direct trampolines. This broke x86 as x86 requires
   direct trampolines to have WITH_REGS.

   This wasn't noticed because direct trampolines work as long as the
   function it is attached to is not shared with other callbacks (like
   the function tracer). When there are other callbacks, a helper
   trampoline is called, to call all the non direct callbacks and when
   it returns, the direct trampoline is called.

   For x86, the direct trampoline sets a flag in the regs field to tell
   the x86 specific code to call the direct trampoline. But this only
   works if the ftrace_ops had WITH_REGS set. ARM does things
   differently that does not require this. For now, set WITH_REGS if the
   arch supports WITH_REGS (which ARM does not), and this makes it work
   for both ARM64 and x86.

 - Fix wasted memory in the saved_cmdlines logic.

   The saved_cmdlines is a cache that maps PIDs to COMMs that tracing
   can use. Most trace events only save the PID in the event. The
   saved_cmdlines file lists PIDs to COMMs so that the tracing tools can
   show an actual name and not just a PID for each event. There's an
   array of PIDs that map to a small set of saved COMM strings. The
   array is set to PID_MAX_DEFAULT which is usually set to 32768. When a
   PID comes in, it will add itself to this array along with the index
   into the COMM array (note if the system allows more than
   PID_MAX_DEFAULT, this cache is similar to cache lines as an update of
   a PID that has the same PID_MAX_DEFAULT bits set will flush out
   another task with the same matching bits set).

   A while ago, the size of this cache was changed to be dynamic and the
   array was moved into a structure and created with kmalloc(). But this
   new structure had the size of 131104 bytes, or 0x20020 in hex. As
   kmalloc allocates in powers of two, it was actually allocating
   0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The
   last element of this structure was a pointer to the COMM string array
   which defaulted to just saving 128 COMMs.

   By changing the last field of this structure to a variable length
   string, and just having it round up to fill the allocated memory, the
   default size of the saved COMM cache is now 8190. This not only uses
   the wasted space, but actually saves space by removing the extra
   allocation for the COMM names.

* tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix wasted memory in saved_cmdlines logic
  ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default
2024-02-09 11:13:19 -08:00
Steven Rostedt (Google)
44dc5c41b5 tracing: Fix wasted memory in saved_cmdlines logic
While looking at improving the saved_cmdlines cache I found a huge amount
of wasted memory that should be used for the cmdlines.

The tracing data saves pids during the trace. At sched switch, if a trace
occurred, it will save the comm of the task that did the trace. This is
saved in a "cache" that maps pids to comms and exposed to user space via
the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by
default 128 comms.

The structure that uses this creates an array to store the pids using
PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure
to be of the size of 131104 bytes on 64 bit machines.

In hex: 131104 = 0x20020, and since the kernel allocates generic memory in
powers of two, the kernel would allocate 0x40000 or 262144 bytes to store
this structure. That leaves 131040 bytes of wasted space.

Worse, the structure points to an allocated array to store the comm names,
which is 16 bytes times the amount of names to save (currently 128), which
is 2048 bytes. Instead of allocating a separate array, make the structure
end with a variable length string and use the extra space for that.

This is similar to a recommendation that Linus had made about eventfs_inode names:

  https://lore.kernel.org/all/20240130190355.11486-5-torvalds@linux-foundation.org/

Instead of allocating a separate string array to hold the saved comms,
have the structure end with: char saved_cmdlines[]; and round up to the
next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * TASK_COMM_LEN
It will use this extra space for the saved_cmdline portion.

Now, instead of saving only 128 comms by default, by using this wasted
space at the end of the structure it can save over 8000 comms and even
saves space by removing the need for allocating the other array.

Link: https://lore.kernel.org/linux-trace-kernel/20240209063622.1f7b6d5f@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Fixes: 939c7a4f04 ("tracing: Introduce saved_cmdlines_size file")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-09 06:43:21 -05:00
Masami Hiramatsu (Google)
a8b9cf62ad ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default
The commit 60c8971899 ("ftrace: Make DIRECT_CALLS work WITH_ARGS
and !WITH_REGS") changed DIRECT_CALLS to use SAVE_ARGS when there
are multiple ftrace_ops at the same function, but since the x86 only
support to jump to direct_call from ftrace_regs_caller, when we set
the function tracer on the same target function on x86, ftrace-direct
does not work as below (this actually works on arm64.)

At first, insmod ftrace-direct.ko to put a direct_call on
'wake_up_process()'.

 # insmod kernel/samples/ftrace/ftrace-direct.ko
 # less trace
...
          <idle>-0       [006] ..s1.   564.686958: my_direct_func: waking up rcu_preempt-17
          <idle>-0       [007] ..s1.   564.687836: my_direct_func: waking up kcompactd0-63
          <idle>-0       [006] ..s1.   564.690926: my_direct_func: waking up rcu_preempt-17
          <idle>-0       [006] ..s1.   564.696872: my_direct_func: waking up rcu_preempt-17
          <idle>-0       [007] ..s1.   565.191982: my_direct_func: waking up kcompactd0-63

Setup a function filter to the 'wake_up_process' too, and enable it.

 # cd /sys/kernel/tracing/
 # echo wake_up_process > set_ftrace_filter
 # echo function > current_tracer
 # less trace
...
          <idle>-0       [006] ..s3.   686.180972: wake_up_process <-call_timer_fn
          <idle>-0       [006] ..s3.   686.186919: wake_up_process <-call_timer_fn
          <idle>-0       [002] ..s3.   686.264049: wake_up_process <-call_timer_fn
          <idle>-0       [002] d.h6.   686.515216: wake_up_process <-kick_pool
          <idle>-0       [002] d.h6.   686.691386: wake_up_process <-kick_pool

Then, only function tracer is shown on x86.
But if you enable 'kprobe on ftrace' event (which uses SAVE_REGS flag)
on the same function, it is shown again.

 # echo 'p wake_up_process' >> dynamic_events
 # echo 1 > events/kprobes/p_wake_up_process_0/enable
 # echo > trace
 # less trace
...
          <idle>-0       [006] ..s2.  2710.345919: p_wake_up_process_0: (wake_up_process+0x4/0x20)
          <idle>-0       [006] ..s3.  2710.345923: wake_up_process <-call_timer_fn
          <idle>-0       [006] ..s1.  2710.345928: my_direct_func: waking up rcu_preempt-17
          <idle>-0       [006] ..s2.  2710.349931: p_wake_up_process_0: (wake_up_process+0x4/0x20)
          <idle>-0       [006] ..s3.  2710.349934: wake_up_process <-call_timer_fn
          <idle>-0       [006] ..s1.  2710.349937: my_direct_func: waking up rcu_preempt-17

To fix this issue, use SAVE_REGS flag for multiple ftrace_ops flag of
direct_call by default.

Link: https://lore.kernel.org/linux-trace-kernel/170484558617.178953.1590516949390270842.stgit@devnote2

Fixes: 60c8971899 ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS")
Cc: stable@vger.kernel.org
Cc: Florent Revest <revest@chromium.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-09 04:58:22 -05:00
Jakub Kicinski
3be042cf46 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/stmicro/stmmac/common.h
  38cc3c6dcc ("net: stmmac: protect updates of 64-bit statistics counters")
  fd5a6a7131 ("net: stmmac: est: Per Tx-queue error count for HLBF")
  c5c3e1bfc9 ("net: stmmac: Offload queueMaxSDU from tc-taprio")

drivers/net/wireless/microchip/wilc1000/netdev.c
  c901388028 ("wifi: fill in MODULE_DESCRIPTION()s for wilc1000")
  328efda22a ("wifi: wilc1000: do not realloc workqueue everytime an interface is added")

net/unix/garbage.c
  11498715f2 ("af_unix: Remove io_uring code for GC.")
  1279f9d9de ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-08 15:30:33 -08:00
Geliang Tang
947e56f82f bpf, btf: Check btf for register_bpf_struct_ops
Similar to the handling in the functions __register_btf_kfunc_id_set()
and register_btf_id_dtor_kfuncs(), this patch uses the newly added
helper check_btf_kconfigs() to handle module with its btf section
stripped.

While at it, the patch also adds the missed IS_ERR() check to fix the
commit f6be98d199 ("bpf, net: switch to dynamic registration")

Fixes: f6be98d199 ("bpf, net: switch to dynamic registration")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/69082b9835463fe36f9e354bddf2d0a97df39c2b.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08 11:37:49 -08:00
Waiman Long
49584bb8dd workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
Commit 85f0ab43f9 ("kernel/workqueue: Bind rescuer to unbound
cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of
an unbound workqueue to the cpumask in wq->unbound_attrs. However
unbound_attrs->cpumask's of all workqueues are initialized to
cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag
to expose a cpumask sysfs file to be written by users. So this patch
doesn't achieve what it is intended to do.

If an unbound workqueue is created after wq_unbound_cpumask is modified
and there is no more unbound cpumask update after that, the unbound
rescuer will be bound to all CPUs unless the workqueue is created
with the WQ_SYSFS flag and a user explicitly modified its cpumask
sysfs file.  Fix this problem by binding directly to wq_unbound_cpumask
in init_rescuer().

Fixes: 85f0ab43f9 ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:23:38 -10:00
Juri Lelli
d64f2fa064 kernel/workqueue: Let rescuers follow unbound wq cpumask changes
When workqueue cpumask changes are committed the associated rescuer (if
one exists) affinity is not touched and this might be a problem down the
line for isolated setups.

Make sure rescuers affinity is updated every time a workqueue cpumask
changes, so that rescuers can't break isolation.

 [longman: set_cpus_allowed_ptr() will block until the designated task
  is enqueued on an allowed CPU, no wake_up_process() needed. Also use
  the unbound_effective_cpumask() helper as suggested by Tejun.]

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:23:32 -10:00
Geliang Tang
9e60b0e025 bpf, btf: Add check_btf_kconfigs helper
This patch extracts duplicate code on error path when btf_get_module_btf()
returns NULL from the functions __register_btf_kfunc_id_set() and
register_btf_id_dtor_kfuncs() into a new helper named check_btf_kconfigs()
to check CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES in it.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/fa5537fc55f1e4d0bfd686598c81b7ab9dbd82b7.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08 11:22:56 -08:00
Waiman Long
4c065dbce1 workqueue: Enable unbound cpumask update on ordered workqueues
Ordered workqueues does not currently follow changes made to the
global unbound cpumask because per-pool workqueue changes may break
the ordering guarantee. IOW, a work function in an ordered workqueue
may run on an isolated CPU.

This patch enables ordered workqueues to follow changes made to the
global unbound cpumask by temporaily plug or suspend the newly allocated
pool_workqueue from executing newly queued work items until the old
pwq has been properly drained. For ordered workqueues, there should
only be one pwq that is unplugged, the rests should be plugged.

This enables ordered workqueues to follow the unbound cpumask changes
like other unbound workqueues at the expense of some delay in execution
of work functions during the transition period.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:22:49 -10:00
Waiman Long
26fb7e3dda workqueue: Link pwq's into wq->pwqs from oldest to newest
Add a new pwq into the tail of wq->pwqs so that pwq iteration will
start from the oldest pwq to the newest. This ordering will facilitate
the inclusion of ordered workqueues in a wq_unbound_cpumask update.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08 09:22:30 -10:00
Geliang Tang
b9a395f0f7 bpf, btf: Fix return value of register_btf_id_dtor_kfuncs
The same as __register_btf_kfunc_id_set(), to let the modules with
stripped btf section loaded, this patch changes the return value of
register_btf_id_dtor_kfuncs() too from -ENOENT to 0 when btf is NULL.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/eab65586d7fb0e72f2707d3747c7d4a5d60c823f.1707373307.git.tanggeliang@kylinos.cn
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08 11:05:07 -08:00
Li zeming
9efd24ec55 kprobes: Remove unnecessary initial values of variables
ri and sym is assigned first, so it does not need to initialize the
assignment.

Link: https://lore.kernel.org/all/20230919012823.7815-1-zeming@nfschina.com/

Signed-off-by: Li zeming <zeming@nfschina.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-02-08 23:29:29 +09:00
Masami Hiramatsu (Google)
9a571c1e27 tracing/probes: Fix to set arg size and fmt after setting type from BTF
Since the BTF type setting updates probe_arg::type, the type size
calculation and setting print-fmt should be done after that.
Without this fix, the argument size and print-fmt can be wrong.

Link: https://lore.kernel.org/all/170602218196.215583.6417859469540955777.stgit@devnote2/

Fixes: b576e09701 ("tracing/probes: Support function parameters if BTF is available")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-02-08 23:26:25 +09:00
Masami Hiramatsu (Google)
8c427cc2fa tracing/probes: Fix to show a parse error for bad type for $comm
Fix to show a parse error for bad type (non-string) for $comm/$COMM and
immediate-string. With this fix, error_log file shows appropriate error
message as below.

 /sys/kernel/tracing # echo 'p vfs_read $comm:u32' >> kprobe_events
sh: write error: Invalid argument
 /sys/kernel/tracing # echo 'p vfs_read \"hoge":u32' >> kprobe_events
sh: write error: Invalid argument
 /sys/kernel/tracing # cat error_log

[   30.144183] trace_kprobe: error: $comm and immediate-string only accepts string type
  Command: p vfs_read $comm:u32
                            ^
[   62.618500] trace_kprobe: error: $comm and immediate-string only accepts string type
  Command: p vfs_read \"hoge":u32
                              ^
Link: https://lore.kernel.org/all/170602215411.215583.2238016352271091852.stgit@devnote2/

Fixes: 3dd1f7f24f ("tracing: probeevent: Fix to make the type of $comm string")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-02-08 23:26:13 +09:00
Oleg Nesterov
c1be35a16b exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
After the recent changes nobody use siglock to read the values protected
by stats_lock, we can kill spin_lock_irq(&current->sighand->siglock) and
update the comment.

With this patch only __exit_signal() and thread_group_start_cputime() take
stats_lock under siglock.

Link: https://lkml.kernel.org/r/20240123153359.GA21866@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07 21:20:33 -08:00
Oleg Nesterov
f7ec1cd5cc getrusage: use sig->stats_lock rather than lock_task_sighand()
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call
getrusage() at the same time and the process has NR_THREADS, spin_lock_irq
will spin with irqs disabled O(NR_CPUS * NR_THREADS) time.

Change getrusage() to use sig->stats_lock, it was specifically designed
for this type of use. This way it runs lockless in the likely case.

TODO:
	- Change do_task_stat() to use sig->stats_lock too, then we can
	  remove spin_lock_irq(siglock) in wait_task_zombie().

	- Turn sig->stats_lock into seqcount_rwlock_t, this way the
	  readers in the slow mode won't exclude each other. See
	  https://lore.kernel.org/all/20230913154907.GA26210@redhat.com/

	- stats_lock has to disable irqs because ->siglock can be taken
	  in irq context, it would be very nice to change __exit_signal()
	  to avoid the siglock->stats_lock dependency.

Link: https://lkml.kernel.org/r/20240122155053.GA26214@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07 21:20:32 -08:00
Oleg Nesterov
daa694e413 getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
Patch series "getrusage: use sig->stats_lock", v2.


This patch (of 2):

thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.

This is also preparation for the next patch which changes getrusage() to
use stats_lock instead of siglock, thread_group_cputime() takes the same
lock.  With the current implementation recursive read_seqbegin_or_lock()
is fine, thread_group_cputime() can't enter the slow mode if the caller
holds stats_lock, yet this looks more safe and better performance-wise.

Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com
Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-07 21:20:32 -08:00
Masahiro Yamada
e55dad12ab bpf: Merge two CONFIG_BPF entries
'config BPF' exists in both init/Kconfig and kernel/bpf/Kconfig.

Commit b24abcff91 ("bpf, kconfig: Add consolidated menu entry for bpf
with core options") added the second one to kernel/bpf/Kconfig instead
of moving the existing one.

Merge them together.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240204075634.32969-1-masahiroy@kernel.org
2024-02-07 16:38:20 -08:00
Peter Hilber
4b7f521229 timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
Clocksource pointers can be problematic to obtain for drivers which are not
clocksource drivers themselves. In particular, the RFC virtio_rtc driver
[1] would require a new helper function to obtain a pointer to the ARM
Generic Timer clocksource. The ptp_kvm driver also required a similar
workaround.

Address this by evaluating the clocksource ID, rather than the clocksource
pointer, of struct system_counterval_t. By this, setting the clocksource
pointer becomes unneeded, and get_device_system_crosststamp() callers will
no longer need to supply clocksource pointers.

All relevant clocksource drivers provide the ID, so this change is not
changing the behaviour.

[1] https://lore.kernel.org/lkml/20231218073849.35294-1-peter.hilber@opensynergy.com/

Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201010453.2212371-7-peter.hilber@opensynergy.com
2024-02-07 17:05:21 +01:00
Ricardo B. Marliere
49f1ff50d4 clockevents: Make clockevents_subsys const
Now that the driver core can properly handle constant struct bus_type,
move the clockevents_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-2-207ec18e24b8@marliere.net
2024-02-07 15:11:24 +01:00
Ricardo B. Marliere
2bc7fc24f9 clocksource: Make clocksource_subsys const
Now that the driver core can properly handle constant struct bus_type,
move the clocksource_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-1-207ec18e24b8@marliere.net
2024-02-07 15:11:24 +01:00
Tycho Andersen
0c9bd6bc4b
pidfd: getfd should always report ESRCH if a task is exiting
We can get EBADF from pidfd_getfd() if a task is currently exiting,
which might be confusing. Let's check PF_EXITING, and just report ESRCH
if so.

I chose PF_EXITING, because it is set in exit_signals(), which is called
before exit_files(). Since ->exit_status is mostly set after
exit_files() in exit_notify(), using that still leaves a window open for
the race.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
Link: https://lore.kernel.org/r/20240206192357.81942-1-tycho@tycho.pizza
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-07 12:09:44 +01:00
Oleg Nesterov
83b290c9e3 pidfd: clone: allow CLONE_THREAD | CLONE_PIDFD together
copy_process() just needs to pass PIDFD_THREAD to __pidfd_prepare()
if clone_flags & CLONE_THREAD.

We can also add another CLONE_ flag (or perhaps reuse CLONE_DETACHED)
to enforce PIDFD_THREAD without CLONE_THREAD.

Originally-from: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205145532.GA28823@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-06 14:39:32 +01:00
Oleg Nesterov
e2e8a142fb pidfd: exit: kill the no longer used thread_group_exited()
It was used by pidfd_poll() but now it has no callers.

If it finally finds a modular user we can revert this change, but note
that the comment above this helper and the changelog in 38fd525a4c
("exit: Factor thread_group_exited out of pidfd_poll") are not accurate,
thread_group_exited() won't return true if all other threads have passed
exit_notify() and are zombies, it returns true only when all other threads
are completely gone. Not to mention that it can only work if the task
identified by @pid is a thread-group leader.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205174347.GA31461@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-06 14:02:51 +01:00
Oleg Nesterov
9ed52108f6 pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN))
rather than wake_up_all(). This way do_notify_pidfd() won't wakeup the
POLLHUP-only waiters which wait for pid_task() == NULL.

TODO:
    - as Christian pointed out, this asks for the new wake_up_all_poll()
      helper, it can already have other users.

    - we can probably discriminate the PIDFD_THREAD and non-PIDFD_THREAD
      waiters, but this needs more work. See
      https://lore.kernel.org/all/20240205140848.GA15853@redhat.com/

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240205141348.GA16539@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-06 13:52:51 +01:00
Frederic Weisbecker
dad6a09f31 hrtimer: Report offline hrtimer enqueue
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.

For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.

Make sure such situations never go unnoticed and warn when that happens.

Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com
2024-02-06 10:56:35 +01:00
Kumar Kartikeya Dwivedi
6fceea0fa5 bpf: Transfer RCU lock state between subprog calls
Allow transferring an imbalanced RCU lock state between subprog calls
during verification. This allows patterns where a subprog call returns
with an RCU lock held, or a subprog call releases an RCU lock held by
the caller. Currently, the verifier would end up complaining if the RCU
lock is not released when processing an exit from a subprog, which is
non-ideal if its execution is supposed to be enclosed in an RCU read
section of the caller.

Instead, simply only check whether we are processing exit for frame#0
and do not complain on an active RCU lock otherwise. We only need to
update the check when processing BPF_EXIT insn, as copy_verifier_state
is already set up to do the right thing.

Suggested-by: David Vernet <void@manifault.com>
Tested-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-05 20:00:14 -08:00
Kumar Kartikeya Dwivedi
a44b1334aa bpf: Allow calling static subprogs while holding a bpf_spin_lock
Currently, calling any helpers, kfuncs, or subprogs except the graph
data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock
is not allowed. One of the original motivations of this decision was to
force the BPF programmer's hand into keeping the bpf_spin_lock critical
section small, and to ensure the execution time of the program does not
increase due to lock waiting times. In addition to this, some of the
helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock.

However, when it comes to subprog calls, atleast for static subprogs,
the verifier is able to explore their instructions during verification.
Therefore, it is similar in effect to having the same code inlined into
the critical section. Hence, not allowing static subprog calls in the
bpf_spin_lock critical section is mostly an annoyance that needs to be
worked around, without providing any tangible benefit.

Unlike static subprog calls, global subprog calls are not safe to permit
within the critical section, as the verifier does not explore them
during verification, therefore whether the same lock will be taken
again, or unlocked, cannot be ascertained.

Therefore, allow calling static subprogs within a bpf_spin_lock critical
section, and only reject it in case the subprog linkage is global.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-05 19:58:47 -08:00
Tejun Heo
40911d4457 Merge branch 'for-6.8-fixes' into for-6.9
The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit
ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 15:49:47 -10:00
Tejun Heo
aac8a59537 Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"
This reverts commit ca10d851b9.

The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED
on now removed implicitly ordered workqueues. This was incorrect in that
system-wide config change shouldn't break ordering properties of all
workqueues. The reason why apply_workqueue_attrs() path was allowed to do so
was because it was targeting the specific workqueue - either the workqueue
had WQ_SYSFS set or the workqueue user specifically tried to change
max_active, both of which indicate that the workqueue doesn't need to be
ordered.

The implicitly ordered workqueue promotion was removed by the previous
commit 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/
@max_active==1 ordered"). However, it didn't update this path and broke
build. Let's revert the commit which was incorrect in the first place which
also fixes build.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3bc1e711c2 ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered")
Fixes: ca10d851b9 ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 15:49:06 -10:00
Tejun Heo
3bc1e711c2 workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered
workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way
to create ordered workqueues and the new NUMA support broke it. These
problems can be subtle and the fact that they can only trigger on NUMA
machines made them even more difficult to debug.

However, overloading the UNBOUND allocation interface this way creates other
issues. It's difficult to tell whether a given workqueue actually needs to
be ordered and users that legitimately want a min concurrency level wq
unexpectedly gets an ordered one instead. With planned UNBOUND workqueue
udpates to improve execution locality and more prevalence of chiplet designs
which can benefit from such improvements, this isn't a state we wanna be in
forever.

There aren't that many UNBOUND w/ @max_active==1 users in the tree and the
preceding patches audited all and converted them to
alloc_ordered_workqueue() as appropriate. This patch removes the implicit
promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.

v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in
    apply_workqueue_attrs_locked() which spuriously triggers WARNING and
    fails workqueue creation. Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com
2024-02-05 14:19:10 -10:00
Tejun Heo
7245d24f87 backtracetest: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts backtracetest from tasklet to BH workqueue.

- Replace "irq" with "bh" in names and message to better reflect what's
  happening.

- Replace completion usage with a flush_work() call.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
2024-02-05 13:22:34 -10:00
Kui-Feng Lee
df9705eaa0 bpf: Remove an unnecessary check.
The "i" here is always equal to "btf_type_vlen(t)" since
the "for_each_member()" loop never breaks.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-05 10:25:08 -08:00
Waiman Long
8eb17dc1a6 workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask
Skip updating workqueues with __WQ_DESTROYING bit set when updating
global unbound cpumask to avoid unnecessary work and other complications.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 07:52:22 -10:00
Wang Jinchao
96068b6030 workqueue: fix a typo in comment
There should be three, fix it.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-05 07:48:25 -10:00
Tejun Heo
4f19b8e01e Revert "workqueue: make wq_subsys const"
This reverts commit d412ace111. This leads to
build failures as it depends on a driver-core commit 32f78abe59 ("driver
core: bus: constantify subsys_register() calls"). Let's drop it from wq tree
and route it through driver-core tree.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/
2024-02-05 07:19:54 -10:00
Jens Axboe
06b23f92af block: update cached timestamp post schedule/preemption
Mark the task as having a cached timestamp when set assign it, so we
can efficiently check if it needs updating post being scheduled back in.
This covers both the actual schedule out case, which would've flushed
the plug, and the preemption case which doesn't touch the plugged
requests (for many reasons, one of them being then we'd need to have
preemption disabled around plug state manipulation).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:07:34 -07:00
Tejun Heo
4cb1ef6460 workqueue: Implement BH workqueues to eventually replace tasklets
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws such as
the execution code accessing the tasklet item after the execution is
complete which can lead to subtle use-after-free in certain usage scenarios
and less-developed flush and cancel mechanisms.

This patch implements BH workqueues which share the same semantics and
features of regular workqueues but execute their work items in the softirq
context. As there is always only one BH execution context per CPU, none of
the concurrency management mechanisms applies and a BH workqueue can be
thought of as a convenience wrapper around softirq.

Except for the inability to sleep while executing and lack of max_active
adjustments, BH workqueues and work items should behave the same as regular
workqueues and work items.

Currently, the execution is hooked to tasklet[_hi]. However, the goal is to
convert all tasklet users over to BH workqueues. Once the conversion is
complete, tasklet can be removed and BH workqueues can directly take over
the tasklet softirqs.

system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in
tasklet, all existing tasklet users should be able to use the system BH
workqueues without creating their own workqueues.

v3: - Add missing interrupt.h include.

v2: - Instead of using tasklets, hook directly into its softirq action
      functions - tasklet[_hi]_action(). This is slightly cheaper and closer
      to the eventual code structure we want to arrive at. Suggested by Lai.

    - Lai also pointed out several places which need NULL worker->task
      handling or can use clarification. Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com
Tested-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-02-04 11:28:06 -10:00
Tejun Heo
2fcdb1b444 workqueue: Factor out init_cpu_worker_pool()
Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure
reorganization in preparation of BH workqueue support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
2024-02-04 11:28:06 -10:00
Tejun Heo
c35aea39d1 workqueue: Update lock debugging code
These changes are in preparation of BH workqueue which will execute work
items from BH context.

- Update lock and RCU depth checks in process_one_work() so that it
  remembers and checks against the starting depths and prints out the depth
  changes.

- Factor out lockdep annotations in the flush paths into
  touch_{wq|work}_lockdep_map(). The work->lockdep_map touching is moved
  from __flush_work() to its callee - start_flush_work(). This brings it
  closer to the wq counterpart and will allow testing the associated wq's
  flags which will be needed to support BH workqueues. This is not expected
  to cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Allen Pais <allen.lkml@gmail.com>
2024-02-04 11:28:06 -10:00
Ricardo B. Marliere
d412ace111 workqueue: make wq_subsys const
Now that the driver core can properly handle constant struct bus_type,
move the wq_subsys variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-04 11:23:25 -10:00
Tejun Heo
c70e1779b7 workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending()
dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work
item handling") relocated pwq_dec_nr_in_flight() after
set_work_pool_and_keep_pending(). However, the latter destroys information
contained in work->data that's needed by pwq_dec_nr_in_flight() including
the flush color. With flush color destroyed, flush_workqueue() can stall
easily when mixed with cancel_work*() usages.

This is easily triggered by running xfstests generic/001 test on xfs:

     INFO: task umount:6305 blocked for more than 122 seconds.
     ...
     task:umount          state:D stack:13008 pid:6305  tgid:6305  ppid:6301   flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x2f6/0xa20
      schedule+0x36/0xb0
      schedule_timeout+0x20b/0x280
      wait_for_completion+0x8a/0x140
      __flush_workqueue+0x11a/0x3b0
      xfs_inodegc_flush+0x24/0xf0
      xfs_unmountfs+0x14/0x180
      xfs_fs_put_super+0x3d/0x90
      generic_shutdown_super+0x7c/0x160
      kill_block_super+0x1b/0x40
      xfs_kill_sb+0x12/0x30
      deactivate_locked_super+0x35/0x90
      deactivate_super+0x42/0x50
      cleanup_mnt+0x109/0x170
      __cleanup_mnt+0x12/0x20
      task_work_run+0x60/0x90
      syscall_exit_to_user_mode+0x146/0x150
      do_syscall_64+0x5d/0x110
      entry_SYSCALL_64_after_hwframe+0x6c/0x74

Fix it by stashing work_data before calling set_work_pool_and_keep_pending()
and using the stashed value for pwq_dec_nr_in_flight().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64
Fixes: dd6c3c5441 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")
2024-02-04 11:14:49 -10:00
Andrii Nakryiko
1eb986746a bpf: don't emit warnings intended for global subprogs for static subprogs
When btf_prepare_func_args() was generalized to handle both static and
global subprogs, a few warnings/errors that are meant only for global
subprog cases started to be emitted for static subprogs, where they are
sort of expected and irrelavant.

Stop polutting verifier logs with irrelevant scary-looking messages.

Fixes: e26080d0da ("bpf: prepare btf_prepare_func_args() for handling static subprogs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-02 18:08:59 -08:00
Andrii Nakryiko
8f13c34087 bpf: handle trusted PTR_TO_BTF_ID_OR_NULL in argument check logic
Add PTR_TRUSTED | PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to
check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID
registers into global functions accepting `__arg_trusted __arg_nullable`
arguments. This hasn't been caught earlier because tests were either
passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR)
registers.

When utilizing this functionality in complicated real-world BPF
application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent
that verifier rejects valid case because check_reg_type() doesn't handle
this case explicitly. Existing check_reg_type() logic is already
anticipating this combination, so we just need to explicitly list this
combo in the switch statement.

Fixes: e2b3c4ff5d ("bpf: add __arg_trusted global func arg tag")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-02 18:08:58 -08:00
Linus Torvalds
56897d5188 Tracing and eventfs fixes for v6.8:
- Fix the return code for ring_buffer_poll_wait()
   It was returing a -EINVAL instead of EPOLLERR.
 
 - Zero out the tracefs_inode so that all fields are initialized.
   The ti->private could have had stale data, but instead of
   just initializing it to NULL, clear out the entire structure
   when it is allocated.
 
 - Fix a crash in timerlat
   The hrtimer was initialized at read and not open, but is
   canceled at close. If the file was opened and never read
   the close will pass a NULL pointer to hrtime_cancel().
 
 - Rewrite of eventfs.
   Linus wrote a patch series to remove the dentry references in the
   eventfs_inode and to use ref counting and more of proper VFS
   interfaces to make it work.
 
 - Add warning to put_ei() if ei is not set to free. That means
   something is about to free it when it shouldn't.
 
 - Restructure the eventfs_inode to make it more compact, and remove
   the unused llist field.
 
 - Remove the fsnotify*() funtions for when the inodes were being created
   in the lookup code. It doesn't make sense to notify about creation
   just because something is being looked up.
 
 - The inode hard link count was not accurate. It was being updated
   when a file was looked up. The inodes of directories were updating
   their parent inode hard link count every time the inode was created.
   That means if memory reclaim cleaned a stale directory inode and
   the inode was lookup up again, it would increment the parent inode
   again as well. Al Viro said to just have all eventfs directories
   have a hard link count of 1. That tells user space not to trust it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZb1l/RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qk6jAQDmecDOnx+j/Rm5krbX/meVPYXFj2CU
 1wO7w1HBzopsBwEA5AjTKm9IGrl/eVG/+jViS165b+sJfwEcblHEFPWcIwo=
 =uUzb
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing and eventfs fixes from Steven Rostedt:

 - Fix the return code for ring_buffer_poll_wait()

   It was returing a -EINVAL instead of EPOLLERR.

 - Zero out the tracefs_inode so that all fields are initialized.

   The ti->private could have had stale data, but instead of just
   initializing it to NULL, clear out the entire structure when it is
   allocated.

 - Fix a crash in timerlat

   The hrtimer was initialized at read and not open, but is canceled at
   close. If the file was opened and never read the close will pass a
   NULL pointer to hrtime_cancel().

 - Rewrite of eventfs.

   Linus wrote a patch series to remove the dentry references in the
   eventfs_inode and to use ref counting and more of proper VFS
   interfaces to make it work.

 - Add warning to put_ei() if ei is not set to free. That means
   something is about to free it when it shouldn't.

 - Restructure the eventfs_inode to make it more compact, and remove the
   unused llist field.

 - Remove the fsnotify*() funtions for when the inodes were being
   created in the lookup code. It doesn't make sense to notify about
   creation just because something is being looked up.

 - The inode hard link count was not accurate.

   It was being updated when a file was looked up. The inodes of
   directories were updating their parent inode hard link count every
   time the inode was created. That means if memory reclaim cleaned a
   stale directory inode and the inode was lookup up again, it would
   increment the parent inode again as well. Al Viro said to just have
   all eventfs directories have a hard link count of 1. That tells user
   space not to trust it.

* tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Keep all directory links at 1
  eventfs: Remove fsnotify*() functions from lookup()
  eventfs: Restructure eventfs_inode structure to be more condensed
  eventfs: Warn if an eventfs_inode is freed without is_freed being set
  tracing/timerlat: Move hrtimer_init to timerlat_fd open()
  eventfs: Get rid of dentry pointers without refcounts
  eventfs: Clean up dentry ops and add revalidate function
  eventfs: Remove unused d_parent pointer field
  tracefs: dentry lookup crapectomy
  tracefs: Avoid using the ei->dentry pointer unnecessarily
  eventfs: Initialize the tracefs inode properly
  tracefs: Zero out the tracefs_inode when allocating it
  ring-buffer: Clean ring_buffer_poll_wait() error return
2024-02-02 15:32:58 -08:00
Eduard Zingerman
6efbde200b bpf: Handle scalar spill vs all MISC in stacksafe()
When check_stack_read_fixed_off() reads value from an spi
all stack slots of which are set to STACK_{MISC,INVALID},
the destination register is set to unbound SCALAR_VALUE.

Exploit this fact by allowing stacksafe() to use a fake
unbound scalar register to compare 'mmmm mmmm' stack value
in old state vs spilled 64-bit scalar in current state
and vice versa.

Veristat results after this patch show some gains:

./veristat -C -e file,prog,states -f 'states_pct>10'  not-opt after
File                     Program                States   (DIFF)
-----------------------  ---------------------  ---------------
bpf_overlay.o            tail_rev_nodeport_lb4    -45 (-15.85%)
bpf_xdp.o                tail_lb_ipv4            -541 (-19.57%)
pyperf100.bpf.o          on_event                -680 (-10.42%)
pyperf180.bpf.o          on_event               -2164 (-19.62%)
pyperf600.bpf.o          on_event               -9799 (-24.84%)
strobemeta.bpf.o         on_event               -9157 (-65.28%)
xdp_synproxy_kern.bpf.o  syncookie_tc             -54 (-19.29%)
xdp_synproxy_kern.bpf.o  syncookie_xdp            -74 (-24.50%)

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-6-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy
c1e6148cb4 bpf: Preserve boundaries and track scalars on narrowing fill
When the width of a fill is smaller than the width of the preceding
spill, the information about scalar boundaries can still be preserved,
as long as it's coerced to the right width (done by coerce_reg_to_size).
Even further, if the actual value fits into the fill width, the ID can
be preserved as well for further tracking of equal scalars.

Implement the above improvements, which makes narrowing fills behave the
same as narrowing spills and MOVs between registers.

Two tests are adjusted to accommodate for endianness differences and to
take into account that it's now allowed to do a narrowing fill from the
least significant bits.

reg_bounds_sync is added to coerce_reg_to_size to correctly adjust
umin/umax boundaries after the var_off truncation, for example, a 64-bit
value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax =
0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be
synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy
e67ddd9b1c bpf: Track spilled unbounded scalars
Support the pattern where an unbounded scalar is spilled to the stack,
then boundary checks are performed on the src register, after which the
stack frame slot is refilled into a register.

Before this commit, the verifier didn't treat the src register and the
stack slot as related if the src register was an unbounded scalar. The
register state wasn't copied, the id wasn't preserved, and the stack
slot was marked as STACK_MISC. Subsequent boundary checks on the src
register wouldn't result in updating the boundaries of the spilled
variable on the stack.

After this commit, the verifier will preserve the bond between src and
dst even if src is unbounded, which permits to do boundary checks on src
and refill dst later, still remembering its boundaries. Such a pattern
is sometimes generated by clang when compiling complex long functions.

One test is adjusted to reflect that now unbounded scalars are tracked.

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240127175237.526726-2-maxtram95@gmail.com
2024-02-02 13:22:14 -08:00
Oleg Nesterov
a1c6d5439f pid: kill the obsolete PIDTYPE_PID code in transfer_pid()
transfer_pid() must be never called with pid == PIDTYPE_PID,
new_leader->thread_pid should be changed by exchange_tids().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131255.GA26025@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 14:57:53 +01:00
Oleg Nesterov
43f0df54c9 pidfd_poll: report POLLHUP when pid_task() == NULL
Add another wake_up_all(wait_pidfd) into __change_pid() and change
pidfd_poll() to include EPOLLHUP if task == NULL.

This allows to wait until the target process/thread is reaped.

TODO: change do_notify_pidfd() to use the keyed wakeups.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240202131226.GA26018@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 14:57:53 +01:00
Oleg Nesterov
64bef697d3 pidfd: implement PIDFD_THREAD flag for pidfd_open()
With this flag:

	- pidfd_open() doesn't require that the target task must be
	  a thread-group leader

	- pidfd_poll() succeeds when the task exits and becomes a
	  zombie (iow, passes exit_notify()), even if it is a leader
	  and thread-group is not empty.

	  This means that the behaviour of pidfd_poll(PIDFD_THREAD,
	  pid-of-group-leader) is not well defined if it races with
	  exec() from its sub-thread; pidfd_poll() can succeed or not
	  depending on whether pidfd_task_exited() is called before
	  or after exchange_tids().

	  Perhaps we can improve this behaviour later, pidfd_poll()
	  can probably take sig->group_exec_task into account. But
	  this doesn't really differ from the case when the leader
	  exits before other threads (so pidfd_poll() succeeds) and
	  then another thread execs and pidfd_poll() will block again.

thread_group_exited() is no longer used, perhaps it can die.

Co-developed-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240131132602.GA23641@redhat.com
Tested-by: Tycho Andersen <tandersen@netflix.com>
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Oleg Nesterov
21e25205d7 pidfd: don't do_notify_pidfd() if !thread_group_empty()
do_notify_pidfd() makes no sense until the whole thread group exits, change
do_notify_parent() to check thread_group_empty().

This avoids the unnecessary do_notify_pidfd() when tsk is not a leader, or
it exits before other threads, or it has a ptraced EXIT_ZOMBIE sub-thread.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240127132407.GA29136@redhat.com
Reviewed-by: Tycho Andersen <tandersen@netflix.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Oleg Nesterov
cdefbf2324 pidfd: cleanup the usage of __pidfd_prepare's flags
- make pidfd_create() static.

- Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(),
  __pidfd_prepare() adds these flags unconditionally.

- Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the
  flags itself, all other users of pidfd_prepare() pass flags = 0.

  If we need a sanity check for those other in kernel users then
  WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense.

- Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except
  O_CLOEXEC.

- Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except
  O_ACCMODE | O_NONBLOCK.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240125161734.GA778@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Wang Jinchao
b639585e71 fork: Using clone_flags for legacy clone check
In the current implementation of clone(), there is a line that
initializes `u64 clone_flags = args->flags` at the top.
This means that there is no longer a need to use args->flags
for the legacy clone check.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Link: https://lore.kernel.org/r/202401311054+0800-wangjinchao@xfusion.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02 13:12:28 +01:00
Jakub Kicinski
cf244463a2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-01 15:12:37 -08:00
Jingzi Meng
09ce61e27d cap_syslog: remove CAP_SYS_ADMIN when dmesg_restrict
CAP_SYSLOG was separated from CAP_SYS_ADMIN and introduced in Linux
2.6.37 (2010-11). For a long time, certain syslog actions required
CAP_SYS_ADMIN or CAP_SYSLOG. Maybe it’s time to officially remove
CAP_SYS_ADMIN for more fine-grained control.

CAP_SYS_ADMIN was once removed but added back for backwards
compatibility reasons. In commit 38ef4c2e43 ("syslog: check cap_syslog
when dmesg_restrict") (2010-12), CAP_SYS_ADMIN was no longer needed. And
in commit ee24aebffb ("cap_syslog: accept CAP_SYS_ADMIN for now")
(2011-02), it was accepted again. Since then, CAP_SYS_ADMIN has been
preserved.

Now that almost 13 years have passed, the legacy application may have
had enough time to be updated.

Signed-off-by: Jingzi Meng <mengjingzi@iie.ac.cn>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240105062007.26965-1-mengjingzi@iie.ac.cn
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-01 10:04:58 -08:00
Matt Bobrowski
1581e5118e bpf: Minor clean-up to sleepable_lsm_hooks BTF set
There's already one main CONFIG_SECURITY_NETWORK ifdef block within
the sleepable_lsm_hooks BTF set. Consolidate this duplicated ifdef
block as there's no need for it and all things guarded by it should
remain in one place in this specific context.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/Zbt1smz43GDMbVU3@google.com
2024-02-01 18:37:45 +01:00
Daniel Bristot de Oliveira
1389358bb0 tracing/timerlat: Move hrtimer_init to timerlat_fd open()
Currently, the timerlat's hrtimer is initialized at the first read of
timerlat_fd, and destroyed at close(). It works, but it causes an error
if the user program open() and close() the file without reading.

Here's an example:

 # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
 # echo timerlat > /sys/kernel/debug/tracing/current_tracer

 # cat <<EOF > ./timerlat_load.py
 # !/usr/bin/env python3

 timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
 timerlat_fd.close();
 EOF

 # ./taskset -c 0 ./timerlat_load.py
<BOOM>

 BUG: kernel NULL pointer dereference, address: 0000000000000010
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
 RIP: 0010:hrtimer_active+0xd/0x50
 Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
 RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
 RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
 RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
 R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
 FS:  00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? __die+0x23/0x70
  ? page_fault_oops+0x171/0x4e0
  ? srso_alias_return_thunk+0x5/0x7f
  ? avc_has_extended_perms+0x237/0x520
  ? exc_page_fault+0x7f/0x180
  ? asm_exc_page_fault+0x26/0x30
  ? hrtimer_active+0xd/0x50
  hrtimer_cancel+0x15/0x40
  timerlat_fd_release+0x48/0xe0
  __fput+0xf5/0x290
  __x64_sys_close+0x3d/0x80
  do_syscall_64+0x60/0x90
  ? srso_alias_return_thunk+0x5/0x7f
  ? __x64_sys_ioctl+0x72/0xd0
  ? srso_alias_return_thunk+0x5/0x7f
  ? syscall_exit_to_user_mode+0x2b/0x40
  ? srso_alias_return_thunk+0x5/0x7f
  ? do_syscall_64+0x6c/0x90
  ? srso_alias_return_thunk+0x5/0x7f
  ? exit_to_user_mode_prepare+0x142/0x1f0
  ? srso_alias_return_thunk+0x5/0x7f
  ? syscall_exit_to_user_mode+0x2b/0x40
  ? srso_alias_return_thunk+0x5/0x7f
  ? do_syscall_64+0x6c/0x90
  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 RIP: 0033:0x7f2ffb321594
 Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
 RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
 RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
 RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
 R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
 R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
  </TASK>
 CR2: 0000000000000010
 ---[ end trace 0000000000000000 ]---

Move hrtimer_init to timerlat_fd open() to avoid this problem.

Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-01 11:50:13 -05:00
Daniel Xu
6f3189f38a bpf: treewide: Annotate BPF kfuncs in BTF
This commit marks kfuncs as such inside the .BTF_ids section. The upshot
of these annotations is that we'll be able to automatically generate
kfunc prototypes for downstream users. The process is as follows:

1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs
2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for
   each function inside BTF_KFUNCS sets
3. At runtime, vmlinux or module BTF is made available in sysfs
4. At runtime, bpftool (or similar) can look at provided BTF and
   generate appropriate prototypes for functions with "bpf_kfunc" tag

To ensure future kfunc are similarly tagged, we now also return error
inside kfunc registration for untagged kfuncs. For vmlinux kfuncs,
we also WARN(), as initcall machinery does not handle errors.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-31 20:40:56 -08:00
Vincent Donnefort
66bbea9ed6 ring-buffer: Clean ring_buffer_poll_wait() error return
The return type for ring_buffer_poll_wait() is __poll_t. This is behind
the scenes an unsigned where we can set event bits. In case of a
non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us,
this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so
user-space at least is aware something went wrong.

Nonetheless, this is an incorrect code. Replace that -EINVAL with a
proper EPOLLERR to clean that output. As this doesn't change the
behaviour, there's no need to treat this change as a bug fix.

Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com

Cc: stable@vger.kernel.org
Fixes: 6721cb6002 ("ring-buffer: Do not poll non allocated cpu buffers")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-31 14:10:24 -05:00
Tejun Heo
c5f8cd6c62 workqueue: Avoid premature init of wq->node_nr_active[].max
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.

However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.

This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_SYSTEM]		= "system",
 };

+static bool wq_topo_initialized = false;
+
 /*
  * Per-cpu work items which run for longer than the following threshold are
  * automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)

 	lockdep_assert_held(&wq->mutex);

+	if (!wq_topo_initialized)
+		return;
+
 	if (!cpumask_test_cpu(off_cpu, effective))
 		off_cpu = -1;

@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)

 static void init_node_nr_active(struct wq_node_nr_active *nna)
 {
+	nna->max = WQ_DFL_MIN_ACTIVE;
 	atomic_set(&nna->nr, 0);
 	raw_spin_lock_init(&nna->lock);
 	INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);

+	wq_topo_initialized = true;
+
 	mutex_lock(&wq_pool_mutex);

 	/*
2024-01-30 19:17:00 -10:00
Tejun Heo
15930da42f workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active()
For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is
going down. The function was incorrectly calling cpumask_test_cpu() with -1
CPU leading to oopses like the following on some archs:

  Unable to handle kernel paging request at virtual address ffff0002100296e0
  ..
  pc : wq_update_node_max_active+0x50/0x1fc
  lr : wq_update_node_max_active+0x1f0/0x1fc
  ...
  Call trace:
    wq_update_node_max_active+0x50/0x1fc
    apply_wqattrs_commit+0xf0/0x114
    apply_workqueue_attrs_locked+0x58/0xa0
    alloc_workqueue+0x5ac/0x774
    workqueue_init_early+0x460/0x540
    start_kernel+0x258/0x684
    __primary_switched+0xb8/0xc0
  Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60)
  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Attempted to kill the idle task!
  ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com
Fixes: 5797b1c189 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
2024-01-30 18:55:55 -10:00
Andrii Nakryiko
8f2b44cd9d bpf: add arg:nullable tag to be combined with trusted pointers
Add ability to mark arg:trusted arguments with optional arg:nullable
tag to mark it as PTR_TO_BTF_ID_OR_NULL variant, which will allow
callers to pass NULL, and subsequently will force global subprog's code
to do NULL check. This allows to have "optional" PTR_TO_BTF_ID values
passed into global subprogs.

For now arg:nullable cannot be combined with anything else.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30 09:41:50 -08:00
Andrii Nakryiko
e2b3c4ff5d bpf: add __arg_trusted global func arg tag
Add support for passing PTR_TO_BTF_ID registers to global subprogs.
Currently only PTR_TRUSTED flavor of PTR_TO_BTF_ID is supported.
Non-NULL semantics is assumed, so caller will be forced to prove
PTR_TO_BTF_ID can't be NULL.

Note, we disallow global subprogs to destroy passed in PTR_TO_BTF_ID
arguments, even the trusted one. We achieve that by not setting
ref_obj_id when validating subprog code. This basically enforces (in
Rust terms) borrowing semantics vs move semantics. Borrowing semantics
seems to be a better fit for isolated global subprog validation
approach.

Implementation-wise, we utilize existing logic for matching
user-provided BTF type to kernel-side BTF type, used by BPF CO-RE logic
and following same matching rules. We enforce a unique match for types.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30 09:41:50 -08:00
Haiyue Wang
6668e818f9 bpf,token: Use BIT_ULL() to convert the bit mask
Replace the '(1ULL << *)' with the macro BIT_ULL(nr).

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127134901.3698613-1-haiyue.wang@intel.com
2024-01-29 20:04:55 -08:00
Linus Torvalds
9b7bd05beb tracing: Two small fixes for tracefs and eventfs:
- Fix register_snapshot_trigger() on allocation error
   If the snapshot fails to allocate, the register_snapshot_trigger() can
   still return success. If the call to tracing_alloc_snapshot_instance()
   returned anything but 0, it returned 0, but it should have been returning
   the error code from that allocation function.
 
 - Remove leftover code from tracefs doing a dentry walk on remount.
   The update_gid() function was called by the tracefs code on remount
   to update the gid of eventfs, but that is no longer the case, but that
   code wasn't deleted. Nothing calls it. Remove it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZbcbZBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqEsAQCWTIQ/yAktP2doLvuccZbZSc5NFHZt
 hZNRHep/SMQqKwD6AlesgiXlXM39HiRTiV17jPzrL65brcTEYMJmqCaWugk=
 =8nia
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two small fixes for tracefs and eventfs:

   - Fix register_snapshot_trigger() on allocation error

     If the snapshot fails to allocate, the register_snapshot_trigger()
     can still return success. If the call to
     tracing_alloc_snapshot_instance() returned anything but 0, it
     returned 0, but it should have been returning the error code from
     that allocation function.

   - Remove leftover code from tracefs doing a dentry walk on remount.

     The update_gid() function was called by the tracefs code on remount
     to update the gid of eventfs, but that is no longer the case, but
     that code wasn't deleted. Nothing calls it. Remove it"

* tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracefs: remove stale 'update_gid' code
  tracing/trigger: Fix to return error if failed to alloc snapshot
2024-01-29 18:42:34 -08:00
Leonardo Bras
aae17ebb53 workqueue: Avoid using isolated cpus' timers on queue_delayed_work
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.

This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.

So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.

As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-29 15:21:37 -10:00
Linus Torvalds
6f3d7d5ced 22 hotfixes. 11 are cc:stable and the remainder address post-6.7 issues
or aren't considered appropriate for backporting.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZbdSnwAKCRDdBJ7gKXxA
 jv49AQCY8eLOgE0L+25HZm99HleBwapbKJozcmsXgMPlgeFZHgEA8saExeL+Nzae
 6ktxmGXoVw2t3FJ67Zr66VE3EyHVKAY=
 =HWuo
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "22 hotfixes. 11 are cc:stable and the remainder address post-6.7
  issues or aren't considered appropriate for backporting"

* tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
  mm: thp_get_unmapped_area must honour topdown preference
  mm: huge_memory: don't force huge page alignment on 32 bit
  userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb
  selftests/mm: ksm_tests should only MADV_HUGEPAGE valid memory
  scs: add CONFIG_MMU dependency for vfree_atomic()
  mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range()
  mm/huge_memory: fix folio_set_dirty() vs. folio_mark_dirty()
  selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag
  selftests: mm: fix map_hugetlb failure on 64K page size systems
  MAINTAINERS: supplement of zswap maintainers update
  stackdepot: make fast paths lock-less again
  stackdepot: add stats counters exported via debugfs
  mm, kmsan: fix infinite recursion due to RCU critical section
  mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again
  selftests/mm: switch to bash from sh
  MAINTAINERS: add man-pages git trees
  mm: memcontrol: don't throttle dying tasks on memory.high
  mm: mmap: map MAP_STACK to VM_NOHUGEPAGE
  uprobes: use pagesize-aligned virtual address when replacing pages
  selftests/mm: mremap_test: fix build warning
  ...
2024-01-29 17:12:16 -08:00
Florian Lehner
aecaa3ed48 perf/bpf: Fix duplicate type check
Remove the duplicate check on type and unify result.

Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20240120150920.3370-1-dev@der-flo.net
2024-01-29 22:40:37 +01:00
Andrii Nakryiko
add9c58cd4 bpf: move arg:ctx type enforcement check inside the main logic loop
Now that bpf and bpf-next trees converged and we don't run the risk of
merge conflicts, move btf_validate_prog_ctx_type() into its most logical
place inside the main logic loop.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240125205510.3642094-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-29 12:34:13 -08:00
Tejun Heo
5797b1c189 workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.

In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.

However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.

While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.

636b927eba ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.

Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:

- One max_active enforcement decouples from pool boundaires, chaining
  execution after a work item finishes requires inter-pool operations which
  would require lock dancing, which is nasty.

- Sharing a single nr_active count across the whole system can be pretty
  expensive on NUMA machines.

- Per-pwq enforcement had been more or less okay while we were using
  per-node pools.

It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:

- To avoid sharing a single counter across multiple nodes, the configured
  max_active is split across nodes according to the proportion of each
  workqueue's online effective CPUs per node. e.g. A node with twice more
  online effective CPUs will get twice higher portion of max_active.

- Workqueue used to be able to process a chain of interdependent work items
  which is as long as max_active. We can't do this anymore as max_active is
  distributed across the nodes. Instead, a new parameter min_active is
  introduced which determines the minimum level of concurrency within a node
  regardless of how max_active distribution comes out to be.

  It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
  This can lead to higher effective max_weight than configured and also
  deadlocks if a workqueue was depending on being able to handle chains of
  interdependent work items that are longer than 8.

  I believe these should be fine given that the number of CPUs in each NUMA
  node is usually higher than 8 and work item chain longer than 8 is pretty
  unlikely. However, if these assumptions turn out to be wrong, we'll need
  to add an interface to adjust min_active.

- Each unbound wq has an array of struct wq_node_nr_active which tracks
  per-node nr_active. When its pwq wants to run a work item, it has to
  obtain the matching node's nr_active. If over the node's max_active, the
  pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
  the completion path round-robins the pending pwqs activating the first
  inactive work item of each, which involves some pool lock dancing and
  kicking other pools. It's not the simplest code but doesn't look too bad.

v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().

    - wq_adjust_max_active() is now protected by wq->mutex instead of
      wq_pool_mutex.

v3: - wq_node_max_active() used to calculate per-node max_active on the fly
      based on system-wide CPU online states. Lai pointed out that this can
      lead to skewed distributions for workqueues with restricted cpumasks.
      Update the max_active distribution to use per-workqueue effective
      online CPU counts instead of system-wide and cache the calculation
      results in node_nr_active->max.

v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:25 -10:00
Tejun Heo
91ccc6e723 workqueue: Introduce struct wq_node_nr_active
Currently, for both percpu and unbound workqueues, max_active applies
per-cpu, which is a recent change for unbound workqueues. The change for
unbound workqueues was a significant departure from the previous behavior of
per-node application. It made some use cases create undesirable number of
concurrent work items and left no good way of fixing them. To address the
problem, workqueue is implementing a NUMA node segmented global nr_active
mechanism, which will be explained further in the next patch.

As a preparation, this patch introduces struct wq_node_nr_active. It's a
data structured allocated for each workqueue and NUMA node pair and
currently only tracks the workqueue's number of active work items on the
node. This is split out from the next patch to make it easier to understand
and review.

Note that there is an extra wq_node_nr_active allocated for the invalid node
nr_node_ids which is used to track nr_active for pools which don't have NUMA
node associated such as the default fallback system-wide pool.

This doesn't cause any behavior changes visible to userland yet. The next
patch will expand to implement the control mechanism on top.

v4: - Fixed out-of-bound access when freeing per-cpu workqueues.

v3: - Use flexible array for wq->node_nr_active as suggested by Lai.

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

    - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped
      pwq->max_active check. Restored. As the next patch replaces the
      max_active enforcement mechanism, this doesn't change the end result.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
dd6c3c5441 workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling
The planned shared nr_active handling for unbound workqueues will make
pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire
other pool locks, which is necessary as retirement of an nr_active count
from one pool may need kick off an inactive work item in another pool.

This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the
end of work item handling so that work item state changes stay atomic.
process_one_work() which is the other user of pwq_dec_nr_in_flight() already
calls it at the end of work item handling. Comments are added to both call
sites and pwq_dec_nr_in_flight().

This shouldn't cause any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
9f66cff212 workqueue: RCU protect wq->dfl_pwq and implement accessors for it
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because
currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq
which doesn't require RCU access. However, we want to be able to access
wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code
can be made easier to read by making the two pwq fields behave in the same
way.

- Make wq->dfl_pwq RCU protected.

- Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq
  and ->cpu_pwq. The former returns the double pointer that can be used
  access and update the pwqs. The latter performs locking check and
  dereferences the double pointer.

- pwq accesses and updates are converted to use unbound_pwq[_slot]().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
c5404d4e6d workqueue: Make wq_adjust_max_active() round-robin pwqs while activating
wq_adjust_max_active() needs to activate work items after max_active is
increased. Previously, it did that by visiting each pwq once activating all
that could be activated. While this makes sense with per-pwq nr_active,
nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd
want to round-robin through pwqs to be fairer.

In preparation, this patch makes wq_adjust_max_active() round-robin pwqs
while activating. While the activation ordering changes, this shouldn't
cause user-noticeable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
1c270b79ce workqueue: Move nr_active handling into helpers
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were
open-coding nr_active handling, which is fine given that the operations are
trivial. However, the planned unbound nr_active update will make them more
complicated, so let's move them into helpers.

- pwq_tryinc_nr_active() is added. It increments nr_active if under
  max_active limit and return a boolean indicating whether inc was
  successful. Note that the function is structured to accommodate future
  changes. __queue_work() is updated to use the new helper.

- pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and
  thus no longer assumes that nr_active is under max_active and returns a
  boolean to indicate whether a work item has been activated.

- wq_adjust_max_active() no longer tests directly whether a work item can be
  activated. Instead, it's updated to use the return value of
  pwq_activate_first_inactive() to tell whether a work item has been
  activated.

- nr_active decrement and activating the first inactive work item is
  factored into pwq_dec_nr_active().

v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as
      now we're calling the function unconditionally from
      pwq_activate_first_inactive().

v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
4c6380305d workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()
To prepare for unbound nr_active handling improvements, move work activation
part of pwq_activate_inactive_work() into __pwq_activate_work() and add
pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active.

pwq_activate_first_inactive() and try_to_grab_pending() are updated to use
pwq_activate_work(). The latter conversion is functionally identical. For
the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE
testing. This is temporary and will be removed by the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
afa87ce853 workqueue: Factor out pwq_is_empty()
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated
multiple times. Let's factor it out into pwq_is_empty().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Tejun Heo
a045a272d8 workqueue: Move pwq->max_active to wq->max_active
max_active is a workqueue-wide setting and the configured value is stored in
wq->saved_max_active; however, the effective value was stored in
pwq->max_active. While this is harmless, it makes max_active update process
more complicated and gets in the way of the planned max_active semantic
updates for unbound workqueues.

This patches moves pwq->max_active to wq->max_active. This simplifies the
code and makes freezing and noop max_active updates cheaper too. No
user-visible behavior change is intended.

As wq->max_active is updated while holding wq mutex but read without any
locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is
added for it.

v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:24 -10:00
Bartosz Golaszewski
aafd753555 genirq/irq_sim: Shrink code by using <linux/cleanup.h> helpers
Use the new __free() mechanism to remove all gotos and simplify the error
paths.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240122124243.44002-5-brgl@bgdev.pl
2024-01-29 11:07:57 +01:00
Linus Torvalds
648f575d5e - Prevent an inconsistent futex operation leading to stale state
exposure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2LAgACgkQEsHwGGHe
 VUqowBAAiW9aPQmp401DSXLX+bX0oS5IQVEZnAEE3hQTWxdvDoIdmX+SBReSutXy
 PDm8mZgVtIiUg3V5bu7/9Dgpu7ovRuChJPjjkYFUDcEmzmsMI11W6u8+8eyt8yRd
 X9LuGUeXPJSI1kadYudhFUhl6X6KcXj4Y+XUqNcyp8yClSEcLriYeiumNApSEzj6
 BneO5VBbXTpJq1b7GOlC4MNhNXhx+WlUdJUb3VPLlxy/akxrNs9x0ASdOuqslCq8
 X9SJPnKeRh0mpezmWDgU72eQ/3vpvWQzwyXvp2pQGbjArCx7IwwD765NDu0P6651
 C/+4ruXmcd+Jp3wuobdHG8/J2NlZQy8tZQm284YkS5vyBQDi4s17hycXw/aeUFpu
 /3LR1Hppl//u7hkaHszE8vE5l6in4a2XAbk9EozChVj/aHRJqIaLn8TGQRquK4Tg
 uRjIC3O2ubJCsIlNIczysjCobSCO+cELwUuFVHh7cdmQAgUwF3efDab0+pJ7MHFb
 ZEcqQbIt4FGea4BGzvRYCYj6W9bkhzttnH+68ef+mDA3BcdGoYnHcQ143M8duNhe
 0inWCibQXMFC9EGPjC8Sz8WvzF/L5KL9bPQmO1sitIzH6kbU3o7PBk2Fe4V6+KP9
 THK865SJ/9QirjXrGmp9Sle6dqJRUylmt1ts8reOWACZ98LKeWU=
 =ibuM
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Prevent an inconsistent futex operation leading to stale state
   exposure

* tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Prevent the reuse of stale pi_state
2024-01-28 10:38:16 -08:00
Linus Torvalds
0e4363ac1a - Initialize the resend node of each IRQ descriptor, not only the first one
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KrEACgkQEsHwGGHe
 VUr1WA//Qsi2JkxO1lyUQgnyuXqs0+oVZJxFyH2dFYzWkfSaxgsyPZ0H+wsweDfP
 OgoNzwwDf1IaNbVz2voV6lSM/30ujJMx4aAucT5WTEXa12cJsvipxRiNd8WU8GqQ
 buBz+vnS9IJ2WfM7UxhIVevYFU8H/ERcSO9WCII0YjlcVxmlwMK3B7kFdpBPdT5Q
 m5hvBorZzIa9wD3TI7e+VvEVbCx0WjsYYEpXDXM/yCf1Juc9952pjjunzx3YmJES
 5JG2WpnEvmNdWwIPO0NAjs7Shw/MNViXTy5Ls5jcbswiAcBoUxNHQlUsvNVDaVyv
 8eMCkPzuSipY8HSoetQSTJl+mr3LyYRvevKahuTgwbS8K+kxgClqHFoLZVqolWnk
 2IDo63R6Ex6lb1Xpb/Rpg/4j4NqUVWcvPHf6Z2CmMRq/XbSk2DIFl1Wxjgy/Cjnu
 +nNLw2FYayEBrKF3VlYgERGoCfBrEsksxzljjeHFn5XWr+G2x1ykF37xaWjQ2+oV
 sFl6UYwIsdqPCjHmpT6R1lwCdeEC3o3Zc2Kf5uEVj+pXacKJkxZU0L6ZneO8UiEc
 rtc0gTgm9ZNd8oDsjsaBU1A3KxH9lOfVz82ZV0tipz94dcN4zrB9Qag0Yw+64YOC
 cQ9cRKiFiCVCeD1ksDLZe1IUX++T2Y9O8MZv06ZDrbaFC56+lU8=
 =IgON
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Borislav Petkov:

 - Initialize the resend node of each IRQ descriptor, not only the first
   one

* tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Initialize resend_node hlist for all interrupt descriptors
2024-01-28 10:34:55 -08:00
Linus Torvalds
90db544eba - Preserve the number of idle calls and sleep entries across CPU hotplug
events in order to be able to compute correct averages
 
 - Limit the duration of the clocksource watchdog checking interval as
   too long intervals lead to wrongly marking the TSC as unstable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmW2KYEACgkQEsHwGGHe
 VUoX7w//Ulls1tp3m1oiejTBtUmkewSmnhNAfkHJv3MlKNe+ttG6LvQVh9g2bf1Z
 FOu2M2se0ge8G5xf3+I5E4rpqlJZSuhPmNmIET+aj+2a61UJq/6zE1Zw6mxjrJOK
 emjOYKTxZ/HxvKJJGO6NiH8Iv5Aj3nQR3Y6oyb/FyP5TLJ6MCT21iEaqyqU7P+Ix
 AHIS3cL97M5R/tFtP2CY3PV2M6hJ0lqapSi9t75hT8DfJN1TNQ5SvFkKgmOIrGFw
 2WxPTSTEZAnXlvI4cC3Nru9i64QQRw9S05FFelX2pwxE/7wVzBvfh8cjuGZJBve/
 KQhNnQ4/fzv6E/hUcavKuOyk1lx5XonfCuG4RFoLl67LjLbLh+Q55RBdXflBPF4T
 Ow9BSyQNFu391C2Bl5gJUYVd2JMv+IVpi2wUiwrXJ/Mxj+A2J7Fj0jz7hMbNCmsU
 EaA+QyfkAGsoa99xP3UDhPzxoCr2s5YTAxH+IUeSWeI25PMq9f+6fifXBwG+GaVa
 FS6Ei1VI0GCNmcYFYawHJbdM2ui5h7lZ96aEpOBSVcAv/2yBNgqxuYZ+icO/wI6N
 JM0DSEEOrWcytfxftl7LmglJauhXSKZH4UTG4RCz0IkDgR72Wn0QF4cqm4wWQ5yh
 n5/xO+SbkzE57bltsnAkvpu0a110fdK5ec+vkFIy4PrxyO83XhY=
 =9RXx
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Preserve the number of idle calls and sleep entries across CPU
   hotplug events in order to be able to compute correct averages

 - Limit the duration of the clocksource watchdog checking interval as
   too long intervals lead to wrongly marking the TSC as unstable

* tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/sched: Preserve number of idle sleeps across CPU hotplug events
  clocksource: Skip watchdog check for large watchdog intervals
2024-01-28 10:33:14 -08:00
Jakub Kicinski
92046e83c0 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZbQV+gAKCRDbK58LschI
 g2OeAP0VvhZS9SPiS+/AMAFuw2W1BkMrFNbfBTc3nzRnyJSmNAD+NG4CLLJvsKI9
 olu7VC20B8pLTGLUGIUSwqnjOC+Kkgc=
 =wVMl
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-01-26

We've added 107 non-merge commits during the last 4 day(s) which contain
a total of 101 files changed, 6009 insertions(+), 1260 deletions(-).

The main changes are:

1) Add BPF token support to delegate a subset of BPF subsystem
   functionality from privileged system-wide daemons such as systemd
   through special mount options for userns-bound BPF fs to a trusted
   & unprivileged application. With addressed changes from Christian
   and Linus' reviews, from Andrii Nakryiko.

2) Support registration of struct_ops types from modules which helps
   projects like fuse-bpf that seeks to implement a new struct_ops type,
   from Kui-Feng Lee.

3) Add support for retrieval of cookies for perf/kprobe multi links,
   from Jiri Olsa.

4) Bigger batch of prep-work for the BPF verifier to eventually support
   preserving boundaries and tracking scalars on narrowing fills,
   from Maxim Mikityanskiy.

5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help
   with the scenario of SYN floods, from Kuniyuki Iwashima.

6) Add code generation to inline the bpf_kptr_xchg() helper which
   improves performance when stashing/popping the allocated BPF objects,
   from Hou Tao.

7) Extend BPF verifier to track aligned ST stores as imprecise spilled
   registers, from Yonghong Song.

8) Several fixes to BPF selftests around inline asm constraints and
   unsupported VLA code generation, from Jose E. Marchesi.

9) Various updates to the BPF IETF instruction set draft document such
   as the introduction of conformance groups for instructions,
   from Dave Thaler.

10) Fix BPF verifier to make infinite loop detection in is_state_visited()
    exact to catch some too lax spill/fill corner cases,
    from Eduard Zingerman.

11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly
    instead of implicitly for various register types, from Hao Sun.

12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness
    in neighbor advertisement at setup time, from Martin KaFai Lau.

13) Change BPF selftests to skip callback tests for the case when the
    JIT is disabled, from Tiezhu Yang.

14) Add a small extension to libbpf which allows to auto create
    a map-in-map's inner map, from Andrey Grafin.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits)
  selftests/bpf: Add missing line break in test_verifier
  bpf, docs: Clarify definitions of various instructions
  bpf: Fix error checks against bpf_get_btf_vmlinux().
  bpf: One more maintainer for libbpf and BPF selftests
  selftests/bpf: Incorporate LSM policy to token-based tests
  selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar
  libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
  selftests/bpf: Add tests for BPF object load with implicit token
  selftests/bpf: Add BPF object loading tests with explicit token passing
  libbpf: Wire up BPF token support at BPF object level
  libbpf: Wire up token_fd into feature probing logic
  libbpf: Move feature detection code into its own file
  libbpf: Further decouple feature checking logic from bpf_object
  libbpf: Split feature detectors definitions from cached results
  selftests/bpf: Utilize string values for delegate_xxx mount options
  bpf: Support symbolic BPF FS delegation mount options
  bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
  bpf,selinux: Allocate bpf_security_struct per BPF token
  selftests/bpf: Add BPF token-enabled tests
  libbpf: Add BPF token support to bpf_prog_load() API
  ...
====================

Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26 21:08:22 -08:00
Tejun Heo
e563d0a7cd workqueue: Break up enum definitions and give names to the types
workqueue is collecting different sorts of enums into a single unnamed enum
type which can increase confusion around enum width. Also, unnamed enums
can't be accessed from BPF. Let's break up enum definitions according to
their purposes and give them type names.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:50 -10:00
Tejun Heo
6a229b0e2f workqueue: Drop unnecessary kick_pool() in create_worker()
After creating a new worker, create_worker() is calling kick_pool() to wake
up the new worker task. However, as kick_pool() doesn't do anything if there
is no work pending, it also calls wake_up_process() explicitly. There's no
reason to call kick_pool() at all. wake_up_process() is enough by itself.
Drop the unnecessary kick_pool() call.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26 11:55:46 -10:00
Masami Hiramatsu (Google)
0958b33ef5 tracing/trigger: Fix to return error if failed to alloc snapshot
Fix register_snapshot_trigger() to return error code if it failed to
allocate a snapshot instead of 0 (success). Unless that, it will register
snapshot trigger without an error.

Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2

Fixes: 0bbe7f7199 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation")
Cc: stable@vger.kernel.org
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-26 15:10:24 -05:00
Li Zhijian
effe6d278e kernel/cpu: Convert snprintf() to sysfs_emit()
Per filesystems/sysfs.rst, show() should only use sysfs_emit()
or sysfs_emit_at() when formatting the value to be returned to user space.

coccinelle complains that there are still a couple of functions that use
snprintf(). Convert them to sysfs_emit().

No functional change intended.

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240116045151.3940401-40-lizhijian@fujitsu.com
2024-01-26 18:25:16 +01:00
Randy Dunlap
ef7e585bf4 cpu/hotplug: Delete an extraneous kernel-doc description
struct cpuhp_cpu_state has an extraneous kernel-doc comment for @cpu.
There is no struct member by that name, so remove the comment to
prevent the kernel-doc warning:

  kernel/cpu.c:85: warning: Excess struct member 'cpu' description in 'cpuhp_cpu_state'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240114030615.30441-1-rdunlap@infradead.org
2024-01-26 17:44:42 +01:00
Bartosz Golaszewski
8dab7fd47e genirq/irq_sim: Order headers alphabetically
For better readability and maintenance keep headers in alphabetical
order.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-4-brgl@bgdev.pl
2024-01-26 13:44:48 +01:00
Bartosz Golaszewski
3832f39042 genirq/irq_sim: Remove unused field from struct irq_sim_irq_ctx
The irqnum field is unused. Remove it.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122124243.44002-3-brgl@bgdev.pl
2024-01-26 13:44:48 +01:00
David Hildenbrand
4dca82d141 uprobes: use pagesize-aligned virtual address when replacing pages
uprobes passes an unaligned page mapping address to
folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we
recently extended in commit 372cbd4d5a ("mm: non-pmd-mappable, large
folios for folio_add_new_anon_rmap()").

Arguably, this is uprobes code doing something wrong; however, for the
time being it would have likely worked in rmap code because
__folio_set_anon() would set folio->index to the same value.

Looking at __replace_page(), we'd also pass slightly wrong values to
mmu_notifier_range_init(), page_vma_mapped_walk(), flush_cache_page(),
ptep_clear_flush() and set_pte_at_notify().  I suspect most of them are
fine, but let's just mark the introducing commit as the one needed fixing.
I don't think CC stable is warranted.

We'll add more sanity checks in rmap code separately, to make sure that we
always get properly aligned addresses.

Link: https://lkml.kernel.org/r/20240115100731.91007-1-david@redhat.com
Fixes: c517ee744b ("uprobes: __replace_page() should not use page_address_in_vma()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lkml.kernel.org/r/ZaMR2EWN-HvlCfUl@krava
Tested-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-25 23:52:20 -08:00
Kui-Feng Lee
e6be8cd5d3 bpf: Fix error checks against bpf_get_btf_vmlinux().
In bpf_struct_ops_map_alloc, it needs to check for NULL in the returned
pointer of bpf_get_btf_vmlinux() when CONFIG_DEBUG_INFO_BTF is not set.
ENOTSUPP is used to preserve the same behavior before the
struct_ops kmod support.

In the function check_struct_ops_btf_id(), instead of redoing the
bpf_get_btf_vmlinux() that has already been done in syscall.c, the fix
here is to check for prog->aux->attach_btf_id.
BPF_PROG_TYPE_STRUCT_OPS must require attach_btf_id and syscall.c
guarantees a valid attach_btf as long as attach_btf_id is set.
When attach_btf_id is not set, this patch returns -ENOTSUPP
because it is what the selftest in test_libbpf_probe_prog_types()
and libbpf_probes.c are expecting for feature probing purpose.

Changes from v1:

 - Remove an unnecessary NULL check in check_struct_ops_btf_id()

Reported-by: syzbot+88f0aafe5f950d7489d7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000040d68a060fc8db8c@google.com/
Reported-by: syzbot+1336f3d4b10bcda75b89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/00000000000026353b060fc21c07@google.com/
Fixes: fcc2c1fb06 ("bpf: pass attached BTF to the bpf_struct_ops subsystem")
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240126023113.1379504-1-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-25 20:49:59 -08:00
Audra Mitchell
8318d6a636 workqueue: Shorten events_freezable_power_efficient name
Since we have set the WQ_NAME_LEN to 32, decrease the name of
events_freezable_power_efficient so that it does not trip the name length
warning when the workqueue is created.

Signed-off-by: Audra Mitchell <audra@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-25 09:11:40 -10:00
Linus Torvalds
3cb9871f81 Urgent RCU pull request for v6.8
This commit fixes RCU grace period stalls, which are observed when
 an outgoing CPU's quiescent state reporting results in wakeup of
 one of the grace period kthreads, to complete the grace period. If
 those kthreads have SCHED_FIFO policy, the wake up can indirectly
 arm the RT bandwith timer to the local offline CPU. Earlier migration
 of the hrtimers from the CPU introduced in commit 5c0930ccaa
 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
 results in this timer getting ignored. If the RCU grace period
 kthreads are waiting for RT bandwidth to be available, they may
 never be actually scheduled, resulting in RCU stall warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZbFOUgAKCRAAHS7/6Z0w
 pQjcAQCg/tJYRjwGUPebKLUgkmXlR+IIANzEgvES/RgWTOld5gEAklZVTjf3J0qt
 QeU9WC3My2cVPKvv6kqnuQ9rrqMQ3g0=
 =U5an
 -----END PGP SIGNATURE-----

Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux

Pull RCU fix from Neeraj Upadhyay:
 "This fixes RCU grace period stalls, which are observed when an
  outgoing CPU's quiescent state reporting results in wakeup of one of
  the grace period kthreads, to complete the grace period.

  If those kthreads have SCHED_FIFO policy, the wake up can indirectly
  arm the RT bandwith timer to the local offline CPU.

  Earlier migration of the hrtimers from the CPU introduced in commit
  5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU
  earlier") results in this timer getting ignored.

  If the RCU grace period kthreads are waiting for RT bandwidth to be
  available, they may never be actually scheduled, resulting in RCU
  stall warnings"

* tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux:
  rcu: Defer RCU kthreads wakeup when CPU is dying
2024-01-25 10:21:21 -08:00
Kunwu Chan
aa13b70908 audit: use KMEM_CACHE() instead of kmem_cache_create()
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-01-25 10:12:22 -05:00
Li zeming
3104d0e94f audit: remove unnecessary assignment in audit_dupe_lsm_field()
The ret variable is assigned when it does not need to be defined, as it
has already been assigned before use.

Signed-off-by: Li zeming <zeming@nfschina.com>
[PM: rewrite subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-01-25 09:59:27 -05:00
Tim Chen
9a574ea906 tick/sched: Preserve number of idle sleeps across CPU hotplug events
Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
CPU hotplug") preserved total idle sleep time and iowait sleeptime across
CPU hotplug events.

Similar reasoning applies to the number of idle calls and idle sleeps to
get the proper average of sleep time per idle invocation.

Preserve those fields too.

Fixes: 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
2024-01-25 09:52:40 +01:00
Jiri Wiesner
6446495535 clocksource: Skip watchdog check for large watchdog intervals
There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:

  clocksource: timekeeping watchdog on CPU373:
  Marking clocksource 'tsc' as unstable because the skew is too large:
  clocksource:   'hpet' wd_nsec: 14523447520
  clocksource:   'tsc'  cs_nsec: 14524115132

The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:

  cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

The kernel used 200 microseconds for the uncertainty_margin of both the
clocksource and watchdog, resulting in a threshold of 400 microseconds (the
md variable). Both the cs_nsec and the wd_nsec value indicate that the
readout interval was circa 14.5 seconds.  The observed behaviour is that
watchdog checks failed for large readout intervals on 8 NUMA node
machines. This indicates that the size of the skew was directly proportinal
to the length of the readout interval on those machines. The measured
clocksource skew, 668 microseconds, was evaluated against a threshold (the
md variable) that is suited for readout intervals of roughly
WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.

The intention of 2e27e793e2 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135 ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use a
clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
readout interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is far stricter than the skew with which NTP can deal.

The root causes of the skew being directly proportinal to the length of
the readout interval are:

  * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
  * the conversion to nanoseconds is imprecise for large readout intervals

Prevent this by skipping the current watchdog check if the readout
interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
(of the TSC and HPET) corresponds to a limit on clocksource skew of 250
ppm (microseconds of skew per second).  To keep the limit imposed by NTP
(500 microseconds of skew per second) for all possible readout intervals,
the margins would have to be scaled so that the threshold value is
proportional to the length of the actual readout interval.

As for why the readout interval may get stretched: Since the watchdog is
executed in softirq context the expiration of the watchdog timer can get
severely delayed on account of a ksoftirqd thread not getting to run in a
timely manner. Surely, a system with such belated softirq execution is not
working well and the scheduling issue should be looked into but the
clocksource watchdog should be able to deal with it accordingly.

Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122172350.GA740@incl
2024-01-25 09:13:16 +01:00
Andrii Nakryiko
6c1752e0b6 bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.

Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.

There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.

Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:

12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token

  $ sudo mkdir -p /sys/fs/bpf/token
  $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
               -o delegate_cmds=prog_load:MAP_CREATE \
               -o delegate_progs=kprobe \
               -o delegate_attachs=xdp
  $ mount | grep token
  bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-24 16:21:02 -08:00
Andrii Nakryiko
aeaa97b006 bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
It's quite confusing in practice when it's possible to successfully
create a BPF token from BPF FS that didn't have any of delegate_xxx
mount options set up. While it's not wrong, it's actually more
meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT)
to let user-space know that no token delegation is setup up.

So, instead of creating empty BPF token that will be always ignored
because it doesn't have any of the allow_xxx bits set, reject it with
-ENOENT. If we ever need empty BPF token to be possible, we can support
that with extra flag passed into BPF_TOKEN_CREATE.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-19-andrii@kernel.org
2024-01-24 16:21:02 -08:00
Andrii Nakryiko
f568a3d49a bpf,lsm: Add BPF token LSM hooks
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
allocate LSM security blob (we add `void *security` field to struct
bpf_token for that), but also control who can instantiate BPF token.
This follows existing pattern for BPF map and BPF prog.

Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
LSM hooks that allow LSM implementation to control and negate (if
necessary) BPF token's delegation of a specific bpf_cmd and capability,
respectively.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-12-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
a2431c7eab bpf,lsm: Refactor bpf_map_alloc/bpf_map_free LSM hooks
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
hook into bpf_map_create, taking not just struct bpf_map, but also
bpf_attr and bpf_token, to give a fuller context to LSMs.

Unlike bpf_prog_alloc, there is no need to move the hook around, as it
currently is firing right before allocating BPF map ID and FD, which
seems to be a sweet spot.

But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
LSM hook is called even if bpf_map_create hook returned error, as if few
LSMs are combined together it could be that one LSM successfully
allocated security blob for its needs, while subsequent LSM rejected BPF
map creation. The former LSM would still need to free up LSM blob, so we
need to ensure security_bpf_map_free() is called regardless of the
outcome.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-11-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
1b67772e4e bpf,lsm: Refactor bpf_prog_alloc/bpf_prog_free LSM hooks
Based on upstream discussion ([0]), rework existing
bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
program struct. Also, we pass bpf_attr union with all the user-provided
arguments for BPF_PROG_LOAD command.  This will give LSMs as much
information as we can basically provide.

The hook is also BPF token-aware now, and optional bpf_token struct is
passed as a third argument. bpf_prog_load LSM hook is called after
a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
allocated and filled out, but right before performing full-fledged BPF
verification step.

bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
consistency. SELinux code is adjusted to all new names, types, and
signatures.

Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
used by some LSMs to allocate extra security blob, but also by other
LSMs to reject BPF program loading, we need to make sure that
bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
*even* if the hook itself returned error. If we don't do that, we run
the risk of leaking memory. This seems to be possible today when
combining SELinux and BPF LSM, as one example, depending on their
relative ordering.

Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
sleepable LSM hooks list, as they are both executed in sleepable
context. Also drop bpf_prog_load hook from untrusted, as there is no
issue with refcount or anything else anymore, that originally forced us
to add it to untrusted list in c0c852dd18 ("bpf: Do not mark certain LSM
hook arguments as trusted"). We now trigger this hook much later and it
should not be an issue anymore.

  [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-10-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
d79a354975 bpf: Consistently use BPF token throughout BPF verifier logic
Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-9-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
bbc1d24724 bpf: Take into account BPF token when fetching helper protos
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
caf8f28e03 bpf: Add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
should be set in prog_flags field when providing prog_token_fd.

Wire through a set of allowed BPF program types and attach types,
derived from BPF FS at BPF token creation time. Then make sure we
perform bpf_token_capable() checks everywhere where it's relevant.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
9ea7c4bf17 bpf: Add BPF token support to BPF_BTF_LOAD command
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified
when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags
field before, we also add btf_flags field.

BTF loading is a pretty straightforward operation, so as long as BPF
token is created with allow_cmds granting BPF_BTF_LOAD command, kernel
proceeds to parsing BTF data and creating BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
a177fc2bf6 bpf: Add BPF token support to BPF_MAP_CREATE command
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.
New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
for BPF_MAP_CREATE command.

Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
35f96de041 bpf: Introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
2024-01-24 16:21:01 -08:00
Andrii Nakryiko
6fe01d3cbb bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
  - `delegate_cmds` allow to specify which bpf() syscall commands are
    allowed with BPF token derived from this BPF FS instance;
  - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
    a set of allowable BPF map types that could be created with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
    a set of allowable BPF program types that could be loaded with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
    a set of allowable BPF program attach types that could be loaded with
    BPF token; delegate_progs and delegate_attachs are meant to be used
    together, as full BPF program type is, in general, determined
    through both program type and program attach type.

Currently, these mount options accept the following forms of values:
  - a special value "any", that enables all possible values of a given
  bit set;
  - numeric value (decimal or hexadecimal, determined by kernel
  automatically) that specifies a bit mask value directly;
  - all the values for a given mount option are combined, if specified
  multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
  delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
  mask.

Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.

Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.

This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.

There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.

This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.

  [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-24 16:21:00 -08:00
Andrii Nakryiko
ed1ad5a741 bpf: Align CAP_NET_ADMIN checks with bpf_capable() approach
Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.

Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.

But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
  - when creating DEVMAP/XDKMAP/CPU_MAP maps;
  - when attaching CGROUP_SKB programs;
  - when handling BPF_PROG_QUERY command.

This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.

This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-2-andrii@kernel.org
2024-01-24 16:21:00 -08:00
Linus Torvalds
cf10015a24 execve fixes for v6.8-rc2
- Fix error handling in begin_new_exec() (Bernd Edlinger)
 
 - MAINTAINERS: specifically mention ELF (Alexey Dobriyan)
 
 - Various cleanups related to earlier open() (Askar Safin, Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWxbGsWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJiQZD/9Lxd6ntRORthvCGk07g12fGZhQ
 OstFdbHyk5/Z+6/uKxSMvkoPZwJkXF2n3D/AvlfMFgyDBvLCFUu08jZOV31YFbeQ
 OFXVGcbY7nexkAmC6eN2k3SX8E+jzLdbcHeVk/iJomNUYBNTpExXhGMEyqZ53Pzo
 fo1uaRNGreCdSP04aHU1LE0vx7p16553oBeBZFT+iLd4glLte+E1TOZh4cIaSZbK
 5h0e+vG1XSBd9uP3fbYEyf+1JzKuhmm1RrVVaDkds1CLgJzUxh0cE1U9otKfnrwf
 xyBu556wTb001vYAIIcLlOJq+ROdiuA12RSyyHbKZmYAWTkQnBgKPV8BGDbshtzN
 zykJEsbRnWV3vN1n6+UzCEknE/xjvywEEdJgghZh46zk2NjnbtULOonLq8aMw7SA
 O+kcr4rqPLuRnxnkBw7QqA1y09QD9+M/iRQdgahsBIaDM3mMXGQsqeJAo9tFxO2M
 oJ1gJ9A7IdeULMBQ7zKVxTvC5c5fF2/CA5jpHUjASiUOTqcfHkPRYX2GINE62Heb
 xfsc3c1RhDrknMA/O01c8ziEBzZqhHUq4vGgWn0VjwIspYyfOOJYneeIx6/pJyTY
 OXbgaK+NetDCOKcv91Jjj0xfxrP0WogzvDbT9j2NuViqX24aQR1oZrredWPCTt5S
 wKouTaLVsM10EwR/Rw==
 =oOcx
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

 - Fix error handling in begin_new_exec() (Bernd Edlinger)

 - MAINTAINERS: specifically mention ELF (Alexey Dobriyan)

 - Various cleanups related to earlier open() (Askar Safin, Kees Cook)

* tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Distinguish in_execve from in_exec
  exec: Fix error handling in begin_new_exec()
  exec: Add do_close_execat() helper
  exec: remove useless comment
  ELF, MAINTAINERS: specifically mention ELF
2024-01-24 13:32:29 -08:00
Kees Cook
90383cc078 exec: Distinguish in_execve from in_exec
Just to help distinguish the fs->in_exec flag from the current->in_execve
flag, add comments in check_unsafe_exec() and copy_fs() for more
context. Also note that in_execve is only used by TOMOYO now.

Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-24 11:48:52 -08:00