Commit Graph

45887 Commits

Author SHA1 Message Date
Linus Torvalds
8d8d276ba2 More tracing fixes for 6.11:
- Move declaration of interface_lock outside of CONFIG_TIMERLAT_TRACER
 
   The fix to some locking races moved the declaration of the
   interface_lock up in the file, but also moved it into the
   CONFIG_TIMERLAT_TRACER #ifdef block, breaking the build when
   that wasn't set. Move it further up and out of that #ifdef block.
 
 - Remove unused function run_tracer_selftest() stub
 
   When CONFIG_FTRACE_STARTUP_TEST is not set the stub function
   run_tracer_selftest() is not used and clang is warning about it.
   Remove the function stub as it is not needed.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZt9WIRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qj2PAPsHsAHxF4oPhXi9UmGHH+l0NcWm87U2
 B5JE+73M+RaDQgD/WpdGJaQRudUwic0wu+aHXzMFae3DVd/WUjWbGnlo5gI=
 =pS08
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Move declaration of interface_lock outside of CONFIG_TIMERLAT_TRACER

   The fix to some locking races moved the declaration of the
   interface_lock up in the file, but also moved it into the
   CONFIG_TIMERLAT_TRACER #ifdef block, breaking the build when that
   wasn't set. Move it further up and out of that #ifdef block.

 - Remove unused function run_tracer_selftest() stub

   When CONFIG_FTRACE_STARTUP_TEST is not set the stub function
   run_tracer_selftest() is not used and clang is warning about it.
   Remove the function stub as it is not needed.

* tag 'trace-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Drop unused helper function to fix the build
  tracing/osnoise: Fix build when timerlat is not enabled
2024-09-10 09:05:20 -07:00
Benjamin ROBIN
35b603f8a7 ntp: Make sure RTC is synchronized when time goes backwards
sync_hw_clock() is normally called every 11 minutes when time is
synchronized. This issue is that this periodic timer uses the REALTIME
clock, so when time moves backwards (the NTP server jumps into the past),
the timer expires late.

If the timer expires late, which can be days later, the RTC will no longer
be updated, which is an issue if the device is abruptly powered OFF during
this period. When the device will restart (when powered ON), it will have
the date prior to the ADJ_SETOFFSET call.

A normal NTP server should not jump in the past like that, but it is
possible... Another way of reproducing this issue is to use phc2sys to
synchronize the REALTIME clock with, for example, an IRIG timecode with
the source always starting at the same date (not synchronized).

Also, if the time jump in the future by less than 11 minutes, the RTC may
not be updated immediately (minor issue). Consider the following scenario:
 - Time is synchronized, and sync_hw_clock() was just called (the timer
   expires in 11 minutes).
 - A time jump is realized in the future by a couple of minutes.
 - The time is synchronized again.
 - Users may expect that RTC to be updated as soon as possible, and not
   after 11 minutes (for the same reason, if a power loss occurs in this
   period).

Cancel periodic timer on any time jump (ADJ_SETOFFSET) greater than or
equal to 1s. The timer will be relaunched at the end of do_adjtimex() if
NTP is still considered synced. Otherwise the timer will be relaunched
later when NTP is synced. This way, when the time is synchronized again,
the RTC is updated after less than 2 seconds.

Signed-off-by: Benjamin ROBIN <dev@benjarobin.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240908140836.203911-1-dev@benjarobin.fr
2024-09-10 13:50:40 +02:00
Thomas Gleixner
2f7eedca6c Merge branch 'linus' into timers/core
To update with the latest fixes.
2024-09-10 13:49:53 +02:00
Waiman Long
d00b83d416 locking/rwsem: Move is_rwsem_reader_owned() and rwsem_owner() under CONFIG_DEBUG_RWSEMS
Both is_rwsem_reader_owned() and rwsem_owner() are currently only used when
CONFIG_DEBUG_RWSEMS is defined. This causes a compilation error with clang
when `make W=1` and CONFIG_WERROR=y:

kernel/locking/rwsem.c:187:20: error: unused function 'is_rwsem_reader_owned' [-Werror,-Wunused-function]
  187 | static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
      |                    ^~~~~~~~~~~~~~~~~~~~~
kernel/locking/rwsem.c:271:35: error: unused function 'rwsem_owner' [-Werror,-Wunused-function]
  271 | static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem)
      |                                   ^~~~~~~~~~~

Fix this by moving these two functions under the CONFIG_DEBUG_RWSEMS define.

Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20240909182905.161156-1-longman@redhat.com
2024-09-10 12:02:33 +02:00
Peter Zijlstra
1d7f856c2c jump_label: Fix static_key_slow_dec() yet again
While commit 83ab38ef0a ("jump_label: Fix concurrency issues in
static_key_slow_dec()") fixed one problem, it created yet another,
notably the following is now possible:

  slow_dec
    if (try_dec) // dec_not_one-ish, false
    // enabled == 1
                                slow_inc
                                  if (inc_not_disabled) // inc_not_zero-ish
                                  // enabled == 2
                                    return

    guard((mutex)(&jump_label_mutex);
    if (atomic_cmpxchg(1,0)==1) // false, we're 2

                                slow_dec
                                  if (try-dec) // dec_not_one, true
                                  // enabled == 1
                                    return
    else
      try_dec() // dec_not_one, false
      WARN

Use dec_and_test instead of cmpxchg(), like it was prior to
83ab38ef0a. Add a few WARNs for the paranoid.

Fixes: 83ab38ef0a ("jump_label: Fix concurrency issues in static_key_slow_dec()")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Tested-by: Klara Modin <klarasmodin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10 11:57:27 +02:00
Kan Liang
a48a36b316 perf: Add PERF_EV_CAP_READ_SCOPE
Usually, an event can be read from any CPU of the scope. It doesn't need
to be read from the advertised CPU.

Add a new event cap, PERF_EV_CAP_READ_SCOPE. An event of a PMU with
scope can be read from any active CPU in the scope.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240802151643.1691631-3-kan.liang@linux.intel.com
2024-09-10 11:44:13 +02:00
Kan Liang
4ba4f1afb6 perf: Generic hotplug support for a PMU with a scope
The perf subsystem assumes that the counters of a PMU are per-CPU. So
the user space tool reads a counter from each CPU in the system wide
mode. However, many PMUs don't have a per-CPU counter. The counter is
effective for a scope, e.g., a die or a socket. To address this, a
cpumask is exposed by the kernel driver to restrict to one CPU to stand
for a specific scope. In case the given CPU is removed,
the hotplug support has to be implemented for each such driver.

The codes to support the cpumask and hotplug are very similar.
- Expose a cpumask into sysfs
- Pickup another CPU in the same scope if the given CPU is removed.
- Invoke the perf_pmu_migrate_context() to migrate to a new CPU.
- In event init, always set the CPU in the cpumask to event->cpu

Similar duplicated codes are implemented for each such PMU driver. It
would be good to introduce a generic infrastructure to avoid such
duplication.

5 popular scopes are implemented here, core, die, cluster, pkg, and
the system-wide. The scope can be set when a PMU is registered. If so, a
"cpumask" is automatically exposed for the PMU.

The "cpumask" is from the perf_online_<scope>_mask, which is to track
the active CPU for each scope. They are set when the first CPU of the
scope is online via the generic perf hotplug support. When a
corresponding CPU is removed, the perf_online_<scope>_mask is updated
accordingly and the PMU will be moved to a new CPU from the same scope
if possible.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240802151643.1691631-2-kan.liang@linux.intel.com
2024-09-10 11:44:12 +02:00
Huang Shijie
2cab4bd024 sched/debug: Fix the runnable tasks output
The current runnable tasks output looks like:

  runnable tasks:
   S            task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
  -------------------------------------------------------------------------------------------------------------
   Ikworker/R-rcu_g     4         0.129049 E         0.620179           0.750000         0.002920         2   100         0.000000         0.002920         0.000000         0.000000 0 0 /
   Ikworker/R-sync_     5         0.125328 E         0.624147           0.750000         0.001840         2   100         0.000000         0.001840         0.000000         0.000000 0 0 /
   Ikworker/R-slub_     6         0.120835 E         0.628680           0.750000         0.001800         2   100         0.000000         0.001800         0.000000         0.000000 0 0 /
   Ikworker/R-netns     7         0.114294 E         0.634701           0.750000         0.002400         2   100         0.000000         0.002400         0.000000         0.000000 0 0 /
   I    kworker/0:1     9       508.781746 E       511.754666           3.000000       151.575240       224   120         0.000000       151.575240         0.000000         0.000000 0 0 /

Which is messy. Remove the duplicate printing of sum_exec_runtime and
tidy up the layout to make it look like:

  runnable tasks:
   S            task   PID       vruntime   eligible    deadline             slice          sum-exec      switches  prio         wait-time        sum-sleep       sum-block  node   group-id  group-path
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   I     kworker/0:3  1698       295.001459   E         297.977619           3.000000        38.862920         9     120         0.000000         0.000000         0.000000   0      0        /
   I     kworker/0:4  1702       278.026303   E         281.026303           3.000000         9.918760         3     120         0.000000         0.000000         0.000000   0      0        /
   S  NetworkManager  2646         0.377936   E           2.598104           3.000000        98.535880       314     120         0.000000         0.000000         0.000000   0      0        /system.slice/NetworkManager.service
   S       virtqemud  2689         0.541016   E           2.440104           3.000000        50.967960        80     120         0.000000         0.000000         0.000000   0      0        /system.slice/virtqemud.service
   S   gsd-smartcard  3058        73.604144   E          76.475904           3.000000        74.033320        88     120         0.000000         0.000000         0.000000   0      0        /user.slice/user-42.slice/session-c1.scope

Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240906053019.7874-1-shijie@os.amperecomputing.com
2024-09-10 09:51:15 +02:00
Peter Zijlstra
c662e2b1e8 sched: Fix sched_delayed vs sched_core
Completely analogous to commit dfa0a574cb ("sched/uclamg: Handle
delayed dequeue"), avoid double dequeue for the sched_core entries.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-09-10 09:51:15 +02:00
Dietmar Eggemann
729288bc68 kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
Remove delayed tasks from util_est even they are runnable.

Exclude delayed task which are (a) migrating between rq's or (b) in a
SAVE/RESTORE dequeue/enqueue.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
2024-09-10 09:51:15 +02:00
Chen Yu
6b9ccbc033 kthread: Fix task state in kthread worker if being frozen
When analyzing a kernel waring message, Peter pointed out that there is a race
condition when the kworker is being frozen and falls into try_to_freeze() with
TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze().
Although the root cause is not related to freeze()[1], it is still worthy to fix
this issue ahead.

One possible race scenario:

        CPU 0                                           CPU 1
        -----                                           -----

        // kthread_worker_fn
        set_current_state(TASK_INTERRUPTIBLE);
                                                       suspend_freeze_processes()
                                                         freeze_processes
                                                           static_branch_inc(&freezer_active);
                                                         freeze_kernel_threads
                                                           pm_nosig_freezing = true;
        if (work) { //false
          __set_current_state(TASK_RUNNING);

        } else if (!freezing(current)) //false, been frozen

                      freezing():
                      if (static_branch_unlikely(&freezer_active))
                        if (pm_nosig_freezing)
                          return true;
          schedule()
	}

        // state is still TASK_INTERRUPTIBLE
        try_to_freeze()
          might_sleep() <--- warning

Fix this by explicitly set the TASK_RUNNING before entering
try_to_freeze().

Fixes: b56c0d8937 ("kthread: implement kthread_worker")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
2024-09-10 09:51:14 +02:00
Chen Yu
84d265281d sched/pelt: Use rq_clock_task() for hw_pressure
commit 97450eb909 ("sched/pelt: Remove shift of thermal clock")
removed the decay_shift for hw_pressure. This commit uses the
sched_clock_task() in sched_tick() while it replaces the
sched_clock_task() with rq_clock_pelt() in __update_blocked_others().
This could bring inconsistence. One possible scenario I can think of
is in ___update_load_sum():

  u64 delta = now - sa->last_update_time

'now' could be calculated by rq_clock_pelt() from
__update_blocked_others(), and last_update_time was calculated by
rq_clock_task() previously from sched_tick(). Usually the former
chases after the latter, it cause a very large 'delta' and brings
unexpected behavior.

Fixes: 97450eb909 ("sched/pelt: Remove shift of thermal clock")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20240827112607.181206-1-yu.c.chen@intel.com
2024-09-10 09:51:14 +02:00
Vincent Guittot
5d871a6399 sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
Move effective_cpu_util() and sched_cpu_util() functions in fair.c file
with others utilization related functions.

No functional change.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240904092417.20660-1-vincent.guittot@linaro.org
2024-09-10 09:51:14 +02:00
Peter Zijlstra
3dcac251b0 sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
Since commit b2a02fc43a ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.

Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.

With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves noticeably. Following are the numbers
from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

   ==================================================================
   Test          : ipistorm (modified)
   Units         : Normalized runtime
   Interpretation: Lower is better
   Statistic     : AMean
   ======================= Intel Ice Lake Xeon ======================
   kernel:				time [pct imp]
   tip:sched/core			1.00 [baseline]
   tip:sched/core + SM_IDLE		0.80 [20.51%]
   ==================== 3rd Generation AMD EPYC =====================
   kernel:				time [pct imp]
   tip:sched/core			1.00 [baseline]
   tip:sched/core + SM_IDLE		0.90 [10.17%]
   ==================================================================

[ kprateek: Commit message, SM_RTLOCK_WAIT fix ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240809092240.6921-1-kprateek.nayak@amd.com
2024-09-10 09:51:14 +02:00
Sean Anderson
038eb433dc dma-mapping: add tracing for dma-mapping API calls
When debugging drivers, it can often be useful to trace when memory gets
(un)mapped for DMA (and can be accessed by the device). Add some
tracepoints for this purpose.

Use u64 instead of phys_addr_t and dma_addr_t (and similarly %llx instead
of %pa) because libtraceevent can't handle typedefs in all cases.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-10 07:47:19 +03:00
Jinjie Ruan
546f02823d user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation
Let the kmemdup_array() take care about multiplication and possible
overflows.

Link: https://lkml.kernel.org/r/20240828072340.1249310-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Li zeming <zeming@nfschina.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:47:42 -07:00
Tejun Heo
4c30f5ce4f sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.

Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.

This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:

  http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org

v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
    they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
    count limit and often won't be needed anyway. Instead provide
    scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
    only when needed and override the specified parameter for the subsequent
    dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
6462dd53a2 sched_ext: Compact struct bpf_iter_scx_dsq_kern
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
using 5 of them. We want to add two more u64 fields but it's better if we do
so while staying within scx_iter_scx_dsq to maintain binary compatibility.

The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
field takes up three u64's but only one bit of the last u64 is used. Turn
the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
16 bits for flags, 32 bits for a u32 - for use by struct
bpf_iter_scx_dsq_kern.

This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
into the cursor field reducing the struct size by a full u64.

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09 13:42:47 -10:00
Tejun Heo
cf3e94430d sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().

- Rename consume_local_task() to move_local_task_to_local_dsq() and remove
  task_unlink_from_dsq() and source DSQ unlocking from it.

This is to make the migration code easier to reuse.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
d434210e13 sched_ext: Move consume_local_task() upward
So that the local case comes first and two CONFIG_SMP blocks can be merged.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
6557133ecd sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
task_unlink_from_dsq(). Also move sanity check into it.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
1389f49098 sched_ext: Reorder args for consume_local/remote_task()
Reorder args for consistency in the order of:

  current_rq, p, src_[rq|dsq], dst_[rq|dsq].

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-09 13:42:47 -10:00
Tejun Heo
18f856991d sched_ext: Restructure dispatch_to_local_dsq()
Now that there's nothing left after the big if block, flip the if condition
and unindent the body.

No functional changes intended.

v2: Add BUG() to clarify control can't reach the end of
    dispatch_to_local_dsq() in UP kernels per David.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
0aab26309e sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
With the preceding update, the only return value which makes meaningful
difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
back to the global DSQ and the other, process_ddsp_deferred_locals(),
doesn't do anything.

It should always fallback to the global DSQ. Move the global DSQ fallback
into dispatch_to_local_dsq() and remove the return value.

v2: Patch title and description updated to reflect the behavior fix for
    process_ddsp_deferred_locals().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
e683949a4b sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
code in direct_dispatch() and dispatch_to_local_dsq().

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
4d3ca89bdd sched_ext: Refactor consume_remote_task()
The tricky p->scx.holding_cpu handling was split across
consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:

- All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with
  consolidated documentation.

- move_task_to_local_dsq() now implements straightforward task migration
  making it easier to use in other places.

- dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
  usage is updated accordingly. This makes the local and remote cases more
  symmetric.

No functional changes intended.

v2: s/task_rq/src_rq/ for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Tejun Heo
fdaedba2f9 sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
Sleepables don't need to be in its own kfunc set as each is tagged with
KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
not held and relocate right above the any set. This will be used to add
kfuncs that are allowed to be called from SYSCALL but not TRACING.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-09-09 13:42:47 -10:00
Kinsey Ho
0e40cf2a8b cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
Patch series "Improve mem_cgroup_iter()", v4.

Incremental cgroup iteration is being used again [1]. This patchset
improves the reliability of mem_cgroup_iter(). It also improves
simplicity and code readability.

[1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/


This patch (of 5):

Explicitly document that css sibling/descendant linkage is protected by
cgroup_mutex or RCU.  Also, document in css_next_descendant_pre() and
similar functions that it isn't necessary to hold a ref on @pos.

The following changes in this patchset rely on this clarification for
simplification in memcg iteration code.

Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com
Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:15 -07:00
Sven Schnelle
08e28de116 uprobes: use vm_special_mapping close() functionality
The following KASAN splat was shown:

[   44.505448] ==================================================================                                                                      20:37:27 [3421/145075]
[   44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8
[   44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384
[   44.505479]
[   44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496
[   44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
[   44.505508] Call Trace:
[   44.505511]  [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108
[   44.505521]  [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0
[   44.505529]  [<000b0324d2f5464c>] print_report+0x44/0x138
[   44.505536]  [<000b0324d1383192>] kasan_report+0xc2/0x140
[   44.505543]  [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8
[   44.505550]  [<000b0324d12c7978>] remove_vma+0x78/0x120
[   44.505557]  [<000b0324d128a2c6>] exit_mmap+0x326/0x750
[   44.505563]  [<000b0324d0ba655a>] __mmput+0x9a/0x370
[   44.505570]  [<000b0324d0bbfbe0>] exit_mm+0x240/0x340
[   44.505575]  [<000b0324d0bc0228>] do_exit+0x548/0xd70
[   44.505580]  [<000b0324d0bc1102>] do_group_exit+0x132/0x390
[   44.505586]  [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60
[   44.505592]  [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430
[   44.505599]  [<000b0324d2f78434>] __do_syscall+0xa4/0x170
[   44.505606]  [<000b0324d2f9454c>] system_call+0x74/0x98
[   44.505614]
[   44.505616] Allocated by task 1384:
[   44.505621]  kasan_save_stack+0x40/0x70
[   44.505630]  kasan_save_track+0x28/0x40
[   44.505636]  __kasan_kmalloc+0xa0/0xc0
[   44.505642]  __create_xol_area+0xfa/0x410
[   44.505648]  get_xol_area+0xb0/0xf0
[   44.505652]  uprobe_notify_resume+0x27a/0x470
[   44.505657]  irqentry_exit_to_user_mode+0x15e/0x1d0
[   44.505664]  pgm_check_handler+0x122/0x170
[   44.505670]
[   44.505672] Freed by task 1384:
[   44.505676]  kasan_save_stack+0x40/0x70
[   44.505682]  kasan_save_track+0x28/0x40
[   44.505687]  kasan_save_free_info+0x4a/0x70
[   44.505693]  __kasan_slab_free+0x5a/0x70
[   44.505698]  kfree+0xe8/0x3f0
[   44.505704]  __mmput+0x20/0x370
[   44.505709]  exit_mm+0x240/0x340
[   44.505713]  do_exit+0x548/0xd70
[   44.505718]  do_group_exit+0x132/0x390
[   44.505722]  __s390x_sys_exit_group+0x56/0x60
[   44.505727]  do_syscall+0x2f6/0x430
[   44.505732]  __do_syscall+0xa4/0x170
[   44.505738]  system_call+0x74/0x98

The problem is that uprobe_clear_state() kfree's struct xol_area, which
contains struct vm_special_mapping *xol_mapping. This one is passed to
_install_special_mapping() in xol_add_vma().
__mput reads:

static inline void __mmput(struct mm_struct *mm)
{
        VM_BUG_ON(atomic_read(&mm->mm_users));

        uprobe_clear_state(mm);
        exit_aio(mm);
        ksm_exit(mm);
        khugepaged_exit(mm); /* must run before exit_mmap */
        exit_mmap(mm);
        ...
}

So uprobe_clear_state() in the beginning free's the memory area
containing the vm_special_mapping data, but exit_mmap() uses this
address later via vma->vm_private_data (which was set in
_install_special_mapping().

Fix this by moving uprobe_clear_state() to uprobes.c and use it as
close() callback.

[usama.anjum@collabora.com: remove unneeded condition]
  Link: https://lkml.kernel.org/r/20240906101825.177490-1-usama.anjum@collabora.com
Link: https://lkml.kernel.org/r/20240903073629.2442754-1-svens@linux.ibm.com
Fixes: 223febc6e5 ("mm: add optional close() to struct vm_special_mapping")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:14 -07:00
Tejun Heo
3ac352797c sched_ext: Add missing static to scx_dump_data
scx_dump_data is only used inside ext.c but doesn't have static. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/
2024-09-09 13:34:33 -10:00
Maxim Mikityanskiy
bee109b7b3 bpf: Fix error message on kfunc arg type mismatch
When "arg#%d expected pointer to ctx, but got %s" error is printed, both
template parts actually point to the type of the argument, therefore, it
will also say "but got PTR", regardless of what was the actual register
type.

Fix the message to print the register type in the second part of the
template, change the existing test to adapt to the new format, and add a
new test to test the case when arg is a pointer to context, but reg is a
scalar.

Fixes: 00b85860fe ("bpf: Rewrite kfunc argument handling")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240909133909.1315460-1-maxim@isovalent.com
2024-09-09 15:58:17 -07:00
Andy Shevchenko
4e378158e5 tracing: Drop unused helper function to fix the build
A helper function defined but not used. This, in particular,
prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y:

kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function]
 2229 | static inline int run_tracer_selftest(struct tracer *type)
      |                   ^~~~~~~~~~~~~~~~~~~

Fix this by dropping unused functions.

See also commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/20240909105314.928302-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09 16:04:25 -04:00
Steven Rostedt
af17814334 tracing/osnoise: Fix build when timerlat is not enabled
To fix some critical section races, the interface_lock was added to a few
locations. One of those locations was above where the interface_lock was
declared, so the declaration was moved up before that usage.
Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER
ifdef block. As the interface_lock is used outside that config, this broke
the build when CONFIG_OSNOISE_TRACER was enabled but
CONFIG_TIMERLAT_TRACER was not.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Helena Anna" <helena.anna.dubel@intel.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home
Fixes: e6a53481da ("tracing/timerlat: Only clear timer if a kthread exists")
Reported-by: "Bityutskiy, Artem" <artem.bityutskiy@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09 16:03:57 -04:00
Yu Liao
3e5b2e81f1 printk: Export match_devname_and_update_preferred_console()
When building serial_base as a module, modpost fails with the following
error message:

  ERROR: modpost: "match_devname_and_update_preferred_console"
  [drivers/tty/serial/serial_base.ko] undefined!

Export the symbol to allow using it from modules.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409071312.qlwtTOS1-lkp@intel.com/
Fixes: 12c91cec31 ("serial: core: Add serial_base_match_and_update_preferred_console()")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Link: https://lore.kernel.org/r/20240909075652.747370-1-liaoyu15@huawei.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-09 17:35:06 +02:00
Anna-Maria Behnsen
bd7c8ff9fe treewide: Fix wrong singular form of jiffies in comments
There are several comments all over the place, which uses a wrong singular
form of jiffies.

Replace 'jiffie' by 'jiffy'. No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
662a1bfb90 cpu: Use already existing usleep_range()
usleep_range() is a wrapper arount usleep_range_state() which hands in
TASK_UNTINTERRUPTIBLE as state argument.

Use already exising wrapper usleep_range(). No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-2-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Anna-Maria Behnsen
fe90c5ba88 timers: Rename next_expiry_recalc() to be unique
next_expiry_recalc is the name of a function as well as the name of a
struct member of struct timer_base. This might lead to confusion.

Rename next_expiry_recalc() to timer_recalc_next_expiry(). No functional
change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-1-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Neeraj Upadhyay
355debb83b Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', 'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a 2024-09-09 00:09:47 +05:30
Paul E. McKenney
1ecd9d68eb rcu: Defer printing stall-warning backtrace when holding rcu_node lock
The rcu_dump_cpu_stacks() holds the leaf rcu_node structure's ->lock
when dumping the stakcks of any CPUs stalling the current grace period.
This lock is held to prevent confusion that would otherwise occur when
the stalled CPU reported its quiescent state (and then went on to do
unrelated things) just as the backtrace NMI was heading towards it.

This has worked well, but on larger systems has recently been observed
to cause severe lock contention resulting in CSD-lock stalls and other
general unhappiness.

This commit therefore does printk_deferred_enter() before acquiring
the lock and printk_deferred_exit() after releasing it, thus deferring
the overhead of actually outputting the stack trace out of that lock's
critical section.

Reported-by: Rik van Riel <riel@surriel.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:06:44 +05:30
Frederic Weisbecker
7562eed272 rcu/nocb: Remove superfluous memory barrier after bypass enqueue
Pre-GP accesses performed by the update side must be ordered against
post-GP accesses performed by the readers. This is ensured by the
bypass or nocb locking on enqueue time, followed by the fully ordered
rnp locking initiated while callbacks are accelerated, and then
propagated throughout the whole GP lifecyle associated with the
callbacks.

Therefore the explicit barrier advertizing ordering between bypass
enqueue and rcuo wakeup is superfluous. If anything, it would even only
order the first bypass callback enqueue against the rcuo wakeup and
ignore all the subsequent ones.

Remove the needless barrier.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
1b022b8763 rcu/nocb: Conditionally wake up rcuo if not already waiting on GP
A callback enqueuer currently wakes up the rcuo kthread if it is adding
the first non-done callback of a CPU, whether the kthread is waiting on
a grace period or not (unless the CPU is offline).

This looks like a desired behaviour because then the rcuo kthread
doesn't wait for the end of the current grace period to handle the
callback. It is accelerated right away and assigned to the next grace
period. The GP kthread is notified about that fact and iterates with
the upcoming GP without sleeping in-between.

However this best-case scenario is contradicted by a few details,
depending on the situation:

1) If the callback is a non-bypass one queued with IRQs enabled, the
   wake up only occurs if no other pending callbacks are on the list.
   Therefore the theoretical "optimization" actually applies on rare
   occasions.

2) If the callback is a non-bypass one queued with IRQs disabled, the
   situation is similar with even more uncertainty due to the deferred
   wake up.

3) If the callback is lazy, a few jiffies don't make any difference.

4) If the callback is bypass, the wake up timer is programmed 2 jiffies
   ahead by rcuo in case the regular pending queue has been handled
   in the meantime. The rare storm of callbacks can otherwise wait for
   the currently elapsing grace period to be flushed and handled.

For all those reasons, the optimization is only theoretical and
occasional. Therefore it is reasonable that callbacks enqueuers only
wake up the rcuo kthread when it is not already waiting on a grace
period to complete.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
9139f93209 rcu/nocb: Fix RT throttling hrtimer armed from offline CPU
After a CPU is marked offline and until it reaches its final trip to
idle, rcuo has several opportunities to be woken up, either because
a callback has been queued in the meantime or because
rcutree_report_cpu_dead() has issued the final deferred NOCB wake up.

If RCU-boosting is enabled, RCU kthreads are set to SCHED_FIFO policy.
And if RT-bandwidth is enabled, the related hrtimer might be armed.
However this then happens after hrtimers have been migrated at the
CPUHP_AP_HRTIMERS_DYING stage, which is broken as reported by the
following warning:

 Call trace:
  enqueue_hrtimer+0x7c/0xf8
  hrtimer_start_range_ns+0x2b8/0x300
  enqueue_task_rt+0x298/0x3f0
  enqueue_task+0x94/0x188
  ttwu_do_activate+0xb4/0x27c
  try_to_wake_up+0x2d8/0x79c
  wake_up_process+0x18/0x28
  __wake_nocb_gp+0x80/0x1a0
  do_nocb_deferred_wakeup_common+0x3c/0xcc
  rcu_report_dead+0x68/0x1ac
  cpuhp_report_idle_dead+0x48/0x9c
  do_idle+0x288/0x294
  cpu_startup_entry+0x34/0x3c
  secondary_start_kernel+0x138/0x158

Fix this with waking up rcuo using an IPI if necessary. Since the
existing API to deal with this situation only handles swait queue, rcuo
is only woken up from offline CPUs if it's not already waiting on a
grace period. In the worst case some callbacks will just wait for a
grace period to complete before being assigned to a subsequent one.

Reported-by: "Cheng-Jui Wang (王正睿)" <Cheng-Jui.Wang@mediatek.com>
Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:05:26 +05:30
Frederic Weisbecker
1fcb932c8b rcu/nocb: Simplify (de-)offloading state machine
Now that the (de-)offloading process can only apply to offline CPUs,
there is no more concurrency between rcu_core and nocb kthreads. Also
the mutation now happens on empty queues.

Therefore the state machine can be reduced to a single bit called
SEGCBLIST_OFFLOADED. Simplify the transition as follows:

* Upon offloading: queue the rdp to be added to the rcuog list and
  wait for the rcuog kthread to set the SEGCBLIST_OFFLOADED bit. Unpark
  rcuo kthread.

* Upon de-offloading: Park rcuo kthread. Queue the rdp to be removed
  from the rcuog list and wait for the rcuog kthread to clear the
  SEGCBLIST_OFFLOADED bit.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-09-09 00:03:55 +05:30
Linus Torvalds
e20398877b - Fix perf's AUX buffer serialization
- Prevent uninitialized struct members in perf's uprobes handling
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbdaUMACgkQEsHwGGHe
 VUoo5hAAkDYx/gqFiU4Zqr4EXu6mfG5qFRnSE5PMsgGYDt1gE+dY6Xugs5vYa7uh
 AzzqcFLw46ZbrOjXv359WBxljYMQCnFI9SbP/1pAYqtUs1X1q3bMl6iuYbHU8DkB
 NHaSCmcyxPBLANezxka554pg0Yqsb/ME4tnxomVH65GosgfG4dxCOpGB8S1jB9Wt
 g8TeXn+pEYwn50wFOTA2MTy+OtwcJZxl1cPRLhJGywY20znJrU0OAFTySdZeAfjm
 3ekMau9coXErmETsiTj5+B6ornWfCvGgYMFpZxj4lkWppJEoxEovzOauUSgkxEjZ
 qM056212tqfTYHVC6SO70mKkRcGQBD3FEQFi7+Ugv9GVIhzML5UN9z0eIKCNvcvU
 dWTCaFPPG1/WwlsKXKaaCJkvt6f+rGuL2zdyZczeiiKlcyvuABSZv/9DscxmhQUh
 5n2ZfigNXTnjUj0c2LxjBuXFmHrdbLnz5IGVr/9Ux0euXSBWJR+1HNoGpWTSHFWy
 aHioF3rgPHMvV0YVzpzb5Arz+ldUEV+ymHwtWOGuxGAtyk7SydpkbKEqZ1AYXyUX
 FEeRP/ryYw8FxTOvsNvpB85X24YDG/LrUgJdX7fbYeZjlm6Nd8IpU8LKdyLTmhmg
 YuIENCa+U6RQZd1dsRW4SqdOuackRyjH4pcQqZsg5i4nNczH+Z4=
 =Alrr
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Fix perf's AUX buffer serialization

 - Prevent uninitialized struct members in perf's uprobes handling

* tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix AUX buffer serialization
  uprobes: Use kzalloc to allocate xol area
2024-09-08 10:20:44 -07:00
Costa Shulyupin
a6fe30d1e3 genirq: Use cpumask_intersects()
Replace `cpumask_any_and(a, b) >= nr_cpu_ids` and `cpumask_any_and(a, b) <
nr_cpu_ids` with the more readable `!cpumask_intersects(a, b)` and
`cpumask_intersects(a, b)`

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240906170142.1135207-1-costa.shul@redhat.com
2024-09-08 16:06:51 +02:00
Tejun Heo
02e65e1c12 sched_ext: Add missing static to scx_has_op[]
scx_has_op[] is only used inside ext.c but doesn't have static. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409062337.m7qqI88I-lkp@intel.com/
2024-09-06 08:18:55 -10:00
Tejun Heo
da330f5e4c sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
pick_task_scx() must be preceded by balance_scx() but there currently is a
bug where fair could say yes on balance() but no on pick_task(), which then
ends up calling pick_task_scx() without preceding balance_scx(). Work around
by dropping WARN_ON_ONCE() and ignoring cases which don't make sense.

This isn't great and can theoretically lead to stalls. However, for
switch_all cases, this happens only while a BPF scheduler is being loaded or
unloaded, and, for partial cases, fair will likely keep triggering this CPU.

This will be reverted once the fair behavior is fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2024-09-06 08:17:09 -10:00
Thomas Gleixner
fe513c2ef0 static_call: Replace pointless WARN_ON() in static_call_module_notify()
static_call_module_notify() triggers a WARN_ON(), when memory allocation
fails in __static_call_add_module().

That's not really justified, because the failure case must be correctly
handled by the well known call chain and the error code is passed
through to the initiating userspace application.

A memory allocation fail is not a fatal problem, but the WARN_ON() takes
the machine out when panic_on_warn is set.

Replace it with a pr_warn().

Fixes: 9183c3f9ed ("static_call: Add inline static call infrastructure")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8734mf7pmb.ffs@tglx
2024-09-06 16:29:22 +02:00
Thomas Gleixner
4b30051c48 static_call: Handle module init failure correctly in static_call_del_module()
Module insertion invokes static_call_add_module() to initialize the static
calls in a module. static_call_add_module() invokes __static_call_init(),
which allocates a struct static_call_mod to either encapsulate the built-in
static call sites of the associated key into it so further modules can be
added or to append the module to the module chain.

If that allocation fails the function returns with an error code and the
module core invokes static_call_del_module() to clean up eventually added
static_call_mod entries.

This works correctly, when all keys used by the module were converted over
to a module chain before the failure. If not then static_call_del_module()
causes a #GP as it blindly assumes that key::mods points to a valid struct
static_call_mod.

The problem is that key::mods is not a individual struct member of struct
static_call_key, it's part of a union to save space:

        union {
                /* bit 0: 0 = mods, 1 = sites */
                unsigned long type;
                struct static_call_mod *mods;
                struct static_call_site *sites;
	};

key::sites is a pointer to the list of built-in usage sites of the static
call. The type of the pointer is differentiated by bit 0. A mods pointer
has the bit clear, the sites pointer has the bit set.

As static_call_del_module() blidly assumes that the pointer is a valid
static_call_mod type, it fails to check for this failure case and
dereferences the pointer to the list of built-in call sites, which is
obviously bogus.

Cure it by checking whether the key has a sites or a mods pointer.

If it's a sites pointer then the key is not to be touched. As the sites are
walked in the same order as in __static_call_init() the site walk can be
terminated because all subsequent sites have not been touched by the init
code due to the error exit.

If it was converted before the allocation fail, then the inner loop which
searches for a module match will find nothing.

A fail in the second allocation in __static_call_init() is harmless and
does not require special treatment. The first allocation succeeded and
converted the key to a module chain. That first entry has mod::mod == NULL
and mod::next == NULL, so the inner loop of static_call_del_module() will
neither find a module match nor a module chain. The next site in the walk
was either already converted, but can't match the module, or it will exit
the outer loop because it has a static_call_site pointer and not a
static_call_mod pointer.

Fixes: 9183c3f9ed ("static_call: Add inline static call infrastructure")
Closes: https://lore.kernel.org/all/20230915082126.4187913-1-ruanjinjie@huawei.com
Reported-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/87zfon6b0s.ffs@tglx
2024-09-06 16:29:21 +02:00
Costa Shulyupin
87b5a153b8 genirq/cpuhotplug: Use cpumask_intersects()
Replace `cpumask_any_and(a, b) >= nr_cpu_ids`
with the more readable `!cpumask_intersects(a, b)`.

[ tglx: Massaged change log ]

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/all/20240904134823.777623-2-costa.shul@redhat.com
2024-09-06 16:28:39 +02:00
Linus Torvalds
b831f83e40 bpf-6.11-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbaYFMACgkQ6rmadz2v
 bTq7JBAAipwHeOL3IYproQxGy+f0W3Uik9FNlavSQ3zpJHmTJcpf0ysXkqH23g2q
 26CF0R44gmGMkdbZsxbk3HLI2qRmzxmznYCDH0g7d9qwzQMhFHIiY7TW7UD/XbKx
 UHdHLb5PYrj+j94T1WGiQdvbZYDlpmdz5rFA9K/TBtBArqYp9mA4D/cIlTDBfFpk
 cjhSGVl9x/BKbiHKApxSGcR7Fh/+ux9mVdlssWQNhRfm3V2tbRSAw1i1/ydTG+4c
 bf/m0RSIDfPMxy1i7D0lNRbclzWVisTqNzDXHfQoRUJMuMDfsK4UZB/6gvh+2LKy
 D60vT8AfN5ygjJbLdFbwFGnEymjfsXWguyqfQB0d9Hj/2/EsZ01rI2ikJv9J+qKl
 wwZM3YeA3Q/V0mZ5wCONp2dn+s+82nga+fdvCRFz6SLkWQwgbW5BYHFF1c60V9MH
 Pbd9Y5VfCOEZRzR6RxbmguPrnoU1+BUwQeIAp9L73bllrzhtmh/aL/b03uw8/wUh
 I+peLxJ+DVp6wTudgvSMviMySWcztuz397G7TnFyG0V4nKe1+QxSaQWWw2HKvpy3
 i+m98qoWqbuJqz49FpEtX6x/17gZZNA0LK648D77nrOfsGWOLTKOZUDbNWbTPw9a
 Gojg5obJ8P82yO9UCYQLyGsAJxJrKZv3OEmqy0mRG1hrSMsozxg=
 =5Quw
 -----END PGP SIGNATURE-----

Merge tag 'bpf-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix crash when btf_parse_base() returns an error (Martin Lau)

 - Fix out of bounds access in btf_name_valid_section() (Jeongjun Park)

* tag 'bpf-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add a selftest to check for incorrect names
  bpf: add check for invalid name in btf_name_valid_section()
  bpf: Fix a crash when btf_parse_base() returns an error pointer
2024-09-05 20:10:53 -07:00
JP Kobryn
bc638d8cb5 bpf: allow kfuncs within tracepoint and perf event programs
Associate tracepoint and perf event program types with the kfunc tracing
hook. This allows calling kfuncs within these types of programs.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20240905223812.141857-2-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-05 17:02:03 -07:00
Andrii Nakryiko
2db2b8cb8f bpf: change int cmd argument in __sys_bpf into typed enum bpf_cmd
This improves BTF data recorded about this function and makes
debugging/tracing better, because now command can be displayed as
symbolic name, instead of obscure number.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240905210520.2252984-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-05 16:58:51 -07:00
Linus Torvalds
e4b42053b7 Tracing fixes for 6.11:
- Fix adding a new fgraph callback after function graph tracing has
   already started.
 
   If the new caller does not initialize its hash before registering the
   fgraph_ops, it can cause a NULL pointer dereference. Fix this by adding
   a new parameter to ftrace_graph_enable_direct() passing in the newly
   added gops directly and not rely on using the fgraph_array[], as entries
   in the fgraph_array[] must be initialized. Assign the new gops to the
   fgraph_array[] after it goes through ftrace_startup_subops() as that
   will properly initialize the gops->ops and initialize its hashes.
 
 - Fix a memory leak in fgraph storage memory test.
 
   If the "multiple fgraph storage on a function" boot up selftest
   fails in the registering of the function graph tracer, it will
   not free the memory it allocated for the filter. Break the loop
   up into two where it allocates the filters first and then registers
   the functions where any errors will do the appropriate clean ups.
 
 - Only clear the timerlat timers if it has an associated kthread.
 
   In the rtla tool that uses timerlat, if it was killed just as it
   was shutting down, the signals can free the kthread and the timer.
   But the closing of the timerlat files could cause the hrtimer_cancel()
   to be called on the already freed timer. As the kthread variable is
   is set to NULL when the kthreads are stopped and the timers are freed
   it can be used to know not to call hrtimer_cancel() on the timer if
   the kthread variable is NULL.
 
 - Use a cpumask to keep track of osnoise/timerlat kthreads
 
   The timerlat tracer can use user space threads for its analysis.
   With the killing of the rtla tool, the kernel can get confused
   between if it is using a user space thread to analyze or one of its
   own kernel threads. When this confusion happens, kthread_stop()
   can be called on a user space thread and bad things happen.
   As the kernel threads are per-cpu, a bitmask can be used to know
   when a kernel thread is used or when a user space thread is used.
 
 - Add missing interface_lock to osnoise/timerlat stop_kthread()
 
   The stop_kthread() function in osnoise/timerlat clears the
   osnoise kthread variable, and if it was a user space thread does
   a put_task on it. But this can race with the closing of the timerlat
   files that also does a put_task on the kthread, and if the race happens
   the task will have put_task called on it twice and oops.
 
 - Add cond_resched() to the tracing_iter_reset() loop.
 
   The latency tracers keep writing to the ring buffer without resetting
   when it issues a new "start" event (like interrupts being disabled).
   When reading the buffer with an iterator, the tracing_iter_reset()
   sets its pointer to that start event by walking through all the events
   in the buffer until it gets to the time stamp of the start event.
   In the case of a very large buffer, the loop that looks for the start
   event has been reported taking a very long time with a non preempt kernel
   that it can trigger a soft lock up warning. Add a cond_resched() into
   that loop to make sure that doesn't happen.
 
 - Use list_del_rcu() for eventfs ei->list variable
 
   It was reported that running loops of creating and deleting  kprobe events
   could cause a crash due to the eventfs list iteration hitting a LIST_POISON
   variable. This is because the list is protected by SRCU but when an item is
   deleted from the list, it was using list_del() which poisons the "next"
   pointer. This is what list_del_rcu() was to prevent.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZtohNBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtoNAQDQKjomYLCpLz2EqgHZ6VB81QVrHuqt
 cU7xuEfUJDzyyAEA/n0t6quIdjYRd6R2/KxGkP6By/805Coq4IZMTgNQmw0=
 =nZ7k
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix adding a new fgraph callback after function graph tracing has
   already started.

   If the new caller does not initialize its hash before registering the
   fgraph_ops, it can cause a NULL pointer dereference. Fix this by
   adding a new parameter to ftrace_graph_enable_direct() passing in the
   newly added gops directly and not rely on using the fgraph_array[],
   as entries in the fgraph_array[] must be initialized.

   Assign the new gops to the fgraph_array[] after it goes through
   ftrace_startup_subops() as that will properly initialize the
   gops->ops and initialize its hashes.

 - Fix a memory leak in fgraph storage memory test.

   If the "multiple fgraph storage on a function" boot up selftest fails
   in the registering of the function graph tracer, it will not free the
   memory it allocated for the filter. Break the loop up into two where
   it allocates the filters first and then registers the functions where
   any errors will do the appropriate clean ups.

 - Only clear the timerlat timers if it has an associated kthread.

   In the rtla tool that uses timerlat, if it was killed just as it was
   shutting down, the signals can free the kthread and the timer. But
   the closing of the timerlat files could cause the hrtimer_cancel() to
   be called on the already freed timer. As the kthread variable is is
   set to NULL when the kthreads are stopped and the timers are freed it
   can be used to know not to call hrtimer_cancel() on the timer if the
   kthread variable is NULL.

 - Use a cpumask to keep track of osnoise/timerlat kthreads

   The timerlat tracer can use user space threads for its analysis. With
   the killing of the rtla tool, the kernel can get confused between if
   it is using a user space thread to analyze or one of its own kernel
   threads. When this confusion happens, kthread_stop() can be called on
   a user space thread and bad things happen. As the kernel threads are
   per-cpu, a bitmask can be used to know when a kernel thread is used
   or when a user space thread is used.

 - Add missing interface_lock to osnoise/timerlat stop_kthread()

   The stop_kthread() function in osnoise/timerlat clears the osnoise
   kthread variable, and if it was a user space thread does a put_task
   on it. But this can race with the closing of the timerlat files that
   also does a put_task on the kthread, and if the race happens the task
   will have put_task called on it twice and oops.

 - Add cond_resched() to the tracing_iter_reset() loop.

   The latency tracers keep writing to the ring buffer without resetting
   when it issues a new "start" event (like interrupts being disabled).
   When reading the buffer with an iterator, the tracing_iter_reset()
   sets its pointer to that start event by walking through all the
   events in the buffer until it gets to the time stamp of the start
   event. In the case of a very large buffer, the loop that looks for
   the start event has been reported taking a very long time with a non
   preempt kernel that it can trigger a soft lock up warning. Add a
   cond_resched() into that loop to make sure that doesn't happen.

 - Use list_del_rcu() for eventfs ei->list variable

   It was reported that running loops of creating and deleting kprobe
   events could cause a crash due to the eventfs list iteration hitting
   a LIST_POISON variable. This is because the list is protected by SRCU
   but when an item is deleted from the list, it was using list_del()
   which poisons the "next" pointer. This is what list_del_rcu() was to
   prevent.

* tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
  tracing/timerlat: Only clear timer if a kthread exists
  tracing/osnoise: Use a cpumask to know what threads are kthreads
  eventfs: Use list_del_rcu() for SRCU protected list variable
  tracing: Avoid possible softlockup in tracing_iter_reset()
  tracing: Fix memory leak in fgraph storage selftest
  tracing: fgraph: Fix to add new fgraph_ops to array after ftrace_startup_subops()
2024-09-05 16:29:41 -07:00
Shung-Hsi Yu
1ae497c78f bpf: use type_may_be_null() helper for nullable-param check
Commit 980ca8ceea ("bpf: check bpf_dummy_struct_ops program params for
test runs") does bitwise AND between reg_type and PTR_MAYBE_NULL, which
is correct, but due to type difference the compiler complains:

  net/bpf/bpf_dummy_struct_ops.c:118:31: warning: bitwise operation between different enumeration types ('const enum bpf_reg_type' and 'enum bpf_type_flag') [-Wenum-enum-conversion]
    118 |                 if (info && (info->reg_type & PTR_MAYBE_NULL))
        |                              ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~

Workaround the warning by moving the type_may_be_null() helper from
verifier.c into bpf_verifier.h, and reuse it here to check whether param
is nullable.

Fixes: 980ca8ceea ("bpf: check bpf_dummy_struct_ops program params for test runs")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404241956.HEiRYwWq-lkp@intel.com/
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240905055233.70203-1-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-05 13:29:06 -07:00
Jiri Olsa
900f362e20 bpf: Fix uprobe multi pid filter check
Uprobe multi link does its own process (thread leader) filtering before
running the bpf program by comparing task's vm pointers.

But as Oleg pointed out there can be processes sharing the vm (CLONE_VM),
so we can't just compare task->vm pointers, but instead we need to use
same_thread_group call.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/bpf/20240905115124.1503998-2-jolsa@kernel.org
2024-09-05 12:43:22 -07:00
Steven Rostedt
5bfbcd1ee5 tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
The timerlat interface will get and put the task that is part of the
"kthread" field of the osn_var to keep it around until all references are
released. But here's a race in the "stop_kthread()" code that will call
put_task_struct() on the kthread if it is not a kernel thread. This can
race with the releasing of the references to that task struct and the
put_task_struct() can be called twice when it should have been called just
once.

Take the interface_lock() in stop_kthread() to synchronize this change.
But to do so, the function stop_per_cpu_kthreads() needs to change the
loop from for_each_online_cpu() to for_each_possible_cpu() and remove the
cpu_read_lock(), as the interface_lock can not be taken while the cpu
locks are held. The only side effect of this change is that it may do some
extra work, as the per_cpu variables of the offline CPUs would not be set
anyway, and would simply be skipped in the loop.

Remove unneeded "return;" in stop_kthread().

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240905113359.2b934242@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 12:01:37 -04:00
Steven Rostedt
e6a53481da tracing/timerlat: Only clear timer if a kthread exists
The timerlat tracer can use user space threads to check for osnoise and
timer latency. If the program using this is killed via a SIGTERM, the
threads are shutdown one at a time and another tracing instance can start
up resetting the threads before they are fully closed. That causes the
hrtimer assigned to the kthread to be shutdown and freed twice when the
dying thread finally closes the file descriptors, causing a use-after-free
bug.

Only cancel the hrtimer if the associated thread is still around. Also add
the interface_lock around the resetting of the tlat_var->kthread.

Note, this is just a quick fix that can be backported to stable. A real
fix is to have a better synchronization between the shutdown of old
threads and the starting of new ones.

Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240905085330.45985730@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Reported-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 11:30:23 -04:00
Steven Rostedt
177e1cc2f4 tracing/osnoise: Use a cpumask to know what threads are kthreads
The start_kthread() and stop_thread() code was not always called with the
interface_lock held. This means that the kthread variable could be
unexpectedly changed causing the kthread_stop() to be called on it when it
should not have been, leading to:

 while true; do
   rtla timerlat top -u -q & PID=$!;
   sleep 5;
   kill -INT $PID;
   sleep 0.001;
   kill -TERM $PID;
   wait $PID;
  done

Causing the following OOPS:

 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
 KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
 CPU: 5 UID: 0 PID: 885 Comm: timerlatu/5 Not tainted 6.11.0-rc4-test-00002-gbc754cc76d1b-dirty #125 a533010b71dab205ad2f507188ce8c82203b0254
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:hrtimer_active+0x58/0x300
 Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f
 RSP: 0018:ffff88811d97f940 EFLAGS: 00010202
 RAX: 0000000000000002 RBX: ffff88823c6b5b28 RCX: ffffed10478d6b6b
 RDX: dffffc0000000000 RSI: ffffed10478d6b6c RDI: ffff88823c6b5b28
 RBP: 0000000000000000 R08: ffff88823c6b5b58 R09: ffff88823c6b5b60
 R10: ffff88811d97f957 R11: 0000000000000010 R12: 00000000000a801d
 R13: ffff88810d8b35d8 R14: 0000000000000010 R15: ffff88823c6b5b28
 FS:  0000000000000000(0000) GS:ffff88823c680000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000561858ad7258 CR3: 000000007729e001 CR4: 0000000000170ef0
 Call Trace:
  <TASK>
  ? die_addr+0x40/0xa0
  ? exc_general_protection+0x154/0x230
  ? asm_exc_general_protection+0x26/0x30
  ? hrtimer_active+0x58/0x300
  ? __pfx_mutex_lock+0x10/0x10
  ? __pfx_locks_remove_file+0x10/0x10
  hrtimer_cancel+0x15/0x40
  timerlat_fd_release+0x8e/0x1f0
  ? security_file_release+0x43/0x80
  __fput+0x372/0xb10
  task_work_run+0x11e/0x1f0
  ? _raw_spin_lock+0x85/0xe0
  ? __pfx_task_work_run+0x10/0x10
  ? poison_slab_object+0x109/0x170
  ? do_exit+0x7a0/0x24b0
  do_exit+0x7bd/0x24b0
  ? __pfx_migrate_enable+0x10/0x10
  ? __pfx_do_exit+0x10/0x10
  ? __pfx_read_tsc+0x10/0x10
  ? ktime_get+0x64/0x140
  ? _raw_spin_lock_irq+0x86/0xe0
  do_group_exit+0xb0/0x220
  get_signal+0x17ba/0x1b50
  ? vfs_read+0x179/0xa40
  ? timerlat_fd_read+0x30b/0x9d0
  ? __pfx_get_signal+0x10/0x10
  ? __pfx_timerlat_fd_read+0x10/0x10
  arch_do_signal_or_restart+0x8c/0x570
  ? __pfx_arch_do_signal_or_restart+0x10/0x10
  ? vfs_read+0x179/0xa40
  ? ksys_read+0xfe/0x1d0
  ? __pfx_ksys_read+0x10/0x10
  syscall_exit_to_user_mode+0xbc/0x130
  do_syscall_64+0x74/0x110
  ? __pfx___rseq_handle_notify_resume+0x10/0x10
  ? __pfx_ksys_read+0x10/0x10
  ? fpregs_restore_userregs+0xdb/0x1e0
  ? fpregs_restore_userregs+0xdb/0x1e0
  ? syscall_exit_to_user_mode+0x116/0x130
  ? do_syscall_64+0x74/0x110
  ? do_syscall_64+0x74/0x110
  ? do_syscall_64+0x74/0x110
  entry_SYSCALL_64_after_hwframe+0x71/0x79
 RIP: 0033:0x7ff0070eca9c
 Code: Unable to access opcode bytes at 0x7ff0070eca72.
 RSP: 002b:00007ff006dff8c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
 RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007ff0070eca9c
 RDX: 0000000000000400 RSI: 00007ff006dff9a0 RDI: 0000000000000003
 RBP: 00007ff006dffde0 R08: 0000000000000000 R09: 00007ff000000ba0
 R10: 00007ff007004b08 R11: 0000000000000246 R12: 0000000000000003
 R13: 00007ff006dff9a0 R14: 0000000000000007 R15: 0000000000000008
  </TASK>
 Modules linked in: snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core
 ---[ end trace 0000000000000000 ]---

This is because it would mistakenly call kthread_stop() on a user space
thread making it "exit" before it actually exits.

Since kthreads are created based on global behavior, use a cpumask to know
when kthreads are running and that they need to be shutdown before
proceeding to do new work.

Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/

This was debugged by using the persistent ring buffer:

Link: https://lore.kernel.org/all/20240823013902.135036960@goodmis.org/

Note, locking was originally used to fix this, but that proved to cause too
many deadlocks to work around:

  https://lore.kernel.org/linux-trace-kernel/20240823102816.5e55753b@gandalf.local.home/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240904103428.08efdf4c@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Reported-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 11:30:22 -04:00
Andrii Nakryiko
cd7bdd9d46 uprobes: perform lockless SRCU-protected uprobes_tree lookup
Another big bottleneck to scalablity is uprobe_treelock that's taken in
a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
take advantage of that and make uprobes_tree RB-tree look up lockless.

To make RB-tree RCU-protected lockless lookup correct, we need to take
into account that such RB-tree lookup can return false negatives if there
are parallel RB-tree modifications (rotations) going on. We use seqcount
lock to detect whether RB-tree changed, and if we find nothing while
RB-tree got modified inbetween, we just retry. If uprobe was found, then
it's guaranteed to be a correct lookup.

With all the lock-avoiding changes done, we get a pretty decent
improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is
due to SRCU not really scaling very well with number of CPUs on
a particular hardware that was used for testing (80-core Intel Xeon Gold
6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
currently taken to resolve interrupt address to inode+offset and then
uprobe instance. And, of course, uretprobes still need similar RCU to
avoid refcount in the hot path, which will be addressed in the follow up
patches.

Nevertheless, the improvement is good. We used BPF selftest-based
uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
varying number of CPUs on which uprobes and uretprobes are triggered.

BASELINE
========
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

SRCU CHANGES
============
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
to 5mln/s.

Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903174603.3554182-8-andrii@kernel.org
2024-09-05 16:56:15 +02:00
Peter Zijlstra
04b01625da perf/uprobe: split uprobe_unregister()
With uprobe_unregister() having grown a synchronize_srcu(), it becomes
fairly slow to call. Esp. since both users of this API call it in a
loop.

Peel off the sync_srcu() and do it once, after the loop.

We also need to add uprobe_unregister_sync() into uprobe_register()'s
error handling path, as we need to be careful about returning to the
caller before we have a guarantee that partially attached consumer won't
be called anymore. This is an unlikely slow path and this should be
totally fine to be slow in the case of a failed attach.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903174603.3554182-6-andrii@kernel.org
2024-09-05 16:56:14 +02:00
Andrii Nakryiko
cc01bd044e uprobes: travers uprobe's consumer list locklessly under SRCU protection
uprobe->register_rwsem is one of a few big bottlenecks to scalability of
uprobes, so we need to get rid of it to improve uprobe performance and
multi-CPU scalability.

First, we turn uprobe's consumer list to a typical doubly-linked list
and utilize existing RCU-aware helpers for traversing such lists, as
well as adding and removing elements from it.

For entry uprobes we already have SRCU protection active since before
uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
won't go away from under us, but we add SRCU protection around consumer
list traversal.

Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
we remember whether any removal was requested during handler calls, but
then we double-check the decision under a proper register_rwsem using
consumers' filter callbacks. Handler removal is very rare, so this extra
lock won't hurt performance, overall, but we also avoid the need for any
extra protection (e.g., seqcount locks).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903174603.3554182-5-andrii@kernel.org
2024-09-05 16:56:14 +02:00
Andrii Nakryiko
59da880afe uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
It serves no purpose beyond adding unnecessray argument passed to the
filter callback. Just get rid of it, no one is actually using it.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903174603.3554182-4-andrii@kernel.org
2024-09-05 16:56:14 +02:00
Andrii Nakryiko
8617408f7a uprobes: protected uprobe lifetime with SRCU
To avoid unnecessarily taking a (brief) refcount on uprobe during
breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
not take refcount, but protect the lifetime of a uprobe instance with
RCU. This improves scalability, as refcount gets quite expensive due to
cache line bouncing between multiple CPUs.

Specifically, we utilize our own uprobe-specific SRCU instance for this
RCU protection. put_uprobe() will delay actual kfree() using call_srcu().

For now, uretprobe and single-stepping handling will still acquire
refcount as necessary. We'll address these issues in follow up patches
by making them use SRCU with timeout.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903174603.3554182-3-andrii@kernel.org
2024-09-05 16:56:13 +02:00
Andrii Nakryiko
3f7f1a64da uprobes: revamp uprobe refcounting and lifetime management
Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is confusing and problematic. No
matter how many actual consumers are attached, they all share the same
refcount, and we have an extra logic to drop the "last" (which might not
really be last) refcount once uprobe's consumer list becomes empty.

This is unconventional and has to be kept in mind as a special case all
the time. Further, because of this design we have the situations where
find_uprobe() will find uprobe, bump refcount, return it to the caller,
but that uprobe will still need uprobe_is_active() check, after which
the caller is required to drop refcount and try again. This is just too
many details leaking to the higher level logic.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead, each
uprobe_consumer is assuming its own refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
will attempt to bump refcount, unless it already is zero. Caller needs
to guarantee that uprobe instance won't be freed in parallel, which is
the case while we keep uprobes_treelock (for read or write, doesn't
matter).

Note also, we now don't leak the race between registration and
unregistration, so we remove the retry logic completely. If
find_uprobe() returns valid uprobe, it's guaranteed to remain in
uprobes_tree with properly incremented refcount. The race is handled
inside __insert_uprobe() and put_uprobe() working together:
__insert_uprobe() will remove uprobe from RB-tree, if it can't bump
refcount and will retry to insert the new uprobe instance. put_uprobe()
won't attempt to remove uprobe from RB-tree, if it's already not there.
All that is protected by uprobes_treelock, which keeps things simple.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903174603.3554182-2-andrii@kernel.org
2024-09-05 16:56:13 +02:00
Oleg Nesterov
5fe6e308ab bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()
If bpf_link_prime() fails, bpf_uprobe_multi_link_attach() goes to the
error_free label and frees the array of bpf_uprobe's without calling
bpf_uprobe_unregister().

This leaks bpf_uprobe->uprobe and worse, this frees bpf_uprobe->consumer
without removing it from the uprobe->consumers list.

Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Closes: https://lore.kernel.org/all/000000000000382d39061f59f2dd@google.com/
Reported-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240813152524.GA7292@redhat.com
2024-09-05 16:56:13 +02:00
Luo Gengkun
62c0b10615 perf/core: Fix small negative period being ignored
In perf_adjust_period, we will first calculate period, and then use
this period to calculate delta. However, when delta is less than 0,
there will be a deviation compared to when delta is greater than or
equal to 0. For example, when delta is in the range of [-14,-1], the
range of delta = delta + 7 is between [-7,6], so the final value of
delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
This is unacceptable when the target period is very short, because
we will lose a lot of samples.

Here are some tests and analyzes:
before:
  # perf record -e cs -F 1000  ./a.out
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]

  # perf script
  ...
  a.out     396   257.956048:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.957891:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.959730:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.961545:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.963355:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.965163:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.966973:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.968785:         23 cs:  ffffffff81f4eeec schedul>
  a.out     396   257.970593:         23 cs:  ffffffff81f4eeec schedul>
  ...

after:
  # perf record -e cs -F 1000  ./a.out
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]

  # perf script
  ...
  a.out     395    59.338813:         11 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.339707:         12 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.340682:         13 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.341751:         13 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.342799:         12 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.343765:         11 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.344651:         11 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.345539:         12 cs:  ffffffff81f4eeec schedul>
  a.out     395    59.346502:         13 cs:  ffffffff81f4eeec schedul>
  ...

test.c

int main() {
        for (int i = 0; i < 20000; i++)
                usleep(10);

        return 0;
}

  # time ./a.out
  real    0m1.583s
  user    0m0.040s
  sys     0m0.298s

The above results were tested on x86-64 qemu with KVM enabled using
test.c as test program. Ideally, we should have around 1500 samples,
but the previous algorithm had only about 500, whereas the modified
algorithm now has about 1400. Further more, the new version shows 1
sample per 0.001s, while the previous one is 1 sample per 0.002s.This
indicates that the new algorithm is more sensitive to small negative
values compared to old algorithm.

Fixes: bd2b5b1284 ("perf_counter: More aggressive frequency adjustment")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20240831074316.2106159-2-luogengkun@huaweicloud.com
2024-09-05 16:56:13 +02:00
Zheng Yejian
49aa8a1f4d tracing: Avoid possible softlockup in tracing_iter_reset()
In __tracing_open(), when max latency tracers took place on the cpu,
the time start of its buffer would be updated, then event entries with
timestamps being earlier than start of the buffer would be skipped
(see tracing_iter_reset()).

Softlockup will occur if the kernel is non-preemptible and too many
entries were skipped in the loop that reset every cpu buffer, so add
cond_resched() to avoid it.

Cc: stable@vger.kernel.org
Fixes: 2f26ebd549 ("tracing: use timestamp to determine start of latency traces")
Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 10:18:48 -04:00
Leon Romanovsky
19156263cb dma-mapping: use IOMMU DMA calls for common alloc/free page calls
Common alloca and free pages routines are called when IOMMU DMA is used,
and internally it calls to DMA ops structure which is not available for
default IOMMU. This patch adds necessary if checks to call IOMMU DMA.

It fixes the following crash:

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040
 Mem abort info:
   ESR = 0x0000000096000006
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x06: level 2 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp=00000000d20bb000
 [0000000000000040] pgd=08000000d20c1003
 , p4d=08000000d20c1003
 , pud=08000000d20c2003, pmd=0000000000000000
 Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
 Modules linked in: ipv6 hci_uart venus_core btqca
v4l2_mem2mem btrtl qcom_spmi_adc5 sbs_battery btbcm qcom_vadc_common
cros_ec_typec videobuf2_v4l2 leds_cros_ec cros_kbd_led_backlight
cros_ec_chardev videodev elan_i2c
videobuf2_common qcom_stats mc bluetooth coresight_stm stm_core
ecdh_generic ecc pwrseq_core panel_edp icc_bwmon ath10k_snoc ath10k_core
ath mac80211 phy_qcom_qmp_combo aux_bridge libarc4 coresight_replicator
coresight_etm4x coresight_tmc
coresight_funnel cfg80211 rfkill coresight qcom_wdt cbmem ramoops
reed_solomon pwm_bl coreboot_table backlight crct10dif_ce
 CPU: 7 UID: 0 PID: 70 Comm: kworker/u32:4 Not tainted 6.11.0-rc6-next-20240903-00003-gdfc6015d0711 #660
 Hardware name: Google Lazor Limozeen without Touchscreen (rev5 - rev8) (DT)
 Workqueue: events_unbound deferred_probe_work_func
 hub 2-1:1.0: 4 ports detected

 pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : dma_common_alloc_pages+0x54/0x1b4
 lr : dma_common_alloc_pages+0x4c/0x1b4
 sp : ffff8000807d3730
 x29: ffff8000807d3730 x28: ffff02a7d312f880 x27: 0000000000000001
 x26: 000000000000c000 x25: 0000000000000000 x24: 0000000000000001
 x23: ffff02a7d23b6898 x22: 0000000000006cc0 x21: 000000000000c000
 x20: ffff02a7858bf410 x19: fffffe0a60006000 x18: 0000000000000001
 x17: 00000000000000d5 x16: 1fffe054f0bcc261 x15: 0000000000000001
 x14: ffff02a7844dc680 x13: 0000000000100180 x12: dead000000000100
 x11: dead000000000122 x10: 00000000001001ff x9 : ffff02a87f7b7b00
 x8 : ffff02a87f7b7b00 x7 : ffff405977d6b000 x6 : ffff8000807d3310
 x5 : ffff02a87f6b6398 x4 : 0000000000000001 x3 : ffff405977d6b000
 x2 : ffff02a7844dc600 x1 : 0000000100000000 x0 : fffffe0a60006000
 Call trace:
  dma_common_alloc_pages+0x54/0x1b4
  __dma_alloc_pages+0x68/0x90
  dma_alloc_pages+0x10/0x1c
  snd_dma_noncoherent_alloc+0x28/0x8c
  __snd_dma_alloc_pages+0x30/0x50
  snd_dma_alloc_dir_pages+0x40/0x80
  do_alloc_pages+0xb8/0x13c
  preallocate_pcm_pages+0x6c/0xf8
  preallocate_pages+0x160/0x1a4
  snd_pcm_set_managed_buffer_all+0x64/0xb0
  lpass_platform_pcm_new+0xc0/0xe8
  snd_soc_pcm_component_new+0x3c/0xc8
  soc_new_pcm+0x4fc/0x668
  snd_soc_bind_card+0xabc/0xbac
  snd_soc_register_card+0xf0/0x108
  devm_snd_soc_register_card+0x4c/0xa4
  sc7180_snd_platform_probe+0x180/0x224
  platform_probe+0x68/0xc0
  really_probe+0xbc/0x298
  __driver_probe_device+0x78/0x12c
  driver_probe_device+0x3c/0x15c
  __device_attach_driver+0xb8/0x134
  bus_for_each_drv+0x84/0xe0
  __device_attach+0x9c/0x188
  device_initial_probe+0x14/0x20
  bus_probe_device+0xac/0xb0
  deferred_probe_work_func+0x88/0xc0
  process_one_work+0x14c/0x28c
  worker_thread+0x2cc/0x3d4
  kthread+0x114/0x118
  ret_from_fork+0x10/0x20
 Code: f9411c19 940000c9 aa0003f3 b4000460 (f9402326)
 ---[ end trace 0000000000000000 ]---

Fixes: b5c58b2fdc ("dma-mapping: direct calls for dma-iommu")
Closes: https://lore.kernel.org/all/10431dfd-ce04-4e0f-973b-c78477303c18@notapiano
Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-05 14:29:42 +03:00
Ingo Molnar
95c13662b6 Merge branch 'perf/urgent' into perf/core, to pick up fixes
This also refreshes the -rc1 based branch to -rc5.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-09-05 11:17:43 +02:00
Tejun Heo
649e980dad Merge branch 'bpf/master' into for-6.12
Pull bpf/master to receive baebe9aaba ("bpf: allow passing struct
bpf_iter_<type> as kfunc arguments") and related changes in preparation for
the DSQ iterator patchset.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04 11:41:32 -10:00
Waiman Long
8c7e22fc91 cgroup/cpuset: Move cpu.h include to cpuset-internal.h
The newly created cpuset-v1.c file uses cpus_read_lock/unlock() functions
which are defined in cpu.h but not included in cpuset-internal.h yet
leading to compilation error under certain kernel configurations.  Fix it
by moving the cpu.h include from cpuset.c to cpuset-internal.h. While
at it, sort the include files in alphabetic order.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408311612.mQTuO946-lkp@intel.com/
Fixes: 047b830974 ("cgroup/cpuset: move relax_domain_level to cpuset-v1.c")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04 10:46:49 -10:00
Tejun Heo
8195136669 sched_ext: Add cgroup support
Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. A BPF scheduler may not implement or implement only
subset of cgroup features. The implemented features can be indicated using
%SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features
that are not implemented, a warning is triggered.

While a BPF scheduler is being enabled and disabled, relevant cgroup
operations are locked out using scx_cgroup_rwsem. This avoids situations
like task prep taking place while the task is being moved across cgroups,
making things easier for BPF schedulers.

v7: - cgroup interface file visibility toggling is dropped in favor just
      warning messages. Dynamically changing interface visiblity caused more
      confusion than helping.

v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE.

    - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for
      !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED.

v5: - Flipped the locking order between scx_cgroup_rwsem and
      cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
      documentation around locking.

    - sched_move_task() takes an early exit if the source and destination
      are identical. This triggered the warning in scx_cgroup_can_attach()
      as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
      migration path so that ops.cgroup_prep_move() is skipped for identity
      migrations so that its invocations always match ops.cgroup_move()
      one-to-one.

v4: - Example schedulers moved into their own patches.

    - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.

v3: - Make scx_example_pair switch all tasks by default.

    - Convert to BPF inline iterators.

    - scx_bpf_task_cgroup() is added to determine the current cgroup from
      CPU controller's POV. This allows BPF schedulers to accurately track
      CPU cgroup membership.

    - scx_example_flatcg added. This demonstrates flattened hierarchy
      implementation of CPU cgroup control and shows significant performance
      improvement when cgroups which are nested multiple levels are under
      competition.

v2: - Build fixes for different CONFIG combinations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
2024-09-04 10:24:59 -10:00
Tejun Heo
e179e80c5d sched: Introduce CONFIG_GROUP_SCHED_WEIGHT
sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code
is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or
SCX may implement the feature, put the interface code behind the new
CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED.
This allows either sched class to enable the itnerface code without ading
more complex CONFIG tests.

When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares()
is added to support later CONFIG_CGROUP_SCHED_WEIGHT &&
!CONFIG_FAIR_GROUP_SCHED builds.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04 10:24:59 -10:00
Tejun Heo
41082c1d1d sched: Make cpu_shares_read_u64() use tg_weight()
Move tg_weight() upward and make cpu_shares_read_u64() use it too. This
makes the weight retrieval shared between cgroup v1 and v2 paths and will be
used to implement cgroup support for sched_ext.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04 10:24:59 -10:00
Tejun Heo
859dc4ec5a sched: Expose css_tg()
A new BPF extensible sched_class will use css_tg() in the init and exit
paths to visit all task_groups by walking cgroups.

v4: __setscheduler_prio() is already exposed. Dropped from this patch.

v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup
    mechanism.

v2: Expose SCHED_CHANGE_BLOCK() too and update the description.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
2024-09-04 10:24:59 -10:00
Tejun Heo
a8532fac7b sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable
During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
on every task. To do this, it does get_task_struct() on each iterated task,
drop the lock and then call ops.init_task().

However, a TASK_DEAD task may already have lost all its usage count and be
waiting for RCU grace period to be freed. If get_task_struct() is called on
such task, use-after-free can happen. To avoid such situations,
scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
as they are never going to be scheduled again.

Unfortunately, a racing sched_setscheduler(2) can grab the task before the
task is unhashed and then continue to e.g. move the task from RT to SCX
after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
gone through scx_ops_init_task(), scx_ops_enable_task() called from
switching_to_scx() triggers the following warning:

  sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
  WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
  ...
  RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
  ...
   switching_to_scx+0x13/0xa0
   __sched_setscheduler+0x84e/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

As in the ops_disable path, it just doesn't seem like a good idea to leave
any task in an inconsistent state, even when the task is dead. The root
cause is ops_enable not being able to tell reliably whether a task is truly
dead (no one else is looking at it and it's about to be freed) and was
testing TASK_DEAD instead. Fix it by testing the task's usage count
directly.

- ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
  tasks, @include_dead is removed from scx_task_iter_next_locked() along
  with dead task filtering.

- tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
  fails.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2024-09-04 10:23:32 -10:00
Tejun Heo
61eeb9a905 sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable
scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while
calling scx_ops_exit_task() on all tasks including dead ones. This can leave
a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent.

If another task was in the process of changing the TASK_DEAD task's
scheduling class and grabs the rq lock after scx_ops_disable_workfn() is
done with the task, the task ends up calling scx_ops_disable_task() on the
dead task which is in an inconsistent state triggering a warning:

  WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160
  ...
  RIP: 0010:scx_ops_disable_task+0x12c/0x160
  ...
  Call Trace:
   <TASK>
   check_class_changed+0x2c/0x70
   __sched_setscheduler+0x8a0/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f140d70ea5b

There is no reason to leave dead tasks on SCX when unloading the BPF
scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including
the dead ones from SCX.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04 10:22:55 -10:00
Martin KaFai Lau
00750788df bpf: Fix indentation issue in epilogue_idx
There is a report on new indentation issue in epilogue_idx.
This patch fixed it.

Fixes: 169c31761c ("bpf: Add gen_epilogue to bpf_verifier_ops")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408311622.4GzlzN33-lkp@intel.com/
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240904180847.56947-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-04 12:45:18 -07:00
Martin KaFai Lau
940ce73bde bpf: Remove the insn_buf array stack usage from the inline_bpf_loop()
This patch removes the insn_buf array stack usage from the
inline_bpf_loop(). Instead, the env->insn_buf is used. The
usage in inline_bpf_loop() needs more than 16 insn, so the
INSN_BUF_SIZE needs to be increased from 16 to 32.
The compiler stack size warning on the verifier is gone
after this change.

Cc: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240904180847.56947-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-04 12:45:18 -07:00
Jeongjun Park
bb6705c3f9 bpf: add check for invalid name in btf_name_valid_section()
If the length of the name string is 1 and the value of name[0] is NULL
byte, an OOB vulnerability occurs in btf_name_valid_section() and the
return value is true, so the invalid name passes the check.

To solve this, you need to check if the first position is NULL byte and
if the first character is printable.

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Fixes: bd70a8fb7c ("bpf: Allow all printable characters in BTF DATASEC names")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Link: https://lore.kernel.org/r/20240831054702.364455-1-aha310510@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
2024-09-04 11:56:34 -07:00
Peter Zijlstra
2ab9d83026 perf/aux: Fix AUX buffer serialization
Ole reported that event->mmap_mutex is strictly insufficient to
serialize the AUX buffer, add a per RB mutex to fully serialize it.

Note that in the lock order comment the perf_event::mmap_mutex order
was already wrong, that is, it nesting under mmap_lock is not new with
this patch.

Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Reported-by: Ole <ole@binarygecko.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-09-04 18:22:56 +02:00
Linus Torvalds
76c0f27d06 17 hotfixes, 15 of which are cc:stable.
Mostly MM, no identifiable theme.  And a few nilfs2 fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZtfR/wAKCRDdBJ7gKXxA
 jofjAP9rUlliIcn8zcy7vmBTuMaH4SkoULB64QWAUddaWV+SCAEA+q0sntLPnTIZ
 My3sfihR6mbvhkgKbvIHm6YYQI56NAc=
 =b4Lr
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "17 hotfixes, 15 of which are cc:stable.

  Mostly MM, no identifiable theme.  And a few nilfs2 fixups"

* tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n
  mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area
  mailmap: update entry for Jan Kuliga
  codetag: debug: mark codetags for poisoned page as empty
  mm/memcontrol: respect zswap.writeback setting from parent cg too
  scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum
  Revert "mm: skip CMA pages when they are not available"
  maple_tree: remove rcu_read_lock() from mt_validate()
  kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y
  mm/slub: add check for s->flags in the alloc_tagging_slab_free_hook
  nilfs2: fix state management in error path of log writing function
  nilfs2: fix missing cleanup on rollforward recovery error
  nilfs2: protect references to superblock parameters exposed in sysfs
  userfaultfd: don't BUG_ON() if khugepaged yanks our page table
  userfaultfd: fix checks for huge PMDs
  mm: vmalloc: ensure vmap_block is initialised before adding to queue
  selftests: mm: fix build errors on armhf
2024-09-04 08:37:33 -07:00
John Ogness
daeed1595b printk: Avoid false positive lockdep report for legacy printing
Legacy console printing from printk() caller context may invoke
the console driver from atomic context. This leads to a lockdep
splat because the console driver will acquire a sleeping lock
and the caller may already hold a spinning lock. This is noticed
by lockdep on !PREEMPT_RT configurations because it will lead to
a problem on PREEMPT_RT.

However, on PREEMPT_RT the printing path from atomic context is
always avoided and the console driver is always invoked from a
dedicated thread. Thus the lockdep splat on !PREEMPT_RT is a
false positive.

For !PREEMPT_RT override the lock-context before invoking the
console driver to avoid the false positive.

Do not override the lock-context for PREEMPT_RT in order to
allow lockdep to catch any real locking context issues related
to the write callback usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-18-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:33 +02:00
John Ogness
1529bbb6e2 printk: nbcon: Assign nice -20 for printing threads
It is important that console printing threads are scheduled
shortly after a printk call and with generous runtime budgets.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-17-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:33 +02:00
John Ogness
5f53ca3ff8 printk: Implement legacy printer kthread for PREEMPT_RT
The write() callback of legacy consoles usually makes use of
spinlocks. This is not permitted with PREEMPT_RT in atomic
contexts.

For PREEMPT_RT, create a new kthread to handle printing of all
the legacy consoles (and nbcon consoles if boot consoles are
registered). This allows legacy consoles to work on PREEMPT_RT
without requiring modification. (However they will not have
the reliability properties guaranteed by nbcon atomic
consoles.)

Use the existing printk_kthreads_check_locked() to start/stop
the legacy kthread as needed.

Introduce the macro force_legacy_kthread() to query if the
forced threading of legacy consoles is in effect. Although
currently only enabled for PREEMPT_RT, this acts as a simple
mechanism for the future to allow other preemption models to
easily take advantage of the non-interference property provided
by the legacy kthread.

When force_legacy_kthread() is true, the legacy kthread
fulfills the role of the console_flush_type @legacy_offload by
waking the legacy kthread instead of printing via the
console_lock in the irq_work. If the legacy kthread is not
yet available, no legacy printing takes place (unless in
panic).

If for some reason the legacy kthread fails to create, any
legacy consoles are unregistered. With force_legacy_kthread(),
the legacy kthread is a critical component for legacy consoles.

These changes only affect CONFIG_PREEMPT_RT.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-16-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:33 +02:00
John Ogness
5102981d5e printk: nbcon: Show replay message on takeover
An emergency or panic context can takeover console ownership
while the current owner was printing a printk message. The
atomic printer will re-print the message that the previous
owner was printing. However, this can look confusing to the
user and may even seem as though a message was lost.

  [3430014.1
  [3430014.181123] usb 1-2: Product: USB Audio

Add a new field @nbcon_prev_seq to struct console to track
the sequence number to print that was assigned to the previous
console owner. If this matches the sequence number to print
that the current owner is assigned, then a takeover must have
occurred. In this case, print an additional message to inform
the user that the previous message is being printed again.

  [3430014.1
  ** replaying previous printk message **
  [3430014.181123] usb 1-2: Product: USB Audio

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-12-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
75d430372a printk: Provide helper for message prepending
In order to support prepending different texts to printk
messages, split out the prepending code into a helper
function.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-11-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
13189fa73a printk: nbcon: Rely on kthreads for normal operation
Once the kthread is running and available
(i.e. @printk_kthreads_running is set), the kthread becomes
responsible for flushing any pending messages which are added
in NBCON_PRIO_NORMAL context. Namely the legacy
console_flush_all() and device_release() no longer flush the
console. And nbcon_atomic_flush_pending() used by
nbcon_cpu_emergency_exit() no longer flushes messages added
after the emergency messages.

The console context is safe when used by the kthread only when
one of the following conditions are true:

  1. Other caller acquires the console context with
     NBCON_PRIO_NORMAL with preemption disabled. It will
     release the context before rescheduling.

  2. Other caller acquires the console context with
     NBCON_PRIO_NORMAL under the device_lock.

  3. The kthread is the only context which acquires the console
     with NBCON_PRIO_NORMAL.

This is satisfied for all atomic printing call sites:

nbcon_legacy_emit_next_record() (#1)

nbcon_atomic_flush_pending_con() (#1)

nbcon_device_release() (#2)

It is even double guaranteed when @printk_kthreads_running
is set because then _only_ the kthread will print for
NBCON_PRIO_NORMAL. (#3)

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-10-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
5c586baa60 printk: nbcon: Use thread callback if in task context for legacy
When printing via console_lock, the write_atomic() callback is
used for nbcon consoles. However, if it is known that the
current context is a task context, the write_thread() callback
can be used instead.

Using write_thread() instead of write_atomic() helps to reduce
large disabled preemption regions when the device_lock does not
disable preemption.

This is mainly a preparatory change to allow avoiding
write_atomic() completely during normal operation if boot
consoles are registered.

As a side-effect, it also allows consolidating the printing
code for legacy printing and the kthread printer.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-9-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
9b79a3d0d6 printk: nbcon: Relocate nbcon_atomic_emit_one()
Move nbcon_atomic_emit_one() so that it can be used by
nbcon_kthread_func() in a follow-up commit.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-8-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
Thomas Gleixner
76f258bf3f printk: nbcon: Introduce printer kthreads
Provide the main implementation for running a printer kthread
per nbcon console that is takeover/handover aware. This
includes:

- new mandatory write_thread() callback
- kthread creation
- kthread main printing loop
- kthread wakeup mechanism
- kthread shutdown

kthread creation is a bit tricky because consoles may register
before kthreads can be created. In such cases, registration
will succeed, even though no kthread exists. Once kthreads can
be created, an early_initcall will set @printk_kthreads_ready.
If there are no registered boot consoles, the early_initcall
creates the kthreads for all registered nbcon consoles. If
kthread creation fails, the related console is unregistered.

If there are registered boot consoles when
@printk_kthreads_ready is set, no kthreads are created until
the final boot console unregisters.

Once kthread creation finally occurs, @printk_kthreads_running
is set so that the system knows kthreads are available for all
registered nbcon consoles.

If @printk_kthreads_running is already set when the console
is registering, the kthread is created during registration. If
kthread creation fails, the registration will fail.

Until @printk_kthreads_running is set, console printing occurs
directly via the console_lock.

kthread shutdown on system shutdown/reboot is necessary to
ensure the printer kthreads finish their printing so that the
system can cleanly transition back to direct printing via the
console_lock in order to reliably push out the final
shutdown/reboot messages. @printk_kthreads_running is cleared
before shutting down the individual kthreads.

The kthread uses a new mandatory write_thread() callback that
is called with both device_lock() and the console context
acquired.

The console ownership handling is necessary for synchronization
against write_atomic() which is synchronized only via the
console context ownership.

The device_lock() serializes acquiring the console context with
NBCON_PRIO_NORMAL. It is needed in case the device_lock() does
not disable preemption. It prevents the following race:

CPU0				CPU1

 [ task A ]

 nbcon_context_try_acquire()
   # success with NORMAL prio
   # .unsafe == false;  // safe for takeover

 [ schedule: task A -> B ]

				WARN_ON()
				  nbcon_atomic_flush_pending()
				    nbcon_context_try_acquire()
				      # success with EMERGENCY prio

				      # flushing
				      nbcon_context_release()

				      # HERE: con->nbcon_state is free
				      #       to take by anyone !!!

 nbcon_context_try_acquire()
   # success with NORMAL prio [ task B ]

 [ schedule: task B -> A ]

 nbcon_enter_unsafe()
   nbcon_context_can_proceed()

BUG: nbcon_context_can_proceed() returns "true" because
     the console is owned by a context on CPU0 with
     NBCON_PRIO_NORMAL.

     But it should return "false". The console is owned
     by a context from task B and we do the check
     in a context from task A.

Note that with these changes, the printer kthreads do not yet
take over full responsibility for nbcon printing during normal
operation. These changes only focus on the lifecycle of the
kthreads.

Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-7-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
fb9fabf3d8 printk: nbcon: Init @nbcon_seq to highest possible
When initializing an nbcon console, have nbcon_alloc() set
@nbcon_seq to the highest possible sequence number. For all
practical purposes, this will guarantee that the console
will have nothing to print until later when @nbcon_seq is
set to the proper initial printing value.

This will be particularly important once kthread printing is
introduced because nbcon_alloc() can create/start the kthread
before the desired initial sequence number is known.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-6-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
6cb58cfebb printk: nbcon: Add context to usable() and emit()
The nbcon consoles will have two callbacks to be used for
different contexts. In order to determine if an nbcon console
is usable, console_is_usable() must know if it is a context
that will need to use the optional write_atomic() callback.
Also, nbcon_emit_next_record() must know which callback it
needs to call.

Add an extra parameter @use_atomic to console_is_usable() and
nbcon_emit_next_record() to specify this.

Since so far only the write_atomic() callback exists,
@use_atomic is set to true for all call sites.

For legacy consoles, @use_atomic is not used.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-5-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
0e53e2d9f7 printk: Flush console on unregister_console()
Ensure consoles have flushed pending records before
unregistering. The console should print up to at least its
related "console disabled" record.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-4-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
e37577ebbf printk: Fail pr_flush() if before SYSTEM_SCHEDULING
A follow-up change adds pr_flush() to console unregistration.
However, with boot consoles unregistration can happen very
early if there are also regular consoles registering as well.
In this case the pr_flush() is not important because all
consoles are flushed when checking the initial console sequence
number.

Allow pr_flush() to fail if @system_state has not yet reached
SYSTEM_SCHEDULING. This avoids might_sleep() and msleep()
explosions that would otherwise occur:

[    0.436739][    T0] printk: legacy console [ttyS0] enabled
[    0.439820][    T0] printk: legacy bootconsole [earlyser0] disabled
[    0.446822][    T0] BUG: scheduling while atomic: swapper/0/0/0x00000002
[    0.450491][    T0] 1 lock held by swapper/0/0:
[    0.457897][    T0]  #0: ffffffff82ae5f88 (console_mutex){+.+.}-{4:4}, at: console_list_lock+0x20/0x70
[    0.463141][    T0] Modules linked in:
[    0.465307][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.10.0-rc1+ #372
[    0.469394][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[    0.474402][    T0] Call Trace:
[    0.476246][    T0]  <TASK>
[    0.481473][    T0]  dump_stack_lvl+0x93/0xb0
[    0.483949][    T0]  dump_stack+0x10/0x20
[    0.486256][    T0]  __schedule_bug+0x68/0x90
[    0.488753][    T0]  __schedule+0xb9b/0xd80
[    0.491179][    T0]  ? lock_release+0xb5/0x270
[    0.493732][    T0]  schedule+0x43/0x170
[    0.495998][    T0]  schedule_timeout+0xc5/0x1e0
[    0.498634][    T0]  ? __pfx_process_timeout+0x10/0x10
[    0.501522][    T0]  ? msleep+0x13/0x50
[    0.503728][    T0]  msleep+0x3c/0x50
[    0.505847][    T0]  __pr_flush.constprop.0.isra.0+0x56/0x500
[    0.509050][    T0]  ? _printk+0x58/0x80
[    0.511332][    T0]  ? lock_is_held_type+0x9c/0x110
[    0.514106][    T0]  unregister_console_locked+0xe1/0x450
[    0.517144][    T0]  register_console+0x509/0x620
[    0.519827][    T0]  ? __pfx_univ8250_console_init+0x10/0x10
[    0.523042][    T0]  univ8250_console_init+0x24/0x40
[    0.525845][    T0]  console_init+0x43/0x210
[    0.528280][    T0]  start_kernel+0x493/0x980
[    0.530773][    T0]  x86_64_start_reservations+0x18/0x30
[    0.533755][    T0]  x86_64_start_kernel+0xae/0xc0
[    0.536473][    T0]  common_startup_64+0x12c/0x138
[    0.539210][    T0]  </TASK>

And then the kernel goes into an infinite loop complaining about:

1. releasing a pinned lock
2. unpinning an unpinned lock
3. bad: scheduling from the idle thread!
4. goto 1

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:32 +02:00
John Ogness
bd07d86452 printk: nbcon: Add function for printers to reacquire ownership
Since ownership can be lost at any time due to handover or
takeover, a printing context _must_ be prepared to back out
immediately and carefully. However, there are scenarios where
the printing context must reacquire ownership in order to
finalize or revert hardware changes.

One such example is when interrupts are disabled during
printing. No other context will automagically re-enable the
interrupts. For this case, the disabling context _must_
reacquire nbcon ownership so that it can re-enable the
interrupts.

Provide nbcon_reacquire_nobuf() for exactly this purpose. It
allows a printing context to reacquire ownership using the same
priority as its previous ownership.

Note that after a successful reacquire the printing context
will have no output buffer because that has been lost. This
function cannot be used to resume printing.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 15:56:31 +02:00
John Ogness
d33d5e683b printk: nbcon: Use raw_cpu_ptr() instead of open coding
There is no need to open code a non-migration-checking
this_cpu_ptr(). That is exactly what raw_cpu_ptr() is.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87plpum4jw.fsf@jogness.linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 12:28:25 +02:00
Thorsten Blum
8db70faeab cpu: Fix W=1 build kernel-doc warning
Building the kernel with W=1 generates the following warning:

  kernel/cpu.c:2693: warning: This comment starts with '/**',
  		     but isn't a kernel-doc comment.

The function topology_is_core_online() is a simple helper function and
doesn't need a kernel-doc comment.

Use a normal comment instead.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240825221152.71951-2-thorsten.blum@toblux.com
2024-09-04 12:18:10 +02:00
Thomas Gleixner
eb876ea724 Merge branch 'linus' into smp/core
Pull in upstream changes so further patches don't conflict.
2024-09-04 12:15:38 +02:00
Anna-Maria Behnsen
79f8b28e85 timers: Annotate possible non critical data race of next_expiry
Global timers could be expired remotely when the target CPU is idle. After
a remote timer expiry, the remote timer_base->next_expiry value is updated
while holding the timer_base->lock. When the formerly idle CPU becomes
active at the same time and checks whether timers need to expire, this
check is done lockless as it is on the local CPU. This could lead to a data
race, which was reported by sysbot:

  https://lore.kernel.org/r/000000000000916e55061f969e14@google.com

When the value is read lockless but changed by the remote CPU, only two non
critical scenarios could happen:

1) The already update value is read -> everything is perfect

2) The old value is read -> a superfluous timer soft interrupt is raised

The same situation could happen when enqueueing a new first pinned timer by
a remote CPU also with non critical scenarios:

1) The already update value is read -> everything is perfect

2) The old value is read -> when the CPU is idle, an IPI is executed
nevertheless and when the CPU isn't idle, the updated value will be visible
on the next tick and the timer might be late one jiffie.

As this is very unlikely to happen, the overhead of doing the check under
the lock is a way more effort, than a superfluous timer soft interrupt or a
possible 1 jiffie delay of the timer.

Document and annotate this non critical behavior in the code by using
READ/WRITE_ONCE() pair when accessing timer_base->next_expiry.

Reported-by: syzbot+bf285fcc0a048e028118@syzkaller.appspotmail.com
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240829154305.19259-1-anna-maria@linutronix.de
Closes: https://lore.kernel.org/lkml/000000000000916e55061f969e14@google.com
2024-09-04 11:57:56 +02:00
Jinjie Ruan
85a147a986 printk: Use the BITS_PER_LONG macro
sizeof(unsigned long) * 8 is the number of bits in an unsigned long
variable, replace it with BITS_PER_LONG macro to make it simpler.

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240903035358.308482-1-ruanjinjie@huawei.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04 11:57:48 +02:00
Tejun Heo
37cb049ef8 sched_ext: Remove sched_class->switch_class()
With sched_ext converted to use put_prev_task() for class switch detection,
there's no user of switch_class() left. Drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2024-09-03 21:54:29 -10:00
Tejun Heo
f422316d74 sched_ext: Remove switch_class_scx()
Now that put_prev_task_scx() is called with @next on task switches, there's
no reason to use sched_class.switch_class(). Rename switch_class_scx() to
switch_class() and call it from put_prev_task_scx().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-03 21:54:29 -10:00
Tejun Heo
65aaf90569 sched_ext: Relocate functions in kernel/sched/ext.c
Relocate functions to ease the removal of switch_class_scx(). No functional
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-03 21:54:29 -10:00
Tejun Heo
753e2836d1 sched_ext: Unify regular and core-sched pick task paths
Because the BPF scheduler's dispatch path is invoked from balance(),
sched_ext needs to invoke balance_one() on all sibling rq's before picking
the next task for core-sched.

Before the recent pick_next_task() updates, sched_ext couldn't share pick
task between regular and core-sched paths because pick_next_task() depended
on put_prev_task() being called on the current task. Tasks currently running
on sibling rq's can't be put when one rq is trying to pick the next task, so
pick_task_scx() had to have a separate mechanism to pick between a sibling
rq's current task and the first task in its local DSQ.

However, with the preceding updates, pick_next_task_scx() no longer depends
on the current task being put and can compare the current task and the next
in line statelessly, and the pick task logic should be shareable between
regular and core-sched paths.

Unify regular and core-sched pick task paths:

- There's no reason to distinguish local and sibling picks anymore. @local
  is removed from balance_one().

- pick_next_task_scx() is turned into pick_task_scx() by dropping the
  put_prev_set_next_task() call.

- The old pick_task_scx() is dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-03 21:54:29 -10:00
Tejun Heo
8b1451f2f7 sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP
SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to
keep running the current task. It's not really a task property. Replace it
with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for
the usage. Also, the existing clearing rule is unnecessarily strict and
makes it difficult to use with core-sched. Just clear it on entry to
balance_one().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-03 21:54:28 -10:00
Tejun Heo
7c65ae81ea sched_ext: Don't call put_prev_task_scx() before picking the next task
fd03c5b858 ("sched: Rework pick_next_task()") changed the definition of
pick_next_task() from:

  pick_next_task() := pick_task() + set_next_task(.first = true)

to:

  pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

making invoking put_prev_task() pick_next_task()'s responsibility. This
reordering allows pick_task() to be shared between regular and core-sched
paths and put_prev_task() to know the next task.

sched_ext depended on put_prev_task_scx() enqueueing the current task before
pick_next_task_scx() is called. While pulling sched/core changes,
70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an
explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx()
before picking the first task as a workaround.

Clean it up and adopt the conventions that other sched classes are
following.

The operation of keeping running the current task was spread and required
the task to be put on the local DSQ before picking:

  - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still
    runnable, hasn't exhausted its slice, and thus should keep running.

  - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP
    is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the
    only runnable task. do_enqueue_task() in turn decided whether to use the
    local DSQ depending on SCX_OPS_ENQ_LAST.

Consolidate the logic in balance_one() as it always knows whether it is
going to keep the current task. balance_one() now considers all conditions
where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell
pick_next_task_scx() to keep the current task instead of picking one from
the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from
put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is
updated to pick the current task if SCX_TASK_BAL_KEEP is set.

The workaround put_prev_task[_scx]() calls are replaced with
put_prev_set_next_task().

This causes two behavior changes observable from the BPF scheduler:

- When a task keep running, it no longer goes through enqueue/dequeue cycle
  and thus ops.stopping/running() transitions. The new behavior is better
  and all the existing schedulers should be able to handle the new behavior.

- The BPF scheduler cannot keep executing the current task by enqueueing
  SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
  BPF scheduler is responsible for resuming execution after each
  SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling
  decisions are not made on the local CPU - e.g. central or userspace-driven
  schedulin - and the new behavior is more logical and shouldn't pose any
  problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it
  doesn't fit that well anymore and the last task handling is moved to the
  end of qmap_dispatch().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-09-03 21:54:28 -10:00
Yujie Liu
f22cde4371 sched/numa: Fix the vma scan starving issue
Problem statement:
Since commit fc137c0dda ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot.  Meanwhile, the reducing of
the vma scan might create less Numa page fault information.  The
insufficient information makes it harder for the Numa balancer to make
decision.  Later, commit b7a5b537c5 ("sched/numa: Complete scanning of
partial VMAs regardless of PID activity") and commit 84db47ca71
("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
bring back part of the performance.

Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
long duration of remote Numa node read was observed by PMU events: A few
cores having ~500MB/s remote memory access for ~20 seconds.  It causes
high core-to-core variance and performance penalty.  After the
investigation, it is found that many vmas are skipped due to the active
PID check.  According to the trace events, in most cases,
vma_is_accessed() returns false because the history access info stored in
pids_active array has been cleared.

Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.
Thus compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq.  If the diff has exceeded the threshold,
scan the vma.

This patch especially helps the cases where there are small number of
threads, like the process-based SPECcpu.  Without this patch, if the
SPECcpu process access the vma at the beginning, then sleeps for a long
time, the pid_active array will be cleared.  A a result, if this process
is woken up again, it never has a chance to set prot_none anymore. 
Because only the first 2 times of access is granted for vma scan:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
worse, no other threads within the task can help set the prot_none.  This
causes information lost.

Raghavendra helped test current patch and got the positive result
on the AMD platform:

autonumabench NUMA01
                            base                  patched
Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*

Duration User      380345.36   368252.04
Duration System      1358.89     1156.23
Duration Elapsed     2277.45     2213.25

autonumabench NUMA02

Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*

Duration User        1513.23     1575.48
Duration System         8.33        8.13
Duration Elapsed       28.59       29.71

kernbench

Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*

Duration User       68816.41    67615.74
Duration System     21873.94    22848.08
Duration Elapsed      506.66      504.55

Intel 256 CPUs/2 Sockets:
autonuma benchmark also shows improvements:

                                               v6.10-rc5              v6.10-rc5
                                                                         +patch
Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*

                   v6.10-rc5   v6.10-rc5
                                  +patch
Duration User     1064010.16  1075416.23
Duration System      3307.64     3104.66
Duration Elapsed     4537.54     4604.73

The SPECcpu remote node access issue disappears with the patch applied.

Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
Fixes: fc137c0dda ("sched/numa: enhance vma scanning logic")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03 21:15:56 -07:00
Matthew Wilcox (Oracle)
4ffca5a966 mm: support only one page_type per page
By using a few values in the top byte, users of page_type can store up to
24 bits of additional data in page_type.  It also reduces the code size as
(with replacement of READ_ONCE() with data_race()), the kernel can check
just a single byte.  eg:

ffffffff811e3a79:       8b 47 30                mov    0x30(%rdi),%eax
ffffffff811e3a7c:       55                      push   %rbp
ffffffff811e3a7d:       48 89 e5                mov    %rsp,%rbp
ffffffff811e3a80:       25 00 00 00 82          and    $0x82000000,%eax
ffffffff811e3a85:       3d 00 00 00 80          cmp    $0x80000000,%eax
ffffffff811e3a8a:       74 4d                   je     ffffffff811e3ad9 <folio_mapping+0x69>

becomes:

ffffffff811e3a69:       80 7f 33 f5             cmpb   $0xf5,0x33(%rdi)
ffffffff811e3a6d:       55                      push   %rbp
ffffffff811e3a6e:       48 89 e5                mov    %rsp,%rbp
ffffffff811e3a71:       74 4d                   je     ffffffff811e3ac0 <folio_mapping+0x60>

replacing three instructions with one.

[wangkefeng.wang@huawei.com: fix ubsan warnings]
  Link: https://lkml.kernel.org/r/2d19c48a-c550-4345-bf36-d05cd303c5de@huawei.com
Link: https://lkml.kernel.org/r/20240821173914.2270383-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03 21:15:43 -07:00
Mike Rapoport (Microsoft)
0e8b67982b mm: move kernel/numa.c to mm/
Patch series "mm: introduce numa_memblks", v4.

Following the discussion about handling of CXL fixed memory windows on
arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to
the generic code so they will be available on arm64/riscv and maybe on
loongarch sometime later.

While it could be possible to use memblock to describe CXL memory windows,
it currently lacks notion of unpopulated memory ranges and numa_memblks
does implement this.

Another reason to make numa_memblks generic is that both arch_numa (arm64
and riscv) and loongarch use trimmed copy of x86 code although there is no
fundamental reason why the same code cannot be used on all these
platforms.  Having numa_memblks in mm/ will make it's interaction with
ACPI and FDT more consistent and I believe will reduce maintenance burden.

And with generic numa_memblks it is (almost) straightforward to enable
NUMA emulation on arm64 and riscv.

The first 9 commits in this series are cleanups that are not strictly
related to numa_memblks.
Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks
and NUMA emulation to the generic code.
Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22
does some aftermath cleanups.
Commit 23 updates of_numa_init() to return error of no NUMA nodes were
found in the device tree.
Commit 24 switches arch_numa to numa_memblks.
Commit 25 enables usage of phys_to_target_node() and
memory_add_physaddr_to_nid() with numa_memblks.
Commit 26 moves the description for numa=fake from x86 to admin-guide.

[1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/


This patch (of 26):

The stub functions in kernel/numa.c belong to mm/ rather than to kernel/

Link: https://lkml.kernel.org/r/20240807064110.1003856-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20240807064110.1003856-2-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Rob Herring (Arm) <robh@kernel.org>
Cc: Samuel Holland <samuel.holland@sifive.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03 21:15:26 -07:00
Chen Yu
f689a3ab7b dma-direct: optimize page freeing when it is not addressable
When the CMA allocation succeeds but isn't addressable, its buffer has
already been released and the page is set to NULL.  So later when the
normal page allocation succeeds but isn't addressable, __free_pages()
can be used to free that normal page rather than using
dma_free_contiguous that does extra checks that are not needed.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-04 07:08:51 +03:00
Christoph Hellwig
de6c85bf91 dma-mapping: clearly mark DMA ops as an architecture feature
DMA ops are a helper for architectures and not for drivers to override
the DMA implementation.

Unfortunately driver authors keep ignoring this.  Make the fact more
clear by renaming the symbol to ARCH_HAS_DMA_OPS and having the two drivers
overriding their dma_ops depend on that.  These drivers should probably be
marked broken, but we can give them a bit of a grace period for that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> # for IPU6
Acked-by: Robin Murphy <robin.murphy@arm.com>
2024-09-04 07:08:51 +03:00
Tejun Heo
d7b01aef9d Merge branch 'tip/sched/core' into for-6.12
- Resolve trivial context conflicts from dl_server clearing being moved
  around.

- Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to
  match sched/core.

- Merge sched_class->switch_class() addition from sched_ext with
  tip/sched/core changes in __pick_next_task().

- Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous
  behavior where sched_class->put_prev_task() was called before
  sched_class->pick_next_task().

While this makes sched_ext build and function, the behavior is not in line
with other sched classes. The follow-up patches will address the
discrepancies and remove sched_class->switch_class().

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-03 12:49:18 -10:00
Hongbo Li
8c1867a2f0 audit: Make use of str_enabled_disabled() helper
Use str_enabled_disabled() helper instead of open
coding the same.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-09-03 16:35:16 -04:00
Sven Schnelle
e240b0fde5 uprobes: Use kzalloc to allocate xol area
To prevent unitialized members, use kzalloc to allocate
the xol area.

Fixes: b059a453b1 ("x86/vdso: Add mremap hook to vm_special_mapping")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903102313.3402529-1-svens@linux.ibm.com
2024-09-03 16:54:02 +02:00
Peter Zijlstra
b2d70222db sched: Add put_prev_task(.next)
In order to tell the previous sched_class what the next task is, add
put_prev_task(.next).

Notable SCX will use this to:

 1) determine the next task will leave the SCX sched class and push
    the current task to another CPU if possible.
 2) statistics on how often and which other classes preempt it

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
2024-09-03 15:26:32 +02:00
Peter Zijlstra
bd9bbc96e8 sched: Rework dl_server
When a task is selected through a dl_server, it will have p->dl_server
set, such that it can account runtime to the dl_server, see
update_curr_task().

Currently p->dl_server is set in pick*task() whenever it goes through
the dl_server, clearing it is a bit of a mess though. The trivial
solution is clearing it on the final put (now that we have this
location).

However, this gives a problem when:

	p = pick_task(rq);
	if (p)
		put_prev_set_next_task(rq, prev, next);

picks the same task but through a different path, notably when it goes
from picking through the dl_server to a direct pick or vice-versa. In
that case we cannot readily determine wether we should clear or
preserve p->dl_server.

An additional complication is pick_*task() setting p->dl_server for a
remote pick, it might still need to update runtime before it schedules
the core_pick.

Close all these holes and remove all the random clearing of
p->dl_server by:

 - having pick_*task() manage rq->dl_server

 - having the final put_prev_task() clear p->dl_server

 - having the first set_next_task() set p->dl_server = rq->dl_server

 - complicate the core_sched code to save/restore rq->dl_server where
   appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
2024-09-03 15:26:32 +02:00
Peter Zijlstra
436f3eed5c sched: Combine the last put_prev_task() and the first set_next_task()
Ensure the last put_prev_task() and the first set_next_task() always
go together.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
2024-09-03 15:26:31 +02:00
Peter Zijlstra
fd03c5b858 sched: Rework pick_next_task()
The current rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

And many classes implement it directly as such. Change things around
to make pick_next_task() optional while also changing the definition to:

  pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

The reason is that sched_ext would like to have a 'final' call that
knows the next task. By placing put_prev_task() right next to
set_next_task() (as it already is for sched_core) this becomes
trivial.

As a bonus, this is a nice cleanup on its own.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
2024-09-03 15:26:31 +02:00
Peter Zijlstra
260598f142 sched: Split up put_prev_task_balance()
With the goal of pushing put_prev_task() after pick_task() / into
pick_next_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.943143811@infradead.org
2024-09-03 15:26:31 +02:00
Peter Zijlstra
4686cc598f sched: Clean up DL server vs core sched
Abide by the simple rule:

  pick_next_task() := pick_task() + set_next_task(.first = true)

This allows us to trivially get rid of server_pick_next() and things
collapse nicely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.837303391@infradead.org
2024-09-03 15:26:31 +02:00
Peter Zijlstra
dae4320b29 sched: Fixup set_next_task() implementations
The rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

Turns out, there's still a few things in pick_next_task() that are
missing from that combination.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.724111109@infradead.org
2024-09-03 15:26:30 +02:00
Peter Zijlstra
7d2180d9d9 sched: Use set_next_task(.first) where required
Turns out the core_sched bits forgot to use the
set_next_task(.first=true) variant. Notably:

  pick_next_task() := pick_task() + set_next_task(.first = true)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org
2024-09-03 15:26:30 +02:00
Valentin Schneider
75b6499024 sched/fair: Properly deactivate sched_delayed task upon class change
__sched_setscheduler() goes through an enqueue/dequeue cycle like so:

  flags := DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
  prev_class->dequeue_task(rq, p, flags);
  new_class->enqueue_task(rq, p, flags);

when prev_class := fair_sched_class, this is followed by:

  dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP);

the idea being that since the task has switched classes, we need to drop
the sched_delayed logic and have that task be deactivated per its previous
dequeue_task(..., DEQUEUE_SLEEP).

Unfortunately, this leaves the task on_rq. This is missing the tail end of
dequeue_entities() that issues __block_task(), which __sched_setscheduler()
won't have done due to not using DEQUEUE_DELAYED - not that it should, as
it is pretty much a fair_sched_class specific thing.

Make switched_from_fair() properly deactivate sched_delayed tasks upon
class changes via __block_task(), as if a
  dequeue_task(..., DEQUEUE_DELAYED)
had been issued.

Fixes: 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240829135353.1524260-1-vschneid@redhat.com
2024-09-03 15:26:30 +02:00
Huang Shijie
9c602adb79 sched/deadline: Fix schedstats vs deadline servers
In dl_server_start(), when schedstats is enabled, the following
happens:

  dl_server_start()
    dl_se->dl_server = 1;
    enqueue_dl_entity()
      update_stats_enqueue_dl()
        __schedstats_from_dl_se()
          dl_task_of()
            BUG_ON(dl_server(dl_se));

Since only tasks have schedstats and internal entries do not, avoid
trying to update stats in this case.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20240829031111.12142-1-shijie@os.amperecomputing.com
2024-09-03 15:26:30 +02:00
Chen Yu
e16c7b0778 kthread: fix task state in kthread worker if being frozen
When analyzing a kernel waring message, Peter pointed out that there is a
race condition when the kworker is being frozen and falls into
try_to_freeze() with TASK_INTERRUPTIBLE, which could trigger a
might_sleep() warning in try_to_freeze().  Although the root cause is not
related to freeze()[1], it is still worthy to fix this issue ahead.

One possible race scenario:

        CPU 0                                           CPU 1
        -----                                           -----

        // kthread_worker_fn
        set_current_state(TASK_INTERRUPTIBLE);
                                                       suspend_freeze_processes()
                                                         freeze_processes
                                                           static_branch_inc(&freezer_active);
                                                         freeze_kernel_threads
                                                           pm_nosig_freezing = true;
        if (work) { //false
          __set_current_state(TASK_RUNNING);

        } else if (!freezing(current)) //false, been frozen

                      freezing():
                      if (static_branch_unlikely(&freezer_active))
                        if (pm_nosig_freezing)
                          return true;
          schedule()
	}

        // state is still TASK_INTERRUPTIBLE
        try_to_freeze()
          might_sleep() <--- warning

Fix this by explicitly set the TASK_RUNNING before entering
try_to_freeze().

Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
Link: https://lkml.kernel.org/r/20240827112308.181081-1-yu.c.chen@intel.com
Fixes: b56c0d8937 ("kthread: implement kthread_worker")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: David Gow <davidgow@google.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mickaël Salaün <mic@digikod.net>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:41 -07:00
Sourabh Jain
c91c6062d6 Document/kexec: generalize crash hotplug description
Commit 79365026f8 ("crash: add a new kexec flag for hotplug support")
generalizes the crash hotplug support to allow architectures to update
multiple kexec segments on CPU/Memory hotplug and not just elfcorehdr. 
Therefore, update the relevant kernel documentation to reflect the same.

No functional change.

Link: https://lkml.kernel.org/r/20240812041651.703156-1-sourabhjain@linux.ibm.com
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Petr Tesarik <petr@tesarici.cz>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:37 -07:00
Jani Nikula
6ce2082fd3 fault-inject: improve build for CONFIG_FAULT_INJECTION=n
The fault-inject.h users across the kernel need to add a lot of #ifdef
CONFIG_FAULT_INJECTION to cater for shortcomings in the header.  Make
fault-inject.h self-contained for CONFIG_FAULT_INJECTION=n, and add stubs
for DECLARE_FAULT_ATTR(), setup_fault_attr(), should_fail_ex(), and
should_fail() to allow removal of conditional compilation.

[akpm@linux-foundation.org: repair fallout from no longer including debugfs.h into fault-inject.h]
[akpm@linux-foundation.org: fix drivers/misc/xilinx_tmr_inject.c]
[akpm@linux-foundation.org: Add debugfs.h inclusion to more files, per Stephen]
Link: https://lkml.kernel.org/r/20240813121237.2382534-1-jani.nikula@intel.com
Fixes: 6ff1cb355e ("[PATCH] fault-injection capabilities infrastructure")
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:33 -07:00
Waiman Long
97cf8f5f93 watchdog: handle the ENODEV failure case of lockup_detector_delay_init() separately
When watchdog_hardlockup_probe() is being called by
lockup_detector_delay_init(), an error return of -ENODEV will happen
for the arm64 arch when arch_perf_nmi_is_available() returns false. This
means that NMI is not usable by the hard lockup detector and so has to
be disabled. This can be considered a deficiency in that particular
arm64 chip, but there is nothing we can do about it.  That also means
the following error will always be reported when the kernel boot up.

  watchdog: Delayed init of the lockup detector failed: -19

The word "failed" itself has a connotation that there is something
wrong with the kernel which is not really the case here. Handle this
special ENODEV case separately and explain the reason behind disabling
hard lockup detector without causing anxiety for those users who read
the above message and wonder about it.

Link: https://lkml.kernel.org/r/20240802151621.617244-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:32 -07:00
Jeff Johnson
588661fd87 locking/ww_mutex/test: add MODULE_DESCRIPTION()
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/locking/test-ww_mutex.o

Link: https://lkml.kernel.org/r/20240730-module_description_orphans-v1-5-7094088076c8@quicinc.com
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
Cc: Alistar Popple <alistair@popple.id.au>
Cc: Andrew Jeffery <andrew@codeconstruct.com.au>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eddie James <eajames@linux.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Karol Herbst <karolherbst@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nouveau <nouveau@lists.freedesktop.org>
Cc: Pekka Paalanen <ppaalanen@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:31 -07:00
Jinjie Ruan
59d58189f3 crash: fix crash memory reserve exceed system memory bug
On x86_32 Qemu machine with 1GB memory, the cmdline "crashkernel=4G" is ok
as below:
	crashkernel reserved: 0x0000000020000000 - 0x0000000120000000 (4096 MB)

It's similar on other architectures, such as ARM32 and RISCV32.

The cause is that the crash_size is parsed and printed with "unsigned long
long" data type which is 8 bytes but allocated used with "phys_addr_t"
which is 4 bytes in memblock_phys_alloc_range().

Fix it by checking if crash_size is greater than system RAM size and
return error if so.

After this patch, there is no above confusing reserve success info.

Link: https://lkml.kernel.org/r/20240729115252.1659112-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Dave Young <dyoung@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:30 -07:00
Uros Bizjak
acf02be3c7 kexec: use atomic_try_cmpxchg_acquire() in kexec_trylock()
Use atomic_try_cmpxchg_acquire(*ptr, &old, new) instead of
atomic_cmpxchg_acquire(*ptr, old, new) == old in kexec_trylock().
x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg.

Link: https://lkml.kernel.org/r/20240719103937.53742-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:23 -07:00
Kaiyang Zhao
03790c51a4 mm: create promo_wmark_pages and clean up open-coded sites
Patch series "mm: print the promo watermark in zoneinfo", v2.


This patch (of 2):

Define promo_wmark_pages and convert current call sites of wmark_pages
with fixed WMARK_PROMO to using it instead.

Link: https://lkml.kernel.org/r/20240801232548.36604-1-kaiyang2@cs.cmu.edu
Link: https://lkml.kernel.org/r/20240801232548.36604-2-kaiyang2@cs.cmu.edu
Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:58 -07:00
David Finkel
c6f53ed8f2 mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.


This patch (of 2):

Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark.  Restore parity
with those mechanisms, but with a less racy API.

For example:
 - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
   the high watermark.
 - writing "5" to the clear_refs pseudo-file in a processes's proc
   directory resets the peak RSS.

This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.

Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.

Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path.  Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.

Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.

This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item.  Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.

Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place.  To facilitate this, we track
the peak memory usage.  However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing.  Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.

As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.

Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com
Signed-off-by: David Finkel <davidf@vimeo.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:53 -07:00
David Hildenbrand
394290cba9 mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options
Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications".

This series is a follow up to the fixes:
	"[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking"

When working on the fixes, I wondered why 8xx is fine (-> never uses split
PT locks) and how PT locking even works properly with PMD page table
sharing (-> always requires split PMD PT locks).

Let's improve the split PT lock detection, make hugetlb properly depend on
it and make 8xx bail out if it would ever get enabled by accident.

As an alternative to patch #3 we could extend the Kconfig
SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the
code that actually implements it feels a bit nicer for documentation
purposes, and there is no need to actually disable it because it should
always be disabled (!SMP).

Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks
are still getting used where we would expect them.

[1] https://lkml.kernel.org/r/20240725183955.2268884-1-david@redhat.com


This patch (of 3):

Let's clean that up a bit and prepare for depending on
CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options.

More cleanups would be reasonable (like the arch-specific "depends on" for
CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day.

Link: https://lkml.kernel.org/r/20240726150728.3159964-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240726150728.3159964-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:51 -07:00
Pasha Tatashin
fbe76a6557 task_stack: uninline stack_not_used
Given that stack_not_used() is not performance critical function
uninline it.

Link: https://lkml.kernel.org/r/20240730150158.832783-4-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240724203322.2765486-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:49 -07:00
Pasha Tatashin
c4a6fce856 vmstat: kernel stack usage histogram
As part of the dynamic kernel stack project, we need to know the amount of
data that can be saved by reducing the default kernel stack size [1].

Provide a kernel stack usage histogram to aid in optimizing kernel stack
sizes and minimizing memory waste in large-scale environments.  The
histogram divides stack usage into power-of-two buckets and reports the
results in /proc/vmstat.  This information is especially valuable in
environments with millions of machines, where even small optimizations can
have a significant impact.

The histogram data is presented in /proc/vmstat with entries like
"kstack_1k", "kstack_2k", and so on, indicating the number of threads that
exited with stack usage falling within each respective bucket.

Example outputs:
Intel:
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0

ARM with 64K page_size:
$ grep kstack /proc/vmstat
kstack_1k 1
kstack_2k 340
kstack_4k 25212
kstack_8k 1659
kstack_16k 0
kstack_32k 0
kstack_64k 0

Note: once the dynamic kernel stack is implemented it will depend on the
implementation the usability of this feature: On hardware that supports
faults on kernel stacks, we will have other metrics that show the total
number of pages allocated for stacks.  On hardware where faults are not
supported, we will most likely have some optimization where only some
threads are extended, and for those, these metrics will still be very
useful.

[1] https://lwn.net/Articles/974367

Link: https://lkml.kernel.org/r/20240730150158.832783-3-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240724203322.2765486-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:49 -07:00
Zi Yan
2a28713a67 memory tiering: introduce folio_use_access_time() check
If memory tiering mode is on and a folio is not in the top tier memory,
folio's cpupid field is repurposed to store page access time.  Instead of
an open coded check, use a function to encapsulate the check.

Link: https://lkml.kernel.org/r/20240724130115.793641-3-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:47 -07:00
Danilo Krummrich
590b9d576c mm: kvmalloc: align kvrealloc() with krealloc()
Besides the obvious (and desired) difference between krealloc() and
kvrealloc(), there is some inconsistency in their function signatures and
behavior:

 - krealloc() frees the memory when the requested size is zero, whereas
   kvrealloc() simply returns a pointer to the existing allocation.

 - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas
   kvrealloc() does not accept a NULL pointer at all and, if passed,
   would fault instead.

 - krealloc() is self-contained, whereas kvrealloc() relies on the caller
   to provide the size of the previous allocation.

Inconsistent behavior throughout allocation APIs is error prone, hence
make kvrealloc() behave like krealloc(), which seems superior in all
mentioned aspects.

Besides that, implementing kvrealloc() by making use of krealloc() and
vrealloc() provides oppertunities to grow (and shrink) allocations more
efficiently.  For instance, vrealloc() can be optimized to allocate and
map additional pages to grow the allocation or unmap and free unused pages
to shrink the allocation.

[dakr@kernel.org: document concurrency restrictions]
  Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org
[dakr@kernel.org: disable KASAN when switching to vmalloc]
  Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org
[dakr@kernel.org: properly document __GFP_ZERO behavior]
  Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org
Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:25:44 -07:00
Petr Tesarik
6dacd79d28 kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y
Fix the condition to exclude the elfcorehdr segment from the SHA digest
calculation.

The j iterator is an index into the output sha_regions[] array, not into
the input image->segment[] array.  Once it reaches
image->elfcorehdr_index, all subsequent segments are excluded.  Besides,
if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr
may be wrongly included in the calculation.

Link: https://lkml.kernel.org/r/20240805150750.170739-1-petr.tesarik@suse.com
Fixes: f7cc804a9f ("kexec: exclude elfcorehdr from the segment digest")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Eric DeVolder <eric_devolder@yahoo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 17:59:01 -07:00
Linus Torvalds
c9f016e72b A set of X86 fixes:
- x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
     even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
     disable it thereby creating inconsistent state.
 
     Reorder the logic so it actually works correctly
 
   - The XSTATE logic for handling LBR is incorrect as it assumes that
     XSAVES supports LBR when the CPU supports LBR. In fact both conditions
     need to be true. Otherwise the enablement of LBR in the IA32_XSS MSR
     fails and subsequently the machine crashes on the next XRSTORS
     operation because IA32_XSS is not initialized.
 
     Cache the XSTATE support bit during init and make the related functions
     use this cached information and the LBR CPU feature bit to cure this.
 
   - Cure a long standing bug in KASLR
 
     KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
     randomize the starting points of the direct map, vmalloc and vmemmap
     regions.  It thereby limits the size of the direct map by using the
     installed memory size plus an extra configurable margin for hot-plug
     memory.  This limitation is done to gain more randomization space
     because otherwise only the holes between the direct map, vmalloc,
     vmemmap and vaddr_end would be usable for randomizing.
 
     The limited direct map size is not exposed to the rest of the kernel, so
     the memory hot-plug and resource management related code paths still
     operate under the assumption that the available address space can be
     determined with MAX_PHYSMEM_BITS.
 
     request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
     downwards.  That means the first allocation happens past the end of the
     direct map and if unlucky this address is in the vmalloc space, which
     causes high_memory to become greater than VMALLOC_START and consequently
     causes iounmap() to fail for valid ioremap addresses.
 
     Cure this by exposing the end of the direct map via PHYSMEM_END and use
     that for the memory hot-plug and resource management related places
     instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
     maps to a variable which is initialized by the KASLR initialization and
     otherwise it is based on MAX_PHYSMEM_BITS as before.
 
   - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
     an initialized variabled on the stack to the VMM. The variable is only
     required as output value, so it does not have to exposed to the VMM in
     the first place.
 
   - Prevent an array overrun in the resource control code on systems with
     Sub-NUMA Clustering enabled because the code failed to adjust the index
     by the number of SNC nodes per L3 cache.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbUUu0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodFsEADFgxq2wjnH+VpuaIhLiQIfUa7iVeUl
 bwHAakZRMJ+Cb8BsvaRCMdAWWF+cRdLabAHuh7MRJFFzzdwrVTswnxT9baUBBjEe
 Kd3ZeQOS4AvWxpJNQEDg9r7tYtavmml9ix+Jh0OF+YmXLIweQk5RhDN+ncha07cJ
 0DuPt4ngI24iyAyUX+7gZsRZiwoOm0HqImaRiisaspTbGpNwnrwFQCEioCdwnAv0
 H5S7WTAlsZURCINLBNT+fV5oPjk2E3Ckj/CCJGoG1LYedGUD/44M1Hj0Xsqm4pHF
 Zd0+CuFyYpGqkAuBY6moWOheYP8V2U+yhf9Rtvh8/+h3qxZ/yon5i0ycO/2wMjiF
 0NBomMeKh4PNyefYq8lHWK3kcXphrXH3yv09wVBDdLMXDy98beuS5NScGgza8148
 /nqq0l1uLUyM9TkWg9H+4wW73EzQW1DYIliDU3tC98u+E77kQbyCx+2f0WI2k+ar
 3wy7nYzyEJXl38NUTB+La4xXbhsELcaYQ/Q6scIsWAL+6+KlRb3FNBn+HT+KmOmF
 y702km/28C0uxrLk2OQCjX/zXQtXe2/4aoUzGqFf9atsifa0IBrc8YBzdIDB49Jt
 zz/MOAZTcz4jfyD3sRfYuG2QhBbdTz3f/kd3OryquitdAGozpoeztMIGs1PU2Y6s
 zInlLtUwaosadg==
 =T4i1
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

 - x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
   even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
   disable it thereby creating inconsistent state.

   Reorder the logic so it actually works correctly

 - The XSTATE logic for handling LBR is incorrect as it assumes that
   XSAVES supports LBR when the CPU supports LBR. In fact both
   conditions need to be true. Otherwise the enablement of LBR in the
   IA32_XSS MSR fails and subsequently the machine crashes on the next
   XRSTORS operation because IA32_XSS is not initialized.

   Cache the XSTATE support bit during init and make the related
   functions use this cached information and the LBR CPU feature bit to
   cure this.

 - Cure a long standing bug in KASLR

   KASLR uses the full address space between PAGE_OFFSET and vaddr_end
   to randomize the starting points of the direct map, vmalloc and
   vmemmap regions. It thereby limits the size of the direct map by
   using the installed memory size plus an extra configurable margin for
   hot-plug memory. This limitation is done to gain more randomization
   space because otherwise only the holes between the direct map,
   vmalloc, vmemmap and vaddr_end would be usable for randomizing.

   The limited direct map size is not exposed to the rest of the kernel,
   so the memory hot-plug and resource management related code paths
   still operate under the assumption that the available address space
   can be determined with MAX_PHYSMEM_BITS.

   request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
   downwards. That means the first allocation happens past the end of
   the direct map and if unlucky this address is in the vmalloc space,
   which causes high_memory to become greater than VMALLOC_START and
   consequently causes iounmap() to fail for valid ioremap addresses.

   Cure this by exposing the end of the direct map via PHYSMEM_END and
   use that for the memory hot-plug and resource management related
   places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case
   PHYSMEM_END maps to a variable which is initialized by the KASLR
   initialization and otherwise it is based on MAX_PHYSMEM_BITS as
   before.

 - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
   an initialized variabled on the stack to the VMM. The variable is
   only required as output value, so it does not have to exposed to the
   VMM in the first place.

 - Prevent an array overrun in the resource control code on systems with
   Sub-NUMA Clustering enabled because the code failed to adjust the
   index by the number of SNC nodes per L3 cache.

* tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Fix arch_mbm_* array overrun on SNC
  x86/tdx: Fix data leak in mmio_read()
  x86/kaslr: Expose and use the end of the physical memory address space
  x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
  x86/apic: Make x2apic_disable() work correctly
2024-09-01 14:43:08 -07:00
Linus Torvalds
51859c5aa6 A single fix for rt_mutex:
The deadlock detection code drops into an infinite scheduling loop while
   still holding rt_mutex::wait_lock, which rightfully triggers a
   'scheduling in atomic' warning. Unlock it before that.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbLKh0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoSh0D/sHm7YsexyVAXRxtm1vCJ1rE+ditn/K
 X60QuH8EyLa2abUY9xKKKaM3YuE+4juN1dyej6uBcET8n9olqGQn0pyxukTDy3Hq
 Khy7OIAzYfGd7J0kJXA6DHMWAi16e4GduMcDYCw0BwYCQG4D1Ybe/lfsm0mgdcd/
 aLFSD8/+oAK7bIWy9VE4wXUPT51v7iFnwajvcKsOL1c3u2StF6A0zCLbjkU5Di20
 6Ko4f+B8dh3Bh9yIP/uiwHQHdVUjXp5Y9pVuFLemC8BqGVrwMi71jEEFYQva+evU
 UTQ8VCbhxIAcz6pqWYA3P5WeqFIQLAXq5UNJLK+63qm/Wf4eoaXNhK+Nr2fCigYu
 R6T+H5nx2WATATXfPORgfHYgHWwyWaj6nUPK0kaJYUTHFK5Nlg/ir5siqE/Gz9qq
 ldajnvKpheeuu6rsrdHdJrdI84XVOmjeAJ+5A8i7VMv2jEE50txlxXKt5jeVt2yE
 xm2+NABT4Dlycf/56e6OYZUEADQfX1YlvFGBQe1UYpcmOC1nTKaWWeQ1xlA3ZR92
 plUgXLtRSKaZ3vEFQ3L+/1w0Af3P3/mapb+IgTxW/FEt8WAw6UuwBS9gMZDOhTI+
 GZF0EFj8tUmTNDcWkNAu3m3y5Qmp3iVFYBbYXmyJdKVRJbaMu/uq51JnOOIxmaLf
 L6KovDUg8eF3bQ==
 =B3iX
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Thomas Gleixner:
 "A single fix for rt_mutex.

  The deadlock detection code drops into an infinite scheduling loop
  while still holding rt_mutex::wait_lock, which rightfully triggers a
  'scheduling in atomic' warning.

  Unlock it before that"

* tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rtmutex: Drop rt_mutex::wait_lock before scheduling
2024-09-01 14:26:33 -07:00
Masahiro Yamada
2893f00322 tinyconfig: remove unnecessary 'is not set' for choice blocks
This reverts the following commits:

 - 236dec0510 ("kconfig: tinyconfig: provide whole choice blocks to
   avoid warnings")

 - b0f269728c ("x86/config: Fix warning for 'make ARCH=x86_64
   tinyconfig'")

Since commit f79dc03fe6 ("kconfig: refactor choice value calculation"),
it is no longer necessary to disable the remaining options in choice
blocks.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2024-09-01 20:34:38 +09:00
Tejun Heo
62607d033b sched_ext: Use sched_clock_cpu() instead of rq_clock_task() in touch_core_sched()
Since 3cf78c5d01 ("sched_ext: Unpin and repin rq lock from
balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost
function. This is simpler and in line with what other balance functions are
doing but it loses control over rq->clock_update_flags which makes
assert_clock_udpated() trigger if other CPUs pins the rq lock.

The only place this matters is touch_core_sched() which uses the timestamp
to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may
be better to use per-core dispatch sequence number.

v2: Use sched_clock_cpu() instead of ktime_get_ns() per David.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3cf78c5d01 ("sched_ext: Unpin and repin rq lock from balance_scx()")
Acked-by: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2024-08-30 19:35:19 -10:00
Tejun Heo
0366017e09 sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()
When deciding whether a task can be migrated to a CPU,
dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online()
tests instead of using task_can_run_on_remote_rq(). This had two problems.

- It was missing is_migration_disabled() check and thus could try to migrate
  a task which shouldn't leading to assertion and scheduling failures.

- It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu()
  and thus failed to consider ISA compatibility.

Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq():

- Move scx_ops_error() triggering into task_can_run_on_remote_rq().

- When migration isn't allowed, fall back to the global DSQ instead of the
  source DSQ by returning DTL_INVALID. This is both simpler and an overall
  better behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
2024-08-30 19:34:46 -10:00
Chen Ridong
1abab1ba07 cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1
This patch introduces CONFIG_CPUSETS_V1 and guard cpuset-v1 code under
CONFIG_CPUSETS_V1. The default value of CONFIG_CPUSETS_V1 is N, so that
user who adopted v2 don't have 'pay' for cpuset v1.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
381b53c3b5 cgroup/cpuset: rename functions shared between v1 and v2
Some functions name declared in cpuset-internel.h are generic. To avoid
confilicting with other variables for the same name, rename these
functions with cpuset_/cpuset1_ prefix to make them unique to cpuset.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
b0ced9d378 cgroup/cpuset: move v1 interfaces to cpuset-v1.c
Move legacy cpuset controller interfaces files and corresponding code
into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and
'cpuset_common_seq_show' are also used for v1, so declare them in
cpuset-internal.h.

'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used
cpuset-v1.c now, make it static.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00
Chen Ridong
be126b5b1b cgroup/cpuset: move validate_change_legacy to cpuset-v1.c
The validate_change_legacy functions is used for v1, move it to
cpuset-v1.c. And two micro 'cpuset_for_each_child' and
'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to
cpuset-internal.h.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30 10:00:16 -10:00