36535 Commits

Author SHA1 Message Date
Linus Torvalds
7749b0337b Fix a memory leak in the recently introduced sigqueue cache.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDUL7URHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jWUw/+Igx3XBKs0mB42kdgarx5jsWLMbh+kPnh
 GYD8ugZSopGWoRLsD5GJTDwixsbW6uxr6cgo1ees6SiuYaSs66K4wxh5CX+SzDk2
 lh/DGjsRJI7IDmMEItzAtuoFaMknllBP4JEBm6iH0cyH9pLj4mpDDO8lD6BHn6xs
 WB79tqKZuXpDxxh7WKZOi57Uh6oTVN8B4wvPQCLhHd8FW6rC2l180CItQuZsHUGP
 gl3vuFOsfa07UUzs+VYH4Q+Pfujk43dej2fZSmFfF6eDufxT0dRW9C+/SiXZiuXW
 kUrVa7wupX2jyMpM/pl5T0lgQb07WhT4Gz+V9klhG+ZHeXsOgDwHiZReMfzldOgt
 +w5exrN//x223oWCksmnQiQ/cG1lt/yyUqvw12/0fsYGT6TIUnvMXKpN6lU/K60/
 M3eLgVYHV0P7AYvNYxWaX044cAd71jP+OlGnk7ivbSmiEZczK8GKsKYoAYKP6ne1
 3QV+6Q2Gv4hDVdcPs46Ms4R9FW8RNNFaE6emjp1T6oSKTjvEVnZ6jhql3BMVk6dz
 p0vExwCtewnV7EgpCov0UNDSZrc3BxyHk0trAoDUDwl/7pUoQHsrs6gQwgsRi1aN
 mRP2qJGlpD1Eo7k4w8vpBSwklzWRiPHNiIywl9YyYW+yPIzJOVo/7RNdID1hCq83
 MbxGNuQcN/8=
 =exj4
 -----END PGP SIGNATURE-----

Merge tag 'core-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull sigqueue cache fix from Ingo Molnar:
 "Fix a memory leak in the recently introduced sigqueue cache"

* tag 'core-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  signal: Prevent sigqueue caching after task got released
2021-06-24 09:06:19 -07:00
Linus Torvalds
666751701b A last minute cgroup bandwidth scheduling fix for a recently
introduced logic fail which triggered a kernel warning by
 LTP's cfs_bandwidth01.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDULloRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iTMA/9EogeU4F8ncEgqkkrYbCmpnSYKVbnJzf8
 cEuX4lOgz0Fd5Ps3mWEN7L99jaDgPsaMMiIKi1UQhDZNy3ND6eHywlXHVfxiMKw9
 YEozI/apwyEykp8J6laigSH0N/g5sp+YT5kcU3QsaLDoN7et7pgwSFjqsuC/kHRI
 nnnNFbsO8A1Haq8qMt1W3kThTdaB+HXfBDZdZO7lsIC69GGHbkKPRfiHSZmBfG98
 GhvwpziAlJgOu6mHyGoQtDCVH00y1CNctUi9KVx4lC9ZRCWgIwHk++vgrHgNRxXu
 FUqkH+qsgH4MMO7MopPOgtkVK7RfdXspHNydogrLHhtsFyOXoP5f5vVdgIKBakSq
 aOfIIhyzEvdxentAcfnUAa7aJ6F6Og3N8VUBA/Zi7Vm4IUNM7mmKO8/ixRlpRBf2
 Ymj/Cp7LQPIyGV2s/EN8G24+5T6hEmuLkz9WzXKcHju+4UC9hVQzdJhT1iFk5MUw
 Iy7uIWG1NzYs5bI5zPrK9YeJYzFDF/RBxM9S5znlH8hcl1L910m7LNGnY8aiJrS4
 /w8PqTX9rGrLrDrFt/dBYX3CNl1oRZAJouTyBNFMJ1LchkTdKc8QN4FN877cTvQE
 GuQLOyPqK+dY/pElx2jr9wnIdzaWBMv4ZG6azZqkrc7LaEVtKoin3NSkSfqd0cu2
 QkTSup4mhuU=
 =KzBo
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "A last minute cgroup bandwidth scheduling fix for a recently
  introduced logic fail which triggered a kernel warning by LTP's
  cfs_bandwidth01 test"

* tag 'sched-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Ensure that the CFS parent is added after unthrottling
2021-06-24 08:58:23 -07:00
Linus Torvalds
c0e457851f Address a number of objtool warnings that got reported.
No change in behavior intended, but code generation might be
 impacted by:
 
    1f008d46f124: ("x86: Always inline task_size_max()")
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDUK9ERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jIUg/8CMtepfPe5KAJKW0Q1Fi9c5mfsQqKJToQ
 rNk25Rna+oH1P2kiHxW8920/0gjQJwqO1nkdUfQhI2pcPdTSg63DMQJeakX1aBq5
 v8bGq5TRe8xYx07sp6ATiqD2kntvdS3NW0kOzSa3N7e4jO7U8bJd+J/4DYW6KIrI
 yAIgkHPPsOW/UAbk6Nza/lyjbhqJ7xxzhpfzGHB1k0bbb9d5X8HHJ2b7BOq/0tom
 5m2vFKu8EUy8o8qqeZrWVoJHBQNo28Eh33bAuv2ZQWpFKih6p++ZyyH8QG9SiDUv
 sIHneKDNs1XZ25F0Q7fHHi2xcUCcoI4ssJldatC0stTpZySw8u6JKVaqC/LFxEWz
 veaqQ5ENrsemOksXoIT4ECObwaw+mInEYZNQIEa1ntTGe9WJRqwmJPjkcgdZTHBM
 sV3HmQcIDycB7qy3MZ+vYt0WtCm1ihGcEECkyxNnnPLko4PJS+SN2zU/uOakHeO+
 sTfGVzgAixBy3JxJk1nECOHLhSmdRr3b0+DU6/32hNiHGliABZiyrStUGEYAk6We
 4g1zy9EwJKrB61y3QXJN6yMdc0yTQ1e2C1aXtohU57hVGv92gigRz/bt3YqMytZq
 Y7F+aF5nWjt6AcmjhDFpxHTQXwNsQB3P8waXe/pcFUHMg5xKU/JvRd3FsvPD8Az3
 uqIa2LhiKw4=
 =FtTl
 -----END PGP SIGNATURE-----

Merge tag 'objtool-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fixes from Ingo Molnar:
 "Address a number of objtool warnings that got reported.

  No change in behavior intended, but code generation might be impacted
  by commit 1f008d46f124 ("x86: Always inline task_size_max()")"

* tag 'objtool-urgent-2021-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/lockdep: Improve noinstr vs errors
  x86: Always inline task_size_max()
  x86/xen: Fix noinstr fail in exc_xen_unknown_trap()
  x86/xen: Fix noinstr fail in xen_pv_evtchn_do_upcall()
  x86/entry: Fix noinstr fail in __do_fast_syscall_32()
  objtool/x86: Ignore __x86_indirect_alt_* symbols
2021-06-24 08:47:33 -07:00
Cassio Neri
2760105516 time: Improve performance of time64_to_tm()
The current implementation of time64_to_tm() contains unnecessary loops,
branches and look-up tables. The new one uses an arithmetic-based algorithm
appeared in [1] and is approximately 3x faster (YMMV).

The drawback is that the new code isn't intuitive and contains many 'magic
numbers' (not unusual for this type of algorithm). However, [1] justifies
all those numbers and, given this function's history, the code is unlikely
to need much maintenance, if any at all.

Add a KUnit test for it which checks every day in a 160,000 years interval
centered at 1970-01-01 against the expected result.

[1] Neri, Schneider, "Euclidean Affine Functions and Applications to
Calendar Algorithms". https://arxiv.org/abs/2102.06959

Signed-off-by: Cassio Neri <cassio.neri@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210622213616.313046-1-cassio.neri@gmail.com
2021-06-24 11:51:59 +02:00
Beata Michalska
c744dc4ab5 sched/topology: Rework CPU capacity asymmetry detection
Currently the CPU capacity asymmetry detection, performed through
asym_cpu_capacity_level, tries to identify the lowest topology level
at which the highest CPU capacity is being observed, not necessarily
finding the level at which all possible capacity values are visible
to all CPUs, which might be bit problematic for some possible/valid
asymmetric topologies i.e.:

DIE      [                                ]
MC       [                       ][       ]

CPU       [0] [1] [2] [3] [4] [5]  [6] [7]
Capacity  |.....| |.....| |.....|  |.....|
	     L	     M       B        B

Where:
 arch_scale_cpu_capacity(L) = 512
 arch_scale_cpu_capacity(M) = 871
 arch_scale_cpu_capacity(B) = 1024

In this particular case, the asymmetric topology level will point
at MC, as all possible CPU masks for that level do cover the CPU
with the highest capacity. It will work just fine for the first
cluster, not so much for the second one though (consider the
find_energy_efficient_cpu which might end up attempting the energy
aware wake-up for a domain that does not see any asymmetry at all)

Rework the way the capacity asymmetry levels are being detected,
allowing to point to the lowest topology level (for a given CPU), where
full set of available CPU capacities is visible to all CPUs within given
domain. As a result, the per-cpu sd_asym_cpucapacity might differ across
the domains. This will have an impact on EAS wake-up placement in a way
that it might see different range of CPUs to be considered, depending on
the given current and target CPUs.

Additionally, those levels, where any range of asymmetry (not
necessarily full) is being detected will get identified as well.
The selected asymmetric topology level will be denoted by
SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels'
would receive the already used SD_ASYM_CPUCAPACITY flag. This allows
maintaining the current behaviour for asymmetric topologies, with
misfit migration operating correctly on lower levels, if applicable,
as any asymmetry is enough to trigger the misfit migration.
The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not
relate to the full asymmetry level denoted by the sd_asym_cpucapacity
pointer.

Detecting the CPU capacity asymmetry is being based on a set of
available CPU capacities for all possible CPUs. This data is being
generated upon init and updated once CPU topology changes are being
detected (through arch_update_cpu_topology). As such, any changes
to identified CPU capacities (like initializing cpufreq) need to be
explicitly advertised by corresponding archs to trigger rebuilding
the data.

Additional -dflags- parameter, used when building sched domains, has
been removed as well, as the asymmetry flags are now being set directly
in sd_init.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com
2021-06-24 09:07:51 +02:00
Zhaoyang Huang
8f91efd870 psi: Fix race between psi_trigger_create/destroy
Race detected between psi_trigger_destroy/create as shown below, which
cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
and psi_system->poll_timer->entry->next. Under this modification, the
race window is removed by initialising poll_wait and poll_timer in
group_init which are executed only once at beginning.

  psi_trigger_destroy()                   psi_trigger_create()

  mutex_lock(trigger_lock);
  rcu_assign_pointer(poll_task, NULL);
  mutex_unlock(trigger_lock);
					  mutex_lock(trigger_lock);
					  if (!rcu_access_pointer(group->poll_task)) {
					    timer_setup(poll_timer, poll_timer_fn, 0);
					    rcu_assign_pointer(poll_task, task);
					  }
					  mutex_unlock(trigger_lock);

  synchronize_rcu();
  del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
                                  psi_trigger_create()

So, trigger_lock/RCU correctly protects destruction of
group->poll_task but misses this race affecting poll_timer and
poll_wait.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Co-developed-by: ziwei.dai <ziwei.dai@unisoc.com>
Signed-off-by: ziwei.dai <ziwei.dai@unisoc.com>
Co-developed-by: ke.wang <ke.wang@unisoc.com>
Signed-off-by: ke.wang <ke.wang@unisoc.com>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
2021-06-24 09:07:50 +02:00
Huaixin Chang
f4183717b3 sched/fair: Introduce the burstable CFS controller
The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.

Traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

	mkdir /sys/fs/cgroup/cpu/test
	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.

Tail latencies are shown below, and it wasn't the worst case.

	Latency percentiles (usec)
		50.0000th: 19872
		75.0000th: 21344
		90.0000th: 22176
		95.0000th: 22496
		*99.0000th: 22752
		99.5000th: 22752
		99.9000th: 22752
		min=0, max=22727
	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
2021-06-24 09:07:50 +02:00
David S. Miller
c2f5c57d99 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-06-23

The following pull-request contains BPF updates for your *net* tree.

We've added 14 non-merge commits during the last 6 day(s) which contain
a total of 13 files changed, 137 insertions(+), 64 deletions(-).

Note that when you merge net into net-next, there is a small merge conflict
between 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") from bpf
with c49661aa6f70 ("skmsg: Remove unused parameters of sk_msg_wait_data()")
from net-next. Resolution is to: i) net/ipv4/udp_bpf.c: take udp_msg_wait_data()
and remove err parameter from the function, ii) net/ipv4/tcp_bpf.c: take
tcp_msg_wait_data() and remove err parameter from the function, iii) for
net/core/skmsg.c and include/linux/skmsg.h: remove the sk_msg_wait_data()
implementation and its prototype in header.

The main changes are:

1) Fix BPF poke descriptor adjustments after insn rewrite, from John Fastabend.

2) Fix regression when using BPF_OBJ_GET with non-O_RDWR flags, from Maciej Żenczykowski.

3) Various bug and error handling fixes for UDP-related sock_map, from Cong Wang.

4) Fix patching of vmlinux BTF IDs with correct endianness, from Tony Ambardar.

5) Two fixes for TX descriptor validation in AF_XDP, from Magnus Karlsson.

6) Fix overflow in size calculation for bpf_map_area_alloc(), from Bui Quang Minh.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 14:12:14 -07:00
Peter Zijlstra
012669c740 perf: Fix task context PMU for Hetero
On HETEROGENEOUS hardware (ARM big.Little, Intel Alderlake etc.) each
CPU might have a different hardware PMU. Since each such PMU is
represented by a different struct pmu, but we only have a single HW
task context.

That means that the task context needs to switch PMU type when it
switches CPUs.

Not doing this means that ctx->pmu calls (pmu_{dis,en}able(),
{start,commit,cancel}_txn() etc.) are called against the wrong PMU and
things will go wobbly.

Fixes: f83d2f91d259 ("perf/x86/intel: Add Alder Lake Hybrid support")
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/YMsy7BuGT8nBTspT@hirez.programming.kicks-ass.net
2021-06-23 18:30:56 +02:00
Linus Torvalds
8fd2ed1c01 Merge branch 'stable/for-linus-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb fix from Konrad Rzeszutek Wilk:
 "A fix for the regression for the DMA operations where the offset was
  ignored and corruptions would appear.

  Going forward there will be a cleanups to make the offset and
  alignment logic more clearer and better test-cases to help with this"

* 'stable/for-linus-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
  swiotlb: manipulate orig_addr when tlb_addr has offset
2021-06-23 09:04:07 -07:00
Mark Brown
7fb593cbd8
Merge remote-tracking branch 'regulator/for-5.14' into regulator-next 2021-06-23 16:56:31 +01:00
John Fastabend
7506d211b9 bpf: Fix null ptr deref with mixed tail calls and subprogs
The sub-programs prog->aux->poke_tab[] is populated in jit_subprogs() and
then used when emitting 'BPF_JMP|BPF_TAIL_CALL' insn->code from the
individual JITs. The poke_tab[] to use is stored in the insn->imm by
the code adding it to that array slot. The JIT then uses imm to find the
right entry for an individual instruction. In the x86 bpf_jit_comp.c
this is done by calling emit_bpf_tail_call_direct with the poke_tab[]
of the imm value.

However, we observed the below null-ptr-deref when mixing tail call
programs with subprog programs. For this to happen we just need to
mix bpf-2-bpf calls and tailcalls with some extra calls or instructions
that would be patched later by one of the fixup routines. So whats
happening?

Before the fixup_call_args() -- where the jit op is done -- various
code patching is done by do_misc_fixups(). This may increase the
insn count, for example when we patch map_lookup_up using map_gen_lookup
hook. This does two things. First, it means the instruction index,
insn_idx field, of a tail call instruction will move by a 'delta'.

In verifier code,

 struct bpf_jit_poke_descriptor desc = {
  .reason = BPF_POKE_REASON_TAIL_CALL,
  .tail_call.map = BPF_MAP_PTR(aux->map_ptr_state),
  .tail_call.key = bpf_map_key_immediate(aux),
  .insn_idx = i + delta,
 };

Then subprog start values subprog_info[i].start will be updated
with the delta and any poke descriptor index will also be updated
with the delta in adjust_poke_desc(). If we look at the adjust
subprog starts though we see its only adjusted when the delta
occurs before the new instructions,

        /* NOTE: fake 'exit' subprog should be updated as well. */
        for (i = 0; i <= env->subprog_cnt; i++) {
                if (env->subprog_info[i].start <= off)
                        continue;

Earlier subprograms are not changed because their start values
are not moved. But, adjust_poke_desc() does the offset + delta
indiscriminately. The result is poke descriptors are potentially
corrupted.

Then in jit_subprogs() we only populate the poke_tab[]
when the above insn_idx is less than the next subprogram start. From
above we corrupted our insn_idx so we might incorrectly assume a
poke descriptor is not used in a subprogram omitting it from the
subprogram. And finally when the jit runs it does the deref of poke_tab
when emitting the instruction and crashes with below. Because earlier
step omitted the poke descriptor.

The fix is straight forward with above context. Simply move same logic
from adjust_subprog_starts() into adjust_poke_descs() and only adjust
insn_idx when needed.

[   82.396354] bpf_testmod: version magic '5.12.0-rc2alu+ SMP preempt mod_unload ' should be '5.12.0+ SMP preempt mod_unload '
[   82.623001] loop10: detected capacity change from 0 to 8
[   88.487424] ==================================================================
[   88.487438] BUG: KASAN: null-ptr-deref in do_jit+0x184a/0x3290
[   88.487455] Write of size 8 at addr 0000000000000008 by task test_progs/5295
[   88.487471] CPU: 7 PID: 5295 Comm: test_progs Tainted: G          I       5.12.0+ #386
[   88.487483] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[   88.487490] Call Trace:
[   88.487498]  dump_stack+0x93/0xc2
[   88.487515]  kasan_report.cold+0x5f/0xd8
[   88.487530]  ? do_jit+0x184a/0x3290
[   88.487542]  do_jit+0x184a/0x3290
 ...
[   88.487709]  bpf_int_jit_compile+0x248/0x810
 ...
[   88.487765]  bpf_check+0x3718/0x5140
 ...
[   88.487920]  bpf_prog_load+0xa22/0xf10

Fixes: a748c6975dea3 ("bpf: propagate poke descriptors to subprograms")
Reported-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
2021-06-22 14:46:39 -07:00
Mimi Zohar
0c18f29aae module: limit enabling module.sig_enforce
Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying
"module.sig_enforce=1" on the boot command line sets "sig_enforce".
Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured.

This patch makes the presence of /sys/module/module/parameters/sig_enforce
dependent on CONFIG_MODULE_SIG=y.

Fixes: fda784e50aac ("module: export module signature enforcement status")
Reported-by: Nayna Jain <nayna@linux.ibm.com>
Tested-by: Mimi Zohar <zohar@linux.ibm.com>
Tested-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-22 11:13:19 -07:00
Nick Desaulniers
51c2ee6d12 Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
We don't want compiler instrumentation to touch noinstr functions,
which are annotated with the no_profile_instrument_function function
attribute. Add a Kconfig test for this and make GCOV depend on it, and
in the future, PGO.

If an architecture is using noinstr, it should denote that via this
Kconfig value. That makes Kconfigs that depend on noinstr able to express
dependencies in an architecturally agnostic way.

Cc: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/lkml/YMTn9yjuemKFLbws@hirez.programming.kicks-ass.net/
Link: https://lore.kernel.org/lkml/YMcssV%2Fn5IBGv4f0@hirez.programming.kicks-ass.net/
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210621231822.2848305-4-ndesaulniers@google.com
2021-06-22 11:07:18 -07:00
Bui Quang Minh
7dd5d437c2 bpf: Fix integer overflow in argument calculation for bpf_map_area_alloc
In 32-bit architecture, the result of sizeof() is a 32-bit integer so
the expression becomes the multiplication between 2 32-bit integer which
can potentially leads to integer overflow. As a result,
bpf_map_area_alloc() allocates less memory than needed.

Fix this by casting 1 operand to u64.

Fixes: 0d2c4f964050 ("bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps")
Fixes: 99c51064fb06 ("devmap: Use bpf_map_area_alloc() for allocating hash buckets")
Fixes: 546ac1ffb70d ("bpf: add devmap, a map for storing net device references")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210613143440.71975-1-minhquangbui99@gmail.com
2021-06-22 10:14:29 -07:00
Baokun Li
4e82d2e20f clockevents: Use list_move() instead of list_del()/list_add()
Simplify the code.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210609070242.1322450-1-libaokun1@huawei.com
2021-06-22 17:16:46 +02:00
Feng Tang
22a2238337 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
Currently when an unstable clocksource is detected, the raw counters of
that clocksource and watchdog will be printed, which can only be understood
after some math calculation.

So print the delta in nanoseconds as well to make it easier for humans to
check the results.

[ paulmck: Fix typo. ]

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210527190124.440372-6-paulmck@kernel.org
2021-06-22 16:53:17 +02:00
Paul E. McKenney
1253b9b87e clocksource: Provide kernel module to test clocksource watchdog
When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks.  It would be good
to have a way of testing the clocksource watchdog's ability to
distinguish between these two causes of clock skew and instability.

Therefore, provide a new clocksource-wdtest module selected by a new
TEST_CLOCKSOURCE_WATCHDOG Kconfig option.  This module has a single module
parameter named "holdoff" that provides the number of seconds of delay
before testing should start, which defaults to zero when built as a module
and to 10 seconds when built directly into the kernel.  Very large systems
that boot slowly may need to increase the value of this module parameter.

This module uses hand-crafted clocksource structures to do its testing,
thus avoiding messing up timing for the rest of the kernel and for user
applications.  This module first verifies that the ->uncertainty_margin
field of the clocksource structures are set sanely.  It then tests the
delay-detection capability of the clocksource watchdog, increasing the
number of consecutive delays injected, first provoking console messages
complaining about the delays and finally forcing a clock-skew event.
Unexpected test results cause at least one WARN_ON_ONCE() console splat.
If there are no splats, the test has passed.  Finally, it fuzzes the
value returned from a clocksource to test the clocksource watchdog's
ability to detect time skew.

This module checks the state of its clocksource after each test, and
uses WARN_ON_ONCE() to emit a console splat if there are any failures.
This should enable all types of test frameworks to detect any such
failures.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Reported-by: Chris Mason <clm@fb.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org
2021-06-22 16:53:17 +02:00
Paul E. McKenney
2e27e793e2 clocksource: Reduce clocksource-skew threshold
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL.  This requires that clocks be skewed
by more than 12.5% in order to be marked unstable.  Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left.  And given that there are now checks for false-positive
skews due to delays between reading the two clocks, it should be possible
to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks
such as TSC.

Therefore, add a new uncertainty_margin field to the clocksource structure
that contains the maximum uncertainty in nanoseconds for the corresponding
clock.  This field may be initialized manually, as it is for
clocksource_tsc_early and clocksource_jiffies, which is copied to
refined_jiffies.  If the field is not initialized manually, it will be
computed at clock-registry time as the period of the clock in question
based on the scale and freq parameters to __clocksource_update_freq_scale()
function.  If either of those two parameters are zero, the
tens-of-milliseconds WATCHDOG_THRESHOLD is used as a cowardly alternative
to dividing by zero.  No matter how the uncertainty_margin field is
calculated, it is bounded below by twice WATCHDOG_MAX_SKEW, that is, by 100
microseconds.

Note that manually initialized uncertainty_margin fields are not adjusted,
but there is a WARN_ON_ONCE() that triggers if any such field is less than
twice WATCHDOG_MAX_SKEW.  This WARN_ON_ONCE() is intended to discourage
production use of the one-nanosecond uncertainty_margin values that are
used to test the clock-skew code itself.

The actual clock-skew check uses the sum of the uncertainty_margin fields
of the two clocksource structures being compared.  Integer overflow is
avoided because the largest computed value of the uncertainty_margin
fields is one billion (10^9), and double that value fits into an
unsigned int.  However, if someone manually specifies (say) UINT_MAX,
they will get what they deserve.

Note that the refined_jiffies uncertainty_margin field is initialized to
TICK_NSEC, which means that skew checks involving this clocksource will
be sufficently forgiving.  In a similar vein, the clocksource_tsc_early
uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which
replicates the current behavior and allows custom setting if needed
in order to address the rare skews detected for this clocksource in
current mainline.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-4-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Paul E. McKenney
fa218f1cce clocksource: Limit number of CPUs checked for clock synchronization
Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU,
that clock is checked on all CPUs.  This is thorough, but might not be
what you want on a system with a few tens of CPUs, let alone a few hundred
of them.

Therefore, by default check only up to eight randomly chosen CPUs.  Also
provide a new clocksource.verify_n_cpus kernel boot parameter.  A value of
-1 says to check all of the CPUs, and a non-negative value says to randomly
select that number of CPUs, without concern about selecting the same CPU
multiple times.  However, make use of a cpumask so that a given CPU will be
checked at most once.

Suggested-by: Thomas Gleixner <tglx@linutronix.de> # For verify_n_cpus=1.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-3-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Paul E. McKenney
7560c02bdf clocksource: Check per-CPU clock synchronization when marked unstable
Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other.  However, this problem has purportedy been
solved in the past ten years.  Except that it is all too possible that the
problem has instead simply been made less likely, which might mean that
some of the occasional "Marking clocksource 'tsc' as unstable" messages
might be due to desynchronization.  How would anyone know?

Therefore apply CPU-to-CPU synchronization checking to newly unstable
clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
Lists of desynchronized CPUs are printed, with the caveat that if it
is the reporting CPU that is itself desynchronized, it will appear that
all the other clocks are wrong.  Just like in real life.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-2-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Paul E. McKenney
db3a34e174 clocksource: Retry clock read if long delays detected
When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks.  Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can delay
interrupts-disabled regions of code ranging from SMI handlers to vCPU
preemption.  It would be good to have some indication as to why the clock
was marked unstable.

Therefore, re-read the watchdog clock on either side of the read from the
clock under test.  If the watchdog clock shows an excessive time delta
between its pair of reads, the reads are retried.

The maximum number of retries is specified by a new kernel boot parameter
clocksource.max_cswd_read_retries, which defaults to three, that is, up to
four reads, one initial and up to three retries.  If more than one retry
was required, a message is printed on the console (the occasional single
retry is expected behavior, especially in guest OSes).  If the maximum
number of retries is exceeded, the clock under test will be marked
unstable.  However, the probability of this happening due to various sorts
of delays is quite small.  In addition, the reason (clock-read delays) for
the unstable marking will be apparent.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
2021-06-22 16:53:16 +02:00
Xiongwei Song
0e8a89d49d locking/lockdep: Correct the description error for check_redundant()
If there is no matched result, check_redundant() will return BFS_RNOMATCH.

Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lkml.kernel.org/r/20210618130230.123249-1-sxwjean@me.com
2021-06-22 16:42:09 +02:00
Thomas Gleixner
bf22a69768 futex: Provide FUTEX_LOCK_PI2 to support clock selection
The FUTEX_LOCK_PI futex operand uses a CLOCK_REALTIME based absolute
timeout since it was implemented, but it does not require that the
FUTEX_CLOCK_REALTIME flag is set, because that was introduced later.

In theory as none of the user space implementations can set the
FUTEX_CLOCK_REALTIME flag on this operand, it would be possible to
creatively abuse it and make the meaning invers, i.e. select CLOCK_REALTIME
when not set and CLOCK_MONOTONIC when set. But that's a nasty hackery.

Another option would be to have a new FUTEX_CLOCK_MONOTONIC flag only for
FUTEX_LOCK_PI, but that's also awkward because it does not allow libraries
to handle the timeout clock selection consistently.

So provide a new FUTEX_LOCK_PI2 operand which implements the timeout
semantics which the other operands use and leave FUTEX_LOCK_PI alone.

Reported-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210422194705.440773992@linutronix.de
2021-06-22 16:42:09 +02:00
Thomas Gleixner
e112c41341 futex: Prepare futex_lock_pi() for runtime clock selection
futex_lock_pi() is the only futex operation which cannot select the clock
for timeouts (CLOCK_MONOTONIC/CLOCK_REALTIME). That's inconsistent and
there is no particular reason why this cannot be supported.

This was overlooked when CLOCK_REALTIME_FLAG was introduced and
unfortunately not reported when the inconsistency was discovered in glibc.

Prepare the function and enforce the CLOCK_REALTIME_FLAG on FUTEX_LOCK_PI
so that a new FUTEX_LOCK_PI2 can implement it correctly.

Reported-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210422194705.338657741@linutronix.de
2021-06-22 16:42:08 +02:00
Peter Zijlstra
f8b298cc39 lockdep: Fix wait-type for empty stack
Even the very first lock can violate the wait-context check, consider
the various IRQ contexts.

Fixes: de8f5e4f2dc1 ("lockdep: Introduce wait-type checks")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20210617190313.256987481@infradead.org
2021-06-22 16:42:08 +02:00
Boqun Feng
7b1f8c6179 lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()
In the step #3 of check_irq_usage(), we seach backwards to find a lock
whose usage conflicts the usage of @target_entry1 on safe/unsafe.
However, we should only keep the irq-unsafe usage of @target_entry1 into
consideration, because it could be a case where a lock is hardirq-unsafe
but soft-safe, and in check_irq_usage() we find it because its
hardirq-unsafe could result into a hardirq-safe-unsafe deadlock, but
currently since we don't filter out the other usage bits, so we may find
a lock dependency path softirq-unsafe -> softirq-safe, which in fact
doesn't cause a deadlock. And this may cause misleading lockdep splats.

Fix this by only keeping LOCKF_ENABLED_IRQ_ALL bits when we try the
backwards search.

Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210618170110.3699115-4-boqun.feng@gmail.com
2021-06-22 16:42:07 +02:00
Boqun Feng
d4c157c7b1 locking/lockdep: Remove the unnecessary trace saving
In print_bad_irq_dependency(), save_trace() is called to set the ->trace
for @prev_root as the current call trace, however @prev_root corresponds
to the the held lock, which may not be acquired in current call trace,
therefore it's wrong to use save_trace() to set ->trace of @prev_root.
Moreover, with our adjustment of printing backwards dependency path, the
->trace of @prev_root is unncessary, so remove it.

Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210618170110.3699115-3-boqun.feng@gmail.com
2021-06-22 16:42:07 +02:00
Boqun Feng
69c7a5fb24 locking/lockdep: Fix the dep path printing for backwards BFS
We use the same code to print backwards lock dependency path as the
forwards lock dependency path, and this could result into incorrect
printing because for a backwards lock_list ->trace is not the call trace
where the lock of ->class is acquired.

Fix this by introducing a separate function on printing the backwards
dependency path. Also add a few comments about the printing while we are
at it.

Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210618170110.3699115-2-boqun.feng@gmail.com
2021-06-22 16:42:06 +02:00
Qais Yousef
0213b7083e sched/uclamp: Fix uclamp_tg_restrict()
Now cpu.uclamp.min acts as a protection, we need to make sure that the
uclamp request of the task is within the allowed range of the cgroup,
that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
tg->uclamp[UCLAMP_MAX].

As reported by Xuewen [1] we can have some corner cases where there's
inversion between uclamp requested by task (p) and the uclamp values of
the taskgroup it's attached to (tg). Following table demonstrates
2 corner cases:

	           |  p  |  tg  |  effective
	-----------+-----+------+-----------
	CASE 1
	-----------+-----+------+-----------
	uclamp_min | 60% | 0%   |  60%
	-----------+-----+------+-----------
	uclamp_max | 80% | 50%  |  50%
	-----------+-----+------+-----------
	CASE 2
	-----------+-----+------+-----------
	uclamp_min | 0%  | 30%  |  30%
	-----------+-----+------+-----------
	uclamp_max | 20% | 50%  |  20%
	-----------+-----+------+-----------

With this fix we get:

	           |  p  |  tg  |  effective
	-----------+-----+------+-----------
	CASE 1
	-----------+-----+------+-----------
	uclamp_min | 60% | 0%   |  50%
	-----------+-----+------+-----------
	uclamp_max | 80% | 50%  |  50%
	-----------+-----+------+-----------
	CASE 2
	-----------+-----+------+-----------
	uclamp_min | 0%  | 30%  |  30%
	-----------+-----+------+-----------
	uclamp_max | 20% | 50%  |  30%
	-----------+-----+------+-----------

Additionally uclamp_update_active_tasks() must now unconditionally
update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
instance could have an impact on the effective UCLAMP_MIN of the tasks.

	           |  p  |  tg  |  effective
	-----------+-----+------+-----------
	old
	-----------+-----+------+-----------
	uclamp_min | 60% | 0%   |  50%
	-----------+-----+------+-----------
	uclamp_max | 80% | 50%  |  50%
	-----------+-----+------+-----------
	*new*
	-----------+-----+------+-----------
	uclamp_min | 60% | 0%   | *60%*
	-----------+-----+------+-----------
	uclamp_max | 80% |*70%* | *70%*
	-----------+-----+------+-----------

[1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/

Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
Reported-by: Xuewen Yan <xuewen.yan94@gmail.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
2021-06-22 16:41:59 +02:00
Vincent Donnefort
d7d607096a sched/rt: Fix Deadline utilization tracking during policy change
DL keeps track of the utilization on a per-rq basis with the structure
avg_dl. This utilization is updated during task_tick_dl(),
put_prev_task_dl() and set_next_task_dl(). However, when the current
running task changes its policy, set_next_task_dl() which would usually
take care of updating the utilization when the rq starts running DL
tasks, will not see a such change, leaving the avg_dl structure outdated.
When that very same task will be dequeued later, put_prev_task_dl() will
then update the utilization, based on a wrong last_update_time, leading to
a huge spike in the DL utilization signal.

The signal would eventually recover from this issue after few ms. Even
if no DL tasks are run, avg_dl is also updated in
__update_blocked_others(). But as the CPU capacity depends partly on the
avg_dl, this issue has nonetheless a significant impact on the scheduler.

Fix this issue by ensuring a load update when a running task changes
its policy to DL.

Fixes: 3727e0e ("sched/dl: Add dl_rq utilization tracking")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com
2021-06-22 16:41:59 +02:00
Vincent Donnefort
fecfcbc288 sched/rt: Fix RT utilization tracking during policy change
RT keeps track of the utilization on a per-rq basis with the structure
avg_rt. This utilization is updated during task_tick_rt(),
put_prev_task_rt() and set_next_task_rt(). However, when the current
running task changes its policy, set_next_task_rt() which would usually
take care of updating the utilization when the rq starts running RT tasks,
will not see a such change, leaving the avg_rt structure outdated. When
that very same task will be dequeued later, put_prev_task_rt() will then
update the utilization, based on a wrong last_update_time, leading to a
huge spike in the RT utilization signal.

The signal would eventually recover from this issue after few ms. Even if
no RT tasks are run, avg_rt is also updated in __update_blocked_others().
But as the CPU capacity depends partly on the avg_rt, this issue has
nonetheless a significant impact on the scheduler.

Fix this issue by ensuring a load update when a running task changes
its policy to RT.

Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
2021-06-22 16:41:59 +02:00
Baokun Li
64ab707125 clockevents: Add missing parameter documentation
Add the missing documentation for the @cpu parameter of
tick_cleanup_dead_cpu().

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210608024305.2750999-1-libaokun1@huawei.com
2021-06-22 16:33:16 +02:00
Thomas Gleixner
399f8dd9a8 signal: Prevent sigqueue caching after task got released
syzbot reported a memory leak related to sigqueue caching.

The assumption that a task cannot cache a sigqueue after the signal handler
has been dropped and exit_task_sigqueue_cache() has been invoked turns out
to be wrong.

Such a task can still invoke release_task(other_task), which cleans up the
signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in
turn will cache the signal because task->sigqueue_cache is NULL. That's
obviously bogus because nothing will free the cached signal of that task
anymore, so the cached item is leaked.

This happens when e.g. the last non-leader thread exits and reaps the
zombie leader.

Prevent this by setting tsk::sigqueue_cache to an error pointer value in
exit_task_sigqueue_cache() which forces any subsequent invocation of
sigqueue_cache_or_free() from that task to hand the sigqueue back to the
kmemcache.

Add comments to all relevant places.

Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct")
Reported-by: syzbot+0bac5fec63d4f399ba98@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/878s32g6j5.ffs@nanos.tec.linutronix.de
2021-06-22 15:55:41 +02:00
Maciej Żenczykowski
5dec6d96d1 bpf: Fix regression on BPF_OBJ_GET with non-O_RDWR flags
This reverts commit d37300ed1821 ("bpf: program: Refuse non-O_RDWR flags
in BPF_OBJ_GET"). It breaks Android userspace which expects to be able to
fetch programs with just read permissions.

See: https://cs.android.com/android/platform/superproject/+/master:frameworks/libs/net/common/native/bpf_syscall_wrappers/include/BpfSyscallWrappers.h;drc=7005c764be23d31fa1d69e826b4a2f6689a8c81e;l=124

Side-note: another option to fix it would be to extend bpf_prog_new_fd()
and to pass in used file mode flags in the same way as we do for maps via
bpf_map_new_fd(). Meaning, they'd end up in anon_inode_getfd() and thus
would be retained for prog fd operations with bpf() syscall. Right now
these flags are not checked with progs since they are immutable for their
lifetime (as opposed to maps which can be updated from user space). In
future this could potentially change with new features, but at that point
it's still fine to do the bpf_prog_new_fd() extension when needed. For a
simple stable fix, a revert is less churn.

Fixes: d37300ed1821 ("bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
[ Daniel: added side-note to commit message ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: Greg Kroah-Hartman <gregkh@google.com>
Link: https://lore.kernel.org/bpf/20210618105526.265003-1-zenczykowski@gmail.com
2021-06-22 14:57:43 +02:00
Rik van Riel
fdaba61ef8 sched/fair: Ensure that the CFS parent is added after unthrottling
Ensure that a CFS parent will be in the list whenever one of its children is also
in the list.

A warning on rq->tmp_alone_branch != &rq->leaf_cfs_rq_list has been
reported while running LTP test cfs_bandwidth01.

Odin Ugedal found the root cause:

	$ tree /sys/fs/cgroup/ltp/ -d --charset=ascii
	/sys/fs/cgroup/ltp/
	|-- drain
	`-- test-6851
	    `-- level2
		|-- level3a
		|   |-- worker1
		|   `-- worker2
		`-- level3b
		    `-- worker3

Timeline (ish):
- worker3 gets throttled
- level3b is decayed, since it has no more load
- level2 get throttled
- worker3 get unthrottled
- level2 get unthrottled
  - worker3 is added to list
  - level3b is not added to list, since nr_running==0 and is decayed

 [ Vincent Guittot: Rebased and updated to fix for the reported warning. ]

Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Acked-by: Odin Ugedal <odin@uged.al>
Link: https://lore.kernel.org/r/20210621174330.11258-1-vincent.guittot@linaro.org
2021-06-22 14:06:57 +02:00
Peter Zijlstra
49faa77759 locking/lockdep: Improve noinstr vs errors
Better handle the failure paths.

  vmlinux.o: warning: objtool: debug_locks_off()+0x23: call to console_verbose() leaves .noinstr.text section
  vmlinux.o: warning: objtool: debug_locks_off()+0x19: call to __kasan_check_write() leaves .noinstr.text section

  debug_locks_off+0x19/0x40:
  instrument_atomic_write at include/linux/instrumented.h:86
  (inlined by) __debug_locks_off at include/linux/debug_locks.h:17
  (inlined by) debug_locks_off at lib/debug_locks.c:41

Fixes: 6eebad1ad303 ("lockdep: __always_inline more for noinstr")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210621120120.784404944@infradead.org
2021-06-22 13:56:43 +02:00
John Ogness
3342aa8e6b printk: fix cpu lock ordering
The cpu lock implementation uses a full memory barrier to take
the lock, but no memory barriers when releasing the lock. This
means that changes performed by a lock owner may not be seen by
the next lock owner. This may have been "good enough" for use
by dump_stack() as a serialization mechanism, but it is not
enough to provide proper protection for a critical section.

Correct this problem by using acquire/release memory barriers
for lock/unlock, respectively.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210617095051.4808-3-john.ogness@linutronix.de
2021-06-22 09:57:15 +02:00
John Ogness
766c268bc6 lib/dump_stack: move cpu lock to printk.c
dump_stack() implements its own cpu-reentrant spinning lock to
best-effort serialize stack traces in the printk log. However,
there are other functions (such as show_regs()) that can also
benefit from this serialization.

Move the cpu-reentrant spinning lock (cpu lock) into new helper
functions printk_cpu_lock_irqsave()/printk_cpu_unlock_irqrestore()
so that it is available for others as well. For !CONFIG_SMP the
cpu lock is a NOP.

Note that having multiple cpu locks in the system can easily
lead to deadlock. Code needing a cpu lock should use the
printk cpu lock, since the printk cpu lock could be acquired
from any code and any context.

Also note that it is not necessary for a cpu lock to disable
interrupts. However, in upcoming work this cpu lock will be used
for emergency tasks (for example, atomic consoles during kernel
crashes) and any interruptions while holding the cpu lock should
be avoided if possible.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
[pmladek@suse.com: Backported on top of 5.13-rc1.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210617095051.4808-2-john.ogness@linutronix.de
2021-06-22 09:56:10 +02:00
Bumyong Lee
5f89468e2f swiotlb: manipulate orig_addr when tlb_addr has offset
in case of driver wants to sync part of ranges with offset,
swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with
offset and ends up with data mismatch.

It was removed from
"swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single",
but said logic has to be added back in.

From Linus's email:
"That commit which the removed the offset calculation entirely, because the old

        (unsigned long)tlb_addr & (IO_TLB_SIZE - 1)

was wrong, but instead of removing it, I think it should have just
fixed it to be

        (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);

instead. That way the slot offset always matches the slot index calculation."

(Unfortunatly that broke NVMe).

The use-case that drivers are hitting is as follow:

1. Get dma_addr_t from dma_map_single()

dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE);

    |<---------------vsize------------->|
    +-----------------------------------+
    |                                   | original buffer
    +-----------------------------------+
  vaddr

 swiotlb_align_offset
     |<----->|<---------------vsize------------->|
     +-------+-----------------------------------+
     |       |                                   | swiotlb buffer
     +-------+-----------------------------------+
          tlb_addr

2. Do something
3. Sync dma_addr_t through dma_sync_single_for_device(..)

dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE);

  Error case.
    Copy data to original buffer but it is from base addr (instead of
  base addr + offset) in original buffer:

 swiotlb_align_offset
     |<----->|<- offset ->|<- size ->|
     +-------+-----------------------------------+
     |       |            |##########|           | swiotlb buffer
     +-------+-----------------------------------+
          tlb_addr

    |<- size ->|
    +-----------------------------------+
    |##########|                        | original buffer
    +-----------------------------------+
  vaddr

The fix is to copy the data to the original buffer and take into
account the offset, like so:

 swiotlb_align_offset
     |<----->|<- offset ->|<- size ->|
     +-------+-----------------------------------+
     |       |            |##########|           | swiotlb buffer
     +-------+-----------------------------------+
          tlb_addr

    |<- offset ->|<- size ->|
    +-----------------------------------+
    |            |##########|           | original buffer
    +-----------------------------------+
  vaddr

[One fix which was Linus's that made more sense to as it created a
symmetry would break NVMe. The reason for that is the:
 unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);

would come up with the proper offset, but it would lose the
alignment (which this patch contains).]

Fixes: 16fc3cef33a0 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single")
Signed-off-by: Bumyong Lee <bumyong.lee@samsung.com>
Signed-off-by: Chanho Park <chanho61.park@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Reported-by: Horia Geantă <horia.geanta@nxp.com>
Tested-by: Horia Geantă <horia.geanta@nxp.com>
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-06-21 08:59:02 -04:00
Matti Vaittinen
dfa19b1138
reboot: Add hardware protection power-off
There can be few cases when we need to shut-down the system in order to
protect the hardware. Currently this is done at least by the thermal core
when temperature raises over certain limit.

Some PMICs can also generate interrupts for example for over-current or
over-voltage, voltage drops, short-circuit, ... etc. On some systems
these are a sign of hardware failure and only thing to do is try to
protect the rest of the hardware by shutting down the system.

Add shut-down logic which can be used by all subsystems instead of
implementing the shutdown in each subsystem. The logic is stolen from
thermal_core with difference of using atomic_t instead of a mutex in
order to allow calls directly from IRQ context and changing the WARN()
to pr_emerg() as discussed here:
https://lore.kernel.org/lkml/YJuPwAZroVZ%2Fw633@alley/
and here:
https://lore.kernel.org/linux-iommu/20210331093104.383705-4-geert+renesas@glider.be/

Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/e83ec1ca9408f90c857ea9dcdc57b14d9037b03f.1622628333.git.matti.vaittinen@fi.rohmeurope.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-06-21 13:08:36 +01:00
Thomas Gleixner
b22afcdf04 cpu/hotplug: Cure the cpusets trainwreck
Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.

cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.

As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.

The same is true for CPU unplug, but that does not create user observable
failure (yet).

It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.

Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:

 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")

the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.

So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.

Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.

Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.

This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.

Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.

Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.

Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
2021-06-21 10:31:06 +02:00
Linus Torvalds
cba5e97280 - A single fix to restore fairness between control groups with equal priority
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO65oACgkQEsHwGGHe
 VUqEuw//R3hYrPE6luEeDb8V3er2QXJn320G0Jv2yqycmZlZYzN0tuCg0heRUWGS
 Mh8jnKlNZ0GWh/9CWApxQFbqMnt/G3kkLhMZRbIfNm/dvRxdVVB8MP98+5sgcMv0
 mbuLNrrANB6HANMxs0lM0IvzH24YDDVI92J17zSZEi4mKjDPmHAvtZgjfhzlNU4p
 EZEMx7UZLyBsP4/cMXpq2SEmY8A3evjKkSjVIXhBMf929mpbGYzSM5RSPy+abGDt
 ai/E5644RT27HWo/g7mh/szC9OZbg6TNbsF9J6msInh6kCDLBv6Awh/0NUM3bDu4
 r7H2qgxTIv3oisTZNf9qjyx1uStcJJDGF66t7NhYXRVqChQbOYNBoYJ85kVSh/FD
 jjiow9WtwYrbjQ8bT/+NkGu0poUL5gxnGqUfFyucofq46ct48+I36pnr+3V12OTj
 BxJb0sIDAJNPxv1NODpOMtEJJeiumtROU0usFHb9wnTz8jGhU/PFyKdKk8wa41f2
 kG3fQp39gI6T+D0Va3tMCY3UNOFocDmDYhsYZfRtUPp+d1T2jTPP23Yx5XCQ6gew
 cxliIwttQAO6W4dJe4H5Txm5jmjzDfWJ9zrAXuABkVghRxXswiGHhhfUn2AhJK0w
 9Sz+E7xYA/G5ht6pCvXpFfsJ0S0R4YFHddu7rS8S/EGNLe2729o=
 =KnMk
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:
 "A single fix to restore fairness between control groups with equal
  priority"

* tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Correctly insert cfs_rq's to list on unthrottle
2021-06-20 09:44:52 -07:00
Jakub Kicinski
adc2e56ebe Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh

scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18 19:47:02 -07:00
Linus Torvalds
9ed13a17e3 Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
bluetooth, netfilter and can.
 
 Current release - regressions:
 
  - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
           to fix modifying offloaded qdiscs
 
  - lantiq: net: fix duplicated skb in rx descriptor ring
 
  - rtnetlink: fix regression in bridge VLAN configuration, empty info
               is not an error, bot-generated "fix" was not needed
 
  - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix
            umem creation
 
 Current release - new code bugs:
 
  - ethtool: fix NULL pointer dereference during module EEPROM dump via
             the new netlink API
 
  - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue
           should not be visible to the stack
 
  - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs
 
  - mlx5e: verify dev is present in get devlink port ndo, avoid a panic
 
 Previous releases - regressions:
 
  - neighbour: allow NUD_NOARP entries to be force GCed
 
  - further fixes for fallout from reorg of WiFi locking
      (staging: rtl8723bs, mac80211, cfg80211)
 
  - skbuff: fix incorrect msg_zerocopy copy notifications
 
  - mac80211: fix NULL ptr deref for injected rate info
 
  - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs
 
 Previous releases - always broken:
 
  - bpf: more speculative execution fixes
 
  - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local
 
  - udp: fix race between close() and udp_abort() resulting in a panic
 
  - fix out of bounds when parsing TCP options before packets
    are validated (in netfilter: synproxy, tc: sch_cake and mptcp)
 
  - mptcp: improve operation under memory pressure, add missing wake-ups
 
  - mptcp: fix double-lock/soft lookup in subflow_error_report()
 
  - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress
 
  - ena: fix DMA mapping function issues in XDP
 
  - rds: fix memory leak in rds_recvmsg
 
 Misc:
 
  - vrf: allow larger MTUs
 
  - icmp: don't send out ICMP messages with a source address of 0.0.0.0
 
  - cdc_ncm: switch to eth%d interface naming
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDNP7EACgkQMUZtbf5S
 IrvTmxAAgOAM9MdRl9wnYtqXKPXJ1JJtenozwt1yX6b6OG+Ns7cm6YYafU3KoZWR
 KlzpvP90vRrER3RqksbMngHzvGjZKDS4LWRur7sRlJ1TBQoLrQCIbriAh07d7wlU
 0nnS4J8mczTCKx78QCUYy1QBIX5TQrUbx0JQZDPoIPBjFeILW+Gx/Ghg5tUR4mhf
 6icYqwIPocTXO37ZmWOzezZNVOXJF4kaQUZeuOHNe5hOtm6EeIpZbW1Xx3DIr5bd
 80a/uNU7nVyos0n7jxnfVE/oelTnYbT5scZeV/PPVqZ4U113f7uex2QP23/XhGSX
 lK1EhwPqPOyaNhQoihLM6Xzd4o7aZOcmF8NY96xqjC+DqdN+juvfJU+ClCZojGIj
 H4bwCSaj3y2PiimfQdBiIKvYMc5d4zBdw/Dpk/gLDp4d5N638TAtuunK4Mj+TEuT
 QF1qkBLIB4HFtLS0M35/twk93md/5GUdSTij2GB3fOkAWRu2m266P5m+4DigW/TB
 Xm8FgKdetvxVP0Qv/p49nPEn24Ny8wCafH1x1wVTmoda2qi6j1EXMuSa0PlCdz70
 Sl5FrlxdEkOpC4p+Aoc8APSoBXnOriAlpU+z/EVb8Co4JR/+Ge5zBWpsiZDVD0/K
 Ay0FW3I87iyn9tw1H1Fzr9GBlVl5vWRauZFHjzl90fWakCrCzJE=
 =xxUe
 -----END PGP SIGNATURE-----

Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
  bluetooth, netfilter and can.

  Current release - regressions:

   - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
     to fix modifying offloaded qdiscs

   - lantiq: net: fix duplicated skb in rx descriptor ring

   - rtnetlink: fix regression in bridge VLAN configuration, empty info
     is not an error, bot-generated "fix" was not needed

   - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem
     creation

  Current release - new code bugs:

   - ethtool: fix NULL pointer dereference during module EEPROM dump via
     the new netlink API

   - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose
     queue should not be visible to the stack

   - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs

   - mlx5e: verify dev is present in get devlink port ndo, avoid a panic

  Previous releases - regressions:

   - neighbour: allow NUD_NOARP entries to be force GCed

   - further fixes for fallout from reorg of WiFi locking (staging:
     rtl8723bs, mac80211, cfg80211)

   - skbuff: fix incorrect msg_zerocopy copy notifications

   - mac80211: fix NULL ptr deref for injected rate info

   - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs

  Previous releases - always broken:

   - bpf: more speculative execution fixes

   - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local

   - udp: fix race between close() and udp_abort() resulting in a panic

   - fix out of bounds when parsing TCP options before packets are
     validated (in netfilter: synproxy, tc: sch_cake and mptcp)

   - mptcp: improve operation under memory pressure, add missing
     wake-ups

   - mptcp: fix double-lock/soft lookup in subflow_error_report()

   - bridge: fix races (null pointer deref and UAF) in vlan tunnel
     egress

   - ena: fix DMA mapping function issues in XDP

   - rds: fix memory leak in rds_recvmsg

  Misc:

   - vrf: allow larger MTUs

   - icmp: don't send out ICMP messages with a source address of 0.0.0.0

   - cdc_ncm: switch to eth%d interface naming"

* tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits)
  net: ethernet: fix potential use-after-free in ec_bhf_remove
  selftests/net: Add icmp.sh for testing ICMP dummy address responses
  icmp: don't send out ICMP messages with a source address of 0.0.0.0
  net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
  net: ll_temac: Fix TX BD buffer overwrite
  net: ll_temac: Add memory-barriers for TX BD access
  net: ll_temac: Make sure to free skb when it is completely used
  MAINTAINERS: add Guvenc as SMC maintainer
  bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
  bnxt_en: Fix TQM fastpath ring backing store computation
  bnxt_en: Rediscover PHY capabilities after firmware reset
  cxgb4: fix wrong shift.
  mac80211: handle various extensible elements correctly
  mac80211: reset profile_periodicity/ema_ap
  cfg80211: avoid double free of PMSR request
  cfg80211: make certificate generation more robust
  mac80211: minstrel_ht: fix sample time check
  net: qed: Fix memcpy() overflow of qed_dcbx_params()
  net: cdc_eem: fix tx fixup skb leak
  net: hamradio: fix memory leak in mkiss_close
  ...
2021-06-18 18:55:29 -07:00
Linus Torvalds
89fec74203 Tracing fixes for 5.13:
- Have recordmcount check for valid st_shndx otherwise some archs may have
    invalid references for the mcount location.
 
  - Two fixes done for mapping pids to task names. Traces were not showing
    the names of tasks when they should have.
 
  - Fix to trace_clock_global() to prevent it from going backwards
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYMyrQBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmMvAP4/teh5R2Z1ms0S0JlUM6sS3miNP1t/
 x2QQtk2xuzbgmwD/Vn+01//7VyRPGuFm9dD82vSizQiNrns0EHXHB2cEPAQ=
 =UWEU
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Have recordmcount check for valid st_shndx otherwise some archs may
   have invalid references for the mcount location.

 - Two fixes done for mapping pids to task names. Traces were not
   showing the names of tasks when they should have.

 - Fix to trace_clock_global() to prevent it from going backwards

* tag 'trace-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Do no increment trace_clock_global() by one
  tracing: Do not stop recording comms if the trace file is being read
  tracing: Do not stop recording cmdlines when tracing is off
  recordmcount: Correct st_shndx handling
2021-06-18 10:57:09 -07:00
Linus Torvalds
0f4022a490 printk fixup for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmDMjAwACgkQUqAMR0iA
 lPItwRAAk/vvX97LYEXOpdnL/897xqb8EcQCwuTrZzs+GE8ZSZwUi5SXWoLsNGag
 /osbWbXu3x+iAu9Mta3RsFIahHpqFiGCuveSVoBxGW/vpuUrmNcc/qELqLdaaZvr
 1M3X1jUWzj2pJ6bSO/pNHI4xDi0MFODNuqMG5TYT2VdLJaCxJcDTVYabQ/Y0iauH
 idAfDJBDezb9Zpk6+rjn2RqruldfA0gUroAmsVD4U3QMkbkKlTXfdd/xDA8/BUgm
 elSUqqZw27/hFImXEp2Ng/MpLLiT8n0aApoQFwOHRoWvyJ3+KMjZnXHheqpmKqzf
 gcTaXis2w8TMAnMmfQuXGV1iQJzDYJs6XlmMjqf1Ah3DDJO2pkjOpeVO8p+1R7DK
 lmbrw5y9rhLh2DBlWfe1aclWsc9XQ5ndwrq1y7aYYgaOX8PR7pwsJoRZI+6zk22j
 0hb4jj1eFP+9agKcTgaTaDF/peWfdYjZ8bvkEsaQcKSWrsMeiTdqH65xhhV2p3+3
 NvAjMPdtQIgsWZ/M2sIDmWsyfk6PWuhtkxzEdYrQMAnsASOMEQ+oZ/xFtmOU/Q5m
 3za569tfdUPCUSUJhqVJ8HNJftcNdHNRpiz0jiWElTmLEVRYfuObflRcyvxnNnIP
 C7i4Qhyyw/jlmXphZQbwCKVQuIJz4qb0pxBi5cCCuf/Qyxog5Uw=
 =HDfb
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.13-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fixup from Petr Mladek:
 "Fix misplaced EXPORT_SYMBOL(vsprintf)"

* tag 'printk-for-5.13-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: Move EXPORT_SYMBOL() closer to vprintk definition
2021-06-18 10:50:41 -07:00
Linus Torvalds
944293bcee Power management fix for 5.13-rc7
Remove recently added frequency invariance support from the CPPC
 cpufreq driver, because it has turned out to be problematic and it
 cannot be fixed properly on time for 5.13 (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmDMzWkSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxCeYP/2ZvmDC+saX4jz6FCTvJ8J4jc7fGB3QB
 k+3ek7tNvazYU470mBR68VFYGvqFkjFIP1yjJgNbh4u8fex1lPLFuWAmPurGic7o
 xmuk2+uRG/aSOQPmf+tEVjcBSQn+g/SkvfC6ZIZvbODvCSWQ3VwKWORD+1BdajQJ
 7VJCVOXSyqAbeDsijmuJ0j1Nzj6SZbmn7pQODXCcjuwG+jOBbrPGmjBlW24kAVRM
 mF9k7xqbjcikzaW9SQZ1S38uzRDu5nK9ipcpdJtDUZ7N+xoa1qUjNLQY3H9gQDYQ
 xuocdhpYbOi5ip8srYSJc16xGJPCR7oRxsFS4vuUZAUyuDg0IIvuS2bSLTM7NLBv
 hCaibKQQr8fFW87Dnalv7Ld046g0bLAhan33V/8x9lSNOVDSAdFEHOtrXSSCgeEr
 XiOlxqT1dCl5FU427fA2QpSnONOfl+JXuZ+zGzP99q1VVdK6xW1cIbIrCaEhHuZv
 N9AODRSizrNurU1foBKgZRGj6EGjF/1EuHJ8GiTXQ31SWEdzUKY+J9VIwwPrMTJW
 MAMGrgqBpajIry5hzpLaf5uFFfiK/7ijg2Nn3nSMCyd1mwG9hyh2EreomH6lEOxf
 Iiu+OFn4cz+R6MDksLBMFCoxUPHBzNpN0tc9IiEQSU8R6xy3GpQhUu4l2JLnKRu+
 dZ16oPbhZAnq
 =tW38
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Remove recently added frequency invariance support from the CPPC
  cpufreq driver, because it has turned out to be problematic and it
  cannot be fixed properly on time for 5.13 (Viresh Kumar)"

* tag 'pm-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Revert "cpufreq: CPPC: Add support for frequency invariance"
2021-06-18 10:42:36 -07:00
Steven Rostedt (VMware)
89529d8b8f tracing: Do no increment trace_clock_global() by one
The trace_clock_global() tries to make sure the events between CPUs is
somewhat in order. A global value is used and updated by the latest read
of a clock. If one CPU is ahead by a little, and is read by another CPU, a
lock is taken, and if the timestamp of the other CPU is behind, it will
simply use the other CPUs timestamp.

The lock is also only taken with a "trylock" due to tracing, and strange
recursions can happen. The lock is not taken at all in NMI context.

In the case where the lock is not able to be taken, the non synced
timestamp is returned. But it will not be less than the saved global
timestamp.

The problem arises because when the time goes "backwards" the time
returned is the saved timestamp plus 1. If the lock is not taken, and the
plus one to the timestamp is returned, there's a small race that can cause
the time to go backwards!

	CPU0				CPU1
	----				----
				trace_clock_global() {
				    ts = clock() [ 1000 ]
				    trylock(clock_lock) [ success ]
				    global_ts = ts; [ 1000 ]

				    <interrupted by NMI>
 trace_clock_global() {
    ts = clock() [ 999 ]
    if (ts < global_ts)
	ts = global_ts + 1 [ 1001 ]

    trylock(clock_lock) [ fail ]

    return ts [ 1001]
 }
				    unlock(clock_lock);
				    return ts; [ 1000 ]
				}

 trace_clock_global() {
    ts = clock() [ 1000 ]
    if (ts < global_ts) [ false 1000 == 1000 ]

    trylock(clock_lock) [ success ]
    global_ts = ts; [ 1000 ]
    unlock(clock_lock)

    return ts; [ 1000 ]
 }

The above case shows to reads of trace_clock_global() on the same CPU, but
the second read returns one less than the first read. That is, time when
backwards, and this is not what is allowed by trace_clock_global().

This was triggered by heavy tracing and the ring buffer checker that tests
for the clock going backwards:

 Ring buffer clock went backwards: 20613921464 -> 20613921463
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0
 Modules linked in:
 [..]
 [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463
   [20613915818] PAGE TIME STAMP
   [20613915818] delta:0
   [20613915819] delta:1
   [20613916035] delta:216
   [20613916465] delta:430
   [20613916575] delta:110
   [20613916749] delta:174
   [20613917248] delta:499
   [20613917333] delta:85
   [20613917775] delta:442
   [20613917921] delta:146
   [20613918321] delta:400
   [20613918568] delta:247
   [20613918768] delta:200
   [20613919306] delta:538
   [20613919353] delta:47
   [20613919980] delta:627
   [20613920296] delta:316
   [20613920571] delta:275
   [20613920862] delta:291
   [20613921152] delta:290
   [20613921464] delta:312
   [20613921464] delta:0 TIME EXTEND
   [20613921464] delta:0

This happened more than once, and always for an off by one result. It also
started happening after commit aafe104aa9096 was added.

Cc: stable@vger.kernel.org
Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-18 09:10:00 -04:00
Steven Rostedt (VMware)
4fdd595e4f tracing: Do not stop recording comms if the trace file is being read
A while ago, when the "trace" file was opened, tracing was stopped, and
code was added to stop recording the comms to saved_cmdlines, for mapping
of the pids to the task name.

Code has been added that only records the comm if a trace event occurred,
and there's no reason to not trace it if the trace file is opened.

Cc: stable@vger.kernel.org
Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-18 09:10:00 -04:00