Pull scheduler updates from Thomas Gleixner:
- Cleanup and improvement of NUMA balancing
- Refactoring and improvements to the PELT (Per Entity Load Tracking)
code
- Watchdog simplification and related cleanups
- The usual pile of small incremental fixes and improvements
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
watchdog: Reduce message verbosity
stop_machine: Reflow cpu_stop_queue_two_works()
sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
sched/numa: Use group_weights to identify if migration degrades locality
sched/numa: Update the scan period without holding the numa_group lock
sched/numa: Remove numa_has_capacity()
sched/numa: Modify migrate_swap() to accept additional parameters
sched/numa: Remove unused task_capacity from 'struct numa_stats'
sched/numa: Skip nodes that are at 'hoplimit'
sched/debug: Reverse the order of printing faults
sched/numa: Use task faults only if numa_group is not yet set up
sched/numa: Set preferred_node based on best_cpu
sched/numa: Simplify load_too_imbalanced()
sched/numa: Evaluate move once per node
sched/numa: Remove redundant field
sched/debug: Show the sum wait time of a task group
sched/fair: Remove #ifdefs from scale_rt_capacity()
sched/core: Remove get_cpu() from sched_fork()
sched/cpufreq: Clarify sugov_get_util()
sched/sysctl: Remove unused sched_time_avg_ms sysctl
...
Pull scheduler fix from Thomas Gleixner:
"A single bugfix to prevent a pinned thread which queues stomp machine
work to be preempted by the stopper thread on its CPU which causes a
live lock as it is unable to wake the second CPUs stopper thread"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stop_machine: Atomically queue and wake stopper threads
Pull RCU updates from Thomas Gleixner:
"A large update to RCU:
Preparatory work for consolidating the RCU flavors:
- Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt,
and RCU-sched flavors, replacing the old ->gpnum and ->completed
pair of fields.
This change allows lockless code to obtain the complete
grace-period state with a single READ_ONCE(), which is needed to
maintain tolerable lock contention during the upcoming
consolidation of the three RCU flavors.
Note that grace-period sequence numbers are already used by
rcu_barrier(), expedited RCU grace periods, and SRCU, and are thus
already heavily used and well-tested. Joel Fernandes contributed a
number of excellent fixes and improvements.
- Clean up some grace-period-reporting loose ends, including
improving the handling of quiescent states from offline CPUs and
fixing some false-positive WARN_ON_ONCE() invocations.
(Strictly speaking, the WARN_ON_ONCE() invocations were quite
correct, but their invariants were (harmlessly) violated by the
earlier sloppy handling of quiescent states from offline CPUs.)
In addition, improve grace-period forward-progress guarantees so as
to allow removal of fail-safe checks that required otherwise
needless lock acquisitions. Finally, add more diagnostics to help
debug the upcoming consolidation of the RCU-bh, RCU-preempt, and
RCU-sched flavors.
The rest:
- SRCU updates
- Updates to rcutorture and associated scripting.
- The usual pile of miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits)
rcutorture: Fix rcu_barrier successes counter
rcutorture: Add support to detect if boost kthread prio is too low
rcutorture: Use monotonic timestamp for stall detection
rcutorture: Make boost test more robust
rcutorture: Disable RT throttling for boost tests
rcutorture: Emphasize testing of single reader protection type
rcutorture: Handle extended read-side critical sections
rcutorture: Make rcu_torture_timer() use rcu_torture_one_read()
rcutorture: Use per-CPU random state for rcu_torture_timer()
rcutorture: Use atomic increment for n_rcu_torture_timers
rcutorture: Extract common code from rcu_torture_reader()
rcuperf: Remove unused torturing_tasks() function
rcu: Remove rcutorture test version and sequence number
rcutorture: Change units of onoff_interval to jiffies
rcu: Assign higher prio to RCU threads if rcutorture is built-in
rculist: Improve documentation for list_for_each_entry_from_rcu()
srcu: Add grace-period number to rcutorture statistics printout
rcu: Print stall-warning NMI dyntick state in hexadecimal
MAINTAINERS: Update RCU, SRCU, and TORTURE-TEST entries
rcu: Make rcu_seq_diff() more exact
...
Pull genirq updates from Thomas Gleixner:
"The irq departement provides:
- A synchronization fix for free_irq() to synchronize just the
removed interrupt thread on shared interrupt lines.
- Consolidate the multi low level interrupt entry handling and mvoe
it to the generic code instead of adding yet another copy for
RISC-V
- Refactoring of the ARM LPI allocator and LPI exposure to the
hypervisor
- Yet another interrupt chip driver for the JZ4725B SoC
- Speed up for /proc/interrupts as people seem to love reading this
file with high frequency
- Miscellaneous fixes and updates"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
irqchip/gic-v3-its: Make its_lock a raw_spin_lock_t
genirq/irqchip: Remove MULTI_IRQ_HANDLER as it's now obselete
openrisc: Use the new GENERIC_IRQ_MULTI_HANDLER
arm64: Use the new GENERIC_IRQ_MULTI_HANDLER
ARM: Convert to GENERIC_IRQ_MULTI_HANDLER
irqchip: Port the ARM IRQ drivers to GENERIC_IRQ_MULTI_HANDLER
irqchip/gic-v3-its: Reduce minimum LPI allocation to 1 for PCI devices
dt-bindings: irqchip: renesas-irqc: Document r8a77980 support
dt-bindings: irqchip: renesas-irqc: Document r8a77470 support
irqchip/ingenic: Add support for the JZ4725B SoC
irqchip/stm32: Add exti0 translation for stm32mp1
genirq: Remove redundant NULL pointer check in __free_irq()
irqchip/gic-v3-its: Honor hypervisor enforced LPI range
irqchip/gic-v3: Expose GICD_TYPER in the rdist structure
irqchip/gic-v3-its: Drop chunk allocation compatibility
irqchip/gic-v3-its: Move minimum LPI requirements to individual busses
irqchip/gic-v3-its: Use full range of LPIs
irqchip/gic-v3-its: Refactor LPI allocator
genirq: Synchronize only with single thread on free_irq()
genirq: Update code comments wrt recycled thread_mask
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-13
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add driver XDP support for veth. This can be used in conjunction with
redirect of another XDP program e.g. sitting on NIC so the xdp_frame
can be forwarded to the peer veth directly without modification,
from Toshiaki.
2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
in order to provide more control and visibility on where a SO_REUSEPORT
sk should be located, and the latter enables to directly select a sk
from the bpf map. This also enables map-in-map for application migration
use cases, from Martin.
3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
of cgroup v2 that is the ancestor of the cgroup associated with the
skb at the ancestor_level, from Andrey.
4) Implement BPF fs map pretty-print support based on BTF data for regular
hash table and LRU map, from Yonghong.
5) Decouple the ability to attach BTF for a map from the key and value
pretty-printer in BPF fs, and enable further support of BTF for maps for
percpu and LPM trie, from Daniel.
6) Implement a better BPF sample of using XDP's CPU redirect feature for
load balancing SKB processing to remote CPU. The sample implements the
same XDP load balancing as Suricata does which is symmetric hash based
on IP and L4 protocol, from Jesper.
7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
critical path as it is ensured that the allocator is present, from Björn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
parisc is the only Linux architecture which has defined a value for ENOTSUP.
All other architectures #define ENOTSUP as EOPNOTSUPP in their libc headers.
Having an own value for ENOTSUP which is different than EOPNOTSUPP often gives
problems with userspace programs which expect both to be the same. One such
example is a build error in the libuv package, as can be seen in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900237.
Since we dropped HP-UX support, there is no real benefit in keeping an own
value for ENOTSUP. This patch drops the parisc value for ENOTSUP from the
kernel sources. glibc needs no patch, it reuses the exported headers.
Signed-off-by: Helge Deller <deller@gmx.de>
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.
The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
This is purely a preparatory patch for upcoming changes during the 4.19
merge window.
We have a function called "boot_cpu_state_init()" that isn't really
about the bootup cpu state: that is done much earlier by the similarly
named "boot_cpu_init()" (note lack of "state" in name).
This function initializes some hotplug CPU state, and needs to run after
the percpu data has been properly initialized. It even has a comment to
that effect.
Except it _doesn't_ actually run after the percpu data has been properly
initialized. On x86 it happens to do that, but on at least arm and
arm64, the percpu base pointers are initialized by the arch-specific
'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().
This had some unexpected results, and in particular we have a patch
pending for the merge window that did the obvious cleanup of using
'this_cpu_write()' in the cpu hotplug init code:
- per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
+ this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
which is obviously the right thing to do. Except because of the
ordering issue, it actually failed miserably and unexpectedly on arm64.
So this just fixes the ordering, and changes the name of the function to
be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
hotplug state, because the core CPU state was supposed to have already
been done earlier.
Marked for stable, since the (not yet merged) patch that will show this
problem is marked for stable.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Holding uts_sem as a writer while accessing userspace memory allows a
namespace admin to stall all processes that attempt to take uts_sem.
Instead, move data through stack buffers and don't access userspace memory
while uts_sem is held.
Cc: stable@vger.kernel.org
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The old code would hold the userns_state_mutex indefinitely if
memdup_user_nul stalled due to e.g. a userfault region. Prevent that by
moving the memdup_user_nul in front of the mutex_lock().
Note: This changes the error precedence of invalid buf/count/*ppos vs
map already written / capabilities missing.
Fixes: 22d917d80e84 ("userns: Rework the user_namespace adding uid/gid...")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].
At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
point, it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.
For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb(). Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.
Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.
The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).
In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages. Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb. The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.
The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").
The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk. It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()". Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case). The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want. This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match). The bpf prog can decide if it wants to do SK_DROP as its
discretion.
When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk. If not, the kernel
will select one from "reuse->socks[]" (as before this patch).
The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]
During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP. These conditions are
ensured in "reuseport_array_update_check()".
A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map). SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.
When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also. It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]". Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.
The reuseport_array will also support lookup from the syscall
side. It will return a sock_gen_cookie(). The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).
The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update. Supporting different
value_size between lookup and update seems unintuitive also.
We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4. The syscall's lookup
will return ENOSPC on value_size=4. It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
While debugging another bug, I was looking at all the synchronize*()
functions being used in kernel/trace, and noticed that trace_uprobes was
using synchronize_sched(), with a comment to synchronize with
{u,ret}_probe_trace_func(). When looking at those functions, the data is
protected with "rcu_read_lock()" and not with "rcu_read_lock_sched()". This
is using the wrong synchronize_*() function.
Link: http://lkml.kernel.org/r/20180809160553.469e1e32@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 70ed91c6ec7f8 ("tracing/uprobes: Support ftrace_event_file base multibuffer")
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that some trace events can be protected by srcu_read_lock(tracepoint_srcu),
we need to make sure all locations that depend on this are also protected.
There were many places that did a synchronize_sched() thinking that it was
enough to protect againts access to trace events. This use to be the case,
but now that we use SRCU for _rcuidle() trace events, they may not be
protected by synchronize_sched(), as they may be called in paths that RCU is
not watching for preempt disable.
Fixes: e6753f23d961d ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pointer ftrace_swapper_pid is defined but is never used hence it is
redundant and can be removed. The use of this variable was removed
in commit 345ddcc882d8 ("ftrace: Have set_ftrace_pid use the bitmap
like events do").
Cleans up clang warning:
warning: 'ftrace_swapper_pid' defined but not used [-Wunused-const-variable=]
Link: http://lkml.kernel.org/r/20180809125609.13142-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
by lockdep and irqsoff latency tracer. It made both use tracepoints. But the
latency tracer is triggering warnings when using tracepoints to call into
the latency tracer's routines. Mainly, they can be called from NMI context.
If that happens, then the SRCU may not work properly because on some
architectures, SRCU is not safe to be called in both NMI and non-NMI
context.
This is a partial revert of the clean up patch c3bc8fd637a9 ("tracing:
Centralize preemptirq tracepoints and unify their usage") that adds back the
direct calls into the latency tracer. It also only calls the trace events
when not in NMI.
Link: http://lkml.kernel.org/r/20180809210654.622445925@goodmis.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
I was hitting the following warning:
WARNING: CPU: 0 PID: 1 at kernel/trace/trace_irqsoff.c:631 tracer_hardirqs_off+0x15/0x2a
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6-test+ #13
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: tracer_hardirqs_off+0x15/0x2a
Code: ff 85 c0 74 0e 8b 45 00 8b 50 04 8b 45 04 e8 35 ff ff ff 5d c3 55 64 a1 cc 37 51 c1 a9 ff ff ff 7f 89 e5 53 89 d3 89 ca 75 02 <0f> 0b e8 90 fc ff ff 85 c0 74 07 89 d8 e8 0c ff ff ff 5b 5d c3 55
EAX: 80000000 EBX: c04337f0 ECX: c04338e3 EDX: c04338e3
ESI: c04337f0 EDI: c04338e3 EBP: f2aa1d68 ESP: f2aa1d64
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210046
CR0: 80050033 CR2: 00000000 CR3: 01668000 CR4: 001406f0
Call Trace:
trace_irq_disable_rcuidle+0x63/0x6c
trace_hardirqs_off+0x26/0x30
default_send_IPI_mask_allbutself_logical+0x31/0x93
default_send_IPI_allbutself+0x37/0x48
native_send_call_func_ipi+0x4d/0x6a
smp_call_function_many+0x165/0x19d
? add_nops+0x34/0x34
? trace_hardirqs_on_caller+0x2d/0x2d
? add_nops+0x34/0x34
smp_call_function+0x1f/0x23
on_each_cpu+0x15/0x43
? trace_hardirqs_on_caller+0x2d/0x2d
? trace_hardirqs_on_caller+0x2d/0x2d
? trace_irq_disable_rcuidle+0x1/0x6c
text_poke_bp+0xa0/0xc2
? trace_hardirqs_on_caller+0x2d/0x2d
arch_jump_label_transform+0xa7/0xcb
? trace_irq_disable_rcuidle+0x5/0x6c
__jump_label_update+0x3e/0x6d
jump_label_update+0x7d/0x81
static_key_slow_inc_cpuslocked+0x58/0x6d
static_key_slow_inc+0x19/0x20
tracepoint_probe_register_prio+0x19e/0x1d1
? start_critical_timings+0x1c/0x1c
tracepoint_probe_register+0xf/0x11
irqsoff_tracer_init+0x21/0xf2
tracer_init+0x16/0x1a
trace_selftest_startup_irqsoff+0x25/0xc4
run_tracer_selftest+0xca/0x131
register_tracer+0xd5/0x172
? trace_event_define_fields_preemptirq_template+0x45/0x45
init_irqsoff_tracer+0xd/0x11
do_one_initcall+0xab/0x1e8
? rcu_read_lock_sched_held+0x3d/0x44
? trace_initcall_level+0x52/0x86
kernel_init_freeable+0x195/0x21a
? rest_init+0xb4/0xb4
kernel_init+0xd/0xe4
ret_from_fork+0x2e/0x38
It is due to running a CONFIG_PREEMPT_VOLUNTARY kernel, which would trigger
this warning every time:
WARN_ON_ONCE(preempt_count());
Because on CONFIG_PREEMPT_VOLUNTARY, preempt_count() is always zero.
This warning is to make sure preempt_count is set because
tracer_hardirqs_on() does a preempt_enable_notrace() to make the
preempt_trace() work properly, as being called by a trace event, the trace
event code disables preemption, and the tracer wants to know what the
preemption was before it was called.
Instead of enabling preemption like this, just record the preempt_count,
subtract PREEMPT_DISABLE_OFFSET from it (which is zero with !CONFIG_PREEMPT
set), and pass that value to the necessary functions, which should use the
passed in parameter instead of calling preempt_count() directly.
Fixes: da5b3ebb45277 ("tracing: irqsoff: Account for additional preempt_disable")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
by lockdep and irqsoff latency tracer. It made both use tracepoints. But it
caused lockdep to trigger several false positives. We have not figured out
why yet, but removing lockdep from using the trace event hooks and just call
its helper functions directly (like it use to), makes the problem go away.
This is a partial revert of the clean up patch c3bc8fd637a9 ("tracing:
Centralize preemptirq tracepoints and unify their usage") that adds direct
calls for lockdep, but also keeps most of the clean up done to get rid of
the horrible preprocessor if statements.
Link: http://lkml.kernel.org/r/20180806155058.5ee875f4@gandalf.local.home
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") added pretty print support to array map.
This patch adds pretty print for hash and lru_hash maps.
The following example shows the pretty-print result of
a pinned hashmap:
struct map_value {
int count_a;
int count_b;
};
cat /sys/fs/bpf/pinned_hash_map:
87907: {87907,87908}
57354: {37354,57355}
76625: {76625,76626}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In function map_seq_next() of kernel/bpf/inode.c,
the first key will be the "0" regardless of the map type.
This works for array. But for hash type, if it happens
key "0" is in the map, the bpffs map show will miss
some items if the key "0" is not the first element of
the first bucket.
This patch fixed the issue by guaranteeing to get
the first element, if the seq_show is just started,
by passing NULL pointer key to map_get_next_key() callback.
This way, no missing elements will occur for
bpffs hash table show even if key "0" is in the map.
Fixes: a26ca7c982cb5 ("bpf: btf: Add pretty print support to the basic arraymap")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Like cpumap teardown, the devmap teardown code also flush remaining
xdp_frames, via bq_xmit_all() in case map entry is removed. The code
can call xdp_return_frame_rx_napi, from the the wrong context, in-case
ndo_xdp_xmit() fails.
Fixes: 389ab7f01af9 ("xdp: introduce xdp_return_frame_rx_napi")
Fixes: 735fc4054b3a ("xdp: change ndo_xdp_xmit API to support bulking")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When removing a cpumap entry, a number of syncronization steps happen.
Eventually the teardown code __cpu_map_entry_free is invoked from/via
call_rcu.
The teardown code __cpu_map_entry_free() flushes remaining xdp_frames,
by invoking bq_flush_to_queue, which calls xdp_return_frame_rx_napi().
The issues is that the teardown code is not running in the RX NAPI
code path. Thus, it is not allowed to invoke the NAPI variant of
xdp_return_frame.
This bug was found and triggered by using the --stress-mode option to
the samples/bpf program xdp_redirect_cpu. It is hard to trigger,
because the ptr_ring have to be full and cpumap bulk queue max
contains 8 packets, and a remote CPU is racing to empty the ptr_ring
queue.
Fixes: 389ab7f01af9 ("xdp: introduce xdp_return_frame_rx_napi")
Tested-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.
The code was being overly pessimistic. Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child. For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().
While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread. Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.
Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes. When fork completes initialize the new processes
shared_pending signal set with it. The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals. Making it
appear as if those signals were received immediately after the fork.
It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo. This
means it is safe to not bother collecting siginfo on signals sent
during fork.
The sigaction of a child of fork is initially the same as the
sigaction of the parent process. So a signal the parent ignores the
child will also initially ignore. Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.
Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with. They won't cause any problems.
V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
- Initialize signal_struct.multiprocess in init_task
- Move setting of shared_pending to before the new task
is visible to signals. This prevents signals from comming
in before shared_pending.signal is set to delayed.signal
and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In bpf_tcp_sendmsg() the sk_alloc_sg() may fail. In the case of
ENOMEM, it may also mean that we've partially filled the scatterlist
entries with pages. Later jumping to sk_stream_wait_memory()
we could further fail with an error for several reasons, however
we miss to call free_start_sg() if the local sk_msg_buff was used.
Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While working on bpf_tcp_sendmsg() code, I noticed that when a
sk->sk_err is set we error out with err = sk->sk_err. However
this is problematic since sk->sk_err is a positive error value
and therefore we will neither go into sk_stream_error() nor will
we report an error back to user space. I had this case with EPIPE
and user space was thinking sendmsg() succeeded since EPIPE is
a positive value, thinking we submitted 32 bytes. Fix it by
negating the sk->sk_err value.
Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-07
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add cgroup local storage for BPF programs, which provides a fast
accessible memory for storing various per-cgroup data like number
of transmitted packets, etc, from Roman.
2) Support bpf_get_socket_cookie() BPF helper in several more program
types that have a full socket available, from Andrey.
3) Significantly improve the performance of perf events which are
reported from BPF offload. Also convert a couple of BPF AF_XDP
samples overto use libbpf, both from Jakub.
4) seg6local LWT provides the End.DT6 action, which allows to
decapsulate an outer IPv6 header containing a Segment Routing Header.
Adds this action now to the seg6local BPF interface, from Mathieu.
5) Do not mark dst register as unbounded in MOV64 instruction when
both src and dst register are the same, from Arthur.
6) Define u_smp_rmb() and u_smp_wmb() to their respective barrier
instructions on arm64 for the AF_XDP sample code, from Brian.
7) Convert the tcp_client.py and tcp_server.py BPF selftest scripts
over from Python 2 to Python 3, from Jeremy.
8) Enable BTF build flags to the BPF sample code Makefile, from Taeung.
9) Remove an unnecessary rcu_read_lock() in run_lwt_bpf(), from Taehee.
10) Several improvements to the README.rst from the BPF documentation
to make it more consistent with RST format, from Tobin.
11) Replace all occurrences of strerror() by calls to strerror_r()
in libbpf and fix a FORTIFY_SOURCE build error along with it,
from Thomas.
12) Fix a bug in bpftool's get_btf() function to correctly propagate
an error via PTR_ERR(), from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
__cgroup_bpf_attach() and __cgroup_bpf_detach() functions have
a good amount of duplicated code, which is possible to eliminate
by introducing the update_effective_progs() helper function.
The update_effective_progs() calls compute_effective_progs()
and then in case of success it calls activate_effective_progs()
for each descendant cgroup. In case of failure (OOM), it releases
allocated prog arrays and return the error code.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
on the kernel command line as it cannot differentiate between SMT disabled
by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
makes the sysfs interface unusable.
Rework this so that during bringup of the non boot CPUs the availability of
SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
'primary' thread then set the local cpu_smt_available marker and evaluate
this explicitely right after the initial SMP bringup has finished.
SMT evaulation on x86 is a trainwreck as the firmware has all the
information _before_ booting the kernel, but there is no interface to query
it.
Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Recently we tried to make the preemptirqsoff tracer to use irqsoff
tracepoint probes. However this causes issues as reported by Masami:
[2.271078] Testing tracer preemptirqsoff: .. no entries found ..FAILED!
[2.381015] WARNING: CPU: 0 PID: 1 at /home/mhiramat/ksrc/linux/kernel/
trace/trace.c:1512 run_tracer_selftest+0xf3/0x154
This is due to the tracepoint code increasing the preempt nesting count
by calling an additional preempt_disable before calling into the
preemptoff tracer which messes up the preempt_count() check in
tracer_hardirqs_off.
To fix this, make the irqsoff tracer probes balance the additional outer
preempt_disable with a preempt_enable_notrace.
The other way to fix this is to just use SRCU for all tracepoints.
However we can't do that because we can't use NMIs from RCU context.
Link: http://lkml.kernel.org/r/20180806034049.67949-1-joel@joelfernandes.org
Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Fixes: e6753f23d961 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- GICv3 ITS LPI allocation revamp
- GICv3 support for hypervisor-enforced LPI range
- GICv3 ITS conversion to raw spinlock
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAltoBXMVHG1hcmMuenlu
Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDyUYP/1feAq3F7ZmhCIZka4c6y/m4EBpq
BjWEEgOAGMEyyB4s98flsRtZcEUxxp6CqEXo2FgCsd1Nj+og7oA7vwOlqy3aGzsi
9f/Z5Wi6SlG06lH5tmYNkyVbGk2tE3s2FzkH5Rg8qZGk+X3OCOdNs/+G20pYAkSp
ESePWSapbQUJSExJ1MqzfdHFidtVA1V+ev8BKdIp2ykl1NRae8LJeKHIbqac49Ym
JclfCLFpQM1M1ElB9j0E8hAvZhz10oOz7TtBR737O/1QEifVyFqGBckPzldvwIJM
zZ+nR+Yzj1ruD109xwaF1iKy9AinZWhiqrtN7UXJ3jwHtNih+sy0R6FQ38GMNoOC
0K02n/qStR5xglGr4BmAcWlOuFtBYWfz6HpSVMqaTWWmOxHEiqS6pXtEA+dV/YyI
wHLbo0YzpWTQm6t1+b/PoByAJ0/hOcD1nOD57b+NGjX7tZV0sGjpGsecvFhTSywh
BN3COBi9k/FOBrOTGDX1qUAI+mEf76vc2BAC+BkkoiiMg3WlY0E9qfQJguUxHdrb
0LS3lDZoHCNoz8RZLrUyenTT0NYGcjPGUTinMDJWG79VGXOWFexTDdCuX0kF90CK
1Zie3O6lrTYolmaiyLUxwukKp1SVUyoA5IpKVwfDJQYUhEfk27yvlzg2MBMcHDRA
uy3QSkmjx9vw/sAu
=gKw8
-----END PGP SIGNATURE-----
Merge tag 'irqchip-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- GICv3 ITS LPI allocation revamp
- GICv3 support for hypervisor-enforced LPI range
- GICv3 ITS conversion to raw spinlock
At present, "systemctl suspend" and "shutdown" can run in parrallel. A
system can suspend after devices_shutdown(), and resume. Then the shutdown
task goes on to power off. This causes many devices are not really shut
off. Hence replacing reboot_mutex with system_transition_mutex (renamed
from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
system_transition_mutex can be better to reflect the purpose of the mutex.
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This addresses Coverity-ID: 114713 ("Missing break in switch").
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.
Signed-of-by: Jens Axboe <axboe@kernel.dk>
When cpu_stop_queue_work() releases the lock for the stopper
thread that was queued into its wake queue, preemption is
enabled, which leads to the following deadlock:
CPU0 CPU1
sched_setaffinity(0, ...)
__set_cpus_allowed_ptr()
stop_one_cpu(0, ...) stop_two_cpus(0, 1, ...)
cpu_stop_queue_work(0, ...) cpu_stop_queue_two_works(0, ..., 1, ...)
-grabs lock for migration/0-
-spins with preemption disabled,
waiting for migration/0's lock to be
released-
-adds work items for migration/0
and queues migration/0 to its
wake_q-
-releases lock for migration/0
and preemption is enabled-
-current thread is preempted,
and __set_cpus_allowed_ptr
has changed the thread's
cpu allowed mask to CPU1 only-
-acquires migration/0 and migration/1's
locks-
-adds work for migration/0 but does not
add migration/0 to wake_q, since it is
already in a wake_q-
-adds work for migration/1 and adds
migration/1 to its wake_q-
-releases migration/0 and migration/1's
locks, wakes migration/1, and enables
preemption-
-since migration/1 is requested to run,
migration/1 begins to run and waits on
migration/0, but migration/0 will never
be able to run, since the thread that
can wake it is affine to CPU1-
Disable preemption in cpu_stop_queue_work() before queueing works for
stopper threads, and queueing the stopper thread in the wake queue, to
ensure that the operation of queueing the works and waking the stopper
threads is atomic.
Fixes: 0b26351b910f ("stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock")
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: matt@codeblueprint.co.uk
Cc: bigeasy@linutronix.de
Cc: gregkh@linuxfoundation.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1533329766-4856-1-git-send-email-isaacm@codeaurora.org
Co-Developed-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Pull timer fixes from Thomas Gleixner:
"Two oneliners addressing NOHZ failures:
- Use a bitmask to check for the pending timer softirq and not the
bit number. The existing code using the bit number checked for
the wrong bit, which caused timers to either expire late or stop
completely.
- Make the nohz evaluation on interrupt exit more robust. The
existing code did not re-arm the hardware when interrupting a
running softirq in task context (ksoftirqd or tail of
local_bh_enable()), which caused timers to either expire late
or stop completely"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Fix missing tick reprogram when interrupting an inline softirq
nohz: Fix local_timer_softirq_pending()
There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL. Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered. Which
leaves only SIGSTOP that needs to be considered when creating new
threads.
Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.
A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.
It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes. Making it appear
the clone completed happened just before the SIGSTOP.
The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
V2: The call to task_join_group_stop was moved before the new task is
added to the thread group list. This should not matter as
sighand->siglock is held over both the addition of the threads,
the call to task_join_group_stop and do_signal_stop. But the change
is trivial and it is one less thing to worry about when reading
the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add a function calculate_sigpending to test to see if any signals are
pending for a new task immediately following fork. Signals have to
happen either before or after fork. Today our practice is to push
all of the signals to before the fork, but that has the downside that
frequent or periodic signals can make fork take much much longer than
normal or prevent fork from completing entirely.
So we need move signals that we can after the fork to prevent that.
This updates the code to set TIF_SIGPENDING on a new task if there
are signals or other activities that have moved so that they appear
to happen after the fork.
As the code today restarts if it sees any such activity this won't
immediately have an effect, as there will be no reason for it
to set TIF_SIGPENDING immediately after the fork.
Adding calculate_sigpending means the code in fork can safely be
changed to not always restart if a signal is pending.
The new calculate_sigpending function sets sigpending if there
are pending bits in jobctl, pending signals, the freezer needs
to freeze the new task or the live kernel patching framework
need the new thread to take the slow path to userspace.
I have verified that setting TIF_SIGPENDING does make a new process
take the slow path to userspace before it executes it's first userspace
instruction.
I have looked at the callers of signal_wake_up and the code paths
setting TIF_SIGPENDING and I don't see anything else that needs to be
handled. The code probably doesn't need to set TIF_SIGPENDING for the
kernel live patching as it uses a separate thread flag as well. But
at this point it seems safer reuse the recalc_sigpending logic and get
the kernel live patching folks to sort out their story later.
V2: I have moved the test into schedule_tail where siglock can
be grabbed and recalc_sigpending can be reused directly.
Further as the last action of setting up a new task this
guarantees that TIF_SIGPENDING will be properly set in the
new process.
The helper calculate_sigpending takes the siglock and
uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
clear TIF_SIGPENDING if it is unnecessary. This allows reusing
the existing code and keeps maintenance of the conditions simple.
Oleg Nesterov <oleg@redhat.com> suggested the movement
and pointed out the need to take siglock if this code
was going to be called while the new task is discoverable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The full nohz tick is reprogrammed in irq_exit() only if the exit is not in
a nesting interrupt. This stands as an optimization: whether a hardirq or a
softirq is interrupted, the tick is going to be reprogrammed when necessary
at the end of the inner interrupt, with even potential new updates on the
timer queue.
When soft interrupts are interrupted, it's assumed that they are executing
on the tail of an interrupt return. In that case tick_nohz_irq_exit() is
called after softirq processing to take care of the tick reprogramming.
But the assumption is wrong: softirqs can be processed inline as well, ie:
outside of an interrupt, like in a call to local_bh_enable() or from
ksoftirqd.
Inline softirqs don't reprogram the tick once they are done, as opposed to
interrupt tail softirq processing. So if a tick interrupts an inline
softirq processing, the next timer will neither be reprogrammed from the
interrupting tick's irq_exit() nor after the interrupted softirq
processing. This situation may leave the tick unprogrammed while timers are
armed.
To fix this, simply keep reprogramming the tick even if a softirq has been
interrupted. That can be optimized further, but for now correctness is more
important.
Note that new timers enqueued in nohz_full mode after a softirq gets
interrupted will still be handled just fine through self-IPIs triggered by
the timer code.
Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org # 4.14+
Link: https://lkml.kernel.org/r/1533303094-15855-1-git-send-email-frederic@kernel.org
The support of force threading interrupts which are set up with both a
primary and a threaded handler wreckaged the setup of regular requested
threaded interrupts (primary handler == NULL).
The reason is that it does not check whether the primary handler is set to
the default handler which wakes the handler thread. Instead it replaces the
thread handler with the primary handler as it would do with force threaded
interrupts which have been requested via request_irq(). So both the primary
and the thread handler become the same which then triggers the warnon that
the thread handler tries to wakeup a not configured secondary thread.
Fortunately this only happens when the driver omits the IRQF_ONESHOT flag
when requesting the threaded interrupt, which is normaly caught by the
sanity checks when force irq threading is disabled.
Fix it by skipping the force threading setup when a regular threaded
interrupt is requested. As a consequence the interrupt request which lacks
the IRQ_ONESHOT flag is rejected correctly instead of silently wreckaging
it.
Fixes: 2a1d3ab8986d ("genirq: Handle force threading of irqs with primary and thread handler")
Reported-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de>
Cc: stable@vger.kernel.org
Code is emitting the following error message during boot on systems
without PMU hardware support while probing NMI capability.
NMI watchdog: Perf event create on CPU 0 failed with -2
This error is emitted as the perf subsystem returns -ENOENT due to lack of
PMUs in the system.
It is followed by the warning that NMI watchdog is disabled:
NMI watchdog: Perf NMI watchdog permanently disabled
While NMI disabled information is useful for ordinary users, seeing a PERF
event create failed with error code -2 is not.
Reduce the message severity to debug so that if debugging is still possible
in case the error code returned by perf is required for analysis.
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=599368
Link: https://lkml.kernel.org/r/20180803060943.2643-1-okaya@kernel.org
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.
It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
As there is one-to-one relation between a bpf program
and cgroup local storage map, there is no sense in
creating a map of cgroup local storage maps.
Forbid it explicitly to avoid possible side effects.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF_MAP_TYPE_CGROUP_STORAGE maps are special in a way
that the access from the bpf program side is lookup-free.
That means the result is guaranteed to be a valid
pointer to the cgroup storage; no NULL-check is required.
This patch introduces BPF_PTR_TO_MAP_VALUE return type,
which is required to cause the verifier accept programs,
which are not checking the map value pointer for being NULL.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>