The second input parameter of 'wait_rm_addr/sf $1 1' is misused. If it's
1, wait_rm_addr/sf will never break, and will loop ten times, then
'wait_rm_addr/sf' equals to 'sleep 1'. This delay time is too long,
which can sometimes make the tests fail.
A better way to use wait_rm_addr/sf is to use rm_addr/sf_count to obtain
the current value, and then pass into wait_rm_addr/sf.
Fixes: 4369c198e599 ("selftests: mptcp: test userspace pm out of transfer")
Cc: stable@vger.kernel.org
Suggested-by: Matthieu Baerts <matttbe@kernel.org>
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-2-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some userspace pm tests failed are reported by CI:
112 userspace pm add & remove address
syn [ ok ]
synack [ ok ]
ack [ ok ]
add [ ok ]
echo [ ok ]
mptcp_info subflows=1:1 [ ok ]
subflows_total 2:2 [ ok ]
mptcp_info add_addr_signal=1:1 [ ok ]
rm [ ok ]
rmsf [ ok ]
Info: invert
mptcp_info subflows=0:0 [ ok ]
subflows_total 1:1 [fail]
got subflows 0:0 expected 1:1
Server ns stats
TcpPassiveOpens 2 0.0
TcpInSegs 118 0.0
This patch fixes them by changing 'speed' to 5 to run the tests much more
slowly.
Fixes: 4369c198e599 ("selftests: mptcp: test userspace pm out of transfer")
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231025-send-net-next-20231025-v1-1-db8f25f798eb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Test the new MDB get functionality by converting dump and grep to MDB
get.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Test the new MDB get functionality by converting dump and grep to MDB
get.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZTp12QAKCRDbK58LschI
g8BrAQDifqp5liEEdXV8jdReBwJtqInjrL5tzy5LcyHUMQbTaAEA6Ph3Ct3B+3oA
mFnIW/y6UJiJrby0Xz4+vV5BXI/5WQg=
=pLCV
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-10-26
We've added 51 non-merge commits during the last 10 day(s) which contain
a total of 75 files changed, 5037 insertions(+), 200 deletions(-).
The main changes are:
1) Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF,
from Chuyi Zhou.
2) Fix BPF verifier's iterator convergence logic to use exact states
comparison for convergence checks, from Eduard Zingerman,
Andrii Nakryiko and Alexei Starovoitov.
3) Add BPF programmable net device where bpf_mprog defines the logic
of its xmit routine. It can operate in L3 and L2 mode,
from Daniel Borkmann and Nikolay Aleksandrov.
4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking
for global per-CPU allocator, from Hou Tao.
5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section
was going to be present whenever a binary has SHT_GNU_versym section,
from Andrii Nakryiko.
6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into
atomic_set_release(), from Paul E. McKenney.
7) Add a warning if NAPI callback missed xdp_do_flush() under
CONFIG_DEBUG_NET which helps checking if drivers were missing
the former, from Sebastian Andrzej Siewior.
8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing
a warning under sleepable programs, from Yafang Shao.
9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before
checking map_locked, from Song Liu.
10) Make BPF CI linked_list failure test more robust,
from Kumar Kartikeya Dwivedi.
11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik.
12) Fix xsk starving when multiple xsk sockets were associated with
a single xsk_buff_pool, from Albert Huang.
13) Clarify the signed modulo implementation for the BPF ISA standardization
document that it uses truncated division, from Dave Thaler.
14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider
signed bounds knowledge, from Andrii Nakryiko.
15) Add an option to XDP selftests to use multi-buffer AF_XDP
xdp_hw_metadata and mark used XDP programs as capable to use frags,
from Larysa Zaremba.
16) Fix bpftool's BTF dumper wrt printing a pointer value and another
one to fix struct_ops dump in an array, from Manu Bretelle.
* tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits)
netkit: Remove explicit active/peer ptr initialization
selftests/bpf: Fix selftests broken by mitigations=off
samples/bpf: Allow building with custom bpftool
samples/bpf: Fix passing LDFLAGS to libbpf
samples/bpf: Allow building with custom CFLAGS/LDFLAGS
bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free
selftests/bpf: Add selftests for netkit
selftests/bpf: Add netlink helper library
bpftool: Extend net dump with netkit progs
bpftool: Implement link show support for netkit
libbpf: Add link-based API for netkit
tools: Sync if_link uapi header
netkit, bpf: Add bpf programmable net device
bpf: Improve JEQ/JNE branch taken logic
bpf: Fold smp_mb__before_atomic() into atomic_set_release()
bpf: Fix unnecessary -EBUSY from htab_lock_bucket
xsk: Avoid starving the xsk further down the list
bpf: print full verifier states on infinite loop detection
selftests/bpf: test if state loops are detected in a tricky case
bpf: correct loop detection for iterators convergence
...
====================
Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Page pool code is compiled conditionally, but the operations
are part of the shared netlink family. We can handle this
by reporting empty list of pools or -EOPNOTSUPP / -ENOSYS
but the cleanest way seems to be removing the ops completely
at compilation time. That way user can see that the page
pool ops are not present using genetlink introspection.
Same way they'd check if the kernel is "new enough" to
support the ops.
Extend the specs with the ability to specify the config
condition under which op (and its policies, etc.) should
be hidden.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231025162253.133159-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct nla_policy is usually constant itself, but unless
we make the ranges inside constant we won't be able to
make range structs const. The ranges are not modified
by the core.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231025162204.132528-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add 82 test suites to check edge cases related to bind() and connect()
actions. They are defined with 6 fixtures and their variants:
The "protocol" fixture is extended with 12 variants defined as a matrix
of: sandboxed/not-sandboxed, IPv4/IPv6/unix network domain, and
stream/datagram socket. 4 related tests suites are defined:
* bind: Tests bind action.
* connect: Tests connect action.
* bind_unspec: Tests bind action with the AF_UNSPEC socket family.
* connect_unspec: Tests connect action with the AF_UNSPEC socket family.
The "ipv4" fixture is extended with 4 variants defined as a matrix
of: sandboxed/not-sandboxed, and stream/datagram socket. 1 related test
suite is defined:
* from_unix_to_inet: Tests to make sure unix sockets' actions are not
restricted by Landlock rules applied to TCP ones.
The "tcp_layers" fixture is extended with 8 variants defined as a matrix
of: IPv4/IPv6 network domain, and different number of landlock rule
layers. 2 related tests suites are defined:
* ruleset_overlap: Tests nested layers with less constraints.
* ruleset_expand: Tests nested layers with more constraints.
In the "mini" fixture 4 tests suites are defined:
* network_access_rights: Tests handling of known access rights.
* unknown_access_rights: Tests handling of unknown access rights.
* inval: Tests unhandled allowed access and zero access value.
* tcp_port_overflow: Tests with port values greater than 65535.
The "ipv4_tcp" fixture supports IPv4 network domain with stream socket.
2 tests suites are defined:
* port_endianness: Tests with big/little endian port formats.
* with_fs: Tests a ruleset with both filesystem and network
restrictions.
The "port_specific" fixture is extended with 4 variants defined
as a matrix of: sandboxed/not-sandboxed, IPv4/IPv6 network domain,
and stream socket. 2 related tests suites are defined:
* bind_connect_zero: Tests with port 0.
* bind_connect_1023: Tests with port 1023.
Test coverage for security/landlock is 92.4% of 710 lines according to
gcc/gcov-13.
Signed-off-by: Konstantin Meskhidze <konstantin.meskhidze@huawei.com>
Link: https://lore.kernel.org/r/20231026014751.414649-11-konstantin.meskhidze@huawei.com
[mic: Extend commit message, update test coverage, clean up capability
use, fix useless TEST_F_FORK, and improve ipv4_tcp.with_fs]
Co-developed-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Add network rules support in the ruleset management helpers and the
landlock_create_ruleset() syscall. Extend user space API to support
network actions:
* Add new network access rights: LANDLOCK_ACCESS_NET_BIND_TCP and
LANDLOCK_ACCESS_NET_CONNECT_TCP.
* Add a new network rule type: LANDLOCK_RULE_NET_PORT tied to struct
landlock_net_port_attr. The allowed_access field contains the network
access rights, and the port field contains the port value according to
the controlled protocol. This field can take up to a 64-bit value
but the maximum value depends on the related protocol (e.g. 16-bit
value for TCP). Network port is in host endianness [1].
* Add a new handled_access_net field to struct landlock_ruleset_attr
that contains network access rights.
* Increment the Landlock ABI version to 4.
Implement socket_bind() and socket_connect() LSM hooks, which enable
to control TCP socket binding and connection to specific ports.
Expand access_masks_t from u16 to u32 to be able to store network access
rights alongside filesystem access rights for rulesets' handled access
rights.
Access rights are not tied to socket file descriptors but checked at
bind() or connect() call time against the caller's Landlock domain. For
the filesystem, a file descriptor is a direct access to a file/data.
However, for network sockets, we cannot identify for which data or peer
a newly created socket will give access to. Indeed, we need to wait for
a connect or bind request to identify the use case for this socket.
Likewise a directory file descriptor may enable to open another file
(i.e. a new data item), but this opening is also restricted by the
caller's domain, not the file descriptor's access rights [2].
[1] https://lore.kernel.org/r/278ab07f-7583-a4e0-3d37-1bacd091531d@digikod.net
[2] https://lore.kernel.org/r/263c1eb3-602f-57fe-8450-3f138581bee7@digikod.net
Signed-off-by: Konstantin Meskhidze <konstantin.meskhidze@huawei.com>
Link: https://lore.kernel.org/r/20231026014751.414649-9-konstantin.meskhidze@huawei.com
[mic: Extend commit message, fix typo in comments, and specify
endianness in the documentation]
Co-developed-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
There are two reasons to do this, firstly there is a shellcheck warning
in cs_etm_dev_name(), which can be completely deleted. And secondly the
current iteration method doesn't support systems with both ETE and ETM
because it picks one or the other. There isn't a known system with this
configuration, but it could happen in the future.
Iterating over all the sources for each CPU can be done by going through
/sys/bus/event_source/devices/cs_etm/cpu* and following the symlink back
to the Coresight device in /sys/bus/coresight/devices. This will work
whether the device is ETE, ETM or any future name, and is much simpler
and doesn't require any hard coded version numbers
Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: James Clark <james.clark@arm.com>
Acked-by: Ian Rogers <irogers@google.com>
Tested-by: Leo Yan <leo.yan@linaro.org>
Cc: tianruidong@linux.alibaba.com
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Anushree Mathur <anushree.mathur@linux.vnet.ibm.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: atrajeev@linux.vnet.ibm.com
Cc: coresight@lists.linaro.org
Link: https://lore.kernel.org/r/20231023131550.487760-1-james.clark@arm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add tma_info_system_socket_clks and uncore_freq metrics.
The associated converter script fix is in:
https://github.com/intel/perfmon/pull/112
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Caleb Biggers <caleb.biggers@intel.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Link: https://lore.kernel.org/r/20230926205948.1399594-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Add tma_info_system_socket_clks and uncore_freq metrics that require a
broadwellx style uncore event for UNC_CLOCK.
The associated converter script fix is in:
https://github.com/intel/perfmon/pull/112
Fixes: 7d124303d620 ("perf vendor events intel: Update broadwell variant events/metrics")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Caleb Biggers <caleb.biggers@intel.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Link: https://lore.kernel.org/r/20230926205948.1399594-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The IOMMU_HWPT_ALLOC ioctl now supports passing user_data to allocate a
user-managed domain for nested HWPTs. Add its coverage for that. Also,
update _test_cmd_hwpt_alloc() and add test_cmd/err_hwpt_alloc_nested().
Link: https://lore.kernel.org/r/20231026043938.63898-11-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When we configure the kernel command line with 'mitigations=off' and set
the sysctl knob 'kernel.unprivileged_bpf_disabled' to 0, the commit
bc5bc309db45 ("bpf: Inherit system settings for CPU security mitigations")
causes issues in the execution of `test_progs -t verifier`. This is
because 'mitigations=off' bypasses Spectre v1 and Spectre v4 protections.
Currently, when a program requests to run in unprivileged mode
(kernel.unprivileged_bpf_disabled = 0), the BPF verifier may prevent
it from running due to the following conditions not being enabled:
- bypass_spec_v1
- bypass_spec_v4
- allow_ptr_leaks
- allow_uninit_stack
While 'mitigations=off' enables the first two conditions, it does not
enable the latter two. As a result, some test cases in
'test_progs -t verifier' that were expected to fail to run may run
successfully, while others still fail but with different error messages.
This makes it challenging to address them comprehensively.
Moreover, in the future, we may introduce more fine-grained control over
CPU mitigations, such as enabling only bypass_spec_v1 or bypass_spec_v4.
Given the complexity of the situation, rather than fixing each broken test
case individually, it's preferable to skip them when 'mitigations=off' is
in effect and introduce specific test cases for the new 'mitigations=off'
scenario. For instance, we can introduce new BTF declaration tags like
'__failure__nospec', '__failure_nospecv1' and '__failure_nospecv4'.
In this patch, the approach is to simply skip the broken test cases when
'mitigations=off' is enabled. The result of `test_progs -t verifier` as
follows after this commit,
Before this commit
==================
- without 'mitigations=off'
- kernel.unprivileged_bpf_disabled = 2
Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED
- kernel.unprivileged_bpf_disabled = 0
Summary: 74/1336 PASSED, 0 SKIPPED, 0 FAILED <<<<
- with 'mitigations=off'
- kernel.unprivileged_bpf_disabled = 2
Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED
- kernel.unprivileged_bpf_disabled = 0
Summary: 63/1276 PASSED, 0 SKIPPED, 11 FAILED <<<< 11 FAILED
After this commit
=================
- without 'mitigations=off'
- kernel.unprivileged_bpf_disabled = 2
Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED
- kernel.unprivileged_bpf_disabled = 0
Summary: 74/1336 PASSED, 0 SKIPPED, 0 FAILED <<<<
- with this patch, with 'mitigations=off'
- kernel.unprivileged_bpf_disabled = 2
Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED
- kernel.unprivileged_bpf_disabled = 0
Summary: 74/948 PASSED, 388 SKIPPED, 0 FAILED <<<< SKIPPED
Fixes: bc5bc309db45 ("bpf: Inherit system settings for CPU security mitigations")
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Closes: https://lore.kernel.org/bpf/CAADnVQKUBJqg+hHtbLeeC2jhoJAWqnmRAzXW3hmUCNSV9kx4sQ@mail.gmail.com
Link: https://lore.kernel.org/bpf/20231025031144.5508-1-laoar.shao@gmail.com
Merge updates related to system sleep handling, one power capping update
and one PM utility update for 6.7-rc1:
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon).
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav).
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon).
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig).
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä).
- Change the minimum python version for the intel_pstate_tracer utility
from 2.7 to 3.6 (Doug Smythies).
* pm-sleep:
PM: hibernate: fix the kerneldoc comment for swsusp_check() and swsusp_close()
PM: hibernate: Clean up sync_read handling in snapshot_write_next()
PM: sleep: Fix symbol export for _SIMPLE_ variants of _PM_OPS()
PM: hibernate: Use __get_safe_page() rather than touching the list
* powercap:
powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug()
* pm-tools:
tools/power/x86/intel_pstate_tracer: python minimum version
Broadwell-de has a consumer core and server uncore. The uncore_arb PMU
isn't present and the broadwellx style cbox PMU should be used
instead. Fix the tma_info_system_dram_bw_use metric to use the server
metric rather than client.
The associated converter script fix is in:
https://github.com/intel/perfmon/pull/111
Fixes: 7d124303d620 ("perf vendor events intel: Update broadwell variant events/metrics")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Caleb Biggers <caleb.biggers@intel.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Link: https://lore.kernel.org/r/20230926031034.1201145-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Fix leak where mem_info__put wouldn't release the maps/map as used by
perf mem. Add exit functions and use elsewhere that the maps and map
are released.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-12-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Display code doesn't modify the branch_type_stat so switch uses to
const. This is done to aid refactoring struct callchain_list where
current the branch_type_stat is embedded even if not used.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-9-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Caught by address/leak sanitizer.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-8-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Commit 40826c45eb0b ("perf thread: Remove notion of dead threads")
removed dead threads but the list head wasn't removed. Remove it here.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-7-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Caught using reference count checking on perf top with
"--call-graph=lbr". After this no memory leaks were detected.
Fixes: 57849998e2cd ("perf report: Add processing for cycle histograms")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-6-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Comparing pointers with reference count checking is tricky to avoid a
SEGV. Add a convenience macro to simplify and use.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Make the implicit REFCOUNT_CHECKING robust to when building with GCC.
Fixes: 9be6ab181b7b ("libperf rc_check: Enable implicitly with sanitizers")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-4-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Running perf top with address sanitizer and "--call-graph=lbr" fails
due to reading sample 0 when no samples exist. Add a guard to prevent
this.
Fixes: e2b23483eb1d ("perf machine: Factor out lbr_callchain_add_lbr_ip()")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-3-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Mutex error check will capture trying to take the lock recursively and
other problems that rwlock won't. At the expense of concurrency, adda
debug mode that uses a mutex in place of a rwsem.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: German Gomez <german.gomez@arm.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: liuwenyu <liuwenyu7@huawei.com>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231024222353.3024098-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
To address this grep 3.8 warning:
grep: warning: stray \ before #
We needed to remove the '' around the grep expression and keep the \
before # so that it is escaped by the $(shell grep ...) and thus doesn't
get to grep.
We need that \ before the #, otherwise we get this:
Makefile.perf:364: *** unterminated call to function 'shell': missing ')'. Stop.
As everything after the # will be considered a comment.
Removing the single quotes needs some more escaping so that _some_ of
the escaped chars gets to grep, like the '\|' that becomes '\\\|´.
Running on debian:10, where there is no libtraceevent-devel available,
we get:
Makefile.perf:367: *** PYTHON_EXT_SRCS= util/python.c ../lib/ctype.c util/cap.c util/evlist.c util/evsel.c util/evsel_fprintf.c util/perf_event_attr_fprintf.c util/cpumap.c util/memswap.c util/mmap.c util/namespaces.c ../lib/bitmap.c ../lib/find_bit.c ../lib/list_sort.c ../lib/hweight.c ../lib/string.c ../lib/vsprintf.c util/thread_map.c util/util.c util/cgroup.c util/parse-branch-options.c util/rblist.c util/counts.c util/print_binary.c util/strlist.c ../lib/rbtree.c util/string.c util/symbol_fprintf.c util/units.c util/affinity.c util/rwsem.c util/hashmap.c util/perf_regs.c util/fncache.c util/perf-regs-arch/perf_regs_aarch64.c util/perf-regs-arch/perf_regs_arm.c util/perf-regs-arch/perf_regs_csky.c util/perf-regs-arch/perf_regs_loongarch.c util/perf-regs-arch/perf_regs_mips.c util/perf-regs-arch/perf_regs_powerpc.c util/perf-regs-arch/perf_regs_riscv.c util/perf-regs-arch/perf_regs_s390.c util/perf-regs-arch/perf_regs_x86.c. Stop.
make[1]: *** [Makefile.perf:242: sub-make] Error 2
I.e. both the comments and the util/trace-event.c were removed.
When using:
msg := $(error PYTHON_EXT_SRCS=$(PYTHON_EXT_SRCS))
While on the more recent fedora:38, with the new grep and make packages
and libtraceevent-devel installed:
Makefile.perf:367: *** PYTHON_EXT_SRCS= util/python.c ../lib/ctype.c util/cap.c util/evlist.c util/evsel.c util/evsel_fprintf.c util/perf_event_attr_fprintf.c util/cpumap.c util/memswap.c util/mmap.c util/namespaces.c ../lib/bitmap.c ../lib/find_bit.c ../lib/list_sort.c ../lib/hweight.c ../lib/string.c ../lib/vsprintf.c util/thread_map.c util/util.c util/cgroup.c util/parse-branch-options.c util/rblist.c util/counts.c util/print_binary.c util/strlist.c util/trace-event.c ../lib/rbtree.c util/string.c util/symbol_fprintf.c util/units.c util/affinity.c util/rwsem.c util/hashmap.c util/perf_regs.c util/fncache.c util/perf-regs-arch/perf_regs_aarch64.c util/perf-regs-arch/perf_regs_arm.c util/perf-regs-arch/perf_regs_csky.c util/perf-regs-arch/perf_regs_loongarch.c util/perf-regs-arch/perf_regs_mips.c util/perf-regs-arch/perf_regs_powerpc.c util/perf-regs-arch/perf_regs_riscv.c util/perf-regs-arch/perf_regs_s390.c util/perf-regs-arch/perf_regs_x86.c. Stop.
make[1]: *** [Makefile.perf:242: sub-make] Error 2
make: *** [Makefile:113: install-bin] Error 2
make: Leaving directory '/home/acme/git/perf-tools-next/tools/perf'
$
I.e. only the comments were removed.
If we build it on the same fedora:38 system, but using NO_LIBTRACEEVENT=1
$ make NO_LIBTRACEEVENT=1 CORESIGHT=1 O=/tmp/build/$(basename $PWD) -C tools/perf install-bin
Makefile.perf:367: *** PYTHON_EXT_SRCS= util/python.c ../lib/ctype.c util/cap.c util/evlist.c util/evsel.c util/evsel_fprintf.c util/perf_event_attr_fprintf.c util/cpumap.c util/memswap.c util/mmap.c util/namespaces.c ../lib/bitmap.c ../lib/find_bit.c ../lib/list_sort.c ../lib/hweight.c ../lib/string.c ../lib/vsprintf.c util/thread_map.c util/util.c util/cgroup.c util/parse-branch-options.c util/rblist.c util/counts.c util/print_binary.c util/strlist.c ../lib/rbtree.c util/string.c util/symbol_fprintf.c util/units.c util/affinity.c util/rwsem.c util/hashmap.c util/perf_regs.c util/fncache.c util/perf-regs-arch/perf_regs_aarch64.c util/perf-regs-arch/perf_regs_arm.c util/perf-regs-arch/perf_regs_csky.c util/perf-regs-arch/perf_regs_loongarch.c util/perf-regs-arch/perf_regs_mips.c util/perf-regs-arch/perf_regs_powerpc.c util/perf-regs-arch/perf_regs_riscv.c util/perf-regs-arch/perf_regs_s390.c util/perf-regs-arch/perf_regs_x86.c. Stop.
make[1]: *** [Makefile.perf:242: sub-make] Error 2
make: *** [Makefile:113: install-bin] Error 2
make: Leaving directory '/home/acme/git/perf-tools-next/tools/perf'
$
Both comments and the util/trace-event.c file removed.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/ZTj6mfM9UqY2DggC@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The hierarchy mode needs to setup output formats for each evsel.
Normally setup_sorting() handles this at the beginning, but it cannot
do that if data comes from a pipe since there's no evsel info before
reading the data. And then perf report cannot process the samples
in hierarchy mode and think as if there's no sample.
Let's check the condition and setup the output formats after reading
data so that it can find evsels.
Before:
$ ./perf record -o- true | ./perf report -i- --hierarchy -q
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB - ]
Error:
The - data has no samples!
After:
$ ./perf record -o- true | ./perf report -i- --hierarchy -q
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB - ]
94.76% true
94.76% [kernel.kallsyms]
94.76% [k] filemap_fault
5.24% perf-ex
5.24% [kernel.kallsyms]
5.06% [k] __memset
0.18% [k] native_write_msr
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20231025003121.2811738-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Currently lock contention timestamp is maintained in a hash map keyed by
pid. That means it needs to get and release a map element (which is
proctected by spinlock!) on each contention begin and end pair. This
can impact on performance if there are a lot of contention (usually from
spinlocks).
It used to go with task local storage but it had an issue on memory
allocation in some critical paths. Although it's addressed in recent
kernels IIUC, the tool should support old kernels too. So it cannot
simply switch to the task local storage at least for now.
As spinlocks create lots of contention and they disabled preemption
during the spinning, it can use per-cpu array to keep the timestamp to
avoid overhead in hashmap update and delete.
In contention_begin, it's easy to check the lock types since it can see
the flags. But contention_end cannot see it. So let's try to per-cpu
array first (unconditionally) if it has an active element (lock != 0).
Then it should be used and per-task tstamp map should not be used until
the per-cpu array element is cleared which means nested spinlock
contention (if any) was finished and it nows see (the outer) lock.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Song Liu <song@kernel.org>
Cc: bpf@vger.kernel.org
Link: https://lore.kernel.org/r/20231020204741.1869520-3-namhyung@kernel.org
When pelem is NULL, it'd create a new entry with zero data. But it
might be preempted by IRQ/NMI just before calling bpf_map_update_elem()
then there's a chance to call it twice for the same pid. So it'd be
better to use BPF_NOEXIST flag and check the return value to prevent
the race.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Song Liu <song@kernel.org>
Cc: bpf@vger.kernel.org
Link: https://lore.kernel.org/r/20231020204741.1869520-2-namhyung@kernel.org
It checks the current lock to calculated the delta of contention time.
The address is saved in the tstamp map which is allocated at begining of
contention and released at end of contention.
But it's possible for bpf_map_delete_elem() to fail. In that case, the
element in the tstamp map kept for the current lock and it makes the
next contention for the same lock tracked incorrectly. Specificially
the next contention begin will see the existing element for the task and
it'd just return. Then the next contention end will see the element and
calculate the time using the timestamp for the previous begin.
This can result in a large value for two small contentions happened from
time to time. Let's clear the lock address so that it can be updated
next time even if the bpf_map_delete_elem() failed.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Song Liu <song@kernel.org>
Cc: bpf@vger.kernel.org
Link: https://lore.kernel.org/r/20231020204741.1869520-1-namhyung@kernel.org
evsel__increase_rlimit() helper does nothing with evsel, and description
of the functionality is inaccurate, rename it and move to util/rlimit.c.
By the way, fix a checkppatch warning about misplaced license tag:
WARNING: Misplaced SPDX-License-Identifier tag - use line 1 instead
#160: FILE: tools/perf/util/rlimit.h:3:
/* SPDX-License-Identifier: LGPL-2.1 */
No functional change.
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20231023033144.1011896-1-yangjihong1@huawei.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The -G/--cgroups option is to put sender and receiver in different
cgroups in order to measure cgroup context switch overheads.
Users need to make sure the cgroups exist and accessible. The following
example should the effect of this change. Please don't forget taskset
before the perf bench to measure cgroup switches properly. Otherwise
each task would run on a different CPU and generate cgroup switches
regardless of this change.
# perf stat -e context-switches,cgroup-switches \
> taskset -c 0 perf bench sched pipe -l 10000 > /dev/null
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 10000':
20,001 context-switches
2 cgroup-switches
0.053449651 seconds time elapsed
0.011286000 seconds user
0.041869000 seconds sys
# perf stat -e context-switches,cgroup-switches \
> taskset -c 0 perf bench sched pipe -l 10000 -G AAA,BBB > /dev/null
Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 10000 -G AAA,BBB':
20,001 context-switches
20,001 cgroup-switches
0.052768627 seconds time elapsed
0.006284000 seconds user
0.046266000 seconds sys
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20231017202342.1353124-1-namhyung@kernel.org
CoreSight might be not available, in such case, skip the tests.
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Reviewed-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: Carsten Haitzler <carsten.haitzler@arm.com>
Cc: vmolnaro@redhat.com
Link: https://lore.kernel.org/r/20231019091137.22525-1-mpetlan@redhat.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
This file was renamed from .txt to .rst and left a dangling reference.
Fix it.
Fixes: 151f4e2bdc7a ("docs: power: convert docs to ReST and rename to *.rst")
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
suspend/resume routines (Balsam Chihi)
- Fix probe for THERMAL_V2 for the Mediatek LVTS driver (Markus
Schneider-Pargmann)
- Remove duplicate error message in the max76620 driver when
thermal_of_zone_register() fails as the sub routine already show one
(Thierry Reding)
- Add i.MX7D compatible bindings to fix a warning from dtbs_check for
the imx6ul platform (Alexander Stein)
- Add sa8775p compatible for the QCom tsens driver (Priyansh Jain)
- Fix error check in lvts_debugfs_init() which is checking against
NULL instead of PTR_ERR() on the LVTS Mediatek driver (Minjie Du)
- Remove unused variable in the thermal/tools (Kuan-Wei Chiu)
- Document the imx8dl thermal sensor (Fabio Estevam)
- Add variable names in callback prototypes to prevent warning from
checkpatch.pl for the imx8mm driver (Bragatheswaran Manickavel)
- Add missing unevaluatedProperties on child node schemas for tegra124
(Rob Herring)
- Add mt7988 support for the Mediatek LVTS driver (Frank Wunderlich)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAmU35s8ACgkQqDIjiipP
6E8nggf/VQei7x/fRTFRYJVyP6SdwzpZPcyMYEcwL50wpp/d/R5xapFzM7nofV6n
SYFXkpuejGH5d5JbJjpkv1S56Q/Bf6yy7cGciKiBmxaU2cWY636KmhlUPe2kfmX5
ipiNi/ZU0zAjZM7FRpBVPgp0HTvnGehDzIaSi6Y/vKVAw5TPTzjLz8oNoeHGP4/6
xNRgJuTwAvqL1lqZoo9YV7tVdkcuQamf3dkDkHqvSSL9UilHZEqrtLyh1Iz12+py
LESjlePz1jzK169oG5QiA4thIMV7CPTwkrkAJXwT2H2zWccg7JzZ7K1S3aJq2vHL
5UMcsUE2wCczfdnlhxD6wkJ3CDyirA==
=i0K9
-----END PGP SIGNATURE-----
Merge tag 'thermal-v6.7-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux
Merge thermal control (ARM drivers mostly) updates for 6.7-rc1 from
Daniel Lezcano:
"- Add support for Mediatek LVTS MT8192 driver along with the
suspend/resume routines (Balsam Chihi)
- Fix probe for THERMAL_V2 for the Mediatek LVTS driver (Markus
Schneider-Pargmann)
- Remove duplicate error message in the max76620 driver when
thermal_of_zone_register() fails as the sub routine already show one
(Thierry Reding)
- Add i.MX7D compatible bindings to fix a warning from dtbs_check for
the imx6ul platform (Alexander Stein)
- Add sa8775p compatible for the QCom tsens driver (Priyansh Jain)
- Fix error check in lvts_debugfs_init() which is checking against
NULL instead of PTR_ERR() on the LVTS Mediatek driver (Minjie Du)
- Remove unused variable in the thermal/tools (Kuan-Wei Chiu)
- Document the imx8dl thermal sensor (Fabio Estevam)
- Add variable names in callback prototypes to prevent warning from
checkpatch.pl for the imx8mm driver (Bragatheswaran Manickavel)
- Add missing unevaluatedProperties on child node schemas for tegra124
(Rob Herring)
- Add mt7988 support for the Mediatek LVTS driver (Frank Wunderlich)"
* tag 'thermal-v6.7-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux:
thermal/qcom/tsens: Drop ops_v0_1
thermal/drivers/mediatek/lvts_thermal: Update calibration data documentation
thermal/drivers/mediatek/lvts_thermal: Add mt8192 support
thermal/drivers/mediatek/lvts_thermal: Add suspend and resume
dt-bindings: thermal: mediatek: Add LVTS thermal controller definition for mt8192
thermal/drivers/mediatek: Fix probe for THERMAL_V2
thermal/drivers/max77620: Remove duplicate error message
dt-bindings: timer: add imx7d compatible
dt-bindings: net: microchip: Allow nvmem-cell usage
dt-bindings: imx-thermal: Add #thermal-sensor-cells property
dt-bindings: thermal: tsens: Add sa8775p compatible
thermal/drivers/mediatek/lvts_thermal: Fix error check in lvts_debugfs_init()
tools/thermal: Remove unused 'mds' and 'nrhandler' variables
dt-bindings: thermal: fsl,scu-thermal: Document imx8dl
thermal/drivers/imx8mm_thermal: Fix function pointer declaration by adding identifier name
dt-bindings: thermal: nvidia,tegra124-soctherm: Add missing unevaluatedProperties on child node schemas
thermal/drivers/mediatek/lvts_thermal: Add mt7988 support
thermal/drivers/mediatek/lvts_thermal: Make coeff configurable
dt-bindings: thermal: mediatek: Add LVTS thermal sensors for mt7988
dt-bindings: thermal: mediatek: Add mt7988 lvts compatible
Add a minimal netlink helper library for the BPF selftests. This has been
taken and cut down and cleaned up from iproute2. This covers basics such
as netdevice creation which we need for BPF selftests / BPF CI given
iproute2 package cannot cover it yet.
Stanislav Fomichev suggested that this could be replaced in future by ynl
tool generated C code once it has RTNL support to create devices. Once we
get to this point the BPF CI would also need to add libmnl. If no further
extensions are needed, a second option could be that we remove this code
again once iproute2 package has support.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231024214904.29825-7-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add support to dump BPF programs on netkit via bpftool. This includes both
the BPF link and attach ops programs. Dumped information contain the attach
location, function entry name, program ID and link ID when applicable.
Example with tc BPF link:
# ./bpftool net
xdp:
tc:
nk1(22) netkit/peer tc1 prog_id 43 link_id 12
[...]
Example with json dump:
# ./bpftool net --json | jq
[
{
"xdp": [],
"tc": [
{
"devname": "nk1",
"ifindex": 18,
"kind": "netkit/primary",
"name": "tc1",
"prog_id": 29,
"prog_flags": [],
"link_id": 8,
"link_flags": []
}
],
"flow_dissector": [],
"netfilter": []
}
]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231024214904.29825-6-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Add support to dump netkit link information to bpftool in similar way as
we have for XDP. The netkit link info only exposes the ifindex and the
attach_type.
Below shows an example link dump output, and a cgroup link is included for
comparison, too:
# bpftool link
[...]
10: cgroup prog 2466
cgroup_id 1 attach_type cgroup_inet6_post_bind
[...]
8: netkit prog 35
ifindex nk1(18) attach_type netkit_primary
[...]
Equivalent json output:
# bpftool link --json
[...]
{
"id": 10,
"type": "cgroup",
"prog_id": 2466,
"cgroup_id": 1,
"attach_type": "cgroup_inet6_post_bind"
},
[...]
{
"id": 12,
"type": "netkit",
"prog_id": 61,
"devname": "nk1",
"ifindex": 21,
"attach_type": "netkit_primary"
}
[...]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231024214904.29825-5-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This adds bpf_program__attach_netkit() API to libbpf. Overall it is very
similar to tcx. The API looks as following:
LIBBPF_API struct bpf_link *
bpf_program__attach_netkit(const struct bpf_program *prog, int ifindex,
const struct bpf_netkit_opts *opts);
The struct bpf_netkit_opts is done in similar way as struct bpf_tcx_opts
for supporting bpf_mprog control parameters. The attach location for the
primary and peer device is derived from the program section "netkit/primary"
and "netkit/peer", respectively.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231024214904.29825-4-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Sync if_link uapi header to the latest version as we need the refresher
in tooling for netkit device. Given it's been a while since the last sync
and the diff is fairly big, it has been done as its own commit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231024214904.29825-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.
One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).
In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.
Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.
Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.
An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>