Commit Graph

44728 Commits

Author SHA1 Message Date
Andrii Nakryiko
9ee9822908 bpf: save extended inner map info for percpu array maps as well
ARRAY_OF_MAPS and HASH_OF_MAPS map types have special logic to save
a few extra fields required for correct operations of ARRAY maps, when
they are used as inner maps. PERCPU_ARRAY maps have similar
requirements as they now support generating inline element lookup
logic. So make sure that both classes of maps are handled correctly.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: db69718b8e ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240515062440.846086-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-15 09:34:54 -07:00
Petr Mladek
dafc2d0f1b Merge branch 'for-6.10-base-small' into for-linus 2024-05-15 11:58:26 +02:00
Steven Rostedt (Google)
b9c6820f02 ring-buffer: Add cast to unsigned long addr passed to virt_to_page()
The sub-buffer pages are held in an unsigned long array, and when it is
passed to virt_to_page() a cast is needed.

Link: https://lore.kernel.org/all/20240515124808.06279d04@canb.auug.org.au/
Link: https://lore.kernel.org/linux-trace-kernel/20240515010558.4abaefdd@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-15 01:28:42 -04:00
Linus Torvalds
1b294a1f35 Networking changes for 6.10.
Core & protocols
 ----------------
 
  - Complete rework of garbage collection of AF_UNIX sockets.
    AF_UNIX is prone to forming reference count cycles due to fd passing
    functionality. New method based on Tarjan's Strongly Connected Components
    algorithm should be both faster and remove a lot of workarounds
    we accumulated over the years.
 
  - Add TCP fraglist GRO support, allowing chaining multiple TCP packets
    and forwarding them together. Useful for small switches / routers which
    lack basic checksum offload in some scenarios (e.g. PPPoE).
 
  - Support using SMP threads for handling packet backlog i.e. packet
    processing from software interfaces and old drivers which don't
    use NAPI. This helps move the processing out of the softirq jumble.
 
  - Continue work of converting from rtnl lock to RCU protection.
    Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address
    labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files,
    MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs,
    neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link
    information available via rtnetlink.
 
  - Small optimizations from Eric to UDP wake up handling, memory accounting,
    RPS/RFS implementation, TCP packet sizing etc.
 
  - Allow direct page recycling in the bulk API used by XDP, for +2% PPS.
 
  - Support peek with an offset on TCP sockets.
 
  - Add MPTCP APIs for querying last time packets were received/sent/acked,
    and whether MPTCP "upgrade" succeeded on a TCP socket.
 
  - Add intra-node communication shortcut to improve SMC performance.
 
  - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver.
 
  - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver.
 
  - Add reset reasons for tracing what caused a TCP reset to be sent.
 
  - Introduce direction attribute for xfrm (IPSec) states.
    State can be used either for input or output packet processing.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS().
    This required touch-ups and renaming of a few existing users.
 
  - Add Endian-dependent __counted_by_{le,be} annotations.
 
  - Make building selftests "quieter" by printing summaries like
    "CC object.o" rather than full commands with all the arguments.
 
 Netfilter
 ---------
 
  - Use GFP_KERNEL to clone elements, to deal better with OOM situations
    and avoid failures in the .commit step.
 
 BPF
 ---
 
  - Add eBPF JIT for ARCv2 CPUs.
 
  - Support attaching kprobe BPF programs through kprobe_multi link in
    a session mode, meaning, a BPF program is attached to both function entry
    and return, the entry program can decide if the return program gets
    executed and the entry program can share u64 cookie value with return
    program. "Session mode" is a common use-case for tetragon and bpftrace.
 
  - Add the ability to specify and retrieve BPF cookie for raw tracepoint
    programs in order to ease migration from classic to raw tracepoints.
 
  - Add an internal-only BPF per-CPU instruction for resolving per-CPU
    memory addresses and implement support in x86, ARM64 and RISC-V JITs.
    This allows inlining functions which need to access per-CPU state.
 
  - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
    atomics in bpf_arena which can be JITed as a single x86 instruction.
    Support BPF arena on ARM64.
 
  - Add a new bpf_wq API for deferring events and refactor process-context
    bpf_timer code to keep common code where possible.
 
  - Harden the BPF verifier's and/or/xor value tracking.
 
  - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs.
 
  - Support bpf_tail_call_static() helper for BPF programs with GCC 13.
 
  - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
    program to have code sections where preemption is disabled.
 
 Driver API
 ----------
 
  - Skip software TC processing completely if all installed rules are
    marked as HW-only, instead of checking the HW-only flag rule by rule.
 
  - Add support for configuring PoE (Power over Ethernet), similar to
    the already existing support for PoDL (Power over Data Line) config.
 
  - Initial bits of a queue control API, for now allowing a single queue
    to be reset without disturbing packet flow to other queues.
 
  - Common (ethtool) statistics for hardware timestamping.
 
 Tests and tooling
 -----------------
 
  - Remove the need to create a config file to run the net forwarding tests
    so that a naive "make run_tests" can exercise them.
 
  - Define a method of writing tests which require an external endpoint
    to communicate with (to send/receive data towards the test machine).
    Add a few such tests.
 
  - Create a shared code library for writing Python tests. Expose the YAML
    Netlink library from tools/ to the tests for easy Netlink access.
 
  - Move netfilter tests under net/, extend them, separate performance tests
    from correctness tests, and iron out issues found by running them
    "on every commit".
 
  - Refactor BPF selftests to use common network helpers.
 
  - Further work filling in YAML definitions of Netlink messages for:
    nftables, team driver, bonding interfaces, vlan interfaces, VF info,
    TC u32 mark, TC police action.
 
  - Teach Python YAML Netlink to decode attribute policies.
 
  - Extend the definition of the "indexed array" construct in the specs
    to cover arrays of scalars rather than just nests.
 
  - Add hyperlinks between definitions in generated Netlink docs.
 
 Drivers
 -------
 
  - Make sure unsupported flower control flags are rejected by drivers,
    and make more drivers report errors directly to the application rather
    than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen).
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - support multiple RSS contexts and steering traffic to them
      - support XDP metadata
      - make page pool allocations more NUMA aware
    - Intel (100G, ice, idpf):
      - extract datapath code common among Intel drivers into a library
      - use fewer resources in switchdev by sharing queues with the PF
      - add PFCP filter support
      - add Ethernet filter support
      - use a spinlock instead of HW lock in PTP clock ops
      - support 5 layer Tx scheduler topology
    - nVidia/Mellanox:
      - 800G link modes and 100G SerDes speeds
      - per-queue IRQ coalescing configuration
    - Marvell Octeon:
      - support offloading TC packet mark action
 
  - Ethernet NICs consumer, embedded and virtual:
    - stop lying about skb->truesize in USB Ethernet drivers, it messes up
      TCP memory calculations
    - Google cloud vNIC:
      - support changing ring size via ethtool
      - support ring reset using the queue control API
    - VirtIO net:
      - expose flow hash from RSS to XDP
      - per-queue statistics
      - add selftests
    - Synopsys (stmmac):
      - support controllers which require an RX clock signal from the MII
        bus to perform their hardware initialization
    - TI:
      - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices
      - icssg_prueth: add SW TX / RX Coalescing based on hrtimers
      - cpsw: minimal XDP support
    - Renesas (ravb):
      - support describing the MDIO bus
    - Realtek (r8169):
      - add support for RTL8168M
    - Microchip Sparx5:
      - matchall and flower actions mirred and redirect
 
  - Ethernet switches:
    - nVidia/Mellanox:
      - improve events processing performance
    - Marvell:
      - add support for MV88E6250 family internal PHYs
    - Microchip:
      - add DCB and DSCP mapping support for KSZ switches
      - vsc73xx: convert to PHYLINK
    - Realtek:
      - rtl8226b/rtl8221b: add C45 instances and SerDes switching
 
  - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup.
 
  - Ethernet PHYs:
    - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY.
    - micrel: lan8814: add support for PPS out and external timestamp trigger
 
  - WiFi:
    - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers.
      Modern devices can only be configured using nl80211.
    - mac80211/cfg80211
      - handle color change per link for WiFi 7 Multi-Link Operation
    - Intel (iwlwifi):
      - don't support puncturing in 5 GHz
      - support monitor mode on passive channels
      - BZ-W device support
      - P2P with HE/EHT support
      - re-add support for firmware API 90
      - provide channel survey information for Automatic Channel Selection
    - MediaTek (mt76):
      - mt7921 LED control
      - mt7925 EHT radiotap support
      - mt7920e PCI support
    - Qualcomm (ath11k):
      - P2P support for QCA6390, WCN6855 and QCA2066
      - support hibernation
      - ieee80211-freq-limit Device Tree property support
    - Qualcomm (ath12k):
      - refactoring in preparation of multi-link support
      - suspend and hibernation support
      - ACPI support
      - debugfs support, including dfs_simulate_radar support
    - RealTek:
      - rtw88: RTL8723CS SDIO device support
      - rtw89: RTL8922AE Wi-Fi 7 PCI device support
      - rtw89: complete features of new WiFi 7 chip 8922AE including
        BT-coexistence and Wake-on-WLAN
      - rtw89: use BIOS ACPI settings to set TX power and channels
      - rtl8xxxu: enable Management Frame Protection (MFP) support
 
  - Bluetooth:
    - support for Intel BlazarI and Filmore Peak2 (BE201)
    - support for MediaTek MT7921S SDIO
    - initial support for Intel PCIe BT driver
    - remove HCI_AMP support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZD6sQACgkQMUZtbf5S
 IrtLYw/+I73ePGIye37o2jpbodcLAUZVfF3r6uYUzK8hokEcKD0QVJa9w7PizLZ3
 UO45ClOXFLJCkfP4reFenLfxGCel2AJI+F7VFl2xaO2XgrcH/lnVrHqKZEAEXjls
 KoYMnShIolv7h2MKP6hHtyTi2j1wvQUKsZC71o9/fuW+4fUT8gECx1YtYcL73wrw
 gEMdlUgBYC3jiiCUHJIFX6iPJ2t/TC+q1eIIF2K/Osrk2kIqQhzoozcL4vpuAZQT
 99ljx/qRelXa8oppDb7nM5eulg7WY8ZqxEfFZphTMC5nLEGzClxuOTTl2kDYI/D/
 UZmTWZDY+F5F0xvNk2gH84qVJXBOVDoobpT7hVA/tDuybobc/kvGDzRayEVqVzKj
 Q0tPlJs+xBZpkK5TVnxaFLJVOM+p1Xosxy3kNVXmuYNBvT/R89UbJiCrUKqKZF+L
 z/1mOYUv8UklHqYAeuJSptHvqJjTGa/fsEYP7dAUBbc1N2eVB8mzZ4mgU5rYXbtC
 E6UXXiWnoSRm8bmco9QmcWWoXt5UGEizHSJLz6t1R5Df/YmXhWlytll5aCwY1ksf
 FNoL7S4u7AZThL1Nwi7yUs4CAjhk/N4aOsk+41S0sALCx30BJuI6UdesAxJ0lu+Z
 fwCQYbs27y4p7mBLbkYwcQNxAxGm7PSK4yeyRIy2njiyV4qnLf8=
 =EsC2
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Complete rework of garbage collection of AF_UNIX sockets.

     AF_UNIX is prone to forming reference count cycles due to fd
     passing functionality. New method based on Tarjan's Strongly
     Connected Components algorithm should be both faster and remove a
     lot of workarounds we accumulated over the years.

   - Add TCP fraglist GRO support, allowing chaining multiple TCP
     packets and forwarding them together. Useful for small switches /
     routers which lack basic checksum offload in some scenarios (e.g.
     PPPoE).

   - Support using SMP threads for handling packet backlog i.e. packet
     processing from software interfaces and old drivers which don't use
     NAPI. This helps move the processing out of the softirq jumble.

   - Continue work of converting from rtnl lock to RCU protection.

     Don't require rtnl lock when reading: IPv6 routing FIB, IPv6
     address labels, netdev threaded NAPI sysfs files, bonding driver's
     sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics,
     TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot
     of the link information available via rtnetlink.

   - Small optimizations from Eric to UDP wake up handling, memory
     accounting, RPS/RFS implementation, TCP packet sizing etc.

   - Allow direct page recycling in the bulk API used by XDP, for +2%
     PPS.

   - Support peek with an offset on TCP sockets.

   - Add MPTCP APIs for querying last time packets were received/sent/acked
     and whether MPTCP "upgrade" succeeded on a TCP socket.

   - Add intra-node communication shortcut to improve SMC performance.

   - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol
     driver.

   - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver.

   - Add reset reasons for tracing what caused a TCP reset to be sent.

   - Introduce direction attribute for xfrm (IPSec) states. State can be
     used either for input or output packet processing.

  Things we sprinkled into general kernel code:

   - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS().

     This required touch-ups and renaming of a few existing users.

   - Add Endian-dependent __counted_by_{le,be} annotations.

   - Make building selftests "quieter" by printing summaries like
     "CC object.o" rather than full commands with all the arguments.

  Netfilter:

   - Use GFP_KERNEL to clone elements, to deal better with OOM
     situations and avoid failures in the .commit step.

  BPF:

   - Add eBPF JIT for ARCv2 CPUs.

   - Support attaching kprobe BPF programs through kprobe_multi link in
     a session mode, meaning, a BPF program is attached to both function
     entry and return, the entry program can decide if the return
     program gets executed and the entry program can share u64 cookie
     value with return program. "Session mode" is a common use-case for
     tetragon and bpftrace.

   - Add the ability to specify and retrieve BPF cookie for raw
     tracepoint programs in order to ease migration from classic to raw
     tracepoints.

   - Add an internal-only BPF per-CPU instruction for resolving per-CPU
     memory addresses and implement support in x86, ARM64 and RISC-V
     JITs. This allows inlining functions which need to access per-CPU
     state.

   - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
     atomics in bpf_arena which can be JITed as a single x86
     instruction. Support BPF arena on ARM64.

   - Add a new bpf_wq API for deferring events and refactor
     process-context bpf_timer code to keep common code where possible.

   - Harden the BPF verifier's and/or/xor value tracking.

   - Introduce crypto kfuncs to let BPF programs call kernel crypto
     APIs.

   - Support bpf_tail_call_static() helper for BPF programs with GCC 13.

   - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
     program to have code sections where preemption is disabled.

  Driver API:

   - Skip software TC processing completely if all installed rules are
     marked as HW-only, instead of checking the HW-only flag rule by
     rule.

   - Add support for configuring PoE (Power over Ethernet), similar to
     the already existing support for PoDL (Power over Data Line)
     config.

   - Initial bits of a queue control API, for now allowing a single
     queue to be reset without disturbing packet flow to other queues.

   - Common (ethtool) statistics for hardware timestamping.

  Tests and tooling:

   - Remove the need to create a config file to run the net forwarding
     tests so that a naive "make run_tests" can exercise them.

   - Define a method of writing tests which require an external endpoint
     to communicate with (to send/receive data towards the test
     machine). Add a few such tests.

   - Create a shared code library for writing Python tests. Expose the
     YAML Netlink library from tools/ to the tests for easy Netlink
     access.

   - Move netfilter tests under net/, extend them, separate performance
     tests from correctness tests, and iron out issues found by running
     them "on every commit".

   - Refactor BPF selftests to use common network helpers.

   - Further work filling in YAML definitions of Netlink messages for:
     nftables, team driver, bonding interfaces, vlan interfaces, VF
     info, TC u32 mark, TC police action.

   - Teach Python YAML Netlink to decode attribute policies.

   - Extend the definition of the "indexed array" construct in the specs
     to cover arrays of scalars rather than just nests.

   - Add hyperlinks between definitions in generated Netlink docs.

  Drivers:

   - Make sure unsupported flower control flags are rejected by drivers,
     and make more drivers report errors directly to the application
     rather than dmesg (large number of driver changes from Asbjørn
     Sloth Tønnesen).

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support multiple RSS contexts and steering traffic to them
         - support XDP metadata
         - make page pool allocations more NUMA aware
      - Intel (100G, ice, idpf):
         - extract datapath code common among Intel drivers into a library
         - use fewer resources in switchdev by sharing queues with the PF
         - add PFCP filter support
         - add Ethernet filter support
         - use a spinlock instead of HW lock in PTP clock ops
         - support 5 layer Tx scheduler topology
      - nVidia/Mellanox:
         - 800G link modes and 100G SerDes speeds
         - per-queue IRQ coalescing configuration
      - Marvell Octeon:
         - support offloading TC packet mark action

   - Ethernet NICs consumer, embedded and virtual:
      - stop lying about skb->truesize in USB Ethernet drivers, it
        messes up TCP memory calculations
      - Google cloud vNIC:
         - support changing ring size via ethtool
         - support ring reset using the queue control API
      - VirtIO net:
         - expose flow hash from RSS to XDP
         - per-queue statistics
         - add selftests
      - Synopsys (stmmac):
         - support controllers which require an RX clock signal from the
           MII bus to perform their hardware initialization
      - TI:
         - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices
         - icssg_prueth: add SW TX / RX Coalescing based on hrtimers
         - cpsw: minimal XDP support
      - Renesas (ravb):
         - support describing the MDIO bus
      - Realtek (r8169):
         - add support for RTL8168M
      - Microchip Sparx5:
         - matchall and flower actions mirred and redirect

   - Ethernet switches:
      - nVidia/Mellanox:
         - improve events processing performance
      - Marvell:
         - add support for MV88E6250 family internal PHYs
      - Microchip:
         - add DCB and DSCP mapping support for KSZ switches
         - vsc73xx: convert to PHYLINK
      - Realtek:
         - rtl8226b/rtl8221b: add C45 instances and SerDes switching

   - Many driver changes related to PHYLIB and PHYLINK deprecated API
     cleanup

   - Ethernet PHYs:
      - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY.
      - micrel: lan8814: add support for PPS out and external timestamp trigger

   - WiFi:
      - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices
        drivers. Modern devices can only be configured using nl80211.
      - mac80211/cfg80211
         - handle color change per link for WiFi 7 Multi-Link Operation
      - Intel (iwlwifi):
         - don't support puncturing in 5 GHz
         - support monitor mode on passive channels
         - BZ-W device support
         - P2P with HE/EHT support
         - re-add support for firmware API 90
         - provide channel survey information for Automatic Channel Selection
      - MediaTek (mt76):
         - mt7921 LED control
         - mt7925 EHT radiotap support
         - mt7920e PCI support
      - Qualcomm (ath11k):
         - P2P support for QCA6390, WCN6855 and QCA2066
         - support hibernation
         - ieee80211-freq-limit Device Tree property support
      - Qualcomm (ath12k):
         - refactoring in preparation of multi-link support
         - suspend and hibernation support
         - ACPI support
         - debugfs support, including dfs_simulate_radar support
      - RealTek:
         - rtw88: RTL8723CS SDIO device support
         - rtw89: RTL8922AE Wi-Fi 7 PCI device support
         - rtw89: complete features of new WiFi 7 chip 8922AE including
           BT-coexistence and Wake-on-WLAN
         - rtw89: use BIOS ACPI settings to set TX power and channels
         - rtl8xxxu: enable Management Frame Protection (MFP) support

   - Bluetooth:
      - support for Intel BlazarI and Filmore Peak2 (BE201)
      - support for MediaTek MT7921S SDIO
      - initial support for Intel PCIe BT driver
      - remove HCI_AMP support"

* tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits)
  selftests: netfilter: fix packetdrill conntrack testcase
  net: gro: fix napi_gro_cb zeroed alignment
  Bluetooth: btintel_pcie: Refactor and code cleanup
  Bluetooth: btintel_pcie: Fix warning reported by sparse
  Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1
  Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config
  Bluetooth: btintel_pcie: Fix compiler warnings
  Bluetooth: btintel_pcie: Add *setup* function to download firmware
  Bluetooth: btintel_pcie: Add support for PCIe transport
  Bluetooth: btintel: Export few static functions
  Bluetooth: HCI: Remove HCI_AMP support
  Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init()
  Bluetooth: qca: Fix error code in qca_read_fw_build_info()
  Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning
  Bluetooth: btintel: Add support for Filmore Peak2 (BE201)
  Bluetooth: btintel: Add support for BlazarI
  LE Create Connection command timeout increased to 20 secs
  dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth
  Bluetooth: compute LE flow credits based on recvbuf space
  Bluetooth: hci_sync: Use cmd->num_cis instead of magic number
  ...
2024-05-14 19:42:24 -07:00
Linus Torvalds
0c181b1d97 Power management updates for 6.10-rc1
- Rework the handling of disabled turbo in the intel_pstate driver and
    make it update the maximum CPU frequency consistently regardless of
    the reason on top of a number of cleanups (Rafael Wysocki).
 
  - Add missing checks for NULL .exit() cpufreq driver callback to the
    cpufreq core (Viresh Kumar).
 
  - Prevent pulicy->max from going above the frequency QoS maximum value
    when cpufreq_frequency_table_verify() is used (Xuewen Yan).
 
  - Prevent a negative CPU number or frequency value from being printed
    if they are really large (Joshua Yeong).
 
  - Update MAINTAINERS entry for amd-pstate to add two new submaintainers
    and a designated reviewer (Huang Rui).
 
  - Clean up the amd-pstate driver and update its documentation (Gautham
    Shenoy).
 
  - Fix the highest frequency issue in the amd-pstate driver which limits
    performance (Perry Yuan).
 
  - Enable CPPC v2 for certain processors in the family 17H, as requested
    by TR40 processor users who expect improved performance and lower
    system temperature (Perry Yuan).
 
  - Change latency and delay values to be read from platform firmware
    firstly for more accurate timing (Perry Yuan).
 
  - A new quirk is introduced for supporting amd-pstate on legacy
    processors which either lack CPPC capability, or only only have CPPC
    v2 capability (Perry Yuan).
 
  - Sun50i cpufreq: Add support for opp_supported_hw, H616 platform and
    general cleanups (Andre Przywara, Martin Botka, Brandon Cheo Fusi,
    Dan Carpenter, Viresh Kumar).
 
  - CPPC cpufreq: Fix possible null pointer dereference (Aleksandr
    Mishin).
 
  - Eliminate uses of of_node_put() from cpufreq (Javier Carrasco,
    Shivani Gupta).
 
  - brcmstb-avs: ISO C90 forbids mixed declarations (Portia Stephens).
 
  - mediatek cpufreq: Add support for MT7988A (Sam Shih).
 
  - cpufreq-qcom-hw: Add SM4450 compatibles in DT bindings (Tengfei Fan).
 
  - Fix struct cpudata::epp_cached kernel-doc in the intel_pstate cpufreq
    driver (Jeff Johnson).
 
  - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson).
 
  - Convert the cpuidle kirkwood driver to platform remove callback
    returning void (Yangtao Li).
 
  - Replace deprecated strncpy() with strscpy() in the hibernation core
    code (Justin Stitt).
 
  - Use %ps to simplify debug output in the core system-wide suspend and
    resume code (Len Brown).
 
  - Remove unnecessary else from device_init_wakeup() and make
    device_wakeup_disable() return void (Dhruva Gole).
 
  - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui).
 
  - Add support for ArrowLake-H platform to the Intel RAPL driver (Zhang
    Rui).
 
  - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li).
 
  - Make the Samsung exynos-asv driver update the Energy Model after
    adjusting voltage on top of some preliminary changes of the OPP and
    Enery Model generic code (Lukasz Luba).
 
  - Remove a reference to a function that has been dropped from the power
    management documentation (Bjorn Helgaas).
 
  - Convert the platfrom remove callback to .remove_new for the
    exyno-nocp, exynos-ppmu, mtk-cci-devfreq, sun8i-a33-mbus, and
    rk3399_dmc devfreq drivers (Uwe Kleine-König).
 
  - Use DEFINE_SIMPLE_PM_OPS for exyno-bus.c driver (Anand Moon).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmZCZrASHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxH3cP/RwYN6V3H+XUlhxN0M1GXb8zkLGTLm9X
 mGRKzDAoElGwYJVSpGPPtP0F+IaS3Sb7JnB719lSS7u7LmYIcqivTaBRdDIHWILJ
 qWbTSy7+84Zakf0RZ5qRr3GIGcNHmY5QDZf3/jC0AX4VBnFqFCjpaW04zmUjmAqn
 k13V3vfHl0J2/qKkm/JIvg2hubcAQzcP9UMgsjRE/S9QzNScEe7910v+0pv8XyUW
 4kdjSItUG8CaJV5er/XarYl4bh39OqT8Lvuo4wbaCFvOyRsMHoXqStxZVLTb9iEI
 j96vBXdy5Bfs503vc+Bu3TGcKPQTfjeRkEYDlwvpxwtJfMGnRQemgidSQwsbz208
 oQaybFxU0UHMgsVh1R0VrbdrhUuMxUz1OrCPSg6rhYJTZ1UhTwISoDTdf+SstGCC
 ODZgG59m6ez5udFAeavLA319jQEGL/oWPkHckVld4Gr10qrMu7SWseflx/+RY2dG
 Rjvd/Kv9FYWVyrIttQf3YIFlc3SLhM5K4IxPhzvj94MDs4spbwAx3wk5lR1Nw2ct
 HIVVjfBS+9I5dlRI7+VLM7VzD1JUxOOeZH84aTMDL080hiFZLEJaD+TkCc2QCa02
 5fGSa1DM5wX87TCdltRtW+OP715Q+97OXdeRQtwgIewfM8zPi0m2ctODNj08+EO1
 qmlFSJYTmFhR
 =el5Y
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These are mostly cpufreq updates, including a significant intel-pstate
  driver update and several amd-pstate improvements plus some updates of
  ARM cpufreq drivers, general fixes and cleanups.

  Also included are changes related to system sleep, power capping
  updates adding support for a new platform and a new hardware feature
  (among other things), a Samsung exynos-asv driver update allowing it
  to change its Energy Model after adjusting voltage, minor cpuidle and
  devfreq updates and a small documentation cleanup.

  Specifics:

   - Rework the handling of disabled turbo in the intel_pstate driver
     and make it update the maximum CPU frequency consistently
     regardless of the reason on top of a number of cleanups (Rafael
     Wysocki)

   - Add missing checks for NULL .exit() cpufreq driver callback to the
     cpufreq core (Viresh Kumar)

   - Prevent pulicy->max from going above the frequency QoS maximum
     value when cpufreq_frequency_table_verify() is used (Xuewen Yan)

   - Prevent a negative CPU number or frequency value from being printed
     if they are really large (Joshua Yeong)

   - Update MAINTAINERS entry for amd-pstate to add two new
     submaintainers and a designated reviewer (Huang Rui)

   - Clean up the amd-pstate driver and update its documentation
     (Gautham Shenoy)

   - Fix the highest frequency issue in the amd-pstate driver which
     limits performance (Perry Yuan)

   - Enable CPPC v2 for certain processors in the family 17H, as
     requested by TR40 processor users who expect improved performance
     and lower system temperature (Perry Yuan)

   - Change latency and delay values to be read from platform firmware
     firstly for more accurate timing (Perry Yuan)

   - A new quirk is introduced for supporting amd-pstate on legacy
     processors which either lack CPPC capability, or only only have
     CPPC v2 capability (Perry Yuan)

   - Sun50i cpufreq: Add support for opp_supported_hw, H616 platform and
     general cleanups (Andre Przywara, Martin Botka, Brandon Cheo Fusi,
     Dan Carpenter, Viresh Kumar)

   - CPPC cpufreq: Fix possible null pointer dereference (Aleksandr
     Mishin)

   - Eliminate uses of of_node_put() from cpufreq (Javier Carrasco,
     Shivani Gupta)

   - brcmstb-avs: ISO C90 forbids mixed declarations (Portia Stephens)

   - mediatek cpufreq: Add support for MT7988A (Sam Shih)

   - cpufreq-qcom-hw: Add SM4450 compatibles in DT bindings (Tengfei
     Fan)

   - Fix struct cpudata::epp_cached kernel-doc in the intel_pstate
     cpufreq driver (Jeff Johnson)

   - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson)

   - Convert the cpuidle kirkwood driver to platform remove callback
     returning void (Yangtao Li)

   - Replace deprecated strncpy() with strscpy() in the hibernation core
     code (Justin Stitt)

   - Use %ps to simplify debug output in the core system-wide suspend
     and resume code (Len Brown)

   - Remove unnecessary else from device_init_wakeup() and make
     device_wakeup_disable() return void (Dhruva Gole)

   - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui)

   - Add support for ArrowLake-H platform to the Intel RAPL driver
     (Zhang Rui)

   - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li)

   - Make the Samsung exynos-asv driver update the Energy Model after
     adjusting voltage on top of some preliminary changes of the OPP and
     Enery Model generic code (Lukasz Luba)

   - Remove a reference to a function that has been dropped from the
     power management documentation (Bjorn Helgaas)

   - Convert the platfrom remove callback to .remove_new for the
     exyno-nocp, exynos-ppmu, mtk-cci-devfreq, sun8i-a33-mbus, and
     rk3399_dmc devfreq drivers (Uwe Kleine-König)

   - Use DEFINE_SIMPLE_PM_OPS for exyno-bus.c driver (Anand Moon)"

* tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (68 commits)
  PM / devfreq: exynos: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions
  PM / devfreq: rk3399_dmc: Convert to platform remove callback returning void
  PM / devfreq: sun8i-a33-mbus: Convert to platform remove callback returning void
  PM / devfreq: mtk-cci: Convert to platform remove callback returning void
  PM / devfreq: exynos-ppmu: Convert to platform remove callback returning void
  PM / devfreq: exynos-nocp: Convert to platform remove callback returning void
  cpufreq: amd-pstate: fix the highest frequency issue which limits performance
  cpufreq: intel_pstate: fix struct cpudata::epp_cached kernel-doc
  cpuidle: ladder: fix ladder_do_selection() kernel-doc
  powercap: intel_rapl_tpmi: Enable PMU support
  powercap: intel_rapl: Introduce APIs for PMU support
  PM: hibernate: replace deprecated strncpy() with strscpy()
  cpufreq: Fix up printing large CPU numbers and frequency values
  MAINTAINERS: cpufreq: amd-pstate: Add co-maintainers and reviewer
  cpufreq: amd-pstate: remove unused variable lowest_nonlinear_freq
  cpufreq: amd-pstate: fix code format problems
  cpufreq: amd-pstate: Add quirk for the pstate CPPC capabilities missing
  cppc_acpi: print error message if CPPC is unsupported
  cpufreq: amd-pstate: get transition delay and latency value from ACPI tables
  cpufreq: amd-pstate: Bail out if min/max/nominal_freq is 0
  ...
2024-05-14 13:19:15 -07:00
Jesper Dangaard Brouer
21c38a3bd4 cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
This closely resembles helpers added for the global cgroup_rstat_lock in
commit fc29e04ae1 ("cgroup/rstat: add cgroup_rstat_lock helpers and
tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.

Based on production workloads, we observe the fast-path "update" function
cgroup_rstat_updated() is invoked around 3 million times per sec, while the
"flush" function cgroup_rstat_flush_locked(), walking each possible CPU,
can see periodic spikes of 700 invocations/sec.

For this reason, the tracepoints are split into normal and fastpath
versions for this per-CPU lock. Making it feasible for production to
continuously monitor the non-fastpath tracepoint to detect lock contention
issues. The reason for monitoring is that lock disables IRQs which can
disturb e.g. softirq processing on the local CPUs involved. When the
global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex),
this per CPU lock becomes the next bottleneck that can introduce latency
variations.

A practical bpftrace script for monitoring contention latency:

 bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
     @start[tid]=nsecs; @cnt[probe]=count()}
   tracepoint:cgroup:cgroup_rstat_cpu_locked {
     if (args->contended) {
       @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}
     @cnt[probe]=count()}
   interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}'

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-05-14 09:43:17 -10:00
Linus Torvalds
896d3fce84 linux_kselftest-kunit-6.10-rc1
This kunit update for Linux 6.10-rc1 consists of:
 
 - fix to race condition in try-catch completion
 - change to __kunit_test_suites_init() to exit early if there is
   nothing to test
 - change to string-stream-test to use KUNIT_DEFINE_ACTION_WRAPPER
 - moving fault tests behind KUNIT_FAULT_TEST Kconfig option
 - kthread test fixes and improvements
 - iov_iter test fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmZCNsAACgkQCwJExA0N
 Qxx03w/9EmjF3T16LPaeuerdoypWDcroDT6gpoFXGrvf3lDrna8uDNija5Pb1yMn
 l97wla3IJ1EZRMTy1jgWGQiiGIdkV8hcze65HZMi19qx/49TUbhA/pTmpYC56cp9
 sk2fBjOHz8iI4kdL4eCMr9MpSiwOIDcfWOr1Lh/AP2LHOU1pRdFZbwO6iZ3wyGlJ
 JH4D1CwmfgMGEau4qUo0jvuRbFAf33S+yEI9gr8CskPItljFVO4jVz4lprnTbU9i
 qAOivHzwcHyYc0upb6q2vIlp8vhmDygG/m07lnwfF7ZHsYo+3zV4FkxHspN2+jGA
 frH7Y0X9zt6YjRRMb9NcNnI67VTiSNzdCvB7urUhKlbXoZ2gjtgB7zHeQtAhlXRo
 XVa4QgWBI5ExKBuLI+0yKo4wEO8M0quXxhbX+2Q+tsRnoYmhwb0G8AUyl/26bt2g
 RelGrArDS5eMrlxl97rjMGFrB5Uan2MR751tl+aZPgyNRW3tRKJnQLZmM1z8aFQp
 vGReT6POzCnQ1wLUkcj6mnObbv9XuuYY1BQgKCtmJflvRToEuwpLOKK8Uca7ou3p
 TbVarGIn0jdHv4zGkXrAkt/mhcxanBXhVKLfh/MqQ7fCZBULkSrjJFLhCpvvHwIV
 nckaP2sZWls6FTDuawFOUxrr/+LjJchMmHhFy9MiDaVoieiTg6U=
 =3QIa
 -----END PGP SIGNATURE-----

Merge tag 'linux_kselftest-kunit-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kunit updates from Shuah Khan:

 - fix race condition in try-catch completion

 - change __kunit_test_suites_init() to exit early if there is
   nothing to test

 - change string-stream-test to use KUNIT_DEFINE_ACTION_WRAPPER

 - move fault tests behind KUNIT_FAULT_TEST Kconfig option

 - kthread test fixes and improvements

 - iov_iter test fixes

* tag 'linux_kselftest-kunit-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: bail out early in __kunit_test_suites_init() if there are no suites to test
  kunit: string-stream-test: use KUNIT_DEFINE_ACTION_WRAPPER
  kunit: test: Move fault tests behind KUNIT_FAULT_TEST Kconfig option
  kunit: unregister the device on error
  kunit: Fix race condition in try-catch completion
  kunit: Add tests for fault
  kunit: Print last test location on fault
  kunit: Fix KUNIT_SUCCESS() calls in iov_iter tests
  kunit: Handle test faults
  kunit: Fix timeout message
  kunit: Fix kthread reference
  kunit: Handle thread creation error
2024-05-14 11:32:52 -07:00
Jakub Kicinski
654de42f3f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.10 net-next PR.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-14 10:53:19 -07:00
Linus Torvalds
6bfd2d442a Updates for the interrupt subsystem:
- Core code:
 
    - Interrupt storm detection for the lockup watchdog:
 
      Lockups which are caused by interrupt storms are not easy to debug
      because there is no information about the events which make the lockup
      detector trigger.
 
      To make this more user friendly, provide an extenstion to interrupt
      statistics which allows to take snapshots and an interface to retrieve
      the delta to the snapshot. Use this new mechanism in the watchdog code
      to do a two stage lockup analysis by taking the snapshot and printing
      the deltas for the topmost active interrupts on the second trigger.
 
      Note: This contains both the interrupt and the watchdog changes as
      the latter depend on the former obviously.
 
   - Avoid summation loops in the /proc/interrupts output and use the global
     counter when possible
 
   - Skip suspended interrupts on CPU hotplug operations to ensure that they
     are not delivered before the system resumes the device drivers when
     coming out of suspend.
 
   - On CPU hot-unplug interrupts which are affine to the outgoing CPU are
     migrated to a different CPU in the affinity mask. This can fail when
     the CPUs have no vectors left. Instead of giving up try to migrate it
     to any online CPU and thereby breaking the affinity setting in order to
     prevent a stale device interrupt which targets an offline CPU
 
   - The usual small cleanups
 
  - Driver code:
 
   - Support for the RISCV AIA MSI controller
 
   - Make the interrupt allocation for the Loongson PCH controller more
     flexible to prevent vector exhaustion
 
   - The usual set of cleanups and fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmZBCM0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeZHEACqMLN3K+1HyWflYtcTHJeYCjZLHS77
 2tQeKaaskOA4W6dcGXPxMw5CHqAobHVQQMqgcJxhUdqQiOJnFFnrtCD7JtqM0hWK
 UORNbyeovuhAo+iJ0fTuS8p63H7vm2GIWwBLWJnOuChYv/6Yyx5Cald1skvyvbzL
 zePhiiAf5mkdmJMeT5wJSCqEWSRYOXsVAJ/0YAwFG3bKkJH3bmDo6SDJY02sXT5P
 pjbtD/0hum9wIVT4fNdYleHHQMdBdj9dLlcxXBikHq50mDMw7GxvjKiLcXmoerw3
 rEBfVVJp3qpSofpNJZ3HH0ywcF3yUzq04/LPE9Tk2MoQ8NF0GzP8r9Ahke4B7cUj
 FysWNiAlC2IisEi6th313FZkTLx0zgewdsdEBTLt8eAE9TU0wamRbo99LZ8i/Qr3
 hk7jV8DzL+EDQJLgl4p1iPJgA708eW17tbCxLEa15VKVV6P58miohmhx/IfPO2Gx
 FV1PPehtItsmiK/UoRtUCoFdFsqNQtOE+h8DWLyy8RDmhBqGbn9Ut4euXiQIF+rX
 WJKPFfslCTR39BrBcZnZeNsgOCN7tEfFRstzjzkey1DaeTGWtxmA5UGhpC2vT74y
 YyXluvZlgKr4S64ABmcqQj++hQLho0OQAih3uW5YVxt4VxEUcXYMJOsV1AQGpMjF
 UnewWH5opBQdfw==
 =jFLf
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt subsystem updates from Thomas Gleixner:
 "Core code:

   - Interrupt storm detection for the lockup watchdog:

     Lockups which are caused by interrupt storms are not easy to debug
     because there is no information about the events which make the
     lockup detector trigger.

     To make this more user friendly, provide an extenstion to interrupt
     statistics which allows to take snapshots and an interface to
     retrieve the delta to the snapshot. Use this new mechanism in the
     watchdog code to do a two stage lockup analysis by taking the
     snapshot and printing the deltas for the topmost active interrupts
     on the second trigger.

     Note: This contains both the interrupt and the watchdog changes as
     the latter depend on the former obviously.

   - Avoid summation loops in the /proc/interrupts output and use the
     global counter when possible

   - Skip suspended interrupts on CPU hotplug operations to ensure that
     they are not delivered before the system resumes the device drivers
     when coming out of suspend.

   - On CPU hot-unplug interrupts which are affine to the outgoing CPU
     are migrated to a different CPU in the affinity mask. This can fail
     when the CPUs have no vectors left. Instead of giving up try to
     migrate it to any online CPU and thereby breaking the affinity
     setting in order to prevent a stale device interrupt which targets
     an offline CPU

   - The usual small cleanups

  Driver code:

   - Support for the RISCV AIA MSI controller

   - Make the interrupt allocation for the Loongson PCH controller more
     flexible to prevent vector exhaustion

   - The usual set of cleanups and fixes all over the place"

* tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc
  cpuidle: Avoid explicit cpumask allocation on stack
  irqchip/sifive-plic: Avoid explicit cpumask allocation on stack
  irqchip/riscv-aplic-direct: Avoid explicit cpumask allocation on stack
  irqchip/loongson-eiointc: Avoid explicit cpumask allocation on stack
  irqchip/gic-v3-its: Avoid explicit cpumask allocation on stack
  irqchip/irq-bcm6345-l1: Avoid explicit cpumask allocation on stack
  cpumask: Introduce cpumask_first_and_and()
  irqchip/irq-brcmstb-l2: Avoid saving mask on shutdown
  genirq: Reuse irq_is_nmi()
  genirq/cpuhotplug: Retry with cpu_online_mask when migration fails
  genirq/cpuhotplug: Skip suspended interrupts when restoring affinity
  arm64: dts: st: Add interrupt parent to pinctrl on stm32mp251
  arm64: dts: st: Add exti1 and exti2 nodes on stm32mp251
  ARM: dts: stm32: List exti parent interrupts on stm32mp131
  ARM: dts: stm32: List exti parent interrupts on stm32mp151
  arm64: Kconfig.platforms: Enable STM32_EXTI for ARCH_STM32
  irqchip/stm32-exti: Mark events reserved with RIF configuration check
  irqchip/stm32-exti: Skip secure events
  irqchip/stm32-exti: Convert driver to standard PM
  ...
2024-05-14 09:47:14 -07:00
Linus Torvalds
2d9db778dd Timers and timekeeping updates:
- Core code:
 
    - Make timekeeping and VDSO time readouts resilent against math overflow:
 
      In guest context the kernel is prone to math overflow when the host
      defers the timer interrupt due to overload, malfunction or malice.
 
      This can be mitigated by checking the clocksource delta for the
      maximum deferrement which is readily available. If that value is
      exceeded then the code uses a slowpath function which can handle the
      multiplication overflow.
 
      This functionality is enabled unconditionally in the kernel, but made
      conditional in the VDSO code. The latter is conditional because it
      allows architectures to optimize the check so it is not causing
      performance regressions.
 
      On X86 this is achieved by reworking the existing check for negative
      TSC deltas as a negative delta obviously exceeds the maximum
      deferrement when it is evaluated as an unsigned value. That avoids two
      conditionals in the hotpath and allows to hide both the negative delta
      and the large delta handling in the same slow path.
 
    - Add an initial minimal ktime_t abstraction for Rust
 
    - The usual boring cleanups and enhancements
 
  - Drivers:
 
    - Boring updates to device trees and trivial enhancements in various
      drivers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmZBErUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZVhD/9iUPzcGNgqGqcO1bXy6dH4xLpeec6o
 2En1vg45DOaygN7DFxkoei20KJtfdFeaaEDH8UqmOfPcpLIuVAd0yqhgDQtx6ZcO
 XNd09SFDInzUt1Ot/WcoXp5N6Wt3vyEgUAlIN1fQdbaZ3fh6OhGhXXCRfiRCGXU1
 ea2pSunLuRf1pKU0AYhGIexnZMOHC4NmVXw/m+WNw5DJrmWB+OaNFKfMoQjtQ1HD
 Vgyr2RALHnIeXm60y2j3dD7TWGXICE/edzOd7pEyg5LFXsmcp388eu/DEdOq3OTV
 tsHLgIi05GJym3dykPBVwZk09M5oVNNfkg9zDxHWhSLkEJmc4QUaH3dgM8uBoaRW
 pS3LaO3ePxWmtAOdSNKFY6xnl6df+PYJoZcIF/GuXgty7im+VLK9C4M05mSjey00
 omcEywvmGdFezY6D9MmjjhFa+q2v9zpRjFpCWaIv3DQdAaDPrOzBk4SSqHZOV4lq
 +hp7ar1mTn1FPrXBouwyOgSOUANISV5cy/QuwOtrVIuVR4rWFVgfWo/7J32/q5Ik
 XBR0lTdQy1Biogf6xy0HCY+4wItOLTqEXXqeknHSMJpDzj5uZglZemgKbix1wVJ9
 8YlD85Q7sktlPmiLMKV9ra0MKVyXDoIrgt4hX98A8M12q9bNdw23x0p0jkJHwGha
 ZYUyX+XxKgOJug==
 =pL+S
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timers and timekeeping updates from Thomas Gleixner:
 "Core code:

   - Make timekeeping and VDSO time readouts resilent against math
     overflow:

     In guest context the kernel is prone to math overflow when the host
     defers the timer interrupt due to overload, malfunction or malice.

     This can be mitigated by checking the clocksource delta for the
     maximum deferrement which is readily available. If that value is
     exceeded then the code uses a slowpath function which can handle
     the multiplication overflow.

     This functionality is enabled unconditionally in the kernel, but
     made conditional in the VDSO code. The latter is conditional
     because it allows architectures to optimize the check so it is not
     causing performance regressions.

     On X86 this is achieved by reworking the existing check for
     negative TSC deltas as a negative delta obviously exceeds the
     maximum deferrement when it is evaluated as an unsigned value. That
     avoids two conditionals in the hotpath and allows to hide both the
     negative delta and the large delta handling in the same slow path.

   - Add an initial minimal ktime_t abstraction for Rust

   - The usual boring cleanups and enhancements

  Drivers:

   - Boring updates to device trees and trivial enhancements in various
     drivers"

* tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  clocksource/drivers/arm_arch_timer: Mark hisi_161010101_oem_info const
  clocksource/drivers/timer-ti-dm: Remove an unused field in struct dmtimer
  clocksource/drivers/renesas-ostm: Avoid reprobe after successful early probe
  clocksource/drivers/renesas-ostm: Allow OSTM driver to reprobe for RZ/V2H(P) SoC
  dt-bindings: timer: renesas: ostm: Document Renesas RZ/V2H(P) SoC
  rust: time: doc: Add missing C header links
  clocksource: Make the int help prompt unit readable in ncurses
  hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active()
  timerqueue: Remove never used function timerqueue_node_expires()
  rust: time: Add Ktime
  vdso: Fix powerpc build U64_MAX undeclared error
  clockevents: Convert s[n]printf() to sysfs_emit()
  clocksource: Convert s[n]printf() to sysfs_emit()
  clocksource: Make watchdog and suspend-timing multiplication overflow safe
  timekeeping: Let timekeeping_cycles_to_ns() handle both under and overflow
  timekeeping: Make delta calculation overflow safe
  timekeeping: Prepare timekeeping_cycles_to_ns() for overflow safety
  timekeeping: Fold in timekeeping_delta_to_ns()
  timekeeping: Consolidate timekeeping helpers
  timekeeping: Refactor timekeeping helpers
  ...
2024-05-14 09:27:40 -07:00
Zheng Yejian
e60b613df8 ftrace: Fix possible use-after-free issue in ftrace_location()
KASAN reports a bug:

  BUG: KASAN: use-after-free in ftrace_location+0x90/0x120
  Read of size 8 at addr ffff888141d40010 by task insmod/424
  CPU: 8 PID: 424 Comm: insmod Tainted: G        W          6.9.0-rc2+
  [...]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x68/0xa0
   print_report+0xcf/0x610
   kasan_report+0xb5/0xe0
   ftrace_location+0x90/0x120
   register_kprobe+0x14b/0xa40
   kprobe_init+0x2d/0xff0 [kprobe_example]
   do_one_initcall+0x8f/0x2d0
   do_init_module+0x13a/0x3c0
   load_module+0x3082/0x33d0
   init_module_from_file+0xd2/0x130
   __x64_sys_finit_module+0x306/0x440
   do_syscall_64+0x68/0x140
   entry_SYSCALL_64_after_hwframe+0x71/0x79

The root cause is that, in lookup_rec(), ftrace record of some address
is being searched in ftrace pages of some module, but those ftrace pages
at the same time is being freed in ftrace_release_mod() as the
corresponding module is being deleted:

           CPU1                       |      CPU2
  register_kprobes() {                | delete_module() {
    check_kprobe_address_safe() {     |
      arch_check_ftrace_location() {  |
        ftrace_location() {           |
          lookup_rec() // USE!        |   ftrace_release_mod() // Free!

To fix this issue:
  1. Hold rcu lock as accessing ftrace pages in ftrace_location_range();
  2. Use ftrace_location_range() instead of lookup_rec() in
     ftrace_location();
  3. Call synchronize_rcu() before freeing any ftrace pages both in
     ftrace_process_locs()/ftrace_release_mod()/ftrace_free_mem().

Link: https://lore.kernel.org/linux-trace-kernel/20240509192859.1273558-1-zhengyejian1@huawei.com

Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <mathieu.desnoyers@efficios.com>
Fixes: ae6aa16fdc ("kprobes: introduce ftrace based optimization")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-14 11:13:27 -04:00
Mike Rapoport (IBM)
2c9e5d4a00 bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
BPF just-in-time compiler depended on CONFIG_MODULES because it used
module_alloc() to allocate memory for the generated code.

Since code allocations are now implemented with execmem, drop dependency of
CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM.

Suggested-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:36:29 -07:00
Mike Rapoport (IBM)
7582b7be16 kprobes: remove dependency on CONFIG_MODULES
kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with execmem, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[mcgrof: rebase in light of NEED_TASKS_RCU ]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:35:06 -07:00
Mike Rapoport (IBM)
223b5e57d0 mm/execmem, arch: convert remaining overrides of module_alloc to execmem
Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
late initialization of execmem required by arm64.

The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Tested-by: Liviu Dudau <liviu@dudau.co.uk>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Mike Rapoport (IBM)
12af2b83d0 mm: introduce execmem_alloc() and execmem_free()
module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.

Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.

No functional changes.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Mike Rapoport (IBM)
bc6b94d3ea module: make module_memory_{alloc,free} more self-contained
Move the logic related to the memory allocation and freeing into
module_memory_alloc() and module_memory_free().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Justin Stitt
086437d94a kallsyms: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces. The goal is to remove its use completely [2].

namebuf is eventually cleaned of any trailing llvm suffixes using
strstr(). This hints that namebuf should be NUL-terminated.

static void cleanup_symbol_name(char *s)
{
	char *res;
	...
	res = strstr(s, ".llvm.");
	...
}

Due to this, use strscpy() over strncpy() as it guarantees
NUL-termination on the destination buffer. Drop the -1 from the length
calculation as it is no longer needed to ensure NUL-termination.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html
Link: https://github.com/KSPP/linux/issues/90 [2]
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Yifan Hong
8d0b728840 module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
If UNUSED_KSYMS_WHITELIST is a file generated
before Kbuild runs, and the source tree is in
a read-only filesystem, the developer must put
the file somewhere and specify an absolute
path to UNUSED_KSYMS_WHITELIST. This worked,
but if IKCONFIG=y, an absolute path is embedded
into .config and eventually into vmlinux, causing
the build to be less reproducible when building
on a different machine.

This patch makes the handling of
UNUSED_KSYMS_WHITELIST to be similar to
MODULE_SIG_KEY.

First, check if UNUSED_KSYMS_WHITELIST is an
absolute path, just as before this patch. If so,
use the path as is.

If it is a relative path, use wildcard to check
the existence of the file below objtree first.
If it does not exist, fall back to the original
behavior of adding $(srctree)/ before the value.

After this patch, the developer can put the generated
file in objtree, then use a relative path against
objtree in .config, eradicating any absolute paths
that may be evaluated differently on different machines.

Signed-off-by: Yifan Hong <elsk@google.com>
Reviewed-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14 00:31:43 -07:00
Dr. David Alan Gilbert
d2cc859cc8 ftrace: Remove unused global 'ftrace_direct_func_count'
Commit 8788ca164e ("ftrace: Remove the legacy _ftrace_direct API")
stopped setting the 'ftrace_direct_func_count' variable, but left
it around.  Clean it up.

Link: https://lore.kernel.org/linux-trace-kernel/20240506233305.215735-1-linux@treblig.org

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-14 02:09:40 -04:00
Dr. David Alan Gilbert
c9d5b7b826 ftrace: Remove unused list 'ftrace_direct_funcs'
Commit 8788ca164e ("ftrace: Remove the legacy _ftrace_direct API")
stopped using 'ftrace_direct_funcs' (and the associated
struct ftrace_direct_func).  Remove them.

Build tested only (on x86-64 with FTRACE and DYNAMIC_FTRACE
enabled)

Link: https://lore.kernel.org/linux-trace-kernel/20240504132303.67538-1-linux@treblig.org

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-14 02:09:00 -04:00
Linus Torvalds
6e5a0c30b6 Scheduler changes for v6.10:
- Add cpufreq pressure feedback for the scheduler
 
  - Rework misfit load-balancing wrt. affinity restrictions
 
  - Clean up and simplify the code around ::overutilized and
    ::overload access.
 
  - Simplify sched_balance_newidle()
 
  - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
    handling that changed the output.
 
  - Rework & clean up <asm/vtime.h> interactions wrt. arch_vtime_task_switch()
 
  - Reorganize, clean up and unify most of the higher level
    scheduler balancing function names around the sched_balance_*()
    prefix.
 
  - Simplify the balancing flag code (sched_balance_running)
 
  - Miscellaneous cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBtA0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gQEw//WiCiV7zTlWShSiG/g8GTfoAvl53QTWXF
 0jQ8TUcoIhxB5VeGgxVG1srYt8f505UXjH7L0MJLrbC3nOgRCg4NK57WiQEachKK
 HORIJHT0tMMsKIwX9D5Ovo4xYJn+j7mv7j/caB+hIlzZAbWk+zZPNWcS84p0ZS/4
 appY6RIcp7+cI7bisNMGUuNZS14+WMdWoX3TgoI6ekgDZ7Ky+kQvkwGEMBXsNElO
 qZOj6yS/QUE4Htwz0tVfd6h5svoPM/VJMIvl0yfddPGurfNw6jEh/fjcXnLdAzZ6
 9mgcosETncQbm0vfSac116lrrZIR9ygXW/yXP5S7I5dt+r+5pCrBZR2E5g7U4Ezp
 GjX1+6J9U6r6y12AMLRjadFOcDvxdwtszhZq4/wAcmS3B9dvupnH/w7zqY9ho3wr
 hTdtDHoAIzxJh7RNEHgeUC0/yQX3wJ9THzfYltDRIIjHTuvl4d5lHgsug+4Y9ClE
 pUIQm/XKouweQN9TZz2ULle4ZhRrR9sM9QfZYfirJ/RppmuKool4riWyQFQNHLCy
 mBRMjFFsTpFIOoZXU6pD4EabOpWdNrRRuND/0yg3WbDat2gBWq6jvSFv2UN1/v7i
 Un5jijTuN7t8yP5lY5Tyf47kQfLlA9bUx1v56KnF9mrpI87FyiDD3MiQVhDsvpGX
 rP96BIOrkSo=
 =obph
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Add cpufreq pressure feedback for the scheduler

 - Rework misfit load-balancing wrt affinity restrictions

 - Clean up and simplify the code around ::overutilized and
   ::overload access.

 - Simplify sched_balance_newidle()

 - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
   handling that changed the output.

 - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()

 - Reorganize, clean up and unify most of the higher level
   scheduler balancing function names around the sched_balance_*()
   prefix

 - Simplify the balancing flag code (sched_balance_running)

 - Miscellaneous cleanups & fixes

* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  sched/pelt: Remove shift of thermal clock
  sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched/cpufreq: Take cpufreq feedback into account
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched/fair: Fix update of rd->sg_overutilized
  sched/vtime: Do not include <asm/vtime.h> header
  s390/irq,nmi: Include <asm/vtime.h> header directly
  s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
  sched/vtime: Get rid of generic vtime_task_switch() implementation
  sched/vtime: Remove confusing arch_vtime_task_switch() declaration
  sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
  sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
  sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
  sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
  sched/fair: Rename root_domain::overload to ::overloaded
  sched/fair: Use helper functions to access root_domain::overload
  sched/fair: Check root_domain::overload value before update
  sched/fair: Combine EAS check with root_domain::overutilized access
  sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
  ...
2024-05-13 17:18:51 -07:00
Linus Torvalds
17ca7fc22f Perf events changes for v6.10:
- Combine perf and BPF for fast evalution of HW breakpoint
    conditions.
 
  - Add LBR capture support outside of hardware events
 
  - Trigger IO signals for watermark_wakeup
 
  - Add RAPL support for Intel Arrow Lake and Lunar Lake
 
  - Optimize frequency-throttling
 
  - Miscellaneous cleanups & fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBsC8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1izyxAAo7yOdhk9q+y2YWlKx2FmxUlZ8vlxBDRT
 22bIN2d1ADrRS2IMsXC2/PhLnw0RNMCjBf6vyXi1hrMMK2zjuCFet5WDN8NboWEp
 hMdUSv1ODf5vb2I8frYS9X4jPtXDKSpIBR9e3E7iFYU6vj3BUXLSXnfXFjRsLU8i
 BG1k4apAWkDw0UjwQsRdxOoTFxp17idO3Ruz0/ksXleO/0aR0WR68tGO2WS1Hz95
 mBhdjudekpWgT8VktGPrXsgUU3jqywTx04zFkWS36+IqDqNeNMPmePC7hqohlvv4
 ZEPg6XrjdFmcDE6nc2YFYLD9njLDbdKPLeGTEtSNFSAmHYqV8W+UFlNa6hlXEE7n
 KFnvJ8zLymW/UQGaPsIcqqTSXkGKuTsUZJO+QK/VF+sK7VpMJtwTaUliSlN7zQtF
 6HDBjp4sLB3NW16AN/M65LjpqyLdRxD7tvXoPLTt9mOVQt41ckv2Tfe2m6hg9OVQ
 qFzEdhgXxOUMyO9ifEX4HC2sBkKee4Jt76SLkpdr6kuuqlTRisIVdhlJ7yjK9/Rk
 RbuK/4eqL1p/o4GFAPP8gQjfdMSWatOZzxpE4V1cnzEdGjwuUMPJrbYPiAkgHskO
 HpzXtY+xFbAiaDanW1kUmwlqO8yO18WvdUem+SRRlFvbeE+grmgmtRZecNOi7mgg
 MlKdr1a4mV8=
 =r0yr
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events updates from Ingo Molnar:

 - Combine perf and BPF for fast evalution of HW breakpoint
   conditions

 - Add LBR capture support outside of hardware events

 - Trigger IO signals for watermark_wakeup

 - Add RAPL support for Intel Arrow Lake and Lunar Lake

 - Optimize frequency-throttling

 - Miscellaneous cleanups & fixes

* tag 'perf-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf/bpf: Mark perf_event_set_bpf_handler() and perf_event_free_bpf_handler() as inline too
  selftests/perf_events: Test FASYNC with watermark wakeups
  perf/ring_buffer: Trigger IO signals for watermark_wakeup
  perf: Move perf_event_fasync() to perf_event.h
  perf/bpf: Change the !CONFIG_BPF_SYSCALL stubs to static inlines
  selftest/bpf: Test a perf BPF program that suppresses side effects
  perf/bpf: Allow a BPF program to suppress all sample side effects
  perf/bpf: Remove unneeded uses_default_overflow_handler()
  perf/bpf: Call BPF handler directly, not through overflow machinery
  perf/bpf: Remove #ifdef CONFIG_BPF_SYSCALL from struct perf_event members
  perf/bpf: Create bpf_overflow_handler() stub for !CONFIG_BPF_SYSCALL
  perf/bpf: Reorder bpf_overflow_handler() ahead of __perf_event_overflow()
  perf/x86/rapl: Add support for Intel Lunar Lake
  perf/x86/rapl: Add support for Intel Arrow Lake
  perf/core: Reduce PMU access to adjust sample freq
  perf/core: Optimize perf_adjust_freq_unthr_context()
  perf/x86/amd: Don't reject non-sampling events with configured LBR
  perf/x86/amd: Support capturing LBR from software events
  perf/x86/amd: Avoid taking branches before disabling LBR
  perf/x86/amd: Ensure amd_pmu_core_disable_all() is always inlined
  ...
2024-05-13 17:13:47 -07:00
Linus Torvalds
48fc82c40b Locking changes for v6.10:
- Over a dozen code generation micro-optimizations for the atomic
    and spinlock code.
 
  - Add more __ro_after_init attributes
 
  - Robustify the lockdevent_*() macros
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBrMMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gSuA//YyLRTCGtH6d/fCudlzzoa14MHO/QiCv7
 lgmq3Vqif/m+MW7LwQJbLrxDPJPT1mE9Ol9woOc133Cj1QZhF/HQvDAKT9ZpMoXU
 d8U3kuZ7tN41TJuQx6vNSCv3w5ToKeXaQJGxiT6od2Y/0QlhUKhVBSBQVtyc/ma6
 o1Uhq1Qp5KPj928jiqwI0JCZJFqqLvzq/rIT38V05phHEPet4GbLMbz9ZTsw70pm
 xmLzGLXJQ9maziuVcmRUrctsAkbk+VhChQ9p4HrH6AcYPwyQoF+zJr7iocyzIMG2
 xQqhEYShI72lcRft8hZwlrLTKZJWSAkDIxIxaQ2egzsNBwBPbRpP0mUIz3qbwJxQ
 fqzKGxwDmxjiX1Ib4gIVje66hp2QpPX5G1ARoeKvbrHkXxzqVuFlaQBn1+OAQ/GV
 mNzKADxrjalhyiMksHXbEbUNEvXCGqC2N9AOWT6XNvpLDqTJBz/wB+f9cbx3gYEO
 9rXwVicWXLzUnEfbRaEjCrDeMEHMLqhaZIndgCx07JpFkkTtKLD1N9tBxFPNH+SP
 XK7SAsXrxwhBjGbWItfF4eOaPCey+/+kGhOPadfTg3g9zDjEBvX/YNBBw9q2CUWc
 JWd/gct+/Jnnkh1jdIj9yRF2xciVY+iOshHRzG+clo/PhRTwv+DwfMJ/uzn+oaSF
 vOT+exKA8bg=
 =rT48
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Over a dozen code generation micro-optimizations for the atomic
   and spinlock code

 - Add more __ro_after_init attributes

 - Robustify the lockdevent_*() macros

* tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macro
  locking/qspinlock/x86: Micro-optimize virt_spin_lock()
  locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with __arch{,_try}_cmpxchg64_emu()
  locking/atomic/x86: Introduce arch_try_cmpxchg64_local()
  locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in __raw_callee_save___pv_queued_spin_unlock()
  locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h
  locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending()
  locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail()
  locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() functions
  locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions
  locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32
  locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32
  locking/atomic/x86: Introduce arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64
  locking/atomic/x86: Modernize x86_32 arch_{,try_}_cmpxchg64{,_local}()
  locking/atomic/x86: Correct the definition of __arch_try_cmpxchg128()
  x86/tsc: Make __use_tsc __ro_after_init
  x86/kvm: Make kvm_async_pf_enabled __ro_after_init
  context_tracking: Make context_tracking_key __ro_after_init
  jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
  locking/qspinlock: Always evaluate lockevent* non-event parameter once
2024-05-13 17:01:28 -07:00
Thorsten Blum
347bd7f072 tracing: Improve benchmark test performance by using do_div()
Partially revert commit d6cb38e108 ("tracing: Use div64_u64() instead
of do_div()") and use do_div() again to utilize its faster 64-by-32
division compared to the 64-by-64 division done by div64_u64().

Explicitly cast the divisor bm_cnt to u32 to prevent a Coccinelle
warning reported by do_div.cocci. The warning was removed with commit
d6cb38e108 ("tracing: Use div64_u64() instead of do_div()").

Using the faster 64-by-32 division and casting bm_cnt to u32 is safe
because we return early from trace_do_benchmark() if bm_cnt > UINT_MAX.
This approach is already used twice in trace_do_benchmark() when
calculating the standard deviation:

	do_div(stddev, (u32)bm_cnt);
	do_div(stddev, (u32)bm_cnt - 1);

Link: https://lore.kernel.org/linux-trace-kernel/20240329160229.4874-2-thorsten.blum@toblux.com

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 20:00:57 -04:00
Steven Rostedt (Google)
fe832be05a ring-buffer: Have mmapped ring buffer keep track of missed events
While testing libtracefs on the mmapped ring buffer, the test that checks
if missed events are accounted for failed when using the mapped buffer.
This is because the mapped page does not update the missed events that
were dropped because the writer filled up the ring buffer before the
reader could catch it.

Add the missed events to the reader page/sub-buffer when the IOCTL is done
and a new reader page is acquired.

Note that all accesses to the reader_page via rb_page_commit() had to be
switched to rb_page_size(), and rb_page_size() which was just a copy of
rb_page_commit() but now it masks out the RB_MISSED bits. This is needed
as the mapped reader page is still active in the ring buffer code and
where it reads the commit field of the bpage for the size, it now must
mask it otherwise the missed bits that are now set will corrupt the size
returned.

Link: https://lore.kernel.org/linux-trace-kernel/20240312175405.12fb6726@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 19:55:22 -04:00
Jakub Kicinski
6e62702feb bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGcZAAKCRDbK58LschI
 g6o6APwLsqhrM2w71VUN5ciCxu4H5VDtZp6wkdqtVbxxU4qNxQEApKgYgKt8ZLF3
 Kily5c7m+S4ZXhMX21rb8JhSAz0dfQk=
 =5Dk7
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-05-13

We've added 119 non-merge commits during the last 14 day(s) which contain
a total of 134 files changed, 9462 insertions(+), 4742 deletions(-).

The main changes are:

1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi.

2) Add BPF range computation improvements to the verifier in particular
   around XOR and OR operators, refactoring of checks for range computation
   and relaxing MUL range computation so that src_reg can also be an unknown
   scalar, from Cupertino Miranda.

3) Add support to attach kprobe BPF programs through kprobe_multi link in
   a session mode, meaning, a BPF program is attached to both function entry
   and return, the entry program can decide if the return program gets
   executed and the entry program can share u64 cookie value with return
   program. Session mode is a common use-case for tetragon and bpftrace,
   from Jiri Olsa.

4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf
   as well as BPF selftest's struct_ops handling, from Andrii Nakryiko.

5) Improvements to BPF selftests in context of BPF gcc backend,
   from Jose E. Marchesi & David Faust.

6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test-
   -style in order to retire the old test, run it in BPF CI and additionally
   expand test coverage, from Jordan Rife.

7) Big batch for BPF selftest refactoring in order to remove duplicate code
   around common network helpers, from Geliang Tang.

8) Another batch of improvements to BPF selftests to retire obsolete
   bpf_tcp_helpers.h as everything is available vmlinux.h,
   from Martin KaFai Lau.

9) Fix BPF map tear-down to not walk the map twice on free when both timer
   and wq is used, from Benjamin Tissoires.

10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL,
    from Alexei Starovoitov.

11) Change BTF build scripts to using --btf_features for pahole v1.26+,
    from Alan Maguire.

12) Small improvements to BPF reusing struct_size() and krealloc_array(),
    from Andy Shevchenko.

13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions,
    from Ilya Leoshkevich.

14) Extend TCP ->cong_control() callback in order to feed in ack and
    flag parameters and allow write-access to tp->snd_cwnd_stamp
    from BPF program, from Miao Xu.

15) Add support for internal-only per-CPU instructions to inline
    bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs,
    from Puranjay Mohan.

16) Follow-up to remove the redundant ethtool.h from tooling infrastructure,
    from Tushar Vyavahare.

17) Extend libbpf to support "module:<function>" syntax for tracing
    programs, from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  bpf: make list_for_each_entry portable
  bpf: ignore expected GCC warning in test_global_func10.c
  bpf: disable strict aliasing in test_global_func9.c
  selftests/bpf: Free strdup memory in xdp_hw_metadata
  selftests/bpf: Fix a few tests for GCC related warnings.
  bpf: avoid gcc overflow warning in test_xdp_vlan.c
  tools: remove redundant ethtool.h from tooling infra
  selftests/bpf: Expand ATTACH_REJECT tests
  selftests/bpf: Expand getsockname and getpeername tests
  sefltests/bpf: Expand sockaddr hook deny tests
  selftests/bpf: Expand sockaddr program return value tests
  selftests/bpf: Retire test_sock_addr.(c|sh)
  selftests/bpf: Remove redundant sendmsg test cases
  selftests/bpf: Migrate ATTACH_REJECT test cases
  selftests/bpf: Migrate expected_attach_type tests
  selftests/bpf: Migrate wildcard destination rewrite test
  selftests/bpf: Migrate sendmsg6 v4 mapped address tests
  selftests/bpf: Migrate sendmsg deny test cases
  selftests/bpf: Migrate WILDCARD_IP test
  selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases
  ...
====================

Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 16:41:10 -07:00
Paul E. McKenney
33f137143e ftrace: Use asynchronous grace period for register_ftrace_direct()
When running heavy test workloads with KASAN enabled, RCU Tasks grace
periods can extend for many tens of seconds, significantly slowing
trace registration.  Therefore, make the registration-side RCU Tasks
grace period be asynchronous via call_rcu_tasks().

Link: https://lore.kernel.org/linux-trace-kernel/ac05be77-2972-475b-9b57-56bef15aa00a@paulmck-laptop

Reported-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Chris Mason <clm@fb.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 19:36:40 -04:00
Yuran Pereira
c5963a0990 ftrace: Replaces simple_strtoul in ftrace
The function simple_strtoul performs no error checking in scenarios
where the input value overflows the intended output variable.
This results in this function successfully returning, even when the
output does not match the input string (aka the function returns
successfully even when the result is wrong).

Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(),
simple_strtoul(), and simple_strtoull() functions explicitly ignore
overflows, which may lead to unexpected results in callers."
Hence, the use of those functions is discouraged.

This patch replaces all uses of the simple_strtoul with the safer
alternatives kstrtoul and kstruint.

Callers affected:
- add_rec_by_index
- set_graph_max_depth_function

Side effects of this patch:
- Since `fgraph_max_depth` is an `unsigned int`, this patch uses
  kstrtouint instead of kstrtoul to avoid any compiler warnings
  that could originate from calling the latter.
- This patch ensures that the callers of kstrtou* return accordingly
  when kstrtoul and kstruint fail for some reason.
  In this case, both callers this patch is addressing return 0 on error.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull

Link: https://lore.kernel.org/linux-trace-kernel/GV1PR10MB656333529A8D7B8AFB28D238E8B4A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM

Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 19:36:40 -04:00
Vincent Donnefort
cf9f0f7c4c tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
     ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-4-vdonnefort@google.com

CC: <linux-mm@kvack.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 18:09:56 -04:00
Vincent Donnefort
117c39200d ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-3-vdonnefort@google.com

CC: <linux-mm@kvack.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 18:09:56 -04:00
Vincent Donnefort
c09d4167b5 ring-buffer: Allocate sub-buffers with __GFP_COMP
In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-2-vdonnefort@google.com

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 18:09:55 -04:00
Linus Torvalds
84c7d76b5a This update includes the following changes:
API:
 
 - Remove crypto stats interface.
 
 Algorithms:
 
 - Add faster AES-XTS on modern x86_64 CPUs.
 - Forbid curves with order less than 224 bits in ecc (FIPS 186-5).
 - Add ECDSA NIST P521.
 
 Drivers:
 
 - Expose otp zone in atmel.
 - Add dh fallback for primes > 4K in qat.
 - Add interface for live migration in qat.
 - Use dma for aes requests in starfive.
 - Add full DMA support for stm32mpx in stm32.
 - Add Tegra Security Engine driver.
 
 Others:
 
 - Introduce scope-based x509_certificate allocation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmZBjXMACgkQxycdCkmx
 i6cQ7g/+JPKnzQedhpJSK5AnkAkqO9kJ16JdeB7AtdSeZZA/EIFxuXZ3Fv1fH44y
 1CCibowc5zdss8F/1iOqPc57u5vy2Mjyw8qlhs7JlmcYf/lo7CBGfT8Uxo7BK/S9
 n+/+y47Xu5p3yt/c6ldrwqjOaWaYuaCKICZtS91XVvrxM80iVnmDSQCNkcch4KQ4
 nsdcVJhS4lOStBNjKtkhWlgufqdp8RPzKYH2B6GbW9z6en8WeTbnoMhgqjqQ3UID
 /DHtixyee0MDUDReQrixyCM3XMV5er/qBMoDrCxipBuVrr4GMd2GlCEaZbXfTUW0
 3K8Nle4KMMqi81lBAQKiD/hRjrC68FHOvVRGHtZntR0+NZ/nlinXCVWv4iHwRzAB
 7BOqRTC3mfv+uMhTvgwQAkXCHAhivMokSzTaDCIrzPLjKIx2BOfVZKmPBt98LxeW
 8/JfgEK4gX6wxe4GRftueEApCfWQrwYK60j5bIkescaJ/mI7M5bEByvTTob1lAka
 Fw5kGDy8dVnrG9HagLwnXoI1pIGmca8hV1t24Vf1OCdWLgOW+GTCIuyutL2c9AWv
 0vEbytGZl69XJlIgQGVcv9RM6NlIXxHwfSHU59N/SHTXhlHjm1XWi3HCiJaZ1b6+
 pcILMJ29FMs8LobiN7PT+rNu6fboaH0/o+R7OK9mKRut864xFTk=
 =NDS0
 -----END PGP SIGNATURE-----

Merge tag 'v6.10-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Remove crypto stats interface

  Algorithms:
   - Add faster AES-XTS on modern x86_64 CPUs
   - Forbid curves with order less than 224 bits in ecc (FIPS 186-5)
   - Add ECDSA NIST P521

  Drivers:
   - Expose otp zone in atmel
   - Add dh fallback for primes > 4K in qat
   - Add interface for live migration in qat
   - Use dma for aes requests in starfive
   - Add full DMA support for stm32mpx in stm32
   - Add Tegra Security Engine driver

  Others:
   - Introduce scope-based x509_certificate allocation"

* tag 'v6.10-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (123 commits)
  crypto: atmel-sha204a - provide the otp content
  crypto: atmel-sha204a - add reading from otp zone
  crypto: atmel-i2c - rename read function
  crypto: atmel-i2c - add missing arg description
  crypto: iaa - Use kmemdup() instead of kzalloc() and memcpy()
  crypto: sahara - use 'time_left' variable with wait_for_completion_timeout()
  crypto: api - use 'time_left' variable with wait_for_completion_killable_timeout()
  crypto: caam - i.MX8ULP donot have CAAM page0 access
  crypto: caam - init-clk based on caam-page0-access
  crypto: starfive - Use fallback for unaligned dma access
  crypto: starfive - Do not free stack buffer
  crypto: starfive - Skip unneeded fallback allocation
  crypto: starfive - Skip dma setup for zeroed message
  crypto: hisilicon/sec2 - fix for register offset
  crypto: hisilicon/debugfs - mask the unnecessary info from the dump
  crypto: qat - specify firmware files for 402xx
  crypto: x86/aes-gcm - simplify GCM hash subkey derivation
  crypto: x86/aes-gcm - delete unused GCM assembly code
  crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath()
  hwrng: stm32 - repair clock handling
  ...
2024-05-13 14:53:05 -07:00
Linus Torvalds
87caef4220 hardening updates for 6.10-rc1
- selftests: Add str*cmp tests (Ivan Orlov)
 
 - __counted_by: provide UAPI for _le/_be variants (Erick Archer)
 
 - Various strncpy deprecation refactors (Justin Stitt)
 
 - stackleak: Use a copy of soon-to-be-const sysctl table (Thomas Weißschuh)
 
 - UBSAN: Work around i386 -regparm=3 bug with Clang prior to version 19
 
 - Provide helper to deal with non-NUL-terminated string copying
 
 - SCSI: Fix older string copying bugs (with new helper)
 
 - selftests: Consolidate string helper behavioral tests
 
 - selftests: add memcpy() fortify tests
 
 - string: Add additional __realloc_size() annotations for "dup" helpers
 
 - LKDTM: Fix KCFI+rodata+objtool confusion
 
 - hardening.config: Enable KCFI
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmY/yCUWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJuf2D/9xlQA7UxUDlm1Z6DPYzTZfNm4M
 D+RJ1QoLNbZEYSzULWvfRSWI+c82qINoSgvtv2DdhWqSKivcMoeNDN846gewfwMY
 0q3iChbhPaNBAHaXat1pf0iA6q2n/wpg1jv1C1PmPVSaEpl0CeQ2MLXSOMz9Gb7G
 FkkaN/v+YlShUzkw61KwKPg959/bh5vCBbeLjSd1XAhLGKU7nWw4yj0J3usTnRbV
 icCnW4mk9SD+pIli/+n7t/QIvPMf6TrJZoSgH9P7YNm+wNme4UEAm1PJz8F+KVAH
 D3CJhlH36l8TrndsHMsHgDjKtUUchh+ExOlWGw3ObUnbU7ST2JP6crAdjtnyT2eN
 uF+ELBT97SskFBAlzOzBSIs8lEwBZzTdJCmWqEBr3ZxxR7lcClmqbJY+X/FhvXko
 o7PvtCbHCatpDPJPZ0e25nVsfEJS29RUED5Gen6vWcUtuvdFEgws70s5BDAbSZTo
 RoJsuDqlRAFLdNDYmEN3UTGcm+PBjPgKsBrXiiNr4Y0BilU67Bzdmd8jiZC9ARe6
 +3cfQRs0uWdemANzvrN5FnrIUhjRHWTvfVTXcC9Jt53HntIuMhhRajJuMcTAX5uQ
 iWACUR14RL8lfInS8phWB5T4AvNexTFc6kVRqNzsGB0ZutsnAsqELttCk57tYQVr
 Hlv/MbePyyLSKF/nYA==
 =CgsW
 -----END PGP SIGNATURE-----

Merge tag 'hardening-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "The bulk of the changes here are related to refactoring and expanding
  the KUnit tests for string helper and fortify behavior.

  Some trivial strncpy replacements in fs/ were carried in my tree. Also
  some fixes to SCSI string handling were carried in my tree since the
  helper for those was introduce here. Beyond that, just little fixes
  all around: objtool getting confused about LKDTM+KCFI, preparing for
  future refactors (constification of sysctl tables, additional
  __counted_by annotations), a Clang UBSAN+i386 crash fix, and adding
  more options in the hardening.config Kconfig fragment.

  Summary:

   - selftests: Add str*cmp tests (Ivan Orlov)

   - __counted_by: provide UAPI for _le/_be variants (Erick Archer)

   - Various strncpy deprecation refactors (Justin Stitt)

   - stackleak: Use a copy of soon-to-be-const sysctl table (Thomas
     Weißschuh)

   - UBSAN: Work around i386 -regparm=3 bug with Clang prior to
     version 19

   - Provide helper to deal with non-NUL-terminated string copying

   - SCSI: Fix older string copying bugs (with new helper)

   - selftests: Consolidate string helper behavioral tests

   - selftests: add memcpy() fortify tests

   - string: Add additional __realloc_size() annotations for "dup"
     helpers

   - LKDTM: Fix KCFI+rodata+objtool confusion

   - hardening.config: Enable KCFI"

* tag 'hardening-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (29 commits)
  uapi: stddef.h: Provide UAPI macros for __counted_by_{le, be}
  stackleak: Use a copy of the ctl_table argument
  string: Add additional __realloc_size() annotations for "dup" helpers
  kunit/fortify: Fix replaced failure path to unbreak __alloc_size
  hardening: Enable KCFI and some other options
  lkdtm: Disable CFI checking for perms functions
  kunit/fortify: Add memcpy() tests
  kunit/fortify: Do not spam logs with fortify WARNs
  kunit/fortify: Rename tests to use recommended conventions
  init: replace deprecated strncpy with strscpy_pad
  kunit/fortify: Fix mismatched kvalloc()/vfree() usage
  scsi: qla2xxx: Avoid possible run-time warning with long model_num
  scsi: mpi3mr: Avoid possible run-time warning with long manufacturer strings
  scsi: mptfusion: Avoid possible run-time warning with long manufacturer strings
  fs: ecryptfs: replace deprecated strncpy with strscpy
  hfsplus: refactor copy_name to not use strncpy
  reiserfs: replace deprecated strncpy with scnprintf
  virt: acrn: replace deprecated strncpy with strscpy
  ubsan: Avoid i386 UBSAN handler crashes with Clang
  ubsan: Remove 1-element array usage in debug reporting
  ...
2024-05-13 14:14:05 -07:00
Linus Torvalds
1ba58f1ae9 seccomp update for 6.10-rc1
- Prepare for sysctl table constification
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmY/xOgWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJvpQD/4+wrbWSMl2x7WRj3pBDFMhOjQv
 98FHC6llMCZyFVvsCX68orSi575YSv5jcGCkT0XRdLGBPfOFi6KxzeGsOewW1jAo
 YkZdZrOr8msBLitr9DYPdhzMtK2UEddnc2AVk/CcCsEA0pzqYndp1oQ/Kmz1Ump2
 ISBzz5GUZ0AElmXH9gr908NbTaidlfCEKqVpGdlzs/E5qN8rEZMofvnhGCWo9ZgA
 bvQ+OLV2qmJuKAKxIuo+NB4cPp/D41B+U0SrYMiK4vBTAlFmf16i3P/m4SEx3TQ0
 eS2B/aA0f6mG9NoVGQW2mRCSi+zDpVyA7HLcSFVjSerBZF2aBFPCX12rRlZXonK5
 kk6lvE/zeM0wAqKhxEUPYcCdE5gUKzRE2TbsUuqkca60gvY2EhhZbYkkN+Vm7eZ3
 XYWw6xIcUX7UFtRMQwB67ARDVpJ0Dc4sk5KTx9v0GQG3MguNf6YG37FhEahVxAd1
 V10SUg3Y5ykTImgD+g6PUMMwxYtU3RuoSGaXOFJa3tzHy7EE+dBuUQFa5JzYm3V7
 OppMgbxz0eqAU4OvD/xM3dYUsd+PxCt+4Zy2OEuip+bYiyS3CPP0elvIOdNyqDTw
 5aPxog3xwNsFCVlmp7/pSj+Aj5hvjFlA7SkQ/oxdGL+rxCb/h+fhwlBLxJZdGHeS
 X2RrkHhGPdUcAoTDTg==
 =EzcC
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp update from Kees Cook:

 - Prepare for sysctl table constification

* tag 'seccomp-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: Constify sysctl subhelpers
2024-05-13 13:56:36 -07:00
Jakub Kicinski
c9f9df3f63 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGSLAAKCRDbK58LschI
 gzFIAQDX9yJYEj3ppVR3oPf9Czqj4oVPE2ZNAmVlTig3eZikfAD9Gh0s5iERnFfs
 WAST1OgUF/4EHktO/7PKtkvBg0DdQgk=
 =DviD
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-05-13

We've added 3 non-merge commits during the last 2 day(s) which contain
a total of 2 files changed, 62 insertions(+), 8 deletions(-).

The main changes are:

1) Fix a case where syzkaller found that it's unexpectedly possible
   to attach a cgroup_skb program to the sockopt hooks. The fix adds
   missing attach_type enforcement for the link_create case along
   with selftests, from Stanislav Fomichev.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add sockopt case to verify prog_type
  selftests/bpf: Extend sockopt tests to use BPF_LINK_CREATE
  bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATE
====================

Link: https://lore.kernel.org/r/20240513041845.31040-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13 13:10:48 -07:00
Rafael J. Wysocki
de1c2722e0 Merge branches 'pm-em' and 'pm-docs'
Merge Enery Model update and a power management documentation update for
6.10:

 - Make the Samsung exynos-asv driver update the Energy Model after
   adjusting voltage on top of some preliminary changes of the OPP and
   Enery Model generic code (Lukasz Luba).

 - Remove a reference to a function that has been dropped from the power
   management documentation (Bjorn Helgaas).

* pm-em:
  soc: samsung: exynos-asv: Update Energy Model after adjusting voltage
  PM: EM: Add em_dev_update_chip_binning()
  PM: EM: Refactor em_adjust_new_capacity()
  OPP: OF: Export dev_opp_pm_calc_power() for usage from EM

* pm-docs:
  Documentation: PM: Update platform_pci_wakeup_init() reference
2024-05-13 20:19:58 +02:00
Rafael J. Wysocki
440f9d47df Merge branches 'pm-cpuidle', 'pm-sleep' and 'pm-powercap'
Merge cpuidle updates, changes related to system sleep and power capping
updates for 6.10:

 - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson).

 - Convert the cpuidle kirkwood driver to platform remove callback
   returning void (Yangtao Li).

 - Replace deprecated strncpy() with strscpy() in the hibernation core
   code (Justin Stitt).

 - Use %ps to simplify debug output in the core system-wide suspend and
   resume code (Len Brown).

 - Remove unnecessary else from device_init_wakeup() and make
   device_wakeup_disable() return void (Dhruva Gole).

 - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui).

 - Add support for ArrowLake-H platform to the Intel RAPL driver (Zhang
   Rui).

 - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li).

* pm-cpuidle:
  cpuidle: ladder: fix ladder_do_selection() kernel-doc
  cpuidle: kirkwood: Convert to platform remove callback returning void

* pm-sleep:
  PM: hibernate: replace deprecated strncpy() with strscpy()
  PM: sleep: Take advantage of %ps to simplify debug output
  PM: wakeup: Remove unnecessary else from device_init_wakeup()
  PM: wakeup: make device_wakeup_disable() return void

* pm-powercap:
  powercap: intel_rapl_tpmi: Enable PMU support
  powercap: intel_rapl: Introduce APIs for PMU support
  powercap: intel_rapl: Sort header files
  powercap: intel_rapl: Add support for ArrowLake-H platform
  powercap: DTPM: Avoid explicit cpumask allocation on stack
2024-05-13 20:14:10 +02:00
Linus Torvalds
c07ea940a0 kcsan: Introduce __data_racy type qualifier
This commit adds a __data_racy type qualifier that enables kernel
 developers to inform KCSAN that a given variable is a shared variable
 without needing to mark each and every access.  This allows pre-KCSAN
 code to be correctly (if approximately) instrumented withh very little
 effort, and also provides people reading the code a clear indication that
 the variable is in fact shared.  In addition, it permits incremental
 transition to per-access KCSAN marking, so that (for example) a given
 subsystem can be transitioned one variable at a time, while avoiding
 large numbers of KCSAN warnings during this transition.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmY+i+cTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jJQ1D/9eOBNKefU7duZgAOzUizPdxRvxKzPx
 UENz6DU/xXB+jcaWiRvdWyFgIFnUS/TaZcwtthXh4bV1I754dRFy8X9+/uHd8AVY
 MUwRkhY3Nie/MgkvLrEmMsfWn9zSUp0Pwq4dwFdhvb0aosFSn7PgtSrE62+RafpZ
 k1abEUa62MfSLJjJ7C8ThYk9broAgz37drloAStAr4PvrCM4JaoeChkStaAK80z1
 qq3EblLtXlzKcW1UNkvsbTxcnv+quLsI4EHKSnN3O8l47/F/k52ENz5Qp1pYTOLk
 kO3IZjqFqnIH6Re5eHPA05cwQssJFvsB8gfB+g+kc2uOK/z7wwg0/gqf9SZyaosw
 ABoaxflfNE/mTzKVgob3wqGyhlsAE/R2k02yoMad4X78ATOi9RpjdH6xC4OOXYfV
 4P8g2hGAHNR8UgYosXFx+YCu2ktGYyfsqTicMaaaECUfxFeJjJ1QqgwHYHADDDv/
 x8UxggAco1jul+6fikPGnjDgBN5IJOwS26NEUguqAFqYMTF8OO/x6ag6cqG5nk3a
 b41GF4HEfoQtJduuOv8jVntyTRU7zbpH+AVuinQ1V34kpYp5fE75p30P4UUjMegA
 JaAoOeD9aebEUHHlujomaV/QKSHobYLmYp/ARe2QZjp7aiELcjvV/ThOdwRxGEZg
 Zl4qRaGc9YO/Ag==
 =f1gr
 -----END PGP SIGNATURE-----

Merge tag 'kcsan.2024.05.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull kcsan update from Paul McKenney:
 "Introduce __data_racy type qualifier

  This adds a __data_racy type qualifier that enables kernel developers
  to inform KCSAN that a given variable is a shared variable without
  needing to mark each and every access.

  This allows pre-KCSAN code to be correctly (if approximately)
  instrumented withh very little effort, and also provides people
  reading the code a clear indication that the variable is in fact
  shared.

  In addition, it permits incremental transition to per-access KCSAN
  marking, so that (for example) a given subsystem can be transitioned
  one variable at a time, while avoiding large numbers of KCSAN warnings
  during this transition"

* tag 'kcsan.2024.05.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  kcsan, compiler_types: Introduce __data_racy type qualifier
2024-05-13 10:13:39 -07:00
Linus Torvalds
c0b9620bc3 RCU pull request for v6.10
This pull request contains the following branches:
 
 fixes.2024.04.15a: Fix a lockdep complain for lazy-preemptible kernel,
 remove redundant BH disable for TINY_RCU, remove redundant READ_ONCE()
 in tree.c, fix false positives KCSAN splat and fix buffer overflow in
 the print_cpu_stall_info().
 
 misc.2024.04.12a: Misc updates related to bpf, tracing and update the
 MAINTAINERS file.
 
 rcu-sync-normal-improve.2024.04.15a: An improvement of a normal
 synchronize_rcu() call in terms of latency. It maintains a separate
 track for sync. users only. This approach bypasses per-cpu nocb-lists
 thus sync-users do not depend on nocb-list length and how fast regular
 callbacks are processed.
 
 rcu-tasks.2024.04.15a: RCU tasks, switch tasks RCU grace periods to
 sleep at TASK_IDLE priority, fix some comments, add some diagnostic
 warning to the exit_tasks_rcu_start() and fix a buffer overflow in
 the show_rcu_tasks_trace_gp_kthread().
 
 rcutorture.2024.04.15a: Increase memory to guest OS, fix a Tasks
 Rude RCU testing, some updates for TREE09, dump mode information
 to debug GP kthread state, remove redundant READ_ONCE(), fix some
 comments about RCU_TORTURE_PIPE_LEN and pipe_count, remove some
 redundant pointer initialization, fix a hung splat task by when
 the rcutorture tests start to exit, fix invalid context warning,
 add '--do-kvfree' parameter to torture test and use slow register
 unregister callbacks only for rcutype test.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEEu6QRe/mAUYNn5U0PBYqkjnKWLM8FAmYzsmUACgkQBYqkjnKW
 LM/FAwv+LcIJ9lO/wzUpnH3d3djBOPmyu7Us8ERNY5lcVZ+neS2m3vxq0kOk/cnV
 RGgZc7qjWqMQ9hAx/MmIodmiw036ceRDe5CP/Ec/TYx68m+NPG3VnP08s/xLXLlx
 n8aSJJu37y0ElMQMwvuQaoNJ2xqlZ8AHCR6iaqJtzmPBR6zHLyeCPVpdPJQfcSO7
 +9ABzqo8isGxeuaAE7y0WUp0ZsSpdYvdext5SStjtvZ+hKERdVluhBF+OxZIZByp
 RSBoZJrbTKKpzTUBSE0ci+mlfqBPmSVjjqvygscuwOoKhm+601E51DYb1QXkGujq
 vuc1f/c7VjTAXyvs9k4An2x3XcN5SFhA6Bhc+L6aU/UJBzAWrJJkVOwS79gHNSn1
 qshyhpDLE8MiBEi0QxaEmBZLkz3BX1aYbQA0+5wvgoz0u8QglrpRrPRIWUWC0wvq
 SOLIibZkJuPUOZuD5AP4tg80swTuSCvyWuiKUVRnJK9FsYKdcyNUCnOLIwUzQlrg
 1/hatlvS
 =cq8V
 -----END PGP SIGNATURE-----

Merge tag 'rcu.next.v6.10' of https://github.com/urezki/linux

Pull RCU updates from Uladzislau Rezki:

 - Fix a lockdep complain for lazy-preemptible kernel, remove redundant
   BH disable for TINY_RCU, remove redundant READ_ONCE() in tree.c, fix
   false positives KCSAN splat and fix buffer overflow in the
   print_cpu_stall_info().

 - Misc updates related to bpf, tracing and update the MAINTAINERS file.

 - An improvement of a normal synchronize_rcu() call in terms of
   latency. It maintains a separate track for sync. users only. This
   approach bypasses per-cpu nocb-lists thus sync-users do not depend on
   nocb-list length and how fast regular callbacks are processed.

 - RCU tasks: switch tasks RCU grace periods to sleep at TASK_IDLE
   priority, fix some comments, add some diagnostic warning to the
   exit_tasks_rcu_start() and fix a buffer overflow in the
   show_rcu_tasks_trace_gp_kthread().

 - RCU torture: Increase memory to guest OS, fix a Tasks Rude RCU
   testing, some updates for TREE09, dump mode information to debug GP
   kthread state, remove redundant READ_ONCE(), fix some comments about
   RCU_TORTURE_PIPE_LEN and pipe_count, remove some redundant pointer
   initialization, fix a hung splat task by when the rcutorture tests
   start to exit, fix invalid context warning, add '--do-kvfree'
   parameter to torture test and use slow register unregister callbacks
   only for rcutype test.

* tag 'rcu.next.v6.10' of https://github.com/urezki/linux: (48 commits)
  rcutorture: Use rcu_gp_slow_register/unregister() only for rcutype test
  torture: Scale --do-kvfree test time
  rcutorture: Fix invalid context warning when enable srcu barrier testing
  rcutorture: Make stall-tasks directly exit when rcutorture tests end
  rcutorture: Removing redundant function pointer initialization
  rcutorture: Make rcutorture support print rcu-tasks gp state
  rcutorture: Use the gp_kthread_dbg operation specified by cur_ops
  rcutorture: Re-use value stored to ->rtort_pipe_count instead of re-reading
  rcutorture: Fix rcu_torture_one_read() pipe_count overflow comment
  rcutorture: Remove extraneous rcu_torture_pipe_update_one() READ_ONCE()
  rcu: Allocate WQ with WQ_MEM_RECLAIM bit set
  rcu: Support direct wake-up of synchronize_rcu() users
  rcu: Add a trace event for synchronize_rcu_normal()
  rcu: Reduce synchronize_rcu() latency
  rcu: Fix buffer overflow in print_cpu_stall_info()
  rcu: Mollify sparse with RCU guard
  rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow
  rcu-tasks: Fix the comments for tasks_rcu_exit_srcu_stall_timer
  rcu-tasks: Replace exit_tasks_rcu_start() initialization with WARN_ON_ONCE()
  rcu: Remove redundant CONFIG_PROVE_RCU #if condition
  ...
2024-05-13 09:49:06 -07:00
Beau Belgrave
bd125a0840 tracing/user_events: Fix non-spaced field matching
When the ABI was updated to prevent same name w/different args, it
missed an important corner case when fields don't end with a space.
Typically, space is used for fields to help separate them, like
"u8 field1; u8 field2". If no spaces are used, like
"u8 field1;u8 field2", then the parsing works for the first time.
However, the match check fails on a subsequent register, leading to
confusion.

This is because the match check uses argv_split() and assumes that all
fields will be split upon the space. When spaces are used, we get back
{ "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
This causes a mismatch, and the user program gets back -EADDRINUSE.

Add a method to detect this case before calling argv_split(). If found
force a space after the field separator character ';'. This ensures all
cases work properly for matching.

With this fix, the following are all treated as matching:
u8 field1;u8 field2
u8 field1; u8 field2
u8 field1;\tu8 field2
u8 field1;\nu8 field2

Link: https://lore.kernel.org/linux-trace-kernel/20240423162338.292-2-beaub@linux.microsoft.com

Fixes: ba470eebc2 ("tracing/user_events: Prevent same name but different args event")
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13 12:15:45 -04:00
Michael Ellerman
e789d4499a Merge branch 'topic/kdump-hotplug' into next
Merge our topic branch containing kdump hotplug changes, more detail from the
original cover letter:

Commit 2472627561 ("crash: add generic infrastructure for crash
hotplug support") added a generic infrastructure that allows
architectures to selectively update the kdump image component during CPU
or memory add/remove events within the kernel itself.

This patch series adds crash hotplug handler for PowerPC and enable
support to update the kdump image on CPU/Memory add/remove events.

Among the 6 patches in this series, the first two patches make changes
to the generic crash hotplug handler to assist PowerPC in adding support
for this feature. The last four patches add support for this feature.

The following section outlines the problem addressed by this patch
series, along with the current solution, its shortcomings, and the
proposed resolution.

Problem:
========
Due to CPU/Memory hotplug or online/offline events the elfcorehdr
(which describes the CPUs and memory of the crashed kernel) and FDT
(Flattened Device Tree) of kdump image becomes outdated. Consequently,
attempting dump collection with an outdated elfcorehdr or FDT can lead
to failed or inaccurate dump collection.

Going forward CPU hotplug or online/offline events are referred as
CPU/Memory add/remove events.

Existing solution and its shortcoming:
======================================
The current solution to address the above issue involves monitoring the
CPU/memory add/remove events in userspace using udev rules and whenever
there are changes in CPU and memory resources, the entire kdump image
is loaded again. The kdump image includes kernel, initrd, elfcorehdr,
FDT, purgatory. Given that only elfcorehdr and FDT get outdated due to
CPU/Memory add/remove events, reloading the entire kdump image is
inefficient. More importantly, kdump remains inactive for a substantial
amount of time until the kdump reload completes.

Proposed solution:
==================
Instead of initiating a full kdump image reload from userspace on
CPU/Memory hotplug and online/offline events, the proposed solution aims
to update only the necessary kdump image component within the kernel
itself.
2024-05-13 23:12:08 +10:00
Puranjay Mohan
2ddec2c80b riscv, bpf: inline bpf_get_smp_processor_id()
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.

RISCV saves the pointer to the CPU's task_struct in the TP (thread
pointer) register. This makes it trivial to get the CPU's processor id.
As thread_info is the first member of task_struct, we can read the
processor id from TP + offsetof(struct thread_info, cpu).

          RISCV64 JIT output for `call bpf_get_smp_processor_id`
	  ======================================================

                Before                           After
               --------                         -------

         auipc   t1,0x848c                  ld    a5,32(tp)
         jalr    604(t1)
         mv      a5,a0

Benchmark using [1] on Qemu.

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

+---------------+------------------+------------------+--------------+
|      Name     |     Before       |       After      |   % change   |
|---------------+------------------+------------------+--------------|
| glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
| arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
| hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
+---------------+------------------+------------------+--------------+

NOTE: This benchmark includes changes from this patch and the previous
      patch that implemented the per-cpu insn.

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20240502151854.9810-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-12 16:54:34 -07:00
Paolo Bonzini
4232da23d7 Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.10

1. Add ParaVirt IPI support.
2. Add software breakpoint support.
3. Add mmio trace events support.
2024-05-10 13:20:18 -04:00
Jakub Kicinski
e7073830cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
  35d92abfba ("net: hns3: fix kernel crash when devlink reload during initialization")
  2a1a1a7b5f ("net: hns3: add command queue trace for hns3")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-09 10:01:01 -07:00
Alexander Lobakin
a6016aac52 dma: fix DMA sync for drivers not calling dma_set_mask*()
There are several reports that the DMA sync shortcut broke non-coherent
devices.
dev->dma_need_sync is false after the &device allocation and if a driver
didn't call dma_set_mask*(), it will still be false even if the device
is not DMA-coherent and thus needs synchronizing. Due to historical
reasons, there's still a lot of drivers not calling it.
Invert the boolean, so that the sync will be performed by default and
the shortcut will be enabled only when calling dma_set_mask*().

Reported-by: Steven Price <steven.price@arm.com>
Closes: https://lore.kernel.org/lkml/010686f5-3049-46a1-8230-7752a1b433ff@arm.com
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Closes: https://lore.kernel.org/lkml/46160534-5003-4809-a408-6b3a3f4921e9@samsung.com
Fixes: f406c8e4b7. ("dma: avoid redundant calls for sync operations")
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steven Price <steven.price@arm.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2024-05-09 19:00:29 +02:00
Kyle Meyer
05037e5f0f sched/topology: Optimize topology_span_sane()
Optimize topology_span_sane() by removing duplicate comparisons.

Since topology_span_sane() is called inside of for_each_cpu(), each
previous CPU has already been compared against every other CPU. The
current CPU only needs to be compared against higher-numbered CPUs.

The total number of comparisons is reduced from N * (N - 1) to
N * (N - 1) / 2 on each non-NUMA scheduling domain level.

Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
2024-05-09 09:25:08 -07:00
Wardenjohn
d927752f28 livepatch: Rename KLP_* to KLP_TRANSITION_*
The original macros of KLP_* is about the state of the transition.
Rename macros of KLP_* to KLP_TRANSITION_* to fix the confusing
description of klp transition state.

Signed-off-by: Wardenjohn <zhangwarden@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/20240507050111.38195-2-zhangwarden@gmail.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-05-09 15:48:01 +02:00
Kees Cook
e406737b11 seccomp: Constify sysctl subhelpers
The read_actions_logged() and write_actions_logged() helpers called by the
sysctl proc handler seccomp_actions_logged_handler() are already expecting
their sysctl table argument to be read-only. Actually mark the argument
as const in preparation[1] for global constification of the sysctl tables.

Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/lkml/20240423-sysctl-const-handler-v3-11-e0beccb836e2@weissschuh.net/ [1]
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20240508171337.work.861-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-05-08 12:50:40 -07:00
Andrew Morton
8fcb916cac kernel/watchdog_perf.c: tidy up kerneldoc
It is unconventional to have a blank line between name-of-function and
description-of-args.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08 08:41:29 -07:00
Song Liu
393fb313a2 watchdog: allow nmi watchdog to use raw perf event
NMI watchdog permanently consumes one hardware counters per CPU on the
system.  For systems that use many hardware counters, this causes more
aggressive time multiplexing of perf events.

OTOH, some CPUs (mostly Intel) support "ref-cycles" event, which is rarely
used.  Add kernel cmdline arg nmi_watchdog=rNNN to configure the watchdog
to use raw event.  For example, on Intel CPUs, we can use "r300" to
configure the watchdog to use ref-cycles event.

If the raw event does not work, fall back to use "cycles".

[akpm@linux-foundation.org: fix kerneldoc]
Link: https://lkml.kernel.org/r/20240430060236.1878002-2-song@kernel.org
Signed-off-by: Song Liu <song@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08 08:41:29 -07:00
Song Liu
602ba77361 watchdog: handle comma separated nmi_watchdog command line
Per the document, the kernel can accept comma separated command line like
nmi_watchdog=nopanic,0.  However, the code doesn't really handle it.  Fix
the kernel to handle it properly.

Link: https://lkml.kernel.org/r/20240430060236.1878002-1-song@kernel.org
Signed-off-by: Song Liu <song@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08 08:41:28 -07:00
Baoquan He
4707c13de3 crash: add prefix for crash dumping messages
Add pr_fmt() to kernel/crash_core.c to add the module name to debugging
message printed as prefix.

And also add prefix 'crashkernel:' to two lines of message printing code
in kernel/crash_reserve.c. In kernel/crash_reserve.c, almost all
debugging messages have 'crashkernel:' prefix or there's keyword
crashkernel at the beginning or in the middle, adding pr_fmt() makes it
redundant.

Link: https://lkml.kernel.org/r/20240418035843.1562887-1-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08 08:41:26 -07:00
Levi Yun
d7ad05c86e timers/migration: Prevent out of bounds access on failure
When tmigr_setup_groups() fails the level 0 group allocation, then the
cleanup derefences index -1 of the local stack array.

Prevent this by checking the loop condition first.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Levi Yun <ppbuk5246@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20240506041059.86877-1-ppbuk5246@gmail.com
2024-05-08 11:19:43 +02:00
Haiyue Wang
75b0fbf15d bpf: Remove redundant page mask of vmf->address
As the comment described in "struct vm_fault":
	".address"      : 'Faulting virtual address - masked'
	".real_address" : 'Faulting virtual address - unmasked'

The link [1] said: "Whatever the routes, all architectures end up to the
invocation of handle_mm_fault() which, in turn, (likely) ends up calling
__handle_mm_fault() to carry out the actual work of allocating the page
tables."

  __handle_mm_fault() does address assignment:
	.address = address & PAGE_MASK,
	.real_address = address,

This is debug dump by running `./test_progs -a "*arena*"`:

[   69.767494] arena fault: vmf->address = 10000001d000, vmf->real_address = 10000001d008
[   69.767496] arena fault: vmf->address = 10000001c000, vmf->real_address = 10000001c008
[   69.767499] arena fault: vmf->address = 10000001b000, vmf->real_address = 10000001b008
[   69.767501] arena fault: vmf->address = 10000001a000, vmf->real_address = 10000001a008
[   69.767504] arena fault: vmf->address = 100000019000, vmf->real_address = 100000019008
[   69.769388] arena fault: vmf->address = 10000001e000, vmf->real_address = 10000001e1e8

So we can use the value of 'vmf->address' to do BPF arena kernel address
space cast directly.

[1] https://docs.kernel.org/mm/page_tables.html

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Link: https://lore.kernel.org/r/20240507063358.8048-1-haiyue.wang@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-07 14:13:17 -07:00
Marco Elver
31f605a308 kcsan, compiler_types: Introduce __data_racy type qualifier
Based on the discussion at [1], it would be helpful to mark certain
variables as explicitly "data racy", which would result in KCSAN not
reporting data races involving any accesses on such variables. To do
that, introduce the __data_racy type qualifier:

	struct foo {
		...
		int __data_racy bar;
		...
	};

In KCSAN-kernels, __data_racy turns into volatile, which KCSAN already
treats specially by considering them "marked". In non-KCSAN kernels the
type qualifier turns into no-op.

The generated code between KCSAN-instrumented kernels and non-KCSAN
kernels is already huge (inserted calls into runtime for every memory
access), so the extra generated code (if any) due to volatile for few
such __data_racy variables are unlikely to have measurable impact on
performance.

Link: https://lore.kernel.org/all/CAHk-=wi3iondeh_9V2g3Qz5oHTRjLsOpoy83hb58MVh=nRZe0A@mail.gmail.com/ [1]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-05-07 11:39:50 -07:00
Paolo Bonzini
aa24865fb5 KVM/riscv changes for 6.10
- Support guest breakpoints using ebreak
 - Introduce per-VCPU mp_state_lock and reset_cntx_lock
 - Virtualize SBI PMU snapshot and counter overflow interrupts
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmYwgroACgkQrUjsVaLH
 LAfckxAAnCvW9Ahcy0GgM2EwTtYDoNkQp1A6Wkp/a3nXBvc3hXMnlyZQ4YkyJ1T3
 BfQABCWEXWiDyEVpN9KUKtzUJi7WJz0MFuph5kvyZwMl53zddUNFqXpN4Hbb58/d
 dqjTJg7AnHbvirfhlHay/Rp+EaYsDq1E5GviDBi46yFkH/vB8IPpWdFLh3pD/+7f
 bmG5jeLos8zsWEwe3pAIC2hLDj0vFRRe2YJuXTZ9fvPzGBsPN9OHrtq0JbB3lRGt
 WRiYKPJiFjt2P3TjPkjh4N1Xmy8pJaEetu0Qwa1TR6I+ULs2ZcFzx9cw2VuoRQ2C
 uNhVx0o5ulAzJwGgX4U49ZTK4M7a5q6xf6zpqNFHbyy5tZylKJuBEWucuSyF1kTU
 RpjNinZ1PShzjx7HU+2gKPu+bmKHgfwKlr2Dp9Cx92IV9It3Wt1VEXWsjatciMfj
 EGYx+E9VcEOfX6INwX/TiO4ti7chLH/sFc+LhLqvw/1elhi83yAWbszjUmJ1Vrx1
 k1eATN2Hehvw06Y72lc+PrD0sYUmJPcDMVk3MSh/cSC8OODmZ9vi32v8Ie2bjNS5
 gHRLc05av1aX8yX+GRpUSPkCRL/XQ2J3jLG4uc3FmBMcWEhAtnIPsvXnCvV8f2mw
 aYrN+VF/FuRfumuYX6jWN6dwEwDO96AN425Rqu9MXik5KqSASXQ=
 =mGfY
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEAD

 KVM/riscv changes for 6.10

- Support guest breakpoints using ebreak
- Introduce per-VCPU mp_state_lock and reset_cntx_lock
- Virtualize SBI PMU snapshot and counter overflow interrupts
- New selftests for SBI PMU and Guest ebreak
2024-05-07 13:03:03 -04:00
Alexander Lobakin
f406c8e4b7 dma: avoid redundant calls for sync operations
Quite often, devices do not need dma_sync operations on x86_64 at least.
Indeed, when dev_is_dma_coherent(dev) is true and
dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
and friends do nothing.

However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.

Add dev->need_dma_sync boolean and turn it off during the device
initialization (dma_set_mask()) depending on the setup:
dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
advertised for non-NULL DMA ops.
Then later, if/when swiotlb is used for the first time, the flag
is reset back to on, from swiotlb_tbl_map_single().

On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
+3-5% increase for direct DMA.

Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07 13:29:53 +02:00
Alexander Lobakin
fe7514b149 dma: compile-out DMA sync op calls when not used
Some platforms do have DMA, but DMA there is always direct and coherent.
Currently, even on such platforms DMA sync operations are compiled and
called.
Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when
either sync operations are needed or there is DMA ops or swiotlb
or DMA debug is enabled. Compile global dma_sync_*() and dma_need_sync()
only when it's set, otherwise provide empty inline stubs.
The change allows for future optimizations of DMA sync calls depending
on runtime conditions.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07 13:29:53 +02:00
Michael Kelley
327e2c97c4 swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()
Currently swiotlb_tbl_map_single() takes alloc_align_mask and
alloc_size arguments to specify an swiotlb allocation that is larger
than mapping_size.  This larger allocation is used solely by
iommu_dma_map_single() to handle untrusted devices that should not have
DMA visibility to memory pages that are partially used for unrelated
kernel data.

Having two arguments to specify the allocation is redundant. While
alloc_align_mask naturally specifies the alignment of the starting
address of the allocation, it can also implicitly specify the size
by rounding up the mapping_size to that alignment.

Additionally, the current approach has an edge case bug.
iommu_dma_map_page() already does the rounding up to compute the
alloc_size argument. But swiotlb_tbl_map_single() then calculates the
alignment offset based on the DMA min_align_mask, and adds that offset to
alloc_size. If the offset is non-zero, the addition may result in a value
that is larger than the max the swiotlb can allocate.  If the rounding up
is done _after_ the alignment offset is added to the mapping_size (and
the original mapping_size conforms to the value returned by
swiotlb_max_mapping_size), then the max that the swiotlb can allocate
will not be exceeded.

In view of these issues, simplify the swiotlb_tbl_map_single() interface
by removing the alloc_size argument. Most call sites pass the same value
for mapping_size and alloc_size, and they pass alloc_align_mask as zero.
Just remove the redundant argument from these callers, as they will see
no functional change. For iommu_dma_map_page() also remove the alloc_size
argument, and have swiotlb_tbl_map_single() compute the alloc_size by
rounding up mapping_size after adding the offset based on min_align_mask.
This has the side effect of fixing the edge case bug but with no other
functional change.

Also add a sanity test on the alloc_align_mask. While IOMMU code
currently ensures the granule is not larger than PAGE_SIZE, if that
guarantee were to be removed in the future, the downstream effect on the
swiotlb might go unnoticed until strange allocation failures occurred.

Tested on an ARM64 system with 16K page size and some kernel test-only
hackery to allow modifying the DMA min_align_mask and the granule size
that becomes the alloc_align_mask. Tested these combinations with a
variety of original memory addresses and sizes, including those that
reproduce the edge case bug:

 * 4K granule and 0 min_align_mask
 * 4K granule and 0xFFF min_align_mask (4K - 1)
 * 16K granule and 0xFFF min_align_mask
 * 64K granule and 0xFFF min_align_mask
 * 64K granule and 0x3FFF min_align_mask (16K - 1)

With the changes, all combinations pass.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07 13:29:28 +02:00
Justin Stitt
e0550222e0 printk: cleanup deprecated uses of strncpy/strcpy
Cleanup some deprecated uses of strncpy() and strcpy() [1].

There doesn't seem to be any bugs with the current code but the
readability of this code could benefit from a quick makeover while
removing some deprecated stuff as a benefit.

The most interesting replacement made in this patch involves
concatenating "ttyS" with a digit-led user-supplied string. Instead of
doing two distinct string copies with carefully managed offsets and
lengths, let's use the more robust and self-explanatory scnprintf().
scnprintf will 1) respect the bounds of @buf, 2) null-terminate @buf, 3)
do the concatenation. This allows us to drop the manual NUL-byte assignment.

Also, since isdigit() is used about a dozen lines after the open-coded
version we'll replace it for uniformity's sake.

All the strcpy() --> strscpy() replacements are trivial as the source
strings are literals and much smaller than the destination size. No
behavioral change here.

Use the new 2-argument version of strscpy() introduced in Commit
e6584c3964 ("string: Allow 2-argument strscpy()"). However, to make
this work fully (since the size must be known at compile time), also
update the extern-qualified declaration to have the proper size
information.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90 [2]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [3]
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240429-strncpy-kernel-printk-printk-c-v1-1-4da7926d7b69@google.com
[pmladek@suse.com: Removed obsolete brackets and added empty lines.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-05-07 10:41:51 +02:00
Cupertino Miranda
41d047a871 bpf/verifier: relax MUL range computation check
MUL instruction required that src_reg would be a known value (i.e.
src_reg would be a const value). The condition in this case can be
relaxed, since the range computation algorithm used in current code
already supports a proper range computation for any valid range value on
its operands.

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Link: https://lore.kernel.org/r/20240506141849.185293-6-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:12 -07:00
Cupertino Miranda
138cc42c05 bpf/verifier: improve XOR and OR range computation
Range for XOR and OR operators would not be attempted unless src_reg
would resolve to a single value, i.e. a known constant value.
This condition is unnecessary, and the following XOR/OR operator
handling could compute a possible better range.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/r/20240506141849.185293-4-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:11 -07:00
Cupertino Miranda
0922c78f59 bpf/verifier: refactor checks for range computation
Split range computation checks in its own function, isolating pessimitic
range set for dst_reg and failing return to a single point.

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>

bpf/verifier: improve code after range computation recent changes.
Link: https://lore.kernel.org/r/20240506141849.185293-3-cupertino.miranda@oracle.com

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:11 -07:00
Cupertino Miranda
d786957ebd bpf/verifier: replace calls to mark_reg_unknown.
In order to further simplify the code in adjust_scalar_min_max_vals all
the calls to mark_reg_unknown are replaced by __mark_reg_unknown.

static void mark_reg_unknown(struct bpf_verifier_env *env,
  			     struct bpf_reg_state *regs, u32 regno)
{
	if (WARN_ON(regno >= MAX_BPF_REG)) {
		... mark all regs not init ...
		return;
    }
	__mark_reg_unknown(env, regs + regno);
}

The 'regno >= MAX_BPF_REG' does not apply to
adjust_scalar_min_max_vals(), because it is only called from the
following stack:
  - check_alu_op
    - adjust_reg_min_max_vals
      - adjust_scalar_min_max_vals

The check_alu_op() does check_reg_arg() which verifies that both src and
dst register numbers are within bounds.

Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/r/20240506141849.185293-2-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06 17:09:11 -07:00
Mickaël Salaün
3a35c13007 kunit: Handle test faults
Previously, when a kernel test thread crashed (e.g. NULL pointer
dereference, general protection fault), the KUnit test hanged for 30
seconds and exited with a timeout error.

Fix this issue by waiting on task_struct->vfork_done instead of the
custom kunit_try_catch.try_completion, and track the execution state by
initially setting try_result with -EINTR and only setting it to 0 if
the test passed.

Fix kunit_generic_run_threadfn_adapter() signature by returning 0
instead of calling kthread_complete_and_exit().  Because thread's exit
code is never checked, always set it to 0 to make it clear.  To make
this explicit, export kthread_exit() for KUnit tests built as module.

Fix the -EINTR error message, which couldn't be reached until now.

This is tested with a following patch.

Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Gow <davidgow@google.com>
Tested-by: Rae Moar <rmoar@google.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20240408074625.65017-5-mic@digikod.net
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-05-06 14:22:02 -06:00
Yoann Congal
b3e90f375b printk: Change type of CONFIG_BASE_SMALL to bool
CONFIG_BASE_SMALL is currently a type int but is only used as a boolean.

So, change its type to bool and adapt all usages:
CONFIG_BASE_SMALL == 0 becomes !IS_ENABLED(CONFIG_BASE_SMALL) and
CONFIG_BASE_SMALL != 0 becomes  IS_ENABLED(CONFIG_BASE_SMALL).

Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Yoann Congal <yoann.congal@smile.fr>
Link: https://lore.kernel.org/r/20240505080343.1471198-3-yoann.congal@smile.fr
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-05-06 17:39:09 +02:00
Benjamin Gray
628d701f2d powerpc/dexcr: Add DEXCR prctl interface
Now that we track a DEXCR on a per-task basis, individual tasks are free
to configure it as they like.

The interface is a pair of getter/setter prctl's that work on a single
aspect at a time (multiple aspects at once is more difficult if there
are different rules applied for each aspect, now or in future). The
getter shows the current state of the process config, and the setter
allows setting/clearing the aspect.

Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
[mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240417112325.728010-5-bgray@linux.ibm.com
2024-05-06 22:04:31 +10:00
Linus Torvalds
80f8b450bf Fix suspicious RCU usage in __do_softirq().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmY3SxERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1irFBAAkF7nMNof2kDXmHqeINNp0ZreVYEcVnTM
 S0xTUCvJ1C0UQgxPqOOlpODfOJLANqBS/xpwWTxzvdDemXDTAEeaiZz2wmiS77qG
 8Q98k39AOH1gynSIoZE9df4tniw2WxYaU5CMveT85YeMIW8rE3B0i/uNyrsCPJDw
 P9Bv0rBc96hbrFs32alVcix6YN1QySo8O9oZW+rRQndh8zd1lBCKVKC2QCGGLh7b
 pS45F0vJt6mVmdVURWvGtoaIh5PKNPBP1exfJow79AgogMuLgXm9JHltErgWc55L
 b508AjH29pKGb0a54hUaLAnXk1Fmu7xGZkQWIwUO7/U2ZYUR+3/eQ8UVoGhcole+
 nS/jew1er4W4/KLqhThKnNSuJaQeLljKbbsOK0bk4Dv1NTfiu83WIxgwVBZfR5Dx
 zZSG+PNcLxqVQDUz+bicy0l31x2bwGEjBnop9llPz/h+eeJHD7i3LVi+wVtrIyeP
 iLaRQVvFSgkFECJglq4aPBZ30bqU387hE9oKx+FW0WCUO6CWMg+rjqs8/MSAB31H
 8HKk9WxAWxlOdlAoESJawVLJxuAKHnVdgfilKjiBH5j5nUUB59cLNEcK+nA6W9t2
 ooGsIEiFNB1Uvt01awcSDOPUaE47H490gdZS4uuz93dTtBX6uPc+wYX0elrR8t7p
 /JRDKNBhlIg=
 =ZLyW
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "Fix suspicious RCU usage in __do_softirq()"

* tag 'irq-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  softirq: Fix suspicious RCU usage in __do_softirq()
2024-05-05 10:12:32 -07:00
Linus Torvalds
2c17a1cd90 Probes fixes for v6.9-rc6:
- probe-events: Fix memory leak in parsing probe argument. There is a
   memory leak (forget to free an allocated buffer) in a memory allocation
   failure path. Fixes it to jump to the correct error handling code.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmY2NRQbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bIacH/RmSQaraWiwQmMaWT8Pp
 wotOxtMYnl2uLNeVx3vn55+G1Xr/rJP3E9EBGTa+HMPky3trea07eBM5B3UnwT2y
 Y75Nhm6z3SFaLBygdKmQZgyIJF1W9w6J1cfqPwPlfR3h08a/9rNojd/DKBo7fLjk
 uwGAUHsB6sNhTvRF64wtr+I7V+8CGwNnApyQvf/mLnHsELerzm86nxDhXcfIvb1P
 UbM4nupqrV3QYCLYdXmma34PFFJzS3ioINGn692QtHFOSEdSwJfqsNv6AU/w98zD
 8o2rlSadc64Yl74vMLFRtBVS3K49VQXNgUUXjx2Gpj9/v80qn+B41HwaNSl1Lagx
 lIY=
 =tob5
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:

 - probe-events: Fix memory leak in parsing probe argument.

   There is a memory leak (forget to free an allocated buffer) in a
   memory allocation failure path. Fix it to jump to the correct error
   handling code.

* tag 'probes-fixes-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
2024-05-05 09:56:50 -07:00
Linus Torvalds
e92b99ae82 tracing and tracefs fixes for v6.9
- Fix RCU callback of freeing an eventfs_inode.
   The freeing of the eventfs_inode from the kref going to zero
   freed the contents of the eventfs_inode and then used kfree_rcu()
   to free the inode itself. But the contents should also be protected
   by RCU. Switch to a call_rcu() that calls a function to free all
   of the eventfs_inode after the RCU synchronization.
 
 - The tracing subsystem maps its own descriptor to a file represented by
   eventfs. The freeing of this descriptor needs to know when the
   last reference of an eventfs_inode is released, but currently
   there is no interface for that. Add a "release" callback to
   the eventfs_inode entry array that allows for freeing of data
   that can be referenced by the eventfs_inode being opened.
   Then increment the ref counter for this descriptor when the
   eventfs_inode file is created, and decrement/free it when the
   last reference to the eventfs_inode is released and the file
   is removed. This prevents races between freeing the descriptor
   and the opening of the eventfs file.
 
 - Fix the permission processing of eventfs.
   The change to make the permissions of eventfs default to the mount
   point but keep track of when changes were made had a side effect
   that could cause security concerns. When the tracefs is remounted
   with a given gid or uid, all the files within it should inherit
   that gid or uid. But if the admin had changed the permission of
   some file within the tracefs file system, it would not get updated
   by the remount. This caused the kselftest of file permissions
   to fail the second time it is run. The first time, all changes
   would look fine, but the second time, because the changes were
   "saved", the remount did not reset them.
 
   Create a link list of all existing tracefs inodes, and clear the
   saved flags on them on a remount if the remount changes the
   corresponding gid or uid fields.
 
   This also simplifies the code by removing the distinction between the
   toplevel eventfs and an instance eventfs. They should both act the
   same. They were different because of a misconception due to the
   remount not resetting the flags. Now that remount resets all the
   files and directories to default to the root node if a uid/gid is
   specified, it makes the logic simpler to implement.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZjXxzxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqzGAQCX8g7gtngGgwSsWqPW5GmecCifwFja
 k7cVEDhMYPnDeAEAkYi2ZBgJRkPsWPfMRClDK/DXP4woOo58asxtIxfTMgg=
 =mCkt
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing and tracefs fixes from Steven Rostedt:

 - Fix RCU callback of freeing an eventfs_inode.

   The freeing of the eventfs_inode from the kref going to zero freed
   the contents of the eventfs_inode and then used kfree_rcu() to free
   the inode itself. But the contents should also be protected by RCU.
   Switch to a call_rcu() that calls a function to free all of the
   eventfs_inode after the RCU synchronization.

 - The tracing subsystem maps its own descriptor to a file represented
   by eventfs. The freeing of this descriptor needs to know when the
   last reference of an eventfs_inode is released, but currently there
   is no interface for that.

   Add a "release" callback to the eventfs_inode entry array that allows
   for freeing of data that can be referenced by the eventfs_inode being
   opened. Then increment the ref counter for this descriptor when the
   eventfs_inode file is created, and decrement/free it when the last
   reference to the eventfs_inode is released and the file is removed.
   This prevents races between freeing the descriptor and the opening of
   the eventfs file.

 - Fix the permission processing of eventfs.

   The change to make the permissions of eventfs default to the mount
   point but keep track of when changes were made had a side effect that
   could cause security concerns. When the tracefs is remounted with a
   given gid or uid, all the files within it should inherit that gid or
   uid. But if the admin had changed the permission of some file within
   the tracefs file system, it would not get updated by the remount.

   This caused the kselftest of file permissions to fail the second time
   it is run. The first time, all changes would look fine, but the
   second time, because the changes were "saved", the remount did not
   reset them.

   Create a link list of all existing tracefs inodes, and clear the
   saved flags on them on a remount if the remount changes the
   corresponding gid or uid fields.

   This also simplifies the code by removing the distinction between the
   toplevel eventfs and an instance eventfs. They should both act the
   same. They were different because of a misconception due to the
   remount not resetting the flags. Now that remount resets all the
   files and directories to default to the root node if a uid/gid is
   specified, it makes the logic simpler to implement.

* tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Have "events" directory get permissions from its parent
  eventfs: Do not treat events directory different than other directories
  eventfs: Do not differentiate the toplevel events directory
  tracefs: Still use mount point as default permissions for instances
  tracefs: Reset permissions on remount if permissions are options
  eventfs: Free all of the eventfs_inode after RCU
  eventfs/tracing: Add callback for release of an eventfs_inode
2024-05-05 09:53:09 -07:00
Linus Torvalds
4fbcf58590 dma-mapping fix for Linux 6.9
- fix the combination of restricted pools and dynamic swiotlb
    (Will Deacon)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmY1uVELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYP3ZBAAi+aGmWWnpF6ujgGTjLSABztNWimuyr8GgZwCiRL4
 otvp/u6Iq6kHQJvPvDpJVUYV80unqz4NV67JYnsOi1kX2QHVME8ActHrQ/tpbiyz
 QrIRQ75iQUH4PVlBubUHHT0/zZoHn5RB1D8rB1vRBIxR+ApN2LIUq74d5W6YMcoE
 LcatCYLbomKovRFEornQ7+a9rHkiZvUPwbXqpxPUVAUnpaS2cTy6Tc5EmKOu00yi
 iMEvx5Hzmb2we0oHTwTNnrjzpmSTNww8geNOKBYRij+3VWBeb1weapJEl/EJ3hRh
 B7xkSNvFPMDMVlTUwO4+Bb6W76xbVXteiFsCatGV+2EUmJUlpw50uEUmA/smACuV
 Aw9oz6MZEj0VZjY+2kliYxO5sfgeU2Is/ZS2iTPB2pNcYlHppG4Fn4Bob9E+MJ9p
 aR+D4NbcrjM/PS4yIgto9/lyjQKu/Vs2T2c8eblE9Vp+io0/ZLI1dguOspRx2eAd
 sWSNZBSTPjrFQJuuszS+skws+s6j9hKCwi6N4Neb39+HNWvjJa0SYvBDFjoXBbd6
 kfwMWvMwRNDd0YhGAzfapPguy+FEtAoJ6s7SSSLG1XQ3BfKoC2YTQKjfG9aid+n4
 MmoAL+UGnXw31IAsITAQGMFC6h41mhNDlKPXJIm4/n8PEW8P7GIQugHN5SvuNXnN
 qk8=
 =05zN
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.9-2024-05-04' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:

 - fix the combination of restricted pools and dynamic swiotlb
   (Will Deacon)

* tag 'dma-mapping-6.9-2024-05-04' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y
2024-05-05 09:49:21 -07:00
Steven Rostedt (Google)
b63db58e2f eventfs/tracing: Add callback for release of an eventfs_inode
Synthetic events create and destroy tracefs files when they are created
and removed. The tracing subsystem has its own file descriptor
representing the state of the events attached to the tracefs files.
There's a race between the eventfs files and this file descriptor of the
tracing system where the following can cause an issue:

With two scripts 'A' and 'B' doing:

  Script 'A':
    echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
    while :
    do
      echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
    done

  Script 'B':
    echo > /sys/kernel/tracing/synthetic_events

Script 'A' creates a synthetic event "hello" and then just writes zero
into its enable file.

Script 'B' removes all synthetic events (including the newly created
"hello" event).

What happens is that the opening of the "enable" file has:

 {
	struct trace_event_file *file = inode->i_private;
	int ret;

	ret = tracing_check_open_get_tr(file->tr);
 [..]

But deleting the events frees the "file" descriptor, and a "use after
free" happens with the dereference at "file->tr".

The file descriptor does have a reference counter, but there needs to be a
way to decrement it from the eventfs when the eventfs_inode is removed
that represents this file descriptor.

Add an optional "release" callback to the eventfs_entry array structure,
that gets called when the eventfs file is about to be removed. This allows
for the creating on the eventfs file to increment the tracing file
descriptor ref counter. When the eventfs file is deleted, it can call the
release function that will call the put function for the tracing file
descriptor.

This will protect the tracing file from being freed while a eventfs file
that references it is being opened.

Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-Tze-nan.Wu@mediatek.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240502090315.448cba46@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5790b1fb3d ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: Tze-nan wu <Tze-nan.Wu@mediatek.com>
Tested-by: Tze-nan Wu (吳澤南) <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04 04:25:37 -04:00
Thomas Weißschuh
0e148d3cca stackleak: Use a copy of the ctl_table argument
Sysctl handlers are not supposed to modify the ctl_table passed to them.
Adapt the logic to work with a temporary variable, similar to how it is
done in other parts of the kernel.

This is also a prerequisite to enforce the immutability of the argument
through the callbacks.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/r/20240503-sysctl-const-stackleak-v1-1-603fecb19170@weissschuh.net
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-05-03 12:35:12 -07:00
Al Viro
b1439b179d swsusp: don't bother with setting block size
same as with the swap...

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:39:44 -04:00
Jakub Kicinski
e958da0ddb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/filter.h
kernel/bpf/core.c
  66e13b615a ("bpf: verifier: prevent userspace memory access")
  d503a04f8b ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 12:06:25 -07:00
Linus Torvalds
545c494465 Including fixes from bpf.
Relatively calm week, likely due to public holiday in most places.
 No known outstanding regressions.
 
 Current release - regressions:
 
   - rxrpc: fix wrong alignmask in __page_frag_alloc_align()
 
   - eth: e1000e: change usleep_range to udelay in PHY mdic access
 
 Previous releases - regressions:
 
   - gro: fix udp bad offset in socket lookup
 
   - bpf: fix incorrect runtime stat for arm64
 
   - tipc: fix UAF in error path
 
   - netfs: fix a potential infinite loop in extract_user_to_sg()
 
   - eth: ice: ensure the copied buf is NUL terminated
 
   - eth: qeth: fix kernel panic after setting hsuid
 
 Previous releases - always broken:
 
   - bpf:
     - verifier: prevent userspace memory access
     - xdp: use flags field to disambiguate broadcast redirect
 
   - bridge: fix multicast-to-unicast with fraglist GSO
 
   - mptcp: ensure snd_nxt is properly initialized on connect
 
   - nsh: fix outer header access in nsh_gso_segment().
 
   - eth: bcmgenet: fix racing registers access
 
   - eth: vxlan: fix stats counters.
 
 Misc:
 
   - a bunch of MAINTAINERS file updates
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmYzaRsSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkh70P/jzsTsvzHspu3RUwcsyvWpSoJPcxP2tF
 5SKR66o8sbSjB5I26zUi/LtRZgbPO32GmLN2Y8GvP74h9lwKdDo4AY4volZKCT6f
 lRG6GohvMa0lSPSn1fti7CKVzDOsaTHvLz3uBBr+Xb9ITCKh+I+zGEEDGj/47SQN
 tmDWHPF8OMs2ezmYS5NqRIQ3CeRz6uyLmEoZhVm4SolypZ18oEg7GCtL3u6U48n+
 e3XB3WwKl0ZxK8ipvPgUDwGIDuM5hEyAaeNon3zpYGoqitRsRITUjULpb9dT4DtJ
 Jma3OkarFJNXgm4N/p/nAtQ9AdiAloF9ivZXs2t0XCdrrUZJUh05yuikoX+mLfpw
 GedG2AbaVl6mdqNkrHeyf5SXKuiPgeCLVfF2xMjS0l1kFbY+Bt8BqnRSdOrcoUG0
 zlSzBeBtajttMdnalWv2ZshjP8uo/NjXydUjoVNwuq8xGO5wP+zhNnwhOvecNyUg
 t7q2PLokahlz4oyDqyY/7SQ0hSEndqxOlt43I6CthoWH0XkS83nTPdQXcTKQParD
 ntJUk5QYwefUT1gimbn/N8GoP7a1+ysWiqcf/7+SNm932gJGiDt36+HOEmyhIfIG
 IDWTWJJW64SnPBIUw59MrG7hMtbfaiZiFQqeUJQpFVrRr+tg5z5NUZ5thA+EJVd8
 qiVDvmngZFiv
 =f6KY
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf.

  Relatively calm week, likely due to public holiday in most places. No
  known outstanding regressions.

  Current release - regressions:

   - rxrpc: fix wrong alignmask in __page_frag_alloc_align()

   - eth: e1000e: change usleep_range to udelay in PHY mdic access

  Previous releases - regressions:

   - gro: fix udp bad offset in socket lookup

   - bpf: fix incorrect runtime stat for arm64

   - tipc: fix UAF in error path

   - netfs: fix a potential infinite loop in extract_user_to_sg()

   - eth: ice: ensure the copied buf is NUL terminated

   - eth: qeth: fix kernel panic after setting hsuid

  Previous releases - always broken:

   - bpf:
       - verifier: prevent userspace memory access
       - xdp: use flags field to disambiguate broadcast redirect

   - bridge: fix multicast-to-unicast with fraglist GSO

   - mptcp: ensure snd_nxt is properly initialized on connect

   - nsh: fix outer header access in nsh_gso_segment().

   - eth: bcmgenet: fix racing registers access

   - eth: vxlan: fix stats counters.

  Misc:

   - a bunch of MAINTAINERS file updates"

* tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  MAINTAINERS: mark MYRICOM MYRI-10G as Orphan
  MAINTAINERS: remove Ariel Elior
  net: gro: add flush check in udp_gro_receive_segment
  net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb
  ipv4: Fix uninit-value access in __ip_make_skb()
  s390/qeth: Fix kernel panic after setting hsuid
  vxlan: Pull inner IP header in vxlan_rcv().
  tipc: fix a possible memleak in tipc_buf_append
  tipc: fix UAF in error path
  rxrpc: Clients must accept conn from any address
  net: core: reject skb_copy(_expand) for fraglist GSO skbs
  net: bridge: fix multicast-to-unicast with fraglist GSO
  mptcp: ensure snd_nxt is properly initialized on connect
  e1000e: change usleep_range to udelay in PHY mdic access
  net: dsa: mv88e6xxx: Fix number of databases for 88E6141 / 88E6341
  cxgb4: Properly lock TX queue for the selftest.
  rxrpc: Fix using alignmask being zero for __page_frag_alloc_align()
  vxlan: Add missing VNI filter counter update in arp_reduce().
  vxlan: Fix racy device stats updates.
  net: qede: use return from qede_parse_actions()
  ...
2024-05-02 08:51:47 -07:00
Will Deacon
75961ffb5c swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y
Using restricted DMA pools (CONFIG_DMA_RESTRICTED_POOL=y) in conjunction
with dynamic SWIOTLB (CONFIG_SWIOTLB_DYNAMIC=y) leads to the following
crash when initialising the restricted pools at boot-time:

  | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
  | Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
  | pc : rmem_swiotlb_device_init+0xfc/0x1ec
  | lr : rmem_swiotlb_device_init+0xf0/0x1ec
  | Call trace:
  |  rmem_swiotlb_device_init+0xfc/0x1ec
  |  of_reserved_mem_device_init_by_idx+0x18c/0x238
  |  of_dma_configure_id+0x31c/0x33c
  |  platform_dma_configure+0x34/0x80

faddr2line reveals that the crash is in the list validation code:

  include/linux/list.h:83
  include/linux/rculist.h:79
  include/linux/rculist.h:106
  kernel/dma/swiotlb.c:306
  kernel/dma/swiotlb.c:1695

because add_mem_pool() is trying to list_add_rcu() to a NULL
'mem->pools'.

Fix the crash by initialising the 'mem->pools' list_head in
rmem_swiotlb_device_init() before calling add_mem_pool().

Reported-by: Nikita Ioffe <ioffe@google.com>
Tested-by: Nikita Ioffe <ioffe@google.com>
Fixes: 1aaa736815 ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-02 14:57:04 +02:00
Ard Biesheuvel
377d909511 vmlinux: Avoid weak reference to notes section
Weak references are references that are permitted to remain unsatisfied
in the final link. This means they cannot be implemented using place
relative relocations, resulting in GOT entries when using position
independent code generation.

The notes section should always exist, so the weak annotations can be
omitted.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-05-02 19:48:26 +09:00
Ard Biesheuvel
951bcae6c5 kallsyms: Avoid weak references for kallsyms symbols
kallsyms is a directory of all the symbols in the vmlinux binary, and so
creating it is somewhat of a chicken-and-egg problem, as its non-zero
size affects the layout of the binary, and therefore the values of the
symbols.

For this reason, the kernel is linked more than once, and the first pass
does not include any kallsyms data at all. For the linker to accept
this, the symbol declarations describing the kallsyms metadata are
emitted as having weak linkage, so they can remain unsatisfied. During
the subsequent passes, the weak references are satisfied by the kallsyms
metadata that was constructed based on information gathered from the
preceding passes.

Weak references lead to somewhat worse codegen, because taking their
address may need to produce NULL (if the reference was unsatisfied), and
this is not usually supported by RIP or PC relative symbol references.

Given that these references are ultimately always satisfied in the final
link, let's drop the weak annotation, and instead, provide fallback
definitions in the linker script that are only emitted if an unsatisfied
reference exists.

While at it, drop the FRV specific annotation that these symbols reside
in .rodata - FRV is long gone.

Tested-by: Nick Desaulniers <ndesaulniers@google.com> # Boot
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20230504174320.3930345-1-ardb%40kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-05-02 19:48:26 +09:00
Vadim Fedorenko
ac2f438c2a bpf: crypto: fix build when CONFIG_CRYPTO=m
Crypto subsytem can be build as a module. In this case we still have to
build BPF crypto framework otherwise the build will fail.

Fixes: 3e1c6f3540 ("bpf: make common crypto API for TC/XDP programs")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202405011634.4JK40epY-lkp@intel.com/
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240501170130.1682309-1-vadfed@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-01 13:32:26 -07:00
Kees Cook
a284e43852 hardening: Enable KCFI and some other options
Add some stuff that got missed along the way:

- CONFIG_UNWIND_PATCH_PAC_INTO_SCS=y so SCS vs PAC is hardware
  selectable.

- CONFIG_X86_KERNEL_IBT=y while a default, just be sure.

- CONFIG_CFI_CLANG=y globally.

- CONFIG_PAGE_TABLE_CHECK=y for userspace mapping sanity.

Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240501193709.make.982-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2024-05-01 12:38:14 -07:00
Andrii Nakryiko
e03c05ac98 rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
that RCU is watching when trying to setup rethooko on a function entry.

One notable exception when we force rcu_is_watching() check is
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
old-style int3-based workflow instead of relying on ftrace, making RCU
watching check important to validate.

This further (in addition to improvements in the previous patch)
improves BPF multi-kretprobe (which rely on rethook) runtime throughput
by 2.3%, according to BPF benchmarks ([0]).

  [0] https://lore.kernel.org/bpf/CAEf4BzauQ2WKMjZdc9s0rBWa01BYbgwHN6aNDXQSHYia47pQ-w@mail.gmail.com/

Link: https://lore.kernel.org/all/20240418190909.704286-2-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:48 +09:00
Andrii Nakryiko
b0e28a4b5b ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level validation of ftrace subsystem invariants. For most users it
should probably be kept disabled to eliminate unnecessary runtime
overhead.

This improves BPF multi-kretprobe (relying on ftrace and rethook
infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).

  [0] https://lore.kernel.org/bpf/CAEf4BzauQ2WKMjZdc9s0rBWa01BYbgwHN6aNDXQSHYia47pQ-w@mail.gmail.com/

Link: https://lore.kernel.org/all/20240418190909.704286-1-andrii@kernel.org/

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:48 +09:00
Jonathan Haslam
0dc715295d uprobes: reduce contention on uprobes_tree access
Active uprobes are stored in an RB tree and accesses to this tree are
dominated by read operations. Currently these accesses are serialized by
a spinlock but this leads to enormous contention when large numbers of
threads are executing active probes.

This patch converts the spinlock used to serialize access to the
uprobes_tree RB tree into a reader-writer spinlock. This lock type
aligns naturally with the overwhelmingly read-only nature of the tree
usage here. Although the addition of reader-writer spinlocks are
discouraged [0], this fix is proposed as an interim solution while an
RCU based approach is implemented (that work is in a nascent form). This
fix also has the benefit of being trivial, self contained and therefore
simple to backport.

We have used a uprobe benchmark from the BPF selftests [1] to estimate
the improvements. Each block of results below show 1 line per execution
of the benchmark ("the "Summary" line) and each line is a run with one
more thread added - a thread is a "producer". The lines are edited to
remove extraneous output.

The tests were executed with this driver script:

for num_threads in {1..20}
do
  sudo ./bench -a -p $num_threads trig-uprobe-nop | grep Summary
done

SPINLOCK (BEFORE)
==================
Summary: hits    1.396 ± 0.007M/s (  1.396M/prod)
Summary: hits    1.656 ± 0.016M/s (  0.828M/prod)
Summary: hits    2.246 ± 0.008M/s (  0.749M/prod)
Summary: hits    2.114 ± 0.010M/s (  0.529M/prod)
Summary: hits    2.013 ± 0.009M/s (  0.403M/prod)
Summary: hits    1.753 ± 0.008M/s (  0.292M/prod)
Summary: hits    1.847 ± 0.001M/s (  0.264M/prod)
Summary: hits    1.889 ± 0.001M/s (  0.236M/prod)
Summary: hits    1.833 ± 0.006M/s (  0.204M/prod)
Summary: hits    1.900 ± 0.003M/s (  0.190M/prod)
Summary: hits    1.918 ± 0.006M/s (  0.174M/prod)
Summary: hits    1.925 ± 0.002M/s (  0.160M/prod)
Summary: hits    1.837 ± 0.001M/s (  0.141M/prod)
Summary: hits    1.898 ± 0.001M/s (  0.136M/prod)
Summary: hits    1.799 ± 0.016M/s (  0.120M/prod)
Summary: hits    1.850 ± 0.005M/s (  0.109M/prod)
Summary: hits    1.816 ± 0.002M/s (  0.101M/prod)
Summary: hits    1.787 ± 0.001M/s (  0.094M/prod)
Summary: hits    1.764 ± 0.002M/s (  0.088M/prod)

RW SPINLOCK (AFTER)
===================
Summary: hits    1.444 ± 0.020M/s (  1.444M/prod)
Summary: hits    2.279 ± 0.011M/s (  1.139M/prod)
Summary: hits    3.422 ± 0.014M/s (  1.141M/prod)
Summary: hits    3.565 ± 0.017M/s (  0.891M/prod)
Summary: hits    2.671 ± 0.013M/s (  0.534M/prod)
Summary: hits    2.409 ± 0.005M/s (  0.401M/prod)
Summary: hits    2.485 ± 0.008M/s (  0.355M/prod)
Summary: hits    2.496 ± 0.003M/s (  0.312M/prod)
Summary: hits    2.585 ± 0.002M/s (  0.287M/prod)
Summary: hits    2.908 ± 0.011M/s (  0.291M/prod)
Summary: hits    2.346 ± 0.016M/s (  0.213M/prod)
Summary: hits    2.804 ± 0.004M/s (  0.234M/prod)
Summary: hits    2.556 ± 0.001M/s (  0.197M/prod)
Summary: hits    2.754 ± 0.004M/s (  0.197M/prod)
Summary: hits    2.482 ± 0.002M/s (  0.165M/prod)
Summary: hits    2.412 ± 0.005M/s (  0.151M/prod)
Summary: hits    2.710 ± 0.003M/s (  0.159M/prod)
Summary: hits    2.826 ± 0.005M/s (  0.157M/prod)
Summary: hits    2.718 ± 0.001M/s (  0.143M/prod)
Summary: hits    2.844 ± 0.006M/s (  0.142M/prod)

The numbers in parenthesis give averaged throughput per thread which is
of greatest interest here as a measure of scalability. Improvements are
in the order of 22 - 68% with this particular benchmark (mean = 43%).

V2:
 - Updated commit message to include benchmark results.

[0] https://docs.kernel.org/locking/spinlocks.html
[1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c

Link: https://lore.kernel.org/all/20240422102306.6026-1-jonathan.haslam@gmail.com/

Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:47 +09:00
Kui-Feng Lee
5120d167e2 rethook: Remove warning messages printed for finding return address of a frame.
The function rethook_find_ret_addr() prints a warning message and returns 0
when the target task is running and is not the "current" task in order to
prevent the incorrect return address, although it still may return an
incorrect address.

However, the warning message turns into noise when BPF profiling programs
call bpf_get_task_stack() on running tasks in a firm with a large number of
hosts.

The callers should be aware and willing to take the risk of receiving an
incorrect return address from a task that is currently running other than
the "current" one. A warning is not needed here as the callers are intent
on it.

Link: https://lore.kernel.org/all/20240408175140.60223-1-thinker.li@gmail.com/

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:47 +09:00
Ye Bin
20fe4d07bd tracing/probes: support '%pD' type for print struct file's name
As like '%pd' type, this patch supports print type '%pD' for print file's
name. For example "name=$arg1:%pD" casts the `$arg1` as (struct file*),
dereferences the "file.f_path.dentry.d_name.name" field and stores it to
"name" argument as a kernel string.
Here is an example:
[tracing]# echo 'p:testprobe vfs_read name=$arg1:%pD' > kprobe_event
[tracing]# echo 1 > events/kprobes/testprobe/enable
[tracing]# grep -q "1" events/kprobes/testprobe/enable
[tracing]# echo 0 > events/kprobes/testprobe/enable
[tracing]# grep "vfs_read" trace | grep "enable"
            grep-15108   [003] .....  5228.328609: testprobe: (vfs_read+0x4/0xbb0) name="enable"

Note that this expects the given argument (e.g. $arg1) is an address of struct
file. User must ensure it.

Link: https://lore.kernel.org/all/20240322064308.284457-3-yebin10@huawei.com/
[Masami: replaced "previous patch" with '%pd' type]

Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:47 +09:00
Ye Bin
d9b15224dd tracing/probes: support '%pd' type for print struct dentry's name
During fault locating, the file name needs to be printed based on the
dentry  address. The offset needs to be calculated each time, which
is troublesome. Similar to printk, kprobe support print type '%pd' for
print dentry's name. For example "name=$arg1:%pd" casts the `$arg1`
as (struct dentry *), dereferences the "d_name.name" field and stores
it to "name" argument as a kernel string.
Here is an example:
[tracing]# echo 'p:testprobe dput name=$arg1:%pd' > kprobe_events
[tracing]# echo 1 > events/kprobes/testprobe/enable
[tracing]# grep -q "1" events/kprobes/testprobe/enable
[tracing]# echo 0 > events/kprobes/testprobe/enable
[tracing]# cat trace | grep "enable"
	    bash-14844   [002] ..... 16912.889543: testprobe: (dput+0x4/0x30) name="enable"
            grep-15389   [003] ..... 16922.834182: testprobe: (dput+0x4/0x30) name="enable"
            grep-15389   [003] ..... 16922.836103: testprobe: (dput+0x4/0x30) name="enable"
            bash-14844   [001] ..... 16931.820909: testprobe: (dput+0x4/0x30) name="enable"

Note that this expects the given argument (e.g. $arg1) is an address of struct
dentry. User must ensure it.

Link: https://lore.kernel.org/all/20240322064308.284457-2-yebin10@huawei.com/

Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:47 +09:00
Andrii Nakryiko
cdf355cc60 uprobes: add speculative lockless system-wide uprobe filter check
It's very common with BPF-based uprobe/uretprobe use cases to have
a system-wide (not PID specific) probes used. In this case uprobe's
trace_uprobe_filter->nr_systemwide counter is bumped at registration
time, and actual filtering is short circuited at the time when
uprobe/uretprobe is triggered.

This is a great optimization, and the only issue with it is that to even
get to checking this counter uprobe subsystem is taking
read-side trace_uprobe_filter->rwlock. This is actually noticeable in
profiles and is just another point of contention when uprobe is
triggered on multiple CPUs simultaneously.

This patch moves this nr_systemwide check outside of filter list's
rwlock scope, as rwlock is meant to protect list modification, while
nr_systemwide-based check is speculative and racy already, despite the
lock (as discussed in [0]). trace_uprobe_filter_remove() and
trace_uprobe_filter_add() already check for filter->nr_systewide
explicitly outside of __uprobe_perf_filter, so no modifications are
required there.

Confirming with BPF selftests's based benchmarks.

BEFORE (based on changes in previous patch)
===========================================
uprobe-nop     :    2.732 ± 0.022M/s
uprobe-push    :    2.621 ± 0.016M/s
uprobe-ret     :    1.105 ± 0.007M/s
uretprobe-nop  :    1.396 ± 0.007M/s
uretprobe-push :    1.347 ± 0.008M/s
uretprobe-ret  :    0.800 ± 0.006M/s

AFTER
=====
uprobe-nop     :    2.878 ± 0.017M/s (+5.5%, total +8.3%)
uprobe-push    :    2.753 ± 0.013M/s (+5.3%, total +10.2%)
uprobe-ret     :    1.142 ± 0.010M/s (+3.8%, total +3.8%)
uretprobe-nop  :    1.444 ± 0.008M/s (+3.5%, total +6.5%)
uretprobe-push :    1.410 ± 0.010M/s (+4.8%, total +7.1%)
uretprobe-ret  :    0.816 ± 0.002M/s (+2.0%, total +3.9%)

In the above, first percentage value is based on top of previous patch
(lazy uprobe buffer optimization), while the "total" percentage is
based on kernel without any of the changes in this patch set.

As can be seen, we get about 4% - 10% speed up, in total, with both lazy
uprobe buffer and speculative filter check optimizations.

  [0] https://lore.kernel.org/bpf/20240313131926.GA19986@redhat.com/

Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/all/20240318181728.2795838-4-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:46 +09:00
Andrii Nakryiko
1b8f85defb uprobes: prepare uprobe args buffer lazily
uprobe_cpu_buffer and corresponding logic to store uprobe args into it
are used for uprobes/uretprobes that are created through tracefs or
perf events.

BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't
need uprobe_cpu_buffer and associated data. For BPF-only use cases this
buffer handling and preparation is a pure overhead. At the same time,
BPF-only uprobe/uretprobe usage is very common in practice. Also, for
a lot of cases applications are very senstivie to performance overheads,
as they might be tracing a very high frequency functions like
malloc()/free(), so every bit of performance improvement matters.

All that is to say that this uprobe_cpu_buffer preparation is an
unnecessary overhead that each BPF user of uprobes/uretprobe has to pay.
This patch is changing this by making uprobe_cpu_buffer preparation
optional. It will happen only if either tracefs-based or perf event-based
uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For
BPF-only use cases this step will be skipped.

We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0])
to estimate the improvements. We have 3 uprobe and 3 uretprobe
scenarios, which vary an instruction that is replaced by uprobe: nop
(fastest uprobe case), `push rbp` (typical case), and non-simulated
`ret` instruction (slowest case). Benchmark thread is constantly calling
user space function in a tight loop. User space function has attached
BPF uprobe or uretprobe program doing nothing but atomic counter
increments to count number of triggering calls. Benchmark emits
throughput in millions of executions per second.

BEFORE these changes
====================
uprobe-nop     :    2.657 ± 0.024M/s
uprobe-push    :    2.499 ± 0.018M/s
uprobe-ret     :    1.100 ± 0.006M/s
uretprobe-nop  :    1.356 ± 0.004M/s
uretprobe-push :    1.317 ± 0.019M/s
uretprobe-ret  :    0.785 ± 0.007M/s

AFTER these changes
===================
uprobe-nop     :    2.732 ± 0.022M/s (+2.8%)
uprobe-push    :    2.621 ± 0.016M/s (+4.9%)
uprobe-ret     :    1.105 ± 0.007M/s (+0.5%)
uretprobe-nop  :    1.396 ± 0.007M/s (+2.9%)
uretprobe-push :    1.347 ± 0.008M/s (+2.3%)
uretprobe-ret  :    0.800 ± 0.006M/s (+1.9)

So the improvements on this particular machine seems to be between 2% and 5%.

  [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c

Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/all/20240318181728.2795838-3-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:46 +09:00
Andrii Nakryiko
3eaea21b4d uprobes: encapsulate preparation of uprobe args buffer
Move the logic of fetching temporary per-CPU uprobe buffer and storing
uprobes args into it to a new helper function. Store data size as part
of this buffer, simplifying interfaces a bit, as now we only pass single
uprobe_cpu_buffer reference around, instead of pointer + dsize.

This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher,
and now will be centralized. All this is also in preparation to make
this uprobe_cpu_buffer handling logic optional in the next patch.

Link: https://lore.kernel.org/all/20240318181728.2795838-2-andrii@kernel.org/
[Masami: update for v6.9-rc3 kernel]

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01 23:18:46 +09:00
Uladzislau Rezki (Sony)
64619b283b Merge branches 'fixes.2024.04.15a', 'misc.2024.04.12a', 'rcu-sync-normal-improve.2024.04.15a', 'rcu-tasks.2024.04.15a' and 'rcutorture.2024.04.15a' into rcu-merge.2024.04.15a
fixes.2024.04.15a: RCU fixes
misc.2024.04.12a: Miscellaneous fixes
rcu-sync-normal-improve.2024.04.15a: Improving synchronize_rcu() call
rcu-tasks.2024.04.15a: Tasks RCU updates
rcutorture.2024.04.15a: Torture-test updates
2024-05-01 13:04:02 +02:00
Stanislav Fomichev
543576ec15 bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATE
bpf_prog_attach uses attach_type_to_prog_type to enforce proper
attach type for BPF_PROG_TYPE_CGROUP_SKB. link_create uses
bpf_prog_get and relies on bpf_prog_attach_check_attach_type
to properly verify prog_type <> attach_type association.

Add missing attach_type enforcement for the link_create case.
Otherwise, it's currently possible to attach cgroup_skb prog
types to other cgroup hooks.

Fixes: af6eea5743 ("bpf: Implement bpf_link-based cgroup BPF program attachment")
Link: https://lore.kernel.org/bpf/0000000000004792a90615a1dde0@google.com/
Reported-by: syzbot+838346b979830606c854@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240426231621.2716876-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-30 10:43:37 -07:00
Palmer Dabbelt
4202f62cb6
Merge patch series "riscv: Create and document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl"
Charlie Jenkins <charlie@rivosinc.com> says:

Improve the performance of icache flushing by creating a new prctl flag
PR_RISCV_SET_ICACHE_FLUSH_CTX. The interface is left generic to allow
for future expansions such as with the proposed J extension [1].

Documentation is also provided to explain the use case.

Patch sent to add PR_RISCV_SET_ICACHE_FLUSH_CTX to man-pages [2].

[1] https://github.com/riscv/riscv-j-extension
[2] https://lore.kernel.org/linux-man/20240124-fencei_prctl-v1-1-0bddafcef331@rivosinc.com

* b4-shazam-merge:
  cpumask: Add assign cpu
  documentation: Document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl
  riscv: Include riscv_set_icache_flush_ctx prctl
  riscv: Remove unnecessary irqflags processor.h include

Link: https://lore.kernel.org/r/20240312-fencei-v13-0-4b6bdc2bbf32@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-04-30 10:35:42 -07:00
Jiri Olsa
5c919acef8 bpf: Add support for kprobe session cookie
Adding support for cookie within the session of kprobe multi
entry and return program.

The session cookie is u64 value and can be retrieved be new
kfunc bpf_session_cookie, which returns pointer to the cookie
value. The bpf program can use the pointer to store (on entry)
and load (on return) the value.

The cookie value is implemented via fprobe feature that allows
to share values between entry and return ftrace fprobe callbacks.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-4-jolsa@kernel.org
2024-04-30 09:45:53 -07:00
Jiri Olsa
adf46d88ae bpf: Add support for kprobe session context
Adding struct bpf_session_run_ctx object to hold session related
data, which is atm is_return bool and data pointer coming in
following changes.

Placing bpf_session_run_ctx layer in between bpf_run_ctx and
bpf_kprobe_multi_run_ctx so the session data can be retrieved
regardless of if it's kprobe_multi or uprobe_multi link, which
support is coming in future. This way both kprobe_multi and
uprobe_multi can use same kfuncs to access the session data.

Adding bpf_session_is_return kfunc that returns true if the
bpf program is executed from the exit probe of the kprobe multi
link attached in wrapper mode. It returns false otherwise.

Adding new kprobe hook for kprobe program type.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-3-jolsa@kernel.org
2024-04-30 09:45:53 -07:00
Jiri Olsa
535a3692ba bpf: Add support for kprobe session attach
Adding support to attach bpf program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two kprobe multi links.

Adding new BPF_TRACE_KPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.

It's possible to control execution of the bpf program on return
probe simply by returning zero or non zero from the entry bpf
program execution to execute or not the bpf program on return
probe respectively.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240430112830.1184228-2-jolsa@kernel.org
2024-04-30 09:45:53 -07:00
Benjamin Tissoires
a891711d01 bpf: Do not walk twice the hash map on free
If someone stores both a timer and a workqueue in a hash map, on free, we
would walk it twice.

Add a check in htab_free_malloced_timers_or_wq and free the timers and
workqueues if they are present.

Fixes: 246331e3f1 ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-2-27afe7f3b17c@kernel.org
2024-04-30 16:28:46 +02:00
Benjamin Tissoires
b98a5c68cc bpf: Do not walk twice the map on free
If someone stores both a timer and a workqueue in a map, on free
we would walk it twice.

Add a check in array_map_free_timers_wq and free the timers and
workqueues if they are present.

Fixes: 246331e3f1 ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps")
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-1-27afe7f3b17c@kernel.org
2024-04-30 16:28:33 +02:00
Justin Stitt
7b831bd3cf PM: hibernate: replace deprecated strncpy() with strscpy()
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

This kernel config option is simply assigned with the resume_file
buffer. It should be NUL-terminated but not necessarily NUL-padded as
per its further usage with other string apis:
|	static int __init find_resume_device(void)
|	{
|		if (!strlen(resume_file))
|			return -ENOENT;
|
|		pm_pr_dbg("Checking hibernation image partition %s\n", resume_file);

Use strscpy() [2] as it guarantees NUL-termination on the destination
buffer. Specifically, use the new 2-argument version of strscpy()
introduced in Commit e6584c3964 ("string: Allow 2-argument
strscpy()").

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-04-30 12:59:30 +02:00
Andy Shevchenko
a3034872cd bpf: Switch to krealloc_array()
Let the krealloc_array() copy the original data and
check for a multiplication overflow.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240429120005.3539116-1-andriy.shevchenko@linux.intel.com
2024-04-29 16:13:14 -07:00