5579 Commits

Author SHA1 Message Date
Paolo Bonzini
7d8942d8e7 KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
    avoid creating ABI that KVM can't sanely support.
 
  - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
    clear that such VMs are purely a development and testing vehicle, and
    come with zero guarantees.
 
  - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
    is to support confidential VMs with deterministic private memory (SNP
    and TDX) only in the TDP MMU.
 
  - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
    when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXZB/8ACgkQOlYIJqCj
 N/3XlQ//RIsvqr38k7kELSKhCMyWgF4J57itABrHpMqAZu3gaAo5sETX8AGcHEe5
 mxmquxyNQSf4cthhWy1kzxjGCy6+fk+Z0Z7wzfz0Yd5D+FI6vpo3HhkjovLb2gpt
 kSrHuhJyuj2vkftNvdaz0nHX1QalVyIEnXnR3oqTmxUUsg6lp1x/zr5SP0KBXjo8
 ZzJtyFd0fkRXWpA792T7XPRBWrzPV31HYZBLX8sPlYmJATcbIx9rYSThgCN6XuVN
 bfE6wATsC+mwv5BpCoDFpCKmFcqSqamag9NGe5qE5mOby5DQGYTCRMCQB8YXXBR0
 97ppaY9ZJV4nOVjrYJn6IMOSMVNfoG7nTRFfcd0eFP4tlPEgHwGr5BGDaBtQPkrd
 KcgWJw8nS02eCA2iOE+FtCXvGJwKhTTjQ45w7rU4EcfUk603L5J4GO1ddmjMhPcP
 upGGcWDK9vCGrSUFTm8pyWp/NKRJPvAQEiQd/BweSk9+isQHTX2RYCQgPAQnwlTS
 wTg7ZPNSLoUkRYmd6r+TUT32ELJGNc8GLftMnxIwweq6V7AgNMi0HE60eMovuBNO
 7DAWWzfBEZmJv+0mNNZPGXczHVv4YvMWysRdKkhztBc3+sO7P3AL1zWIDlm5qwoG
 LpFeeI3qo3o5ZNaqGzkSop2pUUGNGpWCH46WmP0AG7RpzW/Natw=
 =M0td
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM GUEST_MEMFD fixes for 6.8:

 - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
   avoid creating ABI that KVM can't sanely support.

 - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
   clear that such VMs are purely a development and testing vehicle, and
   come with zero guarantees.

 - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
   is to support confidential VMs with deterministic private memory (SNP
   and TDX) only in the TDP MMU.

 - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
   when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09 11:48:35 -05:00
Catalin Marinas
88f0912253 Merge branch 'for-next/stage1-lpa2' into for-next/core
* for-next/stage1-lpa2: (48 commits)
  : Add support for LPA2 and WXN and stage 1
  arm64/mm: Avoid ID mapping of kpti flag if it is no longer needed
  arm64/mm: Use generic __pud_free() helper in pud_free() implementation
  arm64: gitignore: ignore relacheck
  arm64: Use Signed/Unsigned enums for TGRAN{4,16,64} and VARange
  arm64: mm: Make PUD folding check in set_pud() a runtime check
  arm64: mm: add support for WXN memory translation attribute
  mm: add arch hook to validate mmap() prot flags
  arm64: defconfig: Enable LPA2 support
  arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs
  arm64: kvm: avoid CONFIG_PGTABLE_LEVELS for runtime levels
  arm64: ptdump: Deal with translation levels folded at runtime
  arm64: ptdump: Disregard unaddressable VA space
  arm64: mm: Add support for folding PUDs at runtime
  arm64: kasan: Reduce minimum shadow alignment and enable 5 level paging
  arm64: mm: Add 5 level paging support to fixmap and swapper handling
  arm64: Enable LPA2 at boot if supported by the system
  arm64: mm: add LPA2 and 5 level paging support to G-to-nG conversion
  arm64: mm: Add definitions to support 5 levels of paging
  arm64: mm: Add LPA2 support to phys<->pte conversion routines
  arm64: mm: Wire up TCR.DS bit to PTE shareability fields
  ...
2024-03-07 19:05:29 +00:00
Catalin Marinas
0c5ade742e Merge branches 'for-next/reorg-va-space', 'for-next/rust-for-arm64', 'for-next/misc', 'for-next/daif-cleanup', 'for-next/kselftest', 'for-next/documentation', 'for-next/sysreg' and 'for-next/dpisa', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf: (39 commits)
  docs: perf: Fix build warning of hisi-pcie-pmu.rst
  perf: starfive: Only allow COMPILE_TEST for 64-bit architectures
  MAINTAINERS: Add entry for StarFive StarLink PMU
  docs: perf: Add description for StarFive's StarLink PMU
  dt-bindings: perf: starfive: Add JH8100 StarLink PMU
  perf: starfive: Add StarLink PMU support
  docs: perf: Update usage for target filter of hisi-pcie-pmu
  drivers/perf: hisi_pcie: Merge find_related_event() and get_event_idx()
  drivers/perf: hisi_pcie: Relax the check on related events
  drivers/perf: hisi_pcie: Check the target filter properly
  drivers/perf: hisi_pcie: Add more events for counting TLP bandwidth
  drivers/perf: hisi_pcie: Fix incorrect counting under metric mode
  drivers/perf: hisi_pcie: Introduce hisi_pcie_pmu_get_event_ctrl_val()
  drivers/perf: hisi_pcie: Rename hisi_pcie_pmu_{config,clear}_filter()
  drivers/perf: hisi: Enable HiSilicon Erratum 162700402 quirk for HIP09
  perf/arm_cspmu: Add devicetree support
  dt-bindings/perf: Add Arm CoreSight PMU
  perf/arm_cspmu: Simplify counter reset
  perf/arm_cspmu: Simplify attribute groups
  perf/arm_cspmu: Simplify initialisation
  ...

* for-next/reorg-va-space:
  : Reorganise the arm64 kernel VA space in preparation for LPA2 support
  : (52-bit VA/PA).
  arm64: kaslr: Adjust randomization range dynamically
  arm64: mm: Reclaim unused vmemmap region for vmalloc use
  arm64: vmemmap: Avoid base2 order of struct page size to dimension region
  arm64: ptdump: Discover start of vmemmap region at runtime
  arm64: ptdump: Allow all region boundaries to be defined at boot time
  arm64: mm: Move fixmap region above vmemmap region
  arm64: mm: Move PCI I/O emulation region above the vmemmap region

* for-next/rust-for-arm64:
  : Enable Rust support for arm64
  arm64: rust: Enable Rust support for AArch64
  rust: Refactor the build target to allow the use of builtin targets

* for-next/misc:
  : Miscellaneous arm64 patches
  ARM64: Dynamically allocate cpumasks and increase supported CPUs to 512
  arm64: Remove enable_daif macro
  arm64/hw_breakpoint: Directly use ESR_ELx_WNR for an watchpoint exception
  arm64: cpufeatures: Clean up temporary variable to simplify code
  arm64: Update setup_arch() comment on interrupt masking
  arm64: remove unnecessary ifdefs around is_compat_task()
  arm64: ftrace: Don't forbid CALL_OPS+CC_OPTIMIZE_FOR_SIZE with Clang
  arm64/sme: Ensure that all fields in SMCR_EL1 are set to known values
  arm64/sve: Ensure that all fields in ZCR_EL1 are set to known values
  arm64/sve: Document that __SVE_VQ_MAX is much larger than needed
  arm64: make member of struct pt_regs and it's offset macro in the same order
  arm64: remove unneeded BUILD_BUG_ON assertion
  arm64: kretprobes: acquire the regs via a BRK exception
  arm64: io: permit offset addressing
  arm64: errata: Don't enable workarounds for "rare" errata by default

* for-next/daif-cleanup:
  : Clean up DAIF handling for EL0 returns
  arm64: Unmask Debug + SError in do_notify_resume()
  arm64: Move do_notify_resume() to entry-common.c
  arm64: Simplify do_notify_resume() DAIF masking

* for-next/kselftest:
  : Miscellaneous arm64 kselftest patches
  kselftest/arm64: Test that ptrace takes effect in the target process

* for-next/documentation:
  : arm64 documentation patches
  arm64/sme: Remove spurious 'is' in SME documentation
  arm64/fp: Clarify effect of setting an unsupported system VL
  arm64/sme: Fix cut'n'paste in ABI document
  arm64/sve: Remove bitrotted comment about syscall behaviour

* for-next/sysreg:
  : sysreg updates
  arm64/sysreg: Update ID_AA64DFR0_EL1 register
  arm64/sysreg: Update ID_DFR0_EL1 register fields
  arm64/sysreg: Add register fields for ID_AA64DFR1_EL1

* for-next/dpisa:
  : Support for 2023 dpISA extensions
  kselftest/arm64: Add 2023 DPISA hwcap test coverage
  kselftest/arm64: Add basic FPMR test
  kselftest/arm64: Handle FPMR context in generic signal frame parser
  arm64/hwcap: Define hwcaps for 2023 DPISA features
  arm64/ptrace: Expose FPMR via ptrace
  arm64/signal: Add FPMR signal handling
  arm64/fpsimd: Support FEAT_FPMR
  arm64/fpsimd: Enable host kernel access to FPMR
  arm64/cpufeature: Hook new identification registers up to cpufeature
2024-03-07 19:04:55 +00:00
Mark Brown
c1932cac79 arm64/hwcap: Define hwcaps for 2023 DPISA features
The 2023 architecture extensions include a large number of floating point
features, most of which simply add new instructions. Add hwcaps so that
userspace can enumerate these features.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240306-arm64-2023-dpisa-v5-6-c568edc8ed7f@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-07 17:14:54 +00:00
Mark Brown
8c46def444 arm64/signal: Add FPMR signal handling
Expose FPMR in the signal context on systems where it is supported. The
kernel validates the exact size of the FPSIMD registers so we can't readily
add it to fpsimd_context without disruption.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240306-arm64-2023-dpisa-v5-4-c568edc8ed7f@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-07 17:14:53 +00:00
Mark Brown
203f2b95a8 arm64/fpsimd: Support FEAT_FPMR
FEAT_FPMR defines a new EL0 accessible register FPMR use to configure the
FP8 related features added to the architecture at the same time. Detect
support for this register and context switch it for EL0 when present.

Due to the sharing of responsibility for saving floating point state
between the host kernel and KVM FP8 support is not yet implemented in KVM
and a stub similar to that used for SVCR is provided for FPMR in order to
avoid bisection issues. To make it easier to share host state with the
hypervisor we store FPMR as a hardened usercopy field in uw (along with
some padding).

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240306-arm64-2023-dpisa-v5-3-c568edc8ed7f@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-07 17:14:53 +00:00
Mark Brown
b6c0b424cb arm64/fpsimd: Enable host kernel access to FPMR
FEAT_FPMR provides a new generally accessible architectural register FPMR.
This is only accessible to EL0 and EL1 when HCRX_EL2.EnFPM is set to 1,
do this when the host is running. The guest part will be done along with
context switching the new register and exposing it via guest management.

Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240306-arm64-2023-dpisa-v5-2-c568edc8ed7f@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-07 17:14:52 +00:00
Mark Brown
cc9f69a3da arm64/cpufeature: Hook new identification registers up to cpufeature
The 2023 architecture extensions have defined several new ID registers,
hook them up to the cpufeature code so we can add feature checks and hwcaps
based on their contents.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240306-arm64-2023-dpisa-v5-1-c568edc8ed7f@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-07 17:14:52 +00:00
Oliver Upton
9bd8d7df19 Merge branch kvm-arm64/vfio-normal-nc into kvmarm/next
* kvm-arm64/vfio-normal-nc:
  : Normal-NC support for vfio-pci @ stage-2, courtesy of Ankit Agrawal
  :
  : KVM's policy to date has been that any and all MMIO mapping at stage-2
  : is treated as Device-nGnRE. This is primarily done due to concerns of
  : the guest triggering uncontainable failures in the system if they manage
  : to tickle the device / memory system the wrong way, though this is
  : unnecessarily restrictive for devices that can be reasoned as 'safe'.
  :
  : Unsurprisingly, the Device-* mapping can really hurt the performance of
  : assigned devices that can handle Gathering, and can be an outright
  : correctness issue if the guest driver does unaligned accesses.
  :
  : Rather than opening the floodgates to the full ecosystem of devices that
  : can be exposed to VMs, take the conservative approach and allow PCI
  : devices to be mapped as Normal-NC since it has been determined to be
  : 'safe'.
  vfio: Convey kvm that the vfio-pci device is wc safe
  KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device
  mm: Introduce new flag to indicate wc safe
  KVM: arm64: Introduce new flag for non-cacheable IO memory

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07 00:56:01 +00:00
Oliver Upton
0d874858c6 Merge branch kvm-arm64/vm-configuration into kvmarm/next
* kvm-arm64/vm-configuration: (29 commits)
  : VM configuration enforcement, courtesy of Marc Zyngier
  :
  : Userspace has gained the ability to control the features visible
  : through the ID registers, yet KVM didn't take this into account as the
  : effective feature set when determing trap / emulation behavior. This
  : series adds:
  :
  :  - Mechanism for testing the presence of a particular CPU feature in the
  :    guest's ID registers
  :
  :  - Infrastructure for computing the effective value of VNCR-backed
  :    registers, taking into account the RES0 / RES1 bits for a particular
  :    VM configuration
  :
  :  - Implementation of 'fine-grained UNDEF' controls that shadow the FGT
  :    register definitions.
  KVM: arm64: Don't initialize idreg debugfs w/ preemption disabled
  KVM: arm64: Fail the idreg iterator if idregs aren't initialized
  KVM: arm64: Make build-time check of RES0/RES1 bits optional
  KVM: arm64: Add debugfs file for guest's ID registers
  KVM: arm64: Snapshot all non-zero RES0/RES1 sysreg fields for later checking
  KVM: arm64: Make FEAT_MOPS UNDEF if not advertised to the guest
  KVM: arm64: Make AMU sysreg UNDEF if FEAT_AMU is not advertised to the guest
  KVM: arm64: Make PIR{,E0}_EL1 UNDEF if S1PIE is not advertised to the guest
  KVM: arm64: Make TLBI OS/Range UNDEF if not advertised to the guest
  KVM: arm64: Streamline save/restore of HFG[RW]TR_EL2
  KVM: arm64: Move existing feature disabling over to FGU infrastructure
  KVM: arm64: Propagate and handle Fine-Grained UNDEF bits
  KVM: arm64: Add Fine-Grained UNDEF tracking information
  KVM: arm64: Rename __check_nv_sr_forward() to triage_sysreg_trap()
  KVM: arm64: Use the xarray as the primary sysreg/sysinsn walker
  KVM: arm64: Register AArch64 system register entries with the sysreg xarray
  KVM: arm64: Always populate the trap configuration xarray
  KVM: arm64: nv: Move system instructions to their own sys_reg_desc array
  KVM: arm64: Drop the requirement for XARRAY_MULTI
  KVM: arm64: nv: Turn encoding ranges into discrete XArray stores
  ...

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07 00:51:31 +00:00
Oliver Upton
a040adfb7e Merge branch kvm-arm64/misc into kvmarm/next
* kvm-arm64/misc:
  : Miscellaneous updates
  :
  :  - Fix handling of features w/ nonzero safe values in set_id_regs
  :    selftest
  :
  :  - Cleanup the unused kern_hyp_va() asm macro
  :
  :  - Differentiate nVHE and hVHE in boot-time message
  :
  :  - Several selftests cleanups
  :
  :  - Drop bogus return value from kvm_arch_create_vm_debugfs()
  :
  :  - Make save/restore of SPE and TRBE control registers affect EL1 state
  :    in hVHE mode
  :
  :  - Typos
  KVM: arm64: Fix TRFCR_EL1/PMSCR_EL1 access in hVHE mode
  KVM: selftests: aarch64: Remove unused functions from vpmu test
  KVM: arm64: Fix typos
  KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
  KVM: selftests: Print timer ctl register in ISTATUS assertion
  KVM: selftests: Fix GUEST_PRINTF() format warnings in ARM code
  KVM: arm64: removed unused kern_hyp_va asm macro
  KVM: arm64: add comments to __kern_hyp_va
  KVM: arm64: print Hyp mode
  KVM: arm64: selftests: Handle feature fields with nonzero minimum value correctly

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-03-07 00:50:23 +00:00
Arnd Bergmann
d3e5bab923 arch: simplify architecture specific page size configuration
arc, arm64, parisc and powerpc all have their own Kconfig symbols
in place of the common CONFIG_PAGE_SIZE_4KB symbols. Change these
so the common symbols are the ones that are actually used, while
leaving the arhcitecture specific ones as the user visible
place for configuring it, to avoid breaking user configs.

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc32)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-06 19:29:03 +01:00
Nuno Das Neves
0e3f7d1200 hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*
The HV_REGISTER_ are used as arguments to hv_set/get_register(), which
delegate to arch-specific mechanisms for getting/setting synthetic
Hyper-V MSRs.

On arm64, HV_REGISTER_ defines are synthetic VP registers accessed via
the get/set vp registers hypercalls. The naming matches the TLFS
document, although these register names are not specific to arm64.

However, on x86 the prefix HV_REGISTER_ indicates Hyper-V MSRs accessed
via rdmsrl()/wrmsrl(). This is not consistent with the TLFS doc, where
HV_REGISTER_ is *only* used for used for VP register names used by
the get/set register hypercalls.

To fix this inconsistency and prevent future confusion, change the
arch-generic aliases used by callers of hv_set/get_register() to have
the prefix HV_MSR_ instead of HV_REGISTER_.

Use the prefix HV_X64_MSR_ for the x86-only Hyper-V MSRs. On x86, the
generic HV_MSR_'s point to the corresponding HV_X64_MSR_.

Move the arm64 HV_REGISTER_* defines to the asm-generic hyperv-tlfs.h,
since these are not specific to arm64. On arm64, the generic HV_MSR_'s
point to the corresponding HV_REGISTER_.

While at it, rename hv_get/set_registers() and related functions to
hv_get/set_msr(), hv_get/set_nested_msr(), etc. These are only used for
Hyper-V MSRs and this naming makes that clear.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-04 06:59:18 +00:00
Jakub Kicinski
4b2765ae41 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZeEKVAAKCRDbK58LschI
 g7oYAQD5Jlv4fIVTvxvfZrTTZ2tU+OsPa75mc8SDKwpash3YygEA8kvESy8+t6pg
 D6QmSf1DIZdFoSp/bV+pfkNWMeR8gwg=
 =mTAj
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-02-29

We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).

The main changes are:

1) Extend the BPF verifier to enable static subprog calls in spin lock
   critical sections, from Kumar Kartikeya Dwivedi.

2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
   in BPF global subprogs, from Andrii Nakryiko.

3) Larger batch of riscv BPF JIT improvements and enabling inlining
   of the bpf_kptr_xchg() for RV64, from Pu Lehui.

4) Allow skeleton users to change the values of the fields in struct_ops
   maps at runtime, from Kui-Feng Lee.

5) Extend the verifier's capabilities of tracking scalars when they
   are spilled to stack, especially when the spill or fill is narrowing,
   from Maxim Mikityanskiy & Eduard Zingerman.

6) Various BPF selftest improvements to fix errors under gcc BPF backend,
   from Jose E. Marchesi.

7) Avoid module loading failure when the module trying to register
   a struct_ops has its BTF section stripped, from Geliang Tang.

8) Annotate all kfuncs in .BTF_ids section which eventually allows
   for automatic kfunc prototype generation from bpftool, from Daniel Xu.

9) Several updates to the instruction-set.rst IETF standardization
   document, from Dave Thaler.

10) Shrink the size of struct bpf_map resp. bpf_array,
    from Alexei Starovoitov.

11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
    from Benjamin Tissoires.

12) Fix bpftool to be more portable to musl libc by using POSIX's
    basename(), from Arnaldo Carvalho de Melo.

13) Add libbpf support to gcc in CORE macro definitions,
    from Cupertino Miranda.

14) Remove a duplicate type check in perf_event_bpf_event,
    from Florian Lehner.

15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
    with notrace correctly, from Yonghong Song.

16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
    array to fix build warnings, from Kees Cook.

17) Fix resolve_btfids cross-compilation to non host-native endianness,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  selftests/bpf: Test if shadow types work correctly.
  bpftool: Add an example for struct_ops map and shadow type.
  bpftool: Generated shadow variables for struct_ops maps.
  libbpf: Convert st_ops->data to shadow type.
  libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
  bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
  bpf, arm64: use bpf_prog_pack for memory management
  arm64: patching: implement text_poke API
  bpf, arm64: support exceptions
  arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
  bpf: add is_async_callback_calling_insn() helper
  bpf: introduce in_sleepable() helper
  bpf: allow more maps in sleepable bpf programs
  selftests/bpf: Test case for lacking CFI stub functions.
  bpf: Check cfi_stubs before registering a struct_ops type.
  bpf: Clarify batch lookup/lookup_and_delete semantics
  bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
  bpf, docs: Fix typos in instruction-set.rst
  selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
  bpf: Shrink size of struct bpf_map/bpf_array.
  ...
====================

Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-02 20:50:59 -08:00
Jinjie Ruan
527db67a4d arm64: Remove enable_daif macro
Since commit bb8e93a287a5 ("arm64: entry: convert SError handlers to C"),
the enable_daif assembler macro is no longer used anywhere, so remove it.

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20240229132802.1682026-2-ruanjinjie@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-01 17:41:37 +00:00
Anshuman Khandual
9d6b6789c8 arm64/hw_breakpoint: Directly use ESR_ELx_WNR for an watchpoint exception
Let's use existing ISS encoding for an watchpoint exception i.e ESR_ELx_WNR
This represents an instruction's either writing to or reading from a memory
location during an watchpoint exception. While here this drops non-standard
macro AARCH64_ESR_ACCESS_MASK.

Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20240229083431.356578-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-01 17:36:51 +00:00
Ard Biesheuvel
3137db4c66 arm64/mm: Use generic __pud_free() helper in pud_free() implementation
Commit 0dd4f60a2c76 ("arm64: mm: Add support for folding PUDs at
runtime") implements specialized PUD alloc/free helpers to allow the
decision whether or not to fold PUDs to be made at runtime when the
number of paging levels is 4 or higher.

Its implementation of pud_free() is based on the generic version that
existed when the patch was first written, but in the meantime, the
freeing of a PUD has become a bit more involved, and so instead of
simply freeing the page, we should invoke the generic __pud_free() that
encapsulates whatever needs doing at this point.

This fixes a reported warning emitted by the page flags
self-diagnostics.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20240301104046.1234309-5-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-03-01 15:25:45 +00:00
Puranjay Mohan
451c3cab9a arm64: patching: implement text_poke API
The text_poke API is used to implement functions like memcpy() and
memset() for instruction memory (RO+X). The implementation is similar to
the x86 version.

This will be used by the BPF JIT to write and modify BPF programs. There
could be more users of this in the future.

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240228141824.119877-2-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-28 13:44:47 -08:00
Leonardo Bras
1984c80546 arm64: remove unnecessary ifdefs around is_compat_task()
Currently some parts of the codebase will test for CONFIG_COMPAT before
testing is_compat_task().

is_compat_task() is a inlined function only present on CONFIG_COMPAT.
On the other hand, for !CONFIG_COMPAT, we have in linux/compat.h:

 #define is_compat_task() (0)

Since we have this define available in every usage of is_compat_task() for
!CONFIG_COMPAT, it's unnecessary to keep the ifdefs, since the compiler is
smart enough to optimize-out those snippets on CONFIG_COMPAT=n

This requires some regset code as well as a few other defines to be made
available on !CONFIG_COMPAT, so some symbols can get resolved before
getting optimized-out.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240109034651.478462-2-leobras@redhat.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-28 18:01:23 +00:00
Oliver Upton
5c1ebe9ada KVM: arm64: Don't initialize idreg debugfs w/ preemption disabled
Testing KVM with DEBUG_ATOMIC_SLEEP enabled doesn't get far before hitting the
first splat:

  BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1578
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 13062, name: vgic_lpi_stress
  preempt_count: 1, expected: 0
  2 locks held by vgic_lpi_stress/13062:
   #0: ffff080084553240 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0xc0/0x13f0
   #1: ffff800080485f08 (&kvm->arch.config_lock){+.+.}-{3:3}, at: kvm_arch_vcpu_ioctl+0xd60/0x1788
  CPU: 19 PID: 13062 Comm: vgic_lpi_stress Tainted: G        W  O       6.8.0-dbg-DEV #1
  Call trace:
   dump_backtrace+0xf8/0x148
   show_stack+0x20/0x38
   dump_stack_lvl+0xb4/0xf8
   dump_stack+0x18/0x40
   __might_resched+0x248/0x2a0
   __might_sleep+0x50/0x88
   down_write+0x30/0x150
   start_creating+0x90/0x1a0
   __debugfs_create_file+0x5c/0x1b0
   debugfs_create_file+0x34/0x48
   kvm_reset_sys_regs+0x120/0x1e8
   kvm_reset_vcpu+0x148/0x270
   kvm_arch_vcpu_ioctl+0xddc/0x1788
   kvm_vcpu_ioctl+0xb6c/0x13f0
   __arm64_sys_ioctl+0x98/0xd8
   invoke_syscall+0x48/0x108
   el0_svc_common+0xb4/0xf0
   do_el0_svc+0x24/0x38
   el0_svc+0x54/0x128
   el0t_64_sync_handler+0x68/0xc0
   el0t_64_sync+0x1a8/0x1b0

kvm_reset_vcpu() disables preemption as it needs to unload vCPU state
from the CPU to twiddle with it, which subsequently explodes when
taking the parent inode's rwsem while creating the idreg debugfs file.

Fix it by moving the initialization to kvm_arch_create_vm_debugfs().

Fixes: 891766581dea ("KVM: arm64: Add debugfs file for guest's ID registers")
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240227094115.1723330-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-27 19:19:52 +00:00
Ankit Agrawal
c034ec84e8 KVM: arm64: Introduce new flag for non-cacheable IO memory
Currently, KVM for ARM64 maps at stage 2 memory that is considered device
(i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
overrides (as per the ARM architecture [1]) any device MMIO mapping
present at stage 1, resulting in a set-up whereby a guest operating
system cannot determine device MMIO mapping memory attributes on its
own but it is always overridden by the KVM stage 2 default.

This set-up does not allow guest operating systems to select device
memory attributes independently from KVM stage-2 mappings
(refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
which turns out to be an issue in that guest operating systems
(e.g. Linux) may request to map devices MMIO regions with memory
attributes that guarantee better performance (e.g. gathering
attribute - that for some devices can generate larger PCIe memory
writes TLPs) and specific operations (e.g. unaligned transactions)
such as the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM for ARM64 since
it was considered safer (i.e. it would not allow guests to trigger
uncontained failures ultimately crashing the machine) but this
turned out to be asynchronous (SError) defeating the purpose.

Failures containability is a property of the platform and is independent
from the memory type used for MMIO device memory mappings.

Actually, DEVICE_nGnRE memory type is even more problematic than
Normal-NC memory type in terms of faults containability in that e.g.
aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
synchronous (i.e. that would imply that the processor should issue at
most 1 load transaction at a time - it cannot pipeline them - otherwise
the synchronous abort semantics would break the no-speculation attribute
attached to DEVICE_XXX memory).

This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform capabilities
and the device type being assigned (i.e. PCIe AER/DPC error containment
and RAS architecture[3]); therefore the default KVM device stage 2
memory attributes play no role in making device assignment safer
for a given platform (if the platform design adheres to design
guidelines outlined in [3]) and therefore can be relaxed.

For all these reasons, relax the KVM stage 2 device memory attributes
from DEVICE_nGnRE to Normal-NC.

The NormalNC was chosen over a different Normal memory type default
at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.

Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e. device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered with respect to MMIO
transactions according to the PCI ordering rules.

Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):

S1           |  S2           | Result
NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>

It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.

This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].

Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
change anything in this respect, in that the mismatched aliases would
only affect (refer to [2] for a detailed explanation) ordering between
the userspace and GuestOS mappings resulting stream of transactions
(i.e. it does not cause loss of property for either stream of
transactions on its own), which is harmless given that the userspace
and GuestOS access to the device is carried out through independent
transactions streams.

A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
(KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
corresponding PTE value 0x5 (0b101) determined from [1].

Lastly, adapt the stage2 PTE property setter function
(stage2_set_prot_attr) to handle the NormalNC attribute.

The entire discussion leading to this patch series may be followed through
the following links.
Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com

[1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 17:57:39 +00:00
Bjorn Helgaas
75841d89f3 KVM: arm64: Fix typos
Fix typos, most reported by "codespell arch/arm64".  Only touches comments,
no code changes.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.linux.dev
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240103231605.1801364-6-helgaas@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 09:13:33 +00:00
Baoquan He
40254101d8 arm64, crash: wrap crash dumping code into crash related ifdefs
Now crash codes under kernel/ folder has been split out from kexec
code, crash dumping can be separated from kexec reboot in config
items on arm64 with some adjustments.

Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery.

[bhe@redhat.com: fix building error in generic codes]
  Link: https://lkml.kernel.org/r/20240129135033.157195-2-bhe@redhat.com
Link: https://lkml.kernel.org/r/20240124051254.67105-8-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Michael Kelley <mhklinux@outlook.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23 17:48:23 -08:00
Baoquan He
85fcde402d kexec: split crashkernel reservation code out from crash_core.c
Patch series "Split crash out from kexec and clean up related config
items", v3.

Motivation:
=============
Previously, LKP reported a building error. When investigating, it can't
be resolved reasonablly with the present messy kdump config items.

 https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/

The kdump (crash dumping) related config items could causes confusions:

Firstly,

CRASH_CORE enables codes including
 - crashkernel reservation;
 - elfcorehdr updating;
 - vmcoreinfo exporting;
 - crash hotplug handling;

Now fadump of powerpc, kcore dynamic debugging and kdump all selects
CRASH_CORE, while fadump
 - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing
   global variable 'elfcorehdr_addr';
 - kcore only needs vmcoreinfo exporting;
 - kdump needs all of the current kernel/crash_core.c.

So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this
mislead people that we enable crash dumping, actual it's not.

Secondly,

It's not reasonable to allow KEXEC_CORE select CRASH_CORE.

Because KEXEC_CORE enables codes which allocate control pages, copy
kexec/kdump segments, and prepare for switching. These codes are
shared by both kexec reboot and kdump. We could want kexec reboot,
but disable kdump. In that case, CRASH_CORE should not be selected.

 --------------------
 CONFIG_CRASH_CORE=y
 CONFIG_KEXEC_CORE=y
 CONFIG_KEXEC=y
 CONFIG_KEXEC_FILE=y
 ---------------------

Thirdly,

It's not reasonable to allow CRASH_DUMP select KEXEC_CORE.

That could make KEXEC_CORE, CRASH_DUMP are enabled independently from
KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE
code built in doesn't make any sense because no kernel loading or
switching will happen to utilize the KEXEC_CORE code.
 ---------------------
 CONFIG_CRASH_CORE=y
 CONFIG_KEXEC_CORE=y
 CONFIG_CRASH_DUMP=y
 ---------------------

In this case, what is worse, on arch sh and arm, KEXEC relies on MMU,
while CRASH_DUMP can still be enabled when !MMU, then compiling error is
seen as the lkp test robot reported in above link.

 ------arch/sh/Kconfig------
 config ARCH_SUPPORTS_KEXEC
         def_bool MMU

 config ARCH_SUPPORTS_CRASH_DUMP
         def_bool BROKEN_ON_SMP
 ---------------------------

Changes:
===========
1, split out crash_reserve.c from crash_core.c;
2, split out vmcore_infoc. from crash_core.c;
3, move crash related codes in kexec_core.c into crash_core.c;
4, remove dependency of FA_DUMP on CRASH_DUMP;
5, clean up kdump related config items;
6, wrap up crash codes in crash related ifdefs on all 8 arch-es
   which support crash dumping, except of ppc;

Achievement:
===========
With above changes, I can rearrange the config item logic as below (the right
item depends on or is selected by the left item):

    PROC_KCORE -----------> VMCORE_INFO

               |----------> VMCORE_INFO
    FA_DUMP----|
               |----------> CRASH_RESERVE

                                                    ---->VMCORE_INFO
                                                   /
                                                   |---->CRASH_RESERVE
    KEXEC      --|                                /|
                 |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
    KEXEC_FILE --|                               \ |
                                                   \---->CRASH_HOTPLUG


    KEXEC      --|
                 |--> KEXEC_CORE (for kexec reboot only)
    KEXEC_FILE --|

Test
========
On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips,
riscv, loongarch, I did below three cases of config item setting and
building all passed. Take configs on x86_64 as exampmle here:

(1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump
items are unset automatically:
# Kexec and crash features
# CONFIG_KEXEC is not set
# CONFIG_KEXEC_FILE is not set
# end of Kexec and crash features

(2) set CONFIG_KEXEC_FILE and 'make olddefconfig':
---------------
# Kexec and crash features
CONFIG_CRASH_RESERVE=y
CONFIG_VMCORE_INFO=y
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC_FILE=y
CONFIG_CRASH_DUMP=y
CONFIG_CRASH_HOTPLUG=y
CONFIG_CRASH_MAX_MEMORY_RANGES=8192
# end of Kexec and crash features
---------------

(3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig':
------------------------
# Kexec and crash features
CONFIG_KEXEC_CORE=y
CONFIG_KEXEC_FILE=y
# end of Kexec and crash features
------------------------

Note:
For ppc, it needs investigation to make clear how to split out crash
code in arch folder. Hope Hari and Pingfan can help have a look, see if
it's doable. Now, I make it either have both kexec and crash enabled, or
disable both of them altogether.


This patch (of 14):

Both kdump and fa_dump of ppc rely on crashkernel reservation.  Move the
relevant codes into separate files: crash_reserve.c,
include/linux/crash_reserve.h.

And also add config item CRASH_RESERVE to control its enabling of the
codes.  And update config items which has relationship with crashkernel
reservation.

And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE
when those scopes are only crashkernel reservation related.

And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on
arm64, x86 and risc-v because those architectures' crash_core.h is only
related to crashkernel reservation.

[akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin]
Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Michael Kelley <mhklinux@outlook.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23 17:48:21 -08:00
Linus Torvalds
86f01602a4 arm64 fixes for -rc6
- Revert fix to jump label asm constraints, as it regresses the build
   with some GCC 5.5 toolchains.
 
 - Restore SME control registers when resuming from suspend
 
 - Fix incorrect filter definition in CXL PMU driver
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmXXVFoQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNPQpB/9+TGwvTPj3xEJcrJwTnsoY2tMtJdYO/sli
 f31auT0PsOI1RtMin1fyiN9fOpIbjWHcdlBgUxnkyV6c2MpViw8L/0H27guJGWgv
 +G26+H/Ch1BsujewKkfJxX5GimdGZ2uBvbiYyBb4EXg36/UG6qeRjfYP+EB70UPY
 rrT01E8UGv6WDShY9om7QVO91X7YU6EBUfu46wPrs2LFbueaBwCEBZBLE5qu4vWF
 N7yWOWR2CUt/P4v8a5LfV3wwM+VRdmvovHnPUH5uR7QthEa+gDCDPsjf3fus4nts
 iE6NCC517j9cG8lz7B5FJP6UfMPWgbNr1QMFuoyrzwdvc6EoXG6z
 =+SsP
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "A simple fix to a definition in the CXL PMU driver, a couple of
  patches to restore SME control registers on the resume path (since
  Arm's fast model now clears them) and a revert for our jump label asm
  constraints after Geert noticed they broke the build with GCC 5.5.

  There was then the ensuing discussion about raising the minimum GCC
  (and corresponding binutils) versions at [1], but for now we'll keep
  things working as they were until that goes ahead.

   - Revert fix to jump label asm constraints, as it regresses the build
     with some GCC 5.5 toolchains.

   - Restore SME control registers when resuming from suspend

   - Fix incorrect filter definition in CXL PMU driver"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64/sme: Restore SMCR_EL1.EZT0 on exit from suspend
  arm64/sme: Restore SME registers on exit from suspend
  Revert "arm64: jump_label: use constraints "Si" instead of "i""
  perf: CXL: fix CPMU filter value mask length
2024-02-23 10:26:43 -08:00
Ryan Roberts
f0c2264958 arm64/mm: automatically fold contpte mappings
There are situations where a change to a single PTE could cause the
contpte block in which it resides to become foldable (i.e.  could be
repainted with the contiguous bit).  Such situations arise, for example,
when user space temporarily changes protections, via mprotect, for
individual pages, such can be the case for certain garbage collectors.

We would like to detect when such a PTE change occurs.  However this can
be expensive due to the amount of checking required.  Therefore only
perform the checks when an indiviual PTE is modified via mprotect
(ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
when we are setting the final PTE in a contpte-aligned block.

Link: https://lkml.kernel.org/r/20240215103205.2607016-19-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:19 -08:00
Ryan Roberts
b972fc6afb arm64/mm: __always_inline to improve fork() perf
As set_ptes() and wrprotect_ptes() become a bit more complex, the compiler
may choose not to inline them.  But this is critical for fork()
performance.  So mark the functions, along with contpte_try_unfold() which
is called by them, as __always_inline.  This is worth ~1% on the fork()
microbenchmark with order-0 folios (the common case).

Link: https://lkml.kernel.org/r/20240215103205.2607016-18-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:19 -08:00
Ryan Roberts
fb5451e5f7 arm64/mm: implement pte_batch_hint()
When core code iterates over a range of ptes and calls ptep_get() for each
of them, if the range happens to cover contpte mappings, the number of pte
reads becomes amplified by a factor of the number of PTEs in a contpte
block.  This is because for each call to ptep_get(), the implementation
must read all of the ptes in the contpte block to which it belongs to
gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap memory
such as munmap(), exit and madvise(MADV_DONTNEED).  Fortunately we can fix
this by implementing pte_batch_hint() which allows their iterators to skip
getting the contpte tail ptes when gathering the batch of ptes to operate
on.  This results in the number of PTE reads returning to 1 per pte.

Link: https://lkml.kernel.org/r/20240215103205.2607016-17-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:19 -08:00
Ryan Roberts
6b1e4efb6f arm64/mm: implement new [get_and_]clear_full_ptes() batch APIs
Optimize the contpte implementation to fix some of the
exit/munmap/dontneed performance regression introduced by the initial
contpte commit.  Subsequent patches will solve it entirely.

During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
cleared.  Previously this was done 1 PTE at a time.  But the core-mm
supports batched clear via the new [get_and_]clear_full_ptes() APIs.  So
let's implement those APIs and for fully covered contpte mappings, we no
longer need to unfold the contpte.  This significantly reduces unfolding
operations, reducing the number of tlbis that must be issued.

Link: https://lkml.kernel.org/r/20240215103205.2607016-15-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
311a6cf296 arm64/mm: implement new wrprotect_ptes() batch API
Optimize the contpte implementation to fix some of the fork performance
regression introduced by the initial contpte commit.  Subsequent patches
will solve it entirely.

During fork(), any private memory in the parent must be write-protected. 
Previously this was done 1 PTE at a time.  But the core-mm supports
batched wrprotect via the new wrprotect_ptes() API.  So let's implement
that API and for fully covered contpte mappings, we no longer need to
unfold the contpte.  This has 2 benefits:

  - reduced unfolding, reduces the number of tlbis that must be issued.
  - The memory remains contpte-mapped ("folded") in the parent, so it
    continues to benefit from the more efficient use of the TLB after
    the fork.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'.  See
section D21194 at https://developer.arm.com/documentation/102105/ja-07/

Link: https://lkml.kernel.org/r/20240215103205.2607016-14-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
4602e5757b arm64/mm: wire up PTE_CONT for user mappings
With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for user
mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit.  Any subsequent modification
of individual PTEs will cause an "unfold" operation to repaint the contpte
block as individual PTEs before performing the requested operation. 
While, a modification of a single PTE could cause the block of PTEs to
which it belongs to become eligible for "folding" into a contpte entry,
"folding" is not performed in this initial implementation due to the costs
of checking the requirements are met.  Due to this, contpte mappings will
degrade back to normal pte mappings over time if/when protections are
changed.  This will be solved in a future patch.

Since a contpte block only has a single access and dirty bit, the semantic
here changes slightly; when getting a pte (e.g.  ptep_get()) that is part
of a contpte mapping, the access and dirty information are pulled from the
block (so all ptes in the block return the same access/dirty info).  When
changing the access/dirty info on a pte (e.g.  ptep_set_access_flags())
that is part of a contpte mapping, this change will affect the whole
contpte block.  This is works fine in practice since we guarantee that
only a single folio is mapped by a contpte block, and the core-mm tracks
access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols that
are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time.  It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled.  The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Link: https://lkml.kernel.org/r/20240215103205.2607016-13-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
d9d8dc2bd3 arm64/mm: dplit __flush_tlb_range() to elide trailing DSB
Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement.  This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB.  Forthcoming "contpte" code will take advantage of this when
clearing the young bit from a contiguous range of ptes.

Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs() has
changed, but now aligns with the ordering of __flush_tlb_page().  It has
been discussed that __flush_tlb_page() may be wrong though.  Regardless,
both will be resolved separately if needed.

Link: https://lkml.kernel.org/r/20240215103205.2607016-12-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
5a00bfd6a5 arm64/mm: new ptep layer to manage contig bit
Create a new layer for the in-table PTE manipulation APIs.  For now, The
existing API is prefixed with double underscore to become the arch-private
API and the public API is just a simple wrapper that calls the private
API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate.  But since there are
already some contig-aware users (e.g.  hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

The following APIs are treated this way:

 - ptep_get
 - set_pte
 - set_ptes
 - pte_clear
 - ptep_get_and_clear
 - ptep_test_and_clear_young
 - ptep_clear_flush_young
 - ptep_set_wrprotect
 - ptep_set_access_flags

Link: https://lkml.kernel.org/r/20240215103205.2607016-11-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
659e193027 arm64/mm: convert set_pte_at() to set_ptes(..., 1)
Since set_ptes() was introduced, set_pte_at() has been implemented as a
generic macro around set_ptes(..., 1).  So this change should continue to
generate the same code.  However, making this change prepares us for the
transparent contpte support.  It means we can reroute set_ptes() to
__set_ptes().  Since set_pte_at() is a generic macro, there will be no
equivalent __set_pte_at() to reroute to.

Note that a couple of calls to set_pte_at() remain in the arch code.  This
is intentional, since those call sites are acting on behalf of core-mm and
should continue to call into the public set_ptes() rather than the
arch-private __set_ptes().

Link: https://lkml.kernel.org/r/20240215103205.2607016-9-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
532736558e arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep)
There are a number of places in the arch code that read a pte by using the
READ_ONCE() macro.  Refactor these call sites to instead use the
ptep_get() helper, which itself is a READ_ONCE().  Generated code should
be the same.

This will benefit us when we shortly introduce the transparent contpte
support.  In this case, ptep_get() will become more complex so we now have
all the code abstracted through it.

Link: https://lkml.kernel.org/r/20240215103205.2607016-8-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Ryan Roberts
c1bd2b4028 arm64/mm: convert pte_next_pfn() to pte_advance_pfn()
Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Link: https://lkml.kernel.org/r/20240215103205.2607016-5-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 15:27:18 -08:00
Mark Brown
21eb468e9f arm64/sve: Document that __SVE_VQ_MAX is much larger than needed
__SVE_VQ_MAX is defined without comment as 512 but the actual
architectural maximum is 16, a substantial difference which might not
be obvious to readers especially given the several different units used
for specifying vector sizes in various contexts and the fact that it's
often used via macros.  In an effort to minimise surprises for users who
might assume the value is the architectural maximum and use it to do
things like size allocations add a comment noting the difference, and
add a note for SVE_VQ_MAX to aid discoverability.

Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Dave Martin <Dave.Martin@arm.com>
Link: https://lore.kernel.org/r/20240209-arm64-sve-vl-max-comment-v2-1-111b283469ee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-22 19:32:47 +00:00
Ryan Roberts
6e8f588708 arm64/mm: make set_ptes() robust when OAs cross 48-bit boundary
Patch series "mm/memory: optimize fork() with PTE-mapped THP", v3.

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when
processing PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to use
the new rmap batching functions that simplify the code and prepare for
further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes for
fork() (shorter is better):

Folio Size | v6.8-rc1 |      New | Change
------------------------------------------
      4KiB | 0.014328 | 0.014035 |   - 2%
     16KiB | 0.014263 | 0.01196  |   -16%
     32KiB | 0.014334 | 0.01094  |   -24%
     64KiB | 0.014046 | 0.010444 |   -26%
    128KiB | 0.014011 | 0.010063 |   -28%
    256KiB | 0.013993 | 0.009938 |   -29%
    512KiB | 0.013983 | 0.00985  |   -30%
   1024KiB | 0.013986 | 0.00982  |   -30%
   2048KiB | 0.014305 | 0.010076 |   -30%

Note that these numbers are even better than the ones from v1 (verified
over multiple reboots), even though there were only minimal code changes. 
Well, I removed a pte_mkclean() call for anon folios, maybe that also
plays a role.

But my experience is that fork() is extremely sensitive to code size,
inlining, ...  so I suspect we'll see on other architectures rather a
change of -20% instead of -30%, and it will be easy to "lose" some of that
speedup in the future by subtle code changes.

Next up is PTE batching when unmapping.  Only tested on x86-64. 
Compile-tested on most other architectures.

[1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com


This patch (of 15):

Since the high bits [51:48] of an OA are not stored contiguously in the
PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE
to the pte to get the pte with the next pfn.  This works until the pfn
crosses the 48-bit boundary, at which point we overflow into the upper
attributes.

Of course one could argue (and Matthew Wilcox has :) that we will never
see a folio cross this boundary because we only allow naturally aligned
power-of-2 allocation, so this would require a half-petabyte folio.  So
its only a theoretical bug.  But its better that the code is robust
regardless.

I've implemented pte_next_pfn() as part of the fix, which is an opt-in
core-mm interface.  So that is now available to the core-mm, which will be
needed shortly to support forthcoming fork()-batching optimizations.

Link: https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20240129124649.189745-2-david@redhat.com
Fixes: 4a169d61c2ed ("arm64: implement the new page table range API")
Closes: https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43ad01@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Russell King (Oracle) <linux@armlinux.org.uk>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 10:24:50 -08:00
Christophe Leroy
a5e8131a03 arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WX
All architectures using the core ptdump functionality also implement
CONFIG_DEBUG_WX, and they all do it more or less the same way, with a
function called debug_checkwx() that is called by mark_rodata_ro(), which
is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a
no-op otherwise.

Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call
debug_checkwx() immediately after calling mark_rodata_ro() instead of
calling it at the end of every mark_rodata_ro().

On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX
before calling debug_checkwx().  Now the check is inside the callee
ptdump_walk_pgd_level_checkwx().

On powerpc_64, mark_rodata_ro() bails out early before calling
ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature.  The check
is now also done in ptdump_check_wx() as it is called outside
mark_rodata_ro().

Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Phong Tran <tranmanphong@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Steven Price <steven.price@arm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 10:24:47 -08:00
Mark Rutland
253751233b arm64: kretprobes: acquire the regs via a BRK exception
On arm64, kprobes always take an exception and so create a struct
pt_regs through the usual exception entry logic. Similarly kretprobes
taskes and exception for function entry, but for function returns it
uses a trampoline which attempts to create a struct pt_regs without
taking an exception.

This is problematic for a few reasons, including:

1) The kretprobes trampoline neither saves nor restores all of the
   portions of PSTATE. Before invoking the handler it saves a number of
   portions of PSTATE, and after returning from the handler it restores
   NZCV before returning to the original return address provided by the
   handler.

2) The kretprobe trampoline constructs the PSTATE value piecemeal from
   special purpose registers as it cannot read all of PSTATE atomically
   without taking an exception. This is somewhat fragile, and it's not
   possible to reliably recover PSTATE information which only exists on
   some physical CPUs (e.g. when SSBS support is mismatched).

   Today the kretprobes trampoline does not record:

   - BTYPE
   - SSBS
   - ALLINT
   - SS
   - PAN
   - UAO
   - DIT
   - TCO

   ... and this will only get worse with future architecture extensions
   which add more PSTATE bits.

3) The kretprobes trampoline doesn't store portions of struct pt_regs
   (e.g. the PMR value when using pseudo-NMIs). Due to this, helpers
   which operate on a struct pt_regs, such as interrupts_enabled(), may
   not work correctly.

4) The function entry and function exit handlers run in different
   contexts. The entry handler will always be run in a debug exception
   context (which is currently treated as an NMI), but the return will
   be treated as whatever context the instrumented function was executed
   in. The differences between these contexts are liable to cause
   problems (e.g. as the two can be differently interruptible or
   preemptible, adversely affecting synchronization between the
   handlers).

5) As the kretprobes trampoline runs in the same context as the code
   being probed, it is subject to the same single-stepping context,
   which may not be desirable if this is being driven by the kprobes
   handlers.

Overall, this is fragile, painful to maintain, and gets in the way of
supporting other things (e.g. RELIABLE_STACKTRACE, FEAT_NMI).

This patch addresses these issues by replacing the kretprobes trampoline
with a `BRK` instruction, and using an exception boundary to acquire and
restore the regs, in the same way as the regular kprobes trampoline.

Ive tested this atop v6.8-rc3:

| KTAP version 1
| 1..1
|     KTAP version 1
|     # Subtest: kprobes_test
|     # module: test_kprobes
|     1..7
|     ok 1 test_kprobe
|     ok 2 test_kprobes
|     ok 3 test_kprobe_missed
|     ok 4 test_kretprobe
|     ok 5 test_kretprobes
|     ok 6 test_stacktrace_on_kretprobe
|     ok 7 test_stacktrace_on_nested_kretprobe
| # kprobes_test: pass:7 fail:0 skip:0 total:7
| # Totals: pass:7 fail:0 skip:0 total:7
| ok 1 kprobes_test

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Florent Revest <revest@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240208145916.2004154-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-20 18:13:57 +00:00
Mark Rutland
997d79eb93 arm64: Move do_notify_resume() to entry-common.c
Currently do_notify_resume() lives in arch/arm64/kernel/signal.c, but it would
make more sense for it to live in entry-common.c as it handles more than
signals, and is coupled with the rest of the return-to-userspace sequence (e.g.
with unusual DAIF masking that matches the exception return requirements).

Move do_notify_resume() to entry-common.c.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240206123848.1696480-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Itaru Kitayama <itaru.kitayama@linux.dev>
2024-02-20 18:12:13 +00:00
Mark Rutland
d044d6ba6f arm64: io: permit offset addressing
Currently our IO accessors all use register addressing without offsets,
but we could safely use offset addressing (without writeback) to
simplify and optimize the generated code.

To function correctly under a hypervisor which emulates IO accesses, we
must ensure that any faulting/trapped IO access results in an ESR_ELx
value with ESR_ELX.ISS.ISV=1 and with the tranfer register described in
ESR_ELx.ISS.SRT. This means that we can only use loads/stores of a
single general purpose register (or the zero register), and must avoid
writeback addressing modes. However, we can use immediate offset
addressing modes, as these still provide ESR_ELX.ISS.ISV=1 and a valid
ESR_ELx.ISS.SRT when those accesses fault at Stage-2.

Currently we only use register addressing without offsets. We use the
"r" constraint to place the address into a register, and manually
generate the register addressing by surrounding the resulting register
operand with square braces, e.g.

| static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
| {
|         asm volatile("str %x0, [%1]" : : "rZ" (val), "r" (addr));
| }

Due to this, sequences of adjacent accesses need to generate addresses
using separate instructions. For example, the following code:

| void writeq_zero_8_times(void *ptr)
| {
|        writeq_relaxed(0, ptr + 8 * 0);
|        writeq_relaxed(0, ptr + 8 * 1);
|        writeq_relaxed(0, ptr + 8 * 2);
|        writeq_relaxed(0, ptr + 8 * 3);
|        writeq_relaxed(0, ptr + 8 * 4);
|        writeq_relaxed(0, ptr + 8 * 5);
|        writeq_relaxed(0, ptr + 8 * 6);
|        writeq_relaxed(0, ptr + 8 * 7);
| }

... is compiled to:

| <writeq_zero_8_times>:
|     str     xzr, [x0]
|     add     x1, x0, #0x8
|     str     xzr, [x1]
|     add     x1, x0, #0x10
|     str     xzr, [x1]
|     add     x1, x0, #0x18
|     str     xzr, [x1]
|     add     x1, x0, #0x20
|     str     xzr, [x1]
|     add     x1, x0, #0x28
|     str     xzr, [x1]
|     add     x1, x0, #0x30
|     str     xzr, [x1]
|     add     x0, x0, #0x38
|     str     xzr, [x0]
|     ret

As described above, we could safely use immediate offset addressing,
which would allow the ADDs to be folded into the address generation for
the STRs, resulting in simpler and smaller generated assembly. We can do
this by using the "o" constraint to allow the compiler to generate
offset addressing (without writeback) for a memory operand, e.g.

| static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
| {
|         volatile u64 __iomem *ptr = addr;
|         asm volatile("str %x0, %1" : : "rZ" (val), "o" (*ptr));
| }

... which results in the earlier code sequence being compiled to:

| <writeq_zero_8_times>:
|     str     xzr, [x0]
|     str     xzr, [x0, #8]
|     str     xzr, [x0, #16]
|     str     xzr, [x0, #24]
|     str     xzr, [x0, #32]
|     str     xzr, [x0, #40]
|     str     xzr, [x0, #48]
|     str     xzr, [x0, #56]
|     ret

As Will notes at:

  https://lore.kernel.org/linux-arm-kernel/20240117160528.GA3398@willie-the-truck/

... some compilers struggle with a plain "o" constraint, so it's
preferable to use "Qo", where the additional "Q" constraint permits
using non-offset register addressing.

This patch modifies our IO write accessors to use "Qo" constraints,
resulting in the better code generation described above. The IO read
accessors are left as-is because ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE
requires that non-offset register addressing is used, as the LDAR
instruction does not support offset addressing.

When compiling v6.8-rc1 defconfig with GCC 13.2.0, this saves ~4KiB of
text:

| [mark@lakrids:~/src/linux]% ls -al vmlinux-*
| -rwxr-xr-x 1 mark mark 153960576 Jan 23 12:01 vmlinux-after
| -rwxr-xr-x 1 mark mark 153862192 Jan 23 11:57 vmlinux-before
|
| [mark@lakrids:~/src/linux]% size vmlinux-before vmlinux-after
|    text    data     bss     dec     hex filename
| 26708921        16690350         622736 44022007        29fb8f7 vmlinux-before
| 26704761        16690414         622736 44017911        29fa8f7 vmlinux-after

... though due to internal alignment of sections, this has no impact on
the size of the resulting Image:

| [mark@lakrids:~/src/linux]% ls -al Image-*
| -rw-r--r-- 1 mark mark 43590144 Jan 23 12:01 Image-after
| -rw-r--r-- 1 mark mark 43590144 Jan 23 11:57 Image-before

Aside from the better code generation, there should be no functional
change as a result of this patch. I have lightly tested this patch,
including booting under KVM (where some devices such as PL011 are
emulated).

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240124111259.874975-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-20 18:08:37 +00:00
Mark Brown
9533864816 arm64/sme: Restore SME registers on exit from suspend
The fields in SMCR_EL1 and SMPRI_EL1 reset to an architecturally UNKNOWN
value. Since we do not otherwise manage the traps configured in this
register at runtime we need to reconfigure them after a suspend in case
nothing else was kind enough to preserve them for us.

The vector length will be restored as part of restoring the SME state for
the next SME using task.

Fixes: a1f4ccd25cc2 ("arm64/sme: Provide Kconfig for SME")
Reported-by: Jackson Cooper-Driver <Jackson.Cooper-Driver@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240213-arm64-sme-resume-v3-1-17e05e493471@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-02-20 12:19:15 +00:00
Will Deacon
a6b3eb304a Revert "arm64: jump_label: use constraints "Si" instead of "i""
This reverts commit f9daab0ad01cf9d165dbbbf106ca4e61d06e7fe8.

Geert reports that his particular GCC 5.5 vintage toolchain fails to
build an arm64 defconfig because of this change:

 |    arch/arm64/include/asm/jump_label.h:25:2: error: invalid 'asm':
 | invalid operand
 |     asm goto(
      ^
Aopparently, this is something we claim to support, so let's revert back
to the old jump label constraint for now while discussions about raising
the minimum GCC version are ongoing.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/CAMuHMdX+6fnAf8Hm6EqYJPAjrrLO9T7c=Gu3S8V_pqjSDowJ6g@mail.gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-02-20 12:12:44 +00:00
Marc Zyngier
891766581d KVM: arm64: Add debugfs file for guest's ID registers
Debugging ID register setup can be a complicated affair. Give the
kernel hacker a way to dump that state in an easy to parse way.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240214131827.2856277-27-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-19 17:13:02 +00:00
Marc Zyngier
84de212d73 KVM: arm64: Make FEAT_MOPS UNDEF if not advertised to the guest
We unconditionally enable FEAT_MOPS, which is obviously wrong.

So let's only do that when it is advertised to the guest.
Which means we need to rely on a per-vcpu HCRX_EL2 shadow register.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20240214131827.2856277-25-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-19 17:13:02 +00:00
Marc Zyngier
c5bac1ef7d KVM: arm64: Move existing feature disabling over to FGU infrastructure
We already trap a bunch of existing features for the purpose of
disabling them (MAIR2, POR, ACCDATA, SME...).

Let's move them over to our brand new FGU infrastructure.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240214131827.2856277-20-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-19 17:13:01 +00:00
Marc Zyngier
2fd8f31c32 KVM: arm64: Add Fine-Grained UNDEF tracking information
In order to efficiently handle system register access being disabled,
and this resulting in an UNDEF exception being injected, we introduce
the (slightly dubious) concept of Fine-Grained UNDEF, modeled after
the architectural Fine-Grained Traps.

For each FGT group, we keep a 64 bit word that has the exact same
bit assignment as the corresponding FGT register, where a 1 indicates
that trapping this register should result in an UNDEF exception being
reinjected.

So far, nothing populates this information, nor sets the corresponding
trap bits.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240214131827.2856277-18-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-19 17:13:01 +00:00
Marc Zyngier
085eabaa74 KVM: arm64: Rename __check_nv_sr_forward() to triage_sysreg_trap()
__check_nv_sr_forward() is not specific to NV anymore, and does
a lot more. Rename it to triage_sysreg_trap(), making it plain
that its role is to handle where an exception is to be handled.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240214131827.2856277-17-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-19 17:13:01 +00:00
Marc Zyngier
cc5f84fbb0 KVM: arm64: Use the xarray as the primary sysreg/sysinsn walker
Since we always start sysreg/sysinsn handling by searching the
xarray, use it as the source of the index in the correct sys_reg_desc
array.

This allows some cleanup, such as moving the handling of unknown
sysregs in a single location.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240214131827.2856277-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-19 17:13:01 +00:00