linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-07 14:32:23 +00:00

Author	SHA1	Message	Date
Sean Christopherson	dfaae8447c	KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs Try to unprotect shadow pages if and only if indirect_shadow_pages is non- zero, i.e. iff there is at least one protected such shadow page. Pre- checking indirect_shadow_pages avoids taking mmu_lock for write when the gfn is write-protected by a third party, i.e. not for KVM shadow paging, and in the extremely unlikely case that a different task has already unprotected the last shadow page. Link: https://lore.kernel.org/r/20240831001538.336683-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:24 -07:00
Sean Christopherson	01dd4d3192	KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path Move the anti-infinite-loop protection provided by last_retry_{eip,addr} into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that never hits the emulator, as well as reexecute_instruction(), which is the last ditch "might as well try it" logic that kicks in when emulation fails on an instruction that faulted on a write-protected gfn. Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry fields and deduplicate other code (with more to come). Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:23 -07:00
Sean Christopherson	9c19129e53	KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry Store the gpa used to unprotect the faulting gfn for retry as a gpa_t, not an unsigned long. This fixes a bug where 32-bit KVM would unprotect and retry the wrong gfn if the gpa had bits 63:32!=0. In practice, this bug is functionally benign, as unprotecting the wrong gfn is purely a performance issue (thanks to the anti-infinite-loop logic). And of course, almost no one runs 32-bit KVM these days. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:22 -07:00
Sean Christopherson	019f3f84a4	KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip Read RIP from vCPU state instead of pulling it from the emulation context when filling last_retry_eip, which is part of the anti-infinite-loop protection used when unprotecting and retrying instructions that hit a write-protected gfn. This will allow reusing the anti-infinite-loop protection in flows that never make it into the emulator. No functional change intended, as ctxt->eip is set to kvm_rip_read() in init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet). Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:21 -07:00
Sean Christopherson	c1edcc41c3	KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped Resume the guest and thus skip emulation of a non-PTE-writing instruction if and only if unprotecting the gfn actually zapped at least one shadow page. If the gfn is write-protected for some reason other than shadow paging, attempting to unprotect the gfn will effectively fail, and thus retrying the instruction is all but guaranteed to be pointless. This bug has existed for a long time, but was effectively fudged around by the retry RIP+address anti-loop detection. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:21 -07:00
Sean Christopherson	2fb2b7877b	KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected When doing "fast unprotection" of nested TDP page tables, skip emulation if and only if at least one gfn was unprotected, i.e. continue with emulation if simply resuming is likely to hit the same fault and risk putting the vCPU into an infinite loop. Note, it's entirely possible to get a false negative, e.g. if a different vCPU faults on the same gfn and unprotects the gfn first, but that's a relatively rare edge case, and emulating is still functionally ok, i.e. saving a few cycles by avoiding emulation isn't worth the risk of putting the vCPU into an infinite loop. Opportunistically rewrite the relevant comment to document in gory detail exactly what scenario the "fast unprotect" logic is handling. Fixes: `147277540b` ("kvm: svm: Add support for additional SVM NPF error codes") Cc: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:20 -07:00
Sean Christopherson	989a84c93f	KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults Trigger KVM's various "unprotect gfn" paths if and only if the page fault was a write to a write-protected gfn. To do so, add a new page fault return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track such page faults. If a page fault requires emulation for any MMIO (or any reason besides write-protection), trying to unprotect the gfn is pointless and risks putting the vCPU into an infinite loop. E.g. KVM will put the vCPU into an infinite loop if the vCPU manages to trigger MMIO on a page table walk. Fixes: `147277540b` ("kvm: svm: Add support for additional SVM NPF error codes") Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:19 -07:00
Sean Christopherson	4ececec19a	KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper Drop the globally visible PFERR_NESTED_GUEST_PAGE and replace it with a more appropriately named is_write_to_guest_page_table(). The macro name is misleading, because while all nNPT walks match PAGE\|WRITE\|PRESENT, the reverse is not true. No functional change intended. Link: https://lore.kernel.org/r/20240831001538.336683-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:16:18 -07:00
Sean Christopherson	3dde46a21a	KVM: nVMX: Assert that vcpu->mutex is held when accessing secondary VMCSes Add lockdep assertions in get_vmcs12() and get_shadow_vmcs12() to verify the vCPU's mutex is held, as the returned VMCS objects are dynamically allocated/freed when nested VMX is turned on/off, i.e. accessing vmcs12 structures without holding vcpu->mutex is susceptible to use-after-free. Waive the assertion if the VM is being destroyed, as KVM currently forces a nested VM-Exit when freeing the vCPU. If/when that wart is fixed, the assertion can/should be converted to an unqualified lockdep assertion. See also https://lore.kernel.org/all/Zsd0TqCeY3B5Sb5b@google.com. Link: https://lore.kernel.org/r/20240906043413.1049633-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:15:03 -07:00
Sean Christopherson	1ed0f119c5	KVM: nVMX: Explicitly invalidate posted_intr_nv if PI is disabled at VM-Enter Explicitly invalidate posted_intr_nv when emulating nested VM-Enter and posted interrupts are disabled to make it clear that posted_intr_nv is valid if and only if nested posted interrupts are enabled, and as a cheap way to harden against KVM bugs. KVM initializes posted_intr_nv to -1 at vCPU creation and resets it to -1 when unloading vmcs12 and/or leaving nested mode, i.e. this is not a bug fix (or at least, it's not intended to be a bug fix). Note, tracking nested.posted_intr_nv as a u16 subtly adds a measure of safety, as it prevents unintentionally matching KVM's informal "no IRQ" vector of -1, stored as a signed int. Because a u16 can be always be represented as a signed int, the effective "invalid" value of posted_intr_nv, 65535, will be preserved as-is when comparing against an int, i.e. will be zero-extended, not sign-extended, and thus won't get a false positive if KVM is buggy and compares posted_intr_nv against -1. Opportunistically add a comment in vmx_deliver_nested_posted_interrupt() to call out that it must check vmx->nested.posted_intr_nv, not the vector in vmcs12, which is presumably the _entire_ reason nested.posted_intr_nv exists. E.g. vmcs12 is a KVM-controlled snapshot, so there are no TOCTOU races to worry about, the only potential badness is if the vCPU leaves nested and frees vmcs12 between the sender checking is_guest_mode() and dereferencing the vmcs12 pointer. Link: https://lore.kernel.org/r/20240906043413.1049633-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:15:02 -07:00
Sean Christopherson	aa9477966a	KVM: x86: Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt() Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt() now that nVMX essentially open codes kvm_get_apic_interrupt() in order to correctly emulate nested posted interrupts. Opportunistically stop exporting kvm_cpu_get_interrupt(), as the aforementioned nVMX flow was the only user in vendor code. Link: https://lore.kernel.org/r/20240906043413.1049633-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:15:01 -07:00
Sean Christopherson	6e0b456547	KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection When synthensizing a nested VM-Exit due to an external interrupt, pend a nested posted interrupt if the external interrupt vector matches L2's PI notification vector, i.e. if the interrupt is a PI notification for L2. This fixes a bug where KVM will incorrectly inject VM-Exit instead of processing nested posted interrupt when IPI virtualization is enabled. Per the SDM, detection of the notification vector doesn't occur until the interrupt is acknowledge and deliver to the CPU core. If the external-interrupt exiting VM-execution control is 1, any unmasked external interrupt causes a VM exit (see Section 26.2). If the "process posted interrupts" VM-execution control is also 1, this behavior is changed and the processor handles an external interrupt as follows: 1. The local APIC is acknowledged; this provides the processor core with an interrupt vector, called here the physical vector. 2. If the physical vector equals the posted-interrupt notification vector, the logical processor continues to the next step. Otherwise, a VM exit occurs as it would normally due to an external interrupt; the vector is saved in the VM-exit interruption-information field. For the most part, KVM has avoided problems because a PI NV for L2 that arrives will L2 is active will be processed by hardware, and KVM checks for a pending notification vector during nested VM-Enter. Thus, to hit the bug, the PI NV interrupt needs to sneak its way into L1's vIRR while L2 is active. Without IPI virtualization, the scenario is practically impossible to hit, modulo L1 doing weird things (see below), as the ordering between vmx_deliver_posted_interrupt() and nested VM-Enter effectively guarantees that either the sender will see the vCPU as being in_guest_mode(), or the receiver will see the interrupt in its vIRR. With IPI virtualization, introduced by commit `d588bb9be1` ("KVM: VMX: enable IPI virtualization"), the sending CPU effectively implements a rough equivalent of vmx_deliver_posted_interrupt(), sans the nested PI NV check. If the target vCPU has a valid PID, the CPU will send a PI NV interrupt based on _L1's_ PID, as the sender's because IPIv table points at L1 PIDs. PIR := 32 bytes at PID_ADDR; // under lock PIR[V] := 1; store PIR at PID_ADDR; // release lock NotifyInfo := 8 bytes at PID_ADDR + 32; // under lock IF NotifyInfo.ON = 0 AND NotifyInfo.SN = 0; THEN NotifyInfo.ON := 1; SendNotify := 1; ELSE SendNotify := 0; FI; store NotifyInfo at PID_ADDR + 32; // release lock IF SendNotify = 1; THEN send an IPI specified by NotifyInfo.NDST and NotifyInfo.NV; FI; As a result, the target vCPU ends up receiving an interrupt on KVM's POSTED_INTR_VECTOR while L2 is running, with an interrupt in L1's PIR for L2's nested PI NV. The POSTED_INTR_VECTOR interrupt triggers a VM-Exit from L2 to L0, KVM moves the interrupt from L1's PIR to vIRR, triggers a KVM_REQ_EVENT prior to re-entry to L2, and calls vmx_check_nested_events(), effectively bypassing all of KVM's "early" checks on nested PI NV. Without IPI virtualization, the bug can likely be hit only if L1 programs an assigned device to _post_ an interrupt to L2's notification vector, by way of L1's PID.PIR. Doing so would allow the interrupt to get into L1's vIRR without KVM checking vmcs12's NV. Which is architecturally allowed, but unlikely behavior for a hypervisor. Cc: Zeng Guang <guang.zeng@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20240906043413.1049633-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:15:00 -07:00
Sean Christopherson	8c23670f2b	KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no IRQ In the should-be-impossible scenario that kvm_cpu_get_interrupt() doesn't return a valid vector after checking kvm_cpu_has_interrupt(), skip VM-Exit injection to reduce the probability of crashing/confusing L1. Now that KVM gets the IRQ _before_ calling nested_vmx_vmexit(), squashing the VM-Exit injection is trivial since there are no actions that need to be undone. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20240906043413.1049633-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:14:59 -07:00
Sean Christopherson	363010e1dd	KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one and only path where KVM invokes nested_vmx_vmexit() with EXIT_REASON_EXTERNAL_INTERRUPT. A future fix will perform a last-minute check on L2's nested posted interrupt notification vector, just before injecting a nested VM-Exit. To handle that scenario correctly, KVM needs to get the interrupt _before_ injecting VM-Exit, as simply querying the highest priority interrupt, via kvm_cpu_has_interrupt(), would result in TOCTOU bug, as a new, higher priority interrupt could arrive between kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt(). Unfortunately, simply moving the call to kvm_cpu_get_interrupt() doesn't suffice, as a VMWRITE to GUEST_INTERRUPT_STATUS.SVI is hiding in kvm_get_apic_interrupt(), and acknowledging the interrupt before nested VM-Exit would cause the VMWRITE to hit vmcs02 instead of vmcs01. Open code a rough equivalent to kvm_cpu_get_interrupt() so that the IRQ is acknowledged after emulating VM-Exit, taking care to avoid the TOCTOU issue described above. Opportunistically convert the WARN_ON() to a WARN_ON_ONCE(). If KVM has a bug that results in a false positive from kvm_cpu_has_interrupt(), spamming dmesg won't help the situation. Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as they are always "wanted" by L0. Link: https://lore.kernel.org/r/20240906043413.1049633-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:14:58 -07:00
Sean Christopherson	a194a3a13c	KVM: x86: Move "ack" phase of local APIC IRQ delivery to separate API Split the "ack" phase, i.e. the movement of an interrupt from IRR=>ISR, out of kvm_get_apic_interrupt() and into a separate API so that nested VMX can acknowledge a specific interrupt _after_ emulating a VM-Exit from L2 to L1. To correctly emulate nested posted interrupts while APICv is active, KVM must: 1. find the highest pending interrupt. 2. check if that IRQ is L2's notification vector 3. emulate VM-Exit if the IRQ is NOT the notification vector 4. ACK the IRQ in L1 _after_ VM-Exit When APICv is active, the process of moving the IRQ from the IRR to the ISR also requires a VMWRITE to update vmcs01.GUEST_INTERRUPT_STATUS.SVI, and so acknowledging the interrupt before switching to vmcs01 would result in marking the IRQ as in-service in the wrong VMCS. KVM currently fudges around this issue by doing kvm_get_apic_interrupt() smack dab in the middle of emulating VM-Exit, but that hack doesn't play nice with nested posted interrupts, as notification vector IRQs don't trigger a VM-Exit in the first place. Cc: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240906043413.1049633-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:14:57 -07:00
Kai Huang	7efb4d8a39	KVM: VMX: Also clear SGX EDECCSSA in KVM CPU caps when SGX is disabled When SGX EDECCSSA support was added to KVM in commit `16a7fe3728` ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest"), it forgot to clear the X86_FEATURE_SGX_EDECCSSA bit in KVM CPU caps when KVM SGX is disabled. Fix it. Fixes: `16a7fe3728` ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest") Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240905120837.579102-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:13:55 -07:00
Yue Haibing	4ca077f26d	KVM: x86: Remove some unused declarations Commit `238adc7705` ("KVM: Cleanup LAPIC interface") removed kvm_lapic_get_base() but leave declaration. And other two declarations were never implenmented since introduction. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20240830022537.2403873-1-yuehaibing@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:12:43 -07:00
Sean Christopherson	3f6821aa14	KVM: x86: Forcibly leave nested if RSM to L2 hits shutdown Leave nested mode before synthesizing shutdown (a.k.a. TRIPLE_FAULT) if RSM fails when resuming L2 (a.k.a. guest mode). Architecturally, shutdown on RSM occurs _before_ the transition back to guest mode on both Intel and AMD. On Intel, per the SDM pseudocode, SMRAM state is loaded before critical VMX state: restore state normally from SMRAM; ... CR4.VMXE := value stored internally; IF internal storage indicates that the logical processor had been in VMX operation (root or non-root) THEN enter VMX operation (root or non-root); restore VMX-critical state as defined in Section 32.14.1; ... restore current VMCS pointer; FI; AMD's APM is both less clearcut and more explicit. Because AMD CPUs save VMCB and guest state in SMRAM itself, given the lack of anything in the APM to indicate a shutdown in guest mode is possible, a straightforward reading of the clause on invalid state is that _what_ state is invalid is irrelevant, i.e. all roads lead to shutdown. An RSM causes a processor shutdown if an invalid-state condition is found in the SMRAM state-save area. This fixes a bug found by syzkaller where synthesizing shutdown for L2 led to a nested VM-Exit (if L1 is intercepting shutdown), which in turn caused KVM to complain about trying to cancel a nested VM-Enter (see commit `759cbd5967` ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM"). Note, Paolo pointed out that KVM shouldn't set nested_run_pending until after loading SMRAM state. But as above, that's only half the story, KVM shouldn't transition to guest mode either. Unfortunately, fixing that mess requires rewriting the nVMX and nSVM RSM flows to not piggyback their nested VM-Enter flows, as executing the nested VM-Enter flows after loading state from SMRAM would clobber much of said state. For now, add a FIXME to call out that transitioning to guest mode before loading state from SMRAM is wrong. Link: https://lore.kernel.org/all/CABgObfYaUHXyRmsmg8UjRomnpQ0Jnaog9-L2gMjsjkqChjDYUQ@mail.gmail.com Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com Reported-by: Zheyu Ma <zheyuma97@gmail.com> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com Analyzed-by: Michal Wilczynski <michal.wilczynski@intel.com> Cc: Kishen Maloor <kishen.maloor@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906161337.1118412-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-09-09 20:09:49 -07:00
Linus Torvalds	9d4c304001	KVM: x86: don't fall through case statements without annotations clang warns on this because it has an unannotated fall-through between cases: arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] and while we could annotate it as a fallthrough, the proper fix is to just add the break for this case, instead of falling through to the default case and the break there. gcc also has that warning, but it looks like gcc only warns for the cases where they fall through to "real code", rather than to just a break. Odd. Fixes: `d30d9ee94c` ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-09-06 15:23:33 -07:00
Steven Rostedt	59cbd4eea4	KVM: Remove HIGH_RES_TIMERS dependency Commit `92b5265d38` ("KVM: Depend on HIGH_RES_TIMERS") added a dependency to high resolution timers with the comment: KVM lapic timer and tsc deadline timer based on hrtimer, setting a leftmost node to rb tree and then do hrtimer reprogram. If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram do nothing and then make kvm lapic timer and tsc deadline timer fail. That was back in 2012, where hrtimer_start_range_ns() would do the reprogramming with hrtimer_enqueue_reprogram(). But as that was a nop with high resolution timers disabled, this did not work. But a lot has changed in the last 12 years. For example, commit `49a2a07514` ("hrtimer: Kick lowres dynticks targets on timer enqueue") modifies __hrtimer_start_range_ns() to work with low res timers. There's been lots of other changes that make low res work. ChromeOS has tested this before as well, and it hasn't seen any issues with running KVM with high res timers disabled. There could be problems, especially at low HZ, for guests that do not support kvmclock and rely on precise delivery of periodic timers to keep their clock running. This can be the APIC timer (provided by the kernel), the RTC (provided by userspace), or the i8254 (choice of kernel/userspace). These guests are few and far between these days, and in the case of the APIC timer + Intel hosts we can use the preemption timer (which is TSC-based and has better latency _and_ accuracy). In KVM, only x86 is requiring CONFIG_HIGH_RES_TIMERS; perhaps a "depends on HIGH_RES_TIMERS \|\| EXPERT" could be added to virt/kvm, or a pr_warn could be added to kvm_init if HIGH_RES_TIMERS are not enabled. But in general, it seems that there must be other code in the kernel (maybe sound/?) that is relying on having high-enough HZ or hrtimers but that's not documented anywhere. Whenever you disable it you probably need to know what you're doing and what your workload is; so the dependency is not particularly interesting, and we can just remove it. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Message-ID: <20240821095127.45d17b19@gandalf.local.home> [Added the last two paragraphs to the commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-05 12:04:54 -04:00
Sean Christopherson	590b09b1d8	KVM: x86: Register "emergency disable" callbacks when virt is enabled Register the "disable virtualization in an emergency" callback just before KVM enables virtualization in hardware, as there is no functional need to keep the callbacks registered while KVM happens to be loaded, but is inactive, i.e. if KVM hasn't enabled virtualization. Note, unregistering the callback every time the last VM is destroyed could have measurable latency due to the synchronize_rcu() needed to ensure all references to the callback are dropped before KVM is unloaded. But the latency should be a small fraction of the total latency of disabling virtualization across all CPUs, and userspace can set enable_virt_at_load to completely eliminate the runtime overhead. Add a pointer in kvm_x86_ops to allow vendor code to provide its callback. There is no reason to force vendor code to do the registration, and either way KVM would need a new kvm_x86_ops hook. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-04 11:02:34 -04:00
Sean Christopherson	0617a769ce	KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVM Rename x86's the per-CPU vendor hooks used to enable virtualization in hardware to align with the recently renamed arch hooks. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-04 11:02:33 -04:00
Sean Christopherson	071f24ad28	KVM: Rename arch hooks related to per-CPU virtualization enabling Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-04 11:02:33 -04:00
Tom Dohrmann	d30d9ee94c	KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM Until recently, KVM_CAP_READONLY_MEM was unconditionally supported on x86, but this is no longer the case for SEV-ES and SEV-SNP VMs. When KVM_CHECK_EXTENSION is invoked on a VM, only advertise KVM_CAP_READONLY_MEM when it's actually supported. Fixes: `66155de93b` ("KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)") Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michael Roth <michael.roth@amd.com> Signed-off-by: Tom Dohrmann <erbse.13@gmx.de> Message-ID: <20240902144219.3716974-1-erbse.13@gmx.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-09-02 10:56:10 -04:00
Sean Christopherson	1876dd69df	KVM: x86: Add fastpath handling of HLT VM-Exits Add a fastpath for HLT VM-Exits by immediately re-entering the guest if it has a pending wake event. When virtual interrupt delivery is enabled, i.e. when KVM doesn't need to manually inject interrupts, this allows KVM to stay in the fastpath run loop when a vIRQ arrives between the guest doing CLI and STI;HLT. Without AMD's Idle HLT-intercept support, the CPU generates a HLT VM-Exit even though KVM will immediately resume the guest. Note, on bare metal, it's relatively uncommon for a modern guest kernel to actually trigger this scenario, as the window between the guest checking for a wake event and committing to HLT is quite small. But in a nested environment, the timings change significantly, e.g. rudimentary testing showed that ~50% of HLT exits where HLT-polling was successful would be serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets a wake event before KVM schedules out the vCPU, the wake event was pending even before the VM-Exit. Link: https://lore.kernel.org/all/20240528041926.3989-3-manali.shukla@amd.com Link: https://lore.kernel.org/r/20240802195120.325560-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:22 -07:00
Sean Christopherson	70cdd23851	KVM: x86: Reorganize code in x86.c to co-locate vCPU blocking/running helpers Shuffle code around in x86.c so that the various helpers related to vCPU blocking/running logic are (a) located near each other and (b) ordered so that HLT emulation can use kvm_vcpu_has_events() in a future path. No functional change intended. Link: https://lore.kernel.org/r/20240802195120.325560-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	f7f39c50ed	KVM: x86: Exit to userspace if fastpath triggers one on instruction skip Exit to userspace if a fastpath handler triggers such an exit, which can happen when skipping the instruction, e.g. due to userspace single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an emulation failure. Fixes: `404d5d7bff` ("KVM: X86: Introduce more exit_fastpath_completion enum values") Link: https://lore.kernel.org/r/20240802195120.325560-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	ea60229af7	KVM: x86: Dedup fastpath MSR post-handling logic Now that the WRMSR fastpath for x2APIC_ICR and TSC_DEADLINE are identical, ignoring the backend MSR handling, consolidate the common bits of skipping the instruction and setting the return value. No functional change intended. Link: https://lore.kernel.org/r/20240802195120.325560-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	0dd45f2cd8	KVM: x86: Re-enter guest if WRMSR(X2APIC_ICR) fastpath is successful Re-enter the guest in the fastpath if WRMSR emulation for x2APIC's ICR is successful, as no additional work is needed, i.e. there is no code unique for WRMSR exits between the fastpath and the "!= EXIT_FASTPATH_NONE" check in __vmx_handle_exit(). Link: https://lore.kernel.org/r/20240802195120.325560-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:50:21 -07:00
Sean Christopherson	32071fa355	KVM: SVM: Track the per-CPU host save area as a VMCB pointer The host save area is a VMCB, track it as such to help readers follow along, but mostly to cleanup/simplify the retrieval of the SEV-ES host save area. Note, the compile-time assertion that offsetof(struct vmcb, save) == EXPECTED_VMCB_CONTROL_AREA_SIZE ensures that the SEV-ES save area is indeed at offset 0x400 (whoever added the expected/architectural VMCB offsets apparently likes decimal). No functional change intended. Link: https://lore.kernel.org/r/20240802204511.352017-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:06:12 -07:00
Sean Christopherson	48547fe75e	KVM: SVM: Add a helper to convert a SME-aware PA back to a struct page Add __sme_pa_to_page() to pair with __sme_page_pa() and use it to replace open coded equivalents, including for "iopm_base", which previously avoided having to do __sme_clr() by storing the raw PA in the global variable. Opportunistically convert __sme_page_pa() to a helper to provide type safety. No functional change intended. Link: https://lore.kernel.org/r/20240802204511.352017-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:06:12 -07:00
Sean Christopherson	1dc9cc1c4c	KVM: x86/mmu: Reword a misleading comment about checking gpte_changed() Rewrite the comment in FNAME(fetch) to explain why KVM needs to check that the gPTE is still fresh before continuing the shadow page walk, even if KVM already has a linked shadow page for the gPTE in question. No functional change intended. Link: https://lore.kernel.org/r/20240802203900.348808-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:05:55 -07:00
Sean Christopherson	7d67b03e6f	KVM: x86/mmu: Drop pointless "return" wrapper label in FNAME(fetch) Drop the pointless and poorly named "out_gpte_changed" label, in FNAME(fetch), and instead return RET_PF_RETRY directly. No functional change intended. Link: https://lore.kernel.org/r/20240802203900.348808-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:05:55 -07:00
Sean Christopherson	174b6e4a25	KVM: x86/mmu: Decrease indentation in logic to sync new indirect shadow page Combine the back-to-back if-statements for synchronizing children when linking a new indirect shadow page in order to decrease the indentation, and to make it easier to "see" the logic in its entirety. No functional change intended. Link: https://lore.kernel.org/r/20240802203900.348808-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 19:05:55 -07:00
Sean Christopherson	73b42dc69b	KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC) Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's IPI virtualization support, but only for AMD. While not stated anywhere in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled, KVM needs to match CPU behavior as some ICR ICR writes will be handled by the CPU, not by KVM. Add a kvm_x86_ops knob to control the underlying format used by the CPU to store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage format for x2APIC mode doesn't matter, and having the behavior follow AMD versus Intel will provide better test coverage and ease debugging. Fixes: `4d1d7942e3` ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 16:25:05 -07:00
Sean Christopherson	d33234342f	KVM: x86: Move x2APIC ICR helper above kvm_apic_write_nodecode() Hoist kvm_x2apic_icr_write() above kvm_apic_write_nodecode() so that a local helper to _read_ the x2APIC ICR can be added and used in the nodecode path without needing a forward declaration. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 16:21:50 -07:00
Sean Christopherson	71bf395a27	KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bits Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that are must-be-zero on both Intel and AMD, i.e. any reserved bits other than the BUSY bit, which Intel ignores and basically says is undefined. KVM's xapic_state_test selftest has been fudging the bug since commit `4b88b1a518` ("KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode"), which essentially removed the testcase instead of fixing the bug. WARN if the nodecode path triggers a #GP, as the CPU is supposed to check reserved bits for ICR when it's partially virtualized. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-29 16:21:50 -07:00
Vitaly Kuznetsov	5fa9f0480c	KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNP SEV-SNP support is present since commit `1dfe571c12` ("KVM: SEV: Add initial SEV-SNP support") but Kconfig entry wasn't updated and still mentions SEV and SEV-ES only. Add SEV-SNP there and, while on it, expand 'SEV' in the description as 'Encrypted VMs' is not what 'SEV' stands for. No functional change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240828122111.160273-1-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-28 05:46:25 -07:00
Ravi Bangoria	54950bfe2b	KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missing If host supports Bus Lock Detect, KVM advertises it to guests even if SVM support is absent. Additionally, guest wouldn't be able to use it despite guest CPUID bit being set. Fix it by unconditionally clearing the feature bit in KVM cpu capability. Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/r/CALMp9eRet6+v8Y1Q-i6mqPm4hUow_kJNhmVHfOV8tMfuSS=tVg@mail.gmail.com Fixes: `76ea438b4a` ("KVM: X86: Expose bus lock debug exception to guest") Cc: stable@vger.kernel.org Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240808062937.1149-4-ravi.bangoria@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:17:44 -07:00
Sean Christopherson	44dd0f5732	KVM: x86: Suppress userspace access failures on unsupported, "emulated" MSRs Extend KVM's suppression of userspace MSR access failures to MSRs that KVM reports as emulated, but are ultimately unsupported, e.g. if the VMX MSRs are emulated by KVM, but are unsupported given the vCPU model. Suggested-by: Weijiang Yang <weijiang.yang@intel.com> Reviewed-by: Weijiang Yang <weijiang.yang@intel.com> Link: https://lore.kernel.org/r/20240802181935.292540-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:39 -07:00
Sean Christopherson	64a5d7a109	KVM: x86: Suppress failures on userspace access to advertised, unsupported MSRs Extend KVM's suppression of failures due to a userspace access to an unsupported, but advertised as a "to save" MSR to all MSRs, not just those that happen to reach the default case statements in kvm_get_msr_common() and kvm_set_msr_common(). KVM's soon-to-be-established ABI is that if an MSR is advertised to userspace, then userspace is allowed to read the MSR, and write back the value that was read, i.e. why an MSR is unsupported doesn't change KVM's ABI. Practically speaking, this is very nearly a nop, as the only other paths that return KVM_MSR_RET_UNSUPPORTED are {svm,vmx}_get_feature_msr(), and it's unlikely, though not impossible, that userspace is using KVM_GET_MSRS on unsupported MSRs. The primary goal of moving the suppression to common code is to allow returning KVM_MSR_RET_UNSUPPORTED as appropriate throughout KVM, without having to manually handle the "is userspace accessing an advertised" waiver. I.e. this will allow formalizing KVM's ABI without incurring a high maintenance cost. Link: https://lore.kernel.org/r/20240802181935.292540-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:38 -07:00
Sean Christopherson	3adef90345	KVM: x86: Hoist x86.c's global msr_* variables up above kvm_do_msr_access() Move the definitions of the various MSR arrays above kvm_do_msr_access() so that kvm_do_msr_access() can query the arrays when handling failures, e.g. to squash errors if userspace tries to read an MSR that isn't fully supported, but that KVM advertised as being an MSR-to-save. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:37 -07:00
Sean Christopherson	1cec203498	KVM: x86: Funnel all fancy MSR return value handling into a common helper Add a common helper, kvm_do_msr_access(), to invoke the "leaf" APIs that are type and access specific, and more importantly to handle errors that are returned from the leaf APIs. I.e. turn kvm_msr_ignored_check() from a a helper that is called on an error, into a trampoline that detects errors and applies relevant side effects, e.g. logging unimplemented accesses. Because the leaf APIs are used for guest accesses, userspace accesses, and KVM accesses, and because KVM supports restricting access to MSRs from userspace via filters, the error handling is subtly non-trivial. E.g. KVM has had at least one bug escape due to making each "outer" function handle errors. See commit `3376ca3f1a` ("KVM: x86: Fix KVM_GET_MSRS stack info leak"). Link: https://lore.kernel.org/r/20240802181935.292540-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:36 -07:00
Sean Christopherson	7075f16361	KVM: x86: Refactor kvm_get_feature_msr() to avoid struct kvm_msr_entry Refactor kvm_get_feature_msr() to take the components of kvm_msr_entry as separate parameters, along with a vCPU pointer, i.e. to give it the same prototype as kvm_{g,s}et_msr_ignored_check(). This will allow using a common inner helper for handling accesses to "regular" and feature MSRs. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:07:35 -07:00
Sean Christopherson	b848f24bd7	KVM: x86: Rename get_msr_feature() APIs to get_feature_msr() Rename all APIs related to feature MSRs from get_msr_feature() to get_feature_msr(). The APIs get "feature MSRs", not "MSR features". And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't describe the helper itself. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:56 -07:00
Sean Christopherson	74c6c98a59	KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entry Refactor get_msr_feature() to take the index and data pointer as distinct parameters in anticipation of eliminating "struct kvm_msr_entry" usage further up the primary callchain. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:30 -07:00
Sean Christopherson	aaecae7b6a	KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTED Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to try and make it more clear that access was denied because the MSR itself is unsupported/unknown. "INVALID" is too ambiguous, as it could just as easily mean the value for WRMSR as invalid. Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that _are_ actually implemented by KVM, e.g. if the MSR is unsupported because an associated feature flag is not present in guest CPUID. Opportunistically beef up the comments for the internal MSR error codes. Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:29 -07:00
Sean Christopherson	b58b808cbe	KVM: x86: Move MSR_TYPE_{R,W,RW} values from VMX to x86, as enums Move VMX's MSR_TYPE_{R,W,RW} #defines to x86.h, as enums, so that they can be used by common x86 code, e.g. instead of doing "bool write". Opportunistically tweak the definitions to make it more obvious that the values are bitmasks, not arbitrary ascending values. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:29 -07:00
Sean Christopherson	74a0e79df6	KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG value Inject a #GP if the guest attempts to change MSR_AMD64_DE_CFG from its current value, not if the guest attempts to write a value other than KVM's set of supported bits. As per the comment and the changelog of the original code, the intent is to effectively make MSR_AMD64_DE_CFG read- only for the guest. Opportunistically use a more conventional equality check instead of an exclusive-OR check to detect attempts to change bits. Fixes: `d1d93fa90f` ("KVM: SVM: Add MSR-based feature support for serializing LFENCE") Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240802181935.292540-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 12:06:24 -07:00
Sean Christopherson	acf2923271	KVM: x86/mmu: Clean up function comments for dirty logging APIs Rework the function comment for kvm_arch_mmu_enable_log_dirty_pt_masked() into the body of the function, as it has gotten a bit stale, is harder to read without the code context, and is the last source of warnings for W=1 builds in KVM x86 due to using a kernel-doc comment without documenting all parameters. Opportunistically subsume the functions comments for kvm_mmu_write_protect_pt_masked() and kvm_mmu_clear_dirty_pt_masked(), as there is no value in regurgitating similar information at a higher level, and capturing the differences between write-protection and PML-based dirty logging is best done in a common location. No functional change intended. Cc: David Matlack <dmatlack@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20240802202006.340854-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:40:04 -07:00
Li Chen	e0183a42e3	KVM: x86: Use this_cpu_ptr() in kvm_user_return_msr_cpu_online Use this_cpu_ptr() instead of open coding the equivalent in kvm_user_return_msr_cpu_online. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Link: https://lore.kernel.org/r/87zfp96ojk.wl-me@linux.beauty Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:37:41 -07:00
Vitaly Kuznetsov	2ab637df5f	KVM: VMX: hyper-v: Prevent impossible NULL pointer dereference in evmcs_load() GCC 12.3.0 complains about a potential NULL pointer dereference in evmcs_load() as hv_get_vp_assist_page() can return NULL. In fact, this cannot happen because KVM verifies (hv_init_evmcs()) that every CPU has a valid VP assist page and aborts enabling the feature otherwise. CPU onlining path is also checked in vmx_hardware_enable(). To make the compiler happy and to future proof the code, add a KVM_BUG_ON() sentinel. It doesn't seem to be possible (and logical) to observe evmcs_load() happening without an active vCPU so it is presumed that kvm_get_running_vcpu() can't return NULL. No functional change intended. Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240816130124.286226-1-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:35:18 -07:00
Maxim Levitsky	41ab0d59fa	KVM: nVMX: Use vmx_segment_cache_clear() instead of open coded equivalent In prepare_vmcs02_rare(), call vmx_segment_cache_clear() instead of setting segment_cache.bitmask directly. Using the helper minimizes the chances of prepare_vmcs02_rare() doing the wrong thing in the future, e.g. if KVM ends up doing more than just zero the bitmask when purging the cache. No functional change intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240725175232.337266-2-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:35:17 -07:00
Sean Christopherson	653ea4489e	KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent. More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date. Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace really wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access. This partially reverts commit `ac8d6cad3c` ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr"). Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:35:16 -07:00
Kai Huang	d9aa56edad	KVM: VMX: Do not account for temporary memory allocation in ECREATE emulation In handle_encls_ecreate(), a page is allocated to store a copy of SECS structure used by the ENCLS[ECREATE] leaf from the guest. This page is only used temporarily and is freed after use in handle_encls_ecreate(). Don't account for the memory allocation of this page per [1]. Link: https://lore.kernel.org/kvm/b999afeb588eb75d990891855bc6d58861968f23.camel@intel.com/T/#mb81987afc3ab308bbb5861681aa9a20f2aece7fd [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240715101224.90958-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:35:15 -07:00
Qiang Liu	caf22c6dd3	KVM: VMX: Modify the BUILD_BUG_ON_MSG of the 32-bit field in the vmcs_check16 function According to the SDM, the meaning of field bit 0 is: Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit, and natural-width fields. So there is no 32-bit high field here, it should be a 32-bit field instead. Signed-off-by: Qiang Liu <liuq131@chinatelecom.cn> Link: https://lore.kernel.org/r/20240702064609.52487-1-liuq131@chinatelecom.cn Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:35:15 -07:00
Yongqiang Liu	c501062bb2	KVM: SVM: Remove unnecessary GFP_KERNEL_ACCOUNT in svm_set_nested_state() The fixed size temporary variables vmcb_control_area and vmcb_save_area allocated in svm_set_nested_state() are released when the function exits. Meanwhile, svm_set_nested_state() also have vcpu mutex held to avoid massive concurrency allocation, so we don't need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com> Link: https://lore.kernel.org/r/20240821112737.3649937-1-liuyongqiang13@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:35:09 -07:00
Xin Li	566975f6ec	KVM: nVMX: Use macros and #defines in vmx_restore_vmx_misc() Use macros in vmx_restore_vmx_misc() instead of open coding everything using BIT_ULL() and GENMASK_ULL(). Opportunistically split feature bits and reserved bits into separate variables, and add a comment explaining the subset logic (it's not immediately obvious that the set of feature bits is NOT the set of _supported_ feature bits). Cc: Shan Kang <shan.kang@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> [sean: split to separate patch, write changelog, drop #defines] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:54 -07:00
Xin Li	8f56b14e9f	KVM: VMX: Open code VMX preemption timer rate mask in its accessor Use vmx_misc_preemption_timer_rate() to get the rate in hardware_setup(), and open code the rate's bitmask in vmx_misc_preemption_timer_rate() so that the function looks like all the helpers that grab values from VMX_BASIC and VMX_MISC MSR values. No functional change intended. Cc: Shan Kang <shan.kang@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> [sean: split to separate patch, write changelog] Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:54 -07:00
Sean Christopherson	dc1e67f70f	KVM VMX: Move MSR_IA32_VMX_MISC bit defines to asm/vmx.h Move the handful of MSR_IA32_VMX_MISC bit defines that are currently in msr-indx.h to vmx.h so that all of the VMX_MISC defines and wrappers can be found in a single location. Opportunistically use BIT_ULL() instead of open coding hex values, add defines for feature bits that are architecturally defined, and move the defines down in the file so that they are colocated with the helpers for getting fields from VMX_MISC. No functional change intended. Cc: Shan Kang <shan.kang@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> [sean: split to separate patch, write changelog] Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:53 -07:00
Sean Christopherson	92e648042c	KVM: nVMX: Add a helper to encode VMCS info in MSR_IA32_VMX_BASIC Add a helper to encode the VMCS revision, size, and supported memory types in MSR_IA32_VMX_BASIC, i.e. when synthesizing KVM's supported BASIC MSR value, and delete the now unused VMCS size and memtype shift macros. For a variety of reasons, KVM has shifted (pun intended) to using helpers to get information from the VMX MSRs, as opposed to defined MASK and SHIFT macros for direct use. Provide a similar helper for the nested VMX code, which needs to set information, so that KVM isn't left with a mix of SHIFT macros and dedicated helpers. Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:52 -07:00
Xin Li	c97b106fa8	KVM: nVMX: Use macros and #defines in vmx_restore_vmx_basic() Use macros in vmx_restore_vmx_basic() instead of open coding everything using BIT_ULL() and GENMASK_ULL(). Opportunistically split feature bits and reserved bits into separate variables, and add a comment explaining the subset logic (it's not immediately obvious that the set of feature bits is NOT the set of _supported_ feature bits). Cc: Shan Kang <shan.kang@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> [sean: split to separate patch, write changelog, drop #defines] Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:51 -07:00
Xin Li	9df398ff7d	KVM: VMX: Track CPU's MSR_IA32_VMX_BASIC as a single 64-bit value Track the "basic" capabilities VMX MSR as a single u64 in vmcs_config instead of splitting it across three fields, that obviously don't combine into a single 64-bit value, so that KVM can use the macros that define MSR bits using their absolute position. Replace all open coded shifts and masks, many of which are relative to the "high" half, with the appropriate macro. Opportunistically use VMX_BASIC_32BIT_PHYS_ADDR_ONLY instead of an open coded equivalent, and clean up the related comment to not reference a specific SDM section (to the surprise of no one, the comment is stale). No functional change intended (though obviously the code generation will be quite different). Cc: Shan Kang <shan.kang@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> [sean: split to separate patch, write changelog] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:50 -07:00
Sean Christopherson	b6717d35d8	KVM: x86: Stuff vCPU's PAT with default value at RESET, not creation Move the stuffing of the vCPU's PAT to the architectural "default" value from kvm_arch_vcpu_create() to kvm_vcpu_reset(), guarded by !init_event, to better capture that the default value is the value "Following Power-up or Reset". E.g. setting PAT only during creation would break if KVM were to expose a RESET ioctl() to userspace (which is unlikely, but that's not a good reason to have unintuitive code). No functional change. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:48 -07:00
Sean Christopherson	beb2e44604	x86/cpu: KVM: Move macro to encode PAT value to common header Move pat/memtype.c's PAT() macro to msr-index.h as PAT_VALUE(), and use it in KVM to define the default (Power-On / RESET) PAT value instead of open coding an inscrutable magic number. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240605231918.2915961-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:47 -07:00
Sean Christopherson	e7e80b66fb	x86/cpu: KVM: Add common defines for architectural memory types (PAT, MTRRs, etc.) Add defines for the architectural memory types that can be shoved into various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs, etc. While most MSRs/registers support only a subset of all memory types, the values themselves are architectural and identical across all users. Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi header, but add compile-time assertions to connect the dots (and sanity check that the msr-index.h values didn't get fat-fingered). Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the EPTP holds a single memory type in 3 of its 64 bits; those bits just happen to be 2:0, i.e. don't need to be shifted. Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in setup_vmcs_config(). No functional change intended. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:46 -07:00
Maxim Levitsky	dad1613e05	KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASE If these msrs are read by the emulator (e.g due to 'force emulation' prefix), SVM code currently fails to extract the corresponding segment bases, and return them to the emulator. Fix that. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240802151608.72896-3-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:34 -07:00
Sean Christopherson	4bcdd831d9	KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX reads guest memory. Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN via sync_regs(), which already holds SRCU. I.e. trying to precisely use kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause problems. Acquiring SRCU isn't all that expensive, so for simplicity, grab it unconditionally for KVM_SET_VCPU_EVENTS. ============================= WARNING: suspicious RCU usage 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted ----------------------------- include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by repro/1071: #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm] stack backtrace: CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x13f/0x1a0 kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel] load_vmcs12_host_state+0x432/0xb40 [kvm_intel] vmx_leave_nested+0x30/0x40 [kvm_intel] kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm] kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm] ? mark_held_locks+0x49/0x70 ? kvm_vcpu_ioctl+0x7d/0x970 [kvm] ? kvm_vcpu_ioctl+0x497/0x970 [kvm] kvm_vcpu_ioctl+0x497/0x970 [kvm] ? lock_acquire+0xba/0x2d0 ? find_held_lock+0x2b/0x80 ? do_user_addr_fault+0x40c/0x6f0 ? lock_release+0xb7/0x270 __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7ff11eb1b539 </TASK> Fixes: `f7e570780e` ("KVM: x86: Forcibly leave nested virt when SMM state is toggled") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:33 -07:00
Sean Christopherson	28cec7f08b	KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEs Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to fault-in SPTEs will fail miserably due to root.hpa pointing at garbage. Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the WARN isn't user-triggerable and won't run afoul of warn-on-panic because the kernel would already be panicking. BUG: unable to handle page fault for address: 000029ffffffffe8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #548 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm] RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206 RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78 RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080 RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225 R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080 R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000 FS: 00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0 Call Trace: <TASK> kvm_tdp_page_fault+0xcc/0x160 [kvm] kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm] kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm] kvm_vcpu_ioctl+0x761/0x8c0 [kvm] __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Modules linked in: kvm_intel kvm CR2: 000029ffffffffe8 ---[ end trace 0000000000000000 ]--- Fixes: `6e01b7601d` ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()") Reported-by: syzbot+23786faffb695f17edaa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000002b84dc061dd73544@google.com Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: xingwei lee <xrivendell7@gmail.com> Tested-by: yuxin wang <wang1315768607@163.com> Link: https://lore.kernel.org/r/20240723000211.3352304-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:32 -07:00
Yan Zhao	e03a7caa53	KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE rename Replace "removed" with "frozen" in comments as appropriate to complete the rename of REMOVED_SPTE to FROZEN_SPTE. Fixes: `964cea8171` ("KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE") Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20240712233438.518591-1-rick.p.edgecombe@intel.com [sean: write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:31 -07:00
Tao Su	1c450ffef5	KVM: x86: Advertise AVX10.1 CPUID to userspace Advertise AVX10.1 related CPUIDs, i.e. report AVX10 support bit via CPUID.(EAX=07H, ECX=01H):EDX[bit 19] and new CPUID leaf 0x24H so that guest OS and applications can query the AVX10.1 CPUIDs directly. Intel AVX10 represents the first major new vector ISA since the introduction of Intel AVX512, which will establish a common, converged vector instruction set across all Intel architectures[1]. AVX10.1 is an early version of AVX10, that enumerates the Intel AVX512 instruction set at 128, 256, and 512 bits which is enabled on Granite Rapids. I.e., AVX10.1 is only a new CPUID enumeration with no new functionality. New features, e.g. Embedded Rounding and Suppress All Exceptions (SAE) will be introduced in AVX10.2. Advertising AVX10.1 is safe because there is nothing to enable for AVX10.1, i.e. it's purely a new way to enumerate support, thus there will never be anything for the kernel to enable. Note just the CPUID checking is changed when using AVX512 related instructions, e.g. if using one AVX512 instruction needs to check (AVX512 AND AVX512DQ), it can check ((AVX512 AND AVX512DQ) OR AVX10.1) after checking XCR0[7:5]. The versions of AVX10 are expected to be inclusive, e.g. version N+1 is a superset of version N. Per the spec, the version can never be 0, just advertise AVX10.1 if it's supported in hardware. Moreover, advertising AVX10_{128,256,512} needs to land in the same commit as advertising basic AVX10.1 support, otherwise KVM would advertise an impossible CPU model. E.g. a CPU with AVX512 but not AVX10.1/512 is impossible per the SDM. As more and more AVX related CPUIDs are added (it would have resulted in around 40-50 CPUID flags when developing AVX10), the versioning approach is introduced. But incrementing version numbers are bad for virtualization. E.g. if AVX10.2 has a feature that shouldn't be enumerated to guests for whatever reason, then KVM can't enumerate any "later" features either, because the only way to hide the problematic AVX10.2 feature is to set the version to AVX10.1 or lower[2]. But most AVX features are just passed through and don't have virtualization controls, so AVX10 should not be problematic in practice, so long as Intel honors their promise that future versions will be supersets of past versions. [1] https://cdrdv2.intel.com/v1/dl/getContent/784267 [2] https://lore.kernel.org/all/Zkz5Ak0PQlAN8DxK@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Tao Su <tao1.su@linux.intel.com> Link: https://lore.kernel.org/r/20240819062327.3269720-1-tao1.su@linux.intel.com [sean: minor changelog tweaks] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:25 -07:00
Thorsten Blum	1448d4a935	KVM: x86: Optimize local variable in start_sw_tscdeadline() Change the data type of the local variable this_tsc_khz to u32 because virtual_tsc_khz is also declared as u32. Since do_div() casts the divisor to u32 anyway, changing the data type of this_tsc_khz to u32 also removes the following Coccinelle/coccicheck warning reported by do_div.cocci: WARNING: do_div() does a 64-by-32 division, please consider using div64_ul instead Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://lore.kernel.org/r/20240814203345.2234-2-thorsten.blum@toblux.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:24 -07:00
Yan Zhao	aa8d1f48d3	KVM: x86/mmu: Introduce a quirk to control memslot zap behavior Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM VMs. Make sure KVM behave as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs. The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options: - when enabled: Invalidate/zap all SPTEs ("zap-all"), - when disabled: Precisely zap only the leaf SPTEs within the range of the moving/deleting memory slot ("zap-slot-leafs-only"). "zap-all" is today's KVM behavior to work around a bug [1] where the changing the zapping behavior of memslot move/deletion would cause VM instability for VMs with an Nvidia GPU assigned; while "zap-slot-leafs-only" allows for more precise zapping of SPTEs within the memory slot range, improving performance in certain scenarios [2], and meeting the functional requirements for TDX. Previous attempts to select "zap-slot-leafs-only" include a per-VM capability approach [3] (which was not preferred because the root cause of the bug remained unidentified) and a per-memslot flag approach [4]. Sean and Paolo finally recommended the implementation of this quirk and explained that it's the least bad option [5]. By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use "zap-all". Users have the option to disable the quirk to select "zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are unaffected by this bug. For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is always selected without user's opt-in, regardless of if the user opts for "zap-all". This is because it is assumed until proven otherwise that non- KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most importantly, it's because TDX must have "zap-slot-leafs-only" always selected. In TDX's case a memslot's GPA range can be a mixture of "private" or "shared" memory. Shared is roughly analogous to how EPT is handled for normal VMs, but private GPAs need lots of special treatment: 1) "zap-all" would require to zap private root page or non-leaf entries or at least leaf-entries beyond the deleting memslot scope. However, TDX demands that the root page of the private page table remains unchanged, with leaf entries being zapped before non-leaf entries, and any dropped private guest pages must be re-accepted by the guest. 2) if "zap-all" zaps only shared page tables, it would result in private pages still being mapped when the memslot is gone. This may affect even other processes if later the gmem fd was whole punched, causing the pages being freed on the host while still mapped in the TD, because there's no pgoff to the gfn information to zap the private page table after memslot is gone. So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or complicating quirk disabling interface (current quirk disabling interface is limited, no way to query quirks, or force them to be disabled). Add a new function kvm_mmu_zap_memslot_leafs() to implement "zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(), bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as 1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can it be moved. It is only deleted by KVM when APICv is permanently inhibited. 2) kvm_vcpu_reload_apic_access_page() effectively does nothing when APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted. 3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on costly IPIs. Suggested-by: Kai Huang <kai.huang@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1] Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2] Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3] Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4] Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5] Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-14 12:29:11 -04:00
Sean Christopherson	4b7c3f6d04	KVM: x86: Make x2APIC ID 100% readonly Ignore the userspace provided x2APIC ID when fixing up APIC state for KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit `a92e2543d6` ("KVM: x86: use hardware-compatible format for APIC ID register"), which added the fixup, didn't intend to allow userspace to modify the x2APIC ID. In fact, that commit is when KVM first started treating the x2APIC ID as readonly, apparently to fix some race: static inline u32 kvm_apic_id(struct kvm_lapic apic) { - return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff; + / To avoid a race between apic_base and following APIC_ID update when + * switching to x2apic_mode, the x2apic mode returns initial x2apic id. + / + if (apic_x2apic_mode(apic)) + return apic->vcpu->vcpu_id; + + return kvm_lapic_get_reg(apic, APIC_ID) >> 24; } Furthermore, KVM doesn't support delivering interrupts to vCPUs with a modified x2APIC ID, but KVM does* return the modified value on a guest RDMSR and for KVM_GET_LAPIC. I.e. no remotely sane setup can actually work with a modified x2APIC ID. Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map calculation, which expects the LDR to align with the x2APIC ID. WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm] CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014 RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm] Call Trace: <TASK> kvm_apic_set_state+0x1cf/0x5b0 [kvm] kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm] kvm_vcpu_ioctl+0x663/0x8a0 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fade8b9dd6f Unfortunately, the WARN can still trigger for other CPUs than the current one by racing against KVM_SET_LAPIC, so remove it completely. Reported-by: Michal Luczaj <mhal@rbox.co> Closes: https://lore.kernel.org/all/814baa0c-1eaa-4503-129f-059917365e80@rbox.co Reported-by: Haoyu Wu <haoyuwu254@gmail.com> Closes: https://lore.kernel.org/all/20240126161633.62529-1-haoyuwu254@gmail.com Reported-by: syzbot+545f1326f405db4e1c3e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000c2a6b9061cbca3c3@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240802202941.344889-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-13 12:01:46 -04:00
Isaku Yamahata	15e1c3d659	KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id()) Use this_cpu_ptr() instead of open coding the equivalent in various user return MSR helpers. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Message-ID: <20240802201630.339306-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-13 10:24:37 -04:00
Yue Haibing	b098495e69	KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page() There is no caller in tree since introduction in commit `b4f69df0f6` ("KVM: x86: Make Hyper-V emulation optional") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Message-ID: <20240803113233.128185-1-yuehaibing@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-13 09:28:48 -04:00
Dan Carpenter	cd2d006065	KVM: SVM: Fix an error code in sev_gmem_post_populate() The copy_from_user() function returns the number of bytes which it was not able to copy. Return -EFAULT instead. Fixes: `dee5a47cc7` ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Message-ID: <20240612115040.2423290-4-dan.carpenter@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-13 06:08:40 -04:00
Dan Carpenter	92b6c2f007	KVM: SVM: Fix uninitialized variable bug If snp_lookup_rmpentry() fails then "assigned" is printed in the error message but it was never initialized. Initialize it to false. Fixes: `dee5a47cc7` ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Message-ID: <20240612115040.2423290-3-dan.carpenter@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-13 06:05:10 -04:00
Al Viro	1da91ea87a	introduce fd_file(), convert all accessors to it. For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-08-12 22:00:43 -04:00
Paolo Bonzini	1773014a97	* fix latent bug in how usage of large pages is determined for confidential VMs * fix "underline too short" in docs * eliminate log spam from limited APIC timer periods * disallow pre-faulting of memory before SEV-SNP VMs are initialized * delay clearing and encrypting private memory until it is added to guest page tables * this change also enables another small cleanup: the checks in SNP_LAUNCH_UPDATE that limit it to non-populated, private pages can now be moved in the common kvm_gmem_populate() function -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmar0uEUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMf9Af9EZ0k0HHltM+iUSqKW+hcfnyjRSlh MI2m8ZFF4Ra4a/H2CYWbUZSZd6U2TGQoy0cz8vN12uiaaRFSXHAzkoy1zhJGYujq ljCUx46Ovo6DDfA1ve9jPdHQNOKWy6Js8yheP+i58Pau1u9fWTewfvWnrwkMgnfD lkrSfnWhw7aBy7jTSd8KflRU/IugP2/ApsIhrjZZ9sFGncAwPBbb8NL/u5tI/l6f VDp1in5a5gk2PhVRVzvINUxNzhcyuQ0wC07N+B4H+3U0NLg4CwiTBJr/yz0OOWz6 ThA20/fLTrs5jc2f5APk1EjGT8pqeMJYydI2FdqafSfY0PcTZJtXvzgdSw== =CwzF -----END PGP SIGNATURE----- Merge branch 'kvm-fixes' into HEAD * fix latent bug in how usage of large pages is determined for confidential VMs * fix "underline too short" in docs * eliminate log spam from limited APIC timer periods * disallow pre-faulting of memory before SEV-SNP VMs are initialized * delay clearing and encrypting private memory until it is added to guest page tables * this change also enables another small cleanup: the checks in SNP_LAUNCH_UPDATE that limit it to non-populated, private pages can now be moved in the common kvm_gmem_populate() function	2024-08-02 12:33:43 -04:00
Ackerley Tng	aca0ec970d	KVM: x86/mmu: fix determination of max NPT mapping level for private pages The `if (req_max_level)` test was meant ignore req_max_level if PG_LEVEL_NONE was returned. Hence, this function should return max_level instead of the ignored req_max_level. This is only a latent issue for now, since guest_memfd does not support large pages. Signed-off-by: Ackerley Tng <ackerleytng@google.com> Message-ID: <20240801173955.1975034-1-ackerleytng@google.com> Fixes: `f32fb32820` ("KVM: x86: Add hook for determining max NPT mapping level") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-01 14:13:11 -04:00
Paolo Bonzini	e4ee544792	KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns This check is currently performed by sev_gmem_post_populate(), but it applies to all callers of kvm_gmem_populate(): the point of the function is that the memory is being encrypted and some work has to be done on all the gfns in order to encrypt them. Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior to invoking the callback, and stop the operation if a shared page is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does not require attributes, this makes kvm_gmem_populate() depend on CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them). Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:15 -04:00
Paolo Bonzini	4b5f67120a	KVM: extend kvm_range_has_memory_attributes() to check subset of attributes While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM code such as kvm_mem_is_private() is written to expect their existence. Allow using kvm_range_has_memory_attributes() as a multi-page version of kvm_mem_is_private(), without it breaking later when more attributes are introduced. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:15 -04:00
Paolo Bonzini	de80252414	KVM: guest_memfd: move check for already-populated page to common code Do not allow populating the same page twice with startup data. In the case of SEV-SNP, for example, the firmware does not allow it anyway, since the launch-update operation is only possible on pages that are still shared in the RMP. Even if it worked, kvm_gmem_populate()'s callback is meant to have side effects such as updating launch measurements, and updating the same page twice is unlikely to have the desired results. Races between calls to the ioctl are not possible because kvm_gmem_populate() holds slots_lock and the VM should not be running. But again, even if this worked on other confidential computing technology, it doesn't matter to guest_memfd.c whether this is something fishy such as missing synchronization in userspace, or rather something intentional. One of the racers wins, and the page is initialized by either kvm_gmem_prepare_folio() or kvm_gmem_populate(). Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use the same errno that kvm_gmem_populate() is using. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Paolo Bonzini	7239ed7467	KVM: remove kvm_arch_gmem_prepare_needed() It is enough to return 0 if a guest need not do any preparation. This is in fact how sev_gmem_prepare() works for non-SNP guests, and it extends naturally to Intel hosts: the x86 callback for gmem_prepare is optional and returns 0 if not defined. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Paolo Bonzini	564429a6bd	KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_* Add "ARCH" to the symbols; shortly, the "prepare" phase will include both the arch-independent step to clear out contents left in the page by the host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE. For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Paolo Bonzini	5932ca411e	KVM: x86: disallow pre-fault for SNP VMs before initialization KVM_PRE_FAULT_MEMORY for an SNP guest can race with sev_gmem_post_populate() in bad ways. The following sequence for instance can potentially trigger an RMP fault: thread A, sev_gmem_post_populate: called thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP thread A, sev_gmem_post_populate: vaddr = kmap_local_pfn(pfn + i); thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i PAGE_SIZE, PAGE_SIZE); RMP #PF Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's initial private memory contents have been finalized via KVM_SEV_SNP_LAUNCH_FINISH. Beyond fixing this issue, it just sort of makes sense to enforce this, since the KVM_PRE_FAULT_MEMORY documentation states: "KVM maps memory as if the vCPU generated a stage-2 read page fault" which sort of implies we should be acting on the same guest state that a vCPU would see post-launch after the initial guest memory is all set up. Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Jim Mattson	0005ca2076	KVM: x86: Eliminate log spam from limited APIC timer periods SAP's vSMP MemoryONE continuously requests a local APIC timer period less than 500 us, resulting in the following kernel log spam: kvm: vcpu 15: requested 70240 ns lapic timer period limited to 500000 ns kvm: vcpu 19: requested 52848 ns lapic timer period limited to 500000 ns kvm: vcpu 15: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70208 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 387520 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70160 ns lapic timer period limited to 500000 ns kvm: vcpu 66: requested 205744 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70224 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns limit_periodic_timer_frequency: 7569 callbacks suppressed ... To eliminate this spam, change the pr_info_ratelimited() in limit_periodic_timer_frequency() to pr_info_once(). Reported-by: James Houghton <jthoughton@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Message-ID: <20240724190640.2449291-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 13:33:25 -04:00
Linus Torvalds	2c9b351240	ARM: * Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement * Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware * Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol * FPSIMD/SVE support for nested, including merged trap configuration and exception routing * New command-line parameter to control the WFx trap behavior under KVM * Introduce kCFI hardening in the EL2 hypervisor * Fixes + cleanups for handling presence/absence of FEAT_TCRX * Miscellaneous fixes + documentation updates LoongArch: * Add paravirt steal time support. * Add support for KVM_DIRTY_LOG_INITIALLY_SET. * Add perf kvm-stat support for loongarch. RISC-V: * Redirect AMO load/store access fault traps to guest * perf kvm stat support * Use guest files for IMSIC virtualization, when available ONE_REG support for the Zimop, Zcmop, Zca, Zcf, Zcd, Zcb and Zawrs ISA extensions is coming through the RISC-V tree. s390: * Assortment of tiny fixes which are not time critical x86: * Fixes for Xen emulation. * Add a global struct to consolidate tracking of host values, e.g. EFER * Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. * Print the name of the APICv/AVIC inhibits in the relevant tracepoint. * Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. * Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop. * Update to the newfangled Intel CPU FMS infrastructure. * Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. * Misc cleanups x86 - MMU: * Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support. * Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs. * Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages. * Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards. x86 - AMD: * Make per-CPU save_area allocations NUMA-aware. * Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code. * Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges. This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification. There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data. x86 - Intel: * Remove an unnecessary EPT TLB flush when enabling hardware. * Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1). * KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation. Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed. See commit `0dc902267c` ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows. Generic: * Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them). * New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl. * Enable halt poll shrinking by default, as Intel found it to be a clear win. * Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. * Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). * Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. * Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. Selftests: * Remove dead code in the memslot modification stress test. * Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs. * Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs. * Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaZQB0UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkZwf/bv2jiENaLFNGPe/VqTKMQ6PHQLMG +sNHx6fJPP35gTM8Jqf0/7/ummZXcSuC1mWrzYbecZm7Oeg3vwNXHZ4LquwwX6Dv 8dKcUzLbWDAC4WA3SKhi8C8RV2v6E7ohy69NtAJmFWTc7H95dtIQm6cduV2osTC3 OEuHe1i8d9umk6couL9Qhm8hk3i9v2KgCsrfyNrQgLtS3hu7q6yOTR8nT0iH6sJR KE5A8prBQgLmF34CuvYDw4Hu6E4j+0QmIqodovg2884W1gZQ9LmcVqYPaRZGsG8S iDdbkualLKwiR1TpRr3HJGKWSFdc7RblbsnHRvHIZgFsMQiimh4HrBSCyQ== =zepX -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement - Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol - FPSIMD/SVE support for nested, including merged trap configuration and exception routing - New command-line parameter to control the WFx trap behavior under KVM - Introduce kCFI hardening in the EL2 hypervisor - Fixes + cleanups for handling presence/absence of FEAT_TCRX - Miscellaneous fixes + documentation updates LoongArch: - Add paravirt steal time support - Add support for KVM_DIRTY_LOG_INITIALLY_SET - Add perf kvm-stat support for loongarch RISC-V: - Redirect AMO load/store access fault traps to guest - perf kvm stat support - Use guest files for IMSIC virtualization, when available s390: - Assortment of tiny fixes which are not time critical x86: - Fixes for Xen emulation - Add a global struct to consolidate tracking of host values, e.g. EFER - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX - Print the name of the APICv/AVIC inhibits in the relevant tracepoint - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor - Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop - Update to the newfangled Intel CPU FMS infrastructure - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored - Misc cleanups x86 - MMU: - Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages - Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards x86 - AMD: - Make per-CPU save_area allocations NUMA-aware - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code - Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data x86 - Intel: - Remove an unnecessary EPT TLB flush when enabling hardware - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1) - KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed See commit `0dc902267c` ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows Generic: - Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them) - New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl - Enable halt poll shrinking by default, as Intel found it to be a clear win - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out() - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout Selftests: - Remove dead code in the memslot modification stress test - Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs - Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs - Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) crypto: ccp: Add the SNP_VLEK_LOAD command KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops KVM: x86: Replace static_call_cond() with static_call() KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event x86/sev: Move sev_guest.h into common SEV header KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event KVM: x86: Suppress MMIO that is triggered during task switch emulation KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory KVM: Document KVM_PRE_FAULT_MEMORY ioctl mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE perf kvm: Add kvm-stat for loongarch64 LoongArch: KVM: Add PV steal time support in guest side ...	2024-07-20 12:41:03 -07:00
Wei Wang	5d766508fd	KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops Similar to kvm_x86_call(), kvm_pmu_call() is added to streamline the usage of static calls of kvm_pmu_ops, which improves code readability. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-4-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 12:14:12 -04:00
Wei Wang	896046474f	KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops Introduces kvm_x86_call(), to streamline the usage of static calls of kvm_x86_ops. The current implementation of these calls is verbose and could lead to alignment challenges. This makes the code susceptible to exceeding the "80 columns per single line of code" limit as defined in the coding-style document. Another issue with the existing implementation is that the addition of kvm_x86_ prefix to hooks at the static_call sites hinders code readability and navigation. kvm_x86_call() is added to improve code readability and maintainability, while adhering to the coding style guidelines. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 12:14:12 -04:00
Wei Wang	f4854bf741	KVM: x86: Replace static_call_cond() with static_call() The use of static_call_cond() is essentially the same as static_call() on x86 (e.g. static_call() now handles a NULL pointer as a NOP), so replace it with static_call() to simplify the code. Link: https://lore.kernel.org/all/3916caa1dcd114301a49beafa5030eca396745c1.1679456900.git.jpoimboe@kernel.org/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-2-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 12:14:11 -04:00
Paolo Bonzini	bc9cd5a219	Merge branch 'kvm-6.11-sev-attestation' into HEAD The GHCB 2.0 specification defines 2 GHCB request types to allow SNP guests to send encrypted messages/requests to firmware: SNP Guest Requests and SNP Extended Guest Requests. These encrypted messages are used for things like servicing attestation requests issued by the guest. Implementing support for these is required to be fully GHCB-compliant. For the most part, KVM only needs to handle forwarding these requests to firmware (to be issued via the SNP_GUEST_REQUEST firmware command defined in the SEV-SNP Firmware ABI), and then forwarding the encrypted response to the guest. However, in the case of SNP Extended Guest Requests, the host is also able to provide the certificate data corresponding to the endorsement key used by firmware to sign attestation report requests. This certificate data is provided by userspace because: 1) It allows for different keys/key types to be used for each particular guest with requiring any sort of KVM API to configure the certificate table in advance on a per-guest basis. 2) It provides additional flexibility with how attestation requests might be handled during live migration where the certificate data for source/dest might be different. 3) It allows all synchronization between certificates and firmware/signing key updates to be handled purely by userspace rather than requiring some in-kernel mechanism to facilitate it. [1] To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data, but is still enough to provide compliance with the GHCB 2.0 spec.	2024-07-16 11:44:23 -04:00
Michael Roth	74458e4859	KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event Version 2 of GHCB specification added support for the SNP Extended Guest Request Message NAE event. This event serves a nearly identical purpose to the previously-added SNP_GUEST_REQUEST event, but for certain message types it allows the guest to supply a buffer to be used for additional information in some cases. Currently the GHCB spec only defines extended handling of this sort in the case of attestation requests, where the additional buffer is used to supply a table of certificate data corresponding to the attestion report's signing key. Support for this extended handling will require additional KVM APIs to handle coordinating with userspace. Whether or not the hypervisor opts to provide this certificate data is optional. However, support for processing SNP_EXTENDED_GUEST_REQUEST GHCB requests is required by the GHCB 2.0 specification for SNP guests, so for now implement a stub implementation that provides an empty certificate table to the guest if it supplies an additional buffer, but otherwise behaves identically to SNP_GUEST_REQUEST. Reviewed-by: Carlos Bilbao <carlos.bilbao.osdev@gmail.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240701223148.3798365-4-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 11:44:00 -04:00
Brijesh Singh	88caf544c9	KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event Version 2 of GHCB specification added support for the SNP Guest Request Message NAE event. The event allows for an SEV-SNP guest to make requests to the SEV-SNP firmware through the hypervisor using the SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification. This is used by guests primarily to request attestation reports from firmware. There are other request types are available as well, but the specifics of what guest requests are being made generally does not affect how they are handled by the hypervisor, which only serves as a proxy for the guest requests and firmware responses. Implement handling for these events. When an SNP Guest Request is issued, the guest will provide its own request/response pages, which could in theory be passed along directly to firmware. However, these pages would need special care: - Both pages are from shared guest memory, so they need to be protected from migration/etc. occurring while firmware reads/writes to them. At a minimum, this requires elevating the ref counts and potentially needing an explicit pinning of the memory. This places additional restrictions on what type of memory backends userspace can use for shared guest memory since there would be some reliance on using refcounted pages. - The response page needs to be switched to Firmware-owned state before the firmware can write to it, which can lead to potential host RMP #PFs if the guest is misbehaved and hands the host a guest page that KVM is writing to for other reasons (e.g. virtio buffers). Both of these issues can be avoided completely by using separately-allocated bounce pages for both the request/response pages and passing those to firmware instead. So that's the approach taken here. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> [mdr: ensure FW command failures are indicated to guest, drop extended request handling to be re-written as separate patch, massage commit] Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240701223148.3798365-2-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 11:44:00 -04:00
Sean Christopherson	2a1fc7dc36	KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation. Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed. See commit `0dc902267c` ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows. Reported-by: syzbot+2fb9f8ed752c01bc9a3f@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240712144841.1230591-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 09:57:45 -04:00
Sean Christopherson	9fe17d2ada	KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro Tweak the definition of make_huge_page_split_spte() to eliminate an unnecessarily long line, and opportunistically initialize child_spte to make it more obvious that the child is directly derived from the huge parent. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240712151335.1242633-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 09:56:56 -04:00
Sean Christopherson	3d4415ed75	KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state as the callers fully expect a valid SPTE, e.g. the shadow MMU will add an rmap entry, and all MMUs will account the expected small page. Returning '0' is also technically wrong now that SHADOW_NONPRESENT_VALUE exists, i.e. would cause KVM to create a potential #VE SPTE. While it would be possible to have the callers gracefully handle failure, doing so would provide no practical value as the scenario really should be impossible, while the error handling would add a non-trivial amount of noise. Fixes: `a3fe5dbda0` ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled") Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240712151335.1242633-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 09:56:56 -04:00
Paolo Bonzini	208a352a54	KVM VMX changes for 6.11 - Remove an unnecessary EPT TLB flush when enabling hardware. - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1). - Misc cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvX0ACgkQOlYIJqCj N/2Aiw/9Htwy4MfJ2zdTX0ypZx6CUAVY0B7R2q9LVaqlBBL02dLoNWn9ndf7J2pd TJKtp39sHzf342ghti/Za5+mZgRgXA9IjQ5cvcQQjfmjDdDODygEc12otISeSNqq uL2jbUZzzjbcQyUrXkeFptVcNFpaiOG0dFfvnoi1csWzXVf7t+CD+8/3kjVm2Qt7 vQXkV4yN7tNiYOvaukfXP7Og9ALpF8g8ok3YmXVXDPMu7+R7G+P6j3mVWr9ABMPj LOmC+5Z/sscMFw1Io3XHuWoF5socQARXEzJNLCblDaw3GMlSj4LNxif2M/6B7bmR nQVtiegj9K1Fc3OGOqPJcAIRPI4O9nMmf7uOwvXmOlwDSk7rCxF/yPk7Cto2+UXm 6mnLcH1l0/VaidW+a7rUAcDGIlWwgfw0F6tp2j6FdVl2Lx/IThcrkn0teLY1gAW8 CMi/BfTBEXO5583O3+ZCAzVQzeKnWR3yqwJe0oSftB1/rPkPD8PQ39MH8LuJJJxi CN1W4R1/taQdOxMZqggDvS1biz7gwpjNGtnWsO9szAgMEXVjf2M1HOZVcT2e2997 81xDMdZaJSfd26tm7PhWtQnVPqyMZ6vqqIiq7FlIbEEkAE75Kbg4fUn/4y4WRnh9 3Gog6MZPu/MA5TbwvcZ/sy/CRfFu0HKm5q98oArhjSyU8C7oGeQ= =W1/6 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.11' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.11 - Remove an unnecessary EPT TLB flush when enabling hardware. - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1). - Misc cleanups	2024-07-16 09:56:41 -04:00
Paolo Bonzini	1229cbefa6	KVM SVM changes for 6.11 - Make per-CPU save_area allocations NUMA-aware. - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRvOsACgkQOlYIJqCj N/29pQ/9FTbHZa5RFqBPWY2Q8TKkCsguzJVMM3xxqq4lIFCwoWhXp1GBjkuT/L8g oHbRqHgl+cFKJ3xHUF7idyXtRQPRKJ/Dz+zz6RgAEIoeij+pxmjrqhm+kDAbfBkh L2Oz83B+PTdof3h0lD1nqtf449O/2istn5acasn7JrXefv+AI/VvqnaL7iMpb8zg K0Pscwgqtzl2okiQP3jAQfK9DbLuoQ1yHPRPIajijxJr7zGjsacg4Iju849zgbSo F569gvqm3ILD9oTzrBKy+Xb7GLtRCIXjuaI88TKwiqhJ6huc+lhQR3z3ogdpK5Qa nAj+c2/qcWbLXVe0/0Owks8Htm4wRDOhO7AMQ2Nk8Cg98VT09V7AMGoW36civnP2 7X2m4dyigDF504r4YJWHfyJ9sifcXkxocuPKBWOTfgK3kzu6SVVTpbFg8RnLgISt RqOR1uuh6dipDIUV/QoRnBkGATuhVOfCkt0V1ymonFpwwqTWZnW40QjvFmNGjE7M Z4sCtnMFPal78w1s5vQtJ7WKgbRs49GSEg4ib/+qaS8xVZKoT/cTMn9hBX93PtOw 6HG+o8P8zq+JvtGegxHwKzUr/4mgo2B5wDGizT+2HZdpVM3pF0ILQIrJ+fgGL+lu SBdEbyiNf1LFx51y1qYDu4ZOiRs2NFg3FRzdKdC9wldCnmb9V7Q= =RxHs -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.11' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.11 - Make per-CPU save_area allocations NUMA-aware. - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code.	2024-07-16 09:55:39 -04:00
Paolo Bonzini	cda231cd42	KVM x86/pmu changes for 6.11 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. - Update to the newfangled Intel CPU FMS infrastructure. - Use macros instead of open-coded literals to clean up KVM's manipulation of FIXED_CTR_CTRL MSRs. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRu3oACgkQOlYIJqCj N/0f/Q/+PoRFKrr9ENwlVjxmq7DBJOzrEiht5EH89bQdpYL0pcmv6I+n+Z77o08X l49YFO2zVq26dMCe8EFDuQrZpqKjOS/qEc+/zTsLu4lx8NJD1gqYJLJryejgtdQI +GefPVIN11TvlDDjuuxSWgKUCAevk8s3PRe+zbUwlsHmw+GVky8dJoe71QbW27rK hL7Y2pOe5Y8MgRAadxlhm6QmgOnz3RKKYs9t/HMzi2gQP1TuvPxnYtMC3Gz5pVe+ w3Ak7M4fh8Z7FbQsoNY5h3IdigG6eFrssqHX4QpCXr/G5L9vAgUmSR93/M8jLjNv wAkUulLx7vFeTlOXjqcEJSn0U6mX/48pt68vrPB5ES1Rx28RB5s9tzYXCGtCmSxv nHmMDc3YUbg6tp2hvliMqjsN0j5l2GQiX7LJwH2Ma9qQFlTHPmFwJGS4hciki6c5 obCK2vXBoS1jyxrZx8qUhIcJl2oigv3hwihN2YqZ0Q4QDwllv8cw4BeABQByYR9x T91PQ0biiJ9vWCkALbyzYOpy+grdHCblwYW9+FM/qZBGH0ouPzDyZWPrRBLX12pH fEgDMB3vT9JqQ5tyafd0MHuAVlrDVHYEY+lmXplzFGKEFonBkN7HmDzAOKafuCuj GnIe0Sa1JnHVPNomx2dnG6Sku6/tPIfERuHEXrR9zkUJsacnfRY= =pJxP -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86/pmu changes for 6.11 - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. - Update to the newfangled Intel CPU FMS infrastructure. - Use macros instead of open-coded literals to clean up KVM's manipulation of FIXED_CTR_CTRL MSRs.	2024-07-16 09:55:15 -04:00
Paolo Bonzini	5c5ddf7107	KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuwAACgkQOlYIJqCj N/32Gg/+Nnnz6TCRno2vursPJme7gvtLdqSxjazAj3u2ZO8IApGYWMyfVpS+ymC9 Wdpj6gRe2ukSxgTsUI2CYoy5V2NxDaA9YgdTPZUVQvqwujVrqZCJ7L393iPYYnC9 No3LXZ+SOYRmomiCzknjC6GOlT2hAZHzQsyaXDlEYok7NAA2L6XybbLonEdA4RYi V1mS62W5PaA4tUesuxkJjPujXo1nXRWD/aXOruJWjPESdSFSALlx7reFAf2Nwn7K Uw8yZqhq6vWAZSph0Nz8OrZOS/kULKA3q2zl1B/qJJ0ToAt2VdXS6abXky52RExf KvP+jBAWMO5kHbIqaMRtCHjbIkbhH8RdUIYNJQEUQ5DdydM5+/RDa+KprmLPcmUn qvJq+3uyH0MEENtneGegs8uxR+sn6fT32cGMIw790yIywddh562+IJ4Z+C3BuYJi yszD71odqKT8+knUd2CaZjE9UZyoQNDfj2OCCTzzZOC/6TuJWCh9CYQ1csssHbQR KcvZCKE6ht8tWwi+2HWj0laOdg1reX2kV869k3xH4uCwEaFIj2Wk+/Bw/lg2Tn5h 5uTnQ01dx5XhAV1klr6IY3VXJ/A8G8895wRfkZEelsA9Wj8qZvNgXhsoXReIUIrn aR0ppsFcbqHzC50qE2JT4juTD1EPx95LL9zKT8pI9mGKwxCAxUM= =yb10 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop.	2024-07-16 09:54:57 -04:00
Paolo Bonzini	34b69edecb	KVM x86 MMU changes for 6.11 - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hole leafs SPTEs. - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting to avoid stalling vCPUs when splitting huge pages. - Misc cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuqAACgkQOlYIJqCj N/3wwQ//d1HyNn/INUq+KZzaNgPPRXB/phsyAiHg0N9w3COlB3WsPlVtX9u04mHe 12O8IDUmK4TufxPYrxdfMJPRup0Ewb0BOu5+n6fGOBzfaeBDlLF/SbX65I4KaiNI taqfhBCW/4Jis9ESvrJpOZSdv7pAA2Q67aCKVoKrd4Vbrw/96lnrB1GLL662XzZ+ b7jm8nANoJgY4dLm7MVm33aSDQU35EXVHDWC9eWiaJmuXnFf7guf0rLKD1zkmNTI fVpUmZUFI7pcZNMB8u7JNmMofx748yrDe/MT6GxcEoLky6YKLYSLv3tZfywO2OrO vBJagYd1dy3798QQOCyqvtqc4OyHzv5jmwyLiKLGVgtavhYUWhFQUVSNy3p003S/ NfvLFOrT+cBAYE0D898bdoX0cvQQggdgC5UEXjzGaZAfG0TMRMv3klGSUS3NABnE owtdV/2qIRsC+bybLhqaYvib5zjDrZDtzUU6+2wt0ugWrvF4Qn/RdnFmOWedGJ51 Mr0xwhL0wekKvO8QaF55b9JO8wyaN4UYrUkPLmuK3/AICPU1m9CfQmu2iIe1bdsd 303X94LOmsKNRlWTe8SWj5xWrC8LUH0P4g56/gT36ye08tzy7dfmX7T/6VwkVxS4 pGRFLhlV8rqxaCSDgJqs+EdpKhfGpo5LuBcwZzO1YQNcDxoKO0I= =5Mgp -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.11 - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs. - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting to avoid stalling vCPUs when splitting huge pages. - Misc cleanups	2024-07-16 09:53:28 -04:00
Paolo Bonzini	5dcc1e7614	KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRub0ACgkQOlYIJqCj N/2LMxAArGzhcWZ6Qdo2aMRaMIPtSBJHmbEgEuHvHMumgsTZQzDcn9cxDi/hNSrc l8ODOwAM2qNcq95YfwjU7F0ae3E+HRzGvKcBnmZWuQeCDp2HhVEoCphFu1sHst+t XEJTL02b6OgyJUEU3h40mYk12eiq2S4FCnFYXPCqijwwuL6Y5KQvvTqek3c2/SDn c+VneutYGax/S0GiiCkYh4wrwWh9g7qm0IX70ycBwJbW5qBFKgyglvHxvL8JLJC9 Nkkw/p2657wcOdraH+fOBuRy2dMwE5fv++1tOjWwB5WAAhSOJPZh0BGYvgA2yfN7 OE+k7APKUQd9Xxtud8H3LrTPoyMA4hz2sdDFyqrrWK9yjpBY7zXNyN50Fxi7VVsm T8nTIiKAGyRbjotY+m7krXQPXjfZYhVqrJ/jtxESOZLZ93q2gSWU2p/ZXpUPVHnH +YOBAI1owP3wepaYlrthtI4LQx9lF422dnmeSflztfKFGabRbQZxg3uHMCCxIaGc lJ6CD546+D45f/uBXRDMqk//qFTqXhKUbDk9sutmU/C2oWufMwW0R8kOyItGPyvk 9PP1vd8vSsIHj+tpwg+i04jBqYDaAcPBOcTZaHm9SYYP+1e11Uu5Vjep37JL1bkA xJWxnDZOCGcfKQi2jkh51HJ/dOAHXY1GQKMfyAoPQOSonYHvGVY= =Cf2R -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups	2024-07-16 09:53:05 -04:00
Paolo Bonzini	86014c1e20	KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7 GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+ jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg 2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8 GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o= =BalU -----END PGP SIGNATURE----- Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups	2024-07-16 09:51:36 -04:00
Paolo Bonzini	f4501e8bc8	KVM Xen: Fix a bug where KVM fails to check the validity of an incoming userspace virtual address and tries to activate a gfn_to_pfn_cache with a kernel address. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRtxMACgkQOlYIJqCj N/0B9A/+PeiWgW+AZ5Bl3YLLTMUi2Q1pKapT5hdNAthdDC72ewDDqtfXQ/Lcaubq j1ElrxXP5IjMBq7U65hT9hc/f6AxZat2LkTSBRM7uGrhE2UTF+/GmfJve8vUHZsA XtH9JRqNhBfr5EOkyh4AwvzvmzaiuJemPeLtQuEwlgaECs9byG1ILhudD/KiVStw 3Tw4Zw2oKzpo3hiW+5/YZBEkpO+CM50ByP9/SEPUhnG8sCqlGzMEI27bOvDgEDQs vV2fIj992lrSa9OX6oA/xz3/UE1t3ruo8mHjP/r3+sxr1SnnwbmR0YbiGEVdAjJI Pllwx6E9Fxi8Nqq/6Dy4XxUhNG+G8+ozwvr8wbLHzXFIpCGZFN5xgweuHtdEg0x8 mNxekTZrsLE58t5MpvTUMMHiHsn0KqtPqg1g3c2A7znZnzYxMHrIYyUm+uEOIZ/w Q93WT7s3SiaELgENsx3uda3Q0i2gKK3x7gbLk/9N2ciFZXgFZMwdZa7FivJr5OoT wp8r/btKTTaVGyn3x3y/Tum5XpyMNsKdxEKQ4n3aaIkfS2APfiHjQFyzkI2uLePG Tz/SZ/PT2rxpRuYL54mHzovPsIJ3NqBA+OTWCHKI/EwyWeyW9g05rZnmJar4ZYQ5 pdHqWgGi0wMrPWG+GQv0FMBj7EIM4Z3dA4/BRyRObnMmp1XtIb4= =Vgz5 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.10-11' of https://github.com/kvm-x86/linux into HEAD KVM Xen: Fix a bug where KVM fails to check the validity of an incoming userspace virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.	2024-07-16 09:51:14 -04:00
Linus Torvalds	208c6772d3	- This is basically PeterZ's idea to nest the alternative macros to avoid the need to "spell out" the number of alternates in an ALTERNATIVE_n() macro and thus have an ever-increasing complexity in those definitions. For ease of bisection, the old macros are converted to the new, nested variants in a step-by-step manner so that in case an issue is encountered during testing, one can pinpoint the place where it fails easier. Because debugging alternatives is a serious pain. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU/+MACgkQEsHwGGHe VUpMGhAAqVbB3DZohv0Oa4BRRvaKFuQ3L7H0NTjK/pbT3EG+phol0zrHby2MnGjD HWXskps86n91QBB/06vAyZRimV/dvAPvlSKllsRx6ie3VCE4FJPzA4nTWQn/41dC HWamj78mQuSMgioLzIYdTY79KObtJcUw/X/xz+TTMemfkzkQxukKY7+Y71nZbuKi rUuCSrfAWNHQaIaoGs2JowGw7te7yNOtKQMCW5TdNLwvJfOAECuoLIFeiEcWHvoO uGl6FTABNLp26wmaeceUxdjBbTJcM3iV3joZQYED7B+mbJcU/a7tZw7I+mavPrbh Y6+EOn7rzR0wbcmj0iJ74TKr+uKDme/Qzm3YEKgGvJPj9tRjTDwxWRBnyTeCMbav NkKVwWTep8K+1qJtGVBwACY6iz89u3P8V5owD8O++KIPQa8rA0m8pN5gaU3PVYYQ D2UUdqXWIPIFoD4Sveb/WFU8OJKY+Nx7IK8KD03h5tiXW8MmGSa2e5b57gIfCLP7 DbSHyCkTiqEdBrSM4/RaVVckD6NZ39M87H+iV51vYUkCYmODa/riMj0M7SVMi5Jo S/30jvdHEzWnmDBbOsn9d1XbvB5I+zz3BrcZQ2VSyBx+Y9m+SZ9qsyMEkOWw4uKM kfFp+XlYVWSnQFo7jY3UUIgVrU9dmdX1WxNX7/2HABDjg3MtWog= =tGap -----END PGP SIGNATURE----- Merge tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 alternatives updates from Borislav Petkov: "This is basically PeterZ's idea to nest the alternative macros to avoid the need to "spell out" the number of alternates in an ALTERNATIVE_n() macro and thus have an ever-increasing complexity in those definitions. For ease of bisection, the old macros are converted to the new, nested variants in a step-by-step manner so that in case an issue is encountered during testing, one can pinpoint the place where it fails easier. Because debugging alternatives is a serious pain" * tag 'x86_alternatives_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternatives, kvm: Fix a couple of CALLs without a frame pointer x86/alternative: Replace the old macros x86/alternative: Convert the asm ALTERNATIVE_3() macro x86/alternative: Convert the asm ALTERNATIVE_2() macro x86/alternative: Convert the asm ALTERNATIVE() macro x86/alternative: Convert ALTERNATIVE_3() x86/alternative: Convert ALTERNATIVE_TERNARY() x86/alternative: Convert alternative_call_2() x86/alternative: Convert alternative_call() x86/alternative: Convert alternative_io() x86/alternative: Convert alternative_input() x86/alternative: Convert alternative_2() x86/alternative: Convert alternative() x86/alternatives: Add nested alternatives macros x86/alternative: Zap alternative_ternary()	2024-07-15 19:11:28 -07:00
Paolo Bonzini	c8b8b8190a	LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmaOS6UWHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImehejD/9pACGe3h3krXLcFVWXOFIu5Hpc 5kQLP0lSPJ/o5Xs8t/oPLrnDX70z90wXI1LOmltc7h32MSwFa2l8COQh+sN5eJBQ PNyt7u7bMipp0yJS4Gl3LQQ5vklcGOSpQc/gbeXnVx8J/tz+Mo9YGGLIXVRXRM6W Ri8D2VVFiwzQQYeTpPo1u1Ob8C6mA4KOppwvhscMTM3vj4NMbsinBzRnR0lG0Tdw meFhxDPly1Ksxsbnj9UGO6UnEY0A2SLONs6MiO4y4DtoqoDlw/lbqFJuYo4vvbx1 pxtjyirD/PX/wjslQFWUOuU0hMfAodera+JupZ5BZWfcG8FltA4DQfDsm/U9RjK/ 7gGNnr8Xk2/tp6+4AVV+HU2iTgRvq+mXCL72zSy2Y4r7ElBAANDfk4n+Zn/PWisn U9wwV8Ue7tVB15BRpRsg77NzBidiCFEe/6flWYiX2y24ke71gwDJBGUy8hMdKt6t 4Cq8atsU0MvDAzfYMsK9JjskJp4UFq6wb1tXbbuADM4TDhnzlK6s6h3vM+pFlh/f my7fDH8/2qsCWhBDM4pmsJskVp+I1GOk/80RjTQISwx7iHktJWvxNYTaisK2fvD5 Qs1IUWfNFbDX0Lr0QpN6j6X4rZkghR4R6XoFkd4nkicwi+UHVn3oK9GSqv24QJn9 7+Ev3dfRTUYLd6mC4Q== =DpIK -----END PGP SIGNATURE----- Merge tag 'loongarch-kvm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.	2024-07-12 11:24:12 -04:00
Paolo Bonzini	f3996d4d79	Merge branch 'kvm-prefault' into HEAD Pre-population has been requested several times to mitigate KVM page faults during guest boot or after live migration. It is also required by TDX before filling in the initial guest memory with measured contents. Introduce it as a generic API.	2024-07-12 11:18:45 -04:00
Paolo Bonzini	6e01b7601d	KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest memory. It can be called right after KVM_CREATE_VCPU creates a vCPU, since at that point kvm_mmu_create() and kvm_init_mmu() are called and the vCPU is ready to invoke the KVM page fault handler. The helper function kvm_tdp_map_page() takes care of the logic to process RET_PF_* return values and convert them to success or errno. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:47 -04:00
Paolo Bonzini	58ef24699b	KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level The guest memory population logic will need to know what page size or level (4K, 2M, ...) is mapped. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <eabc3f3e5eb03b370cadf6e1901ea34d7a020adc.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:36 -04:00
Sean Christopherson	f5e7f00cf1	KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" Move the accounting of the result of kvm_mmu_do_page_fault() to its callers, as only pf_fixed is common to guest page faults and async #PFs, and upcoming support KVM_PRE_FAULT_MEMORY won't bump _any_ stats. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:35 -04:00
Sean Christopherson	5186ec223b	KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler Account stat.pf_taken in kvm_mmu_page_fault(), i.e. the actual page fault handler, instead of conditionally bumping it in kvm_mmu_do_page_fault(). The "real" page fault handler is the only path that should ever increment the number of taken page faults, as all other paths that "do page fault" are by definition not handling faults that occurred in the guest. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:35 -04:00
Borislav Petkov (AMD)	0d3db1f14a	x86/alternatives, kvm: Fix a couple of CALLs without a frame pointer objtool complains: arch/x86/kvm/kvm.o: warning: objtool: .altinstr_replacement+0xc5: call without frame pointer save/setup vmlinux.o: warning: objtool: .altinstr_replacement+0x2eb: call without frame pointer save/setup Make sure %rSP is an output operand to the respective asm() statements. The test_cc() hunk and ALT_OUTPUT_SP() courtesy of peterz. Also from him add some helpful debugging info to the documentation. Now on to the explanations: tl;dr: The alternatives macros are pretty fragile. If I do ALT_OUTPUT_SP(output) in order to be able to package in a %rsp reference for objtool so that a stack frame gets properly generated, the inline asm input operand with positional argument 0 in clear_page(): "0" (page) gets "renumbered" due to the added : "+r" (current_stack_pointer), "=D" (page) and then gcc says: ./arch/x86/include/asm/page_64.h:53:9: error: inconsistent operand constraints in an ‘asm’ The fix is to use an explicit "D" constraint which points to a singleton register class (gcc terminology) which ends up doing what is expected here: the page pointer - input and output - should be in the same %rdi register. Other register classes have more than one register in them - example: "r" and "=r" or "A": ‘A’ The ‘a’ and ‘d’ registers. This class is used for instructions that return double word results in the ‘ax:dx’ register pair. Single word values will be allocated either in ‘ax’ or ‘dx’. so using "D" and "=D" just works in this particular case. And yes, one would say, sure, why don't you do "+D" but then: : "+r" (current_stack_pointer), "+D" (page) : [old] "i" (clear_page_orig), [new1] "i" (clear_page_rep), [new2] "i" (clear_page_erms), : "cc", "memory", "rax", "rcx") now find the Waldo^Wcomma which throws a wrench into all this. Because that silly macro has an "input..." consume-all last macro arg and in it, one is supposed to supply input and clobbers, leading to silly syntax snafus. Yap, they need to be cleaned up, one fine day... Closes: https://lore.kernel.org/oe-kbuild-all/202406141648.jO9qNGLa-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240625112056.GDZnqoGDXgYuWBDUwu@fat_crate.local	2024-07-01 12:41:11 +02:00
Peng Hao	dd103407ca	KVM: X86: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variables Some variables allocated in kvm_arch_vcpu_ioctl are released when the function exits, so there is no need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20240624012016.46133-1-flyingpeng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 09:52:16 -07:00
Dapeng Mi	f287bef6dd	KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max number Refine the macros which define maximum General Purpose (GP) and fixed counter numbers. Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the maximum supported General Purpose (GP) counter number ambiguously across Intel and AMD platforms. This would cause issues if AMD begins to support more GP counters than Intel. Thus a bunch of new macros including vendor specific and vendor independent are introduced to replace the old macros. The vendor independent macros are used in x86 common code to hide vendor difference and eliminate the ambiguity. No logic changes are introduced in this patch. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240627021756.144815-1-dapeng1.mi@linux.intel.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 09:12:16 -07:00
Sean Christopherson	45405155d8	KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet inject WARN if a blocking vCPU is awakened by a valid wake event that KVM can't inject, e.g. because KVM needs to complete a nested VM-enter, or needs to re-inject an exception. For the nested VM-Enter case, KVM is supposed to clear "nested_run_pending" if L1 puts L2 into HLT, i.e. entering HLT "completes" the nested VM-Enter. And for already-injected exceptions, it should be impossible for the vCPU to be in a blocking state if a VM-Exit occurred while an exception was being vectored. Link: https://lore.kernel.org/r/20240607172609.3205077-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:07 -07:00
Sean Christopherson	321ef62b0c	KVM: nVMX: Fold requested virtual interrupt check into has_nested_events() Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is pending delivery, in vmx_has_nested_events() and drop the one-off kvm_x86_ops.guest_apic_has_interrupt() hook. In addition to dropping a superfluous hook, this fixes a bug where KVM would incorrectly treat virtual interrupts _for L2_ as always enabled due to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(), treating IRQs as enabled if L2 is active and vmcs12 is configured to exit on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake event based on L1's IRQ blocking status. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:06 -07:00
Sean Christopherson	27c4fa42b1	KVM: nVMX: Check for pending posted interrupts when looking for nested events Check for pending (and notified!) posted interrupts when checking if L2 has a pending wake event, as fully posted/notified virtual interrupt is a valid wake event for HLT. Note that KVM must check vmx->nested.pi_pending to avoid prematurely waking L2, e.g. even if KVM sees a non-zero PID.PIR and PID.0N=1, the virtual interrupt won't actually be recognized until a notification IRQ is received by the vCPU or the vCPU does (nested) VM-Enter. Fixes: `26844fee6a` ("KVM: x86: never write to memory from kvm_vcpu_check_block()") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/all/20231207010302.2240506-1-jmattson@google.com Link: https://lore.kernel.org/r/20240607172609.3205077-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:06 -07:00
Sean Christopherson	322a569c4b	KVM: VMX: Split out the non-virtualization part of vmx_interrupt_blocked() Move the non-VMX chunk of the "interrupt blocked" checks to a separate helper so that KVM can reuse the code to detect if interrupts are blocked for L2, e.g. to determine if a virtual interrupt _for L2_ is a valid wake event. If L1 disables HLT-exiting for L2, nested APICv is enabled, and L2 HLTs, then L2 virtual interrupts are valid wake events, but if and only if interrupts are unblocked for L2. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:05 -07:00
Sean Christopherson	32f55e475c	KVM: nVMX: Request immediate exit iff pending nested event needs injection When requesting an immediate exit from L2 in order to inject a pending event, do so only if the pending event actually requires manual injection, i.e. if and only if KVM actually needs to regain control in order to deliver the event. Avoiding the "immediate exit" isn't simply an optimization, it's necessary to make forward progress, as the "already expired" VMX preemption timer trick that KVM uses to force a VM-Exit has higher priority than events that aren't directly injected. At present time, this is a glorified nop as all events processed by vmx_has_nested_events() require injection, but that will not hold true in the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI. I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired VMX preemption timer will trigger VM-Exit before the virtual interrupt is delivered, and KVM will effectively hang the vCPU in an endless loop of forced immediate VM-Exits (because the pending virtual interrupt never goes away). Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:04 -07:00
Sean Christopherson	d83c36d822	KVM: nVMX: Add a helper to get highest pending from Posted Interrupt vector Add a helper to retrieve the highest pending vector given a Posted Interrupt descriptor. While the actual operation is straightforward, it's surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's find_highest_vector(), which doesn't work with PID.PIR due to the APIC's IRR and ISR component registers being physically discontiguous (they're 4-byte registers aligned at 16-byte intervals). To make PIR handling more consistent with respect to IRR and ISR handling, return -1 to indicate "no interrupt pending". Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:59:03 -07:00
Kai Huang	92c1e3cbf0	KVM: VMX: Switch __vmx_exit() and kvm_x86_vendor_exit() in vmx_exit() In the vmx_init() error handling path, the __vmx_exit() is done before kvm_x86_vendor_exit(). They should follow the same order in vmx_exit(). But currently __vmx_exit() is done after kvm_x86_vendor_exit() in vmx_exit(). Switch the order of them to fix. Fixes: `e32b120071` ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace") Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240627010524.3732488-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:58:23 -07:00
Sean Christopherson	23b2c5088d	KVM: VMX: Remove unnecessary INVEPT[GLOBAL] from hardware enable path Remove the completely pointess global INVEPT, i.e. EPT TLB flush, from KVM's VMX enablement path. KVM always does a targeted TLB flush when using a "new" EPT root, in quotes because "new" simply means a root that isn't currently being used by the vCPU. KVM also _deliberately_ runs with stale TLB entries for defunct roots, i.e. doesn't do a TLB flush when vCPUs stop using roots, precisely because KVM does the flush on first use. As called out by the comment in kvm_mmu_load(), the reason KVM flushes on first use is because KVM can't guarantee the correctness of past hypervisors. Jumping back to the global INVEPT, when the painfully terse commit `1439442c7b` ("KVM: VMX: Enable EPT feature for KVM") was added, the effective TLB flush being performed was: static void vmx_flush_tlb(struct kvm_vcpu *vcpu) { vpid_sync_vcpu_all(to_vmx(vcpu)); } I.e. KVM was not flushing EPT TLB entries when allocating a "new" root, which very strongly suggests that the global INVEPT during hardware enabling was a misguided hack that addressed the most obvious symptom, but failed to fix the underlying bug. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240608001003.3296640-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:57:25 -07:00
Sean Christopherson	cb9fb5fc12	KVM: nVMX: Update VMCS12_REVISION comment to state it should never change Rewrite the comment above VMCS12_REVISION to unequivocally state that the ID must never change. KVM_{G,S}ET_NESTED_STATE have been officially supported for some time now, i.e. changing VMCS12_REVISION would break userspace. Opportunistically add a blurb to the CHECK_OFFSET() comment to make it explicitly clear that new fields are allowed, i.e. that the restriction on the layout is all about backwards compatibility. No functional change intended. Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240613190103.1054877-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:55:00 -07:00
Sean Christopherson	704ec48fc2	KVM: SVM: Use sev_es_host_save_area() helper when initializing tsc_aux Use sev_es_host_save_area() instead of open coding an equivalent when setting the MSR_TSC_AUX field during setup. No functional change intended. Link: https://lore.kernel.org/r/20240617210432.1642542-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:53:00 -07:00
Sean Christopherson	34830b3c02	KVM: SVM: Force sev_es_host_save_area() to be inlined (for noinstr usage) Force sev_es_host_save_area() to be always inlined, as it's used in the low level VM-Enter/VM-Exit path, which is non-instrumentable. vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xb0: call to sev_es_host_save_area() leaves .noinstr.text section vmlinux.o: warning: objtool: svm_vcpu_enter_exit+0xbf: call to sev_es_host_save_area.isra.0() leaves .noinstr.text section Fixes: `c92be2fd8e` ("KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area") Reported-by: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240617210432.1642542-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:53:00 -07:00
Jeff Johnson	8815d77cbc	KVM: x86: Add missing MODULE_DESCRIPTION() macros Add module descriptions for the vendor modules to fix allmodconfig 'make W=1' warnings: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-intel.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm-amd.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com [sean: split kvm.ko change to separate commit] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:47:12 -07:00
Pei Li	ebbdf37ce9	KVM: Validate hva in kvm_gpc_activate_hva() to fix __kvm_gpc_refresh() WARN Check that the virtual address is "ok" when activating a gfn_to_pfn_cache with a host VA to ensure that KVM never attempts to use a bad address. This fixes a bug where KVM fails to check the incoming address when handling KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA in kvm_xen_vcpu_set_attr(). Reported-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fd555292a1da3180fc82 Tested-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com Signed-off-by: Pei Li <peili.dev@gmail.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240627-bug5-v2-1-2c63f7ee6739@gmail.com [sean: rewrite changelog with --verbose] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-28 08:31:46 -07:00
Michael Roth	cf6d9d2d24	KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guests With commit `27bd5fdc24` ("KVM: SEV-ES: Prevent MSR access post VMSA encryption"), older VMMs like QEMU 9.0 and older will fail when booting SEV-ES guests with something like the following error: qemu-system-x86_64: error: failed to get MSR 0x174 qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. This is because older VMMs that might still call svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if those interfaces were essentially just noops because of the vCPU state being encrypted and stored separately in the VMSA. Now those VMMs will get an -EINVAL and generally crash. Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of the stricter limitations of what vCPU state can be sync'd during guest run-time, so newer QEMU for instance will work both for legacy KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2. So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with -EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing with either an older VMM and so it needs to assume that returning -EINVAL might break the VMM. Address this by only returning -EINVAL if the guest was started with KVM_SEV_INIT2. Otherwise, just silently return. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Nikunj A Dadhania <nikunj@amd.com> Reported-by: Srikanth Aithal <sraithal@amd.com> Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/ Fixes: `27bd5fdc24` ("KVM: SEV-ES: Prevent MSR access post VMSA encryption") Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240604233510.764949-1-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-21 07:11:29 -04:00
Rick Edgecombe	c2f38f75fc	KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep() Pass fault->gfn into kvm_tdp_mmu_fast_pf_get_last_sptep(), instead of passing fault->addr and then converting it to a GFN. Future changes will make fault->addr and fault->gfn differ when running TDX guests. The GFN will be conceptually the same as it is for normal VMs, but fault->addr may contain a TDX specific bit that differentiates between "shared" and "private" memory. This bit will be used to direct faults to be handled on different roots, either the normal "direct" root or a new type of root that handles private memory. The TDP iterators will process the traditional GFN concept and apply the required TDX specifics depending on the root type. For this reason, it needs to operate on regular GFN and not the addr, which may contain these special TDX specific bits. Today kvm_tdp_mmu_fast_pf_get_last_sptep() takes fault->addr and then immediately converts it to a GFN with a bit shift. However, this would unfortunately retain the TDX specific bits in what is supposed to be a traditional GFN. Excluding TDX's needs, it is also is unnecessary to pass fault->addr and convert it to a GFN when the GFN is already on hand. So instead just pass the GFN into kvm_tdp_mmu_fast_pf_get_last_sptep() and use it directly. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240619223614.290657-9-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 18:43:31 -04:00
Rick Edgecombe	964cea8171	KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other multi-part operations. REMOVED_SPTE is used as a non-present intermediate value for multi-part operations that can happen when a thread doesn't have an MMU write lock. Today these operations are when removing PTEs. However, future changes will want to use the same concept for setting a PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it to FROZEN_SPTE so it can be used for both types of operations. Also rename the relevant helpers and comments that refer to "removed" within the context of the SPTE value. Take care to not update naming referring the "remove" operations, which are still distinct. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240619223614.290657-2-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 18:43:31 -04:00
Paolo Bonzini	02b0d3b9d4	Merge branch 'kvm-6.10-fixes' into HEAD	2024-06-20 17:31:50 -04:00
Isaku Yamahata	8a4e2742a5	KVM: x86/tdp_mmu: Sprinkle __must_check The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace the SPTE value and returns -EBUSY on failure. The caller must check the return value and retry. Add __must_check to it, as well as to two more functions that forward the return value of __tdp_mmu_set_spte_atomic to their caller. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 17:28:44 -04:00
Sean Christopherson	f3ced000a2	KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC routes, irrespective of whether the I/O APIC is emulated by userspace or by KVM. If a level-triggered interrupt routed through the I/O APIC is pending or in-service for a vCPU, KVM needs to intercept EOIs on said vCPU even if the vCPU isn't the destination for the new routing, e.g. if servicing an interrupt using the old routing races with I/O APIC reconfiguration. Commit `fceb3a36c2` ("KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race") fixed the common cases, but kvm_apic_pending_eoi() only checks if an interrupt is in the local APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is pending in the PIR. Failure to intercept EOI can manifest as guest hangs with Windows 11 if the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't expose a more modern form of time to the guest. Cc: stable@vger.kernel.org Cc: Adamos Ttofari <attofari@amazon.de> Cc: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240611014845.82795-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 14:18:02 -04:00
David Matlack	a6816314af	KVM: Introduce vcpu->wants_to_run Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-18 09:20:01 -07:00
Sean Christopherson	d29bf2ca14	KVM: x86: Prevent excluding the BSP on setting max_vcpu_ids If the BSP vCPU ID was already set, ensure it doesn't get excluded when limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID. [mks: provide commit message, code by Sean] Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-4-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-18 08:59:50 -07:00
Mathias Krause	7c305d5118	KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_ID Do not accept IDs which are definitely invalid by limit checking the passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was already set. This ensures invalid values, especially on 64-bit systems, don't go unnoticed and lead to a valid id by chance when truncated by the final assignment. Fixes: `73880c80aa` ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-3-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-18 08:59:36 -07:00
David Matlack	0089c055b5	KVM: x86/mmu: Avoid reacquiring RCU if TDP MMU fails to allocate an SP Avoid needlessly reacquiring the RCU read lock if the TDP MMU fails to allocate a shadow page during eager page splitting. Opportunistically drop the local variable ret as well now that it's no longer necessary. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:25:03 -07:00
David Matlack	3d4a5a45ca	KVM: x86/mmu: Unnest TDP MMU helpers that allocate SPs for eager splitting Move the implementation of tdp_mmu_alloc_sp_for_split() to its one and only caller to reduce unnecessary nesting and make it more clear why the eager split loop continues after allocating a new SP. Opportunistically drop the double-underscores from __tdp_mmu_alloc_sp_for_split() now that its parent is gone. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-4-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:24:23 -07:00
David Matlack	e1c04f7a9f	KVM: x86/mmu: Hard code GFP flags for TDP MMU eager split allocations Now that the GFP_NOWAIT case is gone, hard code GFP_KERNEL_ACCOUNT when allocating shadow pages during eager page splitting in the TDP MMU. Opportunistically replace use of __GFP_ZERO with allocations that zero to improve readability. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-3-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:23:40 -07:00
David Matlack	cf3ff0ee24	KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting Always drop mmu_lock to allocate shadow pages in the TDP MMU when doing eager page splitting. Dropping mmu_lock during eager page splitting is cheap since KVM does not have to flush remote TLBs, and avoids stalling vCPU threads that are taking page faults while KVM is eager splitting under mmu_lock held for write. This change reduces 20%+ dips in MySQL throughput during live migration in a 160 vCPU VM while userspace is issuing CLEAR_DIRTY_LOG ioctls (tested with 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which would also reduce contention on mmu_lock, but doing so will increase the rate of remote TLB flushing, since KVM must flush TLBs before returning from CLEAR_DITY_LOG. When there isn't contention on mmu_lock[1], this change does not regress the time it takes to perform eager page splitting (the cost of releasing and re-acquiring an uncontended lock is minimal on x86). [1] Tested with dirty_log_perf_test, which does not run vCPUs during eager page splitting, and with a 16 vCPU VM Live Migration with manual-protect disabled (where mmu_lock is held for read). Cc: Bibo Mao <maobibo@loongson.cn> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:22:22 -07:00
Sean Christopherson	caa7278829	KVM: x86/mmu: Rephrase comment about synthetic PFERR flags in #PF handler Reword the BUILD_BUG_ON() comment in the legacy #PF handler to explicitly describe how asserting that synthetic PFERR flags are limited to bits 31:0 protects KVM against inadvertently passing a synthetic flag to the common page fault handler. No functional change intended. Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240608001108.3296879-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:20:47 -07:00
Sean Christopherson	3dee3b1874	KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run() Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(), drop the redundant store from vcpu_run(). The flag is cleared only when VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's impossible for l1tf_flush_l1d to be cleared between loading the vCPU and calling vcpu_run(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:47 -07:00
Sean Christopherson	ef2e18ef37	KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU load Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting it only when the vCPU is being scheduled back in. The flag is processed only when VM-Enter is imminent, and KVM obviously needs to load the vCPU before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no meaningful value. I.e. the flag _will_ be set either way, it's simply a matter of when. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:46 -07:00
Sean Christopherson	2a27c43140	KVM: Delete the now unused kvm_arch_sched_in() Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:45 -07:00
Sean Christopherson	8fbb696a8f	KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load() Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying off the recently added kvm_vcpu.scheduled_out as appropriate. Note, there is a very slight functional change, as PLE shrink updates will now happen after blasting WBINVD, but that is quite uninteresting as the two operations do not interact in any way. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:44 -07:00
Sean Christopherson	5d9c07febb	KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load() Move VMX's {grow,shrink}_ple_window() above vmx_vcpu_load() in preparation of moving the sched_in logic, which handles shrinking the PLE window, into vmx_vcpu_load(). No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:43 -07:00
Yi Wang	e3c89f5dd1	KVM: x86: Don't re-setup empty IRQ routing when KVM_CAP_SPLIT_IRQCHIP Now that KVM sets up empty IRQ routing during VM creation, don't recreate empty routing during KVM_CAP_SPLIT_IRQCHIP. Setting IRQ routes during KVM_CAP_SPLIT_IRQCHIP can result in 20+ milliseconds of delay due to the synchronize_srcu_expedited() call in kvm_set_irq_routing(). Note, the empty routing is guaranteed to be intact as KVM x86 only allows changing the IRQ routing after an in-kernel IRQCHIP has been created, and KVM_CAP_SPLIT_IRQCHIP is disallowed after creating an IRQCHIP. Signed-off-by: Yi Wang <foxywang@tencent.com> Link: https://lore.kernel.org/r/20240506101751.3145407-3-foxywang@tencent.com [sean: massage changelog, remove unused empty_routing array] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 14:18:40 -07:00
Sean Christopherson	3b65a692a5	KVM: x86/pmu: Add a helper to enable bits in FIXED_CTR_CTRL Add a helper, intel_pmu_enable_fixed_counter_bits(), to dedup code that enables fixed counter bits, i.e. when KVM clears bits in the reserved mask used to detect invalid MSR_CORE_PERF_FIXED_CTR_CTRL values. No functional change intended. Cc: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240608000819.3296176-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 09:35:58 -07:00
Thomas Prescher	85542adb65	KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know whether it sees L1 or L2 state (besides calling KVM_GET_STATS_FD, which does not have a stable ABI). This causes multiple problems: The simplest one is L2 state corruption when userspace marks the sregs as dirty. See this mailing list thread [1] for a complete discussion. Another problem is that if userspace decides to continue by emulating instructions, it will unknowingly emulate with L2 state as if L1 doesn't exist, which can be considered a weird guest escape. Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data structure, which is set when the vCPU exited while running a nested guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to advertise the functionality to userspace. [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55 Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-11 09:24:31 -07:00
Sean Christopherson	1028893a73	KVM: x86: Bury guest_cpuid_is_amd_or_hygon() in cpuid.c Move guest_cpuid_is_amd_or_hygon() into cpuid.c now that, except for one Intel quirk in the emulator, KVM checks for AMD vs. Intel compatible vCPUs, not exact vendors, i.e. now that there should not be any reason for KVM at-large to care about the exact vendor. Opportunistically refactor the guts of the helper to use "entry" instead of "best", and short circuit the !entry path to make the common case more readable. Link: https://lore.kernel.org/r/20240405235603.1173076-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:39 -07:00
Sean Christopherson	bdaff4f92b	KVM: x86: Open code vendor_intel() in string_registers_quirk() Open code the is_guest_vendor_intel() check in string_registers_quirk() to discourage makiking exact vendor==Intel checks in the emulator, and to remove the rather awful #ifdeffery. The string quirk is literally the only Intel specific, non-architectural behavior that KVM emulates. All Intel specific behavior that is architecturally defined applies to all vendors that are compatible with Intel's architecture, i.e. should use guest_cpuid_is_intel_compatible(). Link: https://lore.kernel.org/r/20240405235603.1173076-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:39 -07:00
Sean Christopherson	4067c2395e	KVM: x86: Allow SYSENTER in Compatibility Mode for all Intel compat vCPUs Emulate SYSENTER in Compatibility Mode for all vCPUs models that are compatible with Intel's architecture, as the behavior if SYSENTER is architecturally defined in Intel's SDM, i.e. should be followed by any CPU that implements Intel's architecture. Link: https://lore.kernel.org/r/20240405235603.1173076-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:39 -07:00
Sean Christopherson	dc2b8b2b52	KVM: SVM: Emulate SYSENTER RIP/RSP behavior for all Intel compat vCPUs Emulate bits 63:32 of the SYSENTER_R{I,S}P MSRs for all vCPUs that are compatible with Intel's architecture, not just strictly vCPUs that have vendor==Intel. The behavior of bits 63:32 is architecturally defined in the SDM, i.e. not some uarch specific quirk of Intel CPUs. Link: https://lore.kernel.org/r/20240405235603.1173076-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:38 -07:00
Sean Christopherson	d99e4cb2ae	KVM: x86: Use "is Intel compatible" helper to emulate SYSCALL in !64-bit Use guest_cpuid_is_intel_compatible() to determine whether SYSCALL in 32-bit Protected Mode (including Compatibility Mode) should #UD or succeed. The existing code already does the exact equivalent of guest_cpuid_is_intel_compatible(), just in a rather roundabout way. No functional change intended. Link: https://lore.kernel.org/r/20240405235603.1173076-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:38 -07:00
Sean Christopherson	c092fc879f	KVM: x86: Inhibit code #DBs in MOV-SS shadow for all Intel compat vCPUs Treat code #DBs as inhibited in MOV/POP-SS shadows for vCPU models that are Intel compatible, not just strictly vCPUs with vendor==Intel. The behavior is explicitly called out in the SDM, and thus architectural, i.e. applies to all CPUs that implement Intel's architecture, and isn't a quirk that is unique to CPUs manufactured by Intel: However, if an instruction breakpoint is placed on an instruction located immediately after a POP SS/MOV SS instruction, the breakpoint will be suppressed as if EFLAGS.RF were 1. Applying the behavior strictly to Intel wasn't intentional, KVM simply didn't have a concept of "Intel compatible" as of commit `baf67ca8e5` ("KVM: x86: Suppress code #DBs on Intel if MOV/POP SS blocking is active"). Link: https://lore.kernel.org/r/20240405235603.1173076-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:38 -07:00
Sean Christopherson	6463e5e418	KVM: x86: Apply Intel's TSC_AUX reserved-bit behavior to Intel compat vCPUs Extend Intel's check on MSR_TSC_AUX[63:32] to all vCPU models that are Intel compatible, i.e. aren't AMD or Hygon in KVM's world, as the behavior is architectural, i.e. applies to any CPU that is compatible with Intel's architecture. Applying the behavior strictly to Intel wasn't intentional, KVM simply didn't have a concept of "Intel compatible" as of commit `61a05d444d` ("KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model"). Link: https://lore.kernel.org/r/20240405235603.1173076-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:38 -07:00
Sean Christopherson	5a4f8b3026	KVM: x86/pmu: Squash period for checkpointed events based on host HLE/RTM Zero out the sampling period for checkpointed events if the host supports HLE or RTM, i.e. supports transactions and thus checkpointed events, not based on whether the vCPU vendor model is Intel. Perf's refusal to allow a sample period for checkpointed events is based purely on whether or not the CPU supports HLE/RTM transactions, i.e. perf has no knowledge of the vCPU vendor model. Note, it is _extremely_ unlikely that the existing code is a problem in real world usage, as there are far, far bigger hurdles that would need to be cleared to support cross-vendor vPMUs. The motivation is mainly to eliminate the use of guest_cpuid_is_intel(), in order to get to a state where KVM pivots on AMD vs. Intel compatibility, i.e. doesn't check for exactly vendor==Intel except in rare circumstances (i.e. for CPU quirks). Cc: Like Xu <like.xu.linux@gmail.com> Link: https://lore.kernel.org/r/20240405235603.1173076-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 14:29:38 -07:00
Binbin Wu	d5989a3533	KVM: VMX: Remove unused declaration of vmx_request_immediate_exit() After commit `0ec3d6d1f1` "KVM: x86: Fully defer to vendor code to decide how to force immediate exit", vmx_request_immediate_exit() was removed. Commit `5f18c642ff` "KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX" added its declaration by accident. Remove it. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20240506075025.2251131-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 09:35:25 -07:00
Hou Wenlong	c7d4c5f019	KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definition The check_apicv_inhibit_reasons() callback implementation was dropped in the commit `b3f257a846` ("KVM: x86: Track required APICv inhibits with variable, not callback"), but the definition removal was missed in the final version patch (it was removed in the v4). Therefore, it should be dropped, and the vmx_check_apicv_inhibit_reasons() function declaration should also be removed. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-10 09:20:52 -07:00
Sean Christopherson	377b2f359d	KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Unconditionally honor guest PAT on CPUs that support self-snoop, as Intel has confirmed that CPUs that support self-snoop always snoop caches and store buffers. I.e. CPUs with self-snoop maintain cache coherency even in the presence of aliased memtypes, thus there is no need to trust the guest behaves and only honor PAT as a last resort, as KVM does today. Honoring guest PAT is desirable for use cases where the guest has access to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual (mediated, for all intents and purposes) GPU is exposed to the guest, along with buffers that are consumed directly by the physical GPU, i.e. which can't be proxied by the host to ensure writes from the guest are performed with the correct memory type for the GPU. Cc: Yiwei Zhang <zzyiwei@google.com> Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-07 07:18:03 -07:00
Yan Zhao	65a4de0ffd	KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path Ensure a full memory barrier is emitted in the VM-Exit path, as a full barrier is required on Intel CPUs to evict WC buffers. This will allow unconditionally honoring guest PAT on Intel CPUs that support self-snoop. As srcu_read_lock() is always called in the VM-Exit path and it internally has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a second fence and make sure smp_mb() is called without dependency on implementation details of srcu_read_lock(). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: massage changelog] Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-07 07:18:02 -07:00
Sean Christopherson	e1548088ff	KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1 Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer honors guest MTRR memtypes, as forcing UC memory for VMs with non-coherent DMA only makes sense if the guest is using something other than PAT to configure the memtype for the DMA region. Furthermore, KVM has forced WB memory for CR0.CD=1 since commit `fb279950ba` ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does so with non-coherent DMA. Lastly, commit `fb279950ba` ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was from the same author as commit `b18d5431ac` ("KVM: x86: fix CR0.CD virtualization"), and followed by a mere month. I.e. forcing UC memory was likely the result of code inspection or perhaps misdiagnosed failures, and not the necessitate by a concrete use case. Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now AMD-only, and to take an erratum for lack of CR0.CD virtualization on Intel. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 08:13:14 -07:00
Sean Christopherson	0a7b73559b	KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM only honors guest MTRR memtypes if EPT is enabled and the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit `efdfe536d8` Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for: commit `0bed3b568b` Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. which was part of a "MTRR/PAT support for EPT" series that also added: + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte \|= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrieve the guest MTRR memtype for a given memtype. And very subtly, at the time of that change, KVM always set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK \| VMX_EPT_WRITABLE_MASK \| VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT \| VMX_EPT_IGMT_BIT); which came in via: commit `928d4bf747` Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole "ignoreIGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but it was applied earlier because it was a generic fix. Jumping back to `0bed3b568b` ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of host MMIO. That fix came in a bit later, but given the author and timing, it's safe to say it was all part of the same EPT+VT-d enabling mess. commit `2aaf69dcee` Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization was obviously never tested on NPT until much later, which lends further credence to the theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what likely happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit `606decd670` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit `fd717f1101`. It was reported to cause Machine Check Exceptions (bug 104091). ... commit `fd717f1101` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that too got reverted. Unfortunately, it doesn't appear that anyone ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit `fc07e76ac7` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit `3c2e7f7de3`. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit `3c2e7f7de3` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In short, honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT and for KVM not forcing UC for host MMIO. And while there are known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absence of MTRR virtualization on NPT and shadow paging proves that, while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring host userspace memtypes, which is also backwards. I.e. it would be far better for host userspace to communicate its desired memtype directly to KVM (or perhaps indirectly via VMAs in the host kernel), not through guest MTRRs. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 08:13:14 -07:00
Alejandro Jimenez	f992572120	KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasons Keep kvm_apicv_inhibit enum naming consistent with the current pattern by renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to APICV_INHIBIT_REASON_DISABLED. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 06:18:28 -07:00
Alejandro Jimenez	69148ccec6	KVM: x86: Print names of apicv inhibit reasons in traces Use the tracing infrastructure helper __print_flags() for printing flag bitfields, to enhance the trace output by displaying a string describing each of the inhibit reasons set. The kvm_apicv_inhibit_changed tracepoint currently shows the raw bitmap value, requiring the user to consult the source file where the inhibit reasons are defined to decode the trace output. Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240506225321.3440701-2-alejandro.j.jimenez@oracle.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 06:18:27 -07:00
Isaku Yamahata	6fef518594	KVM: x86: Add a capability to configure bus frequency for APIC timer Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC bus clock frequency for APIC timer emulation. Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the frequency in nanoseconds. When using this capability, the user space VMM should configure CPUID leaf 0x15 to advertise the frequency. Vishal reported that the TDX guest kernel expects a 25MHz APIC bus frequency but ends up getting interrupts at a significantly higher rate. The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value. In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register." The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz. The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is enumerated, the guest kernel uses it as the APIC bus frequency. If not, the guest kernel measures the frequency based on other known timers like the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest kernel expects a 25MHz timer frequency but gets timer interrupt more frequently due to the 1GHz frequency used by KVM. To ensure that the guest doesn't have a conflicting view of the APIC bus frequency, allow the userspace to tell KVM to use the same frequency that TDX mandates instead of the default 1Ghz. Reported-by: Vishal Annapurve <vannapurve@google.com> Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 06:18:27 -07:00
Isaku Yamahata	b460256b16	KVM: x86: Make nanoseconds per APIC bus cycle a VM variable Introduce the VM variable "nanoseconds per APIC bus cycle" in preparation to make the APIC bus frequency configurable. The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value. In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register." The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz. Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare for allowing userspace to tell KVM to use the frequency that TDX mandates instead of the default 1Ghz. Doing so ensures that the guest doesn't have a conflicting view of the APIC bus frequency. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> [reinette: rework changelog] Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 06:18:26 -07:00
Isaku Yamahata	f9e1cbf180	KVM: x86: hyper-v: Calculate APIC bus frequency for Hyper-V Remove APIC_BUS_FREQUENCY and calculate it based on nanoseconds per APIC bus cycle. APIC_BUS_FREQUENCY is used only for HV_X64_MSR_APIC_FREQUENCY. The MSR is not frequently read, calculate it every time. There are two constants related to the APIC bus frequency: APIC_BUS_FREQUENCY and APIC_BUS_CYCLE_NS. Only one value is required because one can be calculated from the other: APIC_BUS_CYCLES_NS = 1000 * 1000 * 1000 / APIC_BUS_FREQUENCY. Remove APIC_BUS_FREQUENCY and instead calculate it when needed. This prepares for support of configurable APIC bus frequency by requiring to change only a single variable. Suggested-by: Maxim Levitsky <maximlevitsky@gmail.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Maxim Levitsky <maximlevitsky@gmail.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> [reinette: rework changelog] Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/76a659d0898e87ebd73ee7c922f984a87a6ab370.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 06:18:25 -07:00
Ravi Bangoria	f99b052256	KVM: SNP: Fix LBR Virtualization for SNP guest SEV-ES and thus SNP guest mandates LBR Virtualization to be _always_ ON. Although commit `b7e4be0a22` ("KVM: SEV-ES: Delegate LBR virtualization to the processor") did the correct change for SEV-ES guests, it missed the SNP. Fix it. Reported-by: Srikanth Aithal <sraithal@amd.com> Fixes: `b7e4be0a22` ("KVM: SEV-ES: Delegate LBR virtualization to the processor") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240605114810.1304-1-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-05 07:50:50 -04:00
Tao Su	db574f2f96	KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attr Drop the second snapshot of mmu_invalidate_seq in kvm_faultin_pfn(). Before checking the mismatch of private vs. shared, mmu_invalidate_seq is saved to fault->mmu_seq, which can be used to detect an invalidation related to the gfn occurred, i.e. KVM will not install a mapping in page table if fault->mmu_seq != mmu_invalidate_seq. Currently there is a second snapshot of mmu_invalidate_seq, which may not be same as the first snapshot in kvm_faultin_pfn(), i.e. the gfn attribute may be changed between the two snapshots, but the gfn may be mapped in page table without hindrance. Therefore, drop the second snapshot as it has no obvious benefits. Fixes: `f6adeae81f` ("KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()") Signed-off-by: Tao Su <tao1.su@linux.intel.com> Message-ID: <20240528102234.2162763-1-tao1.su@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-05 06:45:06 -04:00
Li RongQing	99a49093ce	KVM: SVM: Consider NUMA affinity when allocating per-CPU save_area save_area of per-CPU svm_data are dominantly accessed from their own local CPUs, so allocate them node-local for performance reason so rename __snp_safe_alloc_page as snp_safe_alloc_page_node which accepts numa node id as input parameter, svm_cpu_init call it with node id switched from cpu id Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240520120858.13117-4-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 16:14:11 -07:00
Li RongQing	9f44286d77	KVM: SVM: not account memory allocation for per-CPU svm_data The allocation for the per-CPU save area in svm_cpu_init shouldn't be accounted, So introduce __snp_safe_alloc_page helper, which has gfp flag as input, svm_cpu_init calls __snp_safe_alloc_page with GFP_KERNEL, snp_safe_alloc_page calls __snp_safe_alloc_page with GFP_KERNEL_ACCOUNT as input Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240520120858.13117-3-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 16:14:11 -07:00
Li RongQing	f51af34686	KVM: SVM: remove useless input parameter in snp_safe_alloc_page The input parameter 'vcpu' in snp_safe_alloc_page is not used. Therefore, remove it. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240520120858.13117-2-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 16:14:11 -07:00
Dapeng Mi	75430c412a	KVM: x86/pmu: Manipulate FIXED_CTR_CTRL MSR with macros Magic numbers are used to manipulate the bit fields of FIXED_CTR_CTRL MSR. This makes reading code become difficult, so use pre-defined macros to replace these magic numbers. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240430005239.13527-3-dapeng1.mi@linux.intel.com [sean: drop unnecessary curly braces] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 14:25:22 -07:00
Dapeng Mi	0e102ce3d4	KVM: x86/pmu: Change ambiguous _mask suffix to _rsvd in kvm_pmu Several '_mask' suffixed variables such as, global_ctrl_mask, are defined in kvm_pmu structure. However the _mask suffix is ambiguous and misleading since it's not a real mask with positive logic. On the contrary it represents the reserved bits of corresponding MSRs and these bits should not be accessed. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240430005239.13527-2-dapeng1.mi@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 14:23:14 -07:00
Hou Wenlong	9ecc1c119b	KVM: x86/mmu: Only allocate shadowed translation cache for sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL Only the indirect SP with sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL might have leaf gptes, so allocation of shadowed translation cache is needed only for it. Then, it can use sp->shadowed_translation to determine whether to use the information in the shadowed translation cache or not. Also, extend the WARN in FNAME(sync_spte)() to ensure that this won't break shadow_mmu_get_sp_for_split(). Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/5b0cda8a7456cda476b14fca36414a56f921dd52.1715398655.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 14:06:39 -07:00
Liang Chen	4f8973e65f	KVM: x86: invalid_list not used anymore in mmu_shrink_scan 'invalid_list' is now gathered in KVM_MMU_ZAP_OLDEST_MMU_PAGES. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Link: https://lore.kernel.org/r/20240509044710.18788-1-liangchen.linux@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 10:55:15 -07:00
Paolo Bonzini	ab978c62e7	Merge branch 'kvm-6.11-sev-snp' into HEAD Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests.	2024-06-03 13:19:46 -04:00
Paolo Bonzini	b3233c737e	Merge branch 'kvm-fixes-6.10-1' into HEAD * Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). * Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it does have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. * Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had.	2024-06-03 13:18:08 -04:00
Paolo Bonzini	f9d1b541d0	Merge branch 'kvm-fixes-6.10-1' into HEAD * Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). * Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it does have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. * Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had.	2024-06-03 13:10:11 -04:00
Sean Christopherson	89a58812c4	KVM: x86: Drop support for hand tuning APIC timer advancement from userspace Remove support for specifying a static local APIC timer advancement value, and instead present a read-only boolean parameter to let userspace enable or disable KVM's dynamic APIC timer advancement. Realistically, it's all but impossible for userspace to specify an advancement that is more precise than what KVM's adaptive tuning can provide. E.g. a static value needs to be tuned for the exact hardware and kernel, and if KVM is using hrtimers, likely requires additional tuning for the exact configuration of the entire system. Dropping support for a userspace provided value also fixes several flaws in the interface. E.g. KVM interprets a negative value other than -1 as a large advancement, toggling between a negative and positive value yields unpredictable behavior as vCPUs will switch from dynamic to static advancement, changing the advancement in the middle of VM creation can result in different values for vCPUs within a VM, etc. Those flaws are mostly fixable, but there's almost no justification for taking on yet more complexity (it's minimal complexity, but still non-zero). The only arguments against using KVM's adaptive tuning is if a setup needs a higher maximum, or if the adjustments are too reactive, but those are arguments for letting userspace control the absolute max advancement and the granularity of each adjustment, e.g. similar to how KVM provides knobs for halt polling. Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com Cc: Shuling Zhou <zhoushuling@huawei.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522010304.1650603-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-03 13:08:05 -04:00
Ravi Bangoria	b7e4be0a22	KVM: SEV-ES: Delegate LBR virtualization to the processor As documented in APM[1], LBR Virtualization must be enabled for SEV-ES guests. Although KVM currently enforces LBRV for SEV-ES guests, there are multiple issues with it: o MSR_IA32_DEBUGCTLMSR is still intercepted. Since MSR_IA32_DEBUGCTLMSR interception is used to dynamically toggle LBRV for performance reasons, this can be fatal for SEV-ES guests. For ex SEV-ES guest on Zen3: [guest ~]# wrmsr 0x1d9 0x4 KVM: entry failed, hardware error 0xffffffff EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000 Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests. No additional save/restore logic is required since MSR_IA32_DEBUGCTLMSR is of swap type A. o KVM will disable LBRV if userspace sets MSR_IA32_DEBUGCTLMSR before the VMSA is encrypted. Fix this by moving LBRV enablement code post VMSA encryption. [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June 2023, Vol 2, 15.35.2 Enabling SEV-ES. https://bugzilla.kernel.org/attachment.cgi?id=304653 Fixes: `376c6d2850` ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Co-developed-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-4-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-03 13:07:18 -04:00
Ravi Bangoria	d922056215	KVM: SEV-ES: Disallow SEV-ES guests when X86_FEATURE_LBRV is absent As documented in APM[1], LBR Virtualization must be enabled for SEV-ES guests. So, prevent SEV-ES guests when LBRV support is missing. [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June 2023, Vol 2, 15.35.2 Enabling SEV-ES. https://bugzilla.kernel.org/attachment.cgi?id=304653 Fixes: `376c6d2850` ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-3-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-03 13:06:48 -04:00
Nikunj A Dadhania	27bd5fdc24	KVM: SEV-ES: Prevent MSR access post VMSA encryption KVM currently allows userspace to read/write MSRs even after the VMSA is encrypted. This can cause unintentional issues if MSR access has side- effects. For ex, while migrating a guest, userspace could attempt to migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on the target. Fix this by preventing access to those MSRs which are context switched via the VMSA, once the VMSA is encrypted. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-2-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-03 13:06:48 -04:00
Tom Lendacky	b2ec042347	KVM: SVM: Remove the need to trigger an UNBLOCK event on AP creation All SNP APs are initially started using the APIC INIT/SIPI sequence in the guest. This sequence moves the AP MP state from KVM_MP_STATE_UNINITIALIZED to KVM_MP_STATE_RUNNABLE, so there is no need to attempt the UNBLOCK. As it is, the UNBLOCK support in SVM is only enabled when AVIC is enabled. When AVIC is disabled, AP creation is still successful. Remove the KVM_REQ_UNBLOCK request from the AP creation code and revert the changes to the vcpu_unblocking() kvm_x86_ops path. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-03 12:38:17 -04:00
Paolo Bonzini	73137f5924	KVM: SEV: Don't WARN() if RMP lookup fails when invalidating gmem pages The hook only handles cleanup work specific to SNP, e.g. RMP table entries and flushing caches for encrypted guest memory. When run on a non-SNP-enabled host (currently only possible using KVM_X86_SW_PROTECTED_VM, e.g. via KVM selftests), the callback is a noop and will WARN due to the RMP table not being present. It's actually expected in this case that the RMP table wouldn't be present and that the hook should be a noop, so drop the WARN_ONCE(). Reported-by: Sean Christopherson <seanjc@google.com> Closes: https://lore.kernel.org/kvm/ZkU3_y0UoPk5yAeK@google.com/ Fixes: `8eb01900b0` ("KVM: SEV: Implement gmem hook for invalidating private pages") Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-03 12:38:14 -04:00
Michael Roth	febff040b1	KVM: SEV: Automatically switch reclaimed pages to shared Currently there's a consistent pattern of always calling host_rmp_make_shared() immediately after snp_page_reclaim(), so go ahead and handle it automatically as part of snp_page_reclaim(). Also rename it to kvm_rmp_make_shared() to more easily distinguish it as a KVM-specific variant of the more generic rmp_make_shared() helper. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-03 12:36:52 -04:00
Tony Luck	0c468a6a02	KVM: VMX: Switch to new Intel CPU model infrastructure Use x86_vfm (vendor, family, module) to detect CPUs that are affected by PERF_GLOBAL_CTRL bugs instead of manually checking the family and model. The new VFM infrastructure encodes all information in one handy location. No functional change intended. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-10-tony.luck@intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 09:10:43 -07:00
Tony Luck	8387435beb	KVM: x86/pmu: Switch to new Intel CPU model defines Use X86_MATCH_VFM(), which does Vendor checking in addition to Family and Model checking, to do FMS-based detection of PEBS features. No functional change intended. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-9-tony.luck@intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 09:10:34 -07:00
Jim Mattson	ea19f7d0bf	KVM: x86: Remove IA32_PERF_GLOBAL_OVF_CTRL from KVM_GET_MSR_INDEX_LIST This MSR reads as 0, and any host-initiated writes are ignored, so there's no reason to enumerate it in KVM_GET_MSR_INDEX_LIST. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231113184854.2344416-1-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 09:00:30 -07:00
Sean Christopherson	82897db912	KVM: x86: Move shadow_phys_bits into "kvm_host", as "maxphyaddr" Move shadow_phys_bits into "struct kvm_host_values", i.e. into KVM's global "kvm_host" variable, so that it is automatically exported for use in vendor modules. Rename the variable/field to maxphyaddr to more clearly capture what value it holds, now that it's used outside of the MMU (and because the "shadow" part is more than a bit misleading as the variable is not at all unique to shadow paging). Recomputing the raw/true host.MAXPHYADDR on every use can be subtly expensive, e.g. it will incur a VM-Exit on the CPUID if KVM is running as a nested hypervisor. Vendor code already has access to the information, e.g. by directly doing CPUID or by invoking kvm_get_shadow_phys_bits(), so there's no tangible benefit to making it MMU-only. Link: https://lore.kernel.org/r/20240423221521.2923759-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:58:55 -07:00
Sean Christopherson	c043eaaa6b	KVM: x86/mmu: Snapshot shadow_phys_bits when kvm.ko is loaded Snapshot shadow_phys_bits when kvm.ko is loaded, not when a vendor module is loaded, to guard against usage of shadow_phys_bits before it is initialized. The computation isn't vendor specific in any way, i.e. there there is no reason to wait to snapshot the value until a vendor module is loaded, nor is there any reason to recompute the value every time a vendor module is loaded. Opportunistically convert it from "read mostly" to "read-only after init". Link: https://lore.kernel.org/r/20240423221521.2923759-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:58:55 -07:00
Sean Christopherson	52c47f5897	KVM: SVM: Use KVM's snapshot of the host's XCR0 for SEV-ES host state Use KVM's snapshot of the host's XCR0 when stuffing SEV-ES host state instead of reading XCR0 from hardware. XCR0 is only written during boot, i.e. won't change while KVM is running (and KVM at large is hosed if that doesn't hold true). Link: https://lore.kernel.org/r/20240423221521.2923759-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:58:54 -07:00
Sean Christopherson	7974c0643e	KVM: x86: Add a struct to consolidate host values, e.g. EFER, XCR0, etc... Add "struct kvm_host_values kvm_host" to hold the various host values that KVM snapshots during initialization. Bundling the host values into a single struct simplifies adding new MSRs and other features with host state/values that KVM cares about, and provides a one-stop shop. E.g. adding a new value requires one line, whereas tracking each value individual often requires three: declaration, definition, and export. No functional change intended. Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:58:53 -07:00
Sean Christopherson	b4bd556467	KVM: SVM: WARN on vNMI + NMI window iff NMIs are outright masked When requesting an NMI window, WARN on vNMI support being enabled if and only if NMIs are actually masked, i.e. if the vCPU is already handling an NMI. KVM's ABI for NMIs that arrive simultanesouly (from KVM's point of view) is to inject one NMI and pend the other. When using vNMI, KVM pends the second NMI simply by setting V_NMI_PENDING, and lets the CPU do the rest (hardware automatically sets V_NMI_BLOCKING when an NMI is injected). However, if KVM can't immediately inject an NMI, e.g. because the vCPU is in an STI shadow or is running with GIF=0, then KVM will request an NMI window and trigger the WARN (but still function correctly). Whether or not the GIF=0 case makes sense is debatable, as the intent of KVM's behavior is to provide functionality that is as close to real hardware as possible. E.g. if two NMIs are sent in quick succession, the probability of both NMIs arriving in an STI shadow is infinitesimally low on real hardware, but significantly larger in a virtual environment, e.g. if the vCPU is preempted in the STI shadow. For GIF=0, the argument isn't as clear cut, because the window where two NMIs can collide is much larger in bare metal (though still small). That said, KVM should not have divergent behavior for the GIF=0 case based on whether or not vNMI support is enabled. And KVM has allowed simultaneous NMIs with GIF=0 for over a decade, since commit `7460fb4a34` ("KVM: Fix simultaneous NMIs"). I.e. KVM's GIF=0 handling shouldn't be modified without a really good reason to do so, and if KVM's behavior were to be modified, it should be done irrespective of vNMI support. Fixes: `fa4c027a79` ("KVM: x86: Add support for SVM's Virtual NMI") Cc: stable@vger.kernel.org Cc: Santosh Shukla <Santosh.Shukla@amd.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522021435.1684366-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-23 12:34:44 -04:00
Sean Christopherson	76d5363c20	KVM: x86: Force KVM_WERROR if the global WERROR is enabled Force KVM_WERROR if the global WERROR is enabled to avoid pestering the user about a Kconfig that will ultimately be ignored. Force KVM_WERROR instead of making it mutually exclusive with WERROR to avoid generating a .config builds KVM with -Werror, but has KVM_WERROR=n. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240517180341.974251-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-23 12:33:31 -04:00
Sean Christopherson	6af6142e3a	KVM: x86: Disable KVM_INTEL_PROVE_VE by default Disable KVM's "prove #VE" support by default, as it provides no functional value, and even its sanity checking benefits are relatively limited. I.e. it should be fully opt-in even on debug kernels, especially since EPT Violation #VE suppression appears to be buggy on some CPUs. Opportunistically add a line in the help text to make it abundantly clear that KVM_INTEL_PROVE_VE should never be enabled in a production environment. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-23 12:33:15 -04:00
Sean Christopherson	bca99c0356	KVM: x86/mmu: Print SPTEs on unexpected #VE Print the SPTEs that correspond to the faulting GPA on an unexpected EPT Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE didn't have SUPPRESS_VE set. Opportunistically assert that the underlying exit reason was indeed an EPT Violation, as the CPU has really gone off the rails if a #VE occurs due to a completely unexpected exit reason. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-23 12:28:45 -04:00

... 2 3 4 5 6 ...

10352 Commits