linux-next/arch/x86/kvm/mmu
Paolo Bonzini d96c77bd4e KVM: x86: switch hugepage recovery thread to vhost_task
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.

However, because of these threads, cgroups which have kvm instances
inside never complete freezing.  This can be trivially reproduced:

  root@test ~# mkdir /sys/fs/cgroup/test
  root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
  root@test ~# qemu-system-x86_64 -nographic -enable-kvm

and in another terminal:

  root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
  root@test ~# cat /sys/fs/cgroup/test/cgroup.events
  populated 1
  frozen 0

The cgroup freezing happens in the signal delivery path but
kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
calls into the signal delivery path and thus never gets frozen. Because
the cgroup freezer determines whether a given cgroup is frozen by
comparing the number of frozen threads to the total number of threads
in the cgroup, the cgroup never becomes frozen and users waiting for
the state transition may hang indefinitely.

Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree.  Use it instead of the
homegrown alternative.

Incidentally, the new code is also better behaved when you flip recovery
back and forth to disabled and back to enabled.  If your recovery period
is 1 minute, it will run the next recovery after 1 minute independent
of how many times you flipped the parameter.

(Commit message based on emails from Tejun).

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Luca Boccassi <bluca@debian.org>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-14 13:20:04 -05:00
..
mmu_internal.h KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level() 2024-10-30 15:25:42 -07:00
mmu.c KVM: x86: switch hugepage recovery thread to vhost_task 2024-11-14 13:20:04 -05:00
mmutrace.h KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults 2024-09-09 20:16:19 -07:00
page_track.c KVM: Use vfree for memory allocated by vcalloc()/__vcalloc() 2024-04-09 12:18:38 -07:00
page_track.h KVM: x86/mmu: Drop @slot param from exported/external page-track APIs 2023-08-31 14:08:18 -04:00
paging_tmpl.h KVM: x86/mmu: Mark pages/folios dirty at the origin of make_spte() 2024-10-25 12:59:08 -04:00
spte.c KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte() 2024-11-04 18:37:23 -08:00
spte.h KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte() 2024-11-04 18:37:23 -08:00
tdp_iter.c arch/x86: Fix typos 2024-01-03 11:46:22 +01:00
tdp_iter.h KVM: x86/mmu: Add sanity checks that KVM doesn't create EPT #VE SPTEs 2024-05-23 12:27:26 -04:00
tdp_mmu.c KVM: x86/mmu: WARN if huge page recovery triggered during dirty logging 2024-11-04 18:37:23 -08:00
tdp_mmu.h KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping 2024-11-04 18:37:22 -08:00