Commit Graph

424 Commits

Author SHA1 Message Date
Linus Torvalds
244aefb1c6 VFIO updates for v6.8-rc1
- Add debugfs support, initially used for reporting device migration
    state. (Longfang Liu)
 
  - Fixes and support for migration dirty tracking across multiple IOVA
    regions in the pds-vfio-pci driver. (Brett Creeley)
 
  - Improved IOMMU allocation accounting visibility. (Pasha Tatashin)
 
  - Virtio infrastructure and a new virtio-vfio-pci variant driver, which
    provides emulation of a legacy virtio interfaces on modern virtio
    hardware for virtio-net VF devices where the PF driver exposes
    support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV
    VF to provide driver ABI compatibility to legacy devices.
    (Yishai Hadas & Feng Liu)
 
  - Migration fixes for the hisi-acc-vfio-pci variant driver.
    (Shameer Kolothum)
 
  - Kconfig dependency fix for new virtio-vfio-pci variant driver.
    (Arnd Bergmann)
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmWhkhEbHGFsZXgud2ls
 bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsiCLgQAJv6mzD79dVWKAZH27Lj
 PK0ZSyu3fwgPxTmhRXysKKMs79WI2GlVx6nyW8pVe3w+OGWpdTcbZK2H/T/FryZQ
 QsbKteueG83ni1cIdJFzmIM1jO79jhtsPxpclRS/VmECRhYA6+c7smynHyZNrVAL
 wWkJIkS2uUEx3eUefzH4U2CRen3TILwHAXi27fJ8pHbr6Yor+XvUOgM3eQDjUj+t
 eABL/pJr0qFDQstom6k7GLAsenRHKMLUG88ziSciSJxOg5YiT4py7zeLXuoEhVD1
 kI9KE+Vle5EdZe8MzLLhmzLZoFVfhjyNfj821QjtfP3Gkj6TqnUWBKJAptMuQpdf
 HklOLNmabrZbat+i6QqswrnQ5Z1doPz1uNBsl2lH+2/KIaT8bHZI+QgjK7pg2H2L
 O679My0od4rVLpjnSLDdRoXlcLd6mmvq3663gPogziHBNdNl3oQBI3iIa7ixljkA
 lxJbOZIDBAjzPk+t5NLYwkTsab1AY4zGlfr0M3Sk3q7tyj/MlBcX/fuqyhXjUfqR
 Zhqaw2OaWD8R0EqfSK+wRXr1+z7EWJO/y1iq8RYlD5Mozo+6YMVThjLDUO+8mrtV
 6/PL0woGALw0Tq1u0tw3rLjzCd9qwD9BD2fFUQwUWEe3j3wG2HCLLqyomxcmaKS8
 WgvUXtufWyvonCcIeLKXI9Kt
 =IuK2
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Add debugfs support, initially used for reporting device migration
   state (Longfang Liu)

 - Fixes and support for migration dirty tracking across multiple IOVA
   regions in the pds-vfio-pci driver (Brett Creeley)

 - Improved IOMMU allocation accounting visibility (Pasha Tatashin)

 - Virtio infrastructure and a new virtio-vfio-pci variant driver, which
   provides emulation of a legacy virtio interfaces on modern virtio
   hardware for virtio-net VF devices where the PF driver exposes
   support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV
   VF to provide driver ABI compatibility to legacy devices (Yishai
   Hadas & Feng Liu)

 - Migration fixes for the hisi-acc-vfio-pci variant driver (Shameer
   Kolothum)

 - Kconfig dependency fix for new virtio-vfio-pci variant driver (Arnd
   Bergmann)

* tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio: (22 commits)
  vfio/virtio: fix virtio-pci dependency
  hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume
  vfio/virtio: Declare virtiovf_pci_aer_reset_done() static
  vfio/virtio: Introduce a vfio driver over virtio devices
  vfio/pci: Expose vfio_pci_core_iowrite/read##size()
  vfio/pci: Expose vfio_pci_core_setup_barmap()
  virtio-pci: Introduce APIs to execute legacy IO admin commands
  virtio-pci: Initialize the supported admin commands
  virtio-pci: Introduce admin commands
  virtio-pci: Introduce admin command sending function
  virtio-pci: Introduce admin virtqueue
  virtio: Define feature bit for administration virtqueue
  vfio/type1: account iommu allocations
  vfio/pds: Add multi-region support
  vfio/pds: Move seq/ack bitmaps into region struct
  vfio/pds: Pass region info to relevant functions
  vfio/pds: Move and rename region specific info
  vfio/pds: Only use a single SGL for both seq and ack
  vfio/pds: Fix calculations in pds_vfio_dirty_sync
  MAINTAINERS: Add vfio debugfs interface doc link
  ...
2024-01-18 15:57:25 -08:00
Arnd Bergmann
78f70c02bd vfio/virtio: fix virtio-pci dependency
The new vfio-virtio driver already has a dependency on VIRTIO_PCI_ADMIN_LEGACY,
but that is a bool symbol and allows vfio-virtio to be built-in even if
virtio-pci itself is a loadable module. This leads to a link failure:

aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_probe':
main.c:(.text+0xec): undefined reference to `virtio_pci_admin_has_legacy_io'
aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_init_device':
main.c:(.text+0x260): undefined reference to `virtio_pci_admin_legacy_io_notify_info'
aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_bar0_rw':
main.c:(.text+0x6ec): undefined reference to `virtio_pci_admin_legacy_common_io_read'
aarch64-linux-ld: main.c:(.text+0x6f4): undefined reference to `virtio_pci_admin_legacy_device_io_read'
aarch64-linux-ld: main.c:(.text+0x7f0): undefined reference to `virtio_pci_admin_legacy_common_io_write'
aarch64-linux-ld: main.c:(.text+0x7f8): undefined reference to `virtio_pci_admin_legacy_device_io_write'

Add another explicit dependency on the tristate symbol.

Fixes: eb61eca0e8 ("vfio/virtio: Introduce a vfio driver over virtio devices")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240109075731.2726731-1-arnd@kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-01-10 15:10:41 -07:00
Linus Torvalds
c604110e66 vfs-6.8.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
 ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
 KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
 =0bRl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Add Jan Kara as VFS reviewer

   - Show correct device and inode numbers in proc/<pid>/maps for vma
     files on stacked filesystems. This is now easily doable thanks to
     the backing file work from the last cycles. This comes with
     selftests

  Cleanups:

   - Remove a redundant might_sleep() from wait_on_inode()

   - Initialize pointer with NULL, not 0

   - Clarify comment on access_override_creds()

   - Rework and simplify eventfd_signal() and eventfd_signal_mask()
     helpers

   - Process aio completions in batches to avoid needless wakeups

   - Completely decouple struct mnt_idmap from namespaces. We now only
     keep the actual idmapping around and don't stash references to
     namespaces

   - Reformat maintainer entries to indicate that a given subsystem
     belongs to fs/

   - Simplify fput() for files that were never opened

   - Get rid of various pointless file helpers

   - Rename various file helpers

   - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
     last cycle

   - Make relatime_need_update() return bool

   - Use GFP_KERNEL instead of GFP_USER when allocating superblocks

   - Replace deprecated ida_simple_*() calls with their current ida_*()
     counterparts

  Fixes:

   - Fix comments on user namespace id mapping helpers. They aren't
     kernel doc comments so they shouldn't be using /**

   - s/Retuns/Returns/g in various places

   - Add missing parameter documentation on can_move_mount_beneath()

   - Rename i_mapping->private_data to i_mapping->i_private_data

   - Fix a false-positive lockdep warning in pipe_write() for watch
     queues

   - Improve __fget_files_rcu() code generation to improve performance

   - Only notify writer that pipe resizing has finished after setting
     pipe->max_usage otherwise writers are never notified that the pipe
     has been resized and hang

   - Fix some kernel docs in hfsplus

   - s/passs/pass/g in various places

   - Fix kernel docs in ntfs

   - Fix kcalloc() arguments order reported by gcc 14

   - Fix uninitialized value in reiserfs"

* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
  reiserfs: fix uninit-value in comp_keys
  watch_queue: fix kcalloc() arguments order
  ntfs: dir.c: fix kernel-doc function parameter warnings
  fs: fix doc comment typo fs tree wide
  selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
  fs/proc: show correct device and inode numbers in /proc/pid/maps
  eventfd: Remove usage of the deprecated ida_simple_xx() API
  fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
  fs/hfsplus: wrapper.c: fix kernel-doc warnings
  fs: add Jan Kara as reviewer
  fs/inode: Make relatime_need_update return bool
  pipe: wakeup wr_wait after setting max_usage
  file: remove __receive_fd()
  file: stop exposing receive_fd_user()
  fs: replace f_rcuhead with f_task_work
  file: remove pointless wrapper
  file: s/close_fd_get_file()/file_close_fd()/g
  Improve __fget_files_rcu() code generation (and thus __fget_light())
  file: massage cleanup of files that failed to open
  fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
  ...
2024-01-08 10:26:08 -08:00
Shameer Kolothum
be12ad45e1 hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume
When the optional PRE_COPY support was added to speed up the device
compatibility check, it failed to update the saving/resuming data
pointers based on the fd offset. This results in migration data
corruption and when the device gets started on the destination the
following error is reported in some cases,

[  478.907684] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received:
[  478.913691] arm-smmu-v3 arm-smmu-v3.2.auto:  0x0000310200000010
[  478.919603] arm-smmu-v3 arm-smmu-v3.2.auto:  0x000002088000007f
[  478.925515] arm-smmu-v3 arm-smmu-v3.2.auto:  0x0000000000000000
[  478.931425] arm-smmu-v3 arm-smmu-v3.2.auto:  0x0000000000000000
[  478.947552] hisi_zip 0000:31:00.0: qm_axi_rresp [error status=0x1] found
[  478.955930] hisi_zip 0000:31:00.0: qm_db_timeout [error status=0x400] found
[  478.955944] hisi_zip 0000:31:00.0: qm sq doorbell timeout in function 2

Fixes: d9a871e4a1 ("hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions")
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231120091406.780-1-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-01-05 09:06:45 -07:00
Yishai Hadas
daca194876 vfio/virtio: Declare virtiovf_pci_aer_reset_done() static
Declare virtiovf_pci_aer_reset_done() as a static function to prevent
the below build warning.

"warning: no previous prototype for 'virtiovf_pci_aer_reset_done'
[-Wmissing-prototypes]"

Fixes: eb61eca0e8 ("vfio/virtio: Introduce a vfio driver over virtio devices")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/lkml/20231220143122.63337669@canb.auug.org.au/
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20231220082456.241973-1-yishaih@nvidia.com
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312202115.oDmvN1VE-lkp@intel.com/
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-20 08:08:39 -07:00
Alex Williamson
0214392d5d Merge branch 'v6.8/vfio/virtio' into v6.8/vfio/next 2023-12-19 12:16:24 -07:00
Yishai Hadas
eb61eca0e8 vfio/virtio: Introduce a vfio driver over virtio devices
Introduce a vfio driver over virtio devices to support the legacy
interface functionality for VFs.

Background, from the virtio spec [1].
--------------------------------------------------------------------
In some systems, there is a need to support a virtio legacy driver with
a device that does not directly support the legacy interface. In such
scenarios, a group owner device can provide the legacy interface
functionality for the group member devices. The driver of the owner
device can then access the legacy interface of a member device on behalf
of the legacy member device driver.

For example, with the SR-IOV group type, group members (VFs) can not
present the legacy interface in an I/O BAR in BAR0 as expected by the
legacy pci driver. If the legacy driver is running inside a virtual
machine, the hypervisor executing the virtual machine can present a
virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
legacy driver accesses to this I/O BAR and forwards them to the group
owner device (PF) using group administration commands.
--------------------------------------------------------------------

Specifically, this driver adds support for a virtio-net VF to be exposed
as a transitional device to a guest driver and allows the legacy IO BAR
functionality on top.

This allows a VM which uses a legacy virtio-net driver in the guest to
work transparently over a VF which its driver in the host is that new
driver.

The driver can be extended easily to support some other types of virtio
devices (e.g virtio-blk), by adding in a few places the specific type
properties as was done for virtio-net.

For now, only the virtio-net use case was tested and as such we introduce
the support only for such a device.

Practically,
Upon probing a VF for a virtio-net device, in case its PF supports
legacy access over the virtio admin commands and the VF doesn't have BAR
0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
transitional device with I/O BAR in BAR 0.

The existence of the simulated I/O bar is reported later on by
overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
exposes itself as a transitional device by overwriting some properties
upon reading its config space.

Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
guest may use it via read/write calls according to the virtio
specification.

Any read/write towards the control parts of the BAR will be captured by
the new driver and will be translated into admin commands towards the
device.

In addition, any data path read/write access (i.e. virtio driver
notifications) will be captured by the driver and forwarded to the
physical BAR which its properties were supplied by the admin command
VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the probing/init flow.

With that code in place a legacy driver in the guest has the look and
feel as if having a transitional device with legacy support for both its
control and data path flows.

[1]
03c2d32e50

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20231219093247.170936-10-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19 11:51:34 -07:00
Yishai Hadas
8486ae162b vfio/pci: Expose vfio_pci_core_iowrite/read##size()
Expose vfio_pci_core_iowrite/read##size() to let it be used by drivers.

This functionality is needed to enable direct access to some physical
BAR of the device with the proper locks/checks in place.

The next patches from this series will use this functionality on a data
path flow when a direct access to the BAR is needed.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20231219093247.170936-9-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19 11:51:33 -07:00
Yishai Hadas
8bccc5b806 vfio/pci: Expose vfio_pci_core_setup_barmap()
Expose vfio_pci_core_setup_barmap() to be used by drivers.

This will let drivers to mmap a BAR and re-use it from both vfio and the
driver when it's applicable.

This API will be used in the next patches by the vfio/virtio coming
driver.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20231219093247.170936-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19 11:51:33 -07:00
Brett Creeley
2e7c6feb4e vfio/pds: Add multi-region support
Only supporting a single region/range is limiting,
wasteful, and in some cases broken (i.e. when there
are large gaps in the iova memory ranges). Fix this
by adding support for multiple regions based on
what the device tells the driver it can support.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231117001207.2793-7-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04 14:33:21 -07:00
Brett Creeley
0c320f223e vfio/pds: Move seq/ack bitmaps into region struct
Since the host seq/ack bitmaps are part of a region
move them into struct pds_vfio_region. Also, make use
of the bmp_bytes value for validation purposes.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231117001207.2793-6-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04 14:33:21 -07:00
Brett Creeley
87bdf9807e vfio/pds: Pass region info to relevant functions
A later patch in the series implements multi-region
support. That will require specific regions to be
passed to relevant functions. Prepare for that change
by passing the region structure to relevant functions.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231117001207.2793-5-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04 14:33:21 -07:00
Brett Creeley
3f5898133a vfio/pds: Move and rename region specific info
An upcoming change in this series will add support
for multiple regions. To prepare for that, move
region specific information into struct pds_vfio_region
and rename the members for readability. This will
reduce the size of the patch that actually implements
multiple region support.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231117001207.2793-4-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04 14:33:21 -07:00
Brett Creeley
3b8f7a24d1 vfio/pds: Only use a single SGL for both seq and ack
Since the seq/ack operations never happen in parallel there
is no need for multiple scatter gather lists per region.
The current implementation is wasting memory. Fix this by
only using a single scatter-gather list for both the seq
and ack operations.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231117001207.2793-3-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04 14:33:20 -07:00
Brett Creeley
4004497cec vfio/pds: Fix calculations in pds_vfio_dirty_sync
The incorrect check is being done for comparing the
iova/length being requested to sync. This can cause
the dirty sync operation to fail. Fix this by making
sure the iova offset added to the requested sync
length doesn't exceed the region_size.

Also, the region_start is assumed to always be at 0.
This can cause dirty tracking to fail because the
device/driver bitmap offset always starts at 0,
however, the region_start/iova may not. Fix this by
determining the iova offset from region_start to
determine the bitmap offset.

Fixes: f232836a91 ("vfio/pds: Add support for dirty page tracking")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231117001207.2793-2-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04 14:33:20 -07:00
Christian Brauner
3652117f85 eventfd: simplify eventfd_signal()
Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c7 ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com>  # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28 14:08:38 +01:00
Brett Creeley
ae2667cd8a vfio/pds: Fix possible sleep while in atomic context
The driver could possibly sleep while in atomic context resulting
in the following call trace while CONFIG_DEBUG_ATOMIC_SLEEP=y is
set:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:283
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2817, name: bash
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
Call Trace:
 <TASK>
 dump_stack_lvl+0x36/0x50
 __might_resched+0x123/0x170
 mutex_lock+0x1e/0x50
 pds_vfio_put_lm_file+0x1e/0xa0 [pds_vfio_pci]
 pds_vfio_put_save_file+0x19/0x30 [pds_vfio_pci]
 pds_vfio_state_mutex_unlock+0x2e/0x80 [pds_vfio_pci]
 pci_reset_function+0x4b/0x70
 reset_store+0x5b/0xa0
 kernfs_fop_write_iter+0x137/0x1d0
 vfs_write+0x2de/0x410
 ksys_write+0x5d/0xd0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

This can happen if pds_vfio_put_restore_file() and/or
pds_vfio_put_save_file() grab the mutex_lock(&lm_file->lock)
while the spin_lock(&pds_vfio->reset_lock) is held, which can
happen during while calling pds_vfio_state_mutex_unlock().

Fix this by changing the reset_lock to reset_mutex so there are no such
conerns. Also, make sure to destroy the reset_mutex in the driver specific
VFIO device release function.

This also fixes a spinlock bad magic BUG that was caused
by not calling spinlock_init() on the reset_lock. Since, the lock is
being changed to a mutex, make sure to call mutex_init() on it.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/kvm/1f9bc27b-3de9-4891-9687-ba2820c1b390@moroto.mountain/
Fixes: bb500dbe2a ("vfio/pds: Add VFIO live migration support")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231122192532.25791-3-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-11-27 09:29:03 -07:00
Brett Creeley
91aeb563bd vfio/pds: Fix mutex lock->magic != lock warning
The following BUG was found when running on a kernel with
CONFIG_DEBUG_MUTEXES=y set:

DEBUG_LOCKS_WARN_ON(lock->magic != lock)
RIP: 0010:mutex_trylock+0x10d/0x120
Call Trace:
 <TASK>
 ? __warn+0x85/0x140
 ? mutex_trylock+0x10d/0x120
 ? report_bug+0xfc/0x1e0
 ? handle_bug+0x3f/0x70
 ? exc_invalid_op+0x17/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? mutex_trylock+0x10d/0x120
 ? mutex_trylock+0x10d/0x120
 pds_vfio_reset+0x3a/0x60 [pds_vfio_pci]
 pci_reset_function+0x4b/0x70
 reset_store+0x5b/0xa0
 kernfs_fop_write_iter+0x137/0x1d0
 vfs_write+0x2de/0x410
 ksys_write+0x5d/0xd0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

As shown, lock->magic != lock. This is because
mutex_init(&pds_vfio->state_mutex) is called in the VFIO open path. So,
if a reset is initiated before the VFIO device is opened the mutex will
have never been initialized. Fix this by calling
mutex_init(&pds_vfio->state_mutex) in the VFIO init path.

Also, don't destroy the mutex on close because the device may
be re-opened, which would cause mutex to be uninitialized. Fix this by
implementing a driver specific vfio_device_ops.release callback that
destroys the mutex before calling vfio_pci_core_release_dev().

Fixes: bb500dbe2a ("vfio/pds: Add VFIO live migration support")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20231122192532.25791-2-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-11-27 09:29:03 -07:00
Linus Torvalds
463f46e114 iommufd for 6.7
This branch has three new iommufd capabilities:
 
  - Dirty tracking for DMA. AMD/ARM/Intel CPUs can now record if a DMA
    writes to a page in the IOPTEs within the IO page table. This can be used
    to generate a record of what memory is being dirtied by DMA activities
    during a VM migration process. A VMM like qemu will combine the IOMMU
    dirty bits with the CPU's dirty log to determine what memory to
    transfer.
 
    VFIO already has a DMA dirty tracking framework that requires PCI
    devices to implement tracking HW internally. The iommufd version
    provides an alternative that the VMM can select, if available. The two
    are designed to have very similar APIs.
 
  - Userspace controlled attributes for hardware page
    tables (HWPT/iommu_domain). There are currently a few generic attributes
    for HWPTs (support dirty tracking, and parent of a nest). This is an
    entry point for the userspace iommu driver to control the HW in detail.
 
  - Nested translation support for HWPTs. This is a 2D translation scheme
    similar to the CPU where a DMA goes through a first stage to determine
    an intermediate address which is then translated trough a second stage
    to a physical address.
 
    Like for CPU translation the first stage table would exist in VM
    controlled memory and the second stage is in the kernel and matches the
    VM's guest to physical map.
 
    As every IOMMU has a unique set of parameter to describe the S1 IO page
    table and its associated parameters the userspace IOMMU driver has to
    marshal the information into the correct format.
 
    This is 1/3 of the feature, it allows creating the nested translation
    and binding it to VFIO devices, however the API to support IOTLB and
    ATC invalidation of the stage 1 io page table, and forwarding of IO
    faults are still in progress.
 
 The series includes AMD and Intel support for dirty tracking. Intel
 support for nested translation.
 
 Along the way are a number of internal items:
 
  - New iommu core items: ops->domain_alloc_user(), ops->set_dirty_tracking,
    ops->read_and_clear_dirty(), IOMMU_DOMAIN_NESTED, and iommu_copy_struct_from_user
 
  - UAF fix in iopt_area_split()
 
  - Spelling fixes and some test suite improvement
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZUDu2wAKCRCFwuHvBreF
 YcdeAQDaBmjyGLrRIlzPyohF6FrombyWo2512n51Hs8IHR4IvQEA3oRNgQ2tsJRr
 1UPuOqnOD5T/oVX6AkUPRBwanCUQwwM=
 =nyJ3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd

Pull iommufd updates from Jason Gunthorpe:
 "This brings three new iommufd capabilities:

   - Dirty tracking for DMA.

     AMD/ARM/Intel CPUs can now record if a DMA writes to a page in the
     IOPTEs within the IO page table. This can be used to generate a
     record of what memory is being dirtied by DMA activities during a
     VM migration process. A VMM like qemu will combine the IOMMU dirty
     bits with the CPU's dirty log to determine what memory to transfer.

     VFIO already has a DMA dirty tracking framework that requires PCI
     devices to implement tracking HW internally. The iommufd version
     provides an alternative that the VMM can select, if available. The
     two are designed to have very similar APIs.

   - Userspace controlled attributes for hardware page tables
     (HWPT/iommu_domain). There are currently a few generic attributes
     for HWPTs (support dirty tracking, and parent of a nest). This is
     an entry point for the userspace iommu driver to control the HW in
     detail.

   - Nested translation support for HWPTs. This is a 2D translation
     scheme similar to the CPU where a DMA goes through a first stage to
     determine an intermediate address which is then translated trough a
     second stage to a physical address.

     Like for CPU translation the first stage table would exist in VM
     controlled memory and the second stage is in the kernel and matches
     the VM's guest to physical map.

     As every IOMMU has a unique set of parameter to describe the S1 IO
     page table and its associated parameters the userspace IOMMU driver
     has to marshal the information into the correct format.

     This is 1/3 of the feature, it allows creating the nested
     translation and binding it to VFIO devices, however the API to
     support IOTLB and ATC invalidation of the stage 1 io page table,
     and forwarding of IO faults are still in progress.

  The series includes AMD and Intel support for dirty tracking. Intel
  support for nested translation.

  Along the way are a number of internal items:

   - New iommu core items: ops->domain_alloc_user(),
     ops->set_dirty_tracking, ops->read_and_clear_dirty(),
     IOMMU_DOMAIN_NESTED, and iommu_copy_struct_from_user

   - UAF fix in iopt_area_split()

   - Spelling fixes and some test suite improvement"

* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (52 commits)
  iommufd: Organize the mock domain alloc functions closer to Joerg's tree
  iommufd/selftest: Fix page-size check in iommufd_test_dirty()
  iommufd: Add iopt_area_alloc()
  iommufd: Fix missing update of domains_itree after splitting iopt_area
  iommu/vt-d: Disallow read-only mappings to nest parent domain
  iommu/vt-d: Add nested domain allocation
  iommu/vt-d: Set the nested domain to a device
  iommu/vt-d: Make domain attach helpers to be extern
  iommu/vt-d: Add helper to setup pasid nested translation
  iommu/vt-d: Add helper for nested domain allocation
  iommu/vt-d: Extend dmar_domain to support nested domain
  iommufd: Add data structure for Intel VT-d stage-1 domain allocation
  iommu/vt-d: Enhance capability check for nested parent domain allocation
  iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
  iommufd/selftest: Add nested domain allocation for mock domain
  iommu: Add iommu_copy_struct_from_user helper
  iommufd: Add a nested HW pagetable object
  iommu: Pass in parent domain with user_data to domain_alloc_user op
  iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
  iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
  ...
2023-11-01 16:44:56 -10:00
Joao Martins
13578d4ebe iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol
export convention i.e. using the IOMMUFD namespace. In doing so,
import the namespace in the current users. This means VFIO and the
vfio-pci drivers that use iova_bitmap_set().

Link: https://lore.kernel.org/r/20231024135109.73787-4-joao.m.martins@oracle.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-24 11:58:42 -03:00
Joao Martins
8c9c727b61 vfio: Move iova_bitmap into iommufd
Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking
the user bitmaps, so move to the common dependency into IOMMUFD.  In doing
so, create the symbol IOMMUFD_DRIVER which designates the builtin code that
will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
supporting dirty tracking and select IOMMUFD_DRIVER accordingly.

Given that the symbol maybe be disabled, add header definitions in
iova_bitmap.h for when IOMMUFD_DRIVER=n

Link: https://lore.kernel.org/r/20231024135109.73787-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-24 11:58:42 -03:00
Yishai Hadas
fcb2f2ed4a vfio/mlx5: Activate the chunk mode functionality
Now that all pieces are in place, activate the chunk mode functionality
based on device capabilities.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-10-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
a899cacab5 vfio/mlx5: Add support for READING in chunk mode
Add support for READING in chunk mode.

In case the last SAVE command recognized that there was still some image
to be read, however, there was no available chunk to use for, this task
was delayed for the reader till one chunk will be consumed and becomes
available.

In the above case, a work will be executed to read in the background the
next image from the device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-9-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
67135f2945 vfio/mlx5: Add support for SAVING in chunk mode
Add support for SAVING in chunk mode, it includes running a work
that will fill the next chunk from the device.

In case the number of available chunks will reach the MAX_NUM_CHUNKS,
the next chunk SAVING will be delayed till the reader will consume one
chunk.

The next patch from the series will add the reader part of the chunk
mode.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
5798e4dd58 vfio/mlx5: Pre-allocate chunks for the STOP_COPY phase
This patch is another preparation step towards working in chunk mode.

It pre-allocates chunks for the STOP_COPY phase to let the driver use
them immediately and prevent an extra allocation upon that phase.

Before that patch we had a single large buffer that was dedicated for
the STOP_COPY phase as there was a single SAVE in the source for the
last image.

Once we'll move to chunk mode the idea is to have some small buffers
that will be used upon the STOP_COPY phase.

The driver will read-ahead from the firmware the full state in
small/optimized chunks while letting QEMU/user space read in parallel
the available data.

Each buffer holds its chunk number to let it be recognized down the road
in the coming patches.

The chunk buffer size is picked-up based on the minimum size that
firmware requires, the total full size and some max value in the driver
code which was set to 8MB to achieve some optimized downtime in the
general case.

As the chunk mode is applicable even if we move directly to STOP_COPY
the buffers preparation and some other related stuff is done
unconditionally with regards to STOP/PRE-COPY.

Note:
In that phase in the series we still didn't activate the chunk mode and
the first buffer will be used in all the places.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
9114100d10 vfio/mlx5: Rename some stuff to match chunk mode
Upon chunk mode there may be multiple images that will be read from the
device upon STOP_COPY.

This patch is some preparation for that mode by replacing the relevant
stuff to a better matching name.

As part of that, be stricter to recognize PRE_COPY error only when it
didn't occur on a STOP_COPY chunk.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
543640af84 vfio/mlx5: Enable querying state size which is > 4GB
Once the device supports 'chunk mode' the driver can support state size
which is larger than 4GB.

In that case the device has the capability to split a single image to
multiple chunks as long as the software provides a buffer in the minimum
size reported by the device.

The driver should query for the minimum buffer size required using
QUERY_VHCA_MIGRATION_STATE command with the 'chunk' bit set in its
input, in that case, the output will include both the minimum buffer
size (i.e.  required_umem_size) and also the remaining total size to be
reported/used where that it will be applicable.

At that point in the series the 'chunk' bit is off, the last patch will
activate the feature once all pieces will be ready.

Note:
Before this change we were limited to 4GB state size as of 4 bytes max
value based on the device specification for the query/save/load
commands.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
34a64c8eac vfio/mlx5: Refactor the SAVE callback to activate a work only upon an error
Upon a successful SAVE callback there is no need to activate a work, all
the required stuff can be done directly.

As so, refactor the above flow to activate a work only upon an error.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Yishai Hadas
82470eba9d vfio/mlx5: Wake up the reader post of disabling the SAVING migration file
Post of disabling the SAVING migration file, which includes setting the
file state to be MLX5_MIGF_STATE_ERROR, call to wake_up_interruptible()
on its poll_wait member.

This lets any potential reader which is waiting already for data as part
of mlx5vf_save_read() to wake up, recognize the error state and return
with an error.

Post of that we don't need to rely on any other condition to wake up
the reader as of the returning of the SAVE command that was previously
executed, etc.

In addition, this change will simplify error flows (e.g health recovery)
once we'll move to chunk mode and multiple SAVE commands may run in the
STOP_COPY phase as we won't need to rely any more on a SAVE command to
wake-up a potential waiting reader.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230911093856.81910-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28 13:07:29 -06:00
Shixiong Ou
27004f89b0 vfio/pds: Use proper PF device access helper
The pci_physfn() helper exists to support cases where the physfn
field may not be compiled into the pci_dev structure. We've
declared this driver dependent on PCI_IOV to avoid this problem,
but regardless we should follow the precedent not to access this
field directly.

Signed-off-by: Shixiong Ou <oushixiong@kylinos.cn>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230914021332.1929155-1-oushixiong@kylinos.cn
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-18 13:36:53 -06:00
Shixiong Ou
5a59f2ff30 vfio/pds: Add missing PCI_IOV depends
If PCI_ATS isn't set, then pdev->physfn is not defined.
it causes a compilation issue:

../drivers/vfio/pci/pds/vfio_dev.c:165:30: error: ‘struct pci_dev’ has no member named ‘physfn’; did you mean ‘is_physfn’?
  165 |   __func__, pci_dev_id(pdev->physfn), pci_id, vf_id,
      |                              ^~~~~~

So adding PCI_IOV depends to select PCI_ATS.

Signed-off-by: Shixiong Ou <oushixiong@kylinos.cn>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230906014942.1658769-1-oushixiong@kylinos.cn
Fixes: 63f77a7161 ("vfio/pds: register with the pds_core PF")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-18 13:36:52 -06:00
Linus Torvalds
ec0e2dc810 VFIO updates for v6.6-rc1
- VFIO direct character device (cdev) interface support.  This extracts
    the vfio device fd from the container and group model, and is intended
    to be the native uAPI for use with IOMMUFD. (Yi Liu)
 
  - Enhancements to the PCI hot reset interface in support of cdev usage.
    (Yi Liu)
 
  - Fix a potential race between registering and unregistering vfio files
    in the kvm-vfio interface and extend use of a lock to avoid extra
    drop and acquires. (Dmitry Torokhov)
 
  - A new vfio-pci variant driver for the AMD/Pensando Distributed Services
    Card (PDS) Ethernet device, supporting live migration. (Brett Creeley)
 
  - Cleanups to remove redundant owner setup in cdx and fsl bus drivers,
    and simplify driver init/exit in fsl code. (Li Zetao)
 
  - Fix uninitialized hole in data structure and pad capability structures
    for alignment. (Stefan Hajnoczi)
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmTvnDUbHGFsZXgud2ls
 bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsimEEP/AzG+VRcu5LfYbLGLe0z
 zB8ts6G7S78wXlmfN/LYi3v92XWvMMcm+vYF8oNAMfr1YL5sibWN6UtQfY1KCr7h
 nWKdQdqjajJ5yDDZnOFdhqHJGNfmZw6+fey8Z0j8zRI2oymK4DncWWX3g/7L1SNr
 9tIexGJef+mOdAmC94yOut3YviAaZ+f95T/xrdXHzzoNr50DD0+PD6AJdKJfKggP
 vhiC/DAYH3Fofaa6tRasgWuKCYWdjZLR/kxgNpeEmW6kZnbq/dnzZ+kgn4HH1f9G
 8p7UKVARR6FfG5aLheWu6Y9PDaKnfnqu8y/hobuE/ivXcmqqK+a6xSxrjgbVs8WJ
 94SYnTBRoTlDJaKWa7GxqdgzJnV+s5ZyAgPhjzdi6mLTPWGzkuLhFWGtYL+LZAQ6
 pNeZSM6CFBk+bva/xT0nNPCXxPh+/j/Y0G18FREj8aPFc03HrJQqz0RLydvTnoDz
 nX/by5KdzMSVSVLPr4uDMtAsgxsGqWiFcp7QMw1HhhlLWxqmYbA+mLZaqyMZUUOx
 6b/P8WXT9P2I+qPVKWQ5CWyqpsEqm6P+72yg6LOM9kINvgwDhOa7cagMXIuMWYMH
 Rf97FL+K8p1eIy6AnvRHgFBMM5185uG+0YcJyVqtucDr/k8T/Om6ujAI6JbWtNe6
 cLgaVAqKOYqCR4HC9bfVGSbd
 =eKSR
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - VFIO direct character device (cdev) interface support. This extracts
   the vfio device fd from the container and group model, and is
   intended to be the native uAPI for use with IOMMUFD (Yi Liu)

 - Enhancements to the PCI hot reset interface in support of cdev usage
   (Yi Liu)

 - Fix a potential race between registering and unregistering vfio files
   in the kvm-vfio interface and extend use of a lock to avoid extra
   drop and acquires (Dmitry Torokhov)

 - A new vfio-pci variant driver for the AMD/Pensando Distributed
   Services Card (PDS) Ethernet device, supporting live migration (Brett
   Creeley)

 - Cleanups to remove redundant owner setup in cdx and fsl bus drivers,
   and simplify driver init/exit in fsl code (Li Zetao)

 - Fix uninitialized hole in data structure and pad capability
   structures for alignment (Stefan Hajnoczi)

* tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio: (53 commits)
  vfio/pds: Send type for SUSPEND_STATUS command
  vfio/pds: fix return value in pds_vfio_get_lm_file()
  pds_core: Fix function header descriptions
  vfio: align capability structures
  vfio/type1: fix cap_migration information leak
  vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code
  vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver
  vfio/pds: Add Kconfig and documentation
  vfio/pds: Add support for firmware recovery
  vfio/pds: Add support for dirty page tracking
  vfio/pds: Add VFIO live migration support
  vfio/pds: register with the pds_core PF
  pds_core: Require callers of register/unregister to pass PF drvdata
  vfio/pds: Initial support for pds VFIO driver
  vfio: Commonize combine_ranges for use in other VFIO drivers
  kvm/vfio: avoid bouncing the mutex when adding and deleting groups
  kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add()
  docs: vfio: Add vfio device cdev description
  vfio: Compile vfio_group infrastructure optionally
  vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev()
  ...
2023-08-30 20:36:01 -07:00
Brett Creeley
642265e22e vfio/pds: Send type for SUSPEND_STATUS command
Commit bb500dbe2a ("vfio/pds: Add VFIO live migration support")
added live migration support for the pds-vfio-pci driver. When
sending the SUSPEND command to the device, the driver sets the
type of suspend (i.e. P2P or FULL). However, the driver isn't
sending the type of suspend for the SUSPEND_STATUS command, which
will result in failures. Fix this by also sending the suspend type
in the SUSPEND_STATUS command.

Fixes: bb500dbe2a ("vfio/pds: Add VFIO live migration support")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230821184215.34564-1-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-22 13:11:57 -06:00
Yang Yingliang
2d12d18f14 vfio/pds: fix return value in pds_vfio_get_lm_file()
anon_inode_getfile() never returns NULL pointer, it will return
ERR_PTR() when it fails, so replace the check with IS_ERR().

Fixes: bb500dbe2a ("vfio/pds: Add VFIO live migration support")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://lore.kernel.org/r/20230819023716.3469037-1-yangyingliang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-21 08:50:33 -06:00
Stefan Hajnoczi
a881b49694 vfio: align capability structures
The VFIO_DEVICE_GET_INFO, VFIO_DEVICE_GET_REGION_INFO, and
VFIO_IOMMU_GET_INFO ioctls fill in an info struct followed by capability
structs:

  +------+---------+---------+-----+
  | info | caps[0] | caps[1] | ... |
  +------+---------+---------+-----+

Both the info and capability struct sizes are not always multiples of
sizeof(u64), leaving u64 fields in later capability structs misaligned.

Userspace applications currently need to handle misalignment manually in
order to support CPU architectures and programming languages with strict
alignment requirements.

Make life easier for userspace by ensuring alignment in the kernel. This
is done by padding info struct definitions and by copying out zeroes
after capability structs that are not aligned.

The new layout is as follows:

  +------+---------+---+---------+-----+
  | info | caps[0] | 0 | caps[1] | ... |
  +------+---------+---+---------+-----+

In this example caps[0] has a size that is not multiples of sizeof(u64),
so zero padding is added to align the subsequent structure.

Adding zero padding between structs does not break the uapi. The memory
layout is specified by the info.cap_offset and caps[i].next fields
filled in by the kernel. Applications use these field values to locate
structs and are therefore unaffected by the addition of zero padding.

Note that code that copies out info structs with padding is updated to
always zero the struct and copy out as many bytes as userspace
requested. This makes the code shorter and avoids potential information
leaks by ensuring padding is initialized.

Originally-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230809203144.2880050-1-stefanha@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-17 12:17:44 -06:00
Brett Creeley
fc9da66103 vfio/pds: Add Kconfig and documentation
Add Kconfig entries and pds-vfio-pci.rst. Also, add an entry in the
MAINTAINERS file for this new driver.

It's not clear where documentation for vendor specific VFIO
drivers should live, so just re-use the current amd
ethernet location.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230807205755.29579-9-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:54:54 -06:00
Brett Creeley
7dabb1bcd1 vfio/pds: Add support for firmware recovery
It's possible that the device firmware crashes and is able to recover
due to some configuration and/or other issue. If a live migration
is in progress while the firmware crashes, the live migration will
fail. However, the VF PCI device should still be functional post
crash recovery and subsequent migrations should go through as
expected.

When the pds_core device notices that firmware crashes it sends an
event to all its client drivers. When the pds_vfio driver receives
this event while migration is in progress it will request a deferred
reset on the next migration state transition. This state transition
will report failure as well as any subsequent state transition
requests from the VMM/VFIO. Based on uapi/vfio.h the only way out of
VFIO_DEVICE_STATE_ERROR is by issuing VFIO_DEVICE_RESET. Once this
reset is done, the migration state will be reset to
VFIO_DEVICE_STATE_RUNNING and migration can be performed.

If the event is received while no migration is in progress (i.e.
the VM is in normal operating mode), then no actions are taken
and the migration state remains VFIO_DEVICE_STATE_RUNNING.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230807205755.29579-8-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:54:54 -06:00
Brett Creeley
f232836a91 vfio/pds: Add support for dirty page tracking
In order to support dirty page tracking, the driver has to implement
the VFIO subsystem's vfio_log_ops. This includes log_start, log_stop,
and log_read_and_clear.

All of the tracker resources are allocated and dirty tracking on the
device is started during log_start. The resources are cleaned up and
dirty tracking on the device is stopped during log_stop. The dirty
pages are determined and reported during log_read_and_clear.

In order to support these callbacks admin queue commands are used.
All of the adminq queue command structures and implementations
are included as part of this patch.

PDS_LM_CMD_DIRTY_STATUS is added to query the current status of
dirty tracking on the device. This includes if it's enabled (i.e.
number of regions being tracked from the device's perspective) and
the maximum number of regions supported from the device's perspective.

PDS_LM_CMD_DIRTY_ENABLE is added to enable dirty tracking on the
specified number of regions and their iova ranges.

PDS_LM_CMD_DIRTY_DISABLE is added to disable dirty tracking for all
regions on the device.

PDS_LM_CMD_READ_SEQ and PDS_LM_CMD_DIRTY_WRITE_ACK are added to
support reading and acknowledging the currently dirtied pages.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20230807205755.29579-7-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:54:46 -06:00
Brett Creeley
bb500dbe2a vfio/pds: Add VFIO live migration support
Add live migration support via the VFIO subsystem. The migration
implementation aligns with the definition from uapi/vfio.h and uses
the pds_core PF's adminq for device configuration.

The ability to suspend, resume, and transfer VF device state data is
included along with the required admin queue command structures and
implementations.

PDS_LM_CMD_SUSPEND and PDS_LM_CMD_SUSPEND_STATUS are added to support
the VF device suspend operation.

PDS_LM_CMD_RESUME is added to support the VF device resume operation.

PDS_LM_CMD_STATE_SIZE is added to determine the exact size of the VF
device state data.

PDS_LM_CMD_SAVE is added to get the VF device state data.

PDS_LM_CMD_RESTORE is added to restore the VF device with the
previously saved data from PDS_LM_CMD_SAVE.

PDS_LM_CMD_HOST_VF_STATUS is added to notify the DSC/firmware when
a migration is in/not-in progress from the host's perspective. The
DSC/firmware can use this to clear/setup any necessary state related
to a migration.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230807205755.29579-6-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:53:26 -06:00
Brett Creeley
63f77a7161 vfio/pds: register with the pds_core PF
The pds_core driver will supply adminq services, so find the PF
and register with the DSC services.

Use the following commands to enable a VF:
echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230807205755.29579-5-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:53:20 -06:00
Brett Creeley
38fe3975b4 vfio/pds: Initial support for pds VFIO driver
This is the initial framework for the new pds-vfio-pci device driver.
This does the very basics of registering the PDS PCI device and
configuring it as a VFIO PCI device.

With this change, the VF device can be bound to the pds-vfio-pci driver
on the host and presented to the VM as an ethernet VF.

Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230807205755.29579-3-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:53:00 -06:00
Brett Creeley
9a4087fab3 vfio: Commonize combine_ranges for use in other VFIO drivers
Currently only Mellanox uses the combine_ranges function. The
new pds_vfio driver also needs this function. So, move it to
a common location for other vendor drivers to use.

Also, fix RCT ordering while moving/renaming the function.

Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20230807205755.29579-2-brett.creeley@amd.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-16 10:52:23 -06:00
Maher Sanalla
f14c1a14e6 net/mlx5: Allocate completion EQs dynamically
This commit enables the dynamic allocation of EQs at runtime, allowing
for more flexibility in managing completion EQs and reducing the memory
overhead of driver load. Whenever a CQ is created for a given vector
index, the driver will lookup to see if there is an already mapped
completion EQ for that vector, if so, utilize it. Otherwise, allocate a
new EQ on demand and then utilize it for the CQ completion events.

Add a protection lock to the EQ table to protect from concurrent EQ
creation attempts.

While at it, replace mlx5_vector2irqn()/mlx5_vector2eqn() with
mlx5_comp_eqn_get() and mlx5_comp_irqn_get() which will allocate an
EQ on demand if no EQ is found for the given vector.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07 10:53:52 -07:00
Maher Sanalla
674dd4e2e0 net/mlx5: Rename mlx5_comp_vectors_count() to mlx5_comp_vectors_max()
To accurately represent its purpose, rename the function that retrieves
the value of maximum vectors from mlx5_comp_vectors_count() to
mlx5_comp_vectors_max().

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-07 10:53:51 -07:00
Yi Liu
9048c7341c vfio-iommufd: Add detach_ioas support for physical VFIO devices
This prepares for adding DETACH ioctl for physical VFIO devices.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718135551.6592-14-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25 10:19:12 -06:00
Yi Liu
71791b9246 vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
This is the way user to invoke hot-reset for the devices opened by cdev
interface. User should check the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED
in the output of VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl before doing
hot-reset for cdev devices.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-11-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25 10:18:13 -06:00
Yi Liu
b56b7aabcf vfio/pci: Copy hot-reset device info to userspace in the devices loop
This copies the vfio_pci_dependent_device to userspace during looping each
affected device for reporting vfio_pci_hot_reset_info. This avoids counting
the affected devices and allocating a potential large buffer to store the
vfio_pci_dependent_device of all the affected devices before copying them
to userspace.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230718105542.4138-10-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25 10:18:09 -06:00
Yi Liu
9062ff405b vfio/pci: Extend VFIO_DEVICE_GET_PCI_HOT_RESET_INFO for vfio device cdev
This allows VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl use the iommufd_ctx
of the cdev device to check the ownership of the other affected devices.

When VFIO_DEVICE_GET_PCI_HOT_RESET_INFO is called on an IOMMUFD managed
device, the new flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID is reported to indicate
the values returned are IOMMUFD devids rather than group IDs as used when
accessing vfio devices through the conventional vfio group interface.
Additionally the flag VFIO_PCI_HOT_RESET_FLAG_DEV_ID_OWNED will be reported
in this mode if all of the devices affected by the hot-reset are owned by
either virtue of being directly bound to the same iommufd context as the
calling device, or implicitly owned via a shared IOMMU group.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-9-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25 10:18:05 -06:00
Yi Liu
a80e1de932 vfio: Add helper to search vfio_device in a dev_set
There are drivers that need to search vfio_device within a given dev_set.
e.g. vfio-pci. So add a helper.

vfio_pci_is_device_in_set() now returns -EBUSY in commit a882c16a2b
("vfio/pci: Change vfio_pci_try_bus_reset() to use the dev_set") where
it was trying to preserve the return of vfio_pci_try_zap_and_vma_lock_cb().
However, it makes more sense to return -ENODEV.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-8-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25 10:18:02 -06:00
Yi Liu
6e6c513fe1 vfio/pci: Move the existing hot reset logic to be a helper
This prepares to add another method for hot reset. The major hot reset logic
are moved to vfio_pci_ioctl_pci_hot_reset_groups().

No functional change is intended.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-3-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25 10:17:42 -06:00