powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV

SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
how anyone is supposed to know how it works except through a lot of
stuffering. Write up some docs about the overall story to help out
the next sucker^Wperson who needs to tinker with it.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-6-oohall@gmail.com
This commit is contained in:
Oliver O'Halloran 2020-07-22 16:57:05 +10:00 committed by Michael Ellerman
parent 37b59ef08c
commit ff79e11af0

View File

@ -12,6 +12,136 @@
/* for pci_dev_is_added() */
#include "../../../../drivers/pci/pci.h"
/*
* The majority of the complexity in supporting SR-IOV on PowerNV comes from
* the need to put the MMIO space for each VF into a separate PE. Internally
* the PHB maps MMIO addresses to a specific PE using the "Memory BAR Table".
* The MBT historically only applied to the 64bit MMIO window of the PHB
* so it's common to see it referred to as the "M64BT".
*
* An MBT entry stores the mapped range as an <base>,<mask> pair. This forces
* the address range that we want to map to be power-of-two sized and aligned.
* For conventional PCI devices this isn't really an issue since PCI device BARs
* have the same requirement.
*
* For a SR-IOV BAR things are a little more awkward since size and alignment
* are not coupled. The alignment is set based on the the per-VF BAR size, but
* the total BAR area is: number-of-vfs * per-vf-size. The number of VFs
* isn't necessarily a power of two, so neither is the total size. To fix that
* we need to finesse (read: hack) the Linux BAR allocator so that it will
* allocate the SR-IOV BARs in a way that lets us map them using the MBT.
*
* The changes to size and alignment that we need to do depend on the "mode"
* of MBT entry that we use. We only support SR-IOV on PHB3 (IODA2) and above,
* so as a baseline we can assume that we have the following BAR modes
* available:
*
* NB: $PE_COUNT is the number of PEs that the PHB supports.
*
* a) A segmented BAR that splits the mapped range into $PE_COUNT equally sized
* segments. The n'th segment is mapped to the n'th PE.
* b) An un-segmented BAR that maps the whole address range to a specific PE.
*
*
* We prefer to use mode a) since it only requires one MBT entry per SR-IOV BAR
* For comparison b) requires one entry per-VF per-BAR, or:
* (num-vfs * num-sriov-bars) in total. To use a) we need the size of each segment
* to equal the size of the per-VF BAR area. So:
*
* new_size = per-vf-size * number-of-PEs
*
* The alignment for the SR-IOV BAR also needs to be changed from per-vf-size
* to "new_size", calculated above. Implementing this is a convoluted process
* which requires several hooks in the PCI core:
*
* 1. In pcibios_add_device() we call pnv_pci_ioda_fixup_iov().
*
* At this point the device has been probed and the device's BARs are sized,
* but no resource allocations have been done. The SR-IOV BARs are sized
* based on the maximum number of VFs supported by the device and we need
* to increase that to new_size.
*
* 2. Later, when Linux actually assigns resources it tries to make the resource
* allocations for each PCI bus as compact as possible. As a part of that it
* sorts the BARs on a bus by their required alignment, which is calculated
* using pci_resource_alignment().
*
* For IOV resources this goes:
* pci_resource_alignment()
* pci_sriov_resource_alignment()
* pcibios_sriov_resource_alignment()
* pnv_pci_iov_resource_alignment()
*
* Our hook overrides the default alignment, equal to the per-vf-size, with
* new_size computed above.
*
* 3. When userspace enables VFs for a device:
*
* sriov_enable()
* pcibios_sriov_enable()
* pnv_pcibios_sriov_enable()
*
* This is where we actually allocate PE numbers for each VF and setup the
* MBT mapping for each SR-IOV BAR. In steps 1) and 2) we setup an "arena"
* where each MBT segment is equal in size to the VF BAR so we can shift
* around the actual SR-IOV BAR location within this arena. We need this
* ability because the PE space is shared by all devices on the same PHB.
* When using mode a) described above segment 0 in maps to PE#0 which might
* be already being used by another device on the PHB.
*
* As a result we need allocate a contigious range of PE numbers, then shift
* the address programmed into the SR-IOV BAR of the PF so that the address
* of VF0 matches up with the segment corresponding to the first allocated
* PE number. This is handled in pnv_pci_vf_resource_shift().
*
* Once all that is done we return to the PCI core which then enables VFs,
* scans them and creates pci_devs for each. The init process for a VF is
* largely the same as a normal device, but the VF is inserted into the IODA
* PE that we allocated for it rather than the PE associated with the bus.
*
* 4. When userspace disables VFs we unwind the above in
* pnv_pcibios_sriov_disable(). Fortunately this is relatively simple since
* we don't need to validate anything, just tear down the mappings and
* move SR-IOV resource back to its "proper" location.
*
* That's how mode a) works. In theory mode b) (single PE mapping) is less work
* since we can map each individual VF with a separate BAR. However, there's a
* few limitations:
*
* 1) For IODA2 mode b) has a minimum alignment requirement of 32MB. This makes
* it only usable for devices with very large per-VF BARs. Such devices are
* similar to Big Foot. They definitely exist, but I've never seen one.
*
* 2) The number of MBT entries that we have is limited. PHB3 and PHB4 only
* 16 total and some are needed for. Most SR-IOV capable network cards can support
* more than 16 VFs on each port.
*
* We use b) when using a) would use more than 1/4 of the entire 64 bit MMIO
* window of the PHB.
*
*
*
* PHB4 (IODA3) added a few new features that would be useful for SR-IOV. It
* allowed the MBT to map 32bit MMIO space in addition to 64bit which allows
* us to support SR-IOV BARs in the 32bit MMIO window. This is useful since
* the Linux BAR allocation will place any BAR marked as non-prefetchable into
* the non-prefetchable bridge window, which is 32bit only. It also added two
* new modes:
*
* c) A segmented BAR similar to a), but each segment can be individually
* mapped to any PE. This is matches how the 32bit MMIO window worked on
* IODA1&2.
*
* d) A segmented BAR with 8, 64, or 128 segments. This works similarly to a),
* but with fewer segments and configurable base PE.
*
* i.e. The n'th segment maps to the (n + base)'th PE.
*
* The base PE is also required to be a multiple of the window size.
*
* Unfortunately, the OPAL API doesn't currently (as of skiboot v6.6) allow us
* to exploit any of the IODA3 features.
*/
static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
{