5

I have below IOMMU groups and devices.

$ for a in /sys/kernel/iommu_groups/*; do find $a -type l; done | sort --version-sort
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/2/devices/0000:00:04.0
/sys/kernel/iommu_groups/3/devices/0000:00:14.0
/sys/kernel/iommu_groups/3/devices/0000:00:14.2
/sys/kernel/iommu_groups/4/devices/0000:00:15.0
/sys/kernel/iommu_groups/4/devices/0000:00:15.1
/sys/kernel/iommu_groups/5/devices/0000:00:16.0
/sys/kernel/iommu_groups/6/devices/0000:00:17.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.7
/sys/kernel/iommu_groups/7/devices/0000:01:00.0
/sys/kernel/iommu_groups/7/devices/0000:02:00.0
/sys/kernel/iommu_groups/8/devices/0000:00:1f.0
/sys/kernel/iommu_groups/8/devices/0000:00:1f.2
/sys/kernel/iommu_groups/8/devices/0000:00:1f.3
/sys/kernel/iommu_groups/8/devices/0000:00:1f.4

I would like to isolate that specific device, /sys/kernel/iommu_groups/7/devices/0000:01:00.0, into its own group with no other devices in the same group.

How can we isolate a single device into separate IOMMU group for PCI passthrough for KVM virtual machine?

2 Answers 2

3

I know that this is an old question, but I had to try and figure this out recently.

The rule of thumb with IOMMU is that the kernel is going to figure out the mappings for you. When the kernel boots it is going to look for what devices can map to an I/O virtual map (IOVA). If the devices have the same IOVA then they end up in the same group. This is done to guarantee that each group has devices that can actually be addressed and talked to separately.

There are a couple of solutions. The first being that you could try moving the card to another position on the mother board. If it is a PCI and not a PCIe card then you are probably out of luck since all the PCI ports are probably mapped to the same PCIe bridge, and therefore will share the same IOVA.

If you really still need to do it then you can take all the devices that are in the same group and assign them all to vfio-pci and then you can do the assignments afterwards to where they devices need to go.

On my machine for example group 13 has a bunch of devices including an extra video card (18:00.) Here is my output from that directory:

root@rwl01:/sys/kernel/iommu_groups/13/devices# ll
total 0
drwxr-xr-x 2 root root 0 Feb 15 15:43 .
drwxr-xr-x 3 root root 0 Feb 15 15:43 ..
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:03:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:03:00.1 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.1
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:03:00.2 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:16:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:00.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:16:01.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:01.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:16:02.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:02.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:16:03.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:03.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:16:04.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:04.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:16:08.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:08.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:17:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:00.0/0000:17:00.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:18:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:01.0/0000:18:00.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:18:00.1 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:01.0/0000:18:00.1
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:19:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:02.0/0000:19:00.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:1a:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:03.0/0000:1a:00.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:1b:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:04.0/0000:1b:00.0
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:1b:00.1 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:04.0/0000:1b:00.1
lrwxrwxrwx 1 root root 0 Feb 15 15:43 0000:1c:00.0 -> ../../../../devices/pci0000:00/0000:00:01.3/0000:03:00.2/0000:16:08.0/0000:1c:00.0

As you can see the directory is a bunch of links. Here is the chain on how things are connected:

root@rwl01:/sys/kernel/iommu_groups/13/devices# lspci | grep -E '00:01.3|03:00.2|16:01.0'
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1453
03:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43b0 (rev 02)
16:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43b4 (rev 02)

For me all these devices are on the same bridge, my extra video card, raid controller etc. You cannot separate these easily.

BUT.... you can

You will need to apply the https://queuecumber.gitlab.io/linux-acs-override/ (ACS Override Kernel patch) This will allow you to use command line parameters to expose parts of groups are their own groups. After you install the patch you can then configure what you would like kernel command line parameters:

pcie_acs_override =
        [PCIE] Override missing PCIe ACS support for:
    downstream
        All downstream ports - full ACS capabilties
    multifunction
        All multifunction devices - multifunction ACS subset
    id:nnnn:nnnn
        Specfic device - full ACS capabilities
        Specified as vid:did (vendor/device ID) in hex

From here you should be able to make the device be in its own group and you should be off to the races. There are issues with this method:

Here are a couple of good links:

2

I know the original question is from 2019, but by the sounds of it, it might be a lot easier with more recent kernels.

For anyone having any issues, I wanted to pass through a NIC on a Broadcom BCM5709 4-port NetXtreme II Gigabit Ethernet card, which was plugged in to a HP Z600. I had already turned all virtualisation, VT-d, and the like in the BIOS.

My issue was that if I pass through one NIC on the 4-port, nothing in Group 12 was accessible to the host, including the internal NIC I was using to connect to it.


This is a useful section to give some basics on how to edit the Kernel commandline https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot_edit_kernel_cmdline

Grub

The kernel commandline needs to be placed in the variable GRUB_CMDLINE_LINUX_DEFAULT in the file /etc/default/grub. Running update-grub appends its content to all linux entries in /boot/grub/grub.cfg.

Systemd-boot

The kernel commandline needs to be placed as one line in /etc/kernel/cmdline. To apply your changes, run proxmox-boot-tool refresh [not sure what would be used without Proxmox], which sets it as the option line for all config files in loader/entries/proxmox-*.conf.


Then, this page gives some details about PCI Passthrough, primarily focussed at Proxmox, but the principle should still be the same: https://pve.proxmox.com/wiki/PCI_Passthrough


THIS VIDEO HELPED ME THE MOST! and was really useful and thorough in walking you through the process. It is for a specific implementation, but everything shown can be accomplished using the above and below information

https://www.youtube.com/watch?v=qQiMMeVNw-o&ab_channel=SpaceinvaderOne


Finally, I can't remember where I got it from, but this is a really useful command to identify all the IOMMU Groups for all the devices (run as root):

for d in $(find /sys/kernel/iommu_groups/ -type l | sort -n -k5 -t/); do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU Group %s ' "$n"; lspci -nns "${d##*/}"; done;

For me, I originally started with GRUB_CMDLINE_LINUX_DEFAULT="quiet"


I enabled the basic enable IOMMU GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"

This gave me:

IOMMU Group 12 00:1c.0 PCI bridge [0604]: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1 [8086:3a40]
IOMMU Group 12 00:1c.5 PCI bridge [0604]: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 6 [8086:3a4a]
IOMMU Group 12 01:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5764M Gigabit Ethernet PCIe [14e4:1684] (rev 10)
IOMMU Group 12 1c:00.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
IOMMU Group 12 1d:02.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
IOMMU Group 12 1d:04.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
IOMMU Group 12 1e:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
IOMMU Group 12 1e:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
IOMMU Group 12 1f:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
IOMMU Group 12 1f:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)

I've removed everything except the group I'm interested in, Group 12. You can see the 4 ports of the NIC, but also another NIC, the onboard one. Unfortunately, if I pass through one NIC, nothing in Group 12 is accessible to the host.


I found a working mechanism which splits the NIC in two, using GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on pcie_acs_override=downstream"

This gave me

IOMMU Group 12 00:1c.0 PCI bridge [0604]: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1 [8086:3a40]
IOMMU Group 13 00:1c.5 PCI bridge [0604]: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 6 [8086:3a4a]
IOMMU Group 18 1c:00.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
IOMMU Group 19 1d:02.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
IOMMU Group 20 1d:04.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
IOMMU Group 21 1e:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
IOMMU Group 21 1e:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
IOMMU Group 22 1f:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
IOMMU Group 22 1f:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
IOMMU Group 23 01:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5764M Gigabit Ethernet PCIe [14e4:1684] (rev 10)

You can see that it's split the 4-port NIC into 2 pairs, and neither is in a group with the on-board NIC. I suspect each pair is on the same bus (excuse the terminology) on the card as each other, so that's probably the smallest split I could get (I did try other mechanisms, e.g. isolating one or more bridges as outlined in the video, but they were worse than the above).

Hope this might be of some use to anyone having the same issue.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .