3

I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.

These are my assumptions on write-back MMIO regions.

  1. Writes to the PCIe device happen only on cache write-back.
  2. The size of TLP payloads is cache block size (64B).

However, captured TLPs do not follow my assumptions.

  1. Writes to the PCIe device happen on every write to the MMIO region.
  2. The size of TLP payloads is 1B.

I write 8-byte of 0xff to the MMIO region with the following user space program & device driver.

Part of User Program

struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0) {
    printf("ioctl failed\n");
}

Part of Device Driver

case IOCTL_WRITE_0xFF:
{
    int i;
    char *buff;
    struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
    copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
    buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
    for (i = 0; i < ioctl_control.num_bytes_to_write; i++) {
        buff[i] = 0xff;
    }
    memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
    kfree(buff);
    break;
}

I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr results for different policies. Please note that I made each region exclusive.

Uncacheable

reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size=   64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size=   32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size=   16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size=    1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size=    1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size=    1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size=    1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size=    1MB, count=1: uncachable

Write-combining

reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size=   64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size=   32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size=   16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size=    1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size=    1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size=    1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size=    1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size=    1MB, count=1: uncachable

Write-back

reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size=   64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size=   32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size=   16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size=    1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size=    1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size=    1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size=    1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size=    1MB, count=1: uncachable

Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid in these waveform example.

  1. Uncacheable: link -> correct, 1B x 8 packets
  2. Write-combining: link -> correct, 8B x 1 packet
  3. Write-back: link -> unexpected, 1B x 8 packets

System configuration is like below.

  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  • OS: Linux kernel 4.15.0-38
  • PCIe Device: Xilinx FPGA KC705 programmed with litepcie

Related Links

  1. Generating a 64-byte read PCIe TLP from an x86 CPU
  2. How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture
  3. Write Combining Buffer Out of Order Writes and PCIe
  4. Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?
  5. MTRR (Memory Type Range Register) control
  6. PATting Linux
  7. Down to the TLP: How PCI express devices talk (Part I)
2
  • 1
    Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from mempcy being implemented as rep movsb inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb fairly good.) Commented Nov 15, 2018 at 1:48
  • 1
    @PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as uncached-minus in PAT. I will try to solve it. Commented Nov 15, 2018 at 5:22

1 Answer 1

2

In short, it seems that mapping MMIO region write-back does not work by design.

Please upload an answer if anyone finds that it is possible.

I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.

  1. Mapping MMIO region write-back is not possible

    Quote from this link

    FYI: The WB type will not work with memory-mapped IO. You can program the bits to set up the mapping as WB, but the system will crash as soon as it gets a transaction that it does not know how to handle. It is theoretically possible to use WP or WT to get cached reads from MMIO, but coherence has to be handled in software.

    Quote from this link

    Only when I set both PAT and MTRR to WB does the kernel crash

  2. Workaround is possible on some processors

    Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin

    There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space twice. Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).

1
  • 1
    I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region. Commented Nov 15, 2018 at 11:06

Not the answer you're looking for? Browse other questions tagged or ask your own question.