34

I have a laptop with a dedicated GPU, Nvidia Quadro P3200. It has 6 GB of RAM.

The laptop also has 32 GB of “normal” (CPU?) RAM.

I’m planning on using the GPU for parallel computing, running physics simulations. Some of these involve quite big arrays.

I am just wondering, if the total memory (all the variables and all the arrays) in my kernel hits 6 GB of the GPU RAM, can I somehow use the CPU’s one?

I would not be using the laptop for anything else during the computation so the main RAM should not be busy.

P.s. I am using a Dell Precision 7530, windows 10.

3
  • 1
    Maybe now for your particular setup but historically there has been such features aka TurboCache en.wikipedia.org/wiki/TurboCache Commented Apr 26, 2020 at 18:33
  • 1
    Note that it is not CPU RAM per say, it is the RAM on the northbridge that is available to the CPU as well as PCI-E (or whatever other devices on the northbridge). The GPU used to be a PCI device on the southbridge and the bus would be limiting, but I think modern GPUs are PCI-E. Also, DDR (RAM) is not as fast as GDDR (graphics). Also CPU has its own on-chip memory (cache) as SRAM, which is super fast (varying between L1/L2/L3) and hardly refreshed (S=static), but really small (L3 is ~2MB/core but can be more... some L3 is ~50MB total). Calling RAM, CPU RAM makes me think of the on-chip cache
    – vol7ron
    Commented Apr 28, 2020 at 16:30
  • To do what your asking requires a combination of hardware (to overcome the bottleneck) and software (to do the memory management). Interesting presentation related to your question can be found here: developer.download.nvidia.com/video/gputechconf/gtc/2019/… Commented Apr 28, 2020 at 21:48

5 Answers 5

50

Short answer: No, you can't.

Longer answer: The bandwidth, and more importantly, latency between the GPU and RAM over the PCIe bus is an order of magnitude worse than between the GPU and VRAM, so if you are going to do that you might as well be number crunching on the CPU.

CPU can use a part of VRAM (part mapped into the PCI aperture, usually 256MB) directly as RAM, but it will be slower than regular RAM because PCIe is a bottleneck. Using it for something like swap might be feasible.

It used to be possible to increase the memory aperture size by changing the strap bits on the GPU BIOS, but I haven't tried this since Nvidia Fermi (GeForce 4xx) GPUs. If it still works, it is also required that your BIOS is up to the task of mapping apertures bigger than standard (it is highly unlikely to have ever been tested on a laptop).

For example, a Xeon Phi compute card needs to map it's entire RAM into the PCI aperture, so it needs a 64-bit capable BIOS in the host that knows how to map apertures above the traditional 4GB (32-bit) boundary.

19
  • 12
    Note that DDR5 has ~52GiB/s BW per module (say 200 GiB/s for a 4-channel), PCIe 6.0 has 256GiB/s BW for 16x. So PCIe is not going to be the bottleneck, the RAM itself will. The GDDR is faster and optimised for multiple readers, so it's obviously faster but the RAM can still be used as a slower "cache" (like Optane DC memory works) and a simple prefetch can totally hide the latency (since GPU works with stream tasks, this is easy). BTW the GPU can access the RAM, every PCI(e) device can be a master unless disabled. That's why things like IOMMU and GART exist. Commented Apr 27, 2020 at 6:52
  • 3
    Also, the "aperture" was an AGP/32-bit thing, with a 64-bit address space, there's no problem mapping even a 1TiB block. Not all the cards allow mapping their entire memory though. Commented Apr 27, 2020 at 6:52
  • 1
    @ Bob UEFI firmware does not implicitly mean ability to map apertures above 4GB. @ Margaret there may be bandwidth, but latency will be much higher. Very few cards are set up to expose large IOMEM area. The only devices I am aware of that expose their entire memory that way are the Xeon Phis. Commented Apr 27, 2020 at 8:13
  • 2
    i see one expert with 1st hand knowledge and years of experience. and i see one post using a screenshot that doesn't prove anything. i will trust the guy with years of experience. Commented Apr 27, 2020 at 18:01
  • 2
    @MargaretBloom No, the PCIe connection is a bottleneck. The bandwidth may be high in the example you gave (which is not available in current mainstream CPUs) but it has incredibly high latency compared to to directly accessing the memory and is very low bandwidth compared to VRAM. It may be possible to hide the latency with high throughput devices like GPUs but the latency is much higher (at least 20x-30x higher) than VRAM or regular RAM.
    – Toothbrush
    Commented Apr 28, 2020 at 13:15
13

Yes. This is the "shared" memory between the CPU and GPU, and there is always going to be a small amount required as buffers to transfer dataat the GPU but it can also be used as a slower "backing" to the graphics card in much the same way as a pagefile is a slower backing store to your main memory.

You can find shared memory in use in the built-in Windows Task Manager by going to the Performance tab and clicking on your GPU.

enter image description here

Shared memory will be slower than your GPU memory though, but probably faster than your disk. Shared memory will be your CPU memory which may operate up to 30GB/s on a reasonably new machine, but your GPU memory is probably able to do 256GB/s or more. You will also be limited by the link between your GPU and CPU, the PCIe bridge. That may be your limiting factor and you will need to know whether you have a Gen3 or Gen4 PCIe and how many lanes (usually "x16") it is using to find out total theoretical bandwidth between CPU and GPU memory.

10
  • 7
    I think you will find you got that backwards. That is the buffer area managed by the CPU for staging data in and out of the GPU memory. It is managed by the code running on the CPU and not visible to the code running in the GPU. Commented Apr 26, 2020 at 9:19
  • 2
    The key point is that it is CPU rather than GPU managed. GPU doesn't have access to it, it's the CPU that uses it as a staging area for pushing data to and from the GPU. Commented Apr 26, 2020 at 9:29
  • 1
    @SuperCiocia it depends on your use case, how you access it and broadly what exactly you mean by can the GPU use it. The GPU itself cannot access CPU RAM, but it can be made to look like it is able to use the CPU RAM by the software running on the CPU. Effectively your CPU can take program memory that would be on your GPU and replace it with the contents of memory in local RAM. In that way you have more (albeit slower) memory than what is just on the GPU.
    – Mokubai
    Commented Apr 26, 2020 at 21:13
  • 2
    The GPU itself cannot access CPU RAM - Are you sure that's true? A PCIe device can do transactions that reads or writes 64 bytes from any physical address. Or are you making a distinction between the actual GPU processing chip on the graphics card having a memory bus directly connected to the DRAM on the graphics card, vs. having to make PCIe transactions for addresses that are CPU physical memory. (However the GPU's own internal address-space is configured.) Commented Apr 26, 2020 at 23:18
  • 5
    @SuperCiocia: if the GPU drivers let you do that it might be possible in theory, but performance would fall off a cliff from going over the PCIe bus to get to system DRAM instead of using the onboard DRAM over a very fast / wide GDDR5 bus. So there's good reason for them to not make this possible because it's generally not going to be useful. Managing which data is in GPU memory when is something we're probably stuck with if we want good performance. That's one of the major reasons graphics cards have their own RAM in the first place. Commented Apr 26, 2020 at 23:21
12

As far as I know, you can share the host's RAM as long as it is page-locked (pinned) memory. In that case, data transfer will be much faster because you don't need to explicitly transfer data, you just need to make sure that your synchronize your work (with cudaDeviceSynchronize for instance, if using CUDA).

Now, for this question:

I am just wondering, if the total memory (all the variables and all the arrays) in my kernel hits 6 GB of the GPU RAM, can I somehow use the CPU’s one?

I don't know if there is a way to "extend" the GPU memory. I don't think the GPU can use pinned memory that is bigger than its own, but I am not certain. What I think you could do in this case is to work in batches. Can your work be distributed so that you only work on 6gb at a time, save the result, and work on another 6gb? In that case, then working in batches might be a solution.

For example, you could implement a simple batching scheme like this:

int main() {

    float *hst_ptr = nullptr;
    float *dev_ptr = nullptr;
    size_t ns = 128;  // 128 elements in this example
    size_t data_size = ns * sizeof(*hst_ptr);

    cudaHostAlloc((void**)&hst_ptr, data_size, cudaHostAllocMapped);
    cudaHostGetDevicePointer(&dev_ptr, hst_ptr, 0);

    // say that we want to work on 4 batches of 128 elements
    for (size_t cnt = 0; cnt < 4; ++cnt) {
        populate_data(hst_ptr);  // read from another array in ram
        kernel<<<1, ns>>>(dev_ptr);
        cudaDeviceSynchronize();
        save_data(hst_ptr);  // write to another array in ram
    }

    cudaFreeHost(hst_ptr);

}
6
  • 2
    This answer is wrong. Using pinned memory you do need to explicitly transfer the data. Its faster because you are telling the OS that it can not move the memory around for optimization reason, that is "pinned" in place, therefore memory transfers are much faster, as they do not need to query the CPU/OS to know if memory is still in the same place. You may be thinking of managed memory in CUDA, which you do not need to explicitly move, but that does not make it faster. And yes, you can pin the entire CPU memory, regardless of its size. In your code, you show managed, not pinned memory Commented Apr 27, 2020 at 12:09
  • @AnderBiguri pinned memory is accessible from the device as it is locked in place, so data does not need to be transferred explicitly, but work must be synchronized.. Regarding the managed memory, are you talking about Unified Memory? In that case, memory does not need to be pinned, and it is allocated with cudaMallocManaged instead. The usage is similar, though. Could you explain a little bit more? Commented Apr 27, 2020 at 18:59
  • @AnderBiguri I am not sure if the OS will let you pin the entire cpu memory, but a huge chunk of it, maybe. What I said I was not sure about is whether the GPU can access all the memory pinned by the host if it is bigger than its own, not whether the host could pin all its memory. Maybe I was not too clear Commented Apr 27, 2020 at 19:01
  • I work with this technology and I have pinned 250Gb of RAM just this week, out of 256Gb. Apologies, I have made a mistake and mixed the use of Managed and Pinned memory call functions and I thought you where using cudaMallocManaged in your code. If instead of using cudaHostGetDevicePointer you use cudaMemcpyAsync safely, you can use all the CPU RAM. Apologies as I misunderstood some things in your code for managed/unified memory Commented Apr 27, 2020 at 21:34
  • 1
    @AnderBiguri I didn't know we could use cudaMemcpyAsync to do that. I'll have to give it a try, thanks for letting me know :) Commented Apr 28, 2020 at 14:39
4

Any GPU can use system RAM when running out of its own VRAM.

In a similar manner to running out of RAM on a system and paging all excess data to storage units (SSD/HDD), modern GPUs can and will pull textures or other data from system RAM. Texture data can be used from system RAM over the PCIe bus to make up for the lack of the faster VRAM.

Since system RAM is a few times slower than VRAM and has much higher latency, running out of VRAM would translate into a performance loss and the performance will be limited also by the PCIe bandwidth.

So it's not a matter if there is possible or not, it's a matter of performance when doing it.

Also note that many integrated GPUs use system RAM, do not even have their own.

In the case of GPUs, the main factor in their performance is the software. A well designed software will use the GPU near its output FLOPS limits, while one designed badly will not. Usually the computing and hashing software comes in the 1st category. Same goes for allocating VRAM.

5
  • 6
    this depends on the GPU and the driver. Many GPU architectures simply can't do certain operations when reading from system memory. E.g. sampling textures. OS libraries may mitigate that by swapping in and out buffers/textures, but a blanket "All GPUs can use system memory" is missing nuance and context.
    – PeterT
    Commented Apr 28, 2020 at 11:59
  • The software represents the context, that I why I detailed that part also. Of course you can't do all operations but that does not mean the RAM will not be used.
    – Overmind
    Commented Apr 28, 2020 at 12:01
  • If it's a matter of software, then maybe you should say "Any GPU can use system RAM through software".
    – clemisch
    Commented Apr 28, 2020 at 12:02
  • Agreed, updated.
    – Overmind
    Commented Apr 28, 2020 at 12:03
  • Not all CPUs support IOMMU which I think would be needed for paging VRAM data. Of course, with complex enough software solution you can emulate that but the hardware rarely support it natively. Commented Feb 22 at 20:05
0

This question is currently a top search result when using keywords like: Can games use RAM instead of VRAM?

So, I thought it's worth it to add that a lot of issues related to game RAM vs. VRAM usage have changed with the Smart Access Memory technology, which is currently supported by AMD Zen 3 CPUs (like Ryzen 5 5600X and Ryzen 7 5800X), and AMD 6000 series GPUs (like the AMD Radeon RX 6800), and will be supported within the next few weeks by the Nvidia RTX 3000 series GPUs, and later on by the 11th-gen Intel CPUs, but Intel's version of the technology, and the name used on even some AMD motherboards, is Resizable BAR.

The technology essentially provides more VRAM access to the CPU, but it remains to be seen if things will eventually work the opposite way, too, where the GPU can access more of the RAM.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .