0
$\begingroup$

I have a vertex shader happily producing all the vertices I want for a 2D plot. Now I want to also render a plot of the same data, but each point relative to the previous one, like SVG Paths using l instead of L.

I have tried a geometry shader, but the primitive would have to be a line strip with all the vertices -- 500,000 or more.

I'm pretty sure a vertex shader can't use the output of the previous instance, and computing the sum of 1..n is difficult to parallelise.

The only method I can think of is;

  1. Use a compute shader to calculate the data
  2. Calculate cumulative sums with CPU
  3. Send data to vertex shader that does very little

Is there a way of doing this with a vertex, geometry or mesh shader? Or with an extension that might help?

It seems to be something basic enough, and useful enough to be included in SVG.

What is the best method to tackle this?

$\endgroup$
6
  • $\begingroup$ "computing the sum of 1..n is difficult to parallelise": what do you mean ? a prefix sum in parallel is no real problem. Compute the local prefix sums, then globalize and adjust the local sums. $\endgroup$
    – user1703
    Commented Jun 1, 2023 at 12:58
  • $\begingroup$ Best method in what sense ? $\endgroup$
    – user1703
    Commented Jun 1, 2023 at 12:59
  • $\begingroup$ Thanks @Yves By cumulative sum, I mean buff_new[n] =Sum(m=1..n, buff_old[m]) or buff_new[n] = buff_old[n] buff_new[n-1] which means that calculating buff_new[n] needs to wait for buff_new[n-1] to be available. Hence, difficult to parallelise. $\endgroup$
    – Nick
    Commented Jun 1, 2023 at 18:45
  • $\begingroup$ Nevertheless, it would be useful if the GPU can do this, as I wouldn't need to copy buffer back to CPU, calculate, then send to GPU again. $\endgroup$
    – Nick
    Commented Jun 1, 2023 at 18:47
  • $\begingroup$ By best method, I mean "Is it possible to do all this in a shader/on the GPU"? $\endgroup$
    – Nick
    Commented Jun 1, 2023 at 18:51

2 Answers 2

0
$\begingroup$

Can a GPU do this? Sure. Can it do it fast? Not really, no.

The nature of the algorithm is antagonistic to how a GPU wants to do work. GPUs gain performance through massive parallelism. Your algorithm is inherently single-threaded: no answer can be computed until the prior answer is computed.

Will the processing time of such a single-threaded operation be greater than the cost of a full GPU->CPU->GPU synchronization? That depends on how much data you're talking about. That being said, unless the data is being computed by the GPU, the CPU could just hold onto a copy of it and compute the values itself.

$\endgroup$
10
  • $\begingroup$ Thanks Nicol, it is useful to have it confirmed that a GPU can do this. I will look at converting my compute shader to a mesh/primitive shader. I'm convinced the vertex shader can't do this, and the geometry shader is too limited. $\endgroup$
    – Nick
    Commented Jun 2, 2023 at 22:32
  • $\begingroup$ It is also useful to have it confirmed that the algorithm is inherently single-threaded => can't be parallelised. The GPU is very good at calculating the initial data, so I think the fastest way is to do it all on the GPU even though the cumulative sum is not really appropriate for the GPU. I think the cost of a full GPU->CPU->GPU synchronization would need to be less than the difference in processing time between GPU/CPU. Maybe I can investigate this at some point. $\endgroup$
    – Nick
    Commented Jun 2, 2023 at 22:41
  • $\begingroup$ "I will look at converting my compute shader to a mesh/primitive shader." ... how would that help? $\endgroup$ Commented Jun 2, 2023 at 22:43
  • $\begingroup$ Your answer has helped me clarify the way to go from here. I'm not allowed to vote it up, so I'll just accept it. Thanks again. $\endgroup$
    – Nick
    Commented Jun 2, 2023 at 22:45
  • $\begingroup$ I'm not merely calculating the data, I want to plot it too. Can I send the output of a compute shader into a graphics pipeline without using the CPU? $\endgroup$
    – Nick
    Commented Jun 2, 2023 at 22:49
2
$\begingroup$

I am rewriting this answer because of misunderstandings and a better explanation of the algorithm.

The render pipeline is not well suited for this kind of task. It is better to use compute shaders instead.

The following algorithm is designed to be run as a compute shader.

First, let's start with a simple example:

Suppose we have 50000 values stored in a shader storage buffer object (SSBO1). And let us further assume that all values are 1. So figure 1 shows the input data.

Enter image description here

Figure 1: 50000 data points, each has the value 1.

This algorithm calculates the cumulative sum. For our simple example, the result looks like Figure 2.

Insert image description here

Figure 2: The final result of the simple example (cumulative sum / discrete integral).

Compute shaders can have different workgroup sizes. In our case, we try to keep the workgroup size as small as possible, with respect to the minimum warp/wave length of 32 (Nvidia) 64 (AMD). Let's assume we are using an AMD graphics card, then our workgroup size is (x = 64, y = 1, z = 1). The algorithm needs additional SSBOs to store intermediate values. The additional SSBOs will contain the following number of values:

unsigned int workgroupSize = 64;
unsigned int dataCount = 50000;

unsigned int SSBO2_count = static_cast<unsigned int>(ceil(static_cast<float>(inputDataCount) / workgroupSize)); // 782
unsigned int SSBO3_count = static_cast<unsigned int>(ceil(static_cast<float>(SSBO2_count) / workgroupSize)); //13

The first compute shader will work as follows: Each invocation will load the global value from SSBO1 (original data) and store the value in shared memory so that other invocations within that workgroup can quickly access that value. Then a barrier is needed to ensure that all data is loaded correctly. Then the first invocation within each workgroup accumulates these values while updating the original SSBO1, and stores the accumulated workgroup result to SSBO2.

#version 430 core
layout (local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(std430, binding = 1) buffer SSBOInput
{
    float values[]; //first execution: 50000 values. Second execution: 782 values
}ssboInput;

layout(std430, binding = 2) buffer SSBOOutput
{
    float values[]; //first execution: 782 values. Second execution 13 values
}ssboOutput;    

shared float loadedValues[64];
uniform int countData;
uniform int writeToOutput;

void main()
{
    loadedValues[gl_LocalInvocationIndex] = 0.0;
    if(gl_GlobalInvocationID.x < countData) //check if inside ssbo1
    {
        loadedValues[gl_LocalInvocationIndex] = ssboInput.values[gl_GlobalInvocationID.x];
    }
    barrier(); //ensure, that all values are loaded to shared memory
    if(gl_LocalInvocationIndex == 0)
    {
        float sum = 0;
        for(int i = 0; i < 64; ++i)
        {
            sum += loadedValues[i];
            if(gl_GlobalInvocationID.x + i < countData)
            {
                ssboInput.values[gl_GlobalInvocationID.x + i] = sum;
            }
        }
        if(writeToOutput == 1)
            ssboOutput.values[gl_WorkGroupID.x] = sum;
    }
}

Bind SSBO1 to binding location 1 (ssboInput) and SSBO2 to binding location 2 (ssboOutput). The uniforms countData = 50000 and writeToOutput = 1. After running 782 workgroups, the values will look like Figure 3.

Insert image description here

Figure 3: These are the values stored in SSBO1 after the compute shader is executed.

In our example, the SSBO2 (782 data points) will all store the value 64. This is the case because the input values are all 1. If we ask for the cumulative value of index (e.g. 200), we only need to calculate the following:

$$indexMax = floor(200 / 64)-1$$ $$CumulativeSum = SSBO1[200] + \sum_{i = 0}^{indexMax}(SSBO2[i])$$

If we look at the equation, we will see that the problem is still the same, except that the difference is between 50000 and 782 data points.

So we do the same thing again. First, bind SSBO2 to binding slot 1. And bind SSBO3 to binding slot 2. Change the variable countData to 782. writeToOutput = 1. And run the compute shader with 13 workgroups.

Let's take a look at the SSBO value:. SSBO1 is not touched and has the values from figure 3. SSBO2 has changed! The values are accumulated depending on the workgroup size (see Figure 4). SSBO3 is now filled, the values are all 4096.

[Enter image description here]

Figure 4: These are the values stored in SSBO2 after the second compute shader call.

So we do the same thing again for the last time: This time it is the last execution of this compute shader. We bind SSBO3 to binding slot 1 and disable writeToOutput = 0. countData = 13. We only need to execute one workgroup.

Final step!

The precalculations are finished. Now we can compute the cumulative sum of the individual values within $log_{64}(50000)$ steps. The following compute shader stores the cumulative sum in SSBO1.

#version 430 core
layout (local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(std430, binding = 1) buffer SSBO1
{
    float values[]; //50000 values
}ssbo1;

layout(std430, binding = 2) buffer SSBO2
{
    float values[]; //782 values
}ssbo2;   

layout(std430, binding = 3) buffer SSBO3
{
    float values[]; //13 values
}ssbo3;    

uniform int countData;

void main()
{
    if(gl_GlobalInvocationID.x < countData) //check if inside ssbo1
    {
        float sum1 = ssbo1.values[gl_GlobalInvocationID.x];
        int index2 = gl_GlobalInvocationID.x / 64 - 1;
        float sum2 = index2 >= 0 ? ssbo2.values[index2] : 0;
        int index3 = index2 / 64 - 1;
        float sum3 = index3 >= 0 ? ssbo3.values[index3] : 0;

        //store cululative sum
        ssbo1.values[gl_GlobalInvocationID.x] = sum1 + sum2 + sum3;
    }
}

The cumulative sum is stored in SSBO1 (see Figure 2).

This code has not been tested yet. So there might be some sintax errors in it.... Sorry for that.

$\endgroup$
6
  • $\begingroup$ Thanks @Thomas. your answer does contain useful advice about workgroup sizes, but I don't think it is relevant to my question. It seems you are simply adding all the data to produce a sum. The Q clearly asks for "a plot of the same data, but each point relative to the previous one". If I start with 100 points, I want to produce 100 points, each being added on to the previous point (relative instead of absolute). Thus each summation must wait for previous point to be calculated. $\endgroup$
    – Nick
    Commented Jun 8, 2023 at 23:44
  • $\begingroup$ @Nick this is a misunderstanding, the answer gives you the cumulative sum with same size as the input data. The SSBO2 stores the Differenz from one workgroup to the next... To calculate the final result of a point within your data, you only need the sum the values from SSBO2 index 0 till the index of the point of interest + the value from SSBO1. As you can see, the problem is still the same, with a difference in size... SSBO1 has size of 50000 where SSBO2 has a size of 50000 / 64. You can use SSBO3 in the same way as SSBO2 to reduce the values to 50000/64/64... Till it has only 64 values left. $\endgroup$
    – Thomas
    Commented Jun 9, 2023 at 9:22
  • $\begingroup$ @Nick I rewrote the answer to make clear how the algorithm works. Please take a look at it :) $\endgroup$
    – Thomas
    Commented Jun 12, 2023 at 11:19
  • $\begingroup$ Thanks @Thomas. This is much more understandable, especially with diagrams and equations, and code is a bonus too. I'm still not clear in my mind how the parallelisation actually happens, though I'm sure this is my failing not yours. I'm trying to implement this myself, and also my Q needs much improvement, but I am battling the constraints of this site, and my ISP in SE England have made SE unreachable. $\endgroup$
    – Nick
    Commented Jun 14, 2023 at 23:31
  • $\begingroup$ @Nick try to understand the algorithm... it is not as complicated as it looks like. The first execution gives you the cumulative sum of the input data with respect to the workgroup size (64). The offset of the cumulated sum of each workgroup is stored to SSBO2. So after the first execution we have broken down the problem from 50000 data points to 782 data points. But these 782 data points need to be cumulative summed as the 50000. So we use the same shader again just with other bindings and store the result to SSBO3. You need $log_{64}(n)$ many SSBOs to brake down the problem to one workgroup $\endgroup$
    – Thomas
    Commented Jun 15, 2023 at 6:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.