fastest way to "read only" from buffer

Question

First of all I have a struct which looks like the following:

struct structName
{
    vec3 position;
    float radius;
    float type;
    ...some more information which are definitely needed! Not sure how many, but most probably less than 10
};

there are like 200 of these struct elements within my scene. When rendering the scene these objects are fixed. That mean, that they don't change during rendering the scene. They only change after rendering the scene. Right now I am using a SSBO to store the elements. Inside my vertex shader (of another geometry) I am reading out all 200 struct elements, iterate over them and make a calculation. This is slowing down my shader... I am trying to improve the speed of the shader. So I am looking for a better way to read the data.

Lets talk about the limitations:

Within the shader I NEVER change any value of any element within the struct (read only).
All vertices within my other geometry need all 200 of these struct elements.
The iterated calculation result for each vertex is different (has something to do with the vertex position) So the calculation can not be outsourced.
The 200 elements will not be rendered at all. They are only information which is needed for other geometry
Each of the 200 struct elements will be changed after each render call (60 FPS => 60 times each of the elements have been changed).
The new values of the elements are calculated within another shader program (shader2). So they are "read only" during the render call, and "read/write" during the calculation in shader2.
The elements need some atomic operation functions during the calculation in shader2 (atomic add).

I think to store it as a SSBO slows down my shader, because shaders are able to read/write to SSBOs so the elements are lying down on some global memory which is slow. Maybe some other buffer type is the correct one... but which one? uniform buffer objects sounds good, because uniforms are const. So maybe when using it as uniform buffer a copy will be put to a local GPU memory so that each shader instruction can access these memory faster.

Which buffer type is needed to increase the FPS of my program under these circumstances?

An alternative may be to read the whole buffer within the vertex shader at once? (not sure if this is possible) The size is already known (200 * sizeof(structName)). For me it is also okay, if the data is stored in an float array each value next to each other.

Thanks a lot!

If there are lots of objects in the scene, you might benefit from defering the rendering process. Basically you output the result of vertex pass to a series of textures, then render fragments using those textures. Another approach might be hard coding the struct values inside vertex shader. Since you can pass shader code as strings in opengl, you can concatenate struct values to vertex shader at runtime before creating the shader. — Kaan E., Commented Feb 22, 2021 at 13:57
@KaanE. thanks for the respond! First: defering seems not to be the right thing, because the calculation within the vertex shader deforms the geometry... Second: what do you meen by hard coding the struct values inside the vertex shader? The values change every frame... — Thomas, Commented Feb 22, 2021 at 14:23
Well, if they change every frame, hard coding it is not a good idea after all. Do you send a struct of arrays or array of structs while you pass 200 structs to vertex shader ? — Kaan E., Commented Feb 22, 2021 at 15:15
@KaanE. I fill within another shader an struct array... so structName[200]. So I have one struct 200 times — Thomas, Commented Feb 22, 2021 at 15:28
Normally struct of arrays tend to give better performance in OpenCL, CUDA. It might work out in your case as well. Basically you would need something like myStructData {vec3 positionS[200]; float radiusS[200]; /*etc*/} my; — Kaan E., Commented Feb 22, 2021 at 15:47

Nathan Reed · Accepted Answer · 2021-02-22 23:42:54Z

I think you may be focusing on the wrong thing here. Moving the data from an SSBO to a uniform buffer might give some speedup, sure, but that is a micro-optimization.

I would search for algorithmic optimizations first. As noted in the comments, processing all 200 records for every vertex is both a lot of data to read, and a lot of computation to do if there's a lot of vertices. Also note that in the course of rendering, vertices will sometimes be processed more than once as they are used by multiple triangles (the vertex cache helps with this, but it can't eliminate all duplicate processing). So you will be re-doing this expensive computation multiple times for the same vertex, sometimes.

Does every vertex really need to process every data element? I see that your data elements have a position and radius. If they represent some kind of mesh deformation that happens within a localized volume of space, then it might be beneficial to store those in an acceleration structure. This would allow each vertex to efficiently narrow down its processing to only those elements that affect it. For example, you could create a regular grid spanning the bounding box of your mesh, and in each grid cell store a list of indices into the main data array for the elements that intersect that grid cell. The creation of this grid could be done in a compute shader earlier in the frame, and should be quite fast. Then, when drawing the mesh, the vertex shader would look up which grid cell the vertex is in, and process those data elements. This is very similar to how we do tiled/clustered shading for light sources in many game engines.

If spatial subdivision like that doesn't get the job done (for example if many of the elements are of large radius, so that most/all of the vertices really do need to process most/all of the elements), then are the elements' influences on the vertices parallelizable in any way? For instance, are you summing a perturbation generated by each element, to arrive at a total displacement for the vertex, or something along those lines? If so, you could run compute shader passes to pre-calculate the displacements and store them into a buffer to be read later by the vertex shader. The compute passes could be set up with work groups that distribute jointly over the vertices and the work elements (the exact distribution and mapping of compute threads would have to be experimented with). Each work group might, for instance, load a small chunk of vertices and a chunk of displacement elements into shared memory, calculate the displacements, then store them out to the per-vertex displacement buffer using atomic adds. Or, they might store out to various intermediate buffers (to avoid atomic contention) and have another set of compute shaders that sum up the intermediate values to obtain the final values. It would take some experimentation to see how to do it most efficiently, but you could likely get some big speedups here.

thanks for your answer. You are right, the position and radius is infecting the geometry within that sphere. I think you are right by the point of somehow space-partitioning the problem, so that each vertex only has to calculate up to 10 struct elements... . I am going to implement it in the following way: write a mesh simplification, to reduce the number of vertices. A compute shader, executed per simplified vertex checks the influence of each element and save the corresponding ones into a list as SSBO. During rendering these list will be used with like 10 objects in it per vertex — Thomas, Commented Feb 23, 2021 at 10:11

Stack Exchange Network

fastest way to "read only" from buffer

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
opengl
buffers
or ask your own question.

Hot Network Questions

fastest way to "read only" from buffer

1 Answer 1

Not the answer you're looking for? Browse other questions tagged openglbuffers or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
opengl
buffers
or ask your own question.