I am doing some parallel reduction and mostly following these nVidia slides. However they are not very detailed in places or I might be missing/misunderstanding something.
Edit 2: While I figured out a solution for my usecase as mentioned later I am very skeptical of the efficiency of this (assigning a second computebuffer and second kernel etc.) and would appreciate if anyone could tell me how to do this properly or point out problems with this solution. NB: i am storing more data in groupshared memory than in these code examples but the fundamental operations are the same end of edit
I am working from Unity with compute shaders. I successfully reduced data thus far but now ran out of groupshared memory when attempting to upscale everything*. Now the nVidia slides on parallel reduction say to use multiple kernel invocations and "kernel launch serves as global synchronisation point". But I am a little unsure of how to take advantage of that. Currently my parallel reduction looks roughly like this:
RWStructuredBuffer<float4> buffers;
groupshared float4 sharedMem1[1024];
groupshared float4 sharedMem2[1024];
int specificIndex;
[numthreads(32, 32, 1)]
void ComputeStuff32x32(int3 id : SV_GroupThreadID, int3 gID : SV_GroupID)
{
Dostuff();...
GroupMemoryBarrierWithGroupSync();
for (unsigned int s = 512; s > 0; s >>= 1)
{
if (flatId < s)
{
sharedMem[flatId] += sharedMem[flatId+ s];
sharedMem2[flatId] += sharedMem2[flatId+ s];
}
GroupMemoryBarrierWithGroupSync();
}
if (flatId == 0) {
buffer[specificIndex] = sharedMem[0]
}
}
Now this all works fine in a 16x16 kernel but when i get to 32x32 I'm running out of shared memory. How exactly would i go about dispatching the kernel 4 times as a 16x16 reduction and combining the 4 resulting values? Edit: And if I need to synchronize over multiple kernel dispatches what exactly are multiple thread groups good for? Can I somehow just create multiple groups and synchronize them? end of edit
Edit 2: I am succesfully brute forcing this by simply having an additional RWStructuredBuffer<\float3> tempBuffer; with a size of 32*sizeof(float3). The first kernel is dispatched in 32 groups of 32 threads (rather than 32x32 threads) that reduce all data along x axis and store it in the tempBuffer[groupID]
RWStructuredBuffer<float4> groupResults;
RWStructuredBuffer<float4> finalResult;
groupshared float4 sharedMem1[32];
[numthreads(32, 1, 1)]
void Compute32x32(int3 id : SV_GroupThreadID, int3 gID : SV_GroupID)
{
doReduction();
if (id.x == 0)
{
groupResults[gID.x] = sharedMem1[0]
}
}
After dispatch of this kernel i dispatch another kernel to combine the result:
groupshared float4 sharedMem2[32];
[numthreads(32, 1, 1)] // combines the result from previous thread groups
void combine32x32Result(int3 id : SV_GroupThreadID, int3 gID : SV_GroupID)
{
sharedMem2[i] = groupResults[i];
for (unsigned int s = 16; s > 0; s >>= 1)
{
if (id.x < s)
{
sharedMem2[i] += sharedMem2[i+s];
}
GroupMemoryBarrierWithGroupSync();
}
if (id.x == 0)
{
finalResult[0] = sharedMem2[0];
}
}
As mentioned above i would refardless appreciate any pointers to things I am doing suboptimally / how to do things more efficiently or properly. end of edit
*Unity says: "Shader error in 'computeStuff.compute': Program 'ComputeStuff32x32', error X4586: The total amount of group shared memory (110592 bytes) exceeds the cs_5_0 limit of 32768 bytes at kernel ComputeSH32x32 (on d3d11)")