2
$\begingroup$

I am doing some parallel reduction and mostly following these nVidia slides. However they are not very detailed in places or I might be missing/misunderstanding something.

Edit 2: While I figured out a solution for my usecase as mentioned later I am very skeptical of the efficiency of this (assigning a second computebuffer and second kernel etc.) and would appreciate if anyone could tell me how to do this properly or point out problems with this solution. NB: i am storing more data in groupshared memory than in these code examples but the fundamental operations are the same end of edit

I am working from Unity with compute shaders. I successfully reduced data thus far but now ran out of groupshared memory when attempting to upscale everything*. Now the nVidia slides on parallel reduction say to use multiple kernel invocations and "kernel launch serves as global synchronisation point". But I am a little unsure of how to take advantage of that. Currently my parallel reduction looks roughly like this:

RWStructuredBuffer<float4> buffers;
groupshared float4 sharedMem1[1024];
groupshared float4 sharedMem2[1024];
int specificIndex;

[numthreads(32, 32, 1)]
void ComputeStuff32x32(int3 id : SV_GroupThreadID, int3 gID : SV_GroupID)
{
   Dostuff();...

   GroupMemoryBarrierWithGroupSync();

   for (unsigned int s = 512; s > 0; s >>= 1)
   {
       if (flatId < s)
       {
            sharedMem[flatId] += sharedMem[flatId+ s];
            sharedMem2[flatId] += sharedMem2[flatId+ s];
       }
       GroupMemoryBarrierWithGroupSync();
    }

    if (flatId == 0) {
    buffer[specificIndex] = sharedMem[0]
    }
}

Now this all works fine in a 16x16 kernel but when i get to 32x32 I'm running out of shared memory. How exactly would i go about dispatching the kernel 4 times as a 16x16 reduction and combining the 4 resulting values? Edit: And if I need to synchronize over multiple kernel dispatches what exactly are multiple thread groups good for? Can I somehow just create multiple groups and synchronize them? end of edit

Edit 2: I am succesfully brute forcing this by simply having an additional RWStructuredBuffer<\float3> tempBuffer; with a size of 32*sizeof(float3). The first kernel is dispatched in 32 groups of 32 threads (rather than 32x32 threads) that reduce all data along x axis and store it in the tempBuffer[groupID]

RWStructuredBuffer<float4> groupResults;
RWStructuredBuffer<float4> finalResult;
groupshared float4 sharedMem1[32];
[numthreads(32, 1, 1)]
void Compute32x32(int3 id : SV_GroupThreadID, int3 gID : SV_GroupID)
{   
    doReduction();
    if (id.x == 0) 
    {
         groupResults[gID.x] = sharedMem1[0]
    }
}

After dispatch of this kernel i dispatch another kernel to combine the result:

groupshared float4 sharedMem2[32];
[numthreads(32, 1, 1)]  // combines the result from previous thread groups
void combine32x32Result(int3 id : SV_GroupThreadID, int3 gID : SV_GroupID)
{
    sharedMem2[i] = groupResults[i];

    for (unsigned int s = 16; s > 0; s >>= 1) 
    {               
        if (id.x < s)
        {
            sharedMem2[i] += sharedMem2[i+s];   
        }
        GroupMemoryBarrierWithGroupSync();
    }

    if (id.x == 0) 
    {
        finalResult[0] = sharedMem2[0];
    }
}

As mentioned above i would refardless appreciate any pointers to things I am doing suboptimally / how to do things more efficiently or properly. end of edit

*Unity says: "Shader error in 'computeStuff.compute': Program 'ComputeStuff32x32', error X4586: The total amount of group shared memory (110592 bytes) exceeds the cs_5_0 limit of 32768 bytes at kernel ComputeSH32x32 (on d3d11)")

$\endgroup$

0

Browse other questions tagged or ask your own question.