When doing (possibly heavy) pixel processing on a large image, multithreading becomes a must. The standard practice is to initiate a loop whose indices are partitioned into multiple threads within a thread pool. The performance benefits become immediately apparent after taking the appropriate thread-safety measures to ensure correctness of results.
However, there are multiple possible configurations how one can partition the indices. The most common methods are partitioning by row or by pixel. Here is my interpretation of the advantages and drawbacks of each:
By Row:
Less thread creation overhead
Thread load may not be even due to the number of rows possibly not being divisible by the number of threads. This can cause an image that is wide but not tall to be processed inefficiently across multiple cores
By Pixel:
More thread creation overhead
Thread load can be distributed more evenly due to the fact that the time taken to process the indices that are not divisible by the number of threads is relatively small
Is my interpretation correct, or is there more to the story? Should I always choose one over the other?
For reference, I am using the Parallel.For() function in C#.