Loop pattern for batch data processing

Question

When processing data sets in batches I usually can think of the following three implementations.

Which one do you consider better than the other and why?

Notes:

The implementation is in C# but the question is about the algorithm.
The GetBatchedData works with a fixed batch size
The Process method can take an empty batch as argument, which means nothing has to be processed.
In case of EmptyBatch, Items is empty and HasMoreData returns true

Option A

As @Flater pointed out, this approach has a bug!

var batchIndex = 0;
var batch = GetBatchedData(batchIndex++);
while (batch.HasMoreData)
{
    Process(batch.Items);
    batch = GetBatchedData(batchIndex++);
}

Option B

var batchIndex = 0;
var batch = GetBatchedData(batchIndex++);
do
{
    Process(batch.Items);
    batch = GetBatchedData(batchIndex++);
} while (batch.HasMoreData)

Option C

var batchIndex = 0;
var batch = new EmptyBatch();
do
{
    Process(batch.Items);
    batch = GetBatchedData(batchIndex++);
} while (batch.HasMoreData)

Additional approaches suggested only in comments but not in responses to the question:

Suggestion A

The GetBatchedData method returns IEnumerable<Data[]>, where every Data[] batch is yield return-ed

foreach (var batch in GetBatchedData())
{
  Process(batch)
}

Why doesn't Batch simply implement IEnumerable<>? All of these design decisions have already been made 20 years ago for you. — Jörg W Mittag, Commented Mar 28, 2022 at 10:29
@JörgWMittag, you made a very good point. Could you please provide a pseudo-code to make it clear what you meant by your previous comment? — Botond Botos, Commented Mar 28, 2022 at 11:22
What are the specific requirements? A simple IEnumerable should be the default, but more complex solutions might be motivated when dealing with large amount of data, concurrency etc. — JonasH, Commented Mar 28, 2022 at 12:51
@JonasH The question is based on the assumption that fetching data in batches leads to shorter processing time overall than fetching then processing them one-by-one. What are you exactly proposing with simple IEnumerable? — Botond Botos, Commented Mar 28, 2022 at 15:07
A simple IEnumerable, as proposed by Jörg and others. But for best efficiency you might want to use a model as close to the underlying data model as possible. Like a Stream or Span<byte> for in-memory data. But it really depends on the type of data, for things like audio or video, efficient handling of buffers can be critical, but in many other cases the overhead of calling a method is irrelevant. Manual batching like that might be relevant when fetching data from some web API. — JonasH, Commented Mar 28, 2022 at 15:35

Flater · Accepted Answer · 2022-03-28 14:06:11Z

Note that I can't judge whether you can/should change how batch works, since you never posted its implementation. This answer assumes that the batch object cannot be changed.
If you are free to do so, first analyze if the data fetching process can be improved from the source, as these improvements will trickle down all the way to your consumer(s).

I don't really like any of the proposed options, simply because they're IMHO needlessly contrived when compared to the simple foreach/for, which is really what you need here.
That being said, I'm aware that you need to defer the fetching of all of the data beforehand, so a simple for wouldn't work here, and a foreach would still require some underlying complexity. My main point here is that you need to compare your presented options to the most natural implementation (which is for/foreach here) and then pick the one that minimizes the added complexity.

Option A suffers from a likely bug where you don't process the last item. This is under the assumption that batch.HasMoreData returns true only if there is data that you still have not fetched. This means that when you fetch the last data, then check batch.HasMoreData, you'll exit the loop and not process the last entry.

Option B is liable to crash when dealing with an empty batch list, as you call Process before doing any kind of sanity check.

Option C looks like its the attempt to implement a sanity check by adding a no-op EmptyBatch that avoids an exception, instead leading to a Process step that does nothing.
While this may work, it is very patchworky and leads to needlessly obfuscated code. It might also be passing the "null buck" by forcing the Process logic to specifically check for the empty object and e.g. not log anything; which is precisely why I'm not a fan of this approach when it can be avoided.

If I had to pick between these options, I would alter option A to do:

MyBatch batch = null;
while (batch == null || batch.HasMoreData)
{
    batch ??= GetBatchedData(batchIndex++);
    Process(batch);
}

This gets the job done without issues, assuming your logic can elegantly handle a first fetch if there is no data to fetch at all.

I am somewhat annoyed at having to both include the "did not fetch anything yet" and "did fetch previous batch and there is more to fetch" logic at the same time.
However, under the assumption that there is no other way of checking that there is data to fetch before you've started; this way at least keeps the main processing logic neat and tidy.

Assuming that batch.HasMoreData only yields true when the supplier of batch contains data that has not been accessed yet, the above should work the way you need it to.

I am also assuming that fetching the first batch, when there is nothing to fetch, yields an object with an empty Items array, not null (for either the object or its Items property).

However, I'm not a fan of this approach overall. I also find it grating that the consumer is the one responsible for keeping track of which index they have to request. If you're using a while, I would innately infer that the consumer doesn't have this knowledge and instead dynamically waits until the data well runs dry at some point.

You're also trying to reinvent the enumerator; and in general I would urge you to use the tools that exist and have been tried and tested, instead of rolling your own (except for educational purposes or if you concrete intend to improve it).

Part of what leads you to avoid the more obvious for/foreach is that you are trying to avoid complete enumeration of all data before you even begin processing the first entry. I understand and respect that, but there are better solutions here.

Suggestion A has my preference here. yield return was created precisely for deferred enumeration where you don't want to render an item before the consumer wishes to concretely access it.
That being said, the underlying method aggregating the yield returned enumerable is still going to end up having to do something along the lines of:

MyBatch batch = null;
while (batch == null || batch.HasMoreData)
{
    batch ??= GetBatchedData(batchIndex++);
    foreach(var item in batch.Items)
    {
        yield return item;
    }
}

If you compare this to how I altered your option A, it does pretty much the same thing in regards to batch fetching, albeit that it returns the data rather than immediately handling it.

Which doesn't really change your approach, it hides it.

Nonetheless, this has my preference, simply because it allows you to separate the nitty gritty aggregation logic from the actual consumer. This means that you can abstract it away into a neat little method/class of its own, and any and all consumers don't have to actively account for the fact that there is deferred enumeration going on under the hood.

Another reason to like it is that it hides the entire idea of batched retrieval, instead pretending like it's returning a flat list of items. This is just one of those little intricacies that a consumer really doesn't care about, so hiding it is a good thing.

If you can't avoid the nitty gritty, at least package it into a pretty box so that others don't have to fuss with it.

Thank you for the detailed answer. Why do you prefer while instead of do...while when the body of the loop must be executed at least once? — Botond Botos, Commented Mar 28, 2022 at 14:12
@BotondBotos: It keeps the body in a sensical order (fetch data, process the fetched data). Don't get me wrong, I like a do while once in a while, and you could make an argument for it here; but overall I prefer a while here. Also note that your options B and C run into a similar bug as the one I mentioned in A (I just noticed); they too would end up not processing the last batch. — Flater, Commented Mar 28, 2022 at 14:16
@BotondBotos: The inherent bug in your options and the somewhat contrived flows is what leads me to want to tackle this different if I could. Instead of HasMoreData, you could e.g. be checking if the current batch has any items. It's one extra call (to fetch the empty one), but it significantly simplifies your processing logic, plus it accounts for cases where there is new data entered in the system between fetching the last batch (where it then says that there is no more data) and finishing the processing of the last batch (at which point new data could have entered the system). — Flater, Commented Mar 28, 2022 at 14:19
I had a specific implementation in mind for HasMoreData which isn't exactly what the name suggests: HasMoreData => Items.Length == BatchSize. Based on the implementation IsFullBatch would be a better name. — Botond Botos, Commented Mar 28, 2022 at 14:27
@BotondBotos: The name is an improvement, but you'll end up fetching the next empty batch in the case where the last item happens to coincide with the last item of the batch. A full page does not guarantee that there is another page behind it. Initially, I assumed that HasMoreData was being indicated by the source of the data, rather than being guessed by you based on how many items were returned. This new information only strengthens my suggestion to simply fetch pages until you encounter an empty one - it works the most elegantly at the negligible cost of one potentially superfluous call. — Flater, Commented Mar 28, 2022 at 14:30

John Wu · Accepted Answer · 2022-03-29 23:58:46Z

Seems like way overthinking. To me, the clearest way to write this is with a for loop on the index and a break statement when there is no more data. There is no reason to try to make it any more clever than this.

You can either specify false as the end condition for the for loop (yes, you can do this) but I would suggest it is more prudent to define some sort of maximum batch count to avoid getting into an infinite loop/integer overflow situation if something were to go wrong.

for (var batchIndex = 0; false; batchIndex++)
{
    var batch = GetBatchedData(batchIndex);
    Process(batch.Items);
    if (!batch.HasMoreData) break;
}

or

for (var batchIndex = 0; batchIndex < batchMax; batchIndex++)
{
    var batch = GetBatchedData(batchIndex);
    Process(batch.Items);
    if (!batch.HasMoreData) break;
}

Any reason for batchMax being any different than long.MaxValue? — Botond Botos, Commented Apr 8, 2022 at 9:21

Stack Exchange Network

Loop pattern for batch data processing

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
c#
algorithms
loops
or ask your own question.

Hot Network Questions

Loop pattern for batch data processing

2 Answers 2

Not the answer you're looking for? Browse other questions tagged c#algorithmsloops or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
c#
algorithms
loops
or ask your own question.