0
$\begingroup$

I've been working on a U-Net model using training images stored on my local drive. To load these I have been using Keras' ImageDataGenerator.flow_from_dataframe method and optionally applying some augmentations.

I have had no problems with this but noticed some odd behaviour when I retrieve batches of data from the flow.

In the below, simplified, example I am loading 8-bit RGB files from a directory and setting the seed - I've omitted augmentation parameters in this example but get the same behaviour with and without those present.

For QA/QC purposes I will typically get a batch and look at a random selection of images. However, when I get a batch and generate some random image indices I always get the same result. This only occurs after batch generation, not initialisation of the flow generator object.

# Step 1
# Set up image data flow
img_generator = ImageDataGenerator(rescale=1/255.)
train_gen = img_generator.flow_from_dataframe(
                img_df, # filnames are read from column "filename"
                img_dir, # local directory containing image files
                y_col=None,
                target_size=(512,512),
                class_mode=None,
                shuffle=False, # I'm using separate mask images so no shuffling here
                batch_size=16,
                seed=42 # behavior occurs when using seed
            )

# Step 2
# Generate and print 8 random indices
# No batch of images retrieved yet; no use of seed
print(np.random.randint(16, size=8))
>>> [ 7 15 13  3  6  3  2 14] # always random

# Step 3
# Now get a batch of images; seed is used
batch = next(train_gen)

# Step 4
# Generate and print 8 random indices
print(np.random.randint(16, size=8))
>>> [ 6  1  3  8 11 13  1  9] # always the same result

Using a seed of 42, the output of Step 2 changes each time Steps 1 & 2 are executed. This is expected behaviour since Step 1 should not impact Step 2. However, once a batch is retrieved from the generator in Step 3, Step 4 always returns the same indices.

This behaviour continues as new batches are yielded; the seed is changed on each yield so each batch returns different indices but always the same indices.

With the seed set to 42 the indices generated after first few batches are:

>>> [ 6 1 3 8 11 13 1 9]  # Batch 1
>>> [10 10 5 5 5 8 10 11] # Batch 2
>>> [ 5 3 0 10 4 9 15 2]  # Batch 3

This suggests to me that when a batch of images is generated the global numpy seed is changed. In practical terms, I end up always examining the same sample of images. When the seed parameter is not provided the global seed remains unmodified and no outputs are alike.

I'm wondering if others have come across this - is this a bug or am I misunderstanding something?

$\endgroup$
2
  • 1
    $\begingroup$ Looking at the source code, it looks as though the seed for numpy is set as soon as the ImageDataGenerator is initialized. $\endgroup$
    – Oxbowerce
    Commented Jun 18, 2021 at 16:48
  • $\begingroup$ Interestingly, the code calls np.random.seed and therefore modifies the global random number generator as my example suggests. Turns out that there are active issues on the Keras repo to address this behavior in other parts of the library such as dataset loaders: github.com/keras-team/keras/issues/12258. $\endgroup$
    – Ali
    Commented Jun 19, 2021 at 12:35

1 Answer 1

1
$\begingroup$

Further investigation confirms that, in this case, Keras does indeed modify the global random number generator.

The repo has active issues and PRs that address this behaviour in other areas of the library by using a local random state e.g. this issue.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.