I've been working on a U-Net model using training images stored on my local drive. To load these I have been using Keras' ImageDataGenerator.flow_from_dataframe
method and optionally applying some augmentations.
I have had no problems with this but noticed some odd behaviour when I retrieve batches of data from the flow.
In the below, simplified, example I am loading 8-bit RGB files from a directory and setting the seed - I've omitted augmentation parameters in this example but get the same behaviour with and without those present.
For QA/QC purposes I will typically get a batch and look at a random selection of images. However, when I get a batch and generate some random image indices I always get the same result. This only occurs after batch generation, not initialisation of the flow generator object.
# Step 1
# Set up image data flow
img_generator = ImageDataGenerator(rescale=1/255.)
train_gen = img_generator.flow_from_dataframe(
img_df, # filnames are read from column "filename"
img_dir, # local directory containing image files
y_col=None,
target_size=(512,512),
class_mode=None,
shuffle=False, # I'm using separate mask images so no shuffling here
batch_size=16,
seed=42 # behavior occurs when using seed
)
# Step 2
# Generate and print 8 random indices
# No batch of images retrieved yet; no use of seed
print(np.random.randint(16, size=8))
>>> [ 7 15 13 3 6 3 2 14] # always random
# Step 3
# Now get a batch of images; seed is used
batch = next(train_gen)
# Step 4
# Generate and print 8 random indices
print(np.random.randint(16, size=8))
>>> [ 6 1 3 8 11 13 1 9] # always the same result
Using a seed of 42, the output of Step 2 changes each time Steps 1 & 2 are executed. This is expected behaviour since Step 1 should not impact Step 2. However, once a batch is retrieved from the generator in Step 3, Step 4 always returns the same indices.
This behaviour continues as new batches are yielded; the seed is changed on each yield so each batch returns different indices but always the same indices.
With the seed set to 42 the indices generated after first few batches are:
>>> [ 6 1 3 8 11 13 1 9] # Batch 1
>>> [10 10 5 5 5 8 10 11] # Batch 2
>>> [ 5 3 0 10 4 9 15 2] # Batch 3
This suggests to me that when a batch of images is generated the global numpy seed is changed. In practical terms, I end up always examining the same sample of images. When the seed parameter is not provided the global seed remains unmodified and no outputs are alike.
I'm wondering if others have come across this - is this a bug or am I misunderstanding something?
numpy
is set as soon as theImageDataGenerator
is initialized. $\endgroup$np.random.seed
and therefore modifies the global random number generator as my example suggests. Turns out that there are active issues on the Keras repo to address this behavior in other parts of the library such as dataset loaders: github.com/keras-team/keras/issues/12258. $\endgroup$