My question is, if I want to create one tfrecords file for my data , it will take approximately 15 days to finish it, it has 500000 pairs of template , and each template is 32 frames( images). In order to save the time, I have 3 GPUs, so I thought I can create three tfrocords file each one file on one GPUs and then I can finish creating the tfrecords in 5 days. But then I searched about a way to merge these three files in one file and couldn't find proper solution.

So Is there any way to merge these three files in one file, OR is there any way that I can train my network by feeding batch of example extracted form the three tfrecords files, knowing I am using Dataset API.

4 Answers 4


As the question is asked two months ago, I thought you already find the solution. For the follows, the answer is NO, you do not need to create a single HUGE tfrecord file. Just use the new DataSet API:

dataset = tf.data.TFRecordDataset(filenames_to_read,
    compression_type=None,    # or 'GZIP', 'ZLIB' if compress you data.
    buffer_size=10240,        # any buffer size you want or 0 means no buffering
    num_parallel_reads=os.cpu_count()  # or 0 means sequentially reading

# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)

# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())

dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)

For details, check the document.

  • 2
    Actually, yes you are right. I figured out the solution was simply by passing a list of filenames through tf.data.TFRecodDataset().I forgot to mention the answer. However for another smaller dataset I noticed that if you are passing one single tfrecord file is better than passing multiple tfrecords files in terms of accuracy, I don't why. I think the only difference between the two way is the shuffling happened differently. So do you thing is having one tfrecords file is better than using multiple tfrecords file?
    – W. Sam
    Commented Jul 27, 2018 at 0:15
  • Also in the documentation about num_parallel_reads in TFRecodDataset they said that it is representing the number of files to read in parallel but you set it in your example to number of cpu cores, so if I have 12 cpu cores and 3 tfrecord files should I set it to 12 or 3? and the same for num_parallel_calls in dataset.map
    – W. Sam
    Commented Jul 27, 2018 at 0:15
  • @W.Sam I prefer to use one single file if it is not too large, such as less than 10 gigabytes. Actually I changed the num_parallel_calls here. Some reference said this should be equal to the batch size. I thought this should be considered as a hyperparameter to find out which is better. Commented Jul 31, 2018 at 19:44
  • My train data is 330 G and val data 179 and test data is 424, I think for this case I need a list of multiple files
    – W. Sam
    Commented Aug 1, 2018 at 1:26
  • @holmescn Thanks for the answer, Is there any way to shuffle filenames instead of actual data on epoch?
    – Elbek
    Commented Jul 22, 2020 at 7:01

Addressing the question title directly for anyone looking to merge multiple .tfrecord files:

The most convenient approach would be to use the tf.Data API: (adapting an example from the docs)

# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)

# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)

However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.

You may also refer to a longer discussion regarding multiple .tfrecord files on Data Science Stackexchange


The answer by MoltenMuffins works for higher versions of tensorflow. However, if you are using lower versions, you have to iterate through the three tfrecords and save them them into a new record file as follows. This works for tf versions 1.0 and above.

def comb_tfrecord(tfrecords_path, save_path, batch_size=128):
        with tf.Graph().as_default(), tf.Session() as sess:
            ds = tf.data.TFRecordDataset(tfrecords_path).batch(batch_size)
            batch = ds.make_one_shot_iterator().get_next()
            writer = tf.python_io.TFRecordWriter(save_path)
            while True:
                    records = sess.run(batch)
                    for record in records:
                except tf.errors.OutOfRangeError:
  • Code only answers are discouraged. Please add some explanation as to how this solves the problem, or how this differs from the existing answers. From Review
    – Nick
    Commented Nov 9, 2019 at 0:58

Customizing the above the script for better tfrecords listing

import os
import glob
import tensorflow as tf
save_path = 'data/tf_serving_warmup_requests'
tfrecords_path = glob.glob('data/*.tfrecords')
dataset = tf.data.TFRecordDataset(tfrecords_path)
writer = tf.data.experimental.TFRecordWriter(save_path)

Not the answer you're looking for? Browse other questions tagged or ask your own question.