2

I have a series of files that are in the following format:

file_1991.xlsx
file_1992.xlsx
# there are some gaps in the file numbering sequence
file_1995.xlsx
file_1996.xlsx
file_1997.xlsx

For each file I want to do something like:

import pandas as pd
data_1995 = pd.read_excel(open(directory + 'file_1995', 'rb'), sheetname = 'Sheet1')

do some work on the data, and save it as another file:

output_1995 = pd.ExcelWriter('output_1995.xlsx')
data_1995.to_excel(output_1995,'Sheet1')

Instead of doing all these for every single file, how can I iterate through multiple files and repeat the same operation across multiple files? In other words, I would like to iterate over all the files (they mostly following a numerical sequence in their names, but there are some gaps in the sequence).

Thanks for the help in advance.

4 Answers 4

3

You can use os.listdir or glob module to list all files in a directory.

With os.listdir, you can use fnmatch to filter files like this (can use a regex too);

import fnmatch
import os

for file in os.listdir('my_directory'):
    if fnmatch.fnmatch(file, '*.xlsx'):
        pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
        """ Do your thing to file """

Or with glob module (which is a shortcut for the fnmatch + listdir) you can do the same like this (or with a regex):

import glob
for file in glob.glob("/my_directory/*.xlsx"):
    pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
    """ Do your thing to file """
2

You should use Python's glob module: https://docs.python.org/3/library/glob.html

For example:

import glob
for path in glob.iglob(directory + "file_*.xlsx"):
    pd.read_excel(path)
    # ...
5
  • Thanks! Can I use the glob module to assign variable names? For instance, I need to read the file by assigning something like this: data_1995 = pd.read_excel(open('file_1995.xlsx'), sheetname = 'Sheet1')
    – kfp_ny
    Commented Feb 28, 2017 at 3:34
  • @kfp_ny Why would you do that? You need to rethink your program.
    – Ali Gajani
    Commented Feb 28, 2017 at 3:39
  • 1
    @kfp_ny no you can not, but if you want to keep the files you can use a dictionary and name the key values after the filename if you want to make a relation. But I would recommend not to do that and find a way to keep it dynamic if you can, as every file will be loaded to memory and you'll run into the same problem otherwise.
    – umutto
    Commented Feb 28, 2017 at 3:52
  • @AliGajani Right. I think I got things mixed up. I'll try it again. Thanks!
    – kfp_ny
    Commented Feb 28, 2017 at 3:58
  • 1
    @umutto Thanks! I'll try to sort it out.
    – kfp_ny
    Commented Feb 28, 2017 at 3:58
2

I would recommend glob.

Doing glob.glob('file_*') returns a list which you can iterate on and do work.

Doing glob.iglob('file_*') returns a generator object which is an iterator.

The first one will give you something like:

['file_1991.xlsx','file_1992.xlsx','file_1995.xlsx','file_1996.xlsx']

2

If you know how your file names can be constructed, you might try to open a file with the 'r' attribute, so that open(..., 'r') fails if the file is non existent.

yearly_data = {}

for year in range(1990,2018):
    try:
        f = open('file_%4.4d.xlsx'%year, 'r')
    except FileNotFoundError:
        continue # to the next year
    yearly_data[year] = ...
    f.close()

Not the answer you're looking for? Browse other questions tagged or ask your own question.