1

I am trying to iterate a number of times over an iterable. Howver, the issue is that once the iterable is consumed it no longer can yield. Which means, after my first iteration I no longer can use the iterable.
In my case, I have an excel file of 10 000 lines, I am creating a textfilereader to avoid importing all my data into memory. I am fixing a number of iterations which will be used to execute the same opeartions over the lines of the excel file. Because of the fading nature of the iterable, I can't use the operations after the second iteration. So, I tried to create a global iteraton loop where I define the iterable each time: Is there a better way to get around this issue

The main reason of using an iterable in my case is to avoid loading data in memory .

Code causing issue

### read file through an iterable
df_test = pd.read_csv('filet_to_read.csv',sep=';',quotechar='"', escapechar='\\', iterator=True, chunksize=15, encoding='utf-8',converters={'Ident':str})
### iterations
iterations=5
for iter in range(iterations):
    for chunk in df_test:
          ##Do_operations
    print('end of itertaion :',iter)

### After first iteration, no more operations are possible because iterable is consumed

My solution

iterations=5
    for i in range(iterations):
        df_test = pd.read_csv('filet_to_read.csv',sep=';',quotechar='"', escapechar='\\', iterator=True, chunksize=15, encoding='utf-8',converters={'Ident':str})
        for chunk in df_test:
              ##Do_operations
        print('end of itertaion :',iter)
2
  • Use the csv module and seek() the start of the file at the end? (I actually don't know if seek is a method on the csvreader object, I'll have to look it up)
    – roganjosh
    Commented Dec 7, 2018 at 15:40
  • Although this is a different example, the code shows how you can seek to the start of the file docs.python.org/2/library/csv.html#csv.Sniffer
    – roganjosh
    Commented Dec 7, 2018 at 15:42

3 Answers 3

3

You could use tee, from the documentation:

Return n independent iterators from a single iterable.

Example

from itertools import tee


it = range(5)

for i in tee(it, 5):
    print(list(i))

Output

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
3
  • Yup, just remember to leave it alone once you've tee'd it.
    – timgeb
    Commented Dec 7, 2018 at 15:46
  • @mouni93 the documentation linked to says: "Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed."
    – timgeb
    Commented Dec 7, 2018 at 15:53
  • 1
    The context of the question is"The main reason of using an iterable in my case is to avoid loading data in memory ." itertools.tee won't work for that, since it will load all the data into memory.
    – Don Hatch
    Commented Jun 7 at 4:31
1

I had an issue very similar to yours. I needed to iterate over a database table many times without keeping it in memory. The above solution didnt satisfy my problem as my code required passing the iterator many times and to many functions at different levels. I came up with a solution that I believe it to be more elegant and general than tee and wanted to share it here.

This "looper" class will allow you to iterate over any iterator multiple times simply by regenerating the iterator once the internal iterator has raised StopIteration. The outer looper class does raise the StopIteration exception before regenerating. Meaning it will have similar iterator behavior to a list or tuple when n=None.

class looper:
    def __init__(self, gen_iter_func,n=None):
        self.gen_iter_func = gen_iter_func
        self.n = n

    def __iter__(self):
        self.iterable = self.gen_iter_func()
        return self

    def __next__(self):
        if (self.n is not None and self.n <= 0):
            raise StopIteration
        try:
            return next(self.iterable)
        except:
            if self.n is not None:
                self.n -= 1
            raise StopIteration

Example using range:

def gen_iter():
    return iter(range(5))

loop = looper(gen_iter,n=3)
for i in range(5):
    print('-------{}-------'.format(i))
    for x in loop:
        print(x)

Output:

-------0-------
0
1
2
3
4
-------1-------
0
1
2
3
4
-------2-------
0
1
2
3
4
-------3-------
-------4-------
0

The currently accepted answer isn't a solution, since it loads all the data into memory. (The question says: "The main reason of using an iterable in my case is to avoid loading data in memory."). So here's another try.

It looks to me like you want df_test to be a robust iterable representing the file ("robust" in the sense that you can iterate over it multiple times without the iterations interfering with each other). Here is a little helper class to help accomplish that.

class RobustIterableFromFunctionReturningIterator:
  def __init__(self, function_returning_iterator):
    self.function_returning_iterator = function_returning_iterator
  def __iter__(self):
    return self.function_returning_iterator()

To use it in your case:

df_test = RobustIterableFromFunctionReturningIterator(lambda:pd.read_csv('filet_to_read.csv',sep=';',quotechar='"', escapechar='\\', iterator=True, chunksize=15, encoding='utf-8',converters={'Ident':str}))
### iterations
iterations=5
for iter in range(iterations):
    for chunk in df_test:
          ##Do_operations
    print('end of itertaion :',iter)

The operations it ends up doing under the covers are the same as in your original solution, but the program structure and underlying mental model are cleaner (in my opinion) and more faithful to your original program structure.


There are plenty of other use cases for this helper class, too. E.g. to make a robust infinite range object from the function-returning-iterator itertools.count:

import itertools
infinite_range = RobustIterableFromFunctionReturningIterator(itertools.count)
# Test it by doing nested iterations over it
pairs = []
for i in infinite_range:
  for j in infinite_range:
    print(f"i={i} j={j}")
    pairs.append((i,j))
    if j+1 == 4: break
  if i+1 == 2: break
assert pairs == [(0,0),(0,1),(0,2),(0,3), (1,0),(1,1),(1,2),(1,3)], pairs

Not the answer you're looking for? Browse other questions tagged or ask your own question.