2

How would you split one iterator into two without iterating twice or using additional memory to store all the data?

Solution when you can store everything in memory:

l = [{'a': i, 'b': i * 2} for i in range(10)]
def a(iterator):
    for item in iterator:
        print(item)
def b(iterator):
   for item in iterator:
        print(item)

a([li['a'] for li in l])
b([li['b'] for li in l])

or if you can iterate twice,

class SomeIterable(object):
    def __iter__(self):
        for i in range(10):
            yield {'a': i, 'b': i * 2}


def a(some_iterator):
    for item in some_iterator:
        print(item)


def b(some_iterator):
    for item in some_iterator:
        print(item)


s = SomeIterable()

a((si['a'] for si in s))
b((si['b'] for si in s))

But how would I make it if I just want to iterate once?

11
  • 1
    Without iterating twice or using additional memory? You don't. Commented Jun 15, 2015 at 22:37
  • 1
    If a must complete before b begins, this is literally impossible. If that isn't a requirement, the problem is still a huge pain; you either need to rewrite a and b, or you need to use threads. Commented Jun 15, 2015 at 22:38
  • 2
    Your question is puzzling, but have you tried itertools.tee ?
    – hpaulj
    Commented Jun 15, 2015 at 22:47
  • 2
    By the way, your SomeIterator class is not actually an iterator. An iterator has a __next__ (or next in Python 2) method, in addition to an __iter__ method that returns itself. Your class is more appropriately described as an "iterable".
    – Blckknght
    Commented Jun 15, 2015 at 22:49
  • 1
    @lqdc I understand that the a and b that you're showing are simple examples to show what you want and not the real ones, but does a really need to go through the whole dataset before running b? Would it be ok to supply just one element to a, then to b, and then go to the next in the generator? Commented Jun 15, 2015 at 23:01

3 Answers 3

2

From the clarification in the comments, a and b are external library functions you cannot rewrite, but it's okay to interleave their execution. In that case, what you want is possible, but it pretty much requires threads:

import multiprocessing.pool # for ThreadPool, not multiprocessing
import Queue

_endofinput = object()

def _queueiter(queue):
    while True:
        item = queue.get()
        if item is _endofinput:
            break
        yield item

def parallel_execute(funcs, iterable, maxqueue):
    '''Interleaves the execution of funcs[0](iterable), funcs[1](iterable), etc.

    No function is allowed to lag more than maxqueue items behind another.
    (This will require adjustment if a function might return before consuming
    all input.)

    Makes only one pass over iterable.

    '''

    queues = [Queue.Queue(maxsize=maxqueue) for func in funcs]
    queueiters = [_queueiter(queue) for queue in queues]
    threadpool = multiprocessing.pool.ThreadPool(processes=len(funcs))

    results = threadpool.map_async(lambda (f, x): f(x), zip(funcs, queueiters))

    for item in iterable:
        for queue in queues:
            queue.put(item)

    for queue in queues:
        queue.put(_endofinput)

    threadpool.close()
    return results.get()
2

If the functions that consume your two iterators are not under your control and don't return control of the program to your code before consuming all of the iterator contents, there is no way to do what you want. You'll either need to hold all of the data in memory in between function calls or regenerate the iterator for the second function.

Now, if your functions were generators (that yield back to your code after consuming some small number of items from the input), you could make it work with itertools.tee. There might also be some other partial workarounds if you can call one or both of your functions with various parts of the input data at a time and then somehow compile the results of the repeated calls together into the desired output. Otherwise you're probably out of luck

1

Ok, if your functions are stateless, but still expect an iterable as argument, and that's the whole problem, then this should do:

for si in s:
    a([si['a']])
    b([si['b']])
5
  • well, they need the whole iterator to generate output. Like you cannot add to a trie once it's generated because it's read only after that.
    – lqdc
    Commented Jun 15, 2015 at 23:12
  • 1
    Well, if the functions are really stateless, then I don't see how this shouldn't work. Either that, or there's still something we need to learn about your problem Commented Jun 15, 2015 at 23:14
  • alright, replace a with def a(iterable): return marisa_trie.Trie(iterable). So I was wrong when saying they are stateless. They expect the whole iterable to generate output and then it's read only after that. An example is from this package: github.com/kmike/marisa-trie
    – lqdc
    Commented Jun 15, 2015 at 23:18
  • I think the only important part there is if marisa_trie.Trie(iterable) returns the same as [marisa_trie.Trie([x]) for x in iterable]. Then you can call a and b separately for each key, and this works. Commented Jun 15, 2015 at 23:23
  • That would generate n tries for n items in the iterable. Basically I want one object at the end, and you cannot pass another iterable to it once it's finished with the first one.
    – lqdc
    Commented Jun 15, 2015 at 23:25

Not the answer you're looking for? Browse other questions tagged or ask your own question.