1

I have a function that does something like this:

def function_a(x_iter: Iterator[dict]):
    y = {}
    for x in x_iter:
        x = other_func_1(x)
        y = other_func_2(x)
        yield x, y

Downstream in the process, I want to use x and y separately, e.g. I want to pass x as an iterator to another function and I want to save y to JSON file. I know we can't call it like this

x, y = function_a(x_iter)

because x and y will be in the same iterator. How should I separate them? I don't think I can do

result = function_a(x_iter)
for x, y in result:
    <do something with x>
    <do something with y>

since x needs to be passed to another function downstream as an iterator.

Thank you

13
  • 2
    So, I'm confused. Have you tried your code or not? If you tried some code, then please update your question with this code and the results and a comment about whether it works for you.
    – quamrana
    Commented Aug 18, 2022 at 11:17
  • Maybe it's related - stackoverflow.com/questions/46941719
    – Daniel Hao
    Commented Aug 18, 2022 at 11:20
  • 1
    So, I've tried your code (suitably modified) and it seems to work fine. However, without some concrete code from you its impossible to tell exactly what is not working.
    – quamrana
    Commented Aug 18, 2022 at 11:23
  • @quamrana: I don't know what you tried, but x, y = function_a(x_iter) definitely doesn't work. As for the for loop, it's impossible to write the code like that because the iterators need to be processed by downstream functions that take iterators; the questioner cannot write an element-by-element loop. Commented Aug 18, 2022 at 11:27
  • 1
    itertools.tee only pays off if the tee iterators stay positioned close to each other in the data stream. It doesn't help in use cases where one iterator will be fully consumed before the other. Commented Aug 18, 2022 at 12:11

2 Answers 2

2

If you can't rewrite the consumer functions, you're going to have to do at least 1 of the following 3 things:

  • Run the downstream functions simultaneously in 2 separate threads.
  • Fully materialize at least one of the data streams involved - maybe save the y elements to a list while you iterate over the x elements, or vice versa, or materialize the underling x_iter to a list so you can make two passes to generate the x and y elements.
  • Generate the input iterator twice.

(If you're really going to save the y elements to a JSON file, you're probably materializing that data to a list anyway, unless you're using a streaming JSON serializer, but it sounds like the JSON thing is just an example.)


Say you iterate over all the x elements somehow. The whole time you're doing that, function_a is producing y elements. You could try to use those elements as function_a produces them, but if you want to do that while the downstream function that consumes the x elements is running, you'll have to run the y consumer at the same time, and the only way to do that without rewriting the consumer functions is to run them in two separate threads. That's the threads option.

If you don't use the y elements immediately, you can store them, but if you're not using separate threads, the y consumer will have to wait until the x consumer finishes. That means you'll have to store the entire y data stream, probably to a list.

If you don't use the y elements immediately, and you don't store them, then they're gone. You can't pull them out of nothing when you're done with the x elements. You have to generate them again, which means you'll need the elements of x_iter, but those elements are gone too. You'll need to either recreate x_iter, or store its contents up front (probably to a list). If you go that way, you probably wouldn't have function_a generate both x and y elements - you'd probably write one function that generates the x elements and one that generates the y elements, so you don't waste time doing work you don't need.


Note that itertools.tee doesn't let you get out of this. It has to store elements in memory too. itertools.tee only pays off in cases where the tee iterators will stay positioned close to each other in the data stream. It's worse than list if you're going to iterate over one iterator fully before starting on the other.

3
  • Although it is impossible to tell without knowing what functions OP intends to use the iterators in, I think you could add generator coroutines as another way to solve this. i.e. OP's function_a sends (gen.send) x, and y values into corresponding functions. Commented Aug 18, 2022 at 12:24
  • @SayandipDutta: If you mean rewriting the consumer functions as generator-based coroutines, I don't think the questioner can rewrite the consumers. Commented Aug 18, 2022 at 12:27
  • I see. Nevermind then :) Commented Aug 18, 2022 at 12:28
1

You cannot, because your iterator inherently produces a tuple of 2 values. What you can do, is to make a wrapper that ignores one of them and yields only 1st value.

other_func_1 = lambda x: x*2
other_func_2 = lambda x: x*3

def function_a(x_iter):
    for x in x_iter:
        x1 = other_func_1(x)
        x2 = other_func_2(x)
        yield x1, x2

def take_ith(x_iter, i):
    for x in x_iter:
        yield x[i]
        
print(list(function_a(range(10))))
print(list(take_ith(function_a(range(10)), 0)))

If you need to generate values of x and y separately, it probably means they shouldn't be grouped in the generator in the first place.

2
  • Do you have an idea not to group them? I was thinking to run x = other_func_1(x) and y = other_func_2(x) on separate functions, but that means that I have to iterate over x_iter twice and that is ineffective.
    – eng2019
    Commented Aug 18, 2022 at 11:37
  • @eng2019 It's hard to tell the best solution without seeing the whole code, but can't you save y further in the code, when the values of x are actually used?
    – matszwecja
    Commented Aug 18, 2022 at 11:44

Not the answer you're looking for? Browse other questions tagged or ask your own question.