Python: Iterating through lines in files

Question

queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
for query in queries:
    for tweet in tweets:
        query_words = query.split()
        tweet_words = tweet.split()
        for qword in query_words:
            for tword in tweet_words:
               #Comparison

I'm trying to use python to iterate over two files with multiple lines in each of them. What I want to do is, to break down each line in both files into words, and then compare each word in the current line in the "query" file with each word in the current line in the "tweet" file. The above is what I did till now, but it's only working for the first line in the query file and skips over the rest of the lines in it. It does work for each line in the tweet file. Any help?

Edit for the duplicate_comment: I understand that after iterating over the queries file it the file handle will be positioned at EOF. But I don't get why it isn't processing the next line in the queries file, and just going directly to EOF.

You may want to use queries.readlines() and tweets.readlines(). — user2746752, Commented Apr 19, 2015 at 8:12

mirosval · Accepted Answer · 2015-04-19 08:18:30Z

2

Essentially what happens is that you go through all the lines in one file while looking just at the first line in the other file. You cannot go through those lines in the next iteration, because you've already read them out.

Do it like this:

queries = open(sys.argv[1],"rU").readlines()
tweets = open(sys.argv[2],"rU").readlines()

for i in range(min(len(queries), len(tweets))):
    tweet = tweets[i]
    query = queries[i]

    # comparison

answered Apr 19, 2015 at 8:18

mirosval

6,7923 gold badges34 silver badges47 bronze badges

This works as follows: 1 query - 1 tweet 2 query - 2 tweet. What I wanted to do was: 1 query - 1 tweet 1 query - 2 tweet 2 query - 1 tweet 2 query - 2 tweet. Basically I want to compare a query with every tweet before moving on to the next query
– Sajal Sharma
Commented Apr 19, 2015 at 8:26
Ah then you can do a double for loop, what will work now too, since we obtained the queries and tweets with readlines() so you can iterate over both multiple times without running into EOF
– mirosval
Commented Apr 19, 2015 at 9:07

Add a comment |

Community · Accepted Answer · 2017-05-23 12:00:56Z

The problem is that, after you iterate through every line of a file, you're at EOF. You either have to open it again or you ensure each line being processed as expected (split and compared in your example) before reading, or iterating, to the next line. In your example, since file tweets is at EOF after the first iteration of query, it would seem like the file queries "skipped" to EOF starting the second iteration, simply because there is no more tweet to iterate through in nested loop.

Also, although garbage collection handles file closing for you, it is still a better practice to explicitly close each opened file.

Refer to @Smac89's answer for modification.

Morb · Accepted Answer · 2015-04-19 08:19:58Z

1

Instead of doing for loops like that, use the function file.readline()

queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
query = queries.readline()
tweet = tweets.readline()
while (query != "" and tweet != ""):
    query_words = query.split()
    tweet_words = tweet.split()
    #comparison
    query = queries.readline()
    tweet = tweets.readline()

mirosval provided an easier answer, use his

answered Apr 19, 2015 at 8:19

Morb

5343 silver badges14 bronze badges

Add a comment |

smac89 · Accepted Answer · 2015-04-19 08:38:18Z

1

Consider using file.seek:

with open(sys.argv[1],"rU") as queries:
    with open(sys.argv[2],"rU") as tweets:
        for query in queries:
            query_words = query.split()
            for tweet in tweets:
                tweet_words = tweet.split()
                for qword in query_words:
                    for tword in tweet_words:
                        #Comparison
            tweets.seek(0) # go back to the start of the file

edited Apr 19, 2015 at 8:38

answered Apr 19, 2015 at 8:27

smac89

42.1k15 gold badges148 silver badges190 bronze badges

1

This works, and Thomas explained why it does. Thanks to both of you!
– Sajal Sharma
Commented Apr 19, 2015 at 8:33
@SajalSharma Yup, I think this is the best answer for your purpose. No problem. :)
– Thomas Hsieh
Commented Apr 19, 2015 at 11:39
Or instead of reading the tweets file x times (depending on the number of lines in the queries file), you can store the data in a list, it's better for your hard drive, unless you can't afford to use too much RAM.
– Morb
Commented Apr 20, 2015 at 6:07

Add a comment |

Serge Ballesta · Accepted Answer · 2015-04-19 08:43:03Z

You want to iterate second file for each line of first file. But look what happens :

you open both files
you start iterating first file
get first line of first file
you iterate second file till the end => pointer of second file is at EOF
you try processing second line of first file
pointer of second file is already at EOF and you immediately loop on next line of first file without any processing

So you have to rewind second file after each iteration of first file. You have two ways to do it :

load second file in memory as a list of lines with readlines and iterate through this list. As it is a list (and not a file) iteration will start at first position instead of current one

queries = open(sys.argv[1],"rU")
tweets_file = open(sys.argv[2],"rU")
tweets = tweets_file.readlines() # tweets is now a list of lines
for query in queries:
    for tweet in tweets:
        query_words = query.split()
        tweet_words = tweet.split()
        for qword in query_words:
            for tword in tweet_words:
               #Comparison

explicitely rewind the file with skip

queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
for query in queries:
    for tweet in tweets:
        query_words = query.split()
        tweet_words = tweet.split()
        for qword in query_words:
            for tword in tweet_words:
               #Comparison
    tweets.seek(0) # explicitely rewind tweets

First solution read second file only once but uses more memory. It should be prefered if second file if small (less than several hundreds of Mo on recent machines). Second solution uses less memory and should be prefered is second file is huge ... or if you have to save memory for any reason (embedded system, lower impact of a script ...)

Collectives™ on Stack Overflow

Python: Iterating through lines in files

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
python
loops
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged pythonloops or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
loops
or ask your own question.