2
queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
for query in queries:
    for tweet in tweets:
        query_words = query.split()
        tweet_words = tweet.split()
        for qword in query_words:
            for tword in tweet_words:
               #Comparison

I'm trying to use python to iterate over two files with multiple lines in each of them. What I want to do is, to break down each line in both files into words, and then compare each word in the current line in the "query" file with each word in the current line in the "tweet" file. The above is what I did till now, but it's only working for the first line in the query file and skips over the rest of the lines in it. It does work for each line in the tweet file. Any help?

Edit for the duplicate_comment: I understand that after iterating over the queries file it the file handle will be positioned at EOF. But I don't get why it isn't processing the next line in the queries file, and just going directly to EOF.

2

5 Answers 5

2

Essentially what happens is that you go through all the lines in one file while looking just at the first line in the other file. You cannot go through those lines in the next iteration, because you've already read them out.

Do it like this:

queries = open(sys.argv[1],"rU").readlines()
tweets = open(sys.argv[2],"rU").readlines()

for i in range(min(len(queries), len(tweets))):
    tweet = tweets[i]
    query = queries[i]

    # comparison
2
  • This works as follows: 1 query - 1 tweet 2 query - 2 tweet. What I wanted to do was: 1 query - 1 tweet 1 query - 2 tweet 2 query - 1 tweet 2 query - 2 tweet. Basically I want to compare a query with every tweet before moving on to the next query Commented Apr 19, 2015 at 8:26
  • Ah then you can do a double for loop, what will work now too, since we obtained the queries and tweets with readlines() so you can iterate over both multiple times without running into EOF
    – mirosval
    Commented Apr 19, 2015 at 9:07
1

The problem is that, after you iterate through every line of a file, you're at EOF. You either have to open it again or you ensure each line being processed as expected (split and compared in your example) before reading, or iterating, to the next line. In your example, since file tweets is at EOF after the first iteration of query, it would seem like the file queries "skipped" to EOF starting the second iteration, simply because there is no more tweet to iterate through in nested loop.

Also, although garbage collection handles file closing for you, it is still a better practice to explicitly close each opened file.

Refer to @Smac89's answer for modification.

0
1

Instead of doing for loops like that, use the function file.readline()

queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
query = queries.readline()
tweet = tweets.readline()
while (query != "" and tweet != ""):
    query_words = query.split()
    tweet_words = tweet.split()
    #comparison
    query = queries.readline()
    tweet = tweets.readline()

mirosval provided an easier answer, use his

1

Consider using file.seek:

with open(sys.argv[1],"rU") as queries:
    with open(sys.argv[2],"rU") as tweets:
        for query in queries:
            query_words = query.split()
            for tweet in tweets:
                tweet_words = tweet.split()
                for qword in query_words:
                    for tword in tweet_words:
                        #Comparison
            tweets.seek(0) # go back to the start of the file
3
  • 1
    This works, and Thomas explained why it does. Thanks to both of you! Commented Apr 19, 2015 at 8:33
  • @SajalSharma Yup, I think this is the best answer for your purpose. No problem. :) Commented Apr 19, 2015 at 11:39
  • Or instead of reading the tweets file x times (depending on the number of lines in the queries file), you can store the data in a list, it's better for your hard drive, unless you can't afford to use too much RAM.
    – Morb
    Commented Apr 20, 2015 at 6:07
0

You want to iterate second file for each line of first file. But look what happens :

  • you open both files
  • you start iterating first file
  • get first line of first file
  • you iterate second file till the end => pointer of second file is at EOF
  • you try processing second line of first file
  • pointer of second file is already at EOF and you immediately loop on next line of first file without any processing

So you have to rewind second file after each iteration of first file. You have two ways to do it :

  • load second file in memory as a list of lines with readlines and iterate through this list. As it is a list (and not a file) iteration will start at first position instead of current one

    queries = open(sys.argv[1],"rU")
    tweets_file = open(sys.argv[2],"rU")
    tweets = tweets_file.readlines() # tweets is now a list of lines
    for query in queries:
        for tweet in tweets:
            query_words = query.split()
            tweet_words = tweet.split()
            for qword in query_words:
                for tword in tweet_words:
                   #Comparison
    
  • explicitely rewind the file with skip

    queries = open(sys.argv[1],"rU")
    tweets = open(sys.argv[2],"rU")
    for query in queries:
        for tweet in tweets:
            query_words = query.split()
            tweet_words = tweet.split()
            for qword in query_words:
                for tword in tweet_words:
                   #Comparison
        tweets.seek(0) # explicitely rewind tweets
    

First solution read second file only once but uses more memory. It should be prefered if second file if small (less than several hundreds of Mo on recent machines). Second solution uses less memory and should be prefered is second file is huge ... or if you have to save memory for any reason (embedded system, lower impact of a script ...)

Not the answer you're looking for? Browse other questions tagged or ask your own question.