7

I get couple of grep:write errors when I run this code. What am I missing?

This is only part of it:

     while d <= datetime.datetime(year, month, daysInMonth[month]):
        day = d.strftime("%Y%m%d")
        print day
        results = [day]
        first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
        output1=first.communicate()[0]
        d += delta
        day = d.strftime("%Y%m%d")
        second=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True,  stdout=subprocess.PIPE, )
        output2=second.communicate()[0]
        articleList = (output1.split('\n'))
        articleList2 = (output2.split('\n'))
        results.append( len(articleList)+len(articleList2))
        w.writerow(tuple(results))
        d += delta
3
  • I can't figure out what you're trying to do. When you give filename arguments to grep it does't read from stdin, so why are you piping the output of one grep process to the second one?
    – Barmar
    Commented Mar 31, 2013 at 3:32
  • I am filtering the files that contain the keyword Algeria OR Bahrain and protests OR protests. It's actually a lil more complicated I just simplified it for this question. I want to get all the files that contain one of the keywords in list1 and one of the keywords in list2 Commented Mar 31, 2013 at 3:44
  • Any particular reason for not using Python's regular expression library, re? It would save you calling out to grep. Commented Mar 31, 2013 at 9:11

4 Answers 4

12

When you do

A | B

in a shell, process A's output is piped into process B as input. If process B shuts down before reading all of process A's output (e.g. because it found what it was looking for, which is the function of the -l option), then process A may complain that its output pipe was prematurely closed.

These errors are basically harmless, and you can work around them by redirecting stderr in the subprocesses to /dev/null.

A better approach, though, may simply be to use Python's powerful regex capabilities to read the files:

def fileContains(fn, pat):
    with open(file) as f:
        for line in f:
            if re.search(pat, line):
                return True
    return False

first = []
for file in glob.glob(monthDir +"/"+day+"*.txt"):
    if fileContains(file, 'Algeria|Bahrain') and fileContains(file, 'Protest|protesters'):
        file.append(first)
2

To find the files matching two patterns, the command structure should be:

grep -l pattern1 $(grep -l pattern2 files)

$(command) substitutes the output of the command into the command line.

So your script should be:

first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' $("+ grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt)", shell=True, stdout=subprocess.PIPE, )

and similarly for second

2
  • It didn't work for me. Can you explain though why I get broken pipe in some cases and some cases I don't. What does the error mean? Commented Mar 31, 2013 at 4:35
  • Broken pipe means that the command tried to write to the pipe, but the read end was closed. I don't think it should happen when you use first.communicate(), since that reads until EOF.
    – Barmar
    Commented Mar 31, 2013 at 4:40
1

If you are just looking for whole words, you could use the count() member function;

# assuming names is a list of filenames
for fn in names:
    with open(fn) as infile:
        text = infile.read().lower()
    # remove puntuation
    text = text.replace(',', '')
    text = text.replace('.', '')
    words = text.split()
    print "Algeria:", words.count('algeria')
    print "Bahrain:", words.count('bahrain')
    print "protesters:", words.count('protesters')
    print "protest:", words.count('protest')

If you want more powerful filtering, use re.

0

Add stderr args in the Popen function based on the python version the stderr value will change. This will support if the python version is less than 3

first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+".txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+".txt", shell=True, stdout=subprocess.PIPE, stderr = subprocess.STDOUT)

1

Not the answer you're looking for? Browse other questions tagged or ask your own question.