Improving CSV filtering with Python using regex

Question

Consider a .csv file that contains a set of video names like so:

"There are happy days","1204923"
"Beware of ignorance","589636"
"Bloody Halls MV","258933"
"Dream Theater - As I Am - Live in...","89526"

The intent of the code I built is to filter items in the csv depending on the list of excluded items. Therefore, if the name of the video contains a word in the list of excluded items, it'll be rejected for saving. The following is the code:

exclude_list = ["mv","live","cover","remix","bootleg"]

data_set = []

with open('video_2013-2016.csv', 'rb') as f:

    reader = csv.reader(f)

    for row in reader:
        # Only record videos with at least 100 views
        if int(row[1]) > 99:

            # A test list that holds whether the regex passes or fails
            test_list = []

            for ex in exclude_list:
                regex = re.compile(".*("+ex+").*")

                if regex.search(row[0]):
                    test_list.append(False)
                else:
                    test_list.append(True)

            # Depending on the results, see if the row is worthy of saving
            if all(result for result in test_list):
                data_set.append(row)

I know the code I wrote above is quite inefficient, and I've seen examples of list comprehensions that can do a better job, but I do not quite understand how list comprehension can work in this case. I just hate it that I have to create the regex variable many times and it feels like a waste of resource.

200_success · Accepted Answer · 2016-07-19 23:07:19Z

The CSV file contains text in some text encoding, and should not be opened in binary mode.

You should construct one regular expression to find any of the forbidden words. It appears that you intended to do a case-insensitive search, but didn't write the code that way. When constructing the regex, you should escape the strings, in case they contain any regex metacharacters. You don't need .*, since re.search() will look for the pattern anywhere in the string, nor do you need capturing parentheses.

If your comment says 100, then your code should have 100 rather than 99.

I suggest doing a destructuring assignment title, view_count = row to make it clear what each column represents.

with open('video_2013-2016.csv') as f:
    forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I)
    for row in csv.reader(f):
        # Only record videos with at least 100 views and none of the bad words
        title, view_count = row
        if int(view_count) >= 100 and not forbidden.search(title):
            data_set.append(row)

I can't believe I didn't think of regex OR method! Absolutely beautiful solution -> forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I). I was very worried that I'm being inefficient for not using list comprehension, but now I see that I really didn't need any list comprehension. — Adib, Commented Jul 19, 2016 at 22:55
From 1.8s to 0.1s, you're a beast! Thank you very much for your help! — Adib, Commented Jul 19, 2016 at 22:58

Stack Exchange Network

Improving CSV filtering with Python using regex

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
performance
regex
csv
or ask your own question.

Hot Network Questions

Improving CSV filtering with Python using regex

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonperformanceregexcsv or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
performance
regex
csv
or ask your own question.