Getting a Cross-Section from Two CSV Files

Question

I have two CSV files that I am working with. One is massive, with about 200,000 rows. The other is much smaller, having about 12,000 rows. Both fit the same format of names, and email addresses (everything is legit here, no worries). Basically I'm trying to get only a subset of the second list by removing all values that presently exist in the larger file.

So, List A has ~200k rows, and List B has ~12k. These lists overlap a bit, and I'd like to remove all entries from List B if they also exist in List A, leaving me with new and unique values only in List B. I've got a few tooks at my disposal that I can use. Open Office is loaded on this machine, along with MySQL (queries are alright).

What's the easiest way to create a third CSV with the intersection of data?

Nate · Accepted Answer · 2010-05-17 20:13:19Z

From a Linux/Unix/Mac command-line:

sort file1 file2 | uniq -d | sort file2 - | uniq -u

Explanation:

This returns only those lines in file2 that do not exactly match any line in file1.

Steps:

sort file1 file2: Concatenates file1 and file2 together, sorts them, and prints them to stdout. Note that duplicates will be listed in adjacent lines (twice in a row) after sorting.
uniq -d: Takes the output of the previous command and prints only the lines that are duplicates.
sort file2 -: Concatenates the original file2 and the output of the previous command (stdout, which is represented by the filename "-" hyphen), and prints the result to stdout. In addition, any items in file2 which were also in file1 will be duplicated (listed twice in a row) in the output.
uniq -u: Takes the output of the previous command and prints only items which are not duplicated (in other words, prints only items which are not listed twice in a row).

Possible gotchas:

This assumes that any given line in file1 exactly matches a corresponding line in file2. If, for example, file1 and file2 had the same e-mail but with different capitalization; or if file1 had a name as "Jon Sampson" while file2 had the same e-mail address with the name "Jonathan Sampson", they would not be considered duplicates.

You could control for this by pre-processing the file to remove everything except the e-mail address, and further, lowercase the e-mail address. The Unix commands cut and tr could be helpful in this case. Or you could switch to SQL for more complex scenarios.

File size:

A file of 200,000 lines and one of 12,000 lines is not really that big. I generated files of similar size using the /usr/share/dict/words file on my MacBook Pro and tested the above command; it took less than 5 seconds to run.

Community · Accepted Answer · 2017-03-20 10:17:17Z

2

Nate has given you a really good answer, but there is a ~~shorter~~ longer way from a Linux/Unix/Mac command-line:

join -t# -v2 <(sort file1.csv) <(sort file2.csv) > result.csv

Caveats:

The original question is about joining whole lines. The only way I can think of
suppressing join's need to split, is to define the field delimiter as a character that is not used in any of the files (# in my example). Ugly, I know.
Input files must be sorted on the join field. You can do this in one line (see above) but it will only work in bash. Other shells have different syntax for this.

If your input files are sorted:

join -t# -v2 file1.csv file2.csv > result.csv

For Windows there is a native port of join .

edited Mar 20, 2017 at 10:17

CommunityBot

1

answered May 17, 2010 at 20:21

Ludwig Weinzierl

8,0434 gold badges30 silver badges31 bronze badges

1

Wow! I did not know about join. You learn something new every day.
– Nate
Commented May 17, 2010 at 20:25
How would I pass the output into a third CSV?
– Sampson
Commented May 17, 2010 at 20:27
Hmmm... this seems to be giving me lines from file2 which match lines from file1. He needs to get lines that do not match. Am I missing something?
– Nate
Commented May 17, 2010 at 20:28
Sorry, -a2 is a right outer join. If you only want the unpairable lines it's -v.
– Ludwig Weinzierl
Commented May 17, 2010 at 20:31
1

Looks great; I'll test this out momentarily.
– Sampson
Commented May 17, 2010 at 21:12

| Show 5 more comments

Stack Exchange Network

Getting a Cross-Section from Two CSV Files

2 Answers 2

Explanation:

Steps:

Possible gotchas:

File size:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
openoffice
mysql
csv
.

Hot Network Questions

Getting a Cross-Section from Two CSV Files

2 Answers 2

Explanation:

Steps:

Possible gotchas:

File size:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged openofficemysqlcsv.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
openoffice
mysql
csv
.