1

I have two CSV files that I am working with. One is massive, with about 200,000 rows. The other is much smaller, having about 12,000 rows. Both fit the same format of names, and email addresses (everything is legit here, no worries). Basically I'm trying to get only a subset of the second list by removing all values that presently exist in the larger file.

So, List A has ~200k rows, and List B has ~12k. These lists overlap a bit, and I'd like to remove all entries from List B if they also exist in List A, leaving me with new and unique values only in List B. I've got a few tooks at my disposal that I can use. Open Office is loaded on this machine, along with MySQL (queries are alright).

What's the easiest way to create a third CSV with the intersection of data?

2 Answers 2

4

From a Linux/Unix/Mac command-line:

sort file1 file2 | uniq -d | sort file2 - | uniq -u

Explanation:

This returns only those lines in file2 that do not exactly match any line in file1.

Steps:

  1. sort file1 file2: Concatenates file1 and file2 together, sorts them, and prints them to stdout. Note that duplicates will be listed in adjacent lines (twice in a row) after sorting.
  2. uniq -d: Takes the output of the previous command and prints only the lines that are duplicates.
  3. sort file2 -: Concatenates the original file2 and the output of the previous command (stdout, which is represented by the filename "-" hyphen), and prints the result to stdout. In addition, any items in file2 which were also in file1 will be duplicated (listed twice in a row) in the output.
  4. uniq -u: Takes the output of the previous command and prints only items which are not duplicated (in other words, prints only items which are not listed twice in a row).

Possible gotchas:

This assumes that any given line in file1 exactly matches a corresponding line in file2. If, for example, file1 and file2 had the same e-mail but with different capitalization; or if file1 had a name as "Jon Sampson" while file2 had the same e-mail address with the name "Jonathan Sampson", they would not be considered duplicates.

You could control for this by pre-processing the file to remove everything except the e-mail address, and further, lowercase the e-mail address. The Unix commands cut and tr could be helpful in this case. Or you could switch to SQL for more complex scenarios.

File size:

A file of 200,000 lines and one of 12,000 lines is not really that big. I generated files of similar size using the /usr/share/dict/words file on my MacBook Pro and tested the above command; it took less than 5 seconds to run.

2

Nate has given you a really good answer, but there is a shorter longer way from a Linux/Unix/Mac command-line:

join -t# -v2 <(sort file1.csv) <(sort file2.csv) > result.csv

Caveats:

  • The original question is about joining whole lines. The only way I can think of
    suppressing join's need to split, is to define the field delimiter as a character that is not used in any of the files (# in my example). Ugly, I know.

  • Input files must be sorted on the join field. You can do this in one line (see above) but it will only work in bash. Other shells have different syntax for this.

If your input files are sorted:

join -t# -v2 file1.csv file2.csv > result.csv

For Windows there is a native port of join .

10
  • 1
    Wow! I did not know about join. You learn something new every day.
    – Nate
    Commented May 17, 2010 at 20:25
  • How would I pass the output into a third CSV?
    – Sampson
    Commented May 17, 2010 at 20:27
  • Hmmm... this seems to be giving me lines from file2 which match lines from file1. He needs to get lines that do not match. Am I missing something?
    – Nate
    Commented May 17, 2010 at 20:28
  • Sorry, -a2 is a right outer join. If you only want the unpairable lines it's -v. Commented May 17, 2010 at 20:31
  • 1
    Looks great; I'll test this out momentarily.
    – Sampson
    Commented May 17, 2010 at 21:12

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .