From a Linux/Unix/Mac command-line:
sort file1 file2 | uniq -d | sort file2 - | uniq -u
Explanation:
This returns only those lines in file2 that do not exactly match any line in file1.
Steps:
sort file1 file2
: Concatenates file1 and file2 together, sorts them, and prints them to stdout. Note that duplicates will be listed in adjacent lines (twice in a row) after sorting.
uniq -d
: Takes the output of the previous command and prints only the lines that are duplicates.
sort file2 -
: Concatenates the original file2 and the output of the previous command (stdout, which is represented by the filename "-
" hyphen), and prints the result to stdout. In addition, any items in file2 which were also in file1 will be duplicated (listed twice in a row) in the output.
uniq -u
: Takes the output of the previous command and prints only items which are not duplicated (in other words, prints only items which are not listed twice in a row).
Possible gotchas:
This assumes that any given line in file1 exactly matches a corresponding line in file2. If, for example, file1 and file2 had the same e-mail but with different capitalization; or if file1 had a name as "Jon Sampson" while file2 had the same e-mail address with the name "Jonathan Sampson", they would not be considered duplicates.
You could control for this by pre-processing the file to remove everything except the e-mail address, and further, lowercase the e-mail address. The Unix commands cut
and tr
could be helpful in this case. Or you could switch to SQL for more complex scenarios.
File size:
A file of 200,000 lines and one of 12,000 lines is not really that big. I generated files of similar size using the /usr/share/dict/words
file on my MacBook Pro and tested the above command; it took less than 5 seconds to run.