What's the simplest way to remove lines from one file matched with lines from another file? For example, if I have the following files:


[email protected]


1,[email protected],somehash1
2,[email protected],somehash2
3,[email protected],somehash3

As a result I'd like to have file3.csv:

1,[email protected],somehash1
3,[email protected],somehash3

What's the fastest way to solve this task? These files are a few GB in size.

  • 2
    Seems too large for anything but a coding solution honestly...
    – soandos
    Commented Jul 25, 2011 at 23:09
  • 2
    Do you have enough RAM to hold all the files in RAM at the same time + 2 gigs? If not, you will need code, as no program can even hold them open at the same time.
    – soandos
    Commented Jul 25, 2011 at 23:25
  • That's a good point. While my solution works in theory, memory is probably going to be a limiting factor. Perhaps you can break the files up first? Commented Jul 26, 2011 at 0:03

6 Answers 6


grep -v -F -f file1.csv file2.csv > file3.csv seems the simplest. But you should do performance tests with smaller files first. (I agree with soandos' comment that such big files might need a dedicated solution.)

  • that's how it was solved: cat file2.csv | fgrep -vf file1.csv > file3.csv
    – A B
    Commented Jul 27, 2011 at 6:49
  • 2
    Apparently, what you've used is effectively the same method, though represents an example of useless use of cat. You could also use < file2.csv | fgrep -vf file1.csv > file3.csv.
    – jankes
    Commented Jul 27, 2011 at 9:04
  • 1
    I think you also need the -x option. Commented Oct 26, 2018 at 19:35
awk -F, '
  FILENAME == ARGV[1] {to_remove[$1]=1; next}
  ! ($2 in to_remove) {print}
' file1.csv file2.csv > file3.csv

You have to have enough memory to read in file1 at once.

Here's another option: join

$ join -t , -v 2 -1 1 -2 2 file1.csv file2.csv
[email protected],1,somehash1
[email protected],3,somehash3

However, from the man page "Important: FILE1 and FILE2 must be sorted on the join fields." so factor that into your decision.


You could loop over each line in file1, and grep matching lines out of file2?

cp file2.csv file3.csv
cat file1.csv | while read line; do
    grep -v ${line%?} file3.csv > temp.csv
    cat temp.csv > file3.csv
rm -f temp.csv


Edit: Tested, seems to work OK. Just make sure you have a trailing newline in file1.

  • The 'line' variable won't have a trailing newline so you don't need to chop off the last char -- that might actually lead to false positives. 'cat temp.csv > file3.csv' is more efficiently written as 'mv temp.csv file3.csv'. But especially, you're processing (reading AND writing) the larger file2 multiple times (once for each line in file1) -- there are approaches that will only pass through file2 once. Commented Jul 26, 2011 at 0:48
  • Hmm, when I was testing it, it wasn't working initially. I added set -x and set -v and saw that a newline was on the end of the string being grep'd for, so I added the %? and it worked... And you are correct that this processes file2 once for each line in file1, but surely the only other approach would be to read file1 once for each line in file2? The OP said the files were 'few gigs each', so I'm not sure how much difference there is between file sizes. Commented Jul 26, 2011 at 1:05

Does file1.csv have to stay unmodifieed?

sed 's|.*|/^&.*/d|' file1.csv > file1.sed
sed -f file1.sed file2.csv > file3.csv 

I don't know how much memory it consumes. AFAIK, it will test the whole -sed file each time on the whole input (2.csv).

If the input is sorted, and the patterns are sorted too, you could implement a faster solution.


Make sure file3.csv exists (and it's empty)

echo > file3.csv
diff file1.csv file2.csv | patch file3.csv

Et voilá!

  • The lines being removed from file2 are not exact matches of lines from file1. The example in the question shows lines from file1 being a subset of a line from file2, so this won't work will it? Commented Jul 26, 2011 at 5:43
  • Right, it will not work. Commented Jul 27, 2011 at 17:40

This doesn't need a 'coded solution.' If you sort the lines first, the algorithmic complexity is reduced by several orders of magnitude.

See this answer for better performance, both in terms of CPU time and memory:


You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .