240

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?

It was much simpler than diff.

3
  • 7
    The answers to this question aren't necessarily what everyone will want, since comm requires sorted input files. If you want just line-by-line common, it's great. But if you want what I would call "anti-diff", comm doesn't do the job. Commented Apr 20, 2012 at 14:15
  • @RobertP.Goldman is there a way to get common between two files when file1 contains partial pattern like pr-123-xy-45 and file2 contains ec11_orop_pr-123-xy-45.gz . I need file3 containing ec11_orop_pr-123-xy-45.gz Commented Nov 2, 2015 at 7:20
  • 1
    See this for sorting text-files line-by-line Commented Jul 25, 2018 at 7:29

12 Answers 12

296

The command you are seeking is comm. eg:-

comm -12 1.sorted.txt 2.sorted.txt

Here:

-1 : suppress column 1 (lines unique to 1.sorted.txt)

-2 : suppress column 2 (lines unique to 2.sorted.txt)

10
  • 28
    Typical usage : comm -12 1.sorted.txt 2.sorted.txt Commented Jun 11, 2013 at 15:54
  • 56
    While comm needs sorted files, you may take grep -f file1 file2 to get the common lines of both files.
    – ferdy
    Commented Jan 20, 2015 at 17:29
  • 5
    @ferdy (Repeating my comment from your answer, as yours is essentially a repeated answer posted as a comment) grep does some weird things you might not expect. Specifically, everything in 1.txt will be interpreted as a regular expression and not a plain string. Also, any blank line in 1.txt will match all lines in 2.txt. So grep will only work in very specific situations. You'd at least want to use fgrep (or grep -f) but the blank-line thing is probably going to wreak havoc on this process. Commented Jul 22, 2015 at 14:08
  • 17
    See ferdy's answer below, and Christopher Schultz's and my comments on it. TL;DR — use grep -F -x -f file1 file2. Commented Jul 22, 2015 at 14:31
  • 1
    @bapors: I've provided a self-answered Q&A as How to get the output from the comm command into 3 separate files? The answer was much too big to fit comfortably here. Commented Sep 21, 2017 at 5:56
75

To easily apply the comm command to unsorted files, use Bash's process substitution:

$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321

So the files abc and def have one line in common, the one with "132". Using comm on unsorted files:

$ comm abc def
123
    132
567
132
    777
    321
$ comm -12 abc def # No output! The common line is not found
$

The last line produced no output, the common line was not discovered.

Now use comm on sorted files, sorting the files with process substitution:

$ comm <( sort abc ) <( sort def )
123
            132
    321
567
    777
$ comm -12 <( sort abc ) <( sort def )
132

Now we got the 132 line!

2
  • 2
    so... sort abc > abc.sorted, sort dev > def.sorted and then comm -12 abc.sorted def.sorted ? Commented Nov 1, 2017 at 1:28
  • 2
    @NikanaReklawyks And then remember to remove the temporary files afterwards, and cope with cleaning up in case of an error. In many scenarios, the process substitution will also be a lot quicker because you can avoid the disk I/O as long as the results fit into memory.
    – tripleee
    Commented Dec 8, 2017 at 5:41
41

To complement the Perl one-liner, here's its awk equivalent:

awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2

This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2. Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.

3
  • 2
    THIS(!) is the correct answer. None of the others can be made to work generally (I haven't tried the perl ones, because). Thanks a million, Ms.
    – entonio
    Commented May 30, 2016 at 9:48
  • 1
    Preserving the order when displaying the common lines can be really useful in some cases that would exclude comm because of that.
    – tuxayo
    Commented Jul 13, 2016 at 13:07
  • 1
    In case anybody wants to do the same thing based on a certain column but doesn't know awk, just replace both $0's with $5's for example for column 5 so you get lines shared in 2 files with same words in column 5 Commented Jan 31, 2019 at 15:15
25

Maybe you mean comm ?

Compare sorted files FILE1 and FILE2 line by line.

With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.

The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.

22

While

fgrep -v -f 1.txt 2.txt > 3.txt

gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a

fgrep -f 1.txt 2.txt > 3.txt

to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!

Note: You can use grep -F instead of fgrep.

4
  • 4
    grep does some weird things you might not expect. Specifically, everything in 1.txt will be interpreted as a regular expression and not a plain string. Also, any blank line in 1.txt will match all lines in 2.txt. So this will only work in very specific situations. Commented Jul 22, 2015 at 14:05
  • 16
    @ChristopherSchultz: It's possible to upgrade this answer to work better using POSIX grep notations, which are supported by the grep found on most modern Unix variants. Add -F (or use fgrep) to suppress regular expressions. Add -x (for exact) to match only whole lines. Commented Jul 22, 2015 at 14:20
  • Why should we take comm for sorted files ?
    – Ulysse BN
    Commented Apr 24, 2017 at 3:23
  • 2
    @UlysseBN comm can work with arbitrarily large files as long as they are sorted because it only ever needs to hold three lines in memory (I'm guessing GNU comm would even know to keep just a prefix if the lines are really long). The grep solution needs to keep all the search expressions in memory.
    – tripleee
    Commented Dec 8, 2017 at 5:44
15

If the two files are not sorted yet, you can use:

comm -12 <(sort a.txt) <(sort b.txt)

and it will work, avoiding the error message comm: file 2 is not in sorted order when doing comm -12 a.txt b.txt.

4
  • You're right, but this is essentially repeating another answer, which really doesn't provide any benefit. If you decide to answer an older question that has well established and correct answers, adding a new answer late in the day may not get you any credit. If you have some distinctive new information, or you're convinced the other answers are all wrong, by all means add a new answer, but 'yet another answer' giving the same basic information a long time after the question was asked usually won't earn you much credit. Commented Sep 21, 2017 at 6:47
  • I didn't even see this answer @JonathanLeffler because this part was at the very end of the answer, mixed with other elements of answer before. While the other answer is more precise, the benefit of mine I think is that for someone who wants for a quick solution will only have 2 lines to read. Sometimes we're looking for detailed answer and sometimes we are in a hurry and a quick-to-read ready-to-paste answer is fine.
    – Basj
    Commented Sep 21, 2017 at 10:28
  • Also I don't care about credit / rep, I didn't post for this purpose.
    – Basj
    Commented Sep 21, 2017 at 10:35
  • 2
    Notice also that the process substitution syntax <(command) is not portable to POSIX shell, though it works in Bash and some others.
    – tripleee
    Commented Dec 8, 2017 at 5:37
10
perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/'  file1 file2
4
  • 1
    this is working better than the comm command as it searches each line of file1 in file2 where comm will only compare if line n in file1 is equal to line n in file2.
    – teriiehina
    Commented Oct 11, 2014 at 12:32
  • 1
    @teriiehina: No; comm does not simply compare line N in file1 with line N in file2. It can perfectly well manage a series of lines inserted in either file (which is equivalent to deleting a series of lines from the other file, of course). It merely requires the inputs to be in sorted order. Commented Jul 22, 2015 at 14:24
  • Better than comm answers if one wants to keep the order. Better than awk answer if one don't want duplicates.
    – tuxayo
    Commented Jul 13, 2016 at 13:16
  • An explanation is here: stackoverflow.com/questions/17552789/… Commented Aug 25, 2017 at 23:18
6
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
1
  • This command does not work. Commented Feb 3, 2022 at 7:47
3

On limited version of Linux (like a QNAP (NAS) I was working on):

  • comm did not exist
  • grep -f file1 file2 can cause some problems as said by @ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)

So here is what I did:

sort file1 > file1.sorted
sort file2 > file2.sorted

diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted

If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:

awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same

Or, for the same order than file2:

awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
2

For how to do this for multiple files, see the linked answer to Finding matching lines across many files.


Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:

#!/bin/bash
ans="matching_lines"

for file1 in *
do 
    for file2 in *
        do 
            if  [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
                echo "Comparing: $file1 $file2 ..." >> $ans
                perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/' $file1 $file2 >> $ans
            fi
         done 
done

Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.

Things to be improved:

  • Skip directories
  • Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
  • Maybe add the line number next to the matching string
0
0

Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario

If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:

cat a_bunch_of_files* | sort | uniq | wc

If the number of lines you get is less than the one you get from

cat a_bunch_of_files* | wc

then there is some repeated line.

-2
rm file3.txt

cat file1.out | while read line1
do
        cat file2.out | while read line2
        do
                if [[ $line1 == $line2 ]]; then
                        echo $line1 >>file3.out
                fi
        done
done

This should do it.

2
  • 1
    You should probably use rm -f file3.txt if you're going to delete the file; that won't report any error if the file doesn't exist. OTOH, it would not be necessary if your script simply echoed to standard output, letting the user of the script choose where the output should go. Ultimately, you'd probably want to use $1 and $2 (command line arguments) instead of fixed file names (file1.out and file2.out). That leaves the algorithm: it is going to be slow. It is going to read file2.out once for each line in file1.out. It'll be slow if the files are big (say multiple kilobytes). Commented Jul 22, 2015 at 14:42
  • 1
    While this can nominally work if you have inputs which doesn't contain any shell metacharacters (hint: see what warnings you get from shellcheck.net), this naive approach is terribly inefficient. A tool like grep -F which reads one file into memory and then does a single pass over the other avoids looping repeatedly over both input files.
    – tripleee
    Commented Dec 8, 2017 at 5:40

Not the answer you're looking for? Browse other questions tagged or ask your own question.