Shell command to find lines common in two files

Question

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?

It was much simpler than diff.

The answers to this question aren't necessarily what everyone will want, since comm requires sorted input files. If you want just line-by-line common, it's great. But if you want what I would call "anti-diff", comm doesn't do the job. — Robert P. Goldman, Commented Apr 20, 2012 at 14:15
@RobertP.Goldman is there a way to get common between two files when file1 contains partial pattern like pr-123-xy-45 and file2 contains ec11_orop_pr-123-xy-45.gz . I need file3 containing ec11_orop_pr-123-xy-45.gz — Chandan Choudhury, Commented Nov 2, 2015 at 7:20

Mohammed H · Accepted Answer · 2017-06-27 09:58:43Z

296

The command you are seeking is comm. eg:-

comm -12 1.sorted.txt 2.sorted.txt

Here:

-1 : suppress column 1 (lines unique to 1.sorted.txt)

-2 : suppress column 2 (lines unique to 2.sorted.txt)

edited Jun 27, 2017 at 9:58

Mohammed H

6,98817 gold badges82 silver badges127 bronze badges

answered Dec 17, 2008 at 6:40

Jonathan Leffler

747k144 gold badges937 silver badges1.3k bronze badges

28

Typical usage : comm -12 1.sorted.txt 2.sorted.txt
– Fedir RYKHTIK
Commented Jun 11, 2013 at 15:54
56

While comm needs sorted files, you may take grep -f file1 file2 to get the common lines of both files.
– ferdy
Commented Jan 20, 2015 at 17:29
5

@ferdy (Repeating my comment from your answer, as yours is essentially a repeated answer posted as a comment) grep does some weird things you might not expect. Specifically, everything in 1.txt will be interpreted as a regular expression and not a plain string. Also, any blank line in 1.txt will match all lines in 2.txt. So grep will only work in very specific situations. You'd at least want to use fgrep (or grep -f) but the blank-line thing is probably going to wreak havoc on this process.
– Christopher Schultz
Commented Jul 22, 2015 at 14:08
17

See ferdy's answer below, and Christopher Schultz's and my comments on it. TL;DR — use grep -F -x -f file1 file2.
– Jonathan Leffler
Commented Jul 22, 2015 at 14:31
1

@bapors: I've provided a self-answered Q&A as How to get the output from the comm command into 3 separate files? The answer was much too big to fit comfortably here.
– Jonathan Leffler
Commented Sep 21, 2017 at 5:56

| Show 5 more comments

Jonathan Leffler · Accepted Answer · 2015-07-22 14:29:29Z

75

To easily apply the comm command to unsorted files, use Bash's process substitution:

$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321

So the files abc and def have one line in common, the one with "132". Using comm on unsorted files:

$ comm abc def
123
    132
567
132
    777
    321
$ comm -12 abc def # No output! The common line is not found
$

The last line produced no output, the common line was not discovered.

Now use comm on sorted files, sorting the files with process substitution:

$ comm <( sort abc ) <( sort def )
123
            132
    321
567
    777
$ comm -12 <( sort abc ) <( sort def )
132

Now we got the 132 line!

edited Jul 22, 2015 at 14:29

Jonathan Leffler

747k144 gold badges937 silver badges1.3k bronze badges

answered Jul 20, 2014 at 13:57

Stephan Wehner

1,1298 silver badges9 bronze badges

2

so... sort abc > abc.sorted, sort dev > def.sorted and then comm -12 abc.sorted def.sorted ?
– Nikana Reklawyks
Commented Nov 1, 2017 at 1:28
2

@NikanaReklawyks And then remember to remove the temporary files afterwards, and cope with cleaning up in case of an error. In many scenarios, the process substitution will also be a lot quicker because you can avoid the disk I/O as long as the results fit into memory.
– tripleee
Commented Dec 8, 2017 at 5:41

Add a comment |

Jonathan Leffler · Accepted Answer · 2015-07-22 14:36:27Z

41

To complement the Perl one-liner, here's its awk equivalent:

awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2

This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2. Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.

edited Jul 22, 2015 at 14:36

Jonathan Leffler

747k144 gold badges937 silver badges1.3k bronze badges

answered Oct 11, 2014 at 21:50

Tatjana Heuser

9799 silver badges11 bronze badges

2

THIS(!) is the correct answer. None of the others can be made to work generally (I haven't tried the perl ones, because). Thanks a million, Ms.
– entonio
Commented May 30, 2016 at 9:48
1

Preserving the order when displaying the common lines can be really useful in some cases that would exclude comm because of that.
– tuxayo
Commented Jul 13, 2016 at 13:07
1

In case anybody wants to do the same thing based on a certain column but doesn't know awk, just replace both $0's with $5's for example for column 5 so you get lines shared in 2 files with same words in column 5
– FatihSarigol
Commented Jan 31, 2019 at 15:15

Add a comment |

Johannes Schaub - litb · Accepted Answer · 2008-12-17 06:47:37Z

25

Maybe you mean comm ?

Compare sorted files FILE1 and FILE2 line by line.

With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.

The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.

edited Dec 17, 2008 at 6:47

answered Dec 17, 2008 at 6:41

Johannes Schaub - litb

503k131 gold badges911 silver badges1.2k bronze badges

Add a comment |

haridsv · Accepted Answer · 2022-03-21 05:14:51Z

22

While

fgrep -v -f 1.txt 2.txt > 3.txt

gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a

fgrep -f 1.txt 2.txt > 3.txt

to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!

Note: You can use grep -F instead of fgrep.

edited Mar 21, 2022 at 5:14

haridsv

9,5054 gold badges65 silver badges67 bronze badges

answered Jan 20, 2015 at 17:21

ferdy

7,6463 gold badges37 silver badges50 bronze badges

4

grep does some weird things you might not expect. Specifically, everything in 1.txt will be interpreted as a regular expression and not a plain string. Also, any blank line in 1.txt will match all lines in 2.txt. So this will only work in very specific situations.
– Christopher Schultz
Commented Jul 22, 2015 at 14:05
16

@ChristopherSchultz: It's possible to upgrade this answer to work better using POSIX grep notations, which are supported by the grep found on most modern Unix variants. Add -F (or use fgrep) to suppress regular expressions. Add -x (for exact) to match only whole lines.
– Jonathan Leffler
Commented Jul 22, 2015 at 14:20
Why should we take comm for sorted files ?
– Ulysse BN
Commented Apr 24, 2017 at 3:23
2

@UlysseBN comm can work with arbitrarily large files as long as they are sorted because it only ever needs to hold three lines in memory (I'm guessing GNU comm would even know to keep just a prefix if the lines are really long). The grep solution needs to keep all the search expressions in memory.
– tripleee
Commented Dec 8, 2017 at 5:44

Add a comment |

Basj · Accepted Answer · 2017-07-21 11:14:14Z

15

If the two files are not sorted yet, you can use:

comm -12 <(sort a.txt) <(sort b.txt)

and it will work, avoiding the error message comm: file 2 is not in sorted order when doing comm -12 a.txt b.txt.

answered Jul 21, 2017 at 11:14

Basj

44.9k107 gold badges430 silver badges763 bronze badges

You're right, but this is essentially repeating another answer, which really doesn't provide any benefit. If you decide to answer an older question that has well established and correct answers, adding a new answer late in the day may not get you any credit. If you have some distinctive new information, or you're convinced the other answers are all wrong, by all means add a new answer, but 'yet another answer' giving the same basic information a long time after the question was asked usually won't earn you much credit.
– Jonathan Leffler
Commented Sep 21, 2017 at 6:47
I didn't even see this answer @JonathanLeffler because this part was at the very end of the answer, mixed with other elements of answer before. While the other answer is more precise, the benefit of mine I think is that for someone who wants for a quick solution will only have 2 lines to read. Sometimes we're looking for detailed answer and sometimes we are in a hurry and a quick-to-read ready-to-paste answer is fine.
– Basj
Commented Sep 21, 2017 at 10:28
Also I don't care about credit / rep, I didn't post for this purpose.
– Basj
Commented Sep 21, 2017 at 10:35
2

Notice also that the process substitution syntax <(command) is not portable to POSIX shell, though it works in Bash and some others.
– tripleee
Commented Dec 8, 2017 at 5:37

Add a comment |

Sam I am says Reinstate Monica · Accepted Answer · 2013-07-17 15:22:02Z

10

perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/'  file1 file2

edited Jul 17, 2013 at 15:22

Sam I am says Reinstate Monica

31.1k12 gold badges73 silver badges101 bronze badges

answered Jul 17, 2013 at 15:05

user2592005

1011 silver badge2 bronze badges

1

this is working better than the comm command as it searches each line of file1 in file2 where comm will only compare if line n in file1 is equal to line n in file2.
– teriiehina
Commented Oct 11, 2014 at 12:32
1

@teriiehina: No; comm does not simply compare line N in file1 with line N in file2. It can perfectly well manage a series of lines inserted in either file (which is equivalent to deleting a series of lines from the other file, of course). It merely requires the inputs to be in sorted order.
– Jonathan Leffler
Commented Jul 22, 2015 at 14:24
Better than comm answers if one wants to keep the order. Better than awk answer if one don't want duplicates.
– tuxayo
Commented Jul 13, 2016 at 13:16
An explanation is here: stackoverflow.com/questions/17552789/…
– Chris Koknat
Commented Aug 25, 2017 at 23:18

Add a comment |

R S John · Accepted Answer · 2016-08-14 10:16:56Z

6

awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2

answered Aug 14, 2016 at 10:16

R S John

5072 gold badges10 silver badges16 bronze badges

This command does not work.
– Ahmad Ismail
Commented Feb 3, 2022 at 7:47

Add a comment |

Peter Mortensen · Accepted Answer · 2021-08-18 20:13:04Z

On limited version of Linux (like a QNAP (NAS) I was working on):

comm did not exist
grep -f file1 file2 can cause some problems as said by @ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)

So here is what I did:

sort file1 > file1.sorted
sort file2 > file2.sorted

diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted

If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:

awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same

Or, for the same order than file2:

awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same

Peter Mortensen · Accepted Answer · 2021-08-18 20:09:06Z

For how to do this for multiple files, see the linked answer to Finding matching lines across many files.

Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:

#!/bin/bash
ans="matching_lines"

for file1 in *
do 
    for file2 in *
        do 
            if  [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
                echo "Comparing: $file1 $file2 ..." >> $ans
                perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/' $file1 $file2 >> $ans
            fi
         done 
done

Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.

Things to be improved:

Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string

Kiteloopdesign · Accepted Answer · 2022-03-07 10:33:09Z

0

Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario

If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:

cat a_bunch_of_files* | sort | uniq | wc

If the number of lines you get is less than the one you get from

cat a_bunch_of_files* | wc

then there is some repeated line.

answered Mar 7, 2022 at 10:33

Kiteloopdesign

991 silver badge10 bronze badges

Add a comment |

Jonathan Leffler · Accepted Answer · 2015-07-22 14:38:49Z

-2

rm file3.txt

cat file1.out | while read line1
do
        cat file2.out | while read line2
        do
                if [[ $line1 == $line2 ]]; then
                        echo $line1 >>file3.out
                fi
        done
done

This should do it.

edited Jul 22, 2015 at 14:38

Jonathan Leffler

747k144 gold badges937 silver badges1.3k bronze badges

answered Sep 1, 2013 at 9:34

Alan Joseph

151 bronze badge

1

You should probably use rm -f file3.txt if you're going to delete the file; that won't report any error if the file doesn't exist. OTOH, it would not be necessary if your script simply echoed to standard output, letting the user of the script choose where the output should go. Ultimately, you'd probably want to use $1 and $2 (command line arguments) instead of fixed file names (file1.out and file2.out). That leaves the algorithm: it is going to be slow. It is going to read file2.out once for each line in file1.out. It'll be slow if the files are big (say multiple kilobytes).
– Jonathan Leffler
Commented Jul 22, 2015 at 14:42
1

While this can nominally work if you have inputs which doesn't contain any shell metacharacters (hint: see what warnings you get from shellcheck.net), this naive approach is terribly inefficient. A tool like grep -F which reads one file into memory and then does a single pass over the other avoids looping repeatedly over both input files.
– tripleee
Commented Dec 8, 2017 at 5:40

Add a comment |

Collectives™ on Stack Overflow

Shell command to find lines common in two files

12 Answers 12

Not the answer you're looking for? Browse other questions tagged
shell
command-line
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

Not the answer you're looking for? Browse other questions tagged shellcommand-line or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
shell
command-line
or ask your own question.