I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff
.
I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff
.
The command you are seeking is comm
. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)
grep
does some weird things you might not expect. Specifically, everything in 1.txt
will be interpreted as a regular expression and not a plain string. Also, any blank line in 1.txt
will match all lines in 2.txt
. So grep
will only work in very specific situations. You'd at least want to use fgrep
(or grep -f
) but the blank-line thing is probably going to wreak havoc on this process.
Commented
Jul 22, 2015 at 14:08
grep -F -x -f file1 file2
.
Commented
Jul 22, 2015 at 14:31
comm
command into 3 separate files? The answer was much too big to fit comfortably here.
Commented
Sep 21, 2017 at 5:56
To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132". Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!
sort abc > abc.sorted
, sort dev > def.sorted
and then comm -12 abc.sorted def.sorted
?
Commented
Nov 1, 2017 at 1:28
To complement the Perl one-liner, here's its awk
equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1
into the array arr[]
, and then check for each line in file2
if it already exists within the array (i.e. file1
). The lines that are found will be printed in the order in which they appear in file2
.
Note that the comparison in arr
uses the entire line from file2
as index to the array, so it will only report exact matches on entire lines.
perl
ones, because). Thanks a million, Ms.
Maybe you mean comm
?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils
and it will list you all the small useful utils.
While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm
nonetheless. Regards!
Note: You can use grep -F
instead of fgrep
.
grep
does some weird things you might not expect. Specifically, everything in 1.txt
will be interpreted as a regular expression and not a plain string. Also, any blank line in 1.txt
will match all lines in 2.txt
. So this will only work in very specific situations.
Commented
Jul 22, 2015 at 14:05
grep
notations, which are supported by the grep
found on most modern Unix variants. Add -F
(or use fgrep
) to suppress regular expressions. Add -x
(for exact) to match only whole lines.
Commented
Jul 22, 2015 at 14:20
comm
can work with arbitrarily large files as long as they are sorted because it only ever needs to hold three lines in memory (I'm guessing GNU comm
would even know to keep just a prefix if the lines are really long). The grep
solution needs to keep all the search expressions in memory.
If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt
.
<(command)
is not portable to POSIX shell, though it works in Bash and some others.
perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/' file1 file2
comm
command as it searches each line of file1
in file2
where comm
will only compare if line n
in file1
is equal to line n
in file2
.
Commented
Oct 11, 2014 at 12:32
comm
does not simply compare line N in file1 with line N in file2. It can perfectly well manage a series of lines inserted in either file (which is equivalent to deleting a series of lines from the other file, of course). It merely requires the inputs to be in sorted order.
Commented
Jul 22, 2015 at 14:24
comm
answers if one wants to keep the order. Better than awk
answer if one don't want duplicates.
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
On limited version of Linux (like a QNAP (NAS) I was working on):
grep -f file1 file2
can cause some problems as said by @ChristopherSchultz and using grep -F -f file1 file2
was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted
shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh
) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.
rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.
rm -f file3.txt
if you're going to delete the file; that won't report any error if the file doesn't exist. OTOH, it would not be necessary if your script simply echoed to standard output, letting the user of the script choose where the output should go. Ultimately, you'd probably want to use $1
and $2
(command line arguments) instead of fixed file names (file1.out
and file2.out
). That leaves the algorithm: it is going to be slow. It is going to read file2.out
once for each line in file1.out
. It'll be slow if the files are big (say multiple kilobytes).
Commented
Jul 22, 2015 at 14:42
grep -F
which reads one file into memory and then does a single pass over the other avoids looping repeatedly over both input files.
comm
requires sorted input files. If you want just line-by-line common, it's great. But if you want what I would call "anti-diff",comm
doesn't do the job.pr-123-xy-45
and file2 containsec11_orop_pr-123-xy-45.gz
. I need file3 containingec11_orop_pr-123-xy-45.gz