44

I'm getting a diff: memory exhausted error when trying to diff two 27 GB files that are largely similar on a Linux box with CentOS 5 and 4 GB of RAM. This is a known problem, it seems.

I would expect there to be an alternative for such an essential utility, but I can't find one. I imagine the solution would have to use temporary files rather than memory to store the information it needs.

  • I tried to use rdiff and xdelta, but they are better for showing the changes between two files, like a patch, and are not that useful for inspecting the differences between two files.
  • Tried VBinDiff, but it is a visual tool which is better for comparing binary files. I need something that can pipe the differences to STDOUT like regular diff.
  • There are a lot of other utilities such as vimdiff that only work with smaller files.
  • I've also read about Solaris bdiff but I could not find a port for Linux.

Any ideas besides splitting the file into smaller pieces? I have 40 of these files so trying to avoid the work of breaking them up.

4

6 Answers 6

16

cmp does things byte-by-byte, so it probably won't run out of memory (just tested it on two 7 GB files) -- but you might be looking for more detail than a list of "files X and Y differ at byte x, line y". If the similarities of your files are offset (e.g., file Y has an identical block of text, but not at the same location), you can pass offsets to cmp; you could probably turn it into a resynchronizing compare with a small script.

Aside: In case anyone else lands here when looking for a way to confirm that two directory structures (containing very large files) are identical: diff --recursive --brief (or diff -r -q for short, or maybe even diff -rq) will work and not run out of memory.

1
  • 2
    nice, I think -q is the key here, somehow not having it can require diff to put the whole file (or at least whole lines) into memory...
    – rogerdpack
    Commented Jul 8, 2014 at 17:55
6

I found this link

diff -H might help, or you can try installing the textproc/2bsd-diff port which apparently doesn't try to load the files into RAM, so it can work on large files more easily.

I'm not sure if you tried those two options or if they might work for you. Good luck.

3
  • 1
    Does this help for anybody out there? For me, same failure...
    – rogerdpack
    Commented Jul 8, 2014 at 17:54
  • 19
    For anyone wondering: diff -H is an undocumented and deprecated alias for diff --speed-large-files.
    – a3nm
    Commented Feb 27, 2016 at 22:47
  • 1
    This answer doesn't help. This is a linux question, and to install 2bsd-diff you would have to port it first. After you found a source. And patched it. Possible, but unlikely a viable solution.
    – nyov
    Commented Aug 27, 2019 at 7:23
2

If the files are identical (same length) except for a few byte values, you can use a script like following (w is the number of bytes per line to hexdump, adjust to your display width):

w=12;
while read -ru7 x && read -ru8 y;
do
  [ ".$x" = ".$y" ] || echo "$x | $y";
done 7< <(od -vw$w -tx1z FILE1) 8< <(od -vw$w -tx1z FILE2) > DIFF-FILE1-FILE2 &

less DIFF-FILE1-FILE2

It's not very fast, but does the job.

1

If the files have the same number of lines and differ by the content of a few of them, use the following command. Substitute \a (alert) with any other character that does not occur within the files.

paste -d $'\a' file1 file2 | awk -F$'\a' '$1 != $2'

This works by pairing the lines of the two files and then comparing each pair.

1
  • 1
    This was a great help... I extended it for my needs, in my answer below
    – Otheus
    Commented Feb 10, 2023 at 20:26
1

So, not exactly the OPs problem, but a related problem is that you have two large database dumps, each insert/record on its own row, but varying differences in floating-point implementation result in numbers that are off by some IEEE error. Thanks to the answer provided by @Diomidis, and a sprawling one-line awk script show below, we get a full functioning, efficient fuzzy-differ.

Add the text below to some script directory as fuzzy-compare.awk, tune the parameters in the BEGIN section as needed (locale-specific, debugging modes, etc), then pipe the output of paste into it:

paste -d $'\a' file1 file2 | awk -f fuzzy-compare.awk

Sample output:

Line 1 diffs found so far: 1 here at field: 4
75747358        1       53      2011-03-29 23:00:00+00  7.428
75747358        1       53      2011-03-28 23:00:00+00  7.428

Line 2 diffs found so far: 2 here at field: 4
75747359        1       53      2011-03-29 23:30:00+00  5.757
75747359        1       53      2011-03-29 23:30:00+01  5.757

Line 3 diffs found so far: 3 here at field: 3
75747360        1       53      2011-03-30 00:00:00+00  6.739
75747360        1       54      2011-03-30 00:00:00+00  6.74

Line 5 diffs found so far: 4
75747362        1       53      2011-03-30 01:00:00+00  6.736   extra-field
75747362        1       53      2011-03-30 01:00:00+00  6.73599999999999977

With diff showing:

# diff sample.sql sample2.sql
1,3c1,3
< 75747358      1       53      2011-03-29 23:00:00+00  7.428
< 75747359      1       53      2011-03-29 23:30:00+00  5.757
< 75747360      1       53      2011-03-30 00:00:00+00  6.739
---

> 75747358      1       53      2011-03-28 23:00:00+00  7.428
> 75747359      1       53      2011-03-29 23:30:00+01  5.757
> 75747360      1       54      2011-03-30 00:00:00+00  6.74
5,13c5,13
< 75747362      1       53      2011-03-30 01:00:00+00  6.736   extra-field
< 75747363      1       53      2011-03-30 01:30:00+00  7.576
< 75747364      1       53      2011-03-30 02:00:00+00  6.789
< 75747365      1       53      2011-03-30 02:30:00+00  6.386e+2
< 75747366      1       53      2011-03-30 03:00:00+00  6.016E-2
< 75747367      1       53      2011-03-30 03:30:00+00  6.336
< 75747368      1       53      2011-03-30 04:00:00+00  6.1
< 75747374      1       53      2011-03-30 07:00:00+00  5.9412
< 75747375      1       53      2011-03-30 07:30:00+00  6.137803249
---
> 75747362      1       53      2011-03-30 01:00:00+00  6.73599999999999977
> 75747363      1       53      2011-03-30 01:30:00+00  7.576e+10
> 75747364      1       53      2011-03-30 02:00:00+00  6.789e-10
> 75747365      1       53      2011-03-30 02:30:00+00  6.38600000000000012e+2
> 75747366      1       53      2011-03-30 03:00:00+00  6.01600000000000001E-2
> 75747367      1       53      2011-03-30 03:30:00+00  6.3360000000000003
> 75747368      1       53      2011-03-30 04:00:00+00  6.0999999999999993
> 75747374      1       53      2011-03-30 07:00:00+00  5.94099999999999984
> 75747375      1       53      2011-03-30 07:30:00+00  6.13780324900000007

Code below (duplicated to a github gist: https://gist.github.com/otheus/92162e3a764d2697c3272b98b2663a94).

#!/bin/awk -f 
## Awk script to compare to SQL (postgres) dumps for which each line of input is a row
## and has been preprocessed by 
##   paste -d $'\a' file1 file2 
## The BEL symbol is used by this program to quickly split the input
##   
## Sometimes, numbers differ by some kind of rounding error / floating-point implementation
## Ignore that error by subtracting the two values and seeing if they are < maxdiff,
##     maxdiff = 1 / (10 ^ (length-after-decimal-point(shortest-value)) 
## Consider:
##   4.2  vs 4.19998
## The shortest number is 4.2, its length is 

## Notes:
##   d is the global *d*iff counter
##   p is the *p*osition / field that first had a difference
##   i is a loop variable,usually current field
##   L is the array of fields from the current line of the *L*eft-file
##   R is  "    "    "    "     "   "    "  "    "   "  "  *R*ight-file
##   clhs is the number of fields in L
##   crhs is the number of fields in R

BEGIN { 
  FS="\a";
  DECIMAL_SEP=".";
  FIELD_SEP="\t";  # for postgresql; for mysql, maybe ", ";
  MAX_DIFFS=10;
  DEBUG=0;
  # Efficiently fill out our table of maximum tolerances of values
  Maxdiffs[1] = 0.1;
  for (i=2; i<31; ++i)
    Maxdiffs[i] = Maxdiffs[i-1] / 10;
  p=-1; # everything starts out fine.
}

# if -v start=...., skip until that line
NR < (0 + start) { next } 

# When pairs don't match, investigate further...
("_" $1) != ("_" $2) {
    if (DEBUG>1) print "Line",NR ": Input lines differed somehow. Investigating...";
    p=0;  # p is field# where difference was found; 0 means whole line
    # split each half into tab-delimited fields
    clhs=split($1,L,FIELD_SEP);
    crhs=split($2,R,FIELD_SEP); 

    if (clhs == crhs) { 
    if (DEBUG>1) print "Line",NR ": Same number of tokens in each line, delimited by '" FIELD_SEP "'";
        ## compare field by field
    p = -1;  # if we don't set p in the loop below, no real differences

    # Compare each field, until a difference is found
    for (i=1; i<=clhs && p<0; ++i) {  
        # Hint: force this compare to be string-based
        if (("_" L[i]) != ("_" R[i])) { 
        if (DEBUG>1) print "Line",NR ": Field",i,"differs somehow";

        ## They differ... but are they numbers?
        if ( \
          L[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ && \
          R[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ \
        ) {  
            # both fields are floating-point numbers, compare loosely

            # strip exponent part
            sub(/[eE].*/,"",L[i]);sub(/[eE].*/,"",R[i]); 
            # determine precision of shortest value
            precision=( \
                length(L[i]) < length(R[i]) ?  \
            length(L[i]) - index(L[i],DECIMAL_SEP) :  \
            length(R[i]) - index(R[i],DECIMAL_SEP)    \
            ); 
            # look up the maxdiff from our table
            maxdiff=Maxdiffs[precision]; 

            diff=(L[1] - R[1]); 
            if (diff > maxdiff || diff < -maxdiff) {
            if (DEBUG) print "Line",NR ": Numbers differed at",i,"between",L[i],"and",R[i],"differing more than",maxdiff;
            p=i;
            }
            else {
            if (DEBUG) print "Line",NR ": Numbers differed at",i,"between",L[i],"and",R[i],"but differed less than",maxdiff;
            }
        } 
        else {
          if (DEBUG) print "Line",NR ": Strings or ints differed at",i,"between",L[i],"and",R[i];
          p=i;
        }
        }
        else { 
          if (DEBUG) print "Line",NR ": No differences found";
        }
    } 
    }
    # else, field count is different, so whole line is.
    else { 
      if (DEBUG) print "Line",NR ": Number of fields in line differ";
    }
}

p>=0 { 
    ++d;  # bump total diffs count
    # Output a little header for each non-matching records
    print "Line",NR,"diffs found so far:",d,(p ? "here at field: "  p : "" ); 
    # Output the lines that didnt match
    print $1; print $2; print ""; 
    p=-1;
}

# Progress counter
NR % 100000 == 0 { print "Line",NR } 
d > MAX_DIFFS { exit(1);}

Note, the above code was a 1-liner prior to publication.

0

This may not work for all types of files, but if your files have a regular structure to them you may be able to split them into smaller chunks and diff the chunks individually.

For example:

csplit large-file.txt '/separator pattern/' '{*}'

Caveat: this only works if your file has something you can use a separator without producing hundreds of small files and where the smaller chunks are still comparable.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .