3

I may be missing some information here, but I was surprised today when I tried to run diff on two .html files that should have a subtle difference and got this message:

$ diff index.html index3.html
Binary files index.html and index3.html differ

Why are .html files being considered binaries? Is there any way to avoid this and treat them like text files?

5
  • 1
    Related (not an answer): unix.stackexchange.com/questions/469483/…
    – Jeff Schaller
    Commented Oct 15, 2019 at 16:44
  • 1
    See also a workaround in unix.stackexchange.com/a/59859/117549
    – Jeff Schaller
    Commented Oct 15, 2019 at 16:45
  • 1
    Probably your HTML file contains non-ascii characters. If you use a GNU diff, e.g. on Linux, you can use option -a to force comparison as text. Using HTML entities instead of special characters might also help, e.g. ä instead of ä etc.
    – Bodo
    Commented Oct 15, 2019 at 17:04
  • 1
    Your files are probably encoded in UTF-16 and contain NUL bytes, which don't show up when you cat them to the terminal. You can try with diff <(iconv -f utf-16 index.html) <(iconv -f utf-16 index3.html).
    – user313992
    Commented Oct 15, 2019 at 17:57
  • 1
    @Bodo No, diff doesn't care about "non-ascii characters". Try it yourself diff <(printf '\xff\n') <(printf '\xfe\n')
    – user313992
    Commented Oct 15, 2019 at 17:59

1 Answer 1

2

GNU diff treats files as binary if there is a null byte within the first few kilobytes. Text files don't contain null bytes, and binary files are very likely to contain null bytes within the first few hundred bytes, so this is a good heuristic. The file name doesn't matter.

The reason diff doesn't display differences between binary files is that this is usually unreadable. Binary formats usually can't be divided into lines that provide useful realignment after a chunk that has changed, often change radically to minor semantic changes (for example inserting one character in a compressed file can change everything that follows), and would result in unprintable characters in the diff output. But diff can work with null bytes. To force diff to treat files as text (meaning: to display differences), pass the --text (or -a) option:

diff --text index.html index3.html

Whether this is useful or not depends on why the file contains null bytes. Null bytes are unusual in an HTML file. You can get a hint with

file index.html

If the file is actually compressed, diff won't show anything useful: you need to uncompress it, and you should give it a name that reflects the compression mechanism, e.g. index.html.gz. If you have compressed files, in bash/ksh/zsh, you can decompress them on the fly (replace uncompress by the actual command that reads a compressed file from its standard input and writes the decompressed text to standard ouptut):

diff --label=index.html <(uncompress <index.html) --label=index3.html <(uncompress <index3.html)

It's possible that your file is encoded in a non-ASCII-based encoding such as UTF-16, UCS-2, UTF-32, or a pre-Unicode multibyte encoding. Such encodings are rare on the web. Web browsers do support them but document production tools might have trouble. If this is the case, you'll save headaches if you modify your production chain to use UTF-8 instead. In the meantime, diff --text will give results that may or may not be readable depending on what non-ASCII content is present, or you can convert the files on the fly to pass them to diff, for example with files encoded in little-endian UTF-16:

diff --label=index.html <(iconv -f UTF-16LE -t UTF-8 <index.html) --label=index3.html <(iconv -f UTF-16LE -t UTF-8 <index3.html)

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .