GNU diff treats files as binary if there is a null byte within the first few kilobytes. Text files don't contain null bytes, and binary files are very likely to contain null bytes within the first few hundred bytes, so this is a good heuristic. The file name doesn't matter.
The reason diff doesn't display differences between binary files is that this is usually unreadable. Binary formats usually can't be divided into lines that provide useful realignment after a chunk that has changed, often change radically to minor semantic changes (for example inserting one character in a compressed file can change everything that follows), and would result in unprintable characters in the diff output. But diff can work with null bytes. To force diff to treat files as text (meaning: to display differences), pass the --text
(or -a
) option:
diff --text index.html index3.html
Whether this is useful or not depends on why the file contains null bytes. Null bytes are unusual in an HTML file. You can get a hint with
file index.html
If the file is actually compressed, diff won't show anything useful: you need to uncompress it, and you should give it a name that reflects the compression mechanism, e.g. index.html.gz
. If you have compressed files, in bash/ksh/zsh, you can decompress them on the fly (replace uncompress
by the actual command that reads a compressed file from its standard input and writes the decompressed text to standard ouptut):
diff --label=index.html <(uncompress <index.html) --label=index3.html <(uncompress <index3.html)
It's possible that your file is encoded in a non-ASCII-based encoding such as UTF-16, UCS-2, UTF-32, or a pre-Unicode multibyte encoding. Such encodings are rare on the web. Web browsers do support them but document production tools might have trouble. If this is the case, you'll save headaches if you modify your production chain to use UTF-8 instead. In the meantime, diff --text
will give results that may or may not be readable depending on what non-ASCII content is present, or you can convert the files on the fly to pass them to diff, for example with files encoded in little-endian UTF-16:
diff --label=index.html <(iconv -f UTF-16LE -t UTF-8 <index.html) --label=index3.html <(iconv -f UTF-16LE -t UTF-8 <index3.html)
-a
to force comparison as text. Using HTML entities instead of special characters might also help, e.g.ä
instead ofä
etc.cat
them to the terminal. You can try withdiff <(iconv -f utf-16 index.html) <(iconv -f utf-16 index3.html)
.diff <(printf '\xff\n') <(printf '\xfe\n')