3

I'm tarring up /home, and piping it through bzip2. However, I've got lots of already-compressed files out there (.jpg, .mp4, .mkv, .webm, etc) which bzip2 shouldn't try to compress.

Are there any CLI compressors out there that are smart enough (either via libmagic or the user enumerating extensions) not to try to back up un- or minimally-compressible files?

A similar question was asked a few years ago, but don't know if there have been any updates since then. Can I command 7z to skip compression (but not inclusion) of specific files while compressing a directory with its subs?

2 Answers 2

4

The way you are doing this, with compressing a .tar file the answer is for sure no.

Whatever you use for compressing the .tar file, it doesn't know about the contents of the file, it just sees a binary stream, and whether parts of that stream are uncompressable, or minimally compressible, there is no way this is known. Don't be confused by the options for the tar command to do the compression, tar --create --xz --file some.tar file1 is as "dumb" as knowing about the stream contents as doing tar --create file1 | xz > some.tar is.

You can do multiple things:

  1. you switch to some container format other than .tar which allows you to compress on an individual basis, but this is unfavourable if you have lots of small files in one directory that have similar patterns (as they get compressed individually). The zip format is an example that would work.
  2. you compress the files, if appropriate before putting them in the tar file. This can be done transparently with e.g. the python tarfile and bzip2 modules This also has the disadvantages of point 1. And there is no straight extraction from the tar file as some files will come out compressed that might not need decompression (as the already were compressed before backup).
  3. Use tar as is and live with the fact that th happens and select a not so high compression for gzip/bzip2/xz so that they will not try too hard to compress the stream, thereby not wasting time on trying to get another 0.5% compression which is not going to happen.

You might want to look at the results of paralleling xz compression (not specific to tar files), to see some results of trying to speed up xz as published on my blog

3

The LZ4 algorithm could be an option.

It checks if the beginning of a block is compressible and stores it uncompressed if the ratio is low. This sucessfully prevents compression of already compressed files without the need to specify their names.

The overall compression ratio is lower compared to the algorithms you mention. But LZ4 is very fast, on the other hand. You can easily reach several hundred MiB/s compression and GiB/s decompression speed.

Examples:

# Compression (creates <inputfile>.lz4)
lz4c <inputfile>

# Decompression
lz4c -d <inputfile>

# Use with tar
tar cf - <directory> | lz4c > <directory>.tar.lz4

# Use with GNU tar
tar cf <directory>.tar.lz4 -I lz4c <directory>
3
  • Thanks. That was blazingly fast. Too bad there's no "--list" option.
    – RonJohn
    Commented Jan 23, 2015 at 9:30
  • 1
    --list is a command line option or tar not of the compressor. And you're free to use tar with LZ4: tar f foo.tar.lz4 --list -I lz4c
    – Marco
    Commented Jan 23, 2015 at 9:59
  • sorry for the ambiguity. I was referring to there not being "--list" in lz4.
    – RonJohn
    Commented Jan 23, 2015 at 12:00

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .