0

I've got a set of .tar.gz files, which are duplicity backup files (either full backups or incremental ones). I'd like to compute which directories take the most space on backups. This will most probably be a different figure to calculating which directories take the most space on a live filesystem because I need to account for how often are files changing (and therefore taking space on incremental backups) and how compressible are files.

I know that while many other archive formats store compressed files as different entities inside the archive file, .tar.gz files do not, and therefore it is impossible to get an exact amount of storage taken in the archive by a single file after compression. Are there any tools to calculate at least some estimates?

2 Answers 2

1

If you are interested in a particular file size after compression, just compress the file with gzip once. That should be the most straight forward method.

3
  • Well, I've got almost a terabyte of backups, and I'd like to compute sums from every compressed file… that would take quite a lot of time.
    – liori
    Commented Nov 7, 2012 at 3:37
  • Take a full backup, dump it to some big empty disk. Then run ** gzip -r < top level dump dir > ** . You can break the process into smaller chunk. It does take time but you only do it once.
    – John Siu
    Commented Nov 7, 2012 at 3:58
  • I don't have such free space.
    – liori
    Commented Nov 7, 2012 at 8:27
0

So, I hacked some C code to find some approximate values. The code shows how many bytes did zlib to read from archive to get to each subsequent file. The code is here: https://github.com/liori/targz-sizes

It seems that I could extract more precise data, but these values shouldn't differ from real ones by more than by few bytes per file, and the error is averaged over all files, so it should be good enough for the purpose described in the question.

5
  • tar -xzvOf /pathto/backup.tgz ./inner/pathto/compressed/item | dd > /dev/null -- my dd (coreutils 5.97) prints the total bytes written as 3690 bytes (3.7 kB) copied, 0.00244849 seconds, 1.5 MB/s Commented Apr 28, 2015 at 13:28
  • @jimbobmcgee: You're measuring size of the unpacked file, not how many bytes it needs inside the compressed archive.
    – liori
    Commented Apr 28, 2015 at 14:30
  • Ah, I missed what you were after (it was subtly different to what I was after when I came here!). I guess then, for rough estimation, the inverse might be close enough: tar -czvO /pathto/uncompressed/item | dd > /dev/null. Some tar overhead but, I guess that might be what you want. If not, substitute tar -czvO for gzip -c. Commented Apr 28, 2015 at 14:40
  • ...or the (slightly awkward) round-trip tar -xzvOf /pathto/backup.tgz ./inner/pathto/compressed/item | dd | gzip -c | dd > /dev/null... Commented Apr 28, 2015 at 14:59
  • @jimbobmcgee: …which is unfortunately still wrong, as (1) similar files placed next to each other in a tar archive will help each other with compression (common case with e.g. source code), (2) directory entries and empty files also take space in the archive—variable amount depending on neighbors. That's why I wrote this utility ;-)
    – liori
    Commented Apr 28, 2015 at 23:56

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .