2

I would like to create tar gzip archive, but do it the reverse manner of what is most commonly done -- have the files in the archive be compressed individually rather than compress the entire archive: that way it retains the seekable property it should have. It makes much more sense to me, and I don't know why this has not been favored.

I have some ideas on how to do this:

However, ideally, I would like to continue to use tar for this, as it is a familiar, de-facto tool for archiving where I work. tar has the --to-command switch, which allows piping extracted files to a program. If I had a symmetric command such as --from-command I would easily implement my wish with:

tar cf my_archive.tar file1 file2 --from-command=gzip
tar xf my_archive.tar --to-command=gunzip

My motivation comes from dealing with archives containing a large number of large files. I currently tar-gzip them, but then extracting any files from the archive takes a long time - it needs to be decompressed before tar can access the file, and it does so in a serial manner!

So here are my questions:

  • Is there an evident way to achieve this that I am disregarding?
  • Has anyone already written a tool to do with, specifically with tar?
  • If one would call tar and gzip and standard methods of archiving and compressing in Linux, what would be the equivalent, popular method for archiving with compression in the manner I mentioned about (i.e. not tar.gz)
  • Is there another way I am overlooking to circumvent the large amount of time it takes to extract a file from a large tar-gzipped archive?

Thanks!

EDIT

I realize that I need to re-phrase and refine my question. Especially since, as Robin Hood pointed out, there are existing rather easy solutions to create compressed archives (namely, zip). So here it is:

Is there are way to use tar that allows true random-access to the archive while still keeping it compressed? If not, is there another tar-replacement for Linux (that is built with the same rationale, and, ideally, with support for the same command-line options), that does achieve this?

Right now I can replace tar in a general sense with zip, by changing:

tar c path/to/file1 path/to/file2 | gzip > arc.tar.gz
gunzip < arc.tar.gz | tar x

to:

zip -qr - path/to/file1 path/to/file2 > arc.zip
unzip -qoX test.zip

However, this has the disadvantage that it does not support all the options that tar does for archiving, namely:

  1. piping each extracted file individually into a pipe (the --to-command switch)
  2. unzip does not accept an archive in standard input. funzip does, however - it only outputs the first file in the archive

So it's rather limiting.

Thanks again!

2
  • 2
    As for the why: Compression efficiency. If you compress all data in one go, you achieve a higher compression ratio. Compressing every file individually could, especially with many small files, result in a result that is larger than the total input size.
    – Daniel B
    Commented Nov 2, 2014 at 13:23
  • Then again, it seems sensible to give the user the option to choose between compression efficiency and access efficiency. I guess this makes sense since tar was intended for use with tapes -- however it still supports seekable archives!
    – Yuval
    Commented Nov 2, 2014 at 14:00

2 Answers 2

1

I've read your question multiple times, it's very difficult to understand, but I think I've got it now. You want to have files put into individual tar archives, and then all stored in one gz archive. This won't work because gz archives only support compression of 1 file which is why people tar the files before compressing with gz. You could do the opposite, put each file into a gz archive, and then put all the gz archives into a single tar archive. Alternatively, you could just stop using formats that require double archiving, and use an archive format that supports multiple files, like zip.

Compressing the files inside a tar will still result in sequential access of the gz archives, because the tar format doesn't support random access. Zip archives use a centralized catelogue so random file access is possible without decompressing, or reading the entire archive. I don't do much archiving under Linux, but on Windows I like to use 7-zip to create zip archives with lzma compression. It's worth noting that either of these methods when used with comparable compression to your tar.gz, will yeild a larger archive due to lack of solid compression, which is why tar.gz is very popular in the linux world compared to zip for distributing software.

Create A Series Of GZ Archives And Store In A Tar Archive:

cp -a -n -v "/home/me/example/inputfiles/." --target-directory="/home/me/example/gzfiles"

This will copy the files you wish to archive to a different folder. Gunzip doesn't allow keeping the original unarchived files, but working from a copy will enable you to avoid this.

gzip -9 "/home/me/example/gzfiles/*"

This will create a seperate gz archive of each file, and use max compression. If your system can't handle that try a lower number; the default number is 6.

tar -cf "/home/me/example/tar/archive.tar" -C "/home/me/example/gzfiles ."

This will create a single tar archive that contains all the gz archives.

Extract A Single File From A GZ Archive In A Tar Archive:

sudo apt-get install archivemount

This will install archive mount, a tool that can mount tar files to a directory.

archivemount -o readonly "/home/me/example/tar/archive.tar" "/home/me/example/mount"

This will mount the tar archive so that you can extract the desired gz archive. I believe it is possible to extract individual files from a tar archive with tar, but I don't know the command, hence why I'm using this approach.

gunzip -c "/home/me/example/mount/example1.txt.gz" > "/home/me/example/extract1/example1.txt"

This will extract the file. Gunzip only supports extracting to the source directory, or to standard output so in this command we've used standard output and then piped the output to a file.

sudo umount "/home/me/example/mount"

This will unmount the tar archive.

Extract All Files From A Series Of GZ Archives In A Tar Archive:

cd "/home/me/example/extractall"

This puts the terminal into the directory you want to extract to since tar extracts to the current directory.

tar -xf /home/me/example/tar/archive.tar

This extracts the gz archives.

gunzip *.gz

This extracts the contents of the gz archives to the current directory /home/me/example/extractall/ and removes the gz archives.

Create A ZIP Archive:

cd "/home/me/example/inputfiles"

This puts the terminal into the inputfiles directory since zip creates an archive from the current directory, and saves to it.

zip -9 -r inputfiles inputfiles.zip *

This will create a zip archive of all the inputfiles directory contents, excluding hidden files, and use max compression. p7-zip would be a better tool to use if you need high compression.

mv "/home/me/example/inputfiles/inputfiles.zip" "/home/me/example/zip/archive.zip"

This will allow you to rename the archive what-ever you wish, and move it where you want.

Extract A ZIP Archive:

cd "/home/me/example/zip"

This puts the terminal into the directory containing the zip.

unzip -n archive.zip

This extracts the zip archives contents to the current directory.

3
  • 1
    Thanks for you answer. However, I did not read beyond the two first paragraphs of your answer, for two reason. First, you didn't understand my question correctly: I asked whether tar has an ability to compress the files while adding them to the archive. Second, you wrote incorrectly that tar doesn't support random access: see the --seek option.
    – Yuval
    Commented Nov 2, 2014 at 13:14
  • 1
    P.S. thanks for the lengthy explanation. zip is an excellent option for what I am looking for, and I might use it. Regarding extracting a single file for a gzip-compressed tar archive, the command would be: tar xfz archive.tgz path/in/archive/to/file
    – Yuval
    Commented Nov 2, 2014 at 13:20
  • @Yuval seek is an ability of the tar program to be used with archive formats that support random access, the tar archive format itself isn't designed for random access. ( en.wikipedia.org/wiki/Tar_%28computing%29#Random_access ) ( duplicity.nongnu.org/new_format.html#nottar ) It is possible to extract individual files from a tar, but this requires scanning the whole archive first to find them because there is no central catelogue. ( arstechnica.com/civis/viewtopic.php?f=16&t=409016 )
    – Robin Hood
    Commented Nov 2, 2014 at 20:57
1

If what you want is individually compressed files in an archive with random access, then dar ("Disk ARchive") may be what you are looking for. Newer versions support LZMA compression, the algorithm used by 7-Zip. It is also possible to define filters to store some file types uncompressed and save time, e.g. media files and archives that already have their own compression. My favorite feature is compressing existing (uncompressed) archives so I can quickly make a backup now and run the CPU intensive LZMA compression at a more convenient time or on a more powerful machine:

dar --empty-dir \
  --fs-root /home \
  --create home-backup-2016-01-11 \
  --prune lost+found

And then later and/or elsewhere:

dar -+ home-backup-2016-01-11-compressed-encrypted \
  -A home-backup-2016-01-11 \
  -zxz:6 \
  -K "aes:" \
  -an -ag -Z "*.mpg" -Z "*.avi" -Z "*.flac" -Z "*.cr2" \
  -Z "*.vob" -Z "*.jpg" -Z "*.jpeg" -Z "*.mpeg" -Z "*.png" \
  -Z "*.mp3" -Z "*.ogg" -Z "*.deb" -Z "*.tgz" -Z "*.tbz2" \
  -Z "*.rpm" -Z "*.xpi" -Z "*.run" -Z "*.sis" -Z "*.gz" \
  -Z "*.Z" -Z "*.bz2" -Z "*.zip" -Z "*.jar" -Z "*.rar" \
  -Z "*.xz" -Z "*.dar" -Z "*.7z" -acase

As shown above, encryption is also possible, all while still allowing extraction of individual files. However, dar does not seem to have an equivalent to the --to-command. It's hard to tell from your question whether you intended to used that feature for anything but decompression.

(Yes, I know this question is old. This is for the people who, like me, googled "tar compress individually" and got this as the first result.)

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .