13

I receive anywhere from 4 to 100 very large tar (~20GB) archive files everyday. I have been concatenating them in the past by looping through each of the archives I see on the file system and doing something like this

/bin/tar -concatenate --file=allTars.tar receivedTar.tar

The problem with this however is that as I concatenate more and more tar files, it must read to the end of allTars.tar to begin concatenating again. Sometimes it takes over 20 minutes to start adding another tar file. It is just too slow and I am missing an agreed upon delivery time of the complete allTars.tar.

I also tried handing my tar command a list of files like so:

/bin/tar --concatenate --file=alltars.tar receiverTar1.tar receivedTar2.tar receivedTar3.tar...etc

This gave very odd results. allTars.tar would be the expected size (ie close to all the receivedTar.tar files' sizes added together) but seemed to overwrite files when allTars.tar was unpacked.

Is there any way to concatenate all these tar files in one command or so it doesn't have to read to the end of archive being concatenated to every time and have them unpack correctly and with all files/data?

7
  • Can't you just make a new tar ball of tar balls (nested tar ball)? Rather than concatenating them. It'll make extraction very slow but this doesn't sound like your problem...
    – Colin
    Commented Jul 16, 2015 at 14:52
  • Unfortunately, no. Our client is very particular about how the tar ball unpacks.
    – Jeff Hall
    Commented Jul 16, 2015 at 14:56
  • Have you tried untarring each source file as it's own background thread with & (in parallel), then cat'ing before re-taring and zipping? Going to need a huge swap file tho!
    – Colin
    Commented Jul 16, 2015 at 15:03
  • How do you receive the files? By network? I'd reduce any unneeded copy operation, best would be to move each file to a directory tree on the same partition.
    – ott--
    Commented Jul 16, 2015 at 18:33
  • What version of tar are you using 1.28?
    – cybernard
    Commented Jul 17, 2015 at 4:19

4 Answers 4

15

This may not help you, but if you are willing to use the -i option when extracting from the final archive, then you can simply cat the tars together. A tar file ends with a header full of nulls and more null padding till the end of the record. With --concatenate tar must go through all the headers to find the exact position of the final header, in order to start overwriting there.

If you just cat the tars, you just have extra nulls between headers. The -i option asks tar to ignore these nulls between headers. So you can

cat  receiverTar1.tar receivedTar2.tar ... >>alltars.tar
tar -itvf alltars.tar

Also, your tar --concatenate example ought to be working. However, if you have the same named file in several tar archives you will rewrite that file several times when you extract all from the resulting tar.

15

This question is rather old but I wish it had been easier for myself to find the following information sooner. So if anyone else stumbles across this, enjoy:

What Jeff describes above is a known bug in gnu tar (reported in August 2008). Only the first archive (the one after the -f option) gets its EOF marker removed. If you try to concatenate more than 2 archives the last archive(s) will be "hidden" behind file-end-markers.

It is a bug in tar. It concatenates entire archives, including trailing zero blocks, so by default reading the resulting archive stops after the first concatenation.

Source: https://lists.gnu.org/archive/html/bug-tar/2008-08/msg00002.html (and following messages)

Considering the age of the bug I wonder if it will ever get fixed. I doubt there is a critical mass that is affected.

The best way to circumvent this bug could be to use the -i option, at least for .tar files on your file system.

As Jeff points out tar --concatenate can take a long time to reach the EOF before it concatenates the next archive. So if you're going to be stuck with a "broken" archive that needs the tar -i option to untar, I suggest the following:

Instead of using tar --concatenate -f archive1.tar archive2.tar archive3.tar you will likely be better off to run cat archive2.tar archive3.tar >> archive1.tar or pipe to dd if you intend to write to a tape device. Also note that this could lead to unexpected behaviour if the tapes did not get zeroed before (over)writing new data onto them. For that reason the approach I am going to take in my application is nested tars as suggested in the comments below the question.

The above suggestion is based on the following very small sample benchmark:

time tar --concatenate -vf buffer.100025.tar buffer.100026.tar
  real  65m33.524s
  user  0m7.324s
  sys   2m50.399s

time cat buffer.100027.tar >> buffer.100028.tar
  real  46m34.101s
  user  0m0.853s
  sys   1m46.133s

The buffer.*.tar files are all 100GB in size, the system was pretty much idle except for each of the calls. The time difference is significant enough that I personally consider this benchmark valid despite small sample size, but you are free to your own judgement on this and probably best off to run a benchmark like this on your own hardware.

3
  • A workaround this bug could be: for f in file1 file2..; do tar --concatenate --file=file0.tar $f; done
    – marius
    Commented Jul 17, 2016 at 12:14
  • Hi @marius - this is what the TO is doing that takes too long as tar scans to the end of archive before appending each file.
    – trs
    Commented Jul 17, 2016 at 18:39
  • Almost a year on from my original solution (cat n.tar >> 1.tar to append and tar -ixf 1.tar to extract) I thought there should be a better way of appending tar files without the need for -i when extracting. Had a play with sed - got it to remove NUL values from end of file but it turns out more smarts required. Anyone come up with a solution ... please post :-)
    – trs
    Commented May 15, 2017 at 21:14
0

As you have stated, the target archive file must be read to the end before the second source archive is appended to it. GNU tar has an -n option that instructs it to assume a file is seekable (remember tar was design for tape and stream archives which are not seekable). GNU tar supposedly defaults to auto-detect if a file is seekable, however many users such as yourself may ensure that tar skips the reading of each records full content by adding the -n option:

tar -n --concatenate --file=target_file.tar  other_file.tar

I am unable to verify (at time of writing) which, if any, versions of tar will perform as expected for this command. If others users have the capability of proving this solution, please comment below and I will update this answer accordingly.

0

This isn't a one line command, but if you can create a file in /usr/local/bin/tar_merger and make it executable, feel free to use this python3 script to merge the tars

Put the tars you want to merge into a folder with the name of the final file, then do

tar_merger folder

It will create a output tar with the folder name and go through every tar in the folder, adding its files to the new one. For me its working fine with merging thousands of about 1 gb tars.

#!/bin/env python3
# Merges many tars into a single one
# Wolfang Torres - [email protected]

from pathlib import Path
from tarfile import TarFile
from sys import argv


def cli(folder):
    """Merges all the tars in folder to a single tar"""
    folder = Path(folder)
    tar_path = folder.parent / f"{folder.name}.tar"
    n = 0
    sub_tars = tuple(folder.glob("*.tar"))
    total = len(sub_tars)
    with TarFile(tar_path, "w", encoding="utf-8") as tar:
        for sub_path in sub_tars:
            n += 1
            print(f"Adding tar {n}/{total} {sub_path.name}")
            with TarFile(sub_path, "r", encoding="utf-8") as sub_tar:
                for file in sub_tar:
                    data = sub_tar.extractfile(file)
                    tar.addfile(file, data)
    print(f"Finished merging {n} Tars into {tar_path}")


if __name__ == "__main__":
    if len(argv) < 2 or argv[1] in ("-h", "--help"):
        print(f"{argv[0]} folder")
        print()
        print("Merges all the tars inside of `folder` to `folder.tar`")
    else:
        cli(argv[1])

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .