11

I have some directories with over 100Gb of data. I'm trying to archive them into smaller volumes i.e. 10Gb each that are independent / standalone.

The problem is if I use tar + split, it results in multiple tar parts that are not independent. I cannot just extract files from one of the parts, unless I cat / combine all back into a single large file first.

I've also tried using tar -c -L1000M ... to split volumes, but that doesn't work either and there's a problem with long filenames getting truncated.

Tried star as well, but seems like its split volumes are not independent either; while 7zip does not preserve permissions in unix.

The reason I wish to have independent split archives is for safety purposes, in case one of the split files are corrupted, I can still retrieve data from the other archives. It is also much faster if I wish to only extract specific files/folders, without needing to combine all the archives back into a single large volume.

How best do I achieve this? Thank you.


SOLUTION FOUND

I have found a solution using tar, as suggested by @Haxiel's answer. The answer has been posted below.

Note that there may still be a file or two that lost if it crosses the boundary of a volume and you don't have the next volume available, but at least the separate volumes could be independently extracted even if the other parts are missing.

9
  • 6
    I've not found any tar programs that can really handle splitting like this. I've spent more time looking into this than I feel is sane. My advice is to create them in manageable sized portions from the beginning. I am not posting this as an answer because I do not want it to be the Answer.
    – Ed Grimm
    Commented Mar 6, 2019 at 3:50
  • Thanks for sharing your experience. Wonder why these legacy tools are not enhanced to meet the modern day needs... Commented Mar 6, 2019 at 7:37
  • 2
    If you're comfortable porting C code, github.com/att/ningaui/blob/master/potoroo/tool/admeasure.c will do bin packing of files into sets of a specified size. Man page is at the end of the file. It depends on the AST library for i/o and may have other dependencies on parts of the ningaui project. It was used to create file lists that were then put into a cpio archive and then written to tape. We limited tape files to 2GB to optimize restore time. Commented Mar 6, 2019 at 10:52
  • Thanks a lot, @MarkPlotnick! I'll check out the source code to see if it's something within my capability to adapt or apply. Commented Mar 6, 2019 at 16:08
  • I have found a solution and have posted the answer details, hope it helps you too @EdGrimm Commented Mar 12, 2019 at 2:49

4 Answers 4

5

I have found a solution using tar, as suggested by @Haxiel's answer. The command used is like this:

tar -c -L1G -H posix -f /backup/somearchive.tar -F '/usr/bin/tar-volume.sh' somefolder

-L: Defines the archive size limit, i.e. 1 Gb

-H: Must use posix format, else long filenames are truncated

-F: Volume script is needed to generate sequential archive file names for tar

This command will create a multi-volume archive in the format of somearchive.tar, somearchive.tar-2, somearchive.tar-3...

Below is my tar-volume.sh, adapted from this tutorial.

#!/bin/bash

echo Preparing volume $TAR_VOLUME of $TAR_ARCHIVE
name=`expr $TAR_ARCHIVE : '\(.*\)\(-[0-9]*\)$'`

case $TAR_SUBCOMMAND in
-c)       ;;
-d|-x|-t) test -r ${name:-$TAR_ARCHIVE}-$TAR_VOLUME || exit 1
          ;;
*)        exit 1
esac

echo ${name:-$TAR_ARCHIVE}-$TAR_VOLUME >&$TAR_FD

To list the contents of say the 3rd archive volume:

tar -tf /backup/somearchive.tar-3

To extract a specific archive volume:

tar -xf /backup/somearchive.tar-3

Note that if you just extract 1 single volume, there may be incomplete files which were split at the beginning or end of the archive to another volume. Tar will create a subfolder called GNUFileParts.xxxx/filename which contain the incomplete file(s).

To extract the entire set of volumes in Unix, you'll need to run it through the volume script again:

tar -xf /backup/somearchive.tar -F '/usr/bin/tar-volume.sh'

If you are extracting them in Windows, the tar command cannot properly run the volume script as that requires a bash shell. You'll need to manually feed the volume file names at the command line, by first running this command:

tar -xf somearchive.tar -M

-M indicates that this is a multi-volume archive. When tar finishes extracting the first volume, it'll prompt you to enter the name of the next volume, until all volumes are extracted.

If there are many volumes, you could potentially just type all the volume name sequences first, then copy and paste the entire batch into tar's command line prompt once the first volume has been extracted:

n somearchive.tar-2
n somearchive.tar-3
n somearchive.tar-4

Note the n in front, which is a tar command to indicate that the following parameter is a new volume file name.

There may still be a file or two that may be lost if it crosses the boundary of a volume and you don't have the next volume available, but at least the separate volumes could be independently extracted even if the other parts are missing.

For more information, please refer to the tar documentation. H

2
  • 1
    If a file is splitted in two volumes, can I use -M option to extract it reading only those two volumes or it will prompt for all the volumes?
    – alexis
    Commented May 25, 2021 at 14:47
  • just in case, this does not work for the MacOS tar
    – True
    Commented Aug 22, 2021 at 15:14
3

This is not a perfect solution, but GNU tar's multi-volume archives seem to be close to what you're looking for. This option is already mentioned in your question, but I would like to add a reference from the GNU tar manual that clarifies why this is a possible option:

Multi-volume archive is a single tar archive, stored on several media volumes of fixed size. Although in this section we will often call `volume' a tape, there is absolutely no requirement for multi-volume archives to be stored on tapes. Instead, they can use whatever media type the user finds convenient, they can even be located on files.

When creating a multi-volume archive, GNU tar continues to fill current volume until it runs out of space, then it switches to next volume (usually the operator is queried to replace the tape on this point), and continues working on the new volume. This operation continues until all requested files are dumped. If GNU tar detects end of media while dumping a file, such a file is archived in split form. Some very big files can even be split across several volumes.

Each volume is itself a valid GNU tar archive, so it can be read without any special options. Consequently any file member residing entirely on one volume can be extracted or otherwise operated upon without needing the other volume. Sure enough, to extract a split member you would need all volumes its parts reside on.

Multi-volume archives suffer from several limitations. In particular, they cannot be compressed.

With this definition, the only files that would be a problem are the ones that are split across the size boundary. Files that are fully contained within a single volume could be treated as independent of the other volumes.

For each volume, it is possible to identify the split files using the -v option.

$ tar -tf multi-test2.tar -v
M--------- 0/0          658432 1970-01-01 03:00 file1--Continued at byte 7341568--
-rw-r--r-- test/users 4000000 2019-03-06 12:12 file2

The files that are fully contained can be extracted as you would with a single archive. tar seems to complain about the split file being incomplete, but it is able to extract the complete files without any problems.

The split files can also be extracted as a single unit from multiple volumes by using the -M option, which will prompt you to provide the name of the next volume. The usage is documented here. Instead, if you prefer to concatenate the volumes to a single archive, you can consider the tarcat utility as well.

10
  • Thank you for the detailed explanation. I have experimented with this option as well, but the problem was the weird unpredictability of filename failures when archiving. For example, it produced this error "tar: xxxxx file name too long to be stored in a GNU multivolume header, truncated", stating the file name is too long (about 100 chars). But yet it managed to archive some files with longer paths / filenames (about 120+ chars). I extracted the archive to verify, and indeed some longer files were there, but some were missing. Commented Mar 6, 2019 at 15:58
  • 1
    @MongrelJedi GNU tar works with a few different archive types, some of which have limitations on the length of a filename. Can you check the docs here and see if specifying a different format with --format helps?
    – Haxiel
    Commented Mar 7, 2019 at 13:19
  • Thank you for the excellent suggestion, I tried adding -H posix parameter, and the filename issue appears to be gone. I was testing on a small directory, will proceed to test with the real data and see if it works fine. I also need to create a proper volume script to auto generate the volume names. If all works fine, I'll proceed to mark this as the answer and share the full list of commands and scripts used. Thanks! Commented Mar 7, 2019 at 15:00
  • 1
    @MongrelJedi It's great to know that this solution works for you; thanks for letting me know :-). I see that you've added the new information to your question. Perhaps you can add that as an answer instead? It's perfectly fine here on StackExchange to answer your own questions, and also to build on top of an existing answer. Your write-up is clear and to the point, so I think it really belongs as a proper answer to this question.
    – Haxiel
    Commented Mar 11, 2019 at 18:01
  • 1
    @alexis That statement is taken directly from the manual, so I don't have additional details. It could be an implementation problem - if a file is split between two archives, perhaps it is not feasible to bring in compression. In any case, the comment is about native capability. You can always use xz on the archives manually once they're created, but you'll most likely need to unxz them before you can give them back to tar.
    – Haxiel
    Commented May 25, 2021 at 15:26
0

(Writing as new answer because I cannot comment yet)

As True mentioned in a comment this might not work with the MacOS tar

Just wanted to point out that you could install gnu-tar on mac via brew:

brew install gnu-tar

then use gtar instead of tar

-1

If you use

star -c tsize=1G ...

you get tar archives that are split in a way that makes them independent.

Be careful to specify enough f=filename options to hold all archives in different files. You need to specify as many f= options as needed when honoring the fact that not all archives reach the full size.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .