3

I have a very large folder that I am trying to create a tar archive of. The issue is I don't have enough extra free space to store the entire archive so I want to create say 100-200GB chunks of the archive at a time and transfer those individually to cloud storage. I need to be able to control when new chunks are created so my HDD doesn't fill up but all of the commands i've found to create split tarballs always create it all at once, in the same directory.

The closest solution I found was from this question but all the responses base the archives on number of files, not size which is important for my use case as my file sizes are unevenly distributed.

5
  • 1
    Do you have SSH access to your cloud storage? If so, you can pipe a tarball directly over SSH which means it will never be stored locally. E.g. tar czvf - /path/to/directory | ssh [email protected] "cat > /path/to/destination.tar.gz"
    – jayhendren
    Commented Jan 7, 2021 at 23:10
  • @jayhendren unfortunately not. I tried mounting it as a filesystem locally but that was incredibly slow unfortunately. Commented Jan 7, 2021 at 23:41
  • will not fit your needs exactly but you can use it as base to start: multi_tar.sh will create standalone tarball archives based on given target size. you could add some interactive wait into script so you have time to move chunks. it currently creates always 4x files at same time so you should limit to 50 GB each. usage: multi_tar.sh -L 52428800 archive.tar <dir1> [<dir2>...] (might do the modifications for you and post it as answer if no one comes up with better solution)
    – alecxs
    Commented Jan 7, 2021 at 23:43
  • @alecxs Thanks, I was able to get it partially working (though with it creating all the archives at once). I did however get this: tar: SELinux support is not available tar: XATTR support is not available but I believe that's fine to ignore Commented Jan 8, 2021 at 0:10
  • that's hardcoded flags for GNU tar, just ignore. search for 'while (( ${file_count:-0} > 4 ))' that is how many files created at same time. set to 1 and add read -p "press enter" below the next 'sleep 1'
    – alecxs
    Commented Jan 8, 2021 at 8:29

3 Answers 3

4

You can use tar, with these options:

--new-volume-script=COMMAND
--tape-length=N

At the end of each volume it will call your script, that will have some environment variables to know which volume has just been processed. Check the manual page for the full list, but at least the variable TAR_VOLUME is pretty useful, in case you have to rename the output file, or keep somehow track of the current volume:

TAR_VOLUME

    Ordinal number of the volume tar is processing (set if reading a multi-volume archive).

If the script returns 0, tar will continue, otherwise it will stop.

For example, this will create each volume, with a maximum size of 20 M, calling your script each time the limit is reached:

tar cvf /tmp/volume.tar /path/to/files/ --new-volume-script=/path/to/myscript.sh --tape-length=20M

The script can be a simple echo "Next volume";read or you could even do the transfer from it (renaming the volume, because once you exit /tmp/volume.tar will be overwritten).

On the other side, be sure to use the flag --multi-volume. If you don't, tar will stop with the errors (I leave it in case somebody searches for the error):

tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
tar xvf /path/to/transferred.volume --multi-volume
Prepare volume #2 for /path/to/transferred.volume and hit return: 

tar will prompt you for the new volume. Once you press Enter, /path/to/transferred.volume will be opened again, and so on.

4
  • 1
    completely forgot the origin of tar - tape archiving. thx for pointing out :)
    – alecxs
    Commented Jan 8, 2021 at 9:00
  • Wow this works perfectly! Wasn't built for my use case but works nonetheless! I found/modified a script from this GNU page that numbers the archive (Pasted in a separate answer due to space). Commented Jan 8, 2021 at 18:24
  • 1
    @alecxs actually I tried to solve it with cpio first :) cpio + a loopback device (for the virtual tape). It kind of worked, but the extraction had issues and it seems it cannot accomodate a file that's larger than a single tape. Thanks for the script @JoshHarrison. Commented Jan 8, 2021 at 18:42
  • 1
    Please keep in mind that this solution is based on vendor specific behavior of GNU tar and thus not portable. Other tar implementations use a different command line. See e.g. schilytools.sourceforge.net/man/man1/star.1.html for the oldest free tar implementation. BTW: cpio is outdated and definitely not recommended for new software.
    – schily
    Commented Jan 12, 2021 at 14:58
2

Following up on eduardo-trápani's excellent answer, below is a slightly modified version of a script found on GNU Page that waits for user input for each volume and retries if a volume is not found:

For completeness this is the command used to create the archive:

tar cvf /tmp/volume.tar /path/to/files/ --new-volume-script=./myscript.sh --tape-length=1000M

And this is the command I used to extract the split archive:

tar xvf /tmp/volume.tar --multi-volume --new-volume-script=./myscript.sh

myscript.sh:

#!/bin/bash
# For this script it's advisable to use a shell, such as Bash,
# that supports a TAR_FD value greater than 9.

echo "Press enter to continue to next volume"

read

echo Preparing volume $TAR_VOLUME of $TAR_ARCHIVE.

name=`expr $TAR_ARCHIVE : '\(.*\)-.*'`
case $TAR_SUBCOMMAND in
-c)       ;;
-d|-x|-t) test -r ${name:-$TAR_ARCHIVE}-$TAR_VOLUME || echo "Failed to find volume"
          ;;
*)        exit 1
esac

echo ${name:-$TAR_ARCHIVE}-$TAR_VOLUME >&$TAR_FD

Edit: This only works with GNU Tar which can be installed on macOS (w/Homebrew) by:

brew install gnu-tar

To use it as your default tar you will need to add it to your path like so:

export PATH="$(brew --prefix)/opt/python/libexec/bin:$PATH"
1
  • 1
    Thank you for mentioning that this is for GNU tar. There are too many people that confuse GNU tar with tar.
    – schily
    Commented Jan 12, 2021 at 14:59
0

I tried to use the answer of Josh Harrison which didn't work in my case.
I didn't have real SSH access to the server because it was a managed hosting. I'm using https://github.com/flozz/p0wny-shell to have something like a shell.

The problem is, that p0wny-shell doesn't provide an stdin stream, so the read command didn't stop the script and the parts still were created one after another without pausing.

I did a modification so that it automatically moved the parts to the new server one by one:

  1. Create part
  2. Upload that part and delete it
  3. Repeat until all parts are created
  4. Upload the last part manually
  5. Unpack it on the remote server with the original myscript.sh (without the read to not stop between the parts)
#!/bin/bash
# For this script it's advisable to use a shell, such as Bash,
# that supports a TAR_FD value greater than 9.

if [[ $TAR_SUBCOMMAND != '-c' ]]; then
  echo 'This script can only be used to compress with -c option'
  exit 1;
fi

# $TAR_ARCHIVE per run:
# 1. archive.tar
# 2. archive.tar-2
# 3. archive.tar-3
# ...

# $TAR_ARCHIVE_NAME per run
# 1. <empty>
# 2. archive.tar
# 3. archive.tar
# ...
TAR_ARCHIVE_NAME=`expr $TAR_ARCHIVE : '\(.*\)-.*'`

# $TAR_ARCHIVE_BASE_NAME per run
# 1. archive.tar
# 2. archive.tar
# 3. archive.tar
# ...
TAR_ARCHIVE_BASE_NAME=${TAR_ARCHIVE_NAME:-$TAR_ARCHIVE}

if (( $TAR_VOLUME == 2 )); then
  # On the first run $TAR_VOLUME will be '2', we want to use the base name
  TAR_ARCHIVE_PREV_PART=$TAR_ARCHIVE_BASE_NAME
elif (( $TAR_VOLUME >= 3 )); then
  # On the next runs $TAR_VOLUME we want to build the name with the previous $TAR_VOLUME
  TAR_PREV_VOLUME=$(($TAR_VOLUME-1))
  TAR_ARCHIVE_PREV_PART=$TAR_ARCHIVE_BASE_NAME-$TAR_PREV_VOLUME
fi


echo "Copying $TAR_ARCHIVE_PREV_PART..."
# SSH key was previously created with `ssh-keygen -f ./id_rsa_user` and public key was added to remote
scp \
  -o StrictHostKeyChecking=no \
  -i '/usr/www/users/user/.ssh/id_rsa_user' \
  $TAR_ARCHIVE_PREV_PART \
  [email protected]:/home/user/path/to/target/


echo "Removing $TAR_ARCHIVE_PREV_PART..."
rm $TAR_ARCHIVE_PREV_PART


echo Preparing volume $TAR_VOLUME of $TAR_ARCHIVE_BASE_NAME.
echo $TAR_ARCHIVE_BASE_NAME-$TAR_VOLUME >&$TAR_FD

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .