OSX 10.11 - python3.5 or AWS CLI (or other tool?)

I have ~ 5,000 subdirectories within an Amazon S3 bucket, each subdirectory contains a single .tar. In each .tar it contains only one .zip, ~<1mb in size.

What I would like to do is run a script that will access each subdirectory within the S3 bucket and copy this .zip found within each .tar to either a given s3 location, or to a local destination.

Each .tar is ~10-15GB when uncompressed, so extracting the full contents is not feasible/wanted. I do believe that the .tar header can instead be read, in order to locate the .zip and copy.

Can you tell me of a way I can achieve this

  • Tar files don't include file positions in the file header -- they are streams, and have to be scanned. In fact, the same file can appear more than once within a given tar file so technically they have to be scanned all the way to the end so that you get the last file of that path+name, which is usually what you want. The answer below will get the file you want, but the entire tar will still be read, even if it isn't extracted. Commented Jan 15, 2016 at 2:20

1 Answer 1


to pull out a single file called zipfile.zip from archive tarfile.tar:

tar xvf /path/to/tarfile.tar /path/to/where/you/want/zipfile.zip

You could use perl to recurse

my @directories_to_search = ('/root/path/to/s3/dir/');
use File::Find;
use File::Basename;

finddepth(\&extract_zip, @directories_to_search);

sub extract_zip {
    return unless /tar$/; # ignore all but tar files
    my $tarname = $File::Find::name;
    `tar xvf "$tarname" /desired/path/name-of-zip-inside-archive.zip`;

Something very close to the above should work. (tested in El capitan). Problem you might have is if the zip filename is different in each tar archive. If it is, you will need to get hold of the name of the zip inside the tar before you extract (or if there is a pattern match eg *.zip you could try that instead)

  • accepted because it is a framework to start with. However, when using AWS s3 as the location of the .tar, streaming is involved which I believe requires an EC2 instance to coordinate, in order to read the information contained within the .tar.
    – bjmarra
    Commented Jan 21, 2016 at 19:51

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .