4

I have a single tar file containing about 19 million files (no folders)

0000107b869682826003b04a40e6394.txt
00029237482s8923789423ud8923892.txt
2c002y8378723887292377a79237649.txt
f598238209237408238742308374038.txt

how do I untar all the files such that they appear in subdirectories named after the first four chars of the files. So for the example above, it would create 0000, 2c02, and f599 directories and each would have the following files.

0000\0000107b869682826003b04a40e6394.txt
0002\00029237482s8923789423ud8923892.txt
2c02\2c002y8378723887292377a79237649.txt
f598\f598238209237408238742308374038.txt

I've already tried creating a script that goes through the files in the tar file, creates a directory and extracts that file from the tar and puts it in the directory. This works for small number of files, but when the tar has millions, extracting takes a really long time.

0

2 Answers 2

9

With GNU tar and its s command with syntax from sed. I switched from s/// to s|||.

tar -xvf file.tar --transform 's|\(....\).*|\1/&|' --show-transformed-names
2
  • LOL - a more elegant solution indeed :-D But talking about 19mil files it's still going to take time :-) Commented Sep 3, 2021 at 22:02
  • ..nice dog bro!
    – Io-oI
    Commented Sep 7, 2021 at 13:18
0

I created a test tarball, no directories:

pg@TREX:~/test$ tar -tvf test.tar | rev | cut -c -8 | rev
0001.txt
0002.txt
0003.txt
0004.txt
0005.txt
0011.txt
0012.txt
0013.txt
0014.txt
0015.txt
0021.txt
0022.txt
0023.txt
0024.txt
0025.txt

I run this script (tartest.sh):

#!/bin/bash

tar -xf tarfile.tar
i=$(ls *.txt | cut -c -3 | sort | uniq) 
echo "$i" >> directory_list 
mkdir $i 
while read line; do mv $line*.txt $line/; done < directory_list

Result:

pg@TREX:~/test$ tree
.
├── 000
│   ├── 0001.txt
│   ├── 0002.txt
│   ├── 0003.txt
│   ├── 0004.txt
│   └── 0005.txt
├── 001
│   ├── 0011.txt
│   ├── 0012.txt
│   ├── 0013.txt
│   ├── 0014.txt
│   └── 0015.txt
├── 002
│   ├── 0021.txt
│   ├── 0022.txt
│   ├── 0023.txt
│   ├── 0024.txt
│   └── 0025.txt
├── directory_list
├── tartest.sh
└── test.tar

I'm sure this'll take a bit of time with 19mil files, and I'm sure more elegant solutions exist... but seems to do what you asked :-)

10
  • 1
    This could fail with 19 million files: ls *.txt
    – Cyrus
    Commented Sep 3, 2021 at 22:23
  • Yeah, if all the files don't have .txt extension, as described on OP:s question. However it should work without the extension, creating a direcory "tar" and moving tarfile itself there - along with any other file beginning with those characters :-) Or do you see a problem with using ls in the first place? Anyway, I prefer your one-liner - I didn't even know sed can be tied with tar that way :-) Commented Sep 3, 2021 at 22:33
  • I assume that if bash replaces *.txt with 19 million file names, that bash will then output argument list too long.
    – Cyrus
    Commented Sep 3, 2021 at 22:44
  • Oh, that didn't occur to me at all. And in keeping with Mr Murphy's legacy, that'd be bound to happen around file 18 699 453 :-D Commented Sep 3, 2021 at 22:59
  • 1
    You can just replace it with ls | egrep '\.txt$' | ...
    – slebetman
    Commented Sep 4, 2021 at 6:56

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .