4

I have a large tar file containing millions of files. For efficiency reasons I don't want to untar the files to disk.

Rather, given a desired filename, I would like to write a script e.g. Python to pull the relevant chunk of data from the tar file.

Is there an easy way to create an index telling me the starting byte and length of every file in the tar file e.g. I could dump out to disk as an index for use in the abovementioned Python script?

Maybe the tar command can do this but I'm not seeing anything obvious in the man page.

The tar is not compressed.

Thanks in advance.

1
  • Why not use zip/7z/xz/etc which have indexes? Commented Jun 28, 2020 at 14:15

2 Answers 2

4

Python code performs not very well. I use below awk scripts to do that for a big tar file.

tar -tvf <tar-file> -R | awk '
BEGIN{
  getline;
  f=$8;
  s=$5;
}
{
  offset = int($2) * 512 - and((s+511), -512)
  print offset,s,f;
  f=$8;
  s=$5;
}'
4
  • What awk is this for? gawk 5.0.0 errors with: and: argument 1 negative value -512 is not allowed
    – rakslice
    Commented Nov 29, 2021 at 0:24
  • ah, support for negative operands was removed in 4.2: gnu.org/software/gawk/manual/html_node/…
    – rakslice
    Commented Nov 29, 2021 at 0:27
  • Evidently from the bitwise and this is already assuming the number representation being used is two's complement, so I just replaced -512 with compl(512)+1 to get something working in gawk 5 with no worse assumptions :D
    – rakslice
    Commented Nov 29, 2021 at 1:07
  • tar -tvf <tar-file> | awk '{printf("%d %d %s", N+512, $3, $6); for (i = 7; i <= NF; i++) { printf(" %s", $(i)) }; printf("\n"); N+=$3 + (512 - ($3 % 512)) + 512}' Should be compatible with busybox (note lack of -R), and account for files with whitespaces (but not newlines) in names. Commented Dec 24, 2023 at 19:10
4

For the benefit of others with a similar use case (i.e. wanting to build an index enabling random access on a tar file) in the end I adapted a handy utility at http://fomori.org/blog/?p=391 the essence of which is (in Python):

fp = open('index.txt', 'w')
ctr = 0
with tarfile.open(tarfname, 'r') as db:
  for tarinfo in db:
     currentseek = tarinfo.offset_data
     rec = "%d\t%d\t%d\t%s\n" % (ctr,tarinfo.offset_data, tarinfo.size, tarinfo.name)
     fp.write(rec)
     ctr += 1
     if ctr % 1000 == 0:
        db.members = []
fp.close()

The check at %1000 conserves RAM. I'm sure this could be neater.

Not the answer you're looking for? Browse other questions tagged or ask your own question.