Method to get file byte offsets (and lengths) in tar files

Question

I have a large tar file containing millions of files. For efficiency reasons I don't want to untar the files to disk.

Rather, given a desired filename, I would like to write a script e.g. Python to pull the relevant chunk of data from the tar file.

Is there an easy way to create an index telling me the starting byte and length of every file in the tar file e.g. I could dump out to disk as an index for use in the abovementioned Python script?

Maybe the tar command can do this but I'm not seeing anything obvious in the man page.

The tar is not compressed.

Thanks in advance.

Why not use zip/7z/xz/etc which have indexes?
– user1133275
Commented Jun 28, 2020 at 14:15 — user1133275, Commented Jun 28, 2020 at 14:15

Congbin Guo · Accepted Answer · 2018-04-16 19:41:04Z

4

Python code performs not very well. I use below awk scripts to do that for a big tar file.

tar -tvf <tar-file> -R | awk '
BEGIN{
  getline;
  f=$8;
  s=$5;
}
{
  offset = int($2) * 512 - and((s+511), -512)
  print offset,s,f;
  f=$8;
  s=$5;
}'

answered Apr 16, 2018 at 19:41

Congbin Guo

1,6852 gold badges16 silver badges16 bronze badges

What awk is this for? gawk 5.0.0 errors with: and: argument 1 negative value -512 is not allowed
– rakslice
Commented Nov 29, 2021 at 0:24
ah, support for negative operands was removed in 4.2: gnu.org/software/gawk/manual/html_node/…
– rakslice
Commented Nov 29, 2021 at 0:27
Evidently from the bitwise and this is already assuming the number representation being used is two's complement, so I just replaced -512 with compl(512)+1 to get something working in gawk 5 with no worse assumptions :D
– rakslice
Commented Nov 29, 2021 at 1:07
tar -tvf <tar-file> | awk '{printf("%d %d %s", N+512, $3, $6); for (i = 7; i <= NF; i++) { printf(" %s", $(i)) }; printf("\n"); N+=$3 + (512 - ($3 % 512)) + 512}' Should be compatible with busybox (note lack of -R), and account for files with whitespaces (but not newlines) in names.
– Alexander Tumin
Commented Dec 24, 2023 at 19:10

Add a comment |

youfu · Accepted Answer · 2023-06-30 14:33:04Z

For the benefit of others with a similar use case (i.e. wanting to build an index enabling random access on a tar file) in the end I adapted a handy utility at http://fomori.org/blog/?p=391 the essence of which is (in Python):

fp = open('index.txt', 'w')
ctr = 0
with tarfile.open(tarfname, 'r') as db:
  for tarinfo in db:
     currentseek = tarinfo.offset_data
     rec = "%d\t%d\t%d\t%s\n" % (ctr,tarinfo.offset_data, tarinfo.size, tarinfo.name)
     fp.write(rec)
     ctr += 1
     if ctr % 1000 == 0:
        db.members = []
fp.close()

The check at %1000 conserves RAM. I'm sure this could be neater.

Collectives™ on Stack Overflow

Method to get file byte offsets (and lengths) in tar files

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
linux
file
tar
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonlinuxfiletar or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
linux
file
tar
or ask your own question.