2

We use tar.bz2 as our archive of server logs. We also write tools that selectively parse these archived logs. Many times I must regex-search one file’s lines in the archive to determine if other logs contained in the same archive are relevant to the parsing needs. (I have a regex match for the file name/path) From a performance stand point I think I facing some limits. But it could be that I have a gap in my knowledge. I mostly script in python and have some basic bash skills. The archives are large and stored on a mount. I would like to avoid as much reading and local/temp storage as I can and especially when archives don’t qualify for a full parsing.

Option 1 (waste bandwidth and cpu resources save local storage)

  1. Read the whole bz2 file to local disk.
  2. Uncompress the tar as I scan the file list.
  3. Uncompress again to search the first log file.
  4. Then if this archive qualifies uncompress again to extract the log that I must parse.
  5. Move on to the next archive

Or (waste local storage and waste less bandwidth)

  1. Read the whole bz2 file to local disk.
  2. Extract most of the files that meet a potential interesting criteria (would need to take most everything)
  3. Now every file is just on my local file system. Scan the first log
  4. Then if it qualifies move on to the log that I must parse.
  5. Delete all the local storage and move on to next archive.

When I research compression tools like 7zip zip rar bz2… Most links give me information about compression speed and compression size. I would like to use something like 7zip because the compression size is important long term. This is not the base of my question! But I also ‘think’ that zip has the ability to expose the full file list and extract one file without decompressing the whole archive. (Because the file list is in headers…) But zip is not very native on Linux.
Is there a way to optimize the process using the existing tar.bz2? What are some tools / methods I should consider? (Switch out tar, use 7zip?)

1
  • dar is an alternative to tar that allows direct file access among other things. not sure if it's applicable to your situation though. Commented Dec 4, 2013 at 16:52

1 Answer 1

1

zip is not native to linux, but if you have the source you probably should not care.

On the other hand 7zip and xz have better performance, and compressing a tar file of multiple entries with similar data is better for compression than zip which does essentially one file at a time. This makes it possible for zip to recover when one file is broken (due to corruption) where a compressed tar archive is often has more problems to recover and/or more unrecoverable.

If you have a chance to change the compressed bz2 file generation (likely otherwise you would not ask), do the following instead of generating the tar.bz2:

  • generate an index.lst using find <list_of_files_to_archive> > index.lst
  • generate a tar.xz from index.lst + list_of_files_to_archive

That way you can quickly extract the index.lst file without decompresing the whole archive and determine on the contents of index.lst if you have the right archive. I am not sure if standard tar stops after extracting index.lst (there could be another one in the archive), so use the python tar module to make sure you stop after extracting (and you immediately parse the index.lst file without need for storing on disk, extra speedup).

3
  • Thanks. Follow up.... The index.lst file gets me the list of the contents without the cost of a full decompress that is good. But I still do not know enough to determine if this is a interesting archive. I only have a regEx for the path to a log that needs to be parsed. The result of this first parse will determine if I need to fetch the much larger message log from the same archive. I do not know the first log file at the time of compression or I would add it like index.lst. I think file by file commpression may be good for my needs? The bulk of the bytes are in 1 file anyway. Commented Dec 4, 2013 at 16:52
  • If the archive is a few large files, then file by file compression would be better. Then zip or 7zip can skip to where the compressed data starts. My answer works best if you have to extract all files from the archive, based on the name of one of the files.
    – Zelda
    Commented Dec 4, 2013 at 17:03
  • Accepting this answer (Mostly due to the information provided in Zelda most recent comment.) Will spend the time to work up an example with 7zip and report back only if my results are worse than tar.bz2. love the index.lst as a way to retrofit my tar.bz2 files Commented Dec 4, 2013 at 17:54

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .