1

Is there simple unix command line i can enter which lets me isolate say 512 bytes either side of a search term, even if there is only one "line" in a very large text file?

Ok, this should be easy.

Famous last words.

I'm not that familiar with grep, but it seems it is mainly used to filter out lines in the input that contain search terms.

I have a very large json file that I downloaded that i want to search for a particular term.

before you click the link - it's over 244MB so be warned - it is from the internet wayback machine and contains lists of zip files of archived photos. i am trying to find mine.

Their web interface is broken, so i found the json file that they make public here - it's the last one on the list.

when i grep looking for my username, it finds it, but proceeds to dump that line to the console. the problem is that line is 244MB long, and it's the only line in the file.

i tried using less, but could not get that to do much - it's very slow, and seems to have the same issue.

is there simple unix command line i can enter which lets me isolate say 512 bytes either side of a search term?

0

3 Answers 3

2

sed is almost what you need, like so:

sed 's/.*\(.\{100\}eubike.\{100\}\).*/\1/' webshots-index-20121231-index.json

returns this to the console:

20121017032138","warc",30012950425],["eusbike","2012-11-11 09:41","20121111040120/webshots.com-user-eusbike-20121111-094102.warc.gz",34212598,"20121111040120","warc",19238806437],["EUSCALDUN","2012-11-17 13:

but and it is a big BUT: you are limited by RE_DUP_MAX to 255 either side. Even for the 100 either side shown it took 16 minutes to process on my macbook pro. Only 2 mins for 10 chars each side. I don't have time to test how long it would take for 255 each side, probably around 50 minutes. The reasons for the limitation are shown in ftp://ftp.ics.uci.edu/pub/centos0/ics-custom-build/BUILD/nagios-plugins-1.4.13/gl/regex.h

I think you may be out of luck if you want that many chars each side around your search term.

2

Since you already have the json file downloaded, you can perform some file manipulation on it to make it easier to search.

I downloaded the first few hundred bytes of the json file, and I see that file looks like this:

["entry1","date1","file1.gz",int1,"string1","string1",int1],["entry2","date2","file2.gz",int2,"string2","string2",int2],[...

It looks like each entry is in a separate json array, separated by ],[. You can use sed to replace these characters with a line break.

sed 's_\],\[_\]\n\[_g' json_file > json_file_with_breaks

This command will insert a line break after every entry, so you will get one entry per line:

[... entry1 ...],
[... entry2 ...],
...

Output will be saved to a new file, json_file_with_breaks. I recommend this because if you need to make multiple searches, running grep on the new file will be faster than running sed each time and piping output to grep. NB: the new file will also be 244 MB in size!

The next step is to use grep to search the new file:

grep 'search term' json_file_with_breaks
1

This is more along the lines of your original question

Is there simple unix command line I can enter which lets me isolate say 512 bytes either side of a search term?

From the grep man page:

-b, --byte-offset  
      Print the 0-based byte offset within the  input  file  before
      each  line  of output.  If -o (--only-matching) is specified,
      print the offset of the matching part itself.

So, you can search for your string like this:

grep -o -b 'my search term' json_file

Output:

1234567:my search term
9876543:my search term
...

Each line holds the byte offset from the start of the file of each occurrence of 'my search term'.

You can use cut -bN-M to select bytes from the Nth to the Mth in the file:

cut -b$((1234567 - 512))-$((1234567 + 512)) json_file
cut -b$((9876543 - 512))-$((9876543 + 512)) json_file

You can automate the above process with a while loop:

grep -o -b 'my search term' json_file | cut -d: -f1 | while read pos; do cut -b$((pos - 512))-$((pos + 512)); done

This finds all occurrences of 'my search term' in the file, cuts out their positions from the grep output, and for every position, cuts out 512 bytes on either side of the match from the json file (for a total of 1024 bytes around the match).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .