Split binary data of fixed byte offset by byte position?

Question

I have binary data which I review by xxd -ps in hex format. I notice that the byte distance is 48300 (=805*60) bytes between two headers where the separator is fafafafa. There is the beginning of the file which should be skipped.

Example hex data where 48300 bytes between headers fafafafa which you can get here called data26.6.2015.txt where three headers and its nearly equivalent binary here called test_27.6.2015.bin which has only first two headers. In both files, the data of last header is not of complete length; otherwise, you can assume that the byte offset is fixed i.e. the length of data between headers.

Pseudocode of algorithm

look header end position
look first two header positions and set the difference of these positions (d2 - d1) the distance between events; this event length is the fixed (777)
split data by byte position (777) - TODO should I split binary format or as xxd -ps converted data? by byte position (777)

I can convert data back to binary by xxd -r like xxd -ps | split and store | xxd -r but I am still unsure if this is necessary.

In which stage can you split binary data? Only in xxd -ps converted format or as binary data.

If splitting in xxd -ps converted format, I think for loop is to only way then go through the file. Possible tools for splitting csplit, split, ..., not sure. However, I am uncertain.

Output from grep (ggrep is gnu grep) on the hex data

$ xxd -ps r328.raw  | ggrep -b -a -o -P 'fafa' | head
49393:fafa
49397:fafa
98502:fafa
98506:fafa
147611:fafa
147615:fafa
196720:fafa
196725:fafa
245830:fafa
245834:fafa

while doing the similar grep in the binary file giving emptyline only as an output.

$ ggrep -b -a -o '\xfa' r328.raw

Documentation

Documentation given to me is found here and here as a picture the general SRS data format:

enter image description here

In which stage can you split binary data (as binary data or as xxd -ps converted data)?

Can you show exactly what you want the end result to be? I'm having trouble following your question. Most likely the most convenient tool for extracting a chunk of a binary file will be dd, but without understanding exactly what chunk you want to extract I'm limiting to making this a comment rather than an answer. — godlygeek, Commented Jun 26, 2015 at 16:55
Unless I'm missing something, that seems to be splitting in the middle of a byte - fafafafad0 starts at character 195 of the hex dump, meaning it's byte 98 of the binary file, but fafafafa6a starts at character 968 of the hex dump, 773 characters of hex later, which means it's 386.5 bytes later, which means it's across a byte boundary. Your "file 001.txt" is 773 characters long, which isn't normally a valid length for a hex dump - hex dumps must have an even number of characters, since each byte of the input is 2 characters. — godlygeek, Commented Jun 26, 2015 at 17:13
@godlygeek Sorry for my mistake. I added correct data now as a link. The byte distance is now 48300 (60*805). There are two headers in the data. I tried to simplify the original data unsuccessfully. — Léo Léopold Hertz 준영, Commented Jun 26, 2015 at 17:37
You should really drop that -P ON grep - it's not doing you any favors there. In general, just dump the file with od or strings or whatever and grep the results - you don't need to save a copy of the whole encoded file, though - you already have the other. And grepping stuff like that is already going to be tedious enough, and so maybe just keep that actual searches basic if you can. — mikeserv, Commented Jun 27, 2015 at 18:59

meuh · Accepted Answer · 2015-06-27 12:21:29Z

You can operate on the binary file without needing to go through xxd. I ran your data back through xxd and used grep -b to show me the byte offsets of your pattern (converted from hex to chars \xfa) in the binary file.

I removed with sed the matched characters from the output to leave just the numbers. I then set the shell positional args to the resulting offsets (set -- ...)

xxd -r -p <data26.6.2015.txt >/tmp/f1
set -- $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//')

You now have a list of offsets in $1, $2, ... You can then extract the part that interests you with dd, setting a block size to 1 (bs=1) so that it reads byte by byte. skip= says how many bytes to skip in the input, and count= the number of bytes to copy.

start=$1 end=$2
let count=$end-$start
dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f2

The above extracts from the start of the 1st pattern to just before the 2nd pattern. To not include the pattern, you can add 4 to start (and count reduces by 4).

If you want to extract all parts, use a loop around this same code, and add starting offset 0 and ending offset size-of-file to the list of numbers:

xxd -r -p <data26.6.2015.txt >/tmp/f1
size=$(stat -c '%s' /tmp/f1)
set -- 0 $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//') $size
i=2
while [ $# -ge 2 ]
do start=$1 end=$2
   let count=$end-$start
   dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f$i
   let i=i+1
   shift
done

If grep doesnt manage to work with the binary data, you can use the xxd hex dump data. First remove all the newlines to have one enormous line, then do the grep using the unescaped hex values, but then divide all the offsets by 2, and do the dd with the raw file:

xxd -r -p <data26.6.2015.txt >r328.raw
tr -d '\n' <data26.6.2015.txt >f1
let size2=2*$(stat -c '%s' f1)
set -- 0 $(grep -b -a -o -P 'fafafafa' f1 | sed 's/:.*//') $size2
i=2
while [ $# -ge 2 ]
do  let start=$1/2
    let end=$2/2
    let count=$end-$start
    dd bs=1 count=$count skip=$start <r328.raw  >f$i
    let i=i+1
    shift
done

\x converts fa from 2 hex digits into a single binary char. My script runs on the original binary you had before converting it with xxd, not the file you put on dropbox. — meuh, Commented Jun 27, 2015 at 11:08
@masi I added an alternative solution that uses your hexdump format file and grep with ascii characters. Note that the dd is still applied to the raw file. — meuh, Commented Jun 27, 2015 at 12:22
yes, it finds offsets 24292 48444. Perhaps your grep will work if you create the characters in the shell: pat=$(echo -e '\xfa\xfa\xfa\xfa') and grep -b -a -o "$pat"... — meuh, Commented Jun 27, 2015 at 15:49
#3 is working. concatenate the resulting files together and compare that with the original the binary file I got from dropbox is truncated compared to the binary I can get from the hexdump version. I think we should stop there. — meuh, Commented Jun 27, 2015 at 16:29
file f1 is an intermediary file that is not part of the output. I didnt want to rename it at this late stage — meuh, Commented Jun 27, 2015 at 20:14

9 revs · Accepted Answer · 2017-04-13 12:36:37Z

Outputs to great meuh's answer where he used data data26.6.2015.txt.

#1

$ cat 27.6.2015_1.sh && sh 27.6.2015_1.sh 
xxd -r -p <data26.6.2015.txt >/tmp/f1
size=$(stat -c '%s' /tmp/f1)
pat=$(echo -e '\xfa\xfa\xfa\xfa')
set -- 0 $(ggrep -b -a -o "$pat" /tmp/f1 | sed 's/:.*//') $size
i=2
while [ $# -ge 2 ]
do start=$1 end=$2
   let count=$end-$start
   dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f$i
   let i=i+1
   shift
done
72900+0 records in
72900+0 records out
72900 bytes (73 kB) copied, 0.160722 s, 454 kB/s

#2

$ cat 27.6.2015_2.sh && sh 27.6.2015_2.sh 
xxd -r -p <data26.6.2015.txt >/tmp/f1
size=$(stat -c '%s' /tmp/f1)
set -- 0 $(ggrep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//') $size
i=2
while [ $# -ge 2 ]
do start=$1 end=$2
   let count=$end-$start
   dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f$i
   let i=i+1
   shift
done
72900+0 records in
72900+0 records out
72900 bytes (73 kB) copied, 0.147935 s, 493 kB/s

#3

$ cat 27.6.2015_3.sh && sh 27.6.2015_3.sh 
xxd -r -p <data26.6.2015.txt >r328.raw
tr -d '\n' <data26.6.2015.txt >f1
let size2=2*$(stat -c '%s' f1)
set -- 0 $(ggrep -b -a -o -P 'fafafafa' f1 | sed 's/:.*//') $size2
i=2
while [ $# -ge 2 ]
do  let start=$1/2
    let end=$2/2
    let count=$end-$start
    dd bs=1 count=$count skip=$start <r328.raw  >f$i
    let i=i+1
    shift
done
24292+0 records in
24292+0 records out
24292 bytes (24 kB) copied, 0.088345 s, 275 kB/s
24152+0 records in
24152+0 records out
24152 bytes (24 kB) copied, 0.061246 s, 394 kB/s
24152+0 records in
24152+0 records out
24152 bytes (24 kB) copied, 0.058611 s, 412 kB/s
304+0 records in
304+0 records out
304 bytes (304 B) copied, 0.001239 s, 245 kB/s

Output is one hex file and 4 binary files:

$ less f1
$ less f2
"f2" may be a binary file.  See it anyway? 
$ less f3
"f3" may be a binary file.  See it anyway? 
$ less f4
"f4" may be a binary file.  See it anyway? 
$ less f5
"f5" may be a binary file.  See it anyway?

There should be only 3 files which have fafafafa because I only gave three headers in file data26.6.2015.txt where the content of last header is a stubb. Outputs in f2-f5:

$ xxd -ps f2 |head -n3
48000000fe5a1eda480000000d00030001000000cd010000010000000000
000000000000000000000000000000000000000000000100000001000000
ffffffff57ea5e5580510b0048000000fe5a1eda480000000d0003000100
$ xxd -ps f3 |head -n3
fafafafa585e0000fe5a1eda480000000d00030007000000cd0100000200
000000000000020000000000008000000000000000000000000000000000
01000000ffffffff72ea5e55b2eb0900105e000016000000010000000000
$ xxd -ps f4 |head -n3
fafafafa585e0000fe5a1eda480000000d00030007000000cd0100000300
000000000000020000000000008000000000000000000000000000000000
01000000ffffffff72ea5e55f2ef0900105e000016000000010000000000
$ xxd -ps f5 |head -n3
fafafafa585e0000fe5a1eda480000000d00030007000000cd0100000400
000000000000020000000000008000000000000000000000000000000000
01000000ffffffff72ea5e55a9f10900105e000016000000010000000000

where

f1 is the whole datafile data26.6.2015.txt (not necessary to include)
f2 is the file header i.e. very beginning of the file data26.6.2015.txt until the first header fafafafa (not necessary to include)
f3 is the first header, correct!
f4 is the second header, correct!
f5 is the third header, correct!

There's no spec for the format or anything? If you're just reverse engineering then you're gonna keep encountering the unexpected until youn don't - back engineering is rarely easy. Else, if it's documented, then instead of handling outside data samples, try instead to create your own that some program which already understands to also understand yours. — mikeserv, Commented Jun 30, 2015 at 9:14
@mikeserv I added the link to specs in the body of the question and also a picture about a general format. Note I simplified the challenge for this question by only considering splitting by event trailers called fafafafa. — Léo Léopold Hertz 준영, Commented Jun 30, 2015 at 9:21
Ok, you're doing the wrong thing entirely. You need to write a program - C or something - that can handle arbitrarily sized data blocks in a stream byte for byte and which can adjust its buffers on the fly. You're wasting your time w/ grep and similar here. The fafafa doesn't always mean fafafa - sometimes its a contextual delimiter and sometimes its data. — mikeserv, Commented Jun 30, 2015 at 9:29
Well, by my (brief) reading of it, I guess the ADC mode should be safe. Have you run strings -1 on the binary? You should get like increment\nletter\nincrement\nletter... most of the time... — mikeserv, Commented Jun 30, 2015 at 9:42

mikeserv · Accepted Answer · 2015-06-27 21:03:40Z

1

It's not really that hard: just look for your start string, and name and match your tail string. Otherwise, try at least to get close. You don't really need all that hexadecimal, but using it:

fold -w2 <hexfile |
sed -e:t -e's/[[:xdigit:]]\{2\}$/\\x&/
    /f[af]$/N;/\(.\)..\1$/!s/.*\n/&\\x/;t
    /^.*\(.\)\(\n.*\)\n\(.*\n\).*/!bt
    s//\3\3\3 H_E_A_D \1 E_N_D \2\2\2/
    s/.* f//;s/a E.*//'

That will get a single hexidecimal byte code per line - each prefixed w/ \x - for every byte in hexfile except where the byte codes fa or ff occur 4 times in sequence. In that case it will instead get either a H_E_A_D or E_N_D marker instead, where the H_E_A_D string will replace the last of four \xfa strings and the E_N_D string will replace the first of four sequential \xff strings - which should also keep the byte offsets in sync by line number.

Like this:

PIPELINE | grep -C8n _

OUTPUT:

(trimmed a little)

72596-\x8b
72597-\xfa
72598-\xfa
72599-\xfa
72600: H_E_A_D
72601-\x58
--
72660-\x00
72661: E_N_D
72662-\xff
72663-\xff
72664-\xff
72665-\x72

And so you can pass the above command's output to, for example:

fold ... | sed ... | grep -n _

... to get a list of offsets where headers might each start, end. With GNU grep you can use the -After switch to tell it how many bytes you'd like to see in contextual sequence - and so you might like to use -A777, for example. You could take output like that and pass it through:

... | grep -A777 E_N_D | sed -ne's/\\/&&/p' | xargs printf %b

...to reproduce each binary byte for each sequence, and can specify the match number w/ -m[num].

edited Jun 27, 2015 at 21:03

answered Jun 27, 2015 at 17:08

mikeserv

58.7k9 gold badges118 silver badges235 bronze badges

1

@Masi - well i did just edit it - try it again? I guess it was missing a quote. it shouldn't now - i just copied it from the terminal.
– mikeserv
Commented Jun 27, 2015 at 18:32
Much better! Why is the prefix of the byte code ff omitted? The byte offset is fixed. So I have had an idea of looking for the length of the byte offset of the header, location of last header in tail and splitting by fixed position. I am afraid that if we do not prefix ff the fixed splitting will not work. Why is the mark E_N_D is set after the first ffffffff?
– Léo Léopold Hertz 준영
Commented Jun 27, 2015 at 19:21
1

@Masi - i dunno - i dunno the format. If i got anything right at all it was just a lucky guess. When should it be set then? And honestly, i think that if you can make the idea of looking for byte offset of header...last header...splitting fixed position stuff a little clearer, it might make it easier for me to help. I'm don't really follow that very well.
– mikeserv
Commented Jun 27, 2015 at 19:24
I think you have best idea about how to do this; I am just a newbie. Note the data of last header is not complete. Otherwise, fixed length of data between headers marked just by four fa's. I just learned that you are doing things right: s/ .* f//;s/a E.*//' is necessary in first sed, otherwise binary output.
– Léo Léopold Hertz 준영
Commented Jun 27, 2015 at 19:46
1

@Masi - if i understood any of what you said before correctly, i think you had a good point, too - why do i remove the ff/fa strings if i don't have to? So i put them back.
– mikeserv
Commented Jun 27, 2015 at 19:50

| Show 1 more comment

Stack Exchange Network

Split binary data of fixed byte offset by byte position?

Pseudocode of algorithm

Output from grep (ggrep is gnu grep) on the hex data

Documentation

3 Answers 3

#1

#2

#3

OUTPUT:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
grep
binary
split
xxd
.

Linked

Hot Network Questions

Split binary data of fixed byte offset by byte position?

Pseudocode of algorithm

Output from grep (ggrep is gnu grep) on the hex data

Documentation

3 Answers 3

#1

#2

#3

OUTPUT:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged grepbinarysplitxxd.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
grep
binary
split
xxd
.