0

I've developed a data acquisition system for a scientific experiment. An FPGA buffers the scientific data in a FIFO and a C# program empty this FIFO at a rate of ~45MB/s.

Because the duration of the experiment suddenly changed from seconds to days, the data now needs to be stored as binary data.

Not being a programmer or computer scientist, it seems to me that there is a huge potential for data corruption. Previously every packet was written to a new line. If, for some reason, one bit went missing it would corrupt one packet.

However, for one missing bit in binary data we might loose TBs of irreplaceable data.

Is there a way to assure that the data does not turn to garbage? My first thought is to write a separate ASCII file of hamming code data, but I don't know if it's feasible.

3
  • 2
    I would look into using a database for this instead of a single huge file (as I think you are planning?). You could store single rows with some binary blob and some timestamp or other information to identify the data. That way the single lines would be as independent from each other as they are now (as far as I understand your problem without knowing much about the actual data you talk about) Commented May 1, 2015 at 11:35
  • What do you expect to be the source of these bit corruptions? Commented May 1, 2015 at 11:41
  • In retrospect a database should have been used, for sure! As always time is an issue. One day to be exact. So i believe it's to late to migrate to a database considering I've never used one before. As for the source of the errors, could be anything. Primarily poor code. However, the consequence of data corruption is millions of dollars in the drain. Better safe than sorry.
    – ThomasRB
    Commented May 1, 2015 at 11:47

2 Answers 2

0

If you want resilience against errors in your file, store additional data that will help you resynchronize and detect errors.

If you have a table of offsets to data records, you can use that to find a record, no matter what the records contain.

If you have a header for a record with expected size and checksum, you can test integrity and know how to skip past it if it's uninteresting or defective. If you give your header a reasonably unique signature, it's likely that you will be able to resynchronize by just scanning forward until you find a valid header.

If you start having large amounts of data, consider some existing data storage format that has been battle-tested in reality. Any decent database ought to be Good Enough for storing blobs of binary data with acceptable integrity.

3
  • The data packets are 48 bits and there are no invalid bit-strings. Is it still possible to implement headers/signature in some clever way?
    – ThomasRB
    Commented May 1, 2015 at 11:50
  • Figure out how much overhead is acceptable, and use that many bits. It might be worth having a byte of checksum for N 48-bit blocks. Assuming you never have partial writes, you cannot desync in your storage solution. Commented May 1, 2015 at 12:41
  • I'm accepting this answer as you gave a few keywords that got my mind spinning. (Reasonable unique, acceptable overhead and adding header bits). I figured that even though there are no invalid words, a special sequence of words is illegal.
    – ThomasRB
    Commented May 2, 2015 at 10:53
0

per logical concept Store the data in a database.
Store datetimeoffset timestamp with the data.
Use 1 record per concept or period.
Keep the rows under 4k in general.
If validation can be done on the data before storing.
You could even persist checksum with the data.

2
  • 1
    Programming haiku/writing programs for others/is always great fun. Commented May 1, 2015 at 15:18
  • Wise man Robert :-)
    – phil soady
    Commented May 1, 2015 at 21:50

Not the answer you're looking for? Browse other questions tagged or ask your own question.