8

I have a java application, which recieves compressed files as input. The application then reads the header information of said files and passes the compressed bytes to an external native library for decompression (JNI). In one of the files we recieved, there was a corrupt blob of compressed bytes within the input file wich leads to a hard crash of the dynamically loaded library and our application(no exceptions etc.).

Upon inspection of the compressed array which is passed to the library we verified that the data is indeed corrupt, while the header information is fine.

The question I have is:

How can I prevent my application from crashing from these corrupted input files?

Thoughts:

  • To me it seems, that there is no way to inspect compressed data for validity... without decompressing.
  • Inspecting the header file for some kind of sanity check is not enough, as the header information is well formed.
  • Changing the called library to be more robust for the malformed data would effectively result in forking the external library, which I want to avoid if possible.

Any pointers are appreciated.

5
  • 1
    I believe there should be a checksum somewhere so you can compute the checksum of the compressed data and match it with the one in the header. Commented Aug 26, 2022 at 9:34
  • 8
    while a checksum is useful for detecting accidental corruption, it may not be useful for detecting malicious corruption. So you might need to consider if you want to protect against malicious actors or not.
    – JonasH
    Commented Aug 26, 2022 at 9:39
  • In our context we assume accidental corruption. But it is an angle, I agree
    – MPIchael
    Commented Aug 26, 2022 at 13:28
  • 5
    Beware that the crash might be exploitable. This malformed input caused a crash. Different malformed input might start sending all your corporate files to Russia. Commented Aug 26, 2022 at 15:04
  • Checksum Is useless. It helps against errors in transport. It doesn’t help against malicious files exploiting the vulnerability, and that’s the danger. Now if a small unintentional error on the side of the encoder crashes your decoder, that’s BAD. Malicious files will have a correct checksum, because the hacker calculates and adds the correct checksum.
    – gnasher729
    Commented Aug 29, 2022 at 10:26

4 Answers 4

16

If you are going to use a native library which might crash (regardless whether the input data is malformed, or that library has a bug), the only safe way to prevent your own application against being "crashed" as well is to run the library in a separate process. Unfortunately, this often means some extra work, since you will have to implement some kind of interprocess communication between your app and the "wrapper app" for the library you will probably need to build.

For most real-world cases, processes (and only processes) provide a sufficient isolation level to protect other apps from being shut down when a library gets a stackoverflow, or tries some illegal memory access.

The only alternative to the former suggestion or forking is to ask the libraries' author for implementing better error handling, or make a pull request to them in case you are willing to implement the missing error handling by yourself. However, even if the author is willing to assist, a native library of certain complexity always bears a certain risk of introducing certain kinds of bugs which cannot be handled by a simple try/catch in your Java application. If you want to be safe, try both: ask the author for a library change and wrap it into its own process.

3
  • 8
    For completeness: switching libraries could also be a alternative.
    – JonasH
    Commented Aug 26, 2022 at 11:28
  • 11
    If asking the author consider sending a sample of the corrupt data that can cause the problem. Reproducible bugs are more likely to get squished. Commented Aug 26, 2022 at 12:11
  • @JonasH: that is not wrong, but often easier said than done. There must be another lib for the same purpose available, it must fulfill ones functional and nonfunctional requirements (including license terms, performance, security, etc), and it should have higher robustness than the former one. That are a lot of extra preconditions.
    – Doc Brown
    Commented Aug 29, 2022 at 15:56
2

There are some checks with some formats, like a specific file end, the original file size redundantly stored.

In your case I would decompress in java as test without storing; most common formats like 7z, bz and such are available. You need not store the bytes, but just read through them, catching any error exceptions.

This can also take the form of replacing the external decompressing application. In the first instance as separate java application. If you can determine which kind of files your existing decompressor can handle, you can even do a hybrid usage.

On the original error: still the most likely error is binary data transmitted like text in SFTP or such (\r\n or EBCDIC/ASCII from AS/400). A (failing) virus infection could also be possible (virus scan).

1
  • 2
    Thank you for your answer. Unfortunately the setting is performance critical, so decompressing on the java-side will be too slow if we do it for all incoming files. Your guess is correct! We found plain String data in the actual compressed data array which certainly should not be there:-)
    – MPIchael
    Commented Aug 26, 2022 at 13:32
2

There’s one solution only: Your decompressor must be written in such a way that it doesn’t crash whatever input it is given. It should either produce correctly decompressed data, or produce an error in some form. It should also be designed to produce the exact same result with the same input data, and there should be no other decompressor ever trying to decompress the same data.

Consider that your library was crashed just by faulty data, now consider what could happen if you are actually under attack by a hacker. So throw away this library, warn everyone, and pick a library that works better.

If there is a complex file format where the library author isn’t willing to give guarantees, then the library author can put the complete library into a separate process. That’s for example what Apple does with their h.264 decoder. The caller feeds in encoded data and receives frames or errors, no matter whether someone manages to crash the h.264 decoder, do the caller need not run a separate process.

0

If you are:

  1. Unable to switch to another implementation or library
  2. Unable to determine a potential crash preemptively

Then you must ensure two things:

  1. The processing happens in a separate process.
  2. The processing is locked down, isolated.

Processing in a separate process: The main application can launch a new (child) process, that can invoke the unsafe processing library. The input/output can be circulated via std in/err pipes or via filesystem.

Processing in isolated manner: To avoid potential risks of unwanted behavior, the third part library process may invoked in an isolated manner. One simple approach is to just launch an ephemeral container and pipe input & output.

Not the answer you're looking for? Browse other questions tagged or ask your own question.