35

When I am looking at the machine code of an application, are there hints and patterns I can discern from the generated machine code which would indicate which compiler (and possibly version) was used to generate it?

Does knowing the compiler used to generate an application help me to more effectively reverse engineer back from the generated object to what the source code might have been, and if it does help, how so?

2
  • 1
    When you say "help me to more effectively reverse engineer back from the generated object to what the source code might have been", is your goal to decompile the code, or to understand the functionality of the code?
    – amccormack
    Commented Mar 19, 2013 at 20:25
  • Is it even possible to completely decompile the code? I'd say to decompile if possible, otherwise, to at least understand the functionality.
    – WilliamKF
    Commented Mar 19, 2013 at 20:43

7 Answers 7

30

There is some academic research in this area, the keywords you want are 'toolchain provenance'. There was a pretty good paper by Nate Rosenblum on this topic, it's been a while since I read this paper but you can use many techniques to establish this information. I think some use machine learning and others can use a big pile of heuristics or axioms about compiler behavior.

Establishing this is of limited utility IMO. It could be useful in an adversarial situation where you're trying to get intelligence about a malware group or threat actor, but also keep in mind that this kind of information can be obfuscated or destroyed. One potential use of this information would be to establish that some binary software was compiled using some companies SDK that included a compiler with signature information unique to that company. Establishing the tool chain provenance can help you make a case that someone who bought your SDK is in violation of a license or contract, say by producing malware.

An example of behavior differences is parameter writing. There are two ways to place a value onto the stack, one using 'push' and another using mov with an address based in esp as the destination operand. So one compiler can do this:

push eax
push ebx

And another can do this:

mov [esp+foo], eax
mov [esp+foo+4], ebx

And they do. Generally, MSVC does the first example and GCC does the second example, at least in some very limited testing/observation just now...

11

When looking at Machine code there typically is a "trail" that can be followed unless the produced binary was some how scrubbed. For example I generated a small "hello world" application using GCC on my Linux box with the standard options gcc -Wall hello.c now if you take a tool like hexedit you can see in the machine code there is a section containing build information:

enter image description here

Clearly you can see in there yes, I built this with GCC version 4.6.3. Other compilers will have other types of signatures Microsoft's "rich" signature.

3
  • 2
    It's interesting how would it look after doing stripping the file... Commented Mar 20, 2013 at 20:07
  • The question was specifically about the machine code. One would hope the OP has already tried such basic methods as using a hex editor or objdump and looking for trivial strings, before asking. In which case, this wouldn't be an answer. But sure, if they somehow hadn't, it would be relevant. ;-) Commented Apr 3, 2016 at 13:04
  • @underscore_d - "One would hope", indeed one would. I was simply making sure we didn't have to only hope the OP knew this. I like to not make too many assumptions!
    – Mike
    Commented May 15, 2016 at 6:48
10

There was a presentation at Recon titled "Packer Genetics: The Selfish Code" that described one approach for this. They used some statistics to extract the most common code sequences from compiled programs and used it to detect the end of unpacking, but the approach can be used easily to identify specific compilers.

See from slide 15 here: http://blog.zynamics.com/2010/07/16/recon-slides-packer-genetics-the-selfish-code-bochspython/

The slides seem somewhat truncated, I believe the actual presentation had more info.

8

Does knowing the compiler used to generate an application help me to more effectively reverse engineer back from the generated object to what the source code might have been, and if it does help, how so?

I consider the knowing used compiler as a very important step because of the following reasons:

  1. It helps you select the proper tool(s) to analyze the target.
  2. Knowing the runtime is important for analysis, for example in Delphi TFileStream is a commonly used object for reading/writing files. Knowing the vtable of that object helps me understand if an offset is read/write/seek etc.

To clarify 1 with an example: a tool such as IDR might be a better fit for a Delphi target than IDA Pro. Or at least we can generate a MAP file/IDC script with it that improves symbols in IDA. But for a target written in Visual Basic one might use VB Decompiler and so on.

8

I guess the first thing you should do to determine the compiler version unless you literally mean the compiler version instead of linker version, is inspect the "MajorLinkerVersion" and "MinorLinkerVersion" fields of the PE header of the executable, be it EXE, DLL, or SYS. See list below.

Major Minor

0x5 0x0 (5.0) Borland C++ / MS Linker 5.0

0x6 0x0 (6.0) Microsoft VIsual Studio 6

0x7 0xA (7.10) Microsoft VIsual Studio 2003

0x8 0x0 (8.0) Microsoft VIsual Studio 2005

0x9 0x0 (9.0) Microsoft VIsual Studio 2008

0xA 0x0 (10.0) Microsoft VIsual Studio 2010

0x2 0x15 (2.21) MinGw

0x2 0x19 (2.0.0.25) Borland Delphi (linker 2.0.0.25)

Unfortunately, packers and protectors tend to overwrite these value to write their own and/or harden the process of guessing the original compiler.

Also, the resource directory of an executable is a good place to search for specific linker info. e.g. RT_RCDATA having a resource named "DVCLAL" is a sign of Borland C++ or Delphi and the "RT_MANIFEST" in case of a MSVC-built executable can tell us about the specfic version of runtime DLL's it is linked to and hence the compiler version.

Also, an executable with the "TimeDateStamp" field set to 0x2A425E19 is a sign of being built with Delphi.

Now, if you want to determine compiler from assembly code, then the sign of a recent MSVC compiler version is seeing the function that generates the stack cookie just at the entry point.

Seeming, a JMP instruction at the entry point followed by the string "fb:C++Hook" is a sign of Borland C++, and so on.

5

Does knowing the compiler used to generate an application help me to more effectively reverse engineer back from the generated object to what the source code might have been, and if it does help, how so?

Yes, it should help.

Even better:

  • the exact compiler version;
  • the exact command line parameters;
  • the build environment (OS, patch level, ...).

The idea is to:

  • build test cases for a lot of different cases (small little programs) that showcase different structures and compiling them;

  • look at the resulting machine code (noticing patterns).

A lot of these cases could be generalized over the major version of the compiler (if and other control structures, basic language functions, ...).

It is possible that there are some compiler-specific optimizations that differ a lot for the same program.

(I wonder if there are test case libraries for common/useful cases to aid reverse engineering of the machine code that a specific compiler generates.)

2
  • Sorry to be blunt, but you need to work on your formatting and get rid of the Random Capitals. Right now the answer is quite hard to read.
    – Igor Skochinsky
    Commented Mar 19, 2013 at 22:39
  • Was the Edit an Improvement? Commented Mar 19, 2013 at 22:56
3

If you just talk about the machine code (or Assembly code), there isn't much information. Most modern compilers will produce similar output or the output won't be enough to see differences. One thing that may give indication is compiler optimization, which I am not experienced with and someone else should chime in. If you do have the entire ELF file though, and symbols are available, you may be able to draw conclusions based on what kinds of libraries are linked (for example, libgcc would be a giveaway) or the names of compiler specific functions. If the ELF contains debugging information you may even see things like "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3". If you are dealing with C++ code the symbol name mangling can give it away.

However, as you asked yourself, I am curious why you need this information. I don't know how much help you will get by knowing the compiler that made it will do. I do more work with ARM and I know with that platform, there is a Application Binary Interface that compilers/assembly code must adhere to. This ABI gives information about how functions should be called, what registers should be used for what, and etc. I know for platforms without a strict ABI, operating systems often give information to developers about such topics. Regardless, compilers all should create compatible code so I don't know of any use for identifying the compiler that created the code.

3
  • 8
    This answer lacks a rationale or reference for why there wouldn't be differences in the output. My personal experience with x86 contradicts this, but my sample size is too small to say this is true in general. Also asking why this information is needed isn't really part of an answer but more a request for clarification and would better fit into a comment for the question.
    – jix
    Commented Mar 19, 2013 at 20:31
  • 2
    Thanks for the constructive criticism. I am new at answering questions so I don't understand all the details. I'll try to find more references.
    – Yifan
    Commented Mar 19, 2013 at 20:34
  • 4
    There are a surprising number of differences between compilers, especially in x86 code where there are so many different instructions to choose from. Switch statement implementations, stack layout decisions and register choices can all provide hints as to which compiler was used.
    – Dougall
    Commented Mar 20, 2013 at 2:23

Not the answer you're looking for? Browse other questions tagged or ask your own question.