When I am looking at the machine code of an application, are there hints and patterns I can discern from the generated machine code which would indicate which compiler (and possibly version) was used to generate it?

Does knowing the compiler used to generate an application help me to more effectively reverse engineer back from the generated object to what the source code might have been, and if it does help, how so?

    When you say "help me to more effectively reverse engineer back from the generated object to what the source code might have been", is your goal to decompile the code, or to understand the functionality of the code?
  • Is it even possible to completely decompile the code? I'd say to decompile if possible, otherwise, to at least understand the functionality.
There is some academic research in this area, the keywords you want are 'toolchain provenance'. There was a pretty good paper by Nate Rosenblum on this topic, it's been a while since I read this paper but you can use many techniques to establish this information. I think some use machine learning and others can use a big pile of heuristics or axioms about compiler behavior.

Establishing this is of limited utility IMO. It could be useful in an adversarial situation where you're trying to get intelligence about a malware group or threat actor, but also keep in mind that this kind of information can be obfuscated or destroyed. One potential use of this information would be to establish that some binary software was compiled using some companies SDK that included a compiler with signature information unique to that company. Establishing the tool chain provenance can help you make a case that someone who bought your SDK is in violation of a license or contract, say by producing malware.

An example of behavior differences is parameter writing. There are two ways to place a value onto the stack, one using 'push' and another using mov with an address based in esp as the destination operand. So one compiler can do this:

push eax
push ebx

And another can do this:

mov [esp+foo], eax
mov [esp+foo+4], ebx

And they do. Generally, MSVC does the first example and GCC does the second example, at least in some very limited testing/observation just now...


When looking at Machine code there typically is a "trail" that can be followed unless the produced binary was some how scrubbed. For example I generated a small "hello world" application using GCC on my Linux box with the standard options gcc -Wall hello.c now if you take a tool like hexedit you can see in the machine code there is a section containing build information:

enter image description here

Clearly you can see in there yes, I built this with GCC version 4.6.3. Other compilers will have other types of signatures Microsoft's "rich" signature.

There was a presentation at Recon titled "Packer Genetics: The Selfish Code" that described one approach for this. They used some statistics to extract the most common code sequences from compiled programs and used it to detect the end of unpacking, but the approach can be used easily to identify specific compilers.

See from slide 15 here: http://blog.zynamics.com/2010/07/16/recon-slides-packer-genetics-the-selfish-code-bochspython/

The slides seem somewhat truncated, I believe the actual presentation had more info.


Does knowing the compiler used to generate an application help me to more effectively reverse engineer back from the generated object to what the source code might have been, and if it does help, how so?

I consider the knowing used compiler as a very important step because of the following reasons:

  1. It helps you select the proper tool(s) to analyze the target.
  2. Knowing the runtime is important for analysis, for example in Delphi TFileStream is a commonly used object for reading/writing files. Knowing the vtable of that object helps me understand if an offset is read/write/seek etc.

To clarify 1 with an example: a tool such as IDR might be a better fit for a Delphi target than IDA Pro. Or at least we can generate a MAP file/IDC script with it that improves symbols in IDA. But for a target written in Visual Basic one might use VB Decompiler and so on.


I guess the first thing you should do to determine the compiler version unless you literally mean the compiler version instead of linker version, is inspect the "MajorLinkerVersion" and "MinorLinkerVersion" fields of the PE header of the executable, be it EXE, DLL, or SYS. See list below.

Major Minor

0x5 0x0 (5.0) Borland C++ / MS Linker 5.0

0x6 0x0 (6.0) Microsoft VIsual Studio 6

0x7 0xA (7.10) Microsoft VIsual Studio 2003

0x8 0x0 (8.0) Microsoft VIsual Studio 2005

0x9 0x0 (9.0) Microsoft VIsual Studio 2008

0xA 0x0 (10.0) Microsoft VIsual Studio 2010

0x2 0x15 (2.21) MinGw

0x2 0x19 ( Borland Delphi (linker

Unfortunately, packers and protectors tend to overwrite these value to write their own and/or harden the process of guessing the original compiler.

Also, the resource directory of an executable is a good place to search for specific linker info. e.g. RT_RCDATA having a resource named "DVCLAL" is a sign of Borland C++ or Delphi and the "RT_MANIFEST" in case of a MSVC-built executable can tell us about the specfic version of runtime DLL's it is linked to and hence the compiler version.

Also, an executable with the "TimeDateStamp" field set to 0x2A425E19 is a sign of being built with Delphi.

Now, if you want to determine compiler from assembly code, then the sign of a recent MSVC compiler version is seeing the function that generates the stack cookie just at the entry point.

Seeming, a JMP instruction at the entry point followed by the string "fb:C++Hook" is a sign of Borland C++, and so on.


Does knowing the compiler used to generate an application help me to more effectively reverse engineer back from the generated object to what the source code might have been, and if it does help, how so?

Yes, it should help.

Even better:

  • the exact compiler version;
  • the exact command line parameters;
  • the build environment (OS, patch level, ...).

The idea is to:

  • build test cases for a lot of different cases (small little programs) that showcase different structures and compiling them;

  • look at the resulting machine code (noticing patterns).

A lot of these cases could be generalized over the major version of the compiler (if and other control structures, basic language functions, ...).

It is possible that there are some compiler-specific optimizations that differ a lot for the same program.

(I wonder if there are test case libraries for common/useful cases to aid reverse engineering of the machine code that a specific compiler generates.)

If you just talk about the machine code (or Assembly code), there isn't much information. Most modern compilers will produce similar output or the output won't be enough to see differences. One thing that may give indication is compiler optimization, which I am not experienced with and someone else should chime in. If you do have the entire ELF file though, and symbols are available, you may be able to draw conclusions based on what kinds of libraries are linked (for example, libgcc would be a giveaway) or the names of compiler specific functions. If the ELF contains debugging information you may even see things like "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3". If you are dealing with C++ code the symbol name mangling can give it away.

However, as you asked yourself, I am curious why you need this information. I don't know how much help you will get by knowing the compiler that made it will do. I do more work with ARM and I know with that platform, there is a Application Binary Interface that compilers/assembly code must adhere to. This ABI gives information about how functions should be called, what registers should be used for what, and etc. I know for platforms without a strict ABI, operating systems often give information to developers about such topics. Regardless, compilers all should create compatible code so I don't know of any use for identifying the compiler that created the code.

