How can a compiler compile itself?

Question

I am researching CoffeeScript on the website http://coffeescript.org/, and it has the text

The CoffeeScript compiler is itself written in CoffeeScript

How can a compiler compile itself, or what does this statement mean?

Another term for a compiler that can compile itself is a self-hosting compiler. See programmers.stackexchange.com/q/263651/6221 — oɔɯǝɹ, Commented Jun 18, 2016 at 23:01
There are at least two copies of the compiler involved. A pre-existing one compiles a new copy. The new one may or may not be identical to the old one. — bdsl, Commented Jun 19, 2016 at 9:37
You may also be interested in Git: its source code is tracked, of course, in a Git repository. — Greg d'Eon, Commented Jun 20, 2016 at 15:39
This is about like asking "How could a Xerox Printer print the schematics to itself?" Compilers compile text to byte code. If the compiler can compile to any usable byte code, you could write the compiler code in the respective language and then pass the code through the compiler to generate the output. — RLH, Commented Jun 21, 2016 at 14:11
@AlexD Yes, you can use a hammer to make another hammer, but you cannot use a hammer to make itself! It works the same with compilers! A compiler cannot compile itself! — pabrams, Commented Jun 23, 2016 at 21:12

nbro · Accepted Answer · 2017-08-01 16:41:18Z

238

The first edition of a compiler can't be machine-generated from a programming language specific to it; your confusion is understandable. A later version of the compiler with more language features (with source rewritten in the first version of the new language) could be built by the first compiler. That version could then compile the next compiler, and so on. Here's an example:

The first CoffeeScript compiler is written in Ruby, producing version 1 of CoffeeScript
The source code of the CS compiler is rewritten in CoffeeScript 1
The original CS compiler compiles the new code (written in CS 1) into version 2 of the compiler
Changes are made to the compiler source code to add new language features
The second CS compiler (the first one written in CS) compiles the revised new source code into version 3 of the compiler
Repeat steps 4 and 5 for each iteration

Note: I'm not sure exactly how CoffeeScript versions are numbered, that was just an example.

This process is usually called bootstrapping. Another example of a bootstrapping compiler is rustc, the compiler for the Rust language.

edited Aug 1, 2017 at 16:41

nbro

15.9k34 gold badges117 silver badges209 bronze badges

answered Jun 18, 2016 at 20:18

Ben N

2,9134 gold badges28 silver badges50 bronze badges

6

The other route for bootstrapping a compiler is to write an interpreter for (a subset) of your language.
– Aron
Commented Jun 22, 2016 at 8:51
As one more alternative to bootstrapping with a compiler or interpreter written in another language, the very old-school route would be to hand-assemble the compiler source. Chuck Moore runs through how to do this for a Forth interpreter in chapter 9, "Programs that bootstrap", at the end of Programming a Problem-Oriented Language (web.archive.org/web/20160327044521/www.colorforth.com/POL.htm), based on having done it twice before by hand. Code entry here is done via a front panel that allows direct storage of values to memory addresses controlled by toggle switches for bits.
– Jeremy W. Sherman
Commented May 9, 2017 at 14:32

Add a comment |

200_success · Accepted Answer · 2016-06-19 07:38:41Z

In the paper Reflections on Trusting Trust, Ken Thompson, one of the originators of Unix, writes a fascinating (and easily readable) overview of how the C compiler compiles itself. Similar concepts can be applied to CoffeeScript or any other language.

The idea of a compiler that compiles its own code is vaguely similar to a quine: source code that, when executed, produces as output the original source code. Here is one example of a CoffeeScript quine. Thompson gave this example of a C quine:

char s[] = {
    '\t',
    '0',
    '\n',
    '}',
    ';',
    '\n',
    '\n',
    '/',
    '*',
    '\n',
    … 213 lines omitted …
    0
};

/*
 * The string s is a representation of the body
 * of this program from '0'
 * to the end.
 */

main()
{
    int i;

    printf("char\ts[] = {\n");
    for(i = 0; s[i]; i++)
        printf("\t%d,\n", s[i]);
    printf("%s", s);
}

Next, you might wonder how the compiler is taught that an escape sequence like '\n' represents ASCII code 10. The answer is that somewhere in the C compiler, there is a routine that interprets character literals, containing some conditions like this to recognize backslash sequences:

…
c = next();
if (c != '\\') return c;        /* A normal character */
c = next();
if (c == '\\') return '\\';     /* Two backslashes in the code means one backslash */
if (c == 'r')  return '\r';     /* '\r' is a carriage return */
…

So, we can add one condition to the code above…

if (c == 'n')  return 10;       /* '\n' is a newline */

… to produce a compiler that knows that '\n' represents ASCII 10. Interestingly, that compiler, and all subsequent compilers compiled by it, "know" that mapping, so in the next generation of the source code, you can change that last line into

if (c == 'n')  return '\n';

… and it will do the right thing! The 10 comes from the compiler, and no longer needs to be explicitly defined in the compiler's source code.¹

That is one example of a C language feature that was implemented in C code. Now, repeat that process for every single language feature, and you have a "self-hosting" compiler: a C compiler that is written in C.

¹ The plot twist described in the paper is that since the compiler can be "taught" facts like this, it can also be mis-taught to generate trojaned executables in a way that is difficult to detect, and such an act of sabotage can persist in all compilers produced by the tainted compiler.

While this is an interesting bit of information, I don't think it answers the question. Your examples assume you already have a bootstrapped compiler, or else in which language is the C compiler written? — Arturo Torres Sánchez, Commented Jun 20, 2016 at 5:10
@ArturoTorresSánchez Different explanations work well for different people. I'm not aiming to reiterate what has been said in other answers. Rather, I find the other answers speak at a higher level than how I like to think. I personally prefer a concrete illustration of how one single feature is added, and letting the reader extrapolate from that, instead of a shallow overview. — 200_success, Commented Jun 20, 2016 at 6:00
OK, I understand your perspective. It's just that the question is more “how can a compiler compile itself if the compiler to compile the compiler doesn't exist” and less “how to add new features to a bootstrapped compiler”. — Arturo Torres Sánchez, Commented Jun 20, 2016 at 15:03
The question itself is ambiguous and open-ended. It appears that some people interpret it to mean "how can a CoffeeScript compiler compile itself?". The flippant response, as given in a comment, is "why shouldn't it be able to compile itself, just like it compiles any code?" I interpret it to mean "how can a self-hosting compiler come into existence?", and have given an illustration of how a compiler can be taught about one of its own language features. It answers the question in a different way, by providing a low-level illustration of how it is implemented. — 200_success, Commented Jun 20, 2016 at 20:32
@ArturoTorresSánchez: "[I]n which language is the C compiler written?" Long ago I maintained the original C compiler noted in the old K&R appendix (the one for IBM 360.) Many people know that first there was BCPL, then B, and that C was an improved version of B. In fact, there were many parts of that old compiler that were still written in B, and had never been rewritten to C. The variables were of the form single letter/digit, pointer arithmetic wasn't assumed to be automatically scaled, etc. That old code testified to the bootstrapping from B to C. The first "C" compiler was written in B. — Eliyahu Skoczylas, Commented Jun 26, 2016 at 10:18

Jörg W Mittag · Accepted Answer · 2016-06-19 00:20:59Z

30

You have already gotten a very good answer, however I want to offer you a different perspective, that will hopefully be enlightening to you. Let's first establish two facts that we can both agree on:

The CoffeeScript compiler is a program which can compile programs written in CoffeeScript.
The CoffeeScript compiler is a program written in CoffeeScript.

I'm sure you can agree that both #1 and #2 are true. Now, look at the two statements. Do you see now that it is completely normal for the CoffeeScript compiler to be able to compile the CoffeeScript compiler?

The compiler doesn't care what it compiles. As long as it's a program written in CoffeeScript, it can compile it. And the CoffeeScript compiler itself just happens to be such a program. The CoffeeScript compiler doesn't care that it's the CoffeeScript compiler itself it is compiling. All it sees is some CoffeeScript code. Period.

How can a compiler compile itself, or what does this statement mean?

Yes, that's exactly what that statement means, and I hope you can see now how that statement is true.

answered Jun 19, 2016 at 0:20

Jörg W Mittag

368k78 gold badges450 silver badges657 bronze badges

2

I don't know much about coffee script but you could clarify point 2 by stating that it WAS written in coffee script but was since compiled and is then machine code. And anyhow, could you please explain the chicken and egg problem then. If the compiler was written in a language that a compiler had not yet been written for, then how can the compiler even run or be compiled?
– barlop
Commented Jun 19, 2016 at 9:47
7

Your statement 2 is incomplete/ inaccurate and very misleading . since as the first answer says, the first was not written in coffee script.. That is so relevant to his question. And as to "How can a compiler compile itself, or what does this statement mean?" You say "Yes" I suppose so(though my mind's a bit small), I see it's used to compile earlier versions of itself, rather than itself. But is it used to compile itself also? I supposed it'd be pointless to.
– barlop
Commented Jun 19, 2016 at 10:10
2

@barlop: Change statement 2 to "Today, the CoffeeScript compiler is a program written in CoffeeScript." Does that help you understand it better? A compiler is "just" a program that translates an input (code) into an output (program). So if you have a compiler for language Foo, then write the source code for a Foo-compiler in the language Foo itself, and feed that source to your first Foo-compiler, you get a second Foo-compiler as output. This done by a lot of languages (for example, all C compilers I know of are written in… C).
– DarkDust
Commented Jun 19, 2016 at 15:49
3

The compiler can't compile itself. The output file is not the same instance as the compiler that produces the output file. I hope you can see now how that statement is false.
– pabrams
Commented Jun 20, 2016 at 20:19
4

@pabrams Why do you assume that? The output could well be identical to the compiler used to produce it. For instance, if I compile GCC 6.1 with GCC 6.1, I get a version of GCC 6.1 compiled with GCC 6.1. And then if I use that to compile GCC 6.1, I also get a version of GCC 6.1 compiled with GCC 6.1, which should be identical (ignoring things like timestamps).
– Stack Exchange Supports Israel
Commented Jun 21, 2016 at 22:25

| Show 21 more comments

Polygnome · Accepted Answer · 2016-06-19 08:54:34Z

How can a compiler compile itself, or what does this statement mean?

It means exactly that. First of all, some things to consider. There are four objects we need to look at:

The source code of any arbitrary CoffeScript program
The (generated) assembly of any arbitrary CoffeScript program
The source code of the CoffeScript compiler
The (generated) assembly of the CoffeScript compiler

Now, it should be obvious that you can use the generated assembly - the executable - of the CoffeScript compiler to compile any arbitrary CoffeScript program, and generate the assembly for that program.

Now, the CoffeScript compiler itself is just an arbitrary CoffeScript program, and thus, it can be compiled by the CoffeScript compiler.

It seems that your confusion stems from the fact that when you create your own new language, you don't have a compiler yet you can use to compile your compiler. This surely looks like an chicken-egg problem, right?

Introduce the process called bootstrapping.

You write a compiler in an already existing language (in case of CoffeScript, the original compiler was written in Ruby) that can compile a subset of the new language
You write a compiler that can compile a subset of the new language in the new language itself. You can only use language features the compiler from the step above can compile.
You use the compiler from step 1 to compile the compiler from step 2. This leaves you with an assembly that was originally written in a subset of the new language, and that is able to compile a subset of the new language.

Now you need to add new features. Say you have only implemented while-loops, but also want for-loops. This isn't a problem, since you can rewrite any for-loop in such a way that it is a while-loop. This means you can only use while-loops in the source code of your compiler, since the assembly you have at hand can only compile those. But you can create functions inside your compiler that can pase and compile for-loops with it. Then you use the assembly you already have, and compile the new compiler version. And now you have an assembly of an compiler that can also parse and compile for-loops! You can now go back to the source file of your compiler, and rewrite any while-loops you don't want into for-loops.

Rinse and repeat until all language features that are desired can be compiled with the compiler.

while and for obviously were only examples, but this works for any new language feature you want. And then you are in the situation CoffeScript is in now: The compiler compiles itself.

There is much literature out there. Reflections on Trusting Trust is a classic everyone interested in that topic should read at least once.

(The sentence "The CoffeeScript compiler is itself written in CoffeeScript", is true, but "A compiler can compile itself" is false.) — pabrams, Commented Jun 20, 2016 at 20:15
No, its completely true. The compiler can compile itself. It just doesn't make sense. Say you have the executable that can compile Version X of the language. You write a compiler that can compile Version X+1, and compile it with the compiler you have (which is version X). You end up with an executable that can compile version X+1 of the language. Now you could go and use that new executable to re-compile the compiler. But to what end? You already have the executable that does what you want to. The compiler can compile any valid program, so it completely can compile itself! — Polygnome, Commented Jun 21, 2016 at 11:04
Indeed it's not unheard of to build quite a few times, iirc modern freepascal builds the compiler a total of 5 times. — plugwash, Commented Jun 23, 2016 at 19:47
@pabrams Writing "Do not touch" and "Hot object. Do not touch" makes no difference to the intended message of the phrase. As long as the intended audience of the message (Programmers) understand the intended message of the phrase (A build of the compiler can compile its source) regardless of how it is written, this discussion is pointless. As it stands now, your argument is invalid. Unless you are able to show that the intended audience of the message is non-programmers, then, and only then, you are correct. — DarkDestry, Commented Jun 24, 2016 at 23:27
@pabrams 'Good english' is english that communicates ideas clearly to the intended audience, and in the way that the writer or speaker intended. If the intended audience is programmers, and programmers understand it, its good english. Saying "Light exists as both particles and waves" is fundamentally equivalent to "Light exists as both photons and electromagnetic waves". To a physicist, they mean literally the same thing. Does that mean we should always use the longer and clearer sentance? No! Because it complicates reading when the meaning is already clear to the intended audience. — DarkDestry, Commented Jun 25, 2016 at 14:34

nbro · Accepted Answer · 2017-08-01 17:10:59Z

A small but important clarification

Here the term compiler glosses over the fact that there are two files involved. One is an executable which takes as input files written in CoffeScript and produces as its output file another executable, a linkable object file, or a shared library. The other is a CoffeeScript source file which just happens to describe the procedure for compiling CoffeeScript.

You apply the first file to the second, producing a third which is capable of performing the same act of compilation as the first (possibly more, if the second file defines features not implemented by the first), and so may replace the first if you so desire.

nbro · Accepted Answer · 2017-08-01 16:42:44Z

5

The CoffeeScript compiler was first written in Ruby.
The CoffeeScript compiler was then re-written in CoffeeScript.

Since the Ruby version of the CoffeeScript compiler already existed, it was used to create the CoffeeScript version of the CoffeeScript compiler.

This is known as a self-hosting compiler.

It's extremely common, and usually results from an author's desire to use their own language to maintain that language's growth.

edited Aug 1, 2017 at 16:42

nbro

15.9k34 gold badges117 silver badges209 bronze badges

answered Jul 6, 2016 at 23:52

Trevor Hickey

37.4k35 gold badges172 silver badges277 bronze badges

Add a comment |

Paul92 · Accepted Answer · 2016-06-21 13:22:55Z

3

It's not a matter of compilers here, but a matter of expressiveness of the language, since a compiler is just a program written in some language.

When we say that "a language is written/implemented" we actually mean that a compiler or interpreter for that language is implemented. There are programming languages in which you can write programs that implement the language (are compilers/interpreters for the same language). These languages are called universal languages.

In order to be able to understand this, think about a metal lathe. It is a tool used to shape metal. It is possible, using just that tool, to create another, identical tool, by creating its parts. Thus, that tool is a universal machine. Of course, the first one was created using other means (other tools), and was probably of lower quality. But the first one was used to build new ones with higher precision.

A 3D printer is almost a universal machine. You can print the whole 3D printer using a 3D printer (you can't build the tip that melts the plastic).

answered Jun 21, 2016 at 13:22

Paul92

8,9521 gold badge25 silver badges38 bronze badges

I like the lathe analogy. Unlike the lathe analogy, though, imperfections in the first compiler iteration are passed along to all subsequent compilers. For example, an above answer mentions adding a for-loop feature where the original compiler only uses while loops. The output understands for-loops, but the implementation is with while loops. If the original while loop implementation is flawed or inefficient, then it always will be!
– user2026256
Commented Jun 21, 2016 at 14:58
@Physics-Compute that is simply wrong. In the absence of malice defects don't usually propagate when compiling a compiler.
– plugwash
Commented Jun 23, 2016 at 19:54
Assembly translations certainly do get passed from iteration to iteration until the assembly translation is fixed. New features that build off old features do not change the underlying implementation. Think about it for a while.
– user2026256
Commented Jun 23, 2016 at 23:42
@plugwash See "Reflections on Trusting Trust" by Ken Thompson - ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf
– user2026256
Commented Mar 30, 2017 at 16:19

Add a comment |

Guy Argo · Accepted Answer · 2016-06-21 20:55:18Z

3

Proof by induction

Inductive step

The n+1th version of the compiler is written in X.

Thus it can be compiled by the nth version of the compiler (also written in X).

Base case

But the first version of the compiler written in X must be compiled by a compiler for X that is written in a language other than X. This step is called bootstrapping the compiler.

answered Jun 21, 2016 at 20:55

Guy Argo

3974 silver badges10 bronze badges

1

The very first compiler compiler for language X can easily be written in X. How that is possible is that this first compiler can be interpreted. (By an X interpreter written in a language other than X).
– Kaz
Commented Jun 23, 2016 at 0:21

Add a comment |

Alex D · Accepted Answer · 2020-09-09 11:36:03Z

While other answers cover all the main points, I feel it would be remiss not to include what may be the most impressive example known of a compiler which was bootstrapped from its own source code.

Decades ago, a man named Doug McIlroy wanted to build a compiler for a new language called TMG. Using paper and pen, he wrote out source code for a simple TMG compiler... in the TMG language itself.

Now, if only he had a TMG interpreter, he could use it to run his TMG compiler on its own source code, and then he would have a runnable, machine-language version of it. But... he did have a TMG interpreter already! It was a slow one, but since the input was small, it would be fast enough.

Doug ran the source code on that paper on the TMG interpreter behind his eye sockets, feeding it the very same source as its input file. As the compiler worked, he could see the tokens being read from the input file, the call stack growing and shrinking as it entered and exited subprocedures, the symbol table growing... and when the compiler started emitting assembly language statements to its "output file", Doug picked up his pen and wrote them down on another piece of paper.

After the compiler finished execution and exited successfully, Doug brought the resulting hand-written assembly listings to a computer terminal, typed them in, and his assembler converted them into a working compiler binary.

So this is another practical (???) way to "use a compiler to compile itself": Have a working language implementation in hardware, even if the "hardware" is wet and squishy and powered by peanut butter sandwiches!

nbro · Accepted Answer · 2017-08-01 16:51:43Z

Compilers take a high-level specification and turn it into a low-level implementation, such as can be executed on hardware. Therefore there is no relationship between the format of the specification and the actual execution besides the semantics of the language being targeted.

Cross-compilers move from one system to another system, cross-language compilers compile one language specification into another language specification.

Basically compiling is a just translation, and the level is usually higher-level of language to lower-level of language, but there are many variants.

Bootstrapping compilers are the most confusing, of course, because they compile the language they are written in. Don't forget the initial step in bootstrapping which requires at least a minimal existing version that is executable. Many bootstrapped compilers work on the minimal features of a programming language first and add additional complex language features going forward as long as the new feature can be expressed using the previous features. If that were not the case it would require to have that part of the "compiler" be developed in another language beforehand.

Collectives™ on Stack Overflow

How can a compiler compile itself?

10 Answers 10

A small but important clarification

Proof by induction

Inductive step

Base case

Not the answer you're looking for? Browse other questions tagged
compilation
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

A small but important clarification

Proof by induction

Inductive step

Base case

Not the answer you're looking for? Browse other questions tagged compilation or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
compilation
or ask your own question.