How to implement separate compilation?

Question

This feels like a really basic question, but I haven't been able to find an answer to it. How should a proof assistant implement separate compilation? Maybe the answer is "just like any interpreted programming language with a bytecode precompiler", but I don't know where to find the answer to that either. I found a lot of tutorials on the web for how to write an interpreter, but as far as I could tell, all of them stopped short of defining a way to load other source files; and tutorials on writing machine-code compilers all seem to focus on placement of code at physical addresses which doesn't seem especially relevant.

Specifically, what I want to know is how to deal with symbol resolution between compilation units. During ordinary typechecking, I am giving each defined constant a unique identity (currently an autonumber integer, like a global De Bruijn level), and then occurrences of that constant in other terms are resolved to reference that identity. That way a definition remains unchanged even if the particular names that are in scope to refer to the other constants it uses change, such as by import or leaving a section.

This is fine during a single run of the typechecker, as the autonumbers can be continually incremented no matter how many files we check. But if I want to store the typechecked terms in a "bytecode" file so that they can be loaded into another file without being re-typechecked, then the stored autonumbers will be those from the compilation run, which may conflict with those from the later run that's trying to load it. I could of course make up a solution, but this is clearly an already-solved problem and I just need a pointer to the best-practice approach.

Are you compiling to a low-level language, so that you cannot reconstruct the original terms from the compiled code, or are you just trying to store type-checked internal version of terms for fast loading later on? — Andrej Bauer, Commented Jun 15 at 8:12
This sort of feels like a question for langdev.stackexchange.com (though I'd specifically ask about "implementing symbol resolution in separate compilation") — cody, Commented Jun 17 at 20:39
Cool, I didn't know about langdev.stackexchange.com, thanks! — Mike Shulman, Commented Jun 18 at 20:17

Andrej Bauer · Accepted Answer · 2024-06-15 12:57:14Z

I am going to propose the most obvious solution in order to see why you dislike it. Keep in mind that loading anything from the disk is an order of magnitude slower than performing actions in memory, so an additional pass through the loaded code will not be a bottleneck.

Let $M$ be the map taking an identifier name $x$ to its unique identity $M(x)$, presumed to be an integer. There is also a global counter $C$ whose value is larger than any identity so far.

Upon saving a unit $A$, you also have to save the accompanying map $M_A$, so that later on, when $A$ is loaded, other (non-compiled) units will be able to refer to entities in $A$ by their names.

Upon loading a unit $A$ and its accompanying map $M_A$, we perform a global shift: we change each $M(x)$ to $M(x) + C$, and we shift all identities in $A$ by adding $C$ to them. We also increment $C$ by the size of $M$. (If you want to "save integers" you can shift $M_A$ down to $0$ when saving the file.)

If your identities are not integers and you cannot use addition to shift them, then you can create a temporary remapping table $R$ which takes each $M_A(x)$ to a new fresh identity $R(M_A(x)) = M'_A(x)$. You then go through $A$ and apply $R$ throughout, and replace $M_A$ with $M_A'$. This solution is more principled, and has a negligible additional cost.

You also need to merge $M_A$ with your global $M$, the details of which may depend on your naming scheme for compilation units.

The proposed solution is standard. The first sentence of the Wikipedia page on dynamic linking says (emphasis mine):

In computing, a dynamic linker is the part of an operating system that loads and links the shared libraries needed by an executable when it is executed (at "run time"), by copying the content of libraries from persistent storage to RAM, filling jump tables and relocating pointers.

Regarding storing data on the disk: this is known as code marshaling. Real programming languages have good support for it, Ocaml has the standard libary module Marshal. It's fast and it saves you a ton of trouble, but note that code marshalled with this module is going to be specific to the OCaml compiler, and thus not very portable, so you would still have to distribute source code. (Python is like that, nobody tries to distribute compiled Python.) However, you should only worry about portability after your proof assistant has more than 666 users, at which point one of them will volunteer to implement portable marshaling for you.

Yes, this is indeed the first thing I thought of also. Thanks for writing it out clearly with justification. The complication that I was not sure what to do about is that the unit $A$ could itself import another unit $B$, which might already have been compiled at the time $A$ was compiled. So there is an $M_B$ from when $B$ was compiled, and then when $A$ is compiled it had to shift $M_B$ by adding a constant $C_B$ and those are the identities of the $B$ constants appearing in the saved $A$ terms. — Mike Shulman, Commented Jun 15 at 15:39
But now at the point that $A$ is loaded, maybe we've already loaded $B$ and shifted its $M_B$ by a different constant $C_B'$. So when we pass through the $A$ terms, we don't just have to rename all the $A$ identities by adding one constant $C_A$, we have to rename all the $B$ identities by subtracting $C_B$ and adding $C_B'$. So that means that when we encounter an identity while traversing the $A$ terms, we have to be able to tell which unit it came from originally, and have stored enough information to know what to add to it. — Mike Shulman, Commented Jun 15 at 15:39
As I said, I could write some code that tries to do this, but it's starting to feel about as complicated as manually munging De Bruijn indices when weakening under binders, and hence as easy to get wrong. So I was hoping that someone else had written out instructions or sample code that I could just copy and be more confident of not making a mistake. — Mike Shulman, Commented Jun 15 at 15:45
I can recommend Xavier Leroy's A modular module system. Even though it's not directly addressing your concerns, it is a very nicely written paper that shows how to think about these matters. — Andrej Bauer, Commented Jun 17 at 18:54

Stack Exchange Network

How to implement separate compilation?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
implementation
or ask your own question.

Hot Network Questions

How to implement separate compilation?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged implementation or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
implementation
or ask your own question.