11
$\begingroup$

Let's say I have some code in my target language like:

fun someFunction(arg1: Int, arg2: String) {
    var a = foo()
    if a < 10 {
        var b = a+1    
        var a = a+b+2  
        ...
    }    

The second a shadows the first one.

In my compiler, I created a Scope class to track names and their definitions. The scopes roughly correspond to the nodes in the abstract syntax tree (AST). Each scope has a map of name → info, and a pointer to its enclosing scope.

But my problem is that this data structure doesn't work well when I get shadowing like this. For example, in what Scope do you evaluate a+b+2? If it's in the function body scope, then it misses the b, if it's in the if's scope, then it gets the uninitialized a, not the outer shadowed one, which I think is what it should use.

Perhaps I have not modeled this correctly. Perhaps it's more accurate to say that each var declaration creates a new nested Scope, used by the code that comes after it in a given {...} block. So the scope tree has more nodes, and they do not so exactly correspond to AST nodes or curly brace blocks.

If I implement this, then many of my Scope objects would be nothing but a single variable, and looking up a name would be a linear walk up the linked list (stack) of scopes. Maybe that's fine, but I'm wondering if there are some standard data structures used in compilers that are better than this linked list stack approach.

I think I would make Scope an abstract class and so that some scopes with many names could use an internal hash table. Scopes for a class could be like this, while scopes created by local var declarations could have just the one name, and a different implementation of a virtual lookup(name) method.

$\endgroup$
4
  • 2
    $\begingroup$ Is allowing var a = a+b+2 core to the language design here? Most block-structured languages have scoping rules that preclude this sort of construction, but yours doesn’t have to; it seems like your var is more like the let … in … construct typical in functional languages. The standard data structures are aligned with standard designs and you can borrow from them when that’s what you have, but we should check what it is that’s needed first. $\endgroup$
    – Michael Homer
    Commented Sep 14, 2023 at 19:20
  • 1
    $\begingroup$ Rust, for example, allows multiple let declarations that shadow each other, right in the same {...} block. Eg, you say let x = 1; and on the next line let x = x+1 and that's a second x that will be 2. Other languages I've seen allow you shadow variables, but you need to at least be in a nested block. So, yes, I guess my var is like a let ... in ... but without the rightward drift that happens in some languages I've used. $\endgroup$
    – Rob N
    Commented Sep 14, 2023 at 20:58
  • $\begingroup$ JavaScript, for example, would generate an error for let a = a + b + 2; because the let binding shadows the outer a for the entire block, and you can't reference a variable before its initialization is complete. $\endgroup$
    – Barmar
    Commented Sep 14, 2023 at 21:43
  • $\begingroup$ Stacks go brrrr $\endgroup$
    – Seggan
    Commented Sep 15, 2023 at 19:12

4 Answers 4

7
$\begingroup$

This is actually a question about language semantics, not about scopes. Your data structure is correct in essence. The question is if the languages introduces the inner var a before evaluating its initializer. If you can choose yourself because it's your language, create an intermittent error declaration with the semantics of using it results in an "unitialized entity used" error. If the language allows using the outer a in the inner initializer, then the natural implementation is to evaluate the initializer's AST before the variable declaration itself.

For local name resolution, I'd actually walk up the AST. The number of reachable variable declarations and enclosing blocks is usually very small resulting in overall better performance than building chains of maps.

$\endgroup$
5
$\begingroup$

This is called a persistent array.

If you don't need the Scope object accessible after finishing processing a scope, you could just use one symbol table, modify the symbol table directly when entering a scope, but record the modifications so that you could undo the modifications in reversed order when exiting a scope. You could use a stack to store information about the current scope and its ancestors. And you only have information about these scopes. All other Scope objects are destroyed.

Each variable name could also be considered an independent stack. Without the following improvements, you may need a unified stack to record which variable names are modified, but the exact modifications are independent from each other.

Then if you want the Scope object available after exiting the scope, for each variable name, you could sort the modifications (shadow and unshadow) of the variable in the above process by time, and maintain them in a search tree. Each Scope is bound to a time (version id). The current state of a variable name is the new value in the latest modification of that variable name before the version id. When you want to create a subscope, create two version ids for entering and exiting the scope just after entering its parent scope, and create the appropriate modifications on the entering version, and undo modifications on the exiting version. The version ids could be created and compared in another shared tree, instead of just using integers, if you may create subscopes in non-sequential order. A parent scope itself should not be modified after a subscope is created, so the the new value on undo could just be a copy of the previous value and don't need to be updated.

The structure is like a simplified Euler tour tree. It is possible to also support modifying a parent scope while there are already subscopes, if used in combination with a full Euler tour tree, to replace undoing.

This is O(logn) time and O(1) space per operation, where n is the number of scopes, assuming variable names are prehashed and don't cost extra time for looking up the tree bound to that variable name.

Some improvements using vEB trees and constant time data structure for maintaining version ids could make it O(loglogn) per operation, though I doubt anyone would really want to implement it.

You could also just use a persistent search tree to store the symbol table, but with O(logn) space overhead per operation.

$\endgroup$
2
  • $\begingroup$ Are there non-research languages, where you can bind a name from a left local scope? $\endgroup$
    – feldentm
    Commented Sep 16, 2023 at 5:36
  • $\begingroup$ @feldentm If you are asking about using a name after exiting a scope, I think classes might be a case, where the stack is the inheritance path. But it might be more practical to just copy the symbol table, as vtables already work like that, and it isn't very useful to create a new class after each member definition. $\endgroup$
    – user23013
    Commented Sep 17, 2023 at 20:06
3
$\begingroup$

The real issue, I would argue, is that there is no one-size-fits-all:

  • At module/file scope, you may have a large number of unique identifiers, especially counting the imported ones. Here, a map definitely makes sense.
  • At class/type scope, you may still have a fairly large number of identifiers. The Iterator trait (Rust) has a whooping ~75 methods at the time of writing, and it's one trait amongst many that a type may implement, coming on top of the inherent methods, and fields, of the type.
  • At function scope, you typically have only a handful of arguments and variables.

As a result, I think your idea of using a Scope interface, and a polymorphic implementation makes sense, so you can specialize based on the type of Scope.


For handling the variables (and arguments) of a function, I would argue for... a stack!

You are, in essence, already thinking of creating a stack of Scopes... it'll be much more efficient to have a single Scope with a stack of variables instead! You'll need a separate one, for popping, too:

class VariableScope(Scope):
    def __init__(self, parent: Scope):
        self.parent = parent
        self.variables: List[(String, ???)] = list()
        self.blocks: List[Int] = list()

    def enter_block(self):
        self.blocks.append(len(self.variables))

    def exit_block(self):
        before = self.blocks.pop()
        del self.variables[before:]

    def add_variable(self, name: String, data: ???):
        self.variables.append((name, data))

    def search_variable(self, name: String) -> Option[???]:
        for n, d in reversed(self.variables):
            if n == name:
                 return d

        return self.parent.search_variable(name)

Finally, with regard to your issue about var a := a + 1, it's just a matter of timing.

If you do not want the newly declared a to be visible in a + 1, insert it into the scope only after having resolved a + 1. That's all there is to it.

$\endgroup$
2
$\begingroup$

I assume the semantics of your language is that the function body is evaluated as if executed sequentially, one line after the other, the same as when reading text. Also, the meaning of names is given by the same order, plus scopes. (As opposed to, for example, code given just dependencies and letting compiler decide evaluation order.)

Perhaps you are thinking scopes as static, as expressed by the graph/drawing.

But perhaps you can solve this by growing scopes: e.g. for the if scope,

  1. it starts empty,
  2. then it contains b's definition; at this point it does not yet contain a, so a gets resolved to the containing scope;
  3. then you add a's definition; at this point you already have b, and whether a means this a or the outer one is up to your language, as pointed out in ohter answers and comments (but you can resolve it either way).

Just an idea. :)

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .