Normy JS IR AST

2021-09-02

Too many abbreviations there, sorry bout that. The IR, or "Intermediate Representation", AST that Preval uses is currently very similar to the regular EStree form. However, the IR kind of evolved out of nowhere without a concrete plan and while that happened I did have some ideas around what it might be had I designed around it. Or in other words, "knowing what I know now". In this post I'll try to outline this design and ideas around it.

Series


This post is part of a mini series describing the normalization steps taken in Preval.

In this post I discuss normalizing expressions in the first phase.

Overview


In the previous posts I tried to outline the normalization that happens in the first phase. After this phase completes we should be left with normalized code, which would be the IR in code. Let's start with a summary of the kinds of expressions and statements we can encounter in the code:

Statements


These are the atomic statements. All the expressions must be Simple Expressions (at worse) except for the expression statement, which can be an Assignment Expression (worst case).

Code:
if (x) {} else {}
while (x) {}
for (x in y) {}
for (x of y) {}
label: {}
label: while (x) {}
label: for (x in y) {}
label: for (x of y) {}
break;
break label;
continue;
continue label;
return x;
throw x;
x;
let x = AE;
const x = AE;

Of these, I'm still considering to fold up the continue and break into forms that always have a label instead. One step further is to prepend a label to every statement by default. It would be normalized but I'm not sure it'd be very helpful down the line. Maybe if we can do make this visually appealing it could be fine but since labels must be globally unique, that's not likely.

Code:
// Would label-everywhere work?
G0 : const tmpAssignComputedRhs$1 = function ($$0) {
G0F0 : const _0x5628ae = $$0;
G0F1 : debugger;
G0F2 : const tmpBinBothRhs$15 = _0x5628ae.which;
G0F3 : tmpBinBothRhs$15 == 0;
G0F4 : const tmpBinBothRhs$17 = _0x5628ae.which;
G0F5 : const tmpIfTest$15 = 17 == tmpBinBothRhs$17;
G0F6 : if (tmpIfTest$15) {
G0B6I0 : tmpSSA_ic = false;
G0B6I1 : return undefined;
G0B6I2 : if (tmpIfTest$15) {
G0B6I0I0: tmpSSA_ic = true;
G0B6I0I1: return 100;
} else {
G0B6I0E0: return undefined;
}
} else {
G0B6E0 : return undefined;
}
};

You can encode the AST structure into the label name and indent to make it like a gutter. But I don't think this scales properly. Additionally, syntactically there'd be trouble for binding declarations. So we can safely ditch this thought experiment; it won't work.

The label for loops are a special syntactical edge case because a labeled continue may only target labels that are direct parents of a loop. So we can't wrap the body in a block as we'd like to. Maybe at some point we find a way to rewrite labels out but for now that's not the case.

The for-in/of loops can not easily be broken down further. Doing so would be detrimental. Arguably the same is true for switch statements but here we are.

Expressions


The number of expressions in normalized form is fairly limited as many kinds have been aggressively removed and broken down into equivalent atomics.

In these examples, the single character idents represent Simple Expressions. While expression statements can only be Trivial Expressions (TE), for the sake of example we'll show all Assignable Expressions (AE) since those are recursive with assignments and var decls.

Code:
// simple (SE)
null;
true;
false;
5;
`foo`; // Maybe this should be string, anyways
x;
undefined; // Assumed to be a primitive
Infinity; // Assumed to be a primitive
NaN; // Assumed to be a primitive

// template with expressions is Trivial, not Simple
`foo ${ident} bar`; // Note: primitives resolve immediately so must be ident

// invokes (TE)
f(a, b, c);
x.y(a, b, c);
x[y](a, b, c);
new f(a, b);

// binary (with any binary op) (TE)
a + b;

// unary (with any unary op except void) (AE)
!a;

// assignments
a = AE;
a.b = c;
a[ b ] = c;

// other (AE)
(function($$0, $$1, ...$$2{
let a = $$0;
let b = $$1;
let c = $$2;
debugger;
return undefined;
});
(class{
m(a, b){}
});
(class extends x{
m(a, b){}
});
[a, b, ...c];
({x: a, y: b, ...c});
this;
/foo/;

// unexplored other AE
super.foo;
super();
import.meta;
new.target;
await X;
yield x;
(async function($$0, $$1, ...$$2{
let a = $$0;
let b = $$1;
let c = $$2;
debugger;
return undefined;
});
(function *($$0, $$1, ...$$2{
let a = $$0;
let b = $$1;
let c = $$2;
debugger;
return undefined;
});
(async function *($$0, $$1, ...$$2{
let a = $$0;
let b = $$1;
let c = $$2;
debugger;
return undefined;
});

You can see that I haven't spent too much time with classes since you could consolidate the extends to Object. At the same time I'm not sure that would simplify checks since you'd still have to check what it might extend. But hey, it's probably better to consolidate the atomics.

The template vs string case is a little annoying. At some point I normalized all the strings to templates. Any template with an expression would transform the expression to a coercion to string alias in the preamble phase. This way we could guarantee that any template is a non-spying guaranteed string, albeit not a primitive just yet. Reconsidering that now, the primitive vs non-primitive distinction feels like the same trap and I'd be better off with making non-expression templates into strings. Hmmm.

The normalization step will guarantee that all binding names are unique and that all explicit binding names do not clash with implicit binding names encountered. Since we can preemptively mark certain idents as appearing implicitly we can guarantee that occurrences of undefined, NaN, and Infinity are actually the primitives and not some shadowing nonsense (ie. function f(undefined) { log(undefined); } becomes function f(undefined) { log(undefined$1); }). That's important since those values are not keywords like true, false, and null are.

So far I've done nothing with generators and async functions and using either of them crashes the build with a todo assert. I think those can be treated as a spying expression, though. Not that much different from a call of an implicit global (which could affect virtually anything). In particular, while the body of the current function is suspended, local closures should still be considered to be mutable which is not the case with implicit globals.

Design goals


The first goal of the IR is to simplify reasoning about the code. We achieve this by breaking down complex constructs with certain edge cases into more atomic constructs with far fewer edge cases, preferably none. The switch, patterns, and logic expressions are such examples.

Another goal is that it should lead to code that is more likely to shape into certain patterns. The idea is that the next phase applies deeper rules by looking at certain patterns. If, while creating and applying these rules, we can concentrate on using a particular structure or predictable pattern then the patterns we're looking for are more likely to be found. Putting bindings as close to their first use may help resolving them. Putting static param operations to the top of a function may allow outlining them. Moving the remaining code after an if into one of its branches when the other branch abruptly completes, may allow reductions that would otherwise not be possible. And so forth.

Unfortunately these patterns are not easy to formulate.

Minimize closures= occurrence


One example may be that for the sake of static analysis it helps if we have as few "closures" as possible. That is, it's much much easier to reason about bindings when they are only used in the same scope as where they are created. Preval will try to inline function calls under a few conditions. There are many tricks that can only be safely applied to bindings that are not closures.

Function promotion


Likewise, a function that is global is easier to manipulate than a function that is local. Preval will try to "promote" functions to as high a scope as possible in hopes of eliminating local dependencies.

Let elimination


The let bindings are some times actually constants but it can be difficult to determine this, safely, during static analysis. Assignment analysis will only get you so far and when a binding is used in a closure you must always be careful since any observable statement could, technically, be calling code that can observe the state of a mutable binding. By trying to move assignments as close to their declaration we can sometimes safely eliminate these uncertainties and conclude that the initial value of a let can not be observed.

Size


The size is not a concern right now. I'm operating under the (potentially misguided) assumption that we can recognize certain patterns and recombine atomic expressions and statements into their higher level and more condensed form. While that may be a far cry for the switch statement, it should be easier for logical constructs. Either way, it's a low priority in the grand scheme of things.

Names


The names of bindings in the AST must be unique. This way we don't have to worry about scope tracking. The original names are not relevant to us as ideally we'd be minifying the code in the end, anyways. The preamble phase will ensure that all labels and binding names are globally unique in the code such that the next phases can use a global reference to resolve and track binding meta data.

AST deviations


With a custom IR comes a custom AST. I currently have a small set of modifications to the EStree model. I did have plans for a few more but could never be bothered to implement them.

Function parameters


Since I wanted to simplify binding mutations I wanted to eliminate the odd duck that is function parameters. These do get an assignment, but the way they are assigned is a bit implicit. So instead I chose to make the actual parameters a predictable incremental symbol and assign that symbol in the function header to an explicit local binding. By treating the function header in a special way these special tokens never leak outside of it. I can treat them as implicit globals if nothing else and consider them "any", so to speak.

Code:
function f($$0, $$1, $$2) {
let range = $$0;
let size = $$1;
let desc = $$2;
debugger;
...

The parameter nodes are normally regular Identifier nodes, but in this AST they are a special custom Param node. This node has two additional properties to identifiers; rest, to indicate whether or not they are a rest parameter, and index, which is pinned to their param index. The name is always $$ followed by the param index. Since they're not Identifier nodes in the AST, they can ignore the "unique ident" rule. When printing they do end up as regular idents in the code, of course. That's kind of the whole point.

The binding inits are also Param nodes. They should not leak outside the function header because I don't want to have to worry about them more than I have to.

strings


Another tweak to the AST is to force all strings to be templates, whether they were or weren't. This means template strings appear in places where they are normally not allowed, like import/export sources and object keys. The printer will make sure that they are printed in a syntactically correct way so it's really only inside the AST.

Preval currently considers two distinct cases of templates, which is why in hindsight I'd probably revert this move: strings with and without expressions. The normalization phase will ensure that any strings with expressions will only be Simple Expressions, so idents or primitives. But it is still an operation so they are not interchangeable with strings that are actually primitives. Even if in many cases that's still the case.

For example, we do constant inlining. We can't safely replace constants with a template with expressions when these expressions may not be reachable everywhere. Or the bindings may be mutable. That's different from a template without expressions since those are guaranteed to be immutable and don't have this access problem.

Ideas for nodes


That's actually the only ones I can think of. However, while developing Preval I did consider a few other modifications. These modifications are quite impactful to the code so it's non-trivial to introduce them on a whim. But knowing what I know now, if I were to create this IR from scratch there'd be a few other new nodes I'd introduce.

Primitives


The AST is a bit of a mess regarding primitives. There are Literal nodes, but they also include regular expressions. That makes sense because regexes have a literal notation but they are also objects so we can't assume them to be primitives. Likewise, the language has some constants built in that were never codified to be keywords: undefined, NaN, and Infinity. So in an AST those are just Identifier nodes. That makes sense in regular JS where these values can be shadowed, but in Preval this is not the case once identifiers are made globally unique. That's because if, for example, code does happen to shadow undefined through a function param name the normalization would force it to get a suffix because it knows there's an implicit global with that name.

So to eliminate this range of problems I'd rather introduce a Primitive node that can represent the various primitive notations that JS sports.

Assignment tracking


In the next phase the assignment tracking plays an important role. It is used to determine the kinds of values a binding may have when it is only referenced in the same scope where it was created. This includes condition branching and loop logic. By tracking which assignments can be observed by which reads or writes of a binding, we can simplify the code by applying "SSA" ("Static Single Assignment" form) transformations. This gets harder (for static analysis) when the binding is used in multiple scopes, let alone mutated.

Code:
let a = 10; // This assignment can only be observed by the next line
f(a);
a = 20; // This can not be observed so we can drop it
if (x) {
a = 30;
} else {
a = 40;
}
f(a); // This can only be 30 or 40, nothing else

Currently the assignment tracking is a bit tacked on since I created it as I was mid way of developing Preval. But I'd rather have a more cemented and fleshed out way of encoding this "graph" into the AST so we can access it in a trivial way and have it always available. It's relevant enough.

Observability of statements


Each statement has at most one observable side effect. But these are relatively expensive or annoying to check for. Especially recursive in a loop. So it would be nice if the AST already has this information encoded into it. This may simplify the action of determining whether there might be any side effects between a read and any write it might observe. That's currently more of a hack than anything else and it annoys me every time I have to look into it.

Binding mutation


In normalized form expressions have a common theme; they are used as the rhs of a binding mutation or they are a statement. For the sake of discussion we could consider the expression that is a statement as an assignment to void, or a (fake / transient) binding name that does not exist or is never observed. Additionally, the binding declaration is really also the action of assigning an expression to a binding. So in a way, all Assignable Expressions that are not Simple Expressions are assigned to a binding.

Code:
// Note: using "var" because that would be syntactically be a noop as long as the name is not a lexical binding
const tmpAssignComputedRhs$1 = function ($$0) {
const _0x5628ae = $$0;
debugger;
const tmpBinBothRhs$15 = _0x5628ae.which;
var _ = f(tmpBinBothRhs$15 == 0);
const tmpBinBothRhs$17 = _0x5628ae.which;
const tmpIfTest$15 = 17 == tmpBinBothRhs$17;
if (tmpIfTest$15) {
var _ = tmpSSA_ic = false;
return undefined;
if (tmpIfTest$15) {
var _ = tmpSSA_ic = true;
return 100;
} else {
return undefined;
}
} else {
return undefined;
}
};

Of course, when printing we could just as well omit the var _ = part because it serves no real purpose.

Expressions currently have three different ways to appear (binding declaration, assignment, statement). I think there's merit in unifying them into a single node type, even the statement one. It eliminates the ugly varNode.declarations[0] access we have to do everywhere. For every case we can say node.lhs and node.rhs. And they would exist, even for the statement case. It can have an additional property, like kind, that tells us whether this is a let or const declaration, or an assignment. When it matters, the statement case can be concluded by checking the name of the lhs.

To eliminate the somewhat special case of assignment to a property we can also transform those into a call expression instead: $propAssign(foo, "bar", val) to mean foo.bar = val. This way the lhs of an assignment is invariably an ident, eliminating another edge case. Property assignments are a separate kind of assignment anyways.

Function header


As mentioned before my AST already has an enforced function header. Would be nice if this was codified in the AST such that I don't have to manually keep track of body offsets and such. So a function would have a head property that would be a block. When printing, this block would disappear and merged at the top of the body. We don't need the debugger statement which we currently use as an artifact to mark as separation.

On the top of functions, I guess there would be a simple Function node since we don't distinct between the different kinds of functions anymore. Of course this would have the properties to indicate async or generator and potentially also method (haven't dug into that area yet). The function wouldn't have a name as in normalized form it would invariably be assigned to a binding.

One thing that might be nice is to trace functions to their origin. This might help with debugging but also the edge case of nasty obfuscators doing a .toString() on functions to detect whether they have been beautified or not (they'll throw you in an infinite loop if they detect any spacing / anomalies). We could codify the original function string or location of the function in the original input as a way to track forward. And we can track more meta data like the original function name, which may help with debugging.

When a function gets split somehow, they would both get the same meta data (because why not) and when a function gets inlined, I guess this meta data is dropped.

For header


The for-in and for-of statements are currently regular nodes and their lhs is forced to be a simple node. However I've always considered making this a special node as well, with the lhs being a binding declaration (as a special new node) and the rhs as a Simple Expression. The fresh binding can then be assigned inside the body of the loop as a special header, just like we do for functions. The trouble here is that it introduces a block header concept which would then need to be applied to all blocks or risk leaking this special lhs binding to other places. So I haven't done this yet. But with a dedicated node we can do the same as we would do for functions: create a head property that is a block.

Additionally the node can have a kind property indicating whether it's a in or of loop (and even await for of). This way we only have to worry about one For node.

Loop nodes


As a stretch we could make generic Loop nodes which would be one of for-in, for-of, or while. Note that there are no other loops in normalized form. But I'm not so sure that is helpful as in many loop cases we only care about the regular loop, being while. On the other hand we do have many cases where we search for any form of looping, like when resolving label cases. So maybe consolidation is a good idea.

Arguments symbol


The implicit local binding arguments is an odd duck inside Preval because it's the only local binding that does keep its original name and therefor has multiple occurrences.

The value itself is forcefully aliased in the function header, if accessed by the function at all. And as the references are dropped, the binding will be dropped and a statement that is just the arguments ident will be dropped so it will self heal.

Arguments length


A special case of arguments is to check how many arguments were passed on in the current call. This is separate from the regular arguments access since we can care less. Right now there's a bunch of special casing around this and it would be nice to have a dedicated symbol to mean arguments.length and distinct it from any other use of it.

Coercion


There are a few ways to coerce a value from a certain or unknown type to another. For example you can do !x, +x, and ''+x. In the end it's the coercion logic that may trigger the side effects in many cases so it makes sense to dedicate a special symbol around this.

This symbol would essentially convey that the result is guaranteed to be a certain type, whatever the input. This is relevant for the next phase, where typing information is used for analysis.

Everything a call


What if everything is a call? We can rewrite binary expressions into a call that applies the same operator to the args (basically hide the expression). Same for unary expressions. Call expressions are eh, well already calls. And new expressions can either be kept or can also become calls (hiding the operator).

Code:
a + b

// ->

$bin(a, '+', b); // special cased call expression

I wonder whether such an IR would make things simpler. Every expression would either be atomic or converted to a call. Then the analysis can search for calls of a particular symbol. Would it actually help the analysis or is it no more than an interesting brain fart.

My concern is that it doesn't actually solve anything, and certainly not the complexity and edge cases. It just moves it to a different place. Does it really solve anything whether we have to check the operator of a binary expression versus the callee of a call expression? Not so sure. But perhaps on a high level, traversal and such, is easier to pattern match when the expression is uniform. Might be worth an experiment.

Examples


Just for the sake of doing it, here's a random example of normalized code:

Code:
const mh = function ($$0) {
const _0x4d7312 = $$0;
debugger;
let _0x1bf416 = undefined;
let _0x20ff67 = undefined;
if (isNS) {
_0x1bf416 = _0x4d7312;
} else {
_0x1bf416 = event;
}
let tmpIfTest$9 = false;
if (isNS) {
_0x20ff67 = _0x1bf416.which;
tmpIfTest$9 = 2 == _0x20ff67;
} else {
_0x20ff67 = _0x1bf416.button;
tmpIfTest$9 = 2 == _0x20ff67;
}
if (tmpIfTest$9) {
return false;
} else {
const tmpClusterSSA_tmpIfTest$9 = 3 == _0x20ff67;
if (tmpClusterSSA_tmpIfTest$9) {
return false;
} else {
return undefined;
}
}
};

This is taken from a random step from some obfuscated code. The code looks very uniform and the functions can get quite large since Preval will try to inline all functions.

We can see the event global implicit (that Preval has no clue about).

Conclusion


I didn't start Preval to do normalization and I had no idea what I'd actually end up with. I certainly didn't expect to tumble down the rabbit hole of static analysis after getting burned so hard on Mope. But it ended up being an interesting endeavor with food for thought. For me, anyways. I'm not sure how interesting it was for you to read my brain dumps so far. Hopefully not too bad. I think this is a typical case of "these blog posts should have been a conference talk". But sadly I'm not sure whether that's gonna happen. Bad timing and all that.

The remaining posts on this topic, if I can put myself to writing them, would cover the next phase of normalization. Turns out that's where the fun stuff begins with higher level knowledge of the code. Doing either pattern matches or type tracking to find redundant code patterns and simplify the code, eliminating unnecessary ops.