How could a language with a C-style syntax implement homoiconicity?

Question

A part of why Lisp's macros work so well is that the code is just a list of symbols, and it is therefore simple to manipulate using a regular Lisp function. How could a language that uses a syntax more like Rust's or C's implement a way to modify code from code itself?

Rust does it. There's a macro_rules! macro which is basically syntax-case, and a very popular crate provides a quote! macro for quasiquoting. — Silvio Mayolo, Commented May 16, 2023 at 17:56
Rust does it, but declarative macros only allow (in practice) a subset of what you can do with Lisp macros. Procedural macros fill the gap, but at the price of a more complicated setup and longer compile times. It's a compromise between power and efficiency — zdimension, Commented May 16, 2023 at 17:59
@kaya3 I've modified for clarification, though I'm also interested in indentation-syntax languages like Python. — kouta-kun, Commented May 16, 2023 at 19:22
@kouta-kun and for the Python-style languages with AST macros I recommend you to take a look at Converge. — SK-logic, Commented Jun 26, 2023 at 15:46

Alexis King · Accepted Answer · 2023-06-28 16:44:43Z

I want to preface this by saying that “homoiconicity” is not a useful term. It is infamously difficult to define, and no working macro system experts I know choose to use it. In practice, the word “homoiconicity” seems to serve two purposes:

To vaguely gesture at a collection of features that make compile-time metaprogramming exceptionally pleasant in Lisp-family languages.
To add an air of mysticism to Lisp.

The first of those purposes refers to something very real, but we can use more specific terms to talk about them that have concrete and well-defined meanings. The second is just obscurantist folklore. Fortunately, you do not use the word in your question body, so let us simply dispense with it.

A part of why Lisp's macros work so well is that the code is just a list of symbols, and it is therefore simple to manipulate using a regular Lisp function.

This is certainly true in some sense, but it somewhat conflates two distinct properties of Lisp:

Lisp has a remarkably simple and regular concrete syntax, and that concrete syntax has very little inherent structure. This permits reading a program into a concrete syntax tree (CST) without truly parsing it into the corresponding AST. Reading a program performs delimiter matching, but otherwise little is done beyond tokenization. The process of turning this loose “token tree” into a true AST is macroexpansion, so in a sense, macroexpansion is parsing.
Traditionally, Lisp chooses to represent these syntax trees using very simple, standard data structures like lists and symbols. This means that functions used to manipulate lists can also be used to manipulate concrete syntax trees.

This first property is undoubtedly related to why Lisps are so amenable to macro-based extension, but as SK-logic mentions, the second property is not actually essential. It sounds nice on paper, but just spend some time reading traditional Lisp macros of any complexity and you will find yourself quickly getting lost in the sea of maps and caddrs.

While reuse of ordinary data structures does provide a certain economy of implementation, a great deal of real utility is obtained from departing from that model. Using richer data structures allows preserving source location and scoping information through macroexpansion at relatively little cost. A real pattern-matching and templating sublanguage (like, say, syntax-parse) is exceedingly useful whichever representation you choose, and it abstracts over most of the differences in syntax representation.

How could a language that uses a syntax more like Rust's or C's implement a way to modify code from code itself?

Now, finally, to the real question. Recall that the key benefit for macro extensibility of Lisp’s simple syntax is that it is possible to read a program without parsing it, and this reading process produces a loose rose tree of tokens formed simply by matching delimiters. Parsing the program is macroexpansion; they cannot be separated into two separate phases.

Given this perspective, it becomes clear that the challenge of supporting more sophisticated syntax is not how to add macros to a traditional parser, but rather how to parse C-like syntax via macroexpansion! Fortunately, there is prior art on how to do this even in the context of a rich, procedural macro system in Honu: syntactic extension for algebraic notation through enforestation (pdf). The Honu approach, along with syntax-parse, was a direct inspiration for Rust’s pattern-based macro system, though they chose to implement a simplified version of it.

All that said, there are plenty of challenges with Honu-like approaches that remain open research problems. Certain constructions can be challenging to parse, and this level of macro-extensibility dramatically complicates the story for IDEs, language servers, and other, similar tools. Various languages are actively pursuing different points in that design space, and which approaches are the most pragmatically useful remains to be seen.

Homoiconicity is a decidable property of grammars. Of the languages you mention: CL, Scheme, Racket (and thus Honu), and even Rust are homoiconic; C is not. The E designers had notes on how to transform C and related languages to be homoiconic. — Corbin, Commented Jul 19, 2023 at 16:34
@Corbin I would love to see your so-called decidable property formally stated. — Alexis King, Commented Jul 19, 2023 at 17:52
Pick a grammar and a non-terminal production which will serve as the top level. Then the grammar is homoiconic if the production appears immediately as part of its own rule (recursion) and also indirectly via some other rule (quoting). That's it. Any way of unquoting or otherwise evaluating quotes is not necessary; that'd be a semantic notion. You can check for yourself that the typical Lisp qualifies, and so does Rust; however, C doesn't. — Corbin, Commented Jul 20, 2023 at 0:30
@Corbin Doesn’t this mean Haskell also qualifies (since it has Template Haskell)? — Alexis King, Commented Jul 20, 2023 at 8:32

Starship · Accepted Answer · 2023-06-26 21:37:33Z

6

They can use metaprogramming techniques that allow you to do what you want. Some examples:

As you said, they can use macros. Pretty similar to Lisp in this regard as they are commonly implemented there too. Some functationally is gone though.
They can use preprocessor directives. Lisp's have more expressive power and enable dynamic code generation. But they lose on simplicity and support.
They can use procedural macros. Lisp's again have more expressive power and are simpler. But Lisp's lose on Integration and complex transformations.

edited Jun 26, 2023 at 21:37

answered May 16, 2023 at 19:27

Starship

1,3366 silver badges29 bronze badges

$\begingroup$ How exactly Lisp-style macros lose on complex transformations? You can wrap an entire optimising compiler into Lisp macro. $\endgroup$
– SK-logic
Commented Jun 29, 2023 at 9:50

Add a comment |

Jon Purdy · Accepted Answer · 2023-06-28 23:21:41Z

The precise meaning of homoiconicity is debated, so I’ll give up front my assumptions about what it means:

The basic code structure has a direct mapping to a general data structure.
That data structure is a basic type within the language.
Except for some trivial marker, the literal syntax for that type is identical to the code syntax.

So for example, Tcl is homoiconic in this sense because all code is made of character strings, and not only in the sense that any textual language is, but in a way that’s reflected in the language. The interpreter may store data types and code in a more efficient representation, but this is purely an optimisation—it must retain enough information to transparently convert back to text. And the notations for escaping (" … ", { … }, \) and interpolation ([ … ], $) are trivial—their size is a small constant, independent of the size of the escaped or interpolated term.

Likewise, of course, modern Lisp written with S-expressions is homoiconic, because its code is isomorphic to a rose tree of atomic terms, being the main aggregate type in the language, and quotation (quote / ') and splicing (unquote / ,, unquote-splicing / ,@) are again trivial.

But I think that early Lisp, written with M-expressions, also fits the bill. For example, from the Lisp 1.5 Programmer’s Manual^(PDF) p. 15, here’s the function member that tests for membership in a list, and how it was desugared in S-expression form.

member[a;x] = [null[x]→F;eq[a;car[x]]→T;T→
      member[a;cdr[x]]]

define ((
(member (lambda (a x) (cond ((null x) f)
      ( (eq a (car x) ) t) (t (member a (cdr x))) )))
))

So there are some differences, but they’re entirely superficial. The punctuation is different, and there are a couple of infix operators, but besides that, all of the same nesting structure and ordering of leaf terms is unchanged. As such, even if the programmer doesn’t look at the S syntax, they can easily predict it from the M syntax, so they can read and write code that treats the M form as if it were the S data structure.

Another good example of nontrivial homoiconicity in practice is XSLT. The program’s input and output are usually both XML, and the program itself is represented as XML, so metaprogramming isn’t appreciably different from ordinary programming, despite the fact that XML has fairly complex syntax.

A long time ago, I designed a language where all compound terms had the same form, essentially a series of names followed by bracketed subterms. With some sugar for infix operators, the result was that code could look fairly familiar to a C programmer. For example, here’s the member function from earlier:

define [member](a, x) {
    if (null(x)) {
        false
    } else if (a = head(x)) {
        true
    } else {
        member(a, tail(x))
    }
}

This desugared to a tree structure very much like the Lisp code, which was accessible to the programmer to a limited extent. For example, member.head[0] is "a", and member.body[0].name is "_else_", because else is an infix operator. At the time, in fact my goal wasn’t metaprogramming, but only allowing some customisable syntax and some introspection, while retaining a simple top-down operator-precedence parser. But I hope it’s illustrative nevertheless.

I think these points can be assembled into a general method of designing such languages:

First, select a data structure amenable to representing a program. Typically this will be some sort of recursive tree structure with labels at the leaves and/or branches. However, recursion isn’t strictly necessary—you can always flatten the structure of a language. For example, you can prescribe a certain implicit traversal order, such as how prefix/postfix notation can do away with parentheses for operations of fixed arity; or you can require compound terms to be defined out of line and referenced by name. (Labels aren’t strictly necessary either, if a program’s meaning is determined entirely by its structure, but that’s getting into esolang territory.)

Second, reify that data type within the language. Ideally, you should ensure that conversion from data to code (let’s call it parse) and from code to data (format) have various properties you might expect:

parse(format(parse(data))) = parse(data)

Formatting preserves all information that parsing does.
format(parse(format(code))) = format(code)

Parsing preserves all information that formatting does.
format(parse(data)) = data

Parsing preserves formatting.
parse(format(code)) = code

Formatting preserves parsing.

(You could call 1 & 2 “weak roundtripping” and 3 & 4 “strong roundtripping”.)

Third and finally, find a notation for this data structure that’s compatible with C-style syntax, or whatever style your goal may be. The biggest usability constraint I see here is that your choice of data structure might need to forgo some of the compiler-specific organisation that you’d want in an AST, so that it can be conveniently manipulated in a uniform way. The bigger and more complex the language is, the more syntactic cases a programmer will need to consider when metaprogramming.

Gavin D. Howard · Accepted Answer · 2023-06-28 18:23:07Z

Because it is a "how could" question, I'll answer with the concrete design that I am using for my language. However, I highly suggest looking at the other answers, especially the one by Alexis King, for how other languages do it as well as the general principles.

The compiler for my language has the regular plumbing, but it's also a plugin API.

This API lets users define new keywords and new lexing modes. Basically, users can define functions for parsing and functions for lexing, but those functions will only be called if the right token (keyword) is found.

Surprisingly, this is all that is needed for implementing the equivalent of Lisp macros in a C-like language.

To understand why, first remember that Lisp macros themselves all must begin with a keyword: the name of the macro, so requiring a keyword to start is not a limitation versus Lisp macros.

Second, Lisp macros may define entirely new syntax. This is what lexing modes are for. A user's lexing mode has to handle all possible characters, and it has to turn those into a stream of tokens.

Essentially, since you define the mode, you define how characters map to tokens. That is the entire essence of syntax, so a lexing mode can define new syntax.

And I have done this to embed a JSON-like language inside my main language.

And of course, your parsing function can do anything it wants with that stream of tokens.

The parser for my JSON-like builds up the JSON value using variables that exist in scope and then generates code to make that data available.

Of course, this doesn't work with just any compiler. Your lexer needs to be on-demand. It also should recognize when an identifier is a type, a keyword, or something else. Your parser needs to be recursive descent because that's the easiest for users to understand. Your language must be single-pass. (Good luck on getting user code to work over multiple passes.)

Here are some reasons you don't want this in your language:

User code could have bugs and could cause your compiler to misbehave. A lot of the work I've done is the design of the API to prevent this as much as possible.
Your compiler now needs a JIT as well because it has to compile the plugin code. I hope you have a plan for the bootstrap.
Your language could become a bunch of dialects, kind of like Lisp itself. Or C++.

But here is why I like it:

The language can grow, and its growth is not hobbled.
I can define all of the language keywords in the standard library, not the compiler.
It forces my parsing to be consistent. If I can't implement a keyword the usual way, its design is wrong.
Keywords aren't special anymore; they can exist anywhere, not just the standard library. The math package of my language will include keywords for standard units, like meters, and will use that to do algebra on the units and arithmetic with correct units.
Keywords can be fully-qualified: you need to add the package name to access a keyword.
You can see what dialects a file uses by looking at the imports.
I already use this to define DSL's, such as a DSL for a build system.

SK-logic · Accepted Answer · 2023-06-29 09:38:17Z

Homoiconicity is a nice feature but it is not necessary for Common Lisp-style metaprogramming.

All you need is quasiquotation for both constructing ASTs and for pattern-matching ASTs. If you have these two components, you can do everything a homoiconic syntax allows to do. Keep in mind that in some languages, most notably Scheme, homoiconicity is in fact a lie, and ASTs are very special data structures, not the same plain lists that Scheme can operate with (see syntax->datum and datum->syntax).

Also, to illustrate how macros in a language with C-like syntax can look like - I experimented with various approaches, from quasiquotation to directly accessing AST to typed macros applied in two phases, before and after typing - all seem to be quite feasible.

E.g., a macro that extends the syntax, adding an upside down assignment operator:

##syntax of pfclike in clcode, inner: ' [clrevset]:v '
+ { clrevset := [clexpr]:r "=:" [cllvalue]:l";" => 
            {mode=stmt} set($source(),l,r); } // Forming an AST directly
{
  return v
}

Which then can be used as

2+2 =: a;

Here, we're just injecting a new parsing rule into the language parser. See the link above for more elaborate examples and for different approaches to handling C ASTs, both reading and forming them.

SoronelHaetir · Accepted Answer · 2024-06-13 20:39:16Z

0

It's a very different approach from lisp, but Microsoft's .net has exposed code engines from its earliest days. It does not really allow modification of the running code, but you can generate new code and run that in place of the old.

answered Jun 13 at 20:39

SoronelHaetir

101

Add a comment |

Stack Exchange Network

How could a language with a C-style syntax implement homoiconicity?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
syntax
homoiconicity
macro
.

Linked

Hot Network Questions

How could a language with a C-style syntax implement homoiconicity?

6 Answers 6

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged syntaxhomoiconicitymacro.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
syntax
homoiconicity
macro
.