5
$\begingroup$

I am thinking of how to write the parser/compiler so it works as a VSCode Language Server Extension, so I can have it show syntax errors, do autocomplete, and other things, in VSCode. One thing that I've seen recommended is doing syntax error recovery at the parser level.

For example, that last post suggests:

  • Insert a token, and see how far the parser can successfully progress. The current context of the parser (i.e: which grammar rule is currently being parsed) and the language’s formal grammar could be used to decide the type of the token to be inserted.
  • Delete a token and see how far the parser can successfully progress.

I can kind of see how this might work, only for the example they provided. But I'm not sure how I can apply this to the language I'm working on.

The language I'm working on is basically a markup language and allows you to write stuff like this (the spec):

i am tree of terms
# same as
i
  am
    tree
      of
        terms
term-with-multiple-words foo
number 123
code #x123, as hex
decimal 1.23
text <foo bar>
text-with-interpolation <foo {hello-world}>
term-with-{interpolation}
term/path
term/path-{with}/interpolation
can use, commas, too
can
  use
  commas
  to
can(use(parens))
can
  use
    parens

That's pretty much it. Some more advanced cases for reference are here.

The question is, what sorts of "syntax error recovery" techniques could I use for such a language? I can't imagine any so far. For example, the "text" format of <anything except angle brackets> is hard to recover from if you didn't have a closing bracket:

something <foo bar
another term tree

So that seems unrecoverable.

Even with the interpolation brackets {stuff} or {{stuff}}, you might have:

{stuff
another}

But what you meant was:

{stuff}
{another}

So I don't really feel like "syntax error recovery" is practical in this situation, do you? Are there any cases where it could be useful here? I can't see any so far, and plus syntax error recovery would be extremely hard to implement, and it is only useful while writing the code, not running the code, so it may not be that useful anyways. People make it sound like syntax error recovery is a huge important feature to have, but how many languages/editors actually support this (and what are examples) is what I'd wonder then.

$\endgroup$

1 Answer 1

4
$\begingroup$

For unclosed delimiters, whether angle brackets, curly braces, or parentheses, I think it would be seen as reasonable for the recovering parser to pretend that the closing delimiter appeared at the end of the line in which the opening delimiter appeared. That line would be marked as having an unclosed-delimiter syntax error, and the parser would continue parsing the rest of the file, finding any further syntax errors, which would be more helpful than stopping at the unclosed-delimiter error as a naïve parser would.

How any of this would be done depends on what tools, such as a parser combinator library or a parser generator, you use to implement your parser.

If your LSP server renders your markup language, it could refuse to update the rendered version until no syntax errors remain in the input, or it could try to identify regions of the input that have no syntax errors and update only those. The latter would be more helpful but more difficult to implement correctly. (However, your language looks more like a data definition language, such as JSON or YAML, than a true markup language, such as HTML or Markdown, so rendering it may be irrelevant.)

In the case of

{stuff
another}

this would be straightforward enough to handle gracefully by pretending that the first line ends with } and the second line starts with {, if line breaks are not allowed inside interpolations. If line breaks are allowed there, then no, I don't suppose you can catch the error if the user intended to write two separate interpolations.

For much more detail, one could read "Resilient LL Parsing Tutorial" by Alexey Kladov, a proponent and developer of IDEs and the Rust LSP server rust-analyzer. That article assumes that the language being parsed is a programming language that lacks indentation-based syntax and needs to be type-checked and semantically understood by the LSP server. I'm not sure that your LSP server has much to do beyond parsing and finding syntax errors, and I think your indentation-based syntax inherently may help contain syntax errors (not including unclosed delimiters) compared to the Rust-like syntax featured in Kladov's post, so I don't know that error recovery would be as valuable for you as Kladov considers it.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .