How to parse Lean 3(?) theorem statements to JSON AST

Question

I'd like to take a .lean file, and extract from it all the theorem statements into a fully-elaborated unambiguous textual format that is morally isomorphic to the underlying representation in memory. Is that easy to do with existing facilities?

Given something like ackermann.lean, it'd be nice, for example to get a JSON list of pairs [{name: "ack_zero", form: "..."}, ...]. I guess there are two parts to this:

the extraction of the theorem statements from the wider file, and
the parsing of the statements themselves into some appropriate AST

Or is the question not even well-formed?

Jason Rute · Accepted Answer · 2023-07-19 14:23:44Z

To build on Mario's answer:

A Lean theorem is passed through multiple levels of processing. The Lean text file is processed into as an AST, and that is processed further into a Lean declaration. All theorems, definitions, axioms, etc. are declarations. A theorem declaration has a few parts, including a name, a list of universe names, its type (theorem statement), and its proof. A name is more or less a list of strings. The type is an (elaborated) expr, which is (mostly) an algebraic datatype that is fairly easy to serialize. (I say mostly, because expressions can also be macros sometimes which aren't serializable. (I don't remember if theorem statements can have macros in them.) This is documented in this paper, and you should also pay attention to the comments found in the code, e.g. for expr.

Mario's AST extractor for Lean 3 is how to access the AST of Lean 3 file. It is one of the only (non-hacky) ways to get for example the tactic proofs or the start and end positions of the theorem statement text strings.

As for accessing the underlying theorems, there are a few ways. They all revolve around the fact that Lean has to store this information to be used later. It is used for example to invoke a previous theorem, to print a theorem with #print, or to provide a tooltip. This data is stored in memory in the environment, which is basically the global context. It is also stored in .olean files which are roughly compiled Lean files. Finally, you can get it also from the extract tool. (I'm not actually clear if the extract tool which Mario mentioned above is different from the oleans.)

Now, if you want to get the data from inside Lean you can do the following. (One advantage of this approach is that it also allows you to use a bunch of Lean meta tools to filter and manipulate the data. You can also get nice metadata like the pretty printed forms of the theorems. The disadvantage is that you have to learn Lean metaprogramming, and the API is not well documented. Also, to my knowledge, there is no built in JSON extraction format, but it isn't actually hard to write and I've written both JSON and s-expression serializers for expr in the past.) Here is a code snippet to loop over all theorems.

-- there is a mathlib script to make an all.lean
-- file which includes all of mathlib
-- import all
import tactic

meta def expr_to_json : expr -> string
| _ := "\"TODO\""

meta def extract_thm_json (e: environment) (nm : name) (univs: list name) (tp : expr) : tactic string := do
  let j := "{\"name\": " ++ (repr nm.to_string),
  let j := j ++ ", \"univs\": " ++ (univs.map (λ n, repr (name.to_string n))).to_string,
  
  -- display the type in various formats
  -- pretty printed
  let j := j ++ ", \"tp_pp\": " ++ (repr tp.to_string),
  -- some built in s-expr like format
  let j := j ++ ", \"tp_sexpr\": " ++ repr (tp.to_raw_fmt).to_string,
  -- your format:
  let j := j ++ ", \"tp_json\": " ++ (expr_to_json tp),
  
  -- metadata
  -- filename 
  file_name <- e.decl_olean nm,
  let j := j ++ ", \"file\": " ++ repr file_name,
  
  return (j ++ "}")

meta def extract_decl_json (e : environment) : declaration -> tactic string
| (declaration.thm nm univs tp val) := extract_thm_json e nm univs tp
| _ := pure "null"  -- not a thm
  
  
meta def loop_over_decls : tactic unit := do
  -- get the tactic state
  s <- tactic.read,
  -- get the environment object and loop over all declarations
  let e := s.env,
  results <- e.mfold list.nil $ λ d l, 
    -- stop after 5 (just for example purposes)
    if l.length < 5 then do {
      decl_json <- extract_decl_json e d,
      return $ decl_json :: l
    }
    else
      return l,
  tactic.trace $ results,
  return ()

#eval loop_over_decls  -- data is printed in the trace window

With some more metaprogramming, you can get more metadata and you can also print this data to a file.

Note, you don't talk about your application. I assume this is for machine learning or something where you don't need to actually put this data back into Lean. Getting it back into Lean can be a bit tricky to do.

Edit: All the above is for Lean 3. The principles are the same in Lean 4, but the specific code and tools are all different.

"I don't remember if theorem statements can have macros in them." They can, of course, being exprs like anything else, although it's much less likely to see them in theorem statements than in proofs (but again, proofs can appear in theorem statements, so...). The most common macro you will find in theorem statements is sorry. — Mario Carneiro, Commented Jul 20, 2023 at 3:44
"I'm not actually clear if the extract tool which Mario mentioned above is different from the oleans." It is. The lean export format is text-based, having lots of lines like "#EA 1234 56", while olean is a binary format which you can parse using github.com/digama0/olean-rs. The olean format is less lossy than the export format, containing more lean-internal entries rather than just those needed for verification. — Mario Carneiro, Commented Jul 20, 2023 at 3:49

Mario Carneiro · Accepted Answer · 2023-07-18 04:44:54Z

You can do this using lean --make --ast computability/ackermann.lean. Just make sure to delete the .olean file first or else it won't do anything (it doesn't run and produce output unless the parser runs).

EDIT: actually I guess this isn't exactly what you are asking for. This produces a JSON representation of the AST of the file, but this is not the same thing as the statements themselves. (Not all theorems appear literally in the file, they might be automatically generated, and the internal representation uses exprs, not AST nodes.) For that, you might want to try the export format, which is unfortunately not JSON formatted but is still fairly simple and can be checked by an external checker.

Stack Exchange Network

How to parse Lean 3(?) theorem statements to JSON AST

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
lean
lean3
abstract-syntax-tree
or ask your own question.

Hot Network Questions

How to parse Lean 3(?) theorem statements to JSON AST

2 Answers 2

Not the answer you're looking for? Browse other questions tagged leanlean3abstract-syntax-tree or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
lean
lean3
abstract-syntax-tree
or ask your own question.