2
$\begingroup$

I know very little about proof assistants, but I played a little with Lean, so I have some basic knowledge of how they work. I was curious whether I could make ChatGPT convert some natural language proofs for me into Lean. To make it clear: I did not ask it to come up with a proof, but merely to convert a simple proof from English to Lean. It gave me something that looked to me like a Lean code, but the online Lean editors returned an error. Is this because ChatGPT is not good enough in Lean, or did I not set some preambles properly/used a different version?

FWIW, here is the code by ChatGPT. I'm not copying it here for people to point out mistakes, just for reference.

-- Define the sum of the first n natural numbers  
def sum_n (n : ℕ) : ℕ := finset.sum (finset.range (n+1)) id  

-- Define the equation to be proved  
theorem sum_of_naturals (n : ℕ) : sum_n n = n * (n+1) / 2 :=  
begin  
  -- First we will simplify the equation using algebraic manipulation  
  have eq : 2 * sum_n n = n * (n+1),  
  { simp [sum_n, mul_comm], rw [←nat.mul_succ, nat.succ_mul], ring },  

  -- Now we will prove the equation by induction on n  
  induction n with k hk,  
  { -- Base case when n=0  
    simp [sum_n] },  
  { -- Inductive step when n=k+1  
    simp [sum_n, nat.succ_eq_add_one, hk k],  
    rw [eq],  
    ring }  
end  
$\endgroup$
5
  • $\begingroup$ "not good enough in lean" is subjective unless you mean "outputs valid lean code". With that meaning, ChatGPT clearly isn't good enough, as it forgot the preamble! import algebra.big_operators.basic, import tactic.ring is the missing preamble, although the code still doesn't work. $\endgroup$
    – Eric
    Commented Mar 16, 2023 at 17:46
  • $\begingroup$ I don't think the training data for Chat-GPT is public but it surely a vanishingly small percent of it is Lean code, and whatever is there is most likely a mix of different versions, so there is no one correct version to use. Theoretically a similar model fine-tuned on a current version of mathlib should perform much better. $\endgroup$
    – Will Sawin
    Commented Mar 16, 2023 at 19:43
  • $\begingroup$ @Will ChatGPT is not bad at writing codes in programming languages, so that's why I think this might have a chance. Maybe next year then. $\endgroup$
    – domotorp
    Commented Mar 16, 2023 at 20:54
  • 3
    $\begingroup$ Someone please create a StackExchange site about ChatGPT. $\endgroup$ Commented Mar 16, 2023 at 21:59
  • 1
    $\begingroup$ @AndrejBauer there is ai.stackexchange.com which covers AI in general and not just chatGPT $\endgroup$ Commented Jun 8, 2023 at 1:15

2 Answers 2

6
$\begingroup$

This question can be interpreted many ways:

  1. Can ChatGPT produce valid Lean code if used naively?
  2. Can ChatGPT produce valid Lean code if used smartly?
  3. Can ChatGPT produce valid Lean code if hooked up to an API that was especially designed to use ChatGPT (or another LLM) to produce Lean code.

The answer to (1) is that it will likely fail if used naively. ChatGPT isn't aweful, but it will likely make lots of mistakes unless it is a simple proof. Sometimes, like in your case, the proof is actually a bit tricky to put into Lean. n * (n+1) / 2 in Lean is a natural number and / is natural number division (so it takes the floor of the division). But I bet your proof is hard to translate directly into Lean because of this.

For (2), it is well known that these systems are a lot more powerful if you know how to talk to them the right way. First, the Chat in Bing is probably better since it is GPT-4 whereas the regular ChatGPT is GPT-3.5 (or so). Next, ChatGPT is designed for back and forth conversation. So if it isn't getting it right (like it is missing imports or there is an error), tell it. Just like Stack Exchange, it is helpful to be clear, and provide it with the right information and what you think is going wrong. Remember, ChatGPT doesn't have access to a Lean interpreter or the Lean library (well, maybe Bing has access to the library). If you and ChatGPT work together, you might be able to solve it as a team. Here is one example of that happening with GPT-4: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/GPT/near/341914198

For (3), note this is an active and exciting area of research. Can we use LLMs to translate informal math to formal math, solve formal proofs, and otherwise generally help with interactive theorem proving? A year ago, those of us who research this would have been quite doubtful. But this field is moving quite fast. Since that time the following papers and research projects have come out. They aren't anywhere close to 100%, but I see a lot of potential here.

  • Autoformalization with Large Language Models . Uses Codex (a precursor to ChatGPT and the model behind Github copilot) to translation math theorem statements into Isabelle/HOL with about 25% accuracy. Here is a Quanta article on this.
  • Lean Chat. Based on the previous paper, it is an app which uses Codex to translate informal math theorem statements into Lean. It is chat-like (before ChatGPT), so the user could instruct the model on how to correct the translation.
  • Draft-Sketch-Prove. Solves competition math problems in three stages. (1) Uses Minerva (a large language model for informal math) to solve the problem in informal language. (2) Uses Codex to translate it to Isabelle/HOL, but leave the low level details empty. (3) Uses SledgeHammer to fill in low level details. (If they give it 100 attempts per problem, it has about a 35-40% success rate.)
  • LeanAid. Like LeanChat, it again translates statements into Lean, but has very tight integration with Lean's internals so the LLM can use Lean specific information like type-checking and access to the Lean library.
  • ProofNet. Provides a benchmark for auto-formalization with some baselines using LLMs.
  • ChatGPT. ChatGPT came out after all these papers. Techniques which only were available to researchers or specialty apps are now available to everyone to play with. But it is also clear from the reviews of users, it has limited usefulness.
  • Baldur. Uses Minerva to generate proofs in Isabelle/HOL and then provide editor feedback to correct the mistakes in them. Has a pretty good success rate, 47%.
  • GPT-4. GPT-4 was just released (or made public, as it has been in Bing for a few weeks), and there have been some examples suggesting it might be even better for all of the above applications.

(Note: Success rates are difficult to measure and the above numbers may be higher for theorems already in the Library, since the models may have memorized the code in the library. So take these numbers with a grain of salt.)

$\endgroup$
5
$\begingroup$

Is this because ChatGPT is not good enough in Lean

That is correct. Actually, it's not good enough at many things, for example, generating valid citations.

ChatGPT is great at producing things that look like correct facts code, according to a model of natural language.

With lots of languages, this is good enough to produce syntactically valid code, or even type safe code.

Lean does not let you cheat. If your proof is not valid, then it will not type check.

I highly doubt LLMs will be producing proofs like this any time soon, apart from a few benchmarks. To write a proof, you need conceptual understanding, not just a linguistic model.

$\endgroup$
4
  • $\begingroup$ ChatGPT is also great when there are more examples online, like it's a lot better at writing Python and JavaScript. Unfortunately the formal methods community is relatively obscure and there are not many resources out there :( $\endgroup$
    – tarzh
    Commented Mar 20, 2023 at 1:05
  • $\begingroup$ Right, but even if there were lots of examples, I'm skeptical of how well that would generalize to generating correct proofs. $\endgroup$ Commented Mar 21, 2023 at 20:46
  • $\begingroup$ @JoeyEremondi If the proofs can be checked relatively easily, maybe one could let it try a few times until the proof works. The paper on "sparks of agi" of gpt 4 shows some of the potential of these models arxiv.org/abs/2303.12712. With some fine-tuning on logic and olympiad problems, one could imagine with the possibility to use lean as a tool and some prompt engineering that tells it to think out a plan before going into the details would lead to some valid proofs out of maybe 100 guesses. $\endgroup$ Commented Mar 31, 2023 at 11:16
  • $\begingroup$ @userrandrand Maybe, but I doubt that would scale to actual conjectures, where you don't know if a proof exists or not. I'm guessing it would take millions of guesses, not 100. But I'll be happy if I'm wrong. $\endgroup$ Commented Apr 1, 2023 at 20:45

Not the answer you're looking for? Browse other questions tagged or ask your own question.