Make ChatGPT write formal proof from natural language proof

Question

I know very little about proof assistants, but I played a little with Lean, so I have some basic knowledge of how they work. I was curious whether I could make ChatGPT convert some natural language proofs for me into Lean. To make it clear: I did not ask it to come up with a proof, but merely to convert a simple proof from English to Lean. It gave me something that looked to me like a Lean code, but the online Lean editors returned an error. Is this because ChatGPT is not good enough in Lean, or did I not set some preambles properly/used a different version?

FWIW, here is the code by ChatGPT. I'm not copying it here for people to point out mistakes, just for reference.

-- Define the sum of the first n natural numbers  
def sum_n (n : ℕ) : ℕ := finset.sum (finset.range (n+1)) id  

-- Define the equation to be proved  
theorem sum_of_naturals (n : ℕ) : sum_n n = n * (n+1) / 2 :=  
begin  
  -- First we will simplify the equation using algebraic manipulation  
  have eq : 2 * sum_n n = n * (n+1),  
  { simp [sum_n, mul_comm], rw [←nat.mul_succ, nat.succ_mul], ring },  

  -- Now we will prove the equation by induction on n  
  induction n with k hk,  
  { -- Base case when n=0  
    simp [sum_n] },  
  { -- Inductive step when n=k+1  
    simp [sum_n, nat.succ_eq_add_one, hk k],  
    rw [eq],  
    ring }  
end

"not good enough in lean" is subjective unless you mean "outputs valid lean code". With that meaning, ChatGPT clearly isn't good enough, as it forgot the preamble! import algebra.big_operators.basic, import tactic.ring is the missing preamble, although the code still doesn't work. — Eric, Commented Mar 16, 2023 at 17:46
I don't think the training data for Chat-GPT is public but it surely a vanishingly small percent of it is Lean code, and whatever is there is most likely a mix of different versions, so there is no one correct version to use. Theoretically a similar model fine-tuned on a current version of mathlib should perform much better. — Will Sawin, Commented Mar 16, 2023 at 19:43
@Will ChatGPT is not bad at writing codes in programming languages, so that's why I think this might have a chance. Maybe next year then. — domotorp, Commented Mar 16, 2023 at 20:54
@AndrejBauer there is ai.stackexchange.com which covers AI in general and not just chatGPT — Bruce Adams, Commented Jun 8, 2023 at 1:15

Jason Rute · Accepted Answer · 2023-03-17 11:24:36Z

This question can be interpreted many ways:

Can ChatGPT produce valid Lean code if used naively?
Can ChatGPT produce valid Lean code if used smartly?
Can ChatGPT produce valid Lean code if hooked up to an API that was especially designed to use ChatGPT (or another LLM) to produce Lean code.

The answer to (1) is that it will likely fail if used naively. ChatGPT isn't aweful, but it will likely make lots of mistakes unless it is a simple proof. Sometimes, like in your case, the proof is actually a bit tricky to put into Lean. n * (n+1) / 2 in Lean is a natural number and / is natural number division (so it takes the floor of the division). But I bet your proof is hard to translate directly into Lean because of this.

For (2), it is well known that these systems are a lot more powerful if you know how to talk to them the right way. First, the Chat in Bing is probably better since it is GPT-4 whereas the regular ChatGPT is GPT-3.5 (or so). Next, ChatGPT is designed for back and forth conversation. So if it isn't getting it right (like it is missing imports or there is an error), tell it. Just like Stack Exchange, it is helpful to be clear, and provide it with the right information and what you think is going wrong. Remember, ChatGPT doesn't have access to a Lean interpreter or the Lean library (well, maybe Bing has access to the library). If you and ChatGPT work together, you might be able to solve it as a team. Here is one example of that happening with GPT-4: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/GPT/near/341914198

For (3), note this is an active and exciting area of research. Can we use LLMs to translate informal math to formal math, solve formal proofs, and otherwise generally help with interactive theorem proving? A year ago, those of us who research this would have been quite doubtful. But this field is moving quite fast. Since that time the following papers and research projects have come out. They aren't anywhere close to 100%, but I see a lot of potential here.

Autoformalization with Large Language Models . Uses Codex (a precursor to ChatGPT and the model behind Github copilot) to translation math theorem statements into Isabelle/HOL with about 25% accuracy. Here is a Quanta article on this.
Lean Chat. Based on the previous paper, it is an app which uses Codex to translate informal math theorem statements into Lean. It is chat-like (before ChatGPT), so the user could instruct the model on how to correct the translation.
Draft-Sketch-Prove. Solves competition math problems in three stages. (1) Uses Minerva (a large language model for informal math) to solve the problem in informal language. (2) Uses Codex to translate it to Isabelle/HOL, but leave the low level details empty. (3) Uses SledgeHammer to fill in low level details. (If they give it 100 attempts per problem, it has about a 35-40% success rate.)
LeanAid. Like LeanChat, it again translates statements into Lean, but has very tight integration with Lean's internals so the LLM can use Lean specific information like type-checking and access to the Lean library.
ProofNet. Provides a benchmark for auto-formalization with some baselines using LLMs.
ChatGPT. ChatGPT came out after all these papers. Techniques which only were available to researchers or specialty apps are now available to everyone to play with. But it is also clear from the reviews of users, it has limited usefulness.
Baldur. Uses Minerva to generate proofs in Isabelle/HOL and then provide editor feedback to correct the mistakes in them. Has a pretty good success rate, 47%.
GPT-4. GPT-4 was just released (or made public, as it has been in Bing for a few weeks), and there have been some examples suggesting it might be even better for all of the above applications.

(Note: Success rates are difficult to measure and the above numbers may be higher for theorems already in the Library, since the models may have memorized the code in the library. So take these numbers with a grain of salt.)

Joey Eremondi · Accepted Answer · 2023-03-16 22:32:37Z

5

Is this because ChatGPT is not good enough in Lean

That is correct. Actually, it's not good enough at many things, for example, generating valid citations.

ChatGPT is great at producing things that look like correct facts code, according to a model of natural language.

With lots of languages, this is good enough to produce syntactically valid code, or even type safe code.

Lean does not let you cheat. If your proof is not valid, then it will not type check.

I highly doubt LLMs will be producing proofs like this any time soon, apart from a few benchmarks. To write a proof, you need conceptual understanding, not just a linguistic model.

answered Mar 16, 2023 at 22:32

Joey Eremondi

9714 silver badges14 bronze badges

$\begingroup$ ChatGPT is also great when there are more examples online, like it's a lot better at writing Python and JavaScript. Unfortunately the formal methods community is relatively obscure and there are not many resources out there :( $\endgroup$
– tarzh
Commented Mar 20, 2023 at 1:05
$\begingroup$ Right, but even if there were lots of examples, I'm skeptical of how well that would generalize to generating correct proofs. $\endgroup$
– Joey Eremondi
Commented Mar 21, 2023 at 20:46
$\begingroup$ @JoeyEremondi If the proofs can be checked relatively easily, maybe one could let it try a few times until the proof works. The paper on "sparks of agi" of gpt 4 shows some of the potential of these models arxiv.org/abs/2303.12712. With some fine-tuning on logic and olympiad problems, one could imagine with the possibility to use lean as a tool and some prompt engineering that tells it to think out a plan before going into the details would lead to some valid proofs out of maybe 100 guesses. $\endgroup$
– userrandrand
Commented Mar 31, 2023 at 11:16
$\begingroup$ @userrandrand Maybe, but I doubt that would scale to actual conjectures, where you don't know if a proof exists or not. I'm guessing it would take millions of guesses, not 100. But I'll be happy if I'm wrong. $\endgroup$
– Joey Eremondi
Commented Apr 1, 2023 at 20:45

Add a comment |

Stack Exchange Network

Make ChatGPT write formal proof from natural language proof

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
lean
lean3
proof-assistant
or ask your own question.

Hot Network Questions

Make ChatGPT write formal proof from natural language proof

2 Answers 2

Not the answer you're looking for? Browse other questions tagged leanlean3proof-assistant or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
lean
lean3
proof-assistant
or ask your own question.