Pythagoras-Prover

Advancing Efficient Formal Proving via Augmented Lean Formalisation

Imperial College London University of Edinburgh Nanyang Technological University MBZUAI

Introduction

We introduce Pythagoras-Prover, a compute-efficient family of open-source large language models for formal theorem proving in Lean. The family comprises two standard autoregressive models with parameter counts of 4B and 32B together with the first diffusion-based theorem-proving model, which performs iterative refinement over Lean proofs at inference time. Its training recipe centres on a Lean-verified synthetic data pipeline that produces a training corpus across easy, medium, and hard difficulty tiers; complementing the pipeline are a parameter-efficient supervised fine-tuning stage equipped with a dynamic proof-reasoning filtering scheme that retains long proofs within a short context budget, and a final reinforcement-learning stage. Within the data pipeline we propose Augmented Lean Formalisation (ALF), a mutation scheme that expands the training corpus via structured formal variants, paired with a self-distillation stage that re-uses these variants without requiring formal verification. Empirically, Pythagoras-Prover-32B achieves state-of-the-art performance among open-source neural theorem provers, attaining 93.03% on MiniF2F-Test and solving 93 of 672 problems on PutnamBench, while Pythagoras-Prover-4B outperforms DeepSeek-Prover-V2-671B on MiniF2F-Test despite being roughly 167× smaller. We additionally release MiniF2F-ALF, an ALF-mutated decontamination benchmark on which every evaluated model loses accuracy; on this split, Pythagoras-Prover-32B remains the strongest evaluated prover and Pythagoras-Prover-4B reaches parity with Goedel-Prover-V2-32B, the prior state of the art.

Key Findings and Achievements

  • Reaches 89.75% on MiniF2F-Test at Pass@32 and 93.03% at Pass@2048.
  • Pythagoras-Prover-4B attains 86.07% at Pass@32, outperforming DeepSeek-Prover-V2-671B (82.4%) while being roughly 167× smaller.
  • Ranks 1st on the PutnamBench leaderboard among open-source models.
  • Releases MiniF2F-ALF, a 488-statement less-saturated benchmark where every evaluated prover degrades.
  • Uses ALF to expand training to ~2M structured Lean variants without per-instance compilation.
  • Introduces Pythagoras-Prover-Diffusion, the first diffusion-based theorem prover for Lean.
  • Uses a compute-efficient recipe: LoRA-only SFT, an 8K context, and an easy/medium/hard curriculum that transfers from 4B to 32B.
Pythagoras-Prover benchmark overview

Method

Our recipe turns a small seed corpus into a state-of-the-art prover in three stages.

1

Synthetic Data Generation

A compute-efficient pipeline autoformalises and Lean-verifies problems using predominantly sub-30B open models, then applies a failure-mode-conditioned rubric-guided distillation step that re-prompts on each rejected seed to target the specific Lean type-checker error — yielding a Lean-verified corpus partitioned into easy, medium, and hard tiers.

Data · Synthesis
2

Model Training

A three-stage difficulty-ordered curriculum LoRA-fine-tunes Qwen3-4B and Qwen3-32B under an 8K context using a dynamic proof-reasoning filter, followed by an RL stage with a Lean-compilation reward.

Train · Curriculum + RL
3

Augmented Lean Formalisation

A mutation scheme systematically expands each Lean statement into ~2M structured formal variants and re-uses them in a self-distillation stage without per-instance Lean compilation, replacing it with a compute-efficient statement-alignment filter.

Augment · ALF
MiniF2F-ALF evaluation results

Experimental Results

Benchmarks

Following previous works, we primarily use MiniF2F as our main evaluation benchmark. Additionally, we introduce MiniF2F-ALF, a mutated formal statement within MiniF2F. The benchmarks represent a diverse range of mathematical problems, from high-school level to undergraduate mathematics.

Main Results

Model Pass Performance
Pass ≤ 128
Goedel-Prover-SFT 32 57.6%
STP 128 61.2%
Kimina-Prover-Preview-72B 32 68.85%
DeepSeek-Prover-V2-7B 32 75.6%
Kimina-Prover-8B 32 78.3%
DeepSeek-Prover-V2-671B 32 82.4%
Kimina-Prover-70B 32 84.0%
Goedel-Prover-V2-8B 32 84.6%
Goedel-Prover-V2-8B w/ Self-Correction 32 86.7%
Goedel-Prover-V2-32B 32 88.1%
Goedel-Prover-V2-32B w/ Self-Correction 32 90.4%
Pythagoras-Prover-4B 32 86.07%
Pythagoras-Prover-32B 32 89.75%
Pass = 1024
Kimina-Prover-70B 1024 87.7%
Goedel-Prover-V2-8B 1024 87.9%
Goedel-Prover-V2-8B w/ Self-Correction 1024 89.3%
Goedel-Prover-V2-32B 1024 91.8%
Goedel-Prover-V2-32B w/ Self-Correction 1024 92.6%
Pythagoras-Prover-4B 1024 88.11%
Pythagoras-Prover-32B 1024 92.62%
Pass ≥ 2048
Goedel-Prover-SFT 3200 62.7%
STP 3200 65.0%
STP 25600 67.6%
Kimina-Prover-Preview-72B 8192 80.74%
DeepSeek-Prover-V2-7B 8192 82.0%
DeepSeek-Prover-V2-671B 8192 88.9%
Goedel-Prover-V2-8B 8192 90.2%
Kimina-Prover-70B w/ Test-Time Reinforcement Learning -- 92.2%
Goedel-Prover-V2-32B 8192 92.2%
Pythagoras-Prover-4B 2048 89.75%
Pythagoras-Prover-32B 2048 93.03%
Table 1: Pass@N on MiniF2F-Test across inference budgets
Model Performance
DeepSeek-Prover-V2-671B 79.71%
Goedel-Prover-V2-8B 82.58%
Goedel-Prover-V2-32B 83.61%
Pythagoras-Prover-4B 83.19%
Pythagoras-Prover-32B 85.04%
Table 2: MiniF2F-ALF Performance (Pass@32)
# Model Open-source Compute Num-solved
1 Pythagoras-Prover pass@2048 93
1 Pythagoras-Prover pass@64 59
1 Pythagoras-Prover pass@32 48
2 Goedel-Prover-V2 (self-correction mode) pass@184 86
2 Goedel-Prover-V2 (self-correction mode) pass@32 57
2 Goedel-Prover-V2 pass@32 43
3 DeepSeek-Prover-V2 pass@1024 47
3 DeepSeek-Prover-V2 pass@32 22
4 DSP+ pass@128 23
5 Bourbaki pass@512 14
6 Kimina-Prover-7B-Distill pass@192 10
7 Self-play Theorem Prover pass@3200 8
8 Goedel-Prover-SFT pass@512 7
9 ABEL pass@596 7
Table 3: PutnamBench Results

Diffusion Theorem Proving

Model Performance
Pythagoras-Prover 86.07%
Pythagoras-Prover 74.59%
Diffusion Pythagoras-Prover 63.25%
denotes the setting where training tokens are restricted to 4096 and evaluation is performed solely at 8192 tokens.
Table 4: Diffusion Theorem Proving Performance (Pass@32)

Analysis

Four closer looks at scaling, mutation robustness, and the accuracy–throughput frontier.

Scaling Analysis

How efficiently does Pythagoras-Prover turn inference budget into solved problems? We trace pass@N from 32 to 2048 on MiniF2F-Test against the strongest open baselines.

  • Pythagoras-Prover-32B leads at every budget, reaching 93.03% at pass@2048.
  • Pythagoras-Prover-4B surpasses the ~167× larger DeepSeek-Prover-V2-671B across the full range.
Tap or hover to read the curve
Pass@N scaling comparison

Consistent, not budget-specific. Pythagoras-Prover-32B leads at every budget and overtakes the self-correction-augmented Goedel-Prover-V2-32B from ~pass@256. Even ~167× smaller, Pythagoras-Prover-4B beats DeepSeek-Prover-V2-671B at every shared budget — its 89.75% at pass@2048 tops that model's pass@8192 (88.9%). Both curves start high then flatten: strong accuracy from far fewer samples.

ALF Makes Strong Provers Distinguishable

MiniF2F is saturated, so strong provers rarely disagree. ALF mutation roughly doubles the instances where they can be told apart — recovering discriminative power, not just reshuffling winners.

  • 4B vs Goedel-8B: 9 disagreements under ALF vs 7 on the original.
  • 32B vs Goedel-32B: 17 vs 9.
Tap or hover to read the map
Domain composition of failures on original MiniF2F
Original · 41 failures
Domain composition of failures on MiniF2F-ALF
MiniF2F-ALF · 57 failures

On original MiniF2F, failures cluster in olympiad sets (AMC + IMO ≈ two-thirds of 41). Under ALF the set grows to 57 and spreads — MathD jumps to ~20%, breaking the IMO/AMC concentration. ALF surfaces structurally simple cases where strong provers are brittle, adding discriminative signal across the full difficulty range.

Per-problem original to mutated transition heatmap
Per-problem original→mutated transition (four-prover panel)
  • Solved before & after
  • Solved → failed (brittle)
  • Failed → solved
  • Failed before & after

Robust Under ALF Perturbation

Re-evaluated on MiniF2F-ALF under one unified setup, every prover loses accuracy — but Pythagoras-Prover retains the most, the signature of training on ALF-augmented data.

  • Pythagoras-Prover-32B keeps the highest pass rate: 85.04%.
  • Pythagoras-Prover-4B (83.19%) matches the 8× larger Goedel-32B, degrading less (2.9 vs 4.5 pts).
Tap or hover to see MiniF2F-ALF
Pass@32 performance on MiniF2F-ALF

Every model degrades on MiniF2F-ALF, confirming a real distribution shift. Pythagoras-Prover-32B holds the top pass rate (85.04%); Pythagoras-Prover-4B (83.19%) is on par with the 8× larger Goedel-Prover-V2-32B and drops less under perturbation (2.9 vs 4.5 points). Because our training corpus is itself ALF-augmented, this retention is the expected signature — MiniF2F performance reflects proof ability, not sensitivity to the phrasing of the test statements.

Compute Efficiency

Pythagoras-Prover-Diffusion trades per-attempt accuracy for substantially higher generation throughput.

  • AR: 86.07% at 4.10 TPS.
  • Diffusion: 63.25% at 10.56 TPS.
  • Throughput-weighted score favors diffusion by 1.89×.
Tap or hover for details

Compute Efficiency: AR versus Diffusion

Sharing the same self-distillation corpus and differing only in decoding (matched MiniF2F pass@32, 8× H100, batch 4), AR wins on raw accuracy, 86.07% versus 63.25% (a 22.82-point gap) — but that is per-attempt only. Diffusion sustains 10.56 versus 4.10 TPS (2.58× faster), since AR decoding is sequential and costly on long completions while diffusion generates block-parallel under a fixed denoising budget.

A throughput-weighted score S = accuracy × TPS gives 3.53 (AR) versus 6.68 (diffusion), a 1.89× edge — though at this point S is dominated by throughput, so it reflects compute efficiency more than proving skill. Bottom line: AR for per-attempt reliability, diffusion where throughput or latency binds, with room to grow as diffusion context lengths increase.

BibTeX

@inproceedings{leang2025theorem,
  title={Theorem prover as a judge for synthetic data generation},
  author={Leang, Joshua Ong Jun and Hong, Giwon and Li, Wenda and Cohen, Shay B},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={29941--29977},
  year={2025}
}