Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Introduction

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute. This cost is driven in part by the scarcity of verified proof data and by the long reasoning traces required for formal proof search, which make both supervised fine-tuning and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers designed to deliver strong performance under practical compute budgets. The family spans two generation paradigms: two autoregressive models with 4B and 32B parameters, and a first proof-of-concept diffusion-based theorem-proving model (4B), which iteratively refines Lean proofs at inference time. To make training more efficient, we construct a Lean-verified corpus stratified into easy, medium, and hard problems and use it for curriculum supervised fine-tuning, allowing the models to acquire proof skills progressively from shorter and simpler proofs to longer and more difficult ones. During supervised fine-tuning, we further apply a dynamic proof-reasoning filtering scheme that preserves informative proof traces while ensuring each training instance fits within an 8k-token context budget. We further introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements; these variants are then populated through self-distillation, providing additional training signal without requiring every mutated instance to be formally verified. By perturbing known problems while preserving their formal character, ALF exposes the model to structured variants of verified problems, reducing reliance on any single statement’s surface form. Empirically, Pythagoras-Prover demonstrates strong performance across model scales. Most notably, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (82.4% → 86.1%), despite using roughly 167× fewer parameters. Scaling to 32B further yields state-of-the-art performance among open-source neural theorem provers, with Pythagoras-Prover-32B attaining 93.0% on MiniF2F-Test and solving 93 of 672 problems on PutnamBench. We additionally release MiniF2F-ALF, an ALF-mutated contamination-sensitive perturbation benchmark on which every evaluated model loses accuracy; on this split, Pythagoras-Prover-32B remains the strongest evaluated prover, while Pythagoras-Prover-4B reaches parity with Goedel-Prover-V2-32B, the prior state of the art. Together, these results show that strong Lean theorem proving need not rely exclusively on frontier-scale models.

Key Findings and Achievements

Reaches 89.8% on MiniF2F-Test at Pass@32 and 93.0% at Pass@2048.
Pythagoras-Prover-4B attains 86.1% at Pass@32, outperforming DeepSeek-Prover-V2-671B (82.4%) while being roughly 167× smaller.
Ranks 1st on the PutnamBench leaderboard among open-source models.
Releases MiniF2F-ALF, a 488-statement less-saturated benchmark where every evaluated prover degrades.
Uses ALF to expand training to ~2M structured Lean variants without per-instance compilation.
Introduces Pythagoras-Prover-Diffusion, the first diffusion-based theorem prover for Lean.
Uses a compute-efficient recipe: LoRA-only SFT, an 8K context, and an easy/medium/hard curriculum that transfers from 4B to 32B.

Method

Our recipe turns a small seed corpus into a state-of-the-art prover in three stages.

1

Synthetic Data Generation

A compute-efficient pipeline autoformalises and Lean-verifies problems using predominantly sub-30B open models, then applies a failure-mode-conditioned rubric-guided distillation step that re-prompts on each rejected seed to target the specific Lean type-checker error — yielding a Lean-verified corpus partitioned into easy, medium, and hard tiers.

Data · Synthesis

➞

2

Model Training

A three-stage difficulty-ordered curriculum LoRA-fine-tunes Qwen3-4B and Qwen3-32B under an 8K context using a dynamic proof-reasoning filter, followed by an RL stage with a Lean-compilation reward.

Train · Curriculum + RL

➞

3

Augmented Lean Formalisation

A mutation scheme systematically expands each Lean statement into ~2M structured formal variants and re-uses them in a self-distillation stage without per-instance Lean compilation, replacing it with a compute-efficient statement-alignment filter.

Augment · ALF

Experimental Results

Benchmarks

Following previous works, we primarily use MiniF2F as our main evaluation benchmark. Additionally, we introduce MiniF2F-ALF, a mutated formal statement within MiniF2F. The benchmarks represent a diverse range of mathematical problems, from high-school level to undergraduate mathematics.

Main Results

Model	Pass	Performance
Pass ≤ 128
Goedel-Prover-SFT	32	57.6%
STP	128	61.2%
Kimina-Prover-Preview-72B	32	68.85%
DeepSeek-Prover-V2-7B	32	75.6%
Kimina-Prover-8B	32	78.3%
DeepSeek-Prover-V2-671B	32	82.4%
Kimina-Prover-70B	32	84.0%
Goedel-Prover-V2-8B	32	84.6%
Goedel-Prover-V2-8B w/ Self-Correction	32	86.7%
Goedel-Prover-V2-32B	32	88.1%
Goedel-Prover-V2-32B w/ Self-Correction	32	90.4%
Pythagoras-Prover-4B	32	86.1%
Pythagoras-Prover-32B	32	89.8%
Pass = 1024
Kimina-Prover-70B	1024	87.7%
Goedel-Prover-V2-8B	1024	87.9%
Goedel-Prover-V2-8B w/ Self-Correction	1024	89.3%
Goedel-Prover-V2-32B	1024	91.8%
Goedel-Prover-V2-32B w/ Self-Correction	1024	92.6%
Pythagoras-Prover-4B	1024	88.1%
Pythagoras-Prover-32B	1024	92.6%
Pass ≥ 2048
Goedel-Prover-SFT	3200	62.7%
STP	3200	65.0%
STP	25600	67.6%
Kimina-Prover-Preview-72B	8192	80.74%
DeepSeek-Prover-V2-7B	8192	82.0%
DeepSeek-Prover-V2-671B	8192	88.9%
Goedel-Prover-V2-8B	8192	90.2%
Kimina-Prover-70B w/ Test-Time Reinforcement Learning	--	92.2%
Goedel-Prover-V2-32B	8192	92.2%
Pythagoras-Prover-4B	2048	89.8%
Pythagoras-Prover-32B	2048	93.0%

Table 1: Pass@N on MiniF2F-Test across inference budgets

Model	Performance
DeepSeek-Prover-V2-671B	79.7%
Goedel-Prover-V2-8B	82.6%
Goedel-Prover-V2-32B	83.6%
Pythagoras-Prover-4B	83.2%
Pythagoras-Prover-32B	85.0%

Table 2: MiniF2F-ALF Performance (Pass@32)

#	Model	Open-source	Compute	Num-solved
1	Pythagoras-Prover	✓	pass@2048	93
1	Pythagoras-Prover	✓	pass@64	59
1	Pythagoras-Prover	✓	pass@32	48
2	Goedel-Prover-V2 (self-correction mode)	✓	pass@184	86
2	Goedel-Prover-V2 (self-correction mode)	✓	pass@32	57
2	Goedel-Prover-V2	✓	pass@32	43
3	DeepSeek-Prover-V2	✓	pass@1024	47
3	DeepSeek-Prover-V2	✓	pass@32	22
4	DSP+	✓	pass@128	23
5	Bourbaki	✓	pass@512	14
6	Kimina-Prover-7B-Distill	✓	pass@192	10
7	Self-play Theorem Prover	✓	pass@3200	8
8	Goedel-Prover-SFT	✓	pass@512	7
9	ABEL	✗	pass@596	7

Table 3: PutnamBench Results

Diffusion Theorem Proving

Model	Performance
Pythagoras-Prover	86.1%
Pythagoras-Prover^∗	74.6%
Diffusion Pythagoras-Prover	63.3%

^∗ denotes the setting where training tokens are restricted to 4096 and evaluation is performed solely at 8192 tokens.

Table 4: Diffusion Theorem Proving Performance (Pass@32)

Analysis

A closer look at Pythagoras-Prover's scaling, mutation robustness, and accuracy-throughput frontier.

Scaling Analysis

How efficiently does Pythagoras-Prover turn inference budget into solved problems? We trace pass@N from 32 to 2048 on MiniF2F-Test against the strongest open baselines.

Pythagoras-Prover-32B leads at every budget, reaching 93.0% at pass@2048.
Pythagoras-Prover-4B surpasses the ~167× larger DeepSeek-Prover-V2-671B across the full range.

Tap or hover to read the curve

Consistent, not budget-specific. Pythagoras-Prover-32B leads at every budget and overtakes the self-correction-augmented Goedel-Prover-V2-32B from ~pass@256. Even ~167× smaller, Pythagoras-Prover-4B beats DeepSeek-Prover-V2-671B at every shared budget — its 89.8% at pass@2048 tops that model's pass@8192 (88.9%). Both curves start high then flatten: strong accuracy from far fewer samples.

ALF Makes Strong Provers Distinguishable

MiniF2F is saturated, so strong provers rarely disagree. ALF mutation roughly doubles the instances where they can be told apart — recovering discriminative power, not just reshuffling winners.

4B vs Goedel-8B: 9 disagreements under ALF vs 7 on the original.
32B vs Goedel-32B: 17 vs 9.

Tap or hover to read the map

Domain composition of failures on original MiniF2F — Original · 41 failures

Domain composition of failures on MiniF2F-ALF — MiniF2F-ALF · 57 failures

On original MiniF2F, failures cluster in olympiad sets (AMC + IMO ≈ two-thirds of 41). Under ALF the set grows to 57 and spreads — MathD jumps to ~20%, breaking the IMO/AMC concentration. ALF surfaces structurally simple cases where strong provers are brittle, adding discriminative signal across the full difficulty range.

Per-problem original to mutated transition heatmap — Per-problem original→mutated transition (four-prover panel)

Solved before & after
Solved → failed (brittle)
Failed → solved
Failed before & after

Robust Under ALF Perturbation

Re-evaluated on MiniF2F-ALF under one unified setup, every prover loses accuracy — but Pythagoras-Prover retains the most, the signature of training on ALF-augmented data.

Pythagoras-Prover-32B keeps the highest pass rate: 85.0%.
Pythagoras-Prover-4B (83.2%) matches the 8× larger Goedel-32B, degrading less (2.9 vs 4.5 pts).

Tap or hover to see MiniF2F-ALF

Every model degrades on MiniF2F-ALF, confirming a real distribution shift. Pythagoras-Prover-32B holds the top pass rate (85.0%); Pythagoras-Prover-4B (83.2%) is on par with the 8× larger Goedel-Prover-V2-32B and drops less under perturbation (2.9 vs 4.5 points). Because our training corpus is itself ALF-augmented, this retention is the expected signature — MiniF2F performance reflects proof ability, not sensitivity to the phrasing of the test statements.

Compute Efficiency

Pythagoras-Prover-Diffusion trades per-attempt accuracy for substantially higher generation throughput.

AR: 86.1% at 4.10 TPS.
Diffusion: 63.3% at 10.56 TPS.
Throughput-weighted score favors diffusion by 1.89×.

Tap or hover for details

Compute Efficiency: AR versus Diffusion

Sharing the same self-distillation corpus and differing only in decoding (matched MiniF2F pass@32, 8× H100, batch 4), AR wins on raw accuracy, 86.1% versus 63.3% (a 22.8-point gap) — but that is per-attempt only. Diffusion sustains 10.56 versus 4.10 TPS (2.58× faster), since AR decoding is sequential and costly on long completions while diffusion generates block-parallel under a fixed denoising budget.

A throughput-weighted score S = accuracy × TPS gives 3.53 (AR) versus 6.68 (diffusion), a 1.89× edge — though at this point S is dominated by throughput, so it reflects compute efficiency more than proving skill. Bottom line: AR for per-attempt reliability, diffusion where throughput or latency binds, with room to grow as diffusion context lengths increase.

BibTeX

@article{leang2026pythagoras,
  title={Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation},
  author={Leang, Joshua Ong Jun and Zhao, Zheng and Stoian, Mihaela Catalina and Xu, Qiyuan and Li, Haonan and Li, Wenda and Cohen, Shay B. and Giunchiglia, Eleonora},
  journal={arXiv preprint arXiv:2606.12594},
  year={2026}
}