Goedel-Architect: Blueprint Generation and Refinement for Formal Theorem Proving

Goedel-Architect Streamlining Formal Theorem Proving with Blueprint Generation and Refinement

Jui-Hui Chung^1,*, Ziyang Cai^1,*, Zihao Li¹, Qishuo Yin¹, Rohit Agarwal¹, Simon Park¹, Rodrigo Porto¹, Narutatsu Ri¹, Ziran Yang¹,
Shange Tang¹, Xingyu Dang¹, Hongzhou Lin², Mengdi Wang¹, Danqi Chen¹, Chi Jin¹, Liam H Fowl^1,*,†, Sanjeev Arora^1,†

¹Princeton Language and Intelligence ²Amazon

^*Equal contribution ^†Joint last authors

Introduction

We introduce Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 centered on blueprint generation and refinement. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. First, Goedel-Architect generates a blueprint of formally stated definitions and lemmas, along with declared dependencies. This blueprint is optionally guided by a natural language proof. Then, a tool-equipped Lean prover component closes each open lemma node in parallel using relevant dependencies. Failed lemmas in turn drive refinement of the global blueprint. This contrasts with mainstream approaches based on recursive lemma decomposition, which can loop inefficiently on dead ends.

Using the open-weight DeepSeek-V4-Flash (284B-A13B) as the backbone, Goedel-Architect attains 99.2% at pass@1 on MiniF2F-test and 88.8% at pass@4 (597/672) on PutnamBench. This pass@4 run on PutnamBench cost $985.67 in API calls, including all unsuccessful attempts, or about $1.65 per problem solved. With an optional Gemini 3.1 Pro natural-language proof seeding the initial blueprint, while all subsequent formal proving is carried out by DeepSeek-V4-Flash, we additionally close the remaining two MiniF2F-test problems (reaching 100%) and solve 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026.

Goedel-Architect vs. prior systems across four formal theorem-proving benchmarks

Goedel-Architect (orange) against prior systems on four formal theorem-proving benchmarks. Bars show the percentage of problems solved; the number above each bar is the count solved. ^*On IMO 2025 we did not attempt P2, a geometry problem that typically requires a dedicated non-Lean geometry engine.

Blueprints

Each solved problem is an interactive dependency-graph blueprint: the main theorem sits at the center of its rings, surrounded by every definition and lemma it relies on. Switch between problems from IMO 2025, Putnam 2025, USAMO 2026, and selected problems from PutnamBench using the panel on the right, scrub the refinement iterations along the bottom, and hover any node to read its statement. Click a node to see its exact Lean 4 source below.

solved unsolved formally negated

Framework

Goedel-Architect is an agentic pipeline that proves theorems in Lean 4 by building and refining a blueprint: a dependency graph whose nodes are formally stated definitions and lemmas building up to the target theorem. Blueprint generation takes only the formal statement of the target theorem and emits a directed acyclic graph. For harder problems this step can optionally be guided by a natural-language proof.

Theorem proving then dispatches each lemma to a Lean prover in parallel; every prover sees only its own lemma and the dependencies that lemma declared, presented as available facts, and may call the Lean compiler and a Mathlib retrieval tool repeatedly until it closes the goal or exhausts its budget. Each node ends up marked solved, unsolved, or formally negated. If any lemma is left unproved, blueprint refinement rewrites the graph around the failures: it reads each prover's diagnosis together with any disproof, then decomposes hard lemmas into intermediate helper lemmas, rewires dependencies so a lemma can reach the facts it needs, and repairs or drops lemmas whose statements were shown to be false. Lemmas that an earlier pass already proved are preserved with their signatures intact, so the proving budget spent on them is not discarded, and the revised blueprint is sent through another proving pass. This loop continues until every node is solved and the blueprint assembles into a complete Lean proof, or the iteration budget is reached.

Scaling

What matters in practice is how many problems a system can close for a given amount of compute. On a shared random subset of 200 PutnamBench problems, all run on the same open-weight DeepSeek-V4-Flash backbone, we chart the fraction of problems solved as a function of the per-problem token budget, charging each problem the cumulative tokens spent on it up to the moment it is first proved. At every budget, Goedel-Architect's blueprint loop solves more than either baseline: direct inference, which simply samples full proofs, never clears 7%, and a tool-integrated agent, which proves each theorem in one shot with compiler and retrieval tools, plateaus near 55%. Decomposing a problem into a blueprint and refining it carries Goedel-Architect to roughly 80% on the same budget, and seeding the initial blueprint with a natural-language proof sketch recovers a further set of hard problems, lifting the curve to 92.5%.

Per-problem token budget versus problems solved on PutnamBench

Per-problem token budget versus problems solved on a shared random subset of 200 PutnamBench problems, all using the same open-weight DeepSeek-V4-Flash backbone. Each curve gives the percentage of problems solved once a problem is charged the cumulative tokens spent before its first proof.