Open Problems Solver

A theorem prover and a counterexample searcher — two complementary AI pipelines for attacking open mathematical problems.

Dual-Pipeline Architecture
INPUT Open Problem (JSON) prove disprove PROVER — Positive Route 0 Literature 1 Plan (DAG) 2 Prove Lemmas recursive decomposition dual audit: GPT + Claude per lemma 3 Integrate 4 Final Audit ProofArtifact DISPROVER — Negative Route 1 Literature 2 Analysis 3 Algorithm Design staged search with executable Python correct optimal adversarial 4 5 Feasibility Gate Document + Code
Data source
LLM / agent call
Deterministic
Audit gate
Input / output

Prover — Positive Route

Decomposes a conjecture into a DAG of lemmas. Proves each lemma via recursive decomposition with backtracking. Each proof is independently audited by GPT and Claude.

Disprover — Negative Route

Designs a computational search algorithm to find counterexamples. Three independent reviewers stress-test the algorithm. Feasibility gate prevents outputting unrunnable code.

Shared Infrastructure

GPT + Claude dual provider. API or CLI mode. DuckDuckGo literature search. Per-phase checkpointing with resume. Full I/O logging.

Prover: Theorem Proving Pipeline
Problem (JSON) PHASE 0 Literature Research search for known approaches and related results PHASE 0.5 Literature Processing structure into categorized form, audit loop PHASE 1 Plan — Lemma Decomposition Claude audit loop confidence < 0.75 → replan LEMMA DAG — parallelized by dependency level L0 Lem A Lem B Lem C L1 Lem D Lem E PHASE 2 — RECURSIVE PROVING Each lemma: decide → prove directly OR decompose into sub-lemmas → recurse Main Theorem Lem A proved directly Lem B decompose ↓ B.1 B.2 B.3 strat 1 strat 2 B.3a B.3b proved directly decomposed failed → backtrack max depth: 25 max lemmas: 300 3 strategies tried per decomposition AUDIT GATE each proved lemma must pass both GPT Self-Audit Claude Cross-Audit ACCEPT or BLOCK PHASE 3 Integrate Proof assemble full proof from all proved lemmas PHASE 4 Final Audit (Claude) end-to-end coherence check OUTPUT ProofArtifact (JSON) complete partial failed

Literature & Planning

Web search for known approaches (Phase 0), structured into categories (Phase 0.5). GPT decomposes the conjecture into a DAG of lemmas with explicit dependencies (Phase 1). Lemmas at the same dependency level are proved in parallel.

Recursive Proving

Each lemma: decide "direct proof" or "decompose into sub-lemmas." Decomposition generates 3 strategies; if one fails, backtrack and try the next.

Dual Audit

GPT self-audits for internal consistency; Claude cross-audits independently. Both must accept. Failed audits trigger retry with feedback.

Disprover: Counterexample Search Pipeline
PHASE 1 Literature Research generate 20 queries, web search, synthesize findings PHASE 1.5 Literature Processing audit → revise structure into categories: known bounds, prior attempts, techniques PHASE 2 Structural Analysis review → revise derive constraints a valid counterexample must satisfy PHASE 3 Algorithm Design design staged search strategy with executable Python code v1 PHASE 4 — MULTI-ROUND REVIEW each reviewer can trigger algorithm revision; 3 rounds max per reviewer Correctness mathematical + code errors Optimality efficiency + parameter ranges Adversarial missed edge cases + gaps blocking issue → revise algorithm v2 v3 version bumps all three must pass before proceeding PHASE 5 Complexity + Feasibility estimate runtime and search space; verify code is practically runnable infeasible? redesign + re-estimate PHASE 6 Final Document Assembly assemble self-contained document + extract executable code FINAL_DOCUMENT.md search_algorithm.py

Literature + Analysis

Searches for known bounds, prior computational attempts, and construction techniques. Derives mathematical constraints any valid counterexample must satisfy.

Triple Review

Three independent reviewers (correctness, optimality, adversarial) run sequentially. Each can trigger an algorithm revision — the version increments until all pass.

Feasibility Gate

Estimates runtime and search space size. If infeasible, redesigns the algorithm and re-estimates — prevents outputting code that can't finish.