How Google DeepMind's AlphaGeometry Reached Math Olympiad Level Reasoning By Combining Creative LLMs With Deductive Symbolic Engines
How open-sourced Neurosymbolic AI (Neural Networks + Symbolic Systems) is revolutionizing automated theorem proving
Get a list of personally curated and freely accessible ML, NLP, and computer vision resources for FREE on newsletter sign-up.
Consider sharing this with someone who wants to know more about machine learning.
DeepMind's AlphaGeometry [8] represents a significant leap forward in automated geometric reasoning, tackling olympiad-level geometry problems with performance rivaling human experts.
DeepMind has a track record of solving hard problems with the goal of beating and surpassing human performance.
AlphaGo and AlphaZero [10, 11]: Developing AI systems that mastered complex games like Go and surpassing human-level performance.
AlphaFold [9]: Creating an AI system that can accurately predict protein structures, which has significant implications for biological research and drug discovery.
In a previous post, we covered what drove the success behind the AlphaFold:
Their neuro-symbolic system solved 25 out of 30 problems from a benchmark set of International Mathematical Olympiad (IMO) geometry questions, surpassing the previous state-of-the-art method's 10 problems and approaching the average IMO gold medalist's 25.9 problems.
It does use a language model but in close coordination with a symbolic system that keeps the LLM grounded. In addition to this, it achieves the feat without relying on human demonstrations.
Would a stand-alone LLM be able to do the same?
Not quite.
“When producing full natural-language proofs on IMO-AG-30, however, GPT-4 has a success rate of 0%, often making syntactic and semantic errors throughout its outputs, showing little understanding of geometry knowledge and of the problem statements itself.”
ChatGPT-4 [2] fails with a success rate of 0%. On its own, a stand-alone LLM [1, 2, 3, 4, 5, 6] is nowhere as good. This makes a strong case for agentic workflows and tool calling. And that stand-alone LLMs might just be good chat agents without additional context (such as in Retrieval Augmented Generation which we saw in a previous article) and tooling.
What is more, they have made the code repository available on GitHub (see References section at the bottom for the link) [7]!
Today, we take a deep dive into what makes AlphaGeometry work:
What is the problem at hand?
An Overview of AlphaGeometry
Detailed Breakdown of Essential Building Blocks
Design Choices
Key Insights and Takeaways
How good is AlphaGeometry’s Performance? Quantitative + Qualitative
Would like to read other related pieces?
Here you can read more about the Transformers series and LLMs series.
1. Scoping the Problem
Geometric theorem proving at the olympiad level poses unique difficulties. Traditional symbolic methods with hand-crafted heuristics have struggled to address these challenges effectively.
Wait, but what are even symbolic systems?
“Systems that are built with symbols, like natural language, programming, languages, and formal logic; …” - Stanford’s Symbolic Systems [12]
Complexity of Geometry & Scarcity of training data
Geometric theorem proving at the Olympiad level has long been a formidable challenge in artificial intelligence. The complexity lies not just in the intricate nature of geometry itself, but also in the scarcity of training data.
Challenges in translating human proofs to machine-verifiable format
Unlike other mathematical domains, geometry faces unique translation challenges when converting human proofs into machine-verifiable format. This data bottleneck has caused geometry to lag behind in recent AI advancements that rely heavily on human demonstrations.
Infinite branching factor in geometric proofs
Traditional approaches to geometry theorem proving have primarily relied on symbolic methods and human-designed, hard-coded search heuristics. However, these methods often struggle with the infinite branching factor inherent in geometric proofs, particularly when it comes to auxiliary constructions - a crucial aspect of many complex geometric arguments.
An auxiliary construction refers to an additional line, segment, circle, or other geometric element that is introduced into a figure to help solve a problem, prove a theorem, or simplify a geometric proof. These constructions are not originally part of the given problem but are added to make the relationships between the elements clearer.
The key challenge, therefore, was to develop a system that could not only prove geometric theorems at a high level but do so without relying on human-provided examples or curated problem statements. This system would need to generate its own training data, learn to make auxiliary constructions, and combine neural and symbolic approaches effectively.
2. The Solution
Enter AlphaGeometry [8], a neuro-symbolic system developed by DeepMind that has achieved remarkable success in geometric theorem proving. AlphaGeometry sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity.
At its core, AlphaGeometry consists of two main components:
A neural language model: Trained from scratch on large-scale synthetic data.
A symbolic deduction engine: Performs logical reasoning and algebraic computations.
This combination allows AlphaGeometry to learn the art of theorem proving from scratch and generating human-readable proofs, without any human demonstrations. The result? A system that can solve 25 out of 30 Olympiad-level geometry problems, outperforming previous methods and approaching the performance of International Mathematical Olympiad (IMO) gold medalists.
3. Building Blocks of the Solution
The key insight of AlphaGeometry lies in its approach to learning the creative aspects of theorem proving, particularly auxiliary constructions. In geometry proofs, these constructions often represent the "Aha!" moment that unlocks the solution.
Traditional systems struggled with this because it's an open-ended problem - there are infinitely many possible constructions one could make. AlphaGeometry tackles this by generating a vast number of random scenarios, solving them with brute-force symbolic methods, and then learning patterns from these solutions.
Think of it like a chess engine that has played millions of games against itself. It doesn't need to be taught specific openings or tactics; it learns them organically from the patterns in its self-play data. Similarly, AlphaGeometry learns to recognize situations where certain types of constructions are likely to be useful, not from human-designed rules, but from patterns in its synthetic data.
AlphaGeometry's architecture consists of several key components:
Random Premise Sampler: Generates diverse sets of geometric premises as starting points for theorem generation.
Symbolic Deduction Engine (DD+AR): Combines a deductive database (DD) with algebraic reasoning (AR) to efficiently derive logical conclusions from given premises.
Traceback Algorithm: Identifies the minimal set of premises and deductions necessary for each conclusion, which is crucial for extracting meaningful proofs.
Language Model: A transformer-based model trained on the synthetic data to guide the proof search process.
Proof Search Algorithm: Combines the language model's suggestions with the symbolic engine's deductions in an iterative process to find proofs.
4. What Makes AlphaGeometry Work?
Synthetic Data Generation
AlphaGeometry's approach to synthetic data generation is a key innovation. The system generates 100 million unique theorem-proof examples, including 9 million with auxiliary constructions. This process involves:
Sampling random theorem premises (nearly 1 billion initially).
Using a symbolic deduction engine to generate derivations.
Applying a traceback algorithm to extract minimal proofs and premises.
Identifying auxiliary constructions using the concept of dependency difference.
This synthetic data allows AlphaGeometry to learn complex patterns and strategies without relying on human-curated problems or demonstrations.
The process starts with a random premise sampler that constructs geometric scenarios using a set of predefined actions (e.g., creating points, lines, circles). These premises are then fed into the symbolic deduction engine (DD+AR), which applies geometric rules and algebraic reasoning to derive all possible conclusions.
The traceback algorithm then works backward from each conclusion to identify the minimal set of premises and deductions required. This process generates millions of theorem-proof pairs, each consisting of a set of premises, a conclusion, and a step-by-step proof.
Language Model Training
A transformer-based language model is trained on these synthetic theorem-proof pairs. The model learns to predict the next step in a proof given the premises and previous steps. Crucially, it's trained to focus on generating auxiliary constructions - the creative leaps often needed in complex proofs.
You can read our dedicated series to learn more about Transformers and LLMs.
The transformer-based language model is trained on serialized versions of the synthetic data, effectively learning to generate proofs conditioned on theorem premises and conclusions. The model architecture includes:
12 layers
Embedding dimension of 1024
8 attention heads (see what is attention in our previous post)
An inter-attention dense layer of dimension 4,096 with ReLU activation
151 million parameters (excluding embedding layers)
The training process involves pretraining on all 100 million synthetic proofs, followed by fine-tuning on the subset of proofs requiring auxiliary constructions (9 million proofs).
Proof Search
AlphaGeometry's proof search is an iterative process alternating between the language model and the symbolic engine:
The language model suggests potential next steps, focusing on auxiliary constructions.
The symbolic engine (DD+AR) applies these suggestions and derives all possible conclusions.
This process continues until either the desired conclusion is reached or a maximum number of steps is hit. The search uses beam search to explore multiple promising paths simultaneously.
A beam search with a size of 512 is used to explore multiple promising paths simultaneously.
Continue reading more:
5. Design Choices
Several crucial design choices contribute to AlphaGeometry's success. Each of these choices addresses a specific challenge in automated theorem proving, contributing to a system that can tackle Olympiad-level problems.
Synthetic Data Generation
The system generates a diverse set of theorems and proofs, covering a wide range of complexity levels. This approach addresses the data scarcity issue and allows the model to learn patterns that might be rare or overlooked in human-designed problems.
By generating theorems and proofs from scratch, AlphaGeometry avoids biases present in human-curated datasets and explores a much broader space of geometric reasoning.
Neuro-Symbolic Synergy
The combination of a neural language model for creative suggestions and a symbolic engine for rigorous deductions balances exploratory power with logical soundness.:
Language Model: Provides flexibility and learns complex patterns
Symbolic Engine: Ensures logical consistency and efficient deduction
Agentic Iterative Proof Search
The back-and-forth between the language model's suggestions and the symbolic engine's verifications allows for a flexible, adaptive proof search process. The interaction between the language model and the symbolic engine is a form of agentic workflow that iterates until a satisfactory solution is found.
Dependency Difference
This concept enables the system to identify the "gap" between given premises and the conclusion, facilitating the learning of necessary auxiliary constructions.
Beam Search for Proof Exploration
The beam search algorithm allows AlphaGeometry to explore multiple promising paths simultaneously, addressing the combinatorial explosion of possible proof steps in complex problems.
Improved Interpretability
The traceback algorithm ensures that only the essential steps are included in proofs, improving both efficiency and interoperability.
6. How good is AlphaGeometry?
Quantitative
AlphaGeometry's performance is impressive. On the IMO-AG-30 benchmark (30 Olympiad-level geometry problems), AlphaGeometry significantly outperforms existing methods:
AlphaGeometry: 25/30 problems solved.
Previous state-of-the-art (Wu's method): 10/30 problems solved.
Strongest baseline (DD + AR + human-designed heuristics): 18/30 problems solved.
ChatGPT-4 : 0/30 problems.
Notably, AlphaGeometry solved both geometry problems from the same year in 2000 and 2015, a threshold considered difficult for the average human contestant at the IMO.
Qualitative
Beyond raw performance, AlphaGeometry demonstrates several important capabilities:
Produces human-readable proofs.
Solves all geometry problems in IMO 2000 and 2015 under human expert evaluation.
Discovers a generalized version of a translated IMO theorem from 2004.
These results suggest that AlphaGeometry is not just memorizing solutions but has developed a genuine understanding of geometric reasoning principles.
7. Key Insights From Ablation Experiments
Data efficiency: Using only 20% of the training data, AlphaGeometry still solves 21 problems, outperforming all baselines.
Search efficiency: With a beam size of 8 (64x reduction), AlphaGeometry solves 21 problems.
Depth efficiency: Reducing search depth from 16 to 2 still results in 21 solved problems.
These results demonstrate the robustness of AlphaGeometry's approach and the effectiveness of its synthetic data.
8. Comparison with Human Proofs
While AlphaGeometry's performance is impressive, analysis of its proofs reveals some limitations:
Verbosity: AlphaGeometry proofs often contain many low-level steps, making them less readable than concise human proofs.
Limited high-level reasoning: The system lacks access to advanced theorems and concepts that human experts use to simplify proofs.
Coordinate system limitations: AlphaGeometry cannot leverage clever coordinate choices that often simplify human solutions.
These observations suggest potential areas for improvement, such as incorporating higher-level geometric concepts and theorems into the synthetic data generation and symbolic engine.
9. What’s Next? + Outro
While AlphaGeometry represents a significant advance, it's not without limitations:
The current implementation is limited to Euclidean plane geometry, excluding topics like geometric inequalities or combinatorial geometry.
The proofs generated, while correct, can sometimes be more verbose and less elegant than human-created proofs.
There are still 5 problems in the IMO-AG-30 benchmark that AlphaGeometry cannot solve.
AlphaGeometry represents a significant advancement in automated geometric reasoning, approaching human expert-level performance on olympiad-level problems. Its success in learning complex geometric strategies from purely synthetic data opens new avenues for AI research in mathematics and beyond.
By combining the pattern recognition capabilities of neural networks with the logical rigor of symbolic systems, AlphaGeometry demonstrates a promising approach to developing AI systems capable of sophisticated mathematical reasoning. As research in this area progresses, we may see AI systems that not only solve existing problems but also contribute to the discovery of new mathematical knowledge.
The open-sourcing [7] of AlphaGeometry's code and model by DeepMind invites the broader research community to build upon this work, potentially leading to further breakthroughs in AI reasoning across various domains of mathematics and science.
See you in the next edition of the Code Compass.
Read more on the Transformers series, LLMs series, or the ML behind Netflix’s content creation:
Consider subscribing to get it straight into your mailbox:
References
[1] Attention Is All You Need: https://arxiv.org/abs/1706.03762
[2] GPT-4 Technical Report: https://arxiv.org/abs/2303.08774
[3] Gemini: A Family of Highly Capable Multimodal Models: https://arxiv.org/abs/2312.11805
[4] Gemini 1.5: https://arxiv.org/abs/2403.05530
[5] Claude 3: https://www.anthropic.com/news/claude-3-family
[6] LLAMA: https://arxiv.org/abs/2302.13971
[7] AlphaGeometry open-source repo: https://github.com/google-deepmind/alphageometry
[8] AlphaGeometry: Solving Olympiad geometry without human demonstrations
[9] AlphaFold: https://www.nature.com/articles/s41586-021-03819-2
[10] Alpha Go: https://deepmind.google/technologies/alphago/
[11] Alpha Zero: https://deepmind.google/discover/blog/alphazero-shedding-new-light-on-chess-shogi-and-go/
[12] Stanford Symbolic Systems: https://symsys.stanford.edu/about/whats-name