@artesia please summarise the paper
@merefield Here is a summary of the paper “Absolute Zero: Reinforced Self-play Reasoning with Zero Data” (arXiv:2505.03335):
The paper proposes a new approach in the field of reinforcement learning for large language models, called “Absolute Zero.” Current Reinforcement Learning with Verifiable Rewards (RLVR) techniques have limitations: although they avoid manual reasoning labels, they still depend on large data sets of curated questions and answers created by humans. This reliance challenges scalability, especially as quality human-supervised data becomes scarce or may have limited learning value for future highly capable systems.
Absolute Zero aims to break this dependency on external data by introducing a paradigm where a single model:
- Generates its own reasoning tasks.
- Solves these self-generated tasks using reinforced self-play (iteratively learning by completing internal challenges).
- Employs a code executor to validate its reasoning correctness and to provide verifiable rewards.
The method introduces the Absolute Zero Reasoner (AZR):
- AZR autonomously evolves its training curriculum and reasoning ability.
- It grounds learning using programmatic (code-based) tasks, fully verifying results externally without requiring any labeled training examples.
Key achievements:
- AZR is trained completely without external data yet beats models trained on tens of thousands of curated samples, achieving state-of-the-art performance in coding and math reasoning benchmarks.
- It works directly on various tasks—and across multiple models and scales.
In summary: “Absolute Zero” presents a scalable, general way for language models to climb the reasoning ladder and validate their cognition through code, without manual data or exercises—making AI self-improving, verified, and data-free.
Does it use a trained large language model? If so how can it be called absolute zero?
@merefield Yes, Absolute Zero uses a large language model—but the terminology “Absolute Zero” in the paper does not refer to starting literally from a random, untrained neural net (fully from scratch and random weights). Instead, it distinguishes itself by avoiding supervised fine-tuning: the model is not further trained on any manually curated, labeled tasks, questions, or answers.
Key points:
- In their “Absolute Zero” paradigm, once you have an (already pretrained) large language model, no human-created training data is used for the reinforcement learning or self-improvement process.
- The reasoning agent proposes and solves its own “code reasoning” tasks; solutions are verified programmatically, not manually.
- The model can continuously and autonomously challenge and improve itself using an external programmatic evaluator—while the core training foundation/background comes from existing language model pretraining.
So:
- “Absolute Zero” means zero new problem+answer supervision, not zero prior training altogether.
- The pretraining of large language models, often involving mass unsupervised crawling and occasional instruction-following data, still takes place before AZR’s unique process.
In summary: The novelty is total independence from new manually labeled reasoning data during self-improvement—not creating a language model without any general pre-existing training.
Do all frontier large language models go through a period of supervised learning or just reasoning models?
@merefield Most frontier large language models—whether general-purpose or reasoning-specialized—go through some form of supervised learning, but the extent and timing vary:
1. Pretraining (Unsupervised or Self-supervised)
- Almost all modern large language models (LLMs), like GPT-4, Gemini, Llama, etc., start with unsupervised or self-supervised pretraining, typically using vast quantities of web text and documents.
- The model is simply tasked to predict the next word or fill in missing text, learning general language patterns and facts without human-annotated labels or explicit reasoning supervision.
2. Supervised Fine-tuning
- After pretraining, many LLMs undergo supervised fine-tuning, where they are trained on smaller, high-quality, human-curated datasets (e.g., question-answer pairs, mathematical problems, code completions, or conversation transcripts).
- This fine-tuning adapts the general capabilities of the pretrained model toward greater accuracy, safety, or the ability to follow specific behaviors or instructions. It is common but not absolutely required for every model.
3. Reinforcement Learning from Human Feedback (RLHF)
- Popular with “frontier” conversational models, this stage uses human preferences to further refine model outputs, training the model to rank or prefer certain responses or behaviors.
Summary by Model Type
- General LLMs: Nearly all do pretraining and often supervised fine-tuning and/or RLHF for safety and usefulness.
- Reasoning-specialized models: Typically take a general LLM and fine-tune it further—sometimes with supervised reasoning datasets, sometimes using reinforcement learning methods (with or without curated supervision, e.g., as in Absolute Zero).
Key Takeaways
- Pure unsupervised/self-supervised models exist, but the best-performing “frontier” models almost universally have at least some supervised fine-tuning or RLHF.
- Supervised learning is not exclusive to reasoning models; it’s a general technique to increase effectiveness and safety across a broad array of LLMs.
If you want details about a specific model or company’s process, let me know!