New AI Research Proves o1 CANNOT Reason!

Recent research highlights significant limitations in the reasoning capabilities of AI models, particularly OpenAI’s 01 preview, which experienced a 30% drop in accuracy when faced with slight variations of mathematical problems from the Putnam competition. The study emphasizes the need for more robust evaluation benchmarks to assess AI reasoning and calls for further testing to address these challenges.

https://openreview.net/pdf?id=YXnwlZe0yf

A recent research paper has raised concerns about the reasoning capabilities of AI models, particularly OpenAI’s 01 preview. The study found that when mathematical problems from the Putnam competition were slightly altered, there was a significant 30% reduction in accuracy for the models tested. This decline in performance highlights the importance of robustness and reliability in AI systems, especially for applications in critical fields like finance and business. If models cannot maintain accuracy under slight variations, their practical use becomes questionable.

The paper introduces the Putnam Axom benchmark, which consists of 236 mathematical problems along with detailed solutions. To ensure the benchmark’s validity, the researchers created a validation set with functional variations of 52 problems, altering variables and constants. The results showed that most models, including OpenAI’s 01 preview, experienced a notable drop in accuracy when faced with these variations, indicating that they may not generalize well to new, unseen problems.

OpenAI’s 01 preview achieved a 41.95% accuracy on the original Putnam problems but dropped significantly when tested on the variations. The study suggests that this decline may be due to overfitting, where models perform well on training data but struggle with new inputs. Additionally, data contamination could be a factor, where test data inadvertently influences the training process, leading to inflated performance metrics on familiar questions.

The research also points out that while 01 preview performed better than other models, it still exhibited logical inconsistencies and a lack of mathematical rigor in its reasoning. The model often made claims without sufficient justification, which raises concerns about its reliability for complex reasoning tasks. The findings suggest that despite advancements in AI, there are still significant gaps in the models’ ability to reason effectively.

Overall, the paper emphasizes the need for more robust evaluation benchmarks that can accurately assess AI models’ reasoning capabilities. As the AI industry continues to evolve, understanding the limitations of these models is crucial for their safe and effective deployment in real-world applications. The study calls for further testing with newer models to better understand their reasoning abilities and to address the challenges identified in the research.

https://openreview.net/pdf?id=YXnwlZe0yf

@artesia please summarise the article at that link

The article “Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning” presents a new benchmark designed to assess the reasoning capabilities of large language models (LLMs) through 236 challenging mathematical problems from the prestigious William Lowell Putnam Mathematical Competition. The authors, comprising researchers from Stanford University, emphasize that traditional benchmarks are becoming saturated, making it difficult to measure LLMs’ reasoning skills effectively.

To address this issue, the authors introduced the Putnam-AXIOM Original dataset, which includes detailed solutions to each problem. To combat data contamination—where benchmark questions may have been inadvertently included in training data—the researchers also developed the Putnam-AXIOM Variation dataset, consisting of 52 problems with functional variations. These variations alter elements such as variable names and numeric constants, generating unique problems not found online.

Initial evaluations revealed that even the top-performing model, OpenAI’s o1-preview, only achieved 41.95% accuracy on the original benchmark, with around a 30% drop in accuracy on the variations. This drop demonstrates the models’ reliance on memorized information rather than true reasoning capabilities.

The authors outline the methodology employed in building both datasets, ensuring rigorous evaluation standards by requiring final answers in a specific boxed format. They argue that the low performance across multiple models underscores the benchmark’s utility for assessing advanced mathematical reasoning and exposing weaknesses in current AI models.

The paper concludes that while LLMs have advanced significantly, they still struggle with complex reasoning tasks, underscoring the need for new evaluation methods to maintain integrity in performance assessments. The Putnam-AXIOM benchmark aims to stimulate further research and development in enhancing mathematical reasoning in AI systems.

1 Like

List the models in order of performance and their score

Here are the models listed in order of performance on the Putnam-AXIOM Original dataset along with their respective scores:

  1. OpenAI o1-preview: 41.94%
  2. GPT-4o: 17.80%
  3. Claude-3.5 Sonnet: 15.96%
  4. GPT-4: 9.32%
  5. NuminaMath-7B-Instruct: 11.86%
  6. DeepSeek-Math-7B-Instruct: 5.08%
  7. Qwen2-Math-7B-Instruct: 11.86%
  8. Mistral-7B-Instruct-v0.3: 3.38%
  9. Llama-3-8b Instruct: 4.23%
  10. Others (various smaller models): Generally below 10%, with specific scores in the dataset.

The exact performance percentages for all models might vary for less prominent models. The table provided in the paper highlighted significant performance disparities, indicating that most models scored under 10%, with only o1-preview and a few others achieving higher rates.