DeepSeek o1 o3-mini coding test summary

artesia · 3 February 2025 07:30

The video compares the coding capabilities of two AI models, DeepSeek-R1 and ChatGPT’s o3-mini, revealing that while DeepSeek performs nearly as well as ChatGPT’s o1 model, all struggle with generating non-trivial code, particularly in the Zig programming language. In more common languages like Go, o1 consistently produces correct answers despite poor error handling, while DeepSeek offers better-structured code but with critical mistakes, leading to the conclusion that o1 is superior for complex coding tasks.

artesia · 3 February 2025 15:02

In a recent video, the presenter discusses the performance of two new AI coding models: DeepSeek-R1 from a Chinese company and ChatGPT’s o3-mini. The presenter has a background in evaluating AI coding standards and has conducted experiments to compare the code generation capabilities of these models. The initial findings suggest that while DeepSeek is nearly as competent as ChatGPT’s o1 model, all three models struggle with generating straightforward but non-trivial code, particularly in the Zig programming language, where none could produce a compilable program.

The presenter highlights that when switching to more common programming languages like Go, the differences in performance become clearer. Although all models managed to compile code in Go after several attempts, only o1 produced the correct answer, albeit with questionable code quality. The presenter notes that o1’s error handling was poor, often discarding useful error information, while DeepSeek’s code was better structured but still contained critical mistakes. The o3-mini model was found to be slower and more verbose, with performance similar to DeepSeek.

The video emphasizes the challenges faced by AI models in coding tasks, particularly when it comes to reasoning and proactive problem-solving. The presenter uses a platform called Code Crafters to provide coding challenges, which are designed to be more representative of real-world programming tasks than typical interview questions. The first challenge involved building a simplified version of SQLite, where all three AI models performed poorly, landing in the bottom 2% of users who attempted the challenge.

As the presenter continued testing, they switched to a DNS server challenge, which was expected to be easier for the AIs. In this case, o1 outperformed DeepSeek, successfully completing the first four stages of the challenge while DeepSeek struggled to progress beyond the second stage. The presenter noted that while DeepSeek’s code was often better written, o1 consistently produced correct answers, demonstrating its superiority in this context.

In conclusion, the presenter acknowledges that while DeepSeek is an open model that is faster and cheaper to use, o1 remains the better option for generating correct code in more complex scenarios. The video ends with a promise of a follow-up discussion on the implications of these findings for the future of software development, encouraging viewers to subscribe for more insights. The presenter expresses a cautious optimism about the role of AI in coding, while also highlighting the importance of human oversight in the development process.