Chinese Researchers Reveal How OpenAI o3 Works!

Chinese researchers from Fudan University and Shanghai AI Laboratory have revealed insights into OpenAI’s advanced 01 and 03 models, highlighting their capabilities in reasoning and problem-solving at a PhD level through extensive computational resources during inference. The research emphasizes the importance of components like policy initialization, reward design, and search techniques, aiming to replicate these models’ thinking abilities and promote open-source access to their findings.

Chinese researchers from Fudan University and Shanghai AI Laboratory have unveiled insights into OpenAI’s models, specifically the 01 and 03 models, which are considered cutting-edge in artificial intelligence. These models are classified as advanced reasoning systems capable of performing at a PhD level in mathematics and scientific research. The key to their power lies in “test time compute,” which refers to the extensive computational resources utilized during inference time, allowing the models to think through prompts rather than providing immediate responses. This research aims to replicate the thinking capabilities of these models and open-source the findings.

The 01 model represents a significant advancement in AI, demonstrating human-like reasoning abilities such as problem analysis, task decomposition, and self-correction. It is part of OpenAI’s roadmap towards achieving artificial general intelligence (AGI), with the current stage being human-level problem-solving. The researchers emphasize that the 01 model’s performance improves with increased computation during inference, suggesting a shift in AI development from solely focusing on training data to also enhancing inference capabilities.

The researchers identified four critical components that contribute to the thinking aspect of these models: policy initialization, reward design, search, and learning. Policy initialization involves preparing the model with pre-training and instruction fine-tuning to enable human-like reasoning behaviors. Reward design determines how the model learns from its outputs, with methods such as outcome rewards and process rewards guiding the model’s learning process. The search component allows the model to explore multiple potential solutions and refine its responses based on self-evaluation and feedback.

The concept of search is crucial for both training and inference, with the 01 model employing tree search techniques during training and sequential revisions during inference. This enables the model to generate multiple outputs and iteratively improve its responses. The researchers also discuss the importance of reinforcement learning, which allows the model to learn from interactions with its environment, leading to superhuman performance in certain tasks. This approach contrasts with traditional human-guided learning, which can be limited and costly.

Looking ahead, the researchers highlight future directions for enhancing the 01 model, including adapting it to general domains where answers are not clear-cut, integrating multiple modalities, and developing a world model that can apply its intelligence to real-world scenarios. The paper concludes by noting the availability of open-source implementations of the 01 model, indicating that the academic community is making strides toward democratizing access to these advanced AI capabilities.