The video discusses the introduction of two new models, the Open Thinker 32 billion and 7 billion parameter models, to the Olama library, which are fine-tuned versions of Quen’s models using a dataset generated by the DeepSeek R1 model. The host evaluates the models’ performance through a series of tests, noting that while the 32 billion parameter model demonstrated strong reasoning and relevance, it often had longer response times, and the 7 billion parameter model performed significantly worse.
In the recent video, the host discusses the addition of two new models, the Open Thinker 32 billion and 7 billion parameter models, to the Olama library. These models are fine-tuned versions of Quen’s 2.5 models, utilizing a dataset called Open Thoughts. This dataset was generated using the DeepSeek R1 model, which helped formulate questions and answers from various benchmarks. The process involved using a tool called Curator to clean up the responses, ensuring accuracy for science-related questions, verification for math and puzzle questions, and execution for coding queries.
The host explains the grading methodology for evaluating the performance of these models. They set up a test harness and created a set of questions to assess the models based on various criteria, including the quality of reasoning and the time taken to respond. The grading system considers both the output and the thinking stage of the models, with penalties applied for longer response times. The machine used for testing is equipped with powerful hardware, allowing for efficient processing of the models.
The first question posed to the Open Thinker 32 billion model involved writing a three-sentence story about a cat chasing a mouse without using the letter ‘e.’ The model took nearly two minutes to think but produced a relevant response that adhered to the constraints. The host noted that while the creativity was somewhat limited, the clarity and relevance of the answer were commendable. The model scored 17 out of 21 points, with adjustments made for the thinking time.
Subsequent questions tested the models on various tasks, including creating analogies, explaining complex concepts to children, and generating creative content. The host observed that while the models performed well in terms of reasoning and relevance, they often took longer than expected to generate responses. For example, one question about designing a new animal resulted in a thoughtful answer but was penalized for length. Overall, the models demonstrated strong capabilities, but response times were a recurring issue.
In conclusion, the host compared the performance of the 32 billion parameter model to the 7 billion parameter model, noting that the latter performed significantly worse. They emphasized the importance of maintaining consistent testing conditions and avoiding prompt engineering to assess the models’ raw capabilities. The video ends with an invitation for viewer feedback on the models and the methodology used, highlighting the ongoing exploration of AI model performance and the potential for future improvements.