How does Phi-3 mini, medium and Gemini 1.5 Pro and Flash stack up in reasoning?

The video evaluated the reasoning abilities of different F3 models from Microsoft by testing them on challenging word problems using various sentence configurations. The Phi-3 Mini model performed poorly, while the Phi-3 Medium model excelled in solving the word problems, with the Gemini 1.5 Pro and Flash models also demonstrating good reasoning abilities.

In the video, the reasoning abilities of new F3 models from Microsoft were tested using difficult word problems. Different language models were evaluated, including the Phi-3 Mini, Medium, and Gemini 1.5 Pro, as well as the Flash model. The evaluation involved two word problems, each consisting of five sentences, with varying configurations to test the logic and reasoning capabilities of the models. The models were tested on a total of 120 combinations of sentence configurations, and their performance was assessed based on the number of correct answers they provided.

The testing process utilized open-source tools like Open Router to run the models and evaluate their performance on the word problems. The results were compared to previous tests conducted on other language models like GPT series and Llama. The aim was to rank the models based on their accuracy in solving the word problems. The testing script allowed for easy comparison and analysis of the models’ performance, providing insights into their reasoning abilities.

The results of the testing revealed varying performance levels among the different F3 models. The Phi-3 Mini model performed relatively poorly, with 85 correct answers out of 120 combinations for the bumper problem and only 21 correct answers for the harder Marcus problem. In contrast, the Phi-3 Medium model excelled, with 117 correct answers for the bumper problem and 23 correct answers for the Marcus problem, putting it on par with other established language models.

The Gemini 1.5 Pro model demonstrated good performance, with 113 correct answers for the bumper problem and 88 correct answers for the Marcus problem. The Gemini 1.5 Flash model also performed well, with 112 correct answers for the bumper problem and 25 correct answers for the Marcus problem. Overall, the models were ranked based on their performance on the word problems, with the Phi-3 Medium model standing out for its strong reasoning abilities in solving the challenges presented.

The video concluded with a discussion on the benefits of becoming a patron to access the code files, courses, and connect with the content creator. The presenter mentioned plans to continue testing new models as they become available and provide updates on their performance. The code files used for the testing were made available on the creator’s Patreon page for further exploration and analysis.