Testing Llama 3.1 reasoning with 5 increasingly difficult problems

The video tests three versions of the Llama model (8 billion, 70 billion, and 405 billion parameters) on five increasingly difficult word problems, revealing that while the models demonstrate some reasoning ability, they generally perform worse than leading models like GPT-4 Turbo. The results highlight a significant decline in performance as problem complexity increases, particularly in familial relationship scenarios, underscoring the limitations of Llama models in comparison to larger counterparts.

In the video, the presenter tests three versions of the Llama model—8 billion, 70 billion, and 405 billion parameters—by posing five progressively challenging word problems. The problems are designed to be presented in various sentence combinations, highlighting how the arrangement of sentences can affect the performance of large language models (LLMs). The first problem involves counting bumper cars, with the correct answer being 23 yellow cars, while the subsequent problems increase in complexity, including scenarios related to homework and familial relationships.

The results for the first problem show that the 8 billion parameter model performed reasonably well, getting 102 answers correct, compared to the previous Llama 3 model, which achieved 106 correct answers. The presenter notes that the best performance for this problem came from the GPT-4 Turbo model, which answered all combinations correctly, while other models like Sonet 3.5 and Opus followed closely behind. The 70 billion model performed slightly better than the 8 billion model on this problem, with 118 correct answers.

As the testing progresses to the more complex homework problem, the 8 billion model’s performance declines significantly, achieving only 7 correct answers, which is worse than its previous attempts. The 70 billion model fared better with 65 correct answers, still falling short compared to larger models. The 405 billion model managed to answer 90 correctly, indicating a more competitive performance, but it still lagged behind the top-performing models like GPT-4 Turbo and Sonet 3.5.

The final two problems examine familial relationships, with the 8 billion model struggling, obtaining zero correct answers on the more difficult problem. The 70 billion model also achieved zero correct answers, and the 405 billion model managed only two correct responses. These results illustrate that while the Llama models show potential, they still do not match the performance of the leading models in more complex reasoning tasks.

Overall, the presenter concludes that the Llama models, while open-source and demonstrating respectable performance, generally fall short compared to their larger counterparts in the realm of reasoning. The video also highlights the availability of code files for viewers interested in conducting similar tests and encourages viewers to explore the benefits of becoming a patron for access to additional resources and courses related to coding and model testing.