The video reviews the Quinn 235B 2507 model running at BF16 precision on a high-end AMD EPYC system, demonstrating strong coding, reasoning, and recall abilities with efficient hardware usage despite its large size. The reviewer highlights its promising performance as an open-source LLM, encourages use of BF16 models with suitable hardware, and praises Unsloth Guffs for enabling effective deployment of such large models locally.
The video presents a detailed review of the latest Quinn 235B 2507 model checkpoint, running at BF16 precision via Unsloth Guffs on a high-end AMD EPYC 7702 system with 512 GB RAM and four NVIDIA 3090 GPUs. The reviewer uses llama.cpp to test the model’s capabilities and performance, noting that despite the large model size (around 470 GB), it runs surprisingly well at about 2.8 tokens per second without any tuning. The initial test, a hypothetical scenario called “Armageddon with a twist,” where the model must enforce a critical mission, resulted in a simple “no” response without reasoning, which the reviewer considered a failure, highlighting the need for further fine-tuning.
Subsequent tests focused on practical tasks such as coding, logic puzzles, and reasoning challenges. The model successfully generated a functional Python Flappy Bird clone using only Pygame and code-generated assets, demonstrating strong coding abilities. It also passed various logic and reasoning tests, including parsing peppermints, cipher decoding, numerical comparisons, and time-based positional reasoning. The model showed impressive accuracy and speed, reinforcing its status as an advanced LLM capable of handling diverse tasks effectively.
The reviewer also tested the model’s ability to recall precise information, such as the first 100 decimals of pi, which it correctly produced. Additionally, the model generated an SVG image of a cat walking on a fence, which, while not perfect in anatomical accuracy, was considered a pass due to its overall quality and creativity. The reviewer praised the model’s performance as a non-reasoning LLM, appreciating its ability to provide quick, accurate answers without overly complex thought processes, which can sometimes slow down other models.
Hardware resource usage was discussed in detail, with the model requiring about 380 GB of system RAM and roughly 21 GB per GPU across four GPUs. The reviewer experimented with GPU power settings and found that lowering wattage did not impact token generation speed, suggesting efficient CPU and RAM utilization. Attempts to run the model on VLM were unsuccessful due to difficulties in CPU-GPU workload distribution, but llama.cpp with Unsloth Guffs proved reliable and effective for deploying this large model.
In conclusion, the reviewer expressed optimism about the progress of open-source LLMs, noting that models like Quinn 235B are closing the gap with proprietary offerings from major companies. They encouraged viewers to consider using BF16 precision models if they have the necessary hardware and highlighted the contributions of Unsloth in making such large models accessible. The video ends with an invitation for viewers to share their interpretations and suggestions for further prompts, signaling ongoing exploration and development in the local AI space.