Qwen QwQ 2.5 32B Ollama Local AI Server Benchmarked w/ Cuda vs Apple M4 MLX

The video benchmarks the Qwen QwQ 32B model from Alibaba against the Apple M4 Max, highlighting its performance in generating tokens per second across various configurations, with the Q4 setup proving to be the most efficient. While the model demonstrates strong capabilities in coding and reasoning tasks, it also shows some inconsistencies, particularly in simpler queries and complex reasoning, but is overall praised for its potential in creative applications.

The video discusses the new Qwen 2.5 model variant, specifically the Qwen QwQ 32B, developed by Alibaba. The presenter highlights the model’s capabilities in Chain of Thought and reasoning, emphasizing its performance and adaptability for self-hosting AI enthusiasts. The video aims to benchmark the Qwen QwQ against the Apple M4 Max, providing insights into the model’s efficiency in generating tokens per second across different configurations, including Q8, Q4, and FP16 formats.

The presenter begins by comparing the Qwen QwQ 32B’s performance on a quad GPU rig against the Apple M4 Max, which reportedly achieves 11.11 tokens per second. The Q8 configuration on the quad GPU rig yields 20.63 tokens per second, while the Q4 configuration impressively reaches 30.65 tokens per second. The FP16 configuration, however, struggles with a lower output of 12.14 tokens per second, indicating that the Q4 model may be the most efficient for users with limited GPU resources.

As the video progresses, the presenter tests the Qwen QwQ’s capabilities by posing various questions, including coding tasks and ethical dilemmas. The model demonstrates a strong ability to generate code for a Flappy Bird clone, although it struggles with collision detection. The presenter notes that the model’s reasoning and creativity are commendable, but it occasionally falters on simpler tasks, revealing inconsistencies in its performance.

The video also explores the model’s ability to handle complex reasoning tasks, such as ethical questions and mathematical inquiries. While the Qwen QwQ performs well in some areas, it fails to provide satisfactory answers in others, particularly when asked to create a fitness plan or analyze dietary needs. The presenter appreciates the model’s attempts to reason through its responses, but notes that it sometimes lacks the depth expected from an advanced AI.

In conclusion, the presenter expresses admiration for the Qwen QwQ 32B, labeling it as one of the best Chain of Thought reasoning models available. Despite some inconsistencies and shortcomings, the model shows significant potential for creative and coding tasks. The video encourages viewers to explore the capabilities of the Qwen QwQ and provides resources for setting up similar AI systems, while inviting feedback and discussion from the audience on their experiences with the model.