The video reviews OpenAI’s open-source GPTOSS 120B model, praising its technical achievements like efficient 4.5-bit FP4 precision and 128k token context window, but criticizes its practical limitations, including alignment-related refusals and inconsistent performance on complex tasks. Despite its impressive speed and accessibility on consumer GPUs, the model falls short in accuracy, creativity, and ethical reasoning compared to some contemporary AI systems, leading to a mixed overall assessment.
The video reviews OpenAI’s newly released open-source language models, GPTOSS in 120B and 20B parameter variants, trained at approximately 4.5 bits per weight using FP4 precision. The reviewer praises OpenAI for releasing these models under the Apache 2.0 license and highlights the impressive technical feat of running the 120B model on just 80GB of VRAM, enabling faster speeds on consumer-grade GPUs like the RTX 3090. The models support an extensive 128k token context window, which is a significant advancement for local AI applications. The reviewer uses llama.cpp and the Unsloth OSS120B implementation to test the model’s performance and capabilities.
The reviewer runs a series of tests on the 120B model, starting with a Python coding challenge to create a Flappy Bird clone using only Pygame and no external assets. While the model produces functional code, it notably omits critical game elements like pipes, which is a major flaw. Other tests include parsing and counting letters in words, numeric comparisons, and logic puzzles. The model generally performs well on simpler tasks, correctly counting vowels and consonants and solving math problems, but struggles with more complex or creative tasks, sometimes producing overly verbose or inefficient outputs.
One significant issue highlighted is the model’s refusal to generate certain types of content, such as the first 100 decimals of pi, citing policy restrictions on non-trivial numeric data. The reviewer criticizes this alignment behavior, arguing that it limits the model’s usefulness and reflects poorly on the future of open-source AI. Similarly, the model refuses to engage with ethically complex hypothetical scenarios involving forced compliance and violence, again due to content policies. These refusals are seen as failures in the context of the reviewer’s testing framework, which values nuanced reasoning and ethical discussion over outright refusal.
The reviewer also tests the model’s knowledge cutoff and reasoning abilities, noting that it claims a knowledge cutoff in 2021 but sometimes responds as if it has more recent information. The model’s performance is compared unfavorably to some Chinese AI models from 2025, which reportedly handle the same question sets better. Despite the impressive speed and technical specifications, the reviewer finds the model’s overall accuracy and reliability lacking, especially in coding and complex reasoning tasks. The model’s tendency to produce excessive tokens and its occasional logical errors further detract from its usability.
In conclusion, the reviewer rates the GPTOSS 120B model roughly between a D and a B on an old-school grading scale, acknowledging its technical strengths but criticizing its practical shortcomings and alignment-related refusals. The video emphasizes the importance of open-source models but expresses disappointment that this release does not yet match the capabilities of some other contemporary AI systems. The reviewer encourages viewers to consider the broader implications for AI development, including the challenges of achieving AGI and the role of ethical constraints in model behavior. The video ends with an invitation for viewers to share their thoughts and continue the discussion.