Gemma 4 Is The Qwen Killer?

The video reviews Google’s newly released Gamma 4 (GMA 4) local AI models, highlighting their diverse range of sizes, multimodal capabilities, and more permissive Apache 2 licensing that encourages broader use and modification. It also discusses practical limitations such as input length restrictions, modality trade-offs, and resource considerations, while expressing enthusiasm for GMA 4’s potential as a versatile tool in the local AI community.

The video discusses the recent advancements and releases in local AI models, focusing on the newly launched Gamma 4 (GMA 4) series by Google. Local models have gained significant traction recently, with GMA 4 building upon its predecessors, Gamma 3 and 3N, by offering a diverse range of models catering to different device capabilities and use cases. The series includes small models optimized for mobile and low-end devices, a 31 billion parameter dense model, and a 26 billion parameter model with active 4 billion parameters, all supporting various modalities such as text, image, audio, and video understanding. This variety allows users to select models based on their specific needs and hardware constraints.

A notable change with GMA 4 is its licensing under the Apache 2 license, which is more permissive and straightforward compared to the special GMA license used in previous versions. This change encourages broader use, modification, and forking of the models, provided proper attribution is given. The models come with different context window sizes, with the smaller E2B and E4B models supporting a 128K context window and multimodal inputs, while the larger 31B dense and 26B models support a 256K context window but are limited to text and image modalities. This trade-off between capability and context size is important for users to consider based on their application requirements.

The video highlights some practical considerations and limitations of the GMA 4 models. For instance, the multimodal E2B and E4B models support audio and video but with constraints: audio inputs are limited to 30 seconds, and video inputs to 60 seconds at a processing rate of one frame per second. Users need to segment longer audio or video inputs accordingly, often using voice activity detection (VAD) to split audio at silences. Additionally, the models do not support speaker identification or detailed timestamping for audio, which limits their use in some transcription or analysis tasks. These nuances are crucial for users planning to adopt GMA 4 for real-world applications.

Another interesting feature of GMA 4 is the introduction of an image token budget, which manages the processing of high-resolution images by downsampling or limiting the detail based on the task. This allows the model to efficiently handle image classification without requiring extensive pre-processing, while still enabling higher token budgets for tasks like OCR that need finer detail extraction. The video also mentions the “thinking” models aspect, which can improve results but at the cost of increased token consumption, and the creator’s personal preference to disable this feature due to its resource intensity.

Overall, GMA 4 represents a significant step forward in local AI models by combining multiple modalities, offering a range of model sizes, and adopting a more open license. While the models are promising and versatile, users should be aware of their specific limitations and trade-offs, especially regarding context windows, modality support, and input length restrictions. The video creator expresses excitement about experimenting with these models and encourages viewers interested in local AI to explore GMA 4, update their tools, and share their experiences. This release marks an exciting development in the local model landscape, potentially becoming a new fine-tuning workhorse for the AI community.