NVIDIA has released a new open AI model with 30 billion parameters that excels in fast, cost-efficient multimodal processing of images, video, and audio by leveraging innovations like linear-scaling member layers, 3D convolutions, and efficient video sampling. While requiring powerful hardware and having a more restrictive license, this model enables users to handle large multimedia data much faster than previous models, making it ideal for applications beyond pure text tasks.
NVIDIA has introduced a new open and free AI model with 30 billion parameters that excels in processing images, video, and audio. Unlike other free models such as Gemma 4, this model stands out primarily for its exceptional throughput and cost efficiency. It can process nearly 10 hours of video per hour, which is almost 10 times faster than real-time and about three times faster than previous models like Gwen 3 Omni. When handling documents, it achieves speeds up to seven times faster. However, running this model locally requires a powerful GPU with around 25 GB of video memory, making it more suitable for desktops or cloud platforms like Lambda.
The model’s speed and efficiency come from five key innovations. First, it uses member layers that scale linearly with context length rather than quadratically, allowing it to handle longer videos, audio, or documents more efficiently. Second, its audio processing converts raw audio waves into tokens without relying on large speech recognition models, preserving emotion and tone while reducing computational costs. Third, it employs 3D convolutions to analyze blocks of video frames simultaneously, rather than processing frames individually, which significantly reduces computation time and cost.
Fourth, instead of using a single large CLIP model for image-text matching, the model distills three separate CLIP models—one for image-text matching, one for fine details, and one for object segmentation—into a single small encoder network, enhancing efficiency. Fifth, it uses efficient video sampling by discarding duplicate frames that share similar backgrounds, further reducing the amount of data processed and improving speed and cost-effectiveness. These combined techniques make the model highly optimized for multimodal inputs.
Regarding licensing, the model is released under its own license, which is more restrictive than the highly permissive Apache 2.0 license but still allows derivative works and commercial use with some attribution and patent grant conditions. While it may not be the best choice for pure text reasoning or coding tasks, it excels in fast, cost-effective multimodal processing, making it a valuable tool for applications involving audio, video, and images.
Overall, this development marks a significant step forward in accessible AI technology, enabling users to run powerful multimodal models locally or in the cloud. As AI models continue to specialize and improve in different areas, users benefit from more tailored and efficient tools. The availability of such open models empowers researchers and developers to experiment and innovate without prohibitive costs, making this an exciting time for AI enthusiasts and professionals alike.