Llama 3.2 Deep Dive - Tiny LM & NEW VLM Unleashed By Meta

Meta’s recent release of Llama 3.2 introduces four new language models, including lightweight text-only models and multimodal vision language models, featuring a rare 128k context window. While the models show promise, particularly in text generation, they face challenges in vision performance compared to competing models in the rapidly evolving AI landscape.

In the recent Meta Connect event, Meta announced the release of Llama 3.2, a new set of language models that diverges from the previous Llama 3.1 series. This release includes four models: two lightweight text-only models (1B and 3B) and two vision language models (11B and 90B). Notably, this marks Meta’s first foray into multimodal models that incorporate vision capabilities. The 1B and 3B models are designed for on-device use cases, such as summarization and instruction following, and they boast an impressive 128k context window, which is rare for models of their size.

The 1B model, with 1.24 billion parameters, shows strong performance in long-context tasks and information retrieval, outperforming larger models in specific benchmarks. However, it does not excel in all areas, particularly in mathematical tasks. The 3B model, while having improved capabilities over the 1B, faces stiff competition in its size class and does not stand out as much. Both models are optimized for local use, making them suitable for devices like smartphones, which enhances user privacy when processing sensitive information.

The 11B and 90B models introduce vision capabilities, maintaining the same 128k context window. These models are based on larger Llama models, but Meta did not release a 405B vision model due to training complexities. While the Llama 3.2 vision models show promise, they are not necessarily the best in the open-source landscape. Comparisons with other multimodal models, such as Quin2 VL and NVM, reveal that Llama 3.2 generally falls short in vision performance, although it excels in text generation.

The architecture of Llama 3.2’s vision models involves a vision transformer that processes images alongside text tokens. This design choice aims to maintain the text generation quality of the Llama series while integrating vision capabilities. However, competing models like Quin2 VL and NVM utilize more advanced architectures that allow for better performance in multimodal tasks. The differences in training and architecture contribute to Llama 3.2’s relative inferiority in vision tasks compared to these newer models.

Finally, Nvidia has collaborated with Meta to optimize Llama 3.2 for various platforms, ensuring efficient performance on existing Nvidia hardware. They have introduced a multimodal retrieval-augmented generation (RAG) pipeline that enhances the usability of the vision models for tasks involving images and documents. Overall, while Llama 3.2 represents a significant step forward for Meta in the realm of language and vision models, it faces challenges in competing with other state-of-the-art models in the rapidly evolving AI landscape.