Llama 3.2 goes Multimodal and to the Edge

Meta has launched Llama 3.2, introducing multimodal capabilities with four new models, including two vision-language models and two lightweight text-only models optimized for mobile devices. These models aim to enhance performance in various applications while allowing for fine-tuning, and they are now available for testing on platforms like Hugging Face.

Meta has introduced new models to its Llama series, specifically Llama 3.2, which now includes multimodal capabilities. Initially, when Llama 3 was released, there were expectations for multimodal models, but those were not delivered at that time. The latest release rectifies this by adding four new models: two vision-language models (VLMs) and two lightweight text-only models designed for on-device use. The new models aim to fill gaps in Meta’s offerings, particularly for mobile devices and edge computing.

The two multimodal models consist of an 11 billion parameter model and a 90 billion parameter model, which is an unusual size for open-source models. These models are designed to serve as drop-in replacements for existing text models, making them suitable for fine-tuning and various applications. The lightweight text models, with 1 billion and 3 billion parameters, are optimized for mobile devices, such as smartphones and Raspberry Pi, and are expected to perform well in tasks like summarization and email reading.

Meta claims that their new vision models outperform existing models like GPT-4 Mini and Claude III Haiku in certain benchmarks. However, there are concerns about transparency in their evaluations, as they did not compare against some competitive models like Qwen2 VL. The benchmarks for the lightweight text models also show promising results, particularly when compared to popular models like Gemini 2 and Phi 3.5 Mini, although there are discrepancies in performance across different datasets.

The lightweight models were created through a process of pruning and knowledge distillation from larger Llama models, allowing them to maintain performance while reducing size. This method involves removing unnecessary weights and layers from the original models and fine-tuning them for specific tasks. Meta has also released demos showcasing the capabilities of these models, including summarization and rewriting, and introduced Llama Stack APIs to facilitate the development of agentic applications.

Overall, the release of Llama 3.2 marks a significant step for Meta in the realm of multimodal AI. The models are now available for testing on platforms like Hugging Face, and there is potential for widespread adoption in mobile applications. As developers explore these new models, there may be opportunities for fine-tuning them for various use cases, including chatbots and other interactive applications. The future may also see further advancements in multimodal capabilities as Meta continues to innovate in this space.