Testing Microsoft's New VLM - Phi-3 Vision

Microsoft has introduced the Phi-3 Vision model as part of their new Phi lineup, which is optimized for edge computing and multimodal tasks like interpreting graphs and visual question answering. The model showcases impressive capabilities in tasks such as receipt interpretation and visual question answering, leveraging synthetic data and a transformer-based language model for efficient processing of text and images.

Microsoft recently introduced new Phi models, specifically the Phi-3 Vision model, as part of their lineup. These models range in size from the Phi-3 mini to the Phi-3 medium, with the Phi-3 Vision being a 4.2 billion parameter model optimized to run at the edge with Onyx runtimes while incorporating multimodality features.

The Phi-3 Vision model is designed to excel in tasks such as interpreting graphs, understanding diagrams, and answering visual questions. It differs from Google’s PaliGemma by offering a more refined fine-tuning approach and a focus on practical applications rather than research experimentation. The model can process inputs in the form of text and images, showcasing its multimodal capabilities.

With a context length reaching up to 128,000 tokens and a training time of 1.5 days using 512 H100s, the Phi-3 Vision model is well-equipped for complex tasks. Microsoft employs synthetic data in training to enhance model performance, a strategy that seems to deliver promising results, particularly in models like GPT-4o and GPT 5 Gemini 2.

Technical specifications reveal that the Phi-3 Vision model utilizes an image encoder and a transformer-based language model. The model processes images and text interchangeably, allowing for versatile input combinations. Synthetic data derived from OCR of PDF files contributes to the model’s training process, showcasing its potential for tasks like optical character recognition.

Testing the Phi-3 Vision model on various tasks, such as visual question answering and receipt interpretation, demonstrates its impressive capabilities. The model accurately identifies objects in images, extracts information from receipts, and provides insightful responses. While the 4-bit version shows some degradation in performance, the model’s small memory footprint makes it a practical choice for resource-efficient applications.