Ollama Multimodal Ai Vision Qwen 2.5 VL and Gemma 3 QAT Test

artesia · 16 May 2025 17:54

The video reviews Ollama’s new multimodal runtime engine, highlighting its support for Quinn 2.5VL for OCR tasks and Gemma 3 for detailed vision analysis, with demonstrations of their performance and limitations. It emphasizes the engine’s improved speed and efficiency, compares the models’ capabilities in image recognition, and discusses practical usage considerations, while noting ongoing developments and future content.

artesia · 16 May 2025 18:14

The video provides an overview and testing of Ollama’s new runner runtime engine, highlighting its support for Quinn 2.5VL, a model known for OCR tasks but limited in detailed understanding compared to Gemma 3. The presenter notes that Ollama is moving towards a fully multimodal platform, supporting text and vision, though video support is not yet available. The new engine is claimed to be faster and more efficient, with improvements in processing capabilities, especially since it has been rewritten into Go, consolidating various components for better performance.

The host demonstrates how to update Ollama to the latest version, emphasizing the importance of keeping the software current to access new features like vision models and Quinn 2.5VL. They showcase the setup process on a Proxmox node, verifying the version, and briefly discuss hardware considerations such as VRAM requirements for running Quinn 2.5VL effectively. The testing includes various image processing tasks, comparing Quinn 2.5VL’s performance and accuracy in recognizing objects like hamburgers and spiders, noting its limitations in detailed recognition and specific object identification.

Further, the presenter compares the performance of Gemma 3 with Ollama’s Quinn 2.5VL, highlighting Gemma 3’s superior ability to recognize detailed features in images, such as identifying specific patterns on plates or accurately describing complex scenes. They demonstrate Gemma 3’s impressive segmentation and detail recognition, including identifying the pattern on a plate and explaining an origami cat image, which Quinn 2.5VL struggled with. The comparison underscores Gemma 3’s strength in detailed vision tasks, making it more suitable for refined image analysis.

The video also discusses the practical aspects of using these models, such as token efficiency, response times, and the ability to cache images for multiple queries without reloading. The host emphasizes that despite some limitations, Ollama’s vision models are highly capable, especially when detailed accuracy is required. They caution viewers about hallucinations and inaccuracies, particularly with links and specific details, and advise not to rely solely on hallucinated information from these models.

In conclusion, the presenter reflects on the broader implications of these advancements, noting that Ollama aims to make multimodal AI accessible and easy to use. They acknowledge ongoing debates about licensing and model legality but focus on the technical progress and performance improvements. The video ends with a teaser for upcoming content, including testing other LLM and GPT-based tools like Libre Chat and Big AI, encouraging viewers to stay tuned and participate in discussions through comments and support.