Pixtral-12B 👀: Mistral AI's SOTA Multi-Modal VLLM is HERE!

Mistral AI has launched Pixol, a groundbreaking 12 billion parameter multimodal Vision Language Model (VLLM) that can process both text and images, showcasing impressive capabilities in tasks like Optical Character Recognition and information extraction. The model’s unique architecture allows it to handle images in patches and process multiple images simultaneously, positioning it as a significant advancement in open-source AI as developers eagerly await the release of its inference code.

Mistral AI has made a significant impact in the AI community with the release of their new model, Pixol, which is being hailed as a groundbreaking multimodal Vision Language Model (VLLM). This model represents a substantial advancement in open-source AI capabilities, particularly in the realm of multimodal models, which can process both text and images. The excitement surrounding Pixol is reminiscent of the initial release of Mistral’s first LLM, Mistral 01, and it sets a promising tone for the future of open-source AI as we approach 2025.

Pixol is a 12 billion parameter model that integrates a vision language component, utilizing a 400 million parameter vision adapter. This architecture allows it to handle images with a resolution of up to 1,000 x 1,000 pixels and employs a unique tokenization method that treats images similarly to text. The model’s ability to process images in patches of 16 x 16 pixels is a notable technical advancement, enabling it to support arbitrary image sizes and a larger vocabulary than previous models. Although the inference code is not yet available, the model weights have been released, generating anticipation for future developments.

The announcement of Pixol came during the Mistral AI Summit, where key figures like Jensen Wong and Arthur Bench discussed the model’s capabilities and the company’s vision for open-source AI. Mistral AI recognizes the different ways users interact with local versus cloud-based models, which may influence their future licensing strategies. The summit also highlighted the importance of orchestration, multimodality, and strong reasoning capabilities in AI models, positioning Pixol as a competitive player in the evolving landscape of AI technology.

In terms of performance, Pixol has shown impressive results in various benchmarks, particularly excelling in tasks like Optical Character Recognition (OCR) and information extraction. Its ability to understand complex images and reason about them sets it apart from other models, including those from competitors like OpenAI and Meta. The model’s architecture allows it to process multiple images simultaneously, a feature that has been a challenge for many existing multimodal models. This capability, combined with its large context window of 128,000 tokens, positions Pixol as a formidable tool for developers and researchers alike.

As the AI community eagerly awaits the release of the inference code and further details, there is a palpable excitement about the potential applications of Pixol. The model’s ability to handle diverse tasks, including drawing diagrams and answering complex questions, showcases its versatility. Mistral AI’s commitment to open-source development encourages innovation and experimentation, making it an exciting time for those interested in multimodal AI. The future of Pixol and its impact on the field will be closely watched as developers begin to explore its capabilities and push the boundaries of what is possible with open-source AI models.