The Gemma team has developed Embedding Gemma, a compact and efficient model for generating text embeddings on edge devices like mobile phones and Raspberry Pis, enabling offline semantic search and retrieval-augmented generation without internet connectivity. Balancing performance and size, Embedding Gemma supports various applications through flexible embedding dimensions and integrates smoothly with tools like LangChain, making it ideal for resource-constrained environments.
The Gemma team has been actively developing their models over the past six months, branching into two key directions. One focus has been on creating on-device versions of their models, exemplified by the release of Gemma 3N at Google IO, which supports running powerful models directly on mobile phones, Raspberry Pis, and even browsers. The other direction targets research applications, with models like T5 Gemma, a variant of the well-known T5 transformer architecture, and the ultra-compact Gemma 3270M, trained on an enormous 6 trillion tokens, demonstrating what small models can achieve with extensive training.
The latest development from the Gemma team is Embedding Gemma, a tiny model designed specifically for generating text embeddings on edge devices. Built upon the T5 Gemma initialization, these models come in various sizes and formats, including quantized and ONNX versions, supporting input texts up to 2000 tokens. They produce matryoshka embeddings with dimensions ranging from 128 to 768, enabling flexible use cases such as micro retrieval-augmented generation (RAG) systems on phones or Raspberry Pis without requiring internet connectivity.
Embedding Gemma performs impressively well on multiple benchmarks, including retrieval, classification, and clustering tasks, especially considering its small size of around 300 million parameters. While it does not outperform larger models like Quen embeddings, which are nearly twice its size, it offers a compelling balance of performance and efficiency for on-device applications. The video demonstrates practical usage of these embeddings in Python using the sentence transformers library, showing how queries and documents can be encoded into vectors and compared for semantic similarity.
Additionally, the presenter showcases building a simple RAG system combining Gemma embeddings with the Gemma 3N language model, using tools like LangGraph and LangChain. This setup enables retrieval of relevant information from a local vector store and generation of responses without relying on cloud services. Although the language models are not state-of-the-art, they are highly efficient and suitable for edge deployment, consuming minimal GPU VRAM and running smoothly on CPUs, making them ideal for mobile and embedded environments.
Overall, the Gemma series is evolving along two complementary paths: one focused on lightweight, on-device models for practical applications, and the other on research-oriented models for experimentation. Embedding Gemma exemplifies this trend by providing a compact, versatile embedding solution that can empower developers to build offline semantic search, classification, and RAG systems on resource-constrained devices. This direction promises exciting possibilities for mobile AI, and the presenter encourages viewers interested in edge AI to explore these models further.