Qwen 3 Embeddings & Rerankers

The video highlights recent advancements in open-source text embeddings and rerankers, particularly focusing on Quen’s new models that offer flexible, multilingual, and instruction-tuned capabilities for local deployment in retrieval systems. It emphasizes the importance of these models for customizable, on-premises document retrieval and reranking, while noting future potential for multimodal embeddings.

The video discusses recent developments in text embeddings, highlighting that while large language models (LLMs) have garnered much attention, the area of text embeddings has seen less innovation lately. However, in the past few weeks, notable releases such as Mistral’s text embeddings and Google’s upcoming models have emerged. The presenter emphasizes the importance of local, open-source embeddings over proprietary solutions like OpenAI, especially for use cases involving document indexing and retrieval where keeping data on-premises is crucial.

A key focus is on Quen’s recent release of a series of open-source embedding and reranking models available on Hugging Face. These models range from 6 billion to 8 billion parameters and include versions optimized for different sizes and use cases. The availability of these models locally allows for greater flexibility, control, and integration into custom retrieval-augmented generation (RAG) systems. The models are also compatible with formats like GGUF, making them easy to incorporate into tools such as Llama or LM Studio.

The presenter highlights that Quen has fine-tuned their models not only for embeddings but also for reranking tasks, with some models trained to handle multilingual data and instruction-based inputs. These models support the “matrioska” representation learning approach, enabling smaller, more efficient embeddings at various vector sizes (e.g., 64, 128, 256). Additionally, the models support very long sequence lengths, up to 32,000 tokens, which is advantageous for certain applications, although shorter sequences are generally preferred for retrieval tasks.

An important feature of these models is their instruction-tuning capability, allowing users to modify how embeddings and reranking are performed based on specific instructions. This flexibility enables tailored use cases, such as differentiating between general search and e-commerce contexts. The reranking models can compare text pairs and assess relevance, which is useful for improving retrieval accuracy in RAG systems. The presenter demonstrates how these models can be integrated into code, using libraries like transformers and VLLM, to perform tasks such as document scoring and relevance ranking.

Finally, the presenter notes some limitations and future directions, including the absence of multimodal embeddings in the current models. He expresses optimism that future work will expand into multimodal representation systems. Overall, the release of these open-source, flexible, and high-performance models marks a significant step forward for local, customizable embeddings and rerankers, empowering developers to build more efficient and controlled retrieval systems for various AI applications.