Shashi Jagtap from Super Agentic AI presents TurboQuant, a novel compression algorithm that reduces memory usage in agent retrieval systems by compressing embeddings to 3-4 bits without sacrificing search quality, addressing the inefficiencies of traditional 32-bit storage. Integrated into popular inference engines and accessible via the open-source Turbo Agent library, TurboQuant enables developers to significantly cut memory costs while maintaining retrieval accuracy and speed, demonstrated through live demos and benchmarks.
In this talk, Shashi Jagtap, founder of Super Agentic AI, introduces TurboQuant, a cutting-edge compression algorithm designed to significantly reduce the memory cost of agent retrieval systems without compromising search quality. He begins by explaining the common problem faced by agents as their context grows: the KV cache, which stores conversation history, expands and can exceed the size of the model itself, especially on devices with limited RAM like Macs. This leads to degraded performance and inefficiencies, primarily because embeddings are stored in full 32-bit precision, while search tasks typically require only 3 to 4 bits, resulting in wasted memory.
Shashi reviews existing solutions such as quantization, context compaction, smaller embeddings, and offloading to CPU or disk, noting that each comes with trade-offs in quality, speed, or hardware requirements. TurboQuant, developed by Google Research and presented at ICLR 2026, offers a novel approach by compressing embeddings in the KV cache down to 3-4 bits using two main techniques: Polar Quant, which compresses vectors, and QJL, which corrects residual errors with just one bit. This method maintains search quality because search algorithms only care about the relative closeness of vectors, not their exact precision.
The talk highlights how TurboQuant is already being integrated into popular inference engines like llama.cpp, MLX, Ollama, and LM Studio, making it accessible without manual tuning. Shashi also introduces Turbo Agent, an open-source library from Super Agentic AI that allows developers to easily swap out their existing retrieval layers with TurboQuant-enabled ones, preserving their current agent frameworks and vector databases while benefiting from reduced memory usage and maintained accuracy.
A live demo showcases TurboQuant’s effectiveness by comparing a baseline 32-bit float index with a TurboQuant compressed index. The demo reveals that TurboQuant reduces memory usage by about five times while delivering the same accurate answers, demonstrating that no changes to the agent or documents are necessary—only the retriever is swapped. Additional demos and benchmarks further confirm that TurboQuant preserves retrieval quality and speed while drastically cutting memory consumption.
In conclusion, Shashi encourages developers to rethink their approach to vector compression by focusing on ranking quality rather than exact vector representation. He suggests starting small by benchmarking TurboQuant on their own data and highlights alternative compression methods like Rabbit and FQ8, noting their specific use cases and limitations. For those interested, Turbo Agent and related demos are available on GitHub, and Shashi invites the audience to reach out via social media for further discussion.