CAG vs Long Context: How AI Models Use and Remember Information

The video compares long context and Cache Augmented Generation (CAG) as methods for large language models to access external knowledge, highlighting that long context includes all relevant documents in the model’s input but can be slow and costly, while CAG precomputes and caches document representations for faster repeated queries over stable knowledge bases. It also notes that prompt caching offered by LLM providers effectively implements CAG, making it practical for developers without managing cache infrastructure, though each method suits different use cases depending on query frequency and knowledge base stability.

The video explains two methods for providing large language models (LLMs) with access to external knowledge beyond their training data: long context and Cache Augmented Generation (CAG). Since LLMs only know what was in their training data, they need ways to access private or up-to-date information, such as company documents or financial reports, at inference time. While Retrieval Augmented Generation (RAG) uses retrieval pipelines with vector databases to fetch relevant documents, long context and CAG offer alternative approaches that build on each other.

Long context is the simpler approach, where all relevant documents are directly included in the model’s context window along with the query. Context windows have grown significantly over time, from GPT-3’s 1,000 tokens to GPT-4 Turbo’s 128,000 tokens, and now Google Gemini 1.5 Pro’s two million tokens. This allows models to process large amounts of text in one go, eliminating the need for retrieval steps. However, this method can be costly and slow because every query requires processing the entire context again, and models tend to perform worse on information buried in the middle of long contexts.

Cache Augmented Generation (CAG) addresses these issues by leveraging the model’s internal key-value (KV) cache, which stores how the model has encoded the input text. CAG works in three phases: first, preparing and formatting the knowledge base; second, precomputing the KV cache by processing all documents once; and third, during inference, loading the cached representation and appending the new query to generate answers quickly. This approach can yield significant speedups—up to 40 times faster—especially for repeated queries over stable knowledge bases.

Despite its advantages, CAG has limitations. The entire knowledge base still needs to fit within the context window, and if the source documents change frequently, the KV cache must be recomputed, reducing latency benefits. Therefore, CAG is best suited for scenarios where the knowledge base is relatively static and queries are repeated, such as an HR chatbot answering company policy questions. In contrast, long context is more appropriate for one-off queries or analyzing single documents where caching is not worthwhile.

Finally, the video highlights prompt caching, a feature now offered by major LLM providers that effectively implements CAG as a service. Developers can send a long system prompt containing all documents once, and subsequent requests with the same prompt prefix skip reprocessing the documents, resulting in significant cost and latency savings. This makes CAG practical and accessible without managing cache infrastructure. The video concludes by noting that while this discussion focused on long context and CAG, the role of RAG remains a topic for further exploration.