RAG vs. CAG: Solving Knowledge Gaps in AI Models

artesia · 17 March 2025 11:00

The video explores the limitations of large language models (LLMs) in recalling information and introduces two techniques to address these gaps: Retrieval-Augmented Generation (RAG), which retrieves relevant documents from an external knowledge base, and Cache-Augmented Generation (CAG), which preloads all knowledge into the model’s context for faster access. It compares the two methods in terms of accuracy, latency, scalability, and data freshness, suggesting that RAG is better for dynamic datasets while CAG is suited for static knowledge bases.

artesia · 17 March 2025 11:20

The video discusses the knowledge limitations of large language models (LLMs) and introduces two augmented generation techniques to address these gaps: Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG). LLMs struggle to recall information not present in their training data, such as recent events or proprietary data. RAG enhances LLMs by querying an external knowledge base to retrieve relevant documents, which are then used to provide context for generating answers. This method allows the model to access updated information without being limited by its training set.

RAG operates in a two-phase system: an offline phase where knowledge is ingested and indexed, and an online phase where retrieval and generation occur. During the offline phase, documents are broken into chunks, and vector embeddings are created and stored in a vector database. When a user submits a query, the RAG retriever converts the question into a vector and performs a similarity search to find relevant document chunks. These chunks are then combined with the user’s query and sent to the LLM for answer generation, making RAG modular and adaptable to different components.

In contrast, CAG takes a different approach by preloading the entire knowledge base into the model’s context window. This method involves formatting all gathered knowledge into a single massive prompt that fits within the model’s context size. The LLM processes this information in one forward pass, creating a key-value cache (KV cache) that retains the encoded knowledge. When a query is submitted, the model can access this cached information without needing to reprocess the entire dataset, allowing for faster response times.

The video compares the two methods across several dimensions: accuracy, latency, scalability, and data freshness. RAG’s accuracy relies heavily on the effectiveness of its retriever, while CAG guarantees that all relevant information is available but places the burden on the model to extract the correct details. In terms of latency, CAG is faster since it eliminates the retrieval step, while RAG incurs additional time due to this process. Scalability favors RAG, as it can handle larger datasets by retrieving only relevant pieces, whereas CAG is limited by the model’s context window size.

Finally, the video presents scenarios to illustrate when to use RAG or CAG. For static knowledge bases, like a product manual, CAG is preferred due to its speed and simplicity. For dynamic and extensive datasets, such as legal cases, RAG is more suitable because it can efficiently update and retrieve information. In complex situations, like clinical decision support systems, a hybrid approach utilizing both RAG and CAG can be beneficial, allowing for efficient retrieval and comprehensive context for follow-up queries. Ultimately, the choice between RAG and CAG depends on the specific requirements of the application.