Is RAG Still Needed? Choosing the Best Approach for LLMs

The video compares Retrieval Augmented Generation (RAG) and long context windows as methods for providing Large Language Models (LLMs) with up-to-date or proprietary information, highlighting that long context windows simplify architecture and improve reasoning for bounded data, while RAG remains essential for efficiently handling large, enterprise-scale datasets. Ultimately, the best approach depends on the specific use case, and a hybrid of both methods may often be most effective.

Large Language Models (LLMs) are inherently limited by their training cutoff date and lack access to real-time or private data. To address this, two main approaches have emerged for providing LLMs with up-to-date or proprietary information: Retrieval Augmented Generation (RAG) and long context windows. RAG is an engineering solution that involves chunking documents, embedding them as vectors, storing them in a vector database, and retrieving relevant chunks at query time. In contrast, long context windows allow users to directly insert large amounts of data into the model’s prompt, leveraging the model’s attention mechanism to find answers.

RAG has traditionally been necessary because early LLMs had small context windows, making it impossible to fit large documents or entire knowledge bases directly into the prompt. However, with the advent of models supporting million-token context windows, it is now possible to include vast amounts of information—such as entire book series—directly in the prompt. This development raises the question of whether the complexity of RAG systems is still justified when simpler alternatives are available.

There are several advantages to using long context windows. First, it collapses infrastructure by eliminating the need for chunking strategies, embedding models, vector databases, and retrieval logic, resulting in a much simpler architecture. Second, it avoids the “retrieval lottery” problem, where RAG might fail to fetch the right information due to the probabilistic nature of semantic search, leading to silent failures. Third, long context windows solve the “whole book problem,” enabling the model to see all relevant documents at once and perform complex comparisons that RAG’s snippet-based retrieval cannot handle.

Despite these benefits, RAG still has important roles. Long context windows can be computationally inefficient, as the model must process large documents every time a query is made, whereas RAG only incurs this cost once during indexing. Additionally, as context windows grow, the model’s attention can become diluted, making it harder to find specific information (“needle in the haystack” problem), while RAG focuses the model on only the most relevant data. Finally, enterprise data sets are often far larger than even the largest context windows, making RAG’s retrieval mechanism essential for filtering down to manageable amounts of information.

In summary, long context windows are ideal for bounded data sets and tasks requiring global reasoning, such as analyzing a single contract or summarizing a book, due to their simplicity and improved reasoning capabilities. However, for navigating vast, ever-growing enterprise data lakes, RAG remains indispensable as a scalable solution. The choice between long context and RAG depends on the specific use case, and in many scenarios, a hybrid approach may be most effective.