Prompt caching is a technique that speeds up large language model (LLM) responses by storing and reusing intermediate computations (key-value pairs) from the static parts of prompts, rather than just caching final outputs. This approach is especially effective for long prompts with large static sections, reducing latency and computational costs when similar prompts are processed repeatedly.
Prompt caching is a technique used to optimize the performance and cost-effectiveness of large language models (LLMs). Unlike traditional output caching, which stores the final response to a repeated query, prompt caching focuses on storing the intermediate computations that occur when an LLM processes an input prompt. This distinction is important because output caching only helps when the exact same question is asked, whereas prompt caching can accelerate a broader range of interactions by reusing parts of the model’s internal processing.
When an LLM receives a prompt, it processes the input by computing key-value (KV) pairs at every transformer layer for each token in the prompt. These KV pairs represent the model’s internal understanding of the prompt, such as how words relate to each other and what context is important. This computation, known as the prefill phase, can be computationally expensive, especially for long prompts. Prompt caching stores these precomputed KV pairs, so if a similar prompt is received later, the model can skip recomputing the shared parts and only process the new or changed tokens.
Prompt caching is particularly beneficial for prompts with large static sections, such as lengthy documents, system instructions, or few-shot examples. For instance, if a user uploads a 50-page manual and asks the LLM to summarize it, the KV pairs for the manual can be cached. If another user later asks a different question about the same manual, the system can reuse the cached computations for the manual and only process the new question, resulting in significant savings in latency and computational cost.
The effectiveness of prompt caching depends on how prompts are structured. To maximize cache hits, static content like system instructions, documents, and examples should be placed at the beginning of the prompt, followed by dynamic content such as user questions. The caching system uses prefix matching, comparing incoming prompts token by token with cached content. Caching stops as soon as a difference is detected, so placing variable content at the end ensures more of the prompt can be reused.
There are practical considerations for prompt caching. Typically, caching is only beneficial for prompts longer than about 1,024 tokens, as the overhead of managing the cache outweighs the benefits for shorter prompts. Cached data is usually kept for a limited time, often between 5 and 10 minutes, though some systems may retain it for up to 24 hours. Some LLM providers offer automatic prompt caching, while others require developers to explicitly mark which parts of the prompt should be cached. When used appropriately, prompt caching can significantly reduce LLM costs and improve response times.