Faster LLMs: Accelerate Inference with Speculative Decoding

artesia · 4 June 2025 11:00

The video explains how speculative decoding speeds up large language model inference by using a smaller draft model to predict multiple tokens simultaneously, which are then verified by a larger target model to ensure quality. This method significantly increases efficiency—by 2-3 times—while maintaining output accuracy, reducing latency and resource usage in text generation.

artesia · 4 June 2025 11:21

The video explains how speculative decoding can significantly speed up inference times in large language models (LLMs) without compromising output quality. This technique involves using a smaller draft model to predict multiple future tokens simultaneously, while a larger target model verifies these predictions in parallel. By doing so, the process can generate two to four tokens in the same time it would normally take to produce just one, making inference more efficient and resource-friendly.

The traditional method of text generation in LLMs is an autoregressive process, involving a forward pass through the model to generate potential next tokens based on input, followed by a decoding step where one token is selected. This sequential approach limits speed, as each token is generated one at a time. The video sets this as a baseline to illustrate how speculative decoding enhances this process by predicting multiple tokens at once, thus reducing the number of passes needed and increasing overall speed.

Speculative decoding operates in three main steps. First, a smaller draft model generates multiple candidate tokens (k tokens) based on the current input. Next, a larger target model concurrently verifies these predictions by assuming they are correct and calculating the probability of each token. Finally, a rejection sampling step compares the draft and target model probabilities for each token, accepting or rejecting them based on a simple rule—if the target’s confidence is higher or equal, the token is accepted; otherwise, it is rejected and replaced by a new sample from the target model. This process ensures high-quality output while leveraging the speed of the draft model.

The key advantage of this approach is that it allows multiple tokens to be generated with a single pass of the target model, leading to a 2-3 times increase in inference speed on average. Even in the worst-case scenario where the first token is rejected, the method still produces at least one token per round. The rejection sampling step is crucial because it maintains output quality by ensuring that only tokens with sufficient confidence from the target model are accepted, preventing the smaller draft model from degrading the output’s accuracy.

Overall, speculative decoding optimizes the use of computational resources by combining the efficiency of smaller models with the accuracy of larger models. It reduces latency, lowers compute costs, and improves memory utilization, all while preserving the quality of generated text. Researchers continue to refine this technique, and the video highlights ongoing advancements in LLM efficiency, encouraging viewers to explore further developments in this promising area of AI technology.