One Setting 3x’d My LLM Speed… Same hardware

The video demonstrates how enabling speculative decoding—where a smaller draft model predicts tokens for a larger model to verify—can nearly double or triple the speed of large language models on the same hardware without sacrificing much quality. The presenter introduces an open-source tool, Draftbench, to help users find the optimal draft and target model pairings for maximum speedup.

The video explores a technique that significantly increases the speed of running large language models (LLMs) on the same hardware by using a method called speculative decoding, which the presenter prefers to call “guess and check.” The demonstration begins with running a dense 70B parameter Meta Llama 3 model in 8-bit quantization on an M4 Max MacBook Pro, achieving a slow 6.31 tokens per second. By enabling speculative decoding and pairing the large model with a smaller draft model, the speed jumps to 12.26 tokens per second—almost double the original rate—without any hardware changes.

Speculative decoding works by having a smaller, faster model “guess” the next tokens, which the larger, more accurate model then verifies. If the guesses are correct, the large model accepts them, saving computation time. This method is supported in popular tools like LM Studio, Llama.cpp, and VLM. The presenter emphasizes that not all draft models will speed up generation; some combinations can even slow things down, so finding the optimal pairing is crucial.

To systematically identify the best draft and target model combinations, the presenter developed an open-source tool called Draftbench, available on GitHub. This tool benchmarks various combinations of target (large) and draft (small) models, considering different quantizations and parameter sizes, to find the sweet spot for maximum speedup without sacrificing too much output quality. The process involves running multiple prompts and averaging results, which can take several hours depending on the number of combinations tested.

The video provides detailed benchmark results using the Qwen 2.5 model family, which comes in a wide range of sizes and quantizations. For example, pairing a 72B parameter model with a 1.5B draft model can boost speed from 8.7 to 27.6 tokens per second. The presenter also demonstrates that less quantized (higher quality) models like FP16 benefit even more from speculative decoding, sometimes achieving over 200% speed improvements. However, using draft models that are too small or too highly quantized can reduce the acceptance rate and degrade performance.

In summary, speculative decoding (guess and check) is a powerful way to make large LLMs usable on consumer hardware by leveraging smaller models to accelerate token generation. The right combination of draft and target models, as identified by tools like Draftbench, can yield dramatic speed improvements with minimal quality loss. The presenter encourages viewers to experiment with these techniques and tools, highlighting that this approach is especially valuable for those running large models on limited hardware.