The biggest Mystery of LLMs have just been solved

The video reveals that the non-deterministic outputs of large language models at zero temperature are caused by server-side batching, which alters GPU kernel strategies and leads to slight numerical differences that accumulate and affect token selection. By developing batch-invariant kernels, researchers achieved fully deterministic inference, improving reproducibility and enabling more effective on-policy training despite a performance trade-off.

The video addresses a long-standing mystery about the non-deterministic behavior of large language models (LLMs) during inference, particularly when the temperature parameter is set to zero. Temperature controls the randomness in token selection, with zero theoretically forcing the model to always pick the most probable next token, thus producing consistent outputs. However, in practice, even with temperature zero and identical prompts, LLMs often generate different responses. This inconsistency has puzzled many, leading to assumptions that the models or hardware might inherently be non-deterministic.

Initially, the video explores the hypothesis that GPU parallelism and floating-point imprecision might cause this non-determinism. GPUs perform many operations simultaneously, and the order in which partial sums are combined can vary, potentially leading to tiny rounding differences. However, during LLM inference, many of these parallel operations that could cause variability are avoided, suggesting that GPU hardware and software are not the root cause of the inconsistency. Instead, the real source of non-determinism lies in how requests are batched on the server side.

When multiple user requests arrive within a short time window, servers batch them together to optimize processing efficiency. This batching changes the shape and size of the input data, causing the GPU kernels—small programs that perform specific mathematical tasks—to switch strategies to maintain efficiency. Different kernel strategies lead to variations in the order of floating-point operations, which, due to rounding behavior, can cause slight numerical differences. These small differences can accumulate through the model’s layers, eventually tipping the token selection from one choice to another, even at temperature zero, resulting in different outputs.

The researchers at Thinking Machines, led by Horus Hei, tackled this problem by developing batch-invariant kernels that maintain consistent computation strategies regardless of batch size or composition. Their experiments showed that with standard kernels, running 1,000 completions of the same prompt at temperature zero produced 80 distinct outputs, with divergence appearing as early as token 103. After implementing batch-invariant kernels, all 1,000 completions were identical, proving that the non-determinism was due to batch-dependent kernel behavior. While this deterministic mode incurs a performance cost—slowing inference by about 1.6 to 2.1 times—it ensures reproducibility, which is crucial for evaluation and benchmarking.

Beyond reproducibility, the research also highlights an important implication for training. Typically, the inference model and the training model differ slightly due to dynamic batching and caching strategies, causing the model to learn from outputs that are not perfectly aligned with its own policy. This “off-policy” learning can hinder training effectiveness. Batch-invariant inference aligns the inference and training models by ensuring consistent kernel behavior and reduction order, enabling truly on-policy reinforcement learning updates and smoother training. This discovery not only solves a fundamental mystery but also opens avenues for improving LLM training and evaluation fidelity.