Lecture 3 of Stanford’s CME 295 course provides an in-depth overview of large language models, focusing on their architectures, efficient scaling techniques like Mixture of Experts, and advanced text generation methods including decoding strategies and prompting techniques to enhance performance and interpretability. It also covers inference-time optimizations such as KV caching and speculative decoding that improve computational efficiency while maintaining high-quality output.
In Lecture 3 of Stanford’s CME 295 course on Transformers and Large Language Models (LLMs), the instructor begins by recapping the foundational concepts covered in previous lectures, focusing on the three main categories of transformer-based models: encoder-decoder models like T5, encoder-only models such as BERT, and decoder-only models exemplified by GPT. The lecture then introduces the concept of large language models, defining them as language models that predict the probability of the next token in a sequence and are characterized by their massive scale—often containing billions to trillions of parameters and trained on vast datasets comprising hundreds of billions to trillions of tokens. The discussion highlights that modern LLMs are predominantly decoder-only architectures, optimized for text generation tasks.
A significant portion of the lecture is dedicated to the concept of Mixture of Experts (MoE), a technique designed to improve computational efficiency in large models. Instead of activating all parameters for every prediction, MoE selectively activates a subset of “experts” (specialized subnetworks) based on the input, guided by a gating or routing network. This approach reduces the computational load during inference by only engaging relevant experts. The lecture explains the difference between dense MoE, where all experts contribute weighted outputs, and sparse MoE, which activates only the top-k experts. Challenges such as routing collapse—where only a few experts are consistently used—are addressed through modified loss functions and techniques like noisy gating to encourage balanced expert utilization.
The lecture then shifts focus to the methods of generating text from LLMs. It contrasts greedy decoding, which selects the highest probability token at each step but can lead to repetitive and suboptimal sequences, with beam search, which maintains multiple candidate sequences to find a more globally optimal output. However, beam search can lack diversity and creativity, leading to the widespread adoption of sampling methods like top-k and top-p sampling, which introduce randomness by sampling from the most probable tokens. The role of temperature in controlling the randomness of sampling is explained mathematically, showing that lower temperatures produce more deterministic outputs while higher temperatures increase diversity and creativity in generated text.
Next, the lecture explores prompting strategies to guide LLMs in producing desired outputs without retraining. It introduces in-context learning, where models are given examples (few-shot) or just instructions (zero-shot) within the input prompt to steer their responses. Techniques like chain-of-thought prompting encourage the model to generate intermediate reasoning steps before the final answer, improving performance and interpretability. Self-consistency is another method discussed, where multiple sampled outputs are generated and the most consistent answer is selected via majority voting, enhancing robustness. The lecture also touches on the importance of managing context length, noting that while larger context windows allow for more input tokens, they can introduce challenges like context rot, where the model’s ability to retrieve relevant information diminishes with longer contexts.
Finally, the lecture covers various inference-time optimization techniques to make LLM generation more efficient. It discusses KV caching, which stores key and value matrices from previous tokens to avoid redundant computations during autoregressive generation. Techniques like group query attention and multi-latent attention reduce memory and computational overhead by sharing or compressing representations across attention heads. On the output side, speculative decoding leverages smaller draft models to propose tokens that are then verified by the larger target model, speeding up generation without sacrificing quality. Multi-token prediction extends this idea by embedding draft predictions within the same model architecture. These methods collectively aim to balance computational efficiency with maintaining high-quality, diverse, and coherent text generation.