Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer

The lecture introduces Stanford’s CME 295 course on Transformers and Large Language Models, covering foundational NLP tasks, tokenization methods, and the evolution from RNNs to the transformer architecture with its self-attention mechanism. It provides a detailed explanation of the transformer’s encoder-decoder structure, attention components, and training techniques, illustrating how these models process and generate language effectively.

The lecture begins with an introduction to the course CME 295 on Transformers and Large Language Models (LLMs), taught by twin brothers Afin and Shervin, who have extensive backgrounds in NLP and industry experience at companies like Uber, Google, and Netflix. They explain the motivation behind the course, which has evolved from a yearly workshop into a formal Stanford class due to the surge of interest in LLMs following the release of ChatGPT in 2022. The course aims to provide students with a deep understanding of the transformer architecture, how LLMs are trained, and their applications. It is designed for anyone interested in NLP, whether for research, development, or interdisciplinary applications, with prerequisites including basic machine learning knowledge and linear algebra.

The lecture then provides an overview of natural language processing (NLP) tasks, categorizing them into three main buckets: classification, multi-classification, and generation. Classification tasks involve predicting a single label from text input, such as sentiment analysis or intent detection. Multi-classification tasks predict multiple labels, exemplified by named entity recognition (NER), which labels specific words in a text with categories like location or time. Generation tasks involve producing variable-length text outputs from text inputs, including machine translation, question answering, and summarization. The lecture also discusses common evaluation metrics for these tasks, such as accuracy, precision, recall, F1 score for classification, and BLEU, ROUGE, and perplexity for generation tasks.

Next, the lecture delves into the foundational challenge of representing text for machine learning models, which inherently understand numbers rather than raw text. Tokenization is introduced as the process of breaking text into smaller units called tokens, with three main approaches: word-level, subword-level, and character-level tokenization. Each has its pros and cons related to vocabulary size, handling of out-of-vocabulary words, and sequence length. The lecture emphasizes the importance of learning meaningful token embeddings rather than using naive one-hot encodings, highlighting the Word2Vec model as a pioneering method that learns embeddings by predicting words based on their context, thereby capturing semantic relationships between words.

Building on token embeddings, the lecture introduces recurrent neural networks (RNNs) and their limitations, such as difficulty handling long-range dependencies and slow sequential processing. Long Short-Term Memory (LSTM) networks are presented as an improvement that better retains important information over longer sequences. However, both RNNs and LSTMs struggle with vanishing gradients and computational inefficiency. To address these issues, the attention mechanism is introduced, which allows models to directly link relevant parts of the input sequence when making predictions. This concept leads to the transformer architecture, which replaces sequential processing with self-attention, enabling parallel computation and better handling of long-range dependencies.

Finally, the lecture explains the transformer architecture in detail, focusing on its encoder-decoder structure used for tasks like machine translation. The encoder processes the input tokens with multi-head self-attention layers and feed-forward networks to produce context-aware embeddings. The decoder generates output tokens by attending to previously generated tokens (masked self-attention) and the encoder’s output (cross-attention). Position encodings are added to token embeddings to retain word order information. The lecture also covers technical details such as the role of queries, keys, and values in attention, the purpose of multiple attention heads, and training techniques like label smoothing. The session concludes with a step-by-step example of how a sentence is tokenized, embedded, encoded, and decoded in the transformer model, illustrating the core principles behind modern LLMs.