Learn How to Make AI Models w/ ML: 2. Transformers

merefield · 3 April 2026 14:00

The video explains the transformer architecture, highlighting its attention mechanism, multi-head attention, and positional encoding, and demonstrates building a mini GPT model from scratch using PyTorch to generate text based on character-level tokenization. It concludes by discussing the challenges of training small models on limited data and previews the next episode’s focus on utilizing pre-trained models from Hugging Face for more efficient and powerful AI applications.

merefield · 3 April 2026 14:20

In this second episode of the machine learning series, the focus is on understanding transformers, the foundational architecture behind major AI models like GPT, Claude, and Llama. The video begins by contrasting transformers with earlier language models such as RNNs and LSTMs, highlighting their limitations in speed and memory due to sequential processing. Transformers revolutionize this by processing all words simultaneously, allowing any word to attend directly to any other, which improves both efficiency and the handling of long texts. The core innovation is the attention mechanism, which enables the model to weigh the relevance of different words dynamically, mimicking how humans focus on important words in context.

The video then delves into the mechanics of attention, explaining the roles of query, key, and value vectors that each word generates. These vectors enable the model to compute attention scores through dot products, scaling, softmax normalization, and weighted sums, resulting in context-aware word representations. Multi-head attention is introduced as a way to capture multiple types of relationships between words simultaneously by running several attention operations in parallel, each specializing in different linguistic patterns. Positional encoding is also covered, emphasizing its necessity to inject word order information into the model since attention alone treats words as an unordered set.

Next, the structure of the transformer block is explained, consisting of multi-head attention, layer normalization, and a feed-forward network that acts as the model’s knowledge store. These blocks are stacked repeatedly to build complex language understanding, with modern models using dozens to nearly a hundred such blocks. The distinction between encoder, decoder, and encoder-decoder architectures is clarified, with modern large language models primarily using decoder-only setups that generate text autoregressively, predicting one token at a time while masking future tokens during training to prevent cheating.

The latter part of the video transitions into a practical project where a mini GPT model is built from scratch using PyTorch. The project involves loading a dataset of One Piece synopses, tokenizing text at the character level, and implementing attention, multi-head attention with causal masking, and the feed-forward network. The model is trained to predict the next character in sequences, demonstrating decreasing training and validation loss over time. Although the generated text is initially gibberish, it reflects recognizable patterns and vocabulary from the training data, illustrating the model’s learning progress.

Finally, the video concludes by comparing training runs with different step counts, showing improved but still imperfect text generation with longer training. It highlights the limitations of training small models from scratch on limited data and introduces the upcoming episode’s focus on leveraging pre-trained models from Hugging Face. This approach allows users to bypass the extensive training process by starting with models trained on vast datasets, enabling more practical and powerful applications of transformers in AI.