The video explains that while scaling up large language models (LLMs) improves their performance by reducing interference between overlapping word representations, recent MIT research shows this benefit is due to increased model width rather than smarter algorithms. However, this approach has inherent limits, suggesting that simply making models bigger will eventually stop yielding significant improvements and highlighting the need for more efficient ways to store information.
The video discusses the current trend among major AI companies to continually scale up large language models (LLMs) by increasing their size and computational power, believing that bigger models yield better results. This approach, known as following the “scaling laws,” has been validated across various models and architectures, showing that doubling a model’s size predictably improves its performance. However, the underlying reason why larger models perform better has remained unclear, with only speculative explanations until recently.
To understand what happens inside these models, the video explains how language models represent words as points in a high-dimensional space. For example, GPT-2 uses around 4,000 dimensions to store representations for about 50,000 unique words or tokens. The expectation was that, due to limited space, the model would prioritize storing common or important words while discarding or compressing less frequent ones—a concept known as “weak superposition.”
However, recent MIT research challenges this assumption. The study found that instead of discarding less important information, LLMs actually store all tokens in the same limited space, resulting in significant overlap—referred to as “strong superposition.” This means that word representations are compressed and stacked on top of each other, much like cramming too many outfits into a small suitcase. This overlapping leads to interference, where information about different words can get mixed up, sometimes causing the model to make confident but incorrect predictions.
The MIT researchers discovered that this interference follows a predictable mathematical law: the interference between any two tokens is inversely proportional to the width (number of dimensions) of the model. In practical terms, doubling the model’s width halves the interference, which explains why larger models perform better—not because they are fundamentally smarter, but because they have more space to reduce overlap and confusion among stored information.
This insight has significant implications. First, it justifies the massive investments AI companies are making in scaling up models, as the improvements are grounded in mathematical principles. Second, it suggests there is a limit to how much performance can be gained by simply increasing model size—eventually, adding more space will no longer reduce interference. Finally, it opens up new research directions, such as finding ways to pack information more efficiently into smaller models, potentially achieving similar performance with less computational cost. However, the compressed and overlapping nature of these representations also makes it extremely difficult to interpret or understand exactly how these models work internally.