Measuring LLM Lies

The video examines how large language models (LLMs) respond to nonsensical questions, finding that Anthropic’s Claude models are better at refusing to answer illogical prompts, while OpenAI and Google’s models often generate plausible-sounding but meaningless responses. The speaker warns that this tendency can reinforce misunderstandings and poor learning, highlighting the risk of LLMs amplifying both good and bad user skills.

The video explores how large language models (LLMs) handle nonsensical or ill-posed questions, using humorous examples like comparing engineering story points to marketing impressions, or reformulating a curry recipe to comply with fire safety codes. The speaker highlights that these questions are intentionally absurd, designed to test whether AI models will recognize the lack of logical connection and push back, or if they will attempt to answer regardless of the question’s validity.

The results of these tests reveal interesting differences between models. Anthropic’s Claude models generally perform well, often refusing to answer nonsensical questions or providing appropriate pushback. Surprisingly, the ranking or version number of Claude models does not consistently predict better performance. The speaker notes that Anthropic seems particularly sensitive to the issue of AI “psychosis”—the tendency to confidently answer nonsense—and has tuned their models accordingly.

In contrast, OpenAI’s models, as well as Google’s, tend to answer almost any question, regardless of how illogical it is. The speaker gives examples where OpenAI’s models invent plausible-sounding but ultimately meaningless answers, such as calculating an exchange rate between story points and impressions or suggesting changes to curry recipes based on fire safety codes. This tendency to always provide an answer, even when the question is flawed, is both amusing and concerning.

The video raises a critical point about the implications for education and learning. If users do not know how to properly frame questions, LLMs may reinforce misunderstandings by providing precise but misguided answers. The speaker worries that this could lead to poor learning outcomes, especially when questions are only subtly incorrect. The quote “models are extremely intelligent but they comprehend nothing” encapsulates the concern that LLMs lack true understanding and context, making them unreliable guides for nuanced or complex problem-solving.

Finally, the speaker reflects on the broader impact of AI as a “skill multiplier.” While LLMs can dramatically boost the productivity of already competent users, they can also amplify the mistakes of less skilled or misguided individuals. This creates a risk that poor decision-making could be scaled up rapidly within organizations. The video ends with a tongue-in-cheek promotion for a coffee subscription service, tying back to the theme of technical culture and the quirks of the developer community.