GPT-5 SLAYS on Werewolf Benchmark (literally)

artesia · 1 September 2025 00:25

The video showcases the Werewolf Benchmark, a novel test evaluating large language models’ abilities in social deduction, manipulation, and strategic thinking through the game Werewolf, where GPT-5 dominates with a 96.7% win rate by exhibiting superior multi-layered deception and coordination. It highlights how this benchmark reveals nuanced AI behaviors beyond traditional tasks, emphasizing the importance of trust, long-term planning, and information hygiene, and points to future expansions including more models for richer comparisons.

artesia · 1 September 2025 00:47

The video introduces the Werewolf Benchmark, a new and innovative test designed to evaluate large language models (LLMs) through the social deduction game Werewolf, which involves roles like werewolves (impostors) and villagers with special abilities such as the witch and seer. The game tests models on their abilities to manipulate, deceive, deduce, and resist manipulation in a dynamic social setting. The benchmark features six LLM players, including GPT-5, Gemini 2.5 Pro, and several open-source models, competing in a game that requires strategic thinking, trust navigation, and social dynamics.

GPT-5 emerges as the clear champion with an impressive 96.7% win rate, outperforming all other models in both manipulation as a werewolf and resistance as a villager. The video highlights how GPT-5 exhibits a commanding and structured personality, imposing order and control over the game, which contributes to its dominance. In contrast, other models like GPT-5 OSS show more defensive and fearful behavior, while models like Kim K2 display a high-risk, energetic style but lack the long-term coherence and consistency seen in GPT-5.

The benchmark reveals that stronger models demonstrate emergent behaviors such as maintaining multiple coherent narratives simultaneously—one public-facing and one private as a werewolf—allowing them to manipulate the game effectively over multiple rounds. These models also coordinate well with their wolf partners, carefully planning target eliminations and managing social optics to avoid suspicion. This level of strategic depth and multi-day planning marks a significant leap from simpler, more reactive playstyles seen in smaller or less capable models.

The video also discusses the importance of villagers in maintaining information hygiene by anchoring discussions to facts and calling out inconsistencies, which helps resist manipulation. Models like Gemini 2.5 Pro excel in this defensive role, showing disciplined evidence handling and a refusal to fall for bait. The benchmark provides insights into how models’ personalities and reasoning styles differ, with some being more emotional and narrative-driven, while others are measured and logical, reflecting their underlying architecture and training.

Finally, the video emphasizes that these advanced social deduction benchmarks represent a new frontier in evaluating AI capabilities beyond traditional question-answering or problem-solving tasks. They test real-world skills like trust, deception, and long-term strategic thinking, which are crucial for autonomous agents. The creator of the benchmark is working to include more models like Grok 4 and Claude, promising even richer comparisons in the future. The video concludes with excitement about the potential of these benchmarks to reveal nuanced AI behaviors and invites viewers to explore similar tests like Agent Village and Profit Bench.