The video highlights serious flaws in LMArena’s ranking system, including untransparent model removal practices and a sampling method that favors top models without scientific justification. Researchers from Coher advocate for returning to a more rigorous, transparent evaluation process based on information gain to ensure fairness and integrity in the platform’s rankings.
The video discusses a significant issue with LMArena, a prominent platform that ranks large language models (LLMs), which has become a crucial standard in the industry. Researchers from Coher recently published a detailed 69-page autopsy of LM Ciserina, revealing serious flaws in how the platform operates and how it influences industry decisions. The ranking system is highly influential, affecting major venture capital investments and the development of new models. Therefore, ensuring the fairness and robustness of this ranking process is of paramount importance.
The platform functions by allowing model providers to upload their models, which are then sampled and evaluated through a process similar to swiping on Tinder. Users interact with model outputs, and these interactions inform the ranking algorithm. The sampling process is supposed to be based on a mathematically robust method called information gain, which aims to fairly evaluate models by sampling them proportionally to their performance. However, the researchers found that the platform actually employs a different, less rigorous sampling method—uniform sampling—where the top-performing models are sampled more frequently, but without a transparent or justified basis.
A major concern highlighted by the Coher researchers is the platform’s handling of models that are no longer relevant or performing poorly. They discovered that LM Ciserina unlists models without providing any explanation, which undermines the transparency and integrity of the ranking system. Additionally, the platform hosts private pools of models, such as multiple variants of Meta’s Llama 4, which are not publicly accessible. This practice allows the platform to silently delete or hide models that do not perform well, while promoting only the best-performing variants, creating an unfair advantage and skewing the rankings.
The researchers criticize this approach, emphasizing that models published on the platform should remain accessible and visible to ensure transparency and fairness. They argue that the current practice of deleting or hiding models without explanation distorts the ranking process and prevents an honest comparison of different models. The Coher team recommends returning to the originally suggested sampling method based on information gain, which is more scientifically sound, and insists that all models, including those that perform poorly, should stay published for the sake of transparency.
In response, LM Ciserina issued a somewhat dismissive statement on Twitter, which did not satisfy the community. Critics, including notable figures like Andre Kapathy, labeled the platform’s practices as unscientific and unfair. The debate underscores the need for more transparent, scientifically rigorous methods in ranking large language models. The Coher researchers’ proposals aim to improve fairness and transparency, but it remains to be seen whether LM Ciserina will adopt these recommendations or continue with its current practices.