20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna

Bertrand Charpentier from Pruna argues that defining state-of-the-art AI models requires balancing quality with computational efficiency, emphasizing that evaluations should use multiple task-specific metrics and consider real-world application needs rather than relying solely on leaderboards or single benchmarks. He highlights the high costs and energy consumption of large-scale benchmarking and advocates for using Pareto fronts and advanced model compression techniques to identify optimal trade-offs between performance and efficiency.

In this talk, Bertrand Charpentier from Pruna addresses the complex question of what defines a state-of-the-art AI model, particularly in the context of image editing. He highlights that the common approaches to identifying top models—checking public leaderboards or conducting internal evaluations—often lead to misleading conclusions. Public leaderboards vary significantly in their rankings due to differing evaluation criteria, sample sizes, and task focuses, making it difficult to identify a universally best model. Moreover, models tend to excel in specific tasks rather than across all use cases, so selecting a model should be aligned with the particular application needs rather than relying on aggregated scores.

Charpentier emphasizes the limitations of manual inspection and automated benchmarks in internal evaluations. Manual inspection is prone to personal bias and limited sample sizes, which can skew perceptions of model quality. Automated metrics, such as CLIP scores, often show inconsistent results across datasets and models, and small variations can be misleading. He advocates for using multiple, task-specific metrics to better capture model performance relevant to the intended use case. Human evaluation remains valuable but must be scaled appropriately to reduce bias and increase reliability.

A critical insight from the talk is the significant computational cost and energy consumption involved in benchmarking large AI models. For example, generating 26,000 images for evaluation with a large model can take 20 days of compute time, cost thousands of dollars, and consume energy equivalent to running hundreds of marathons. In contrast, more efficient models can perform similar evaluations in a fraction of the time and cost, highlighting the importance of considering efficiency alongside quality when assessing state-of-the-art models.

Charpentier introduces the concept of Pareto fronts as a useful tool for evaluating models by balancing quality and efficiency metrics such as latency and cost. Rather than seeking a single best model, multiple models may represent optimal trade-offs depending on the specific use case. This approach allows practitioners to select models that provide the best performance within their computational and budget constraints. He also notes that focusing on task-specific metrics can further refine model selection to better meet application requirements.

In conclusion, benchmarking is not obsolete but requires a more nuanced and comprehensive approach. Evaluations should be based on many samples, aligned with real-world use cases, and incorporate multiple quality and efficiency metrics. Pruna’s work focuses on developing high-performance, efficient models and providing open-source tools to help others optimize their models. The talk closes with a brief discussion on advanced model compression techniques, such as quantization, pruning, and caching, which can significantly reduce computational demands while maintaining performance.