Lecture 8 of Stanford’s CME295 course focuses on evaluating Large Language Models by exploring human and automated evaluation methods, including using LLMs themselves as judges to score outputs and identify biases. It also covers assessing agentic workflows, common failure modes, and various benchmark categories to comprehensively measure model performance while emphasizing the importance of iterative debugging and practical experimentation.
In Lecture 8 of Stanford’s CME295 course on Transformers and Large Language Models (LLMs), the focus is on evaluating LLM performance, a critical step to understanding and improving these models. The lecture begins with a recap of previous topics, including retrieval-augmented generation (RAG), tool calling, and agentic workflows, which combine these techniques to enable LLMs to interact with external systems and perform complex tasks. The main theme is how to quantify the quality of LLM outputs, which is challenging due to the free-form nature of responses that can range from natural language to code and mathematical reasoning.
The lecture discusses the limitations of human evaluation, highlighting the subjectivity and cost of having humans rate every LLM output. It introduces the concept of inter-rater agreement metrics, such as Cohen’s kappa, which measure consistency among human raters beyond chance agreement. These metrics help ensure that rating guidelines are clear and that human evaluations are reliable. However, human evaluation remains slow and expensive, motivating the exploration of automated, rule-based metrics like METEOR, BLEU, and ROUGE, which compare LLM outputs to fixed reference texts. These metrics, while useful, have limitations in handling stylistic variations and often require human ratings to calibrate.
A key innovation presented is the use of LLMs themselves as judges to evaluate responses. This approach leverages the LLM’s pre-trained knowledge and alignment with human preferences to score outputs and provide rationales explaining the scores. The lecture emphasizes best practices for using LLMs as judges, such as providing clear evaluation criteria, using binary scoring for simplicity, and prompting the model to output a rationale before the score to improve reliability. It also discusses common biases in LLM judging, including position bias, verbosity bias, and self-enhancement bias, and suggests mitigation strategies like majority voting and using different models for generation and evaluation.
The lecture then shifts to evaluating agentic workflows, where LLMs interact with multiple tools in a loop of observe, plan, and act steps. It outlines common failure modes in tool prediction, tool calling, and response synthesis, such as failing to use the right tool, hallucinating nonexistent functions, providing incorrect arguments, or producing uninformative outputs. Remedies include improving tool router recall, refining API descriptions, upgrading models, and ensuring meaningful tool outputs. The importance of methodical error categorization and iterative debugging is stressed to improve agent reliability.
Finally, the lecture surveys various benchmark categories used to evaluate LLMs, including knowledge-based benchmarks like MMLU, reasoning benchmarks such as AIM and physical commonsense reasoning, coding benchmarks exemplified by SWEBench, and safety benchmarks like HarmBench. It highlights the importance of benchmarks in profiling model strengths and weaknesses, considering factors like cost and safety, and warns about data contamination risks. The lecture concludes by encouraging hands-on experimentation with models to find the best fit for specific use cases, underscoring that benchmarks are guides rather than absolute measures of model quality.