Kobie Crawford from Snorkel emphasizes that high-quality, well-defined tasks and datasets are crucial for effective AI model training, demonstrating through experiments that superior task quality leads to significantly better model performance and more meaningful failure analysis. Snorkel’s approach combines rigorous task evaluation criteria, expert involvement, and scalable quality assessments to ensure datasets drive meaningful improvements in foundation AI models.
Kobie Crawford, a developer advocate at Snorkel, introduces the company’s focus on producing high-quality datasets for foundation AI models, emphasizing the critical role of data quality in model training and performance. Originating from Stanford University AI research, Snorkel integrates research with production work to ensure that the datasets they provide meet rigorous standards. Their core thesis is that task quality, closely tied to data quality, significantly impacts training outcomes, especially in agentic terminal bench-style tasks where environments are containerized for reproducibility and parallelization.
Crawford outlines four key criteria defining task quality: achievability, non-triviality, functional correctness, and environment reliability. Snorkel’s research team has developed a testing harness to verify these criteria, categorizing tasks into accepted (high-quality) and rejected (low-quality) buckets. Using models like GPT-4.5 and Codex, they compared task performance and found that accepted tasks required more tool calls, had lower pass rates indicating higher difficulty, and involved more reasoning steps, suggesting these tasks provide more meaningful challenges for model improvement.
The team further analyzed failure modes, distinguishing between meaningful failures—where the model logically fails to complete a task—and failures caused by environmental or specification issues that no model could overcome. Their findings showed that accepted tasks tend to produce cleaner, more informative failures, which are valuable for training and improving models. In contrast, rejected tasks often result in failures that are less meaningful and more related to task design flaws, reinforcing the importance of rigorous task quality standards.
To validate the impact of task quality on model training, Snorkel conducted reinforcement learning experiments using the same model and compute budget but different task sets. The results demonstrated a significant performance uplift: about a 6% improvement with high-quality tasks compared to only a 1% improvement with low-quality tasks. This fivefold difference underscores the importance of expert involvement in dataset creation and the value of maintaining high data and task quality to achieve better model outcomes.
In the Q&A, Crawford addresses questions about the influence of task specification, the challenges of iterative and multi-step tasks, and handling inter-annotator agreement. He acknowledges that underspecified tasks can appear harder due to mismatched test expectations and that real-world problem-solving often involves iterative processes not easily captured by one-shot benchmarks. Snorkel employs a combination of human experts and large language model judges using detailed rubrics to ensure high inter-annotator agreement and scalable quality assessment, even in complex or less verifiable domains. Overall, Snorkel’s work highlights the foundational role of data and task quality in advancing AI model capabilities.