Kobie Crawford from Snorkel argued that improving model behavior through disciplined tool use and high-quality data, rather than simply increasing model size, leads to better and more efficient performance, as demonstrated by a smaller fine-tuned model outperforming a much larger one in financial analysis tasks. This approach, supported by reinforcement learning and expert-verified data, reduces costs and complexity while enhancing reliability and deployability in enterprise settings.
Kobie Crawford from Snorkel delivered a presentation emphasizing the importance of improving model behavior over simply increasing model size. He highlighted that while larger models often seem like the straightforward solution for better performance, especially in enterprise contexts like financial analysis, this approach can be inefficient and costly. Instead, Snorkel’s research, in collaboration with UC Berkeley’s RLLM team, demonstrated that a smaller 4 billion parameter model, when fine-tuned with reinforcement learning (RL) and high-quality data, could outperform a much larger 235 billion parameter model on a tool-use task.
The core challenge addressed was the smaller model’s ability to effectively use tools within a constrained environment for financial analysis. The larger model, despite its advanced reasoning capabilities, failed to properly interact with the environment’s tools, leading to hallucinated or incorrect answers. In contrast, the fine-tuned smaller model learned to systematically discover available tables, inspect schemas, handle errors, and self-correct its queries, showcasing disciplined tool use that was critical for accurate performance.
Snorkel’s approach to data quality was central to this success. They emphasized expert involvement in data generation and verification to ensure the training data was precise and relevant. This high-quality dataset enabled the RL process to effectively teach the model the necessary behaviors for tool use. The RL training was efficient, costing under $500 and completing within about 21 hours, demonstrating that significant performance improvements can be achieved without prohibitive expense.
The evaluation revealed interesting insights about training strategies. Training solely on single-table queries yielded the greatest performance uplift, even improving the model’s ability to handle more complex multi-table queries. This finding underscored that mastering fundamental tool-use behaviors was more impactful than focusing solely on complex reasoning tasks. The team also highlighted the value of using rubrics in evaluation to diagnose specific behavioral issues, guiding targeted data generation and model improvement.
In conclusion, the presentation challenged the prevailing notion that bigger models are always better. Instead, it advocated for focusing on the right behaviors, particularly disciplined tool use, supported by high-quality data and efficient RL training. This approach not only reduces costs and complexity but also enhances model reliability and deployability in sensitive enterprise environments. Additional details and resources about the study are available through linked blog posts from Snorkel and their UC Berkeley partners.