How we ship models

Julia, a product manager for VS Code, explains how new AI models are integrated into the editor by using automated and manual evaluations—such as internal and open-source benchmarks—to ensure quality, stability, and compatibility despite the models’ inherent variability. The team collaborates closely with model providers and relies on systematic testing before releasing models to users, aiming to deliver a seamless and reliable developer experience.

Julia, a product manager for Visual Studio Code (VS Code), explains how new AI models are shipped and integrated into VS Code. She highlights the frequent release of new models, such as Sonnet 4.6, Opus, and Gemini 3.1 Pro, and discusses the importance of ensuring these models work seamlessly within the developer environment. Julia emphasizes the need to maintain high quality, stability, and performance with each new model release, so that developers experience minimal disruption and consistently reliable tools.

A major challenge in integrating AI models is their inherent nondeterminism—meaning the same prompt can yield different outputs each time it is run. Julia demonstrates this by asking three different models (Sonnet 4.6, Opus, and Gemini) to analyze the same soccer data in a Jupyter notebook, resulting in varied outputs and visualizations. This variability makes it difficult to assess model quality and consistency, especially when releasing new models to a broad user base.

To address this, the VS Code team relies on deterministic measurements through evaluations, or “evals.” Evals are automated tests that provide repeatable, objective assessments of a model’s performance on specific tasks. These evaluations are integrated into the release pipeline and are run offline before a model is made public. The team uses both open-source benchmarks, like Swebench, and their own internal benchmark, VSCBench, to test models across a variety of developer workflows and programming languages.

VSCBench, developed internally by the VS Code team, consists of over 50 developer-focused test cases designed to reflect real-world usage in VS Code. Each test case includes a prompt and a set of assertions—core requirements that define success for the model’s output. This approach allows the team to systematically compare new models against previous ones, catch regressions, and ensure compatibility with VS Code’s built-in tools and diverse language support.

The release process involves close collaboration with model providers, prompt updates, and a combination of automated and manual testing. Feedback is shared in a closed loop with providers to refine both the models and their integration. Once a model passes all evaluations and manual checks, it is released to users, appearing seamlessly in the model picker within VS Code. Julia concludes by inviting the community to engage with the team if they are interested in VSCBench or related topics, underscoring the ongoing commitment to delivering the best developer experience.