In this session, Julia from the VS Code team explains the complexities of integrating AI models into VS Code, highlighting the need for customized harness configurations and close collaboration with model providers to optimize model behavior and performance. She also emphasizes the importance of rigorous offline benchmarking and iterative testing to ensure reliable, high-quality AI coding assistance, encouraging developers to experiment and provide feedback for continuous improvement.
In this session from Microsoft Build, Julia, a product manager on the VS Code team, discusses the complexities involved in shipping AI models within the VS Code extension. She explains that integrating new models is far from straightforward; it requires close collaboration with model providers to understand each model’s unique characteristics and to optimize how the model interacts within the VS Code environment. Julia emphasizes that even models within the same family can behave differently, necessitating adjustments to the system that runs the AI, known as the harness.
The harness is a critical component that manages how the AI model operates within VS Code, including how it processes system prompts, calls built-in tools, manages context, and handles agent loops. Each new model requires a tailored harness configuration to ensure smooth functionality. Julia demonstrates this by showing debug logs that reveal the behind-the-scenes operations of the harness, highlighting differences in system prompts and tool usage between models like GPT-5.5 and Anthropic’s models. This customization ensures that the AI assistant performs optimally in the coding environment.
Julia also touches on the inherent non-deterministic nature of AI models, meaning that even the same model can produce different outputs on repeated runs. This variability adds another layer of complexity to integrating and testing models. She shares how the team works closely with model providers to refine system prompts and tool interactions, often starting with previous configurations and iteratively improving them based on extensive testing and feedback. This collaborative process is essential to delivering a reliable and effective AI coding assistant.
To ensure quality and performance, the VS Code team employs rigorous evaluation methods, including offline benchmarks and internal testing suites like VSC Bench, which contains over 100 coding tasks. These evaluations help measure the resolution rate of tasks and identify regressions or improvements across model versions. Julia explains that these benchmarks serve as both regression tests and optimization tools, allowing the team to fine-tune the harness and system prompts before and after model launches, ensuring a high standard of performance for users.
In closing, Julia encourages developers to keep experimenting with new models and provide feedback to help improve the system. She highlights the ongoing nature of model optimization and the importance of offline evaluations in maintaining quality. For those building their own AI coding tools, she recommends exploring the concept of a coding harness and leveraging benchmarks to systematically assess and enhance model performance. This approach has been key to the success of Microsoft’s AI integrations and is valued by model providers as well.