Unlock Local LLMs in VS Code

The video explains how Visual Studio Code Insiders (the “green” version) enables users to flexibly manage and use local or remote large language models (LLMs), allowing AI-assisted coding without relying solely on paid cloud services. It demonstrates configuring local models like Quen 3 Coder on a MacBook and powerful remote models on GPU clusters, highlighting enhanced privacy, cost savings, and advanced AI integration compared to the regular VS Code.

The video discusses the different versions of Visual Studio Code (VS Code), focusing on the “green” version called Visual Studio Code Insiders and the “blue” regular version. The green Insiders version often receives new features earlier than the regular version, making it appealing for users who want to try out the latest capabilities. The presenter highlights a new feature in the Insiders version that allows users to manage and use AI models more flexibly, including hosting their own local or remote large language models (LLMs), which is not fully supported in the regular VS Code.

In the regular VS Code, users can access AI models like OpenAI’s GPT-4, Claude, and others, but these are typically paid services and do not support hosting custom or local models beyond very limited options. The Insiders version, however, provides a more advanced interface for managing models, including adding custom OpenAI-compatible models hosted on other machines or locally. The presenter demonstrates how to configure these models by editing the VS Code settings JSON file, specifying model IDs, URLs, token limits, and other capabilities such as tool calling or vision support.

To showcase the flexibility of the Insiders version, the presenter runs LM Studio on a local MacBook Pro with an M4 Max chip, loading a smaller 30-billion parameter model called Quen 3 Coder. By configuring VS Code Insiders to connect to this local model, they demonstrate how to interact with the AI directly within the editor without relying on cloud services. This setup allows for cost-free, private AI assistance running entirely on the user’s machine, which is particularly useful for developers who want to avoid sending code to external servers.

For more powerful AI models, the presenter highlights running a trillion-parameter model called Kimmy K2 Thinking on a remote server cluster equipped with Nvidia H200 GPUs. This setup, hosted on a service called Sircale, provides extremely fast and capable AI assistance while maintaining privacy since the code does not leave the user’s control to big corporate providers. The presenter shows how the AI can analyze and refactor code efficiently, emphasizing the potential of combining local and remote resources for advanced AI-powered development workflows.

Finally, the video touches on some current limitations and ongoing improvements, such as certain features not yet fully integrated into the agent mode of VS Code Insiders. The presenter encourages viewers to explore the green Insiders version to access cutting-edge AI features and hints at future developments that will further enhance local and remote AI model integration. They also mention related content about running large models efficiently on GPU clusters and invite viewers to engage with comments for more information.