The video explains how configuring Hermes agent’s auxiliary models for background tasks like compression and web extraction can reduce token costs by 85% or more, especially by choosing cheaper or local models instead of default expensive ones. It demonstrates significant savings through a live comparison and encourages users to customize these settings to optimize performance and lower API expenses during extended sessions.
The video explores a lesser-known but valuable feature of the Hermes agent—its use of auxiliary models for background tasks that can significantly reduce token costs, potentially saving users 85% or more. Hermes agent runs eight hidden auxiliary tasks alongside the main chat model, handling functions like compressing conversations, summarizing web pages, analyzing images, and writing to persistent memory. These tasks often default to expensive models or fall back to cheaper ones like Gemini Flash, but users can configure which models to use for each task to optimize cost and performance.
The eight auxiliary tasks include compression, web extract, vision, flush memories, session search, skills hub, MCP dispatch, and approval classification. Among these, compression is the most costly as it triggers every time the conversation context reaches a threshold, which can happen frequently during heavy use. Flush memories, web extract, and vision also contribute to costs but to a lesser extent. The video emphasizes that selecting appropriate models for these tasks based on their specific needs can lead to substantial savings.
Users can customize the auxiliary models in the Hermes agent’s config.yaml file, specifying providers and models for each task. The default “auto” setting cycles through several providers, which can lead to unexpected charges from models like Gemini Flash. By manually setting cheaper or local models for tasks such as compression or web extract, users can reduce costs dramatically. The video also highlights the advantage of running local models for these background tasks, which can eliminate token costs entirely.
A live demonstration compares the cost of compressing a 50,000-token context window using a high-end model (Claude Opus) versus a cheaper model (Kimi K2). The expensive model cost about 13 cents per compression, while the cheaper model cost just under 2 cents, representing an 85% cost reduction. Since compression happens frequently during active sessions, these savings can add up to significant monthly reductions in API expenses, especially for users who engage in long coding or research sessions.
In conclusion, the video encourages users to leverage Hermes agent’s auxiliary model configuration to optimize costs and performance. Focusing on the top four costly tasks—compression, flush memories, web extract, and vision—and assigning suitable models can lead to substantial savings. The presenter also promotes a free weekly newsletter for AI news and updates and hints at upcoming deep-dive tutorials on Hermes agent, inviting viewers to subscribe and engage with future content.