MCP is "dead" - or is it?

Timothy Carbat explains that current tool integration methods for large language models, like MCP and skills, lead to excessive token usage and inefficiency because all tools are sent to the model regardless of necessity. He advocates for selective tool enabling and intelligent tool reranking—using a specialized model to filter relevant tools per prompt—which can reduce token consumption by up to 90%, improving performance and cost-effectiveness.

In this video, Timothy Carbat, founder of Anything LLM, discusses how to significantly reduce token usage—by 60 to 90%—when calling tools in large language models (LLMs), particularly focusing on local models. He explains that tools, which are essentially functions or agents that an LLM can call to perform specific tasks, consume tokens as if they were part of the prompt. This token consumption becomes problematic when many tools are loaded but only a few are actually used, leading to wasted tokens, slower performance, and higher costs, especially on cloud platforms where tokens directly translate to money spent.

Timothy critiques the current popular approaches to tool integration, such as the Model Context Protocol (MCP) and skills-based systems. MCP was introduced as a unified framework for tool building and sharing, but it still suffers from token bloat because all tools are sent to the model regardless of whether they are used. Skills, while flexible, are unreliable because they depend heavily on the model’s behavior, which can change unpredictably due to factors like quantization and hyperparameter tuning by model providers. This unpredictability makes skills unsuitable for consistent tool execution and can lead to massive token waste.

To address these issues, Timothy highlights two practical solutions. First, users should have the ability to selectively enable or disable tools within MCPs, as seen in tools like LM Studio and Anything LLM. This simple UI feature allows users to prevent unnecessary tools from being sent to the model, saving tokens immediately. Second, and more importantly, he introduces the concept of intelligent tool selection or reranking. This involves using a small, specialized model to analyze the user’s prompt and dynamically filter the list of available tools down to only the most relevant ones before sending them to the LLM, drastically reducing token usage.

Timothy shares real-world results from Anything LLM, where reranking reduced token usage from 18,000 to 1,800 tokens when dealing with 114 tools, achieving a 90% reduction. This process runs continuously with every prompt, improving efficiency over time and ensuring that the necessary tools are always included. He emphasizes that this approach is not widely adopted yet but should be considered an industry standard due to its clear benefits in performance, cost savings, and user experience.

In conclusion, Timothy urges the community to be more mindful about token usage and context management in tool calling for LLMs. He encourages developers and users to implement or request features like tool filtering and intelligent tool selection to optimize token consumption. By doing so, local and cloud-based LLM applications can become more efficient, cost-effective, and reliable. He invites viewers to share their thoughts and unique approaches, highlighting the importance of collaboration in improving tool integration strategies.