RAG-MCP: A Practical Path to Smarter, Cheaper LLM Inference — AI Innovations and Insights 41

May 15, 2025

Glad to have you back for the 41st chapter of this ongoing journey.

Today’s LLMs are becoming more powerful by tapping into external tools — whether it’s a calculator, a web search, or a database. But as the number of tools grows, so do the headaches. Prompts get bloated, and selecting the right tool for a given task becomes harder to manage.

Figure 1: Comparison between MCP and RAG-MCP during inference. [Source].

As shown on the right side of Figure 1, the RAG-MCP approach improves inference by first using a semantic retriever to filter the most relevant tools from the entire toolset—typically selecting only the top one. It then injects only the description of this selected tool into the model’s prompt. It significantly reduces the number of tokens and increases the model’s chances of choosing the correct tool.

In contrast, the traditional MCP method (Figure 1, left) includes descriptions of all available tools in the prompt at once. This leads to severe prompt bloat, forcing the model to sift through a sea of irrelevant information. As a result, it struggles to make accurate decisions and often fails to select the right tool.

As shown in Figure 2, RAG-MCP begins by encoding the user query with Qwen-max, then retrieving and validating the top-k MCPs, and finally invoking the most appropriate one.

Thoughts and Insights

Put simply, RAG-MCP is a smart way to compress prompts by only including what's truly relevant. While it is an innovative approach, I have a few concerns on the technical and implementation side.

First, the method heavily depends on the accuracy of the semantic retriever. If the retriever isn’t robust enough or suffers from semantic drift, the entire system could struggle to find the right tools.

Second, the core mechanism retrieves the top-K most relevant tools (with K typically set to 1) and injects only that tool’s schema into the LLM. However, RAG-MCP doesn’t explore what happens when K > 1, or how performance might change if multiple tools are invoked in sequence. This seems like a missed opportunity to examine potential trade-offs or performance gains.

Lastly, one of the selling points of RAG-MCP is that it doesn’t require retraining the language model—only updating the vector index. But in practice, if the retriever (say, Qwen-max) and the generator (another LLM) have different styles or reasoning patterns, their semantic understanding may not align, which could cause misinterpretations or inconsistent outputs.

AI Exploration Journey

Discussion about this post