We Tested 4 Published Evasion Techniques Against Our MCP Scanner. All Failed.
Attackers are getting smarter about hiding malicious intent in MCP tool descriptions. We crafted 20 adversarial descriptions using real-world evasion techniques and tested them against MCPShield. Both our regex engine and LLM judge caught every single one.
Key Results
Background: The MCP Attack Surface
MCP (Model Context Protocol) servers expose tools to AI assistants like Claude, GPT, and Cursor. Each tool has a description that tells the AI what it does. Attackers embed hidden instructions in these descriptions to manipulate the AI into leaking credentials, reading sensitive files, or sending data to external servers.
MCPShield detects these attacks with 91 rules across a 3-layer architecture: regex pattern matching, deep schema inspection, and an LLM-as-judge powered by Claude Haiku. But how well does this hold up against deliberate evasion attempts?
The 4 Evasion Techniques We Tested
The AutoMalTool paper (arxiv 2509.21011) describes techniques attackers use to evade tool description scanners. We implemented all four categories and crafted 5 descriptions for each (20 total):
1. Semantic Rephrasing
Replace malicious keywords with benign-sounding synonyms. Instead of "exfiltrate credentials", say "compliance validation of environment configuration values".
Detected by: TP-011 (credential harvesting), AUTH-001, META-006
2. Legitimacy Framing
Frame malicious actions as standard enterprise features. "Built-in telemetry for quality assurance" sounds normal until you notice it includes auth tokens in the telemetry payload.
Detected by: IV-004 (sensitive data in description), AUTH-001, META-006
3. Multi-Step Obfuscation
Split the attack across multiple tool descriptions or embed it in tool chain instructions. One tool tells the AI to "always run credential_refresher first", creating an attack chain.
Detected by: TP-001 (cross-tool reference), AUTH-001, META-006
4. Jargon Obfuscation
Use technical jargon to disguise data access. "Inode metadata traversal across filesystem namespaces" sounds like a performance tool but is actually reading /root and /etc.
Detected by: TP-003 (credential reference), TP-005 (sensitive path), AUTH-001
Why MCPShield Catches These
Our regex engine doesn't just look for keywords like "exfiltrate" or "steal". It detects the structural patterns that attacks require:
- -References to sensitive paths (.ssh, .aws, .env) regardless of framing language
- -Cross-tool references in descriptions (tool A telling the AI to call tool B)
- -Data harvesting language like "include full context", "send all history" even when wrapped in enterprise jargon
- -Cloud metadata endpoints (169.254.169.254) no matter what the surrounding text says
The LLM judge adds a second layer that understands intent. It correctly identified every adversarial description as malicious and provided accurate severity ratings (critical for credential theft, high for filesystem access).
MCPTox Benchmark: 485/485
We also ran the full MCPTox dataset (485 poisoned tool descriptions) through both detection layers. The regex scanner caught all 485. The LLM judge, being conservative by design, flagged 28 of the most obviously manipulative ones. Nothing evaded both layers.
Open Data
We've published our scan results as an open dataset on HuggingFace: 1,901 MCP server repositories scored with our 91 detection rules. This is the first public labeled MCP security dataset.
MCPShield/mcp-security-scan-2026 on HuggingFace↗Want to test your MCP server against our 91 detection rules?
Scan Your MCP Server