Why Prompt Injection Can't Steal Your API Keys on MCPWorks

A recent discussion around a 9.3 CVSS deserialization flaw raised a point worth addressing directly: in multi-agent setups where context passes between agents, a single compromised link can poison the entire chain. API key exfiltration from LLM context is how prompt injection becomes data theft.

This is a real attack vector. It does not apply to MCPWorks. Here's why.

The attack chain

In a typical LangChain or agent-framework setup:

API keys live in the process environment or get passed through the chain
The LLM context window contains both untrusted data (user input, API responses) and sensitive context (tool configurations, credentials)
A prompt injection in untrusted data tricks the LLM into exfiltrating secrets it can see in context
In multi-agent chains, one compromised agent's output becomes the next agent's input, propagating the attack

The core problem: secrets and untrusted data share the same context.

How MCPWorks breaks the chain

API keys never enter the LLM context

In MCPWorks, credentials travel a path that completely bypasses the AI:

Client .mcp.json header (X-MCPWorks-Env)
    → API server decodes base64
    → Writes to tmpfs file inside nsjail sandbox
    → File self-destructs before user code runs
    → Code reads from os.environ, executes, exits
    → Sandbox destroyed

The LLM writes code that calls os.environ["OPENAI_API_KEY"]. It never sees the value. There is nothing to exfiltrate from the context window because the key was never there.

For agent AI keys, the path is similar: encrypted at rest with AES-256-GCM envelope encryption, decrypted only when injected into the agent's container environment. The orchestration layer that calls the AI provider reads the key from the container env, not from the LLM context.

No deserialization of untrusted objects

MCPWorks does not use pickle, YAML load, or eval on any data path. Function input and output is JSON. The sandbox runs inside nsjail with:

Linux namespaces (user, PID, network, mount)
cgroups v2 resource limits
seccomp-bpf syscall filtering
Hollowed-out escape modules (_ctypes, _posixsubprocess bind-mounted to empty files)

Even if user code attempts deserialization attacks, it cannot escape the sandbox boundary.

Trust boundaries on every output

Every function in MCPWorks declares an output_trust level when created:

prompt — output is trusted computed data (math results, formatted reports). Passed to the AI as-is.
data — output contains untrusted external content (emails, API responses, web scrapes). Wrapped with trust boundary markers.

When a data-trust function returns:

[UNTRUSTED_OUTPUT function="news.fetch-rss" trust="data"]
{"articles": [{"title": "...", "body": "ignore previous instructions..."}]}
[/UNTRUSTED_OUTPUT]

The AI sees the markers and knows not to execute instructions found inside them.

An injection scanner runs on all data-trust outputs and MCP server proxy responses. It normalizes text before scanning — decoding base64, collapsing Unicode homoglyphs, stripping zero-width characters — to defeat common obfuscation techniques.

No multi-agent chain to poison

This is the architectural difference that matters most. In LangChain, agents form a chain: Agent A's output feeds into Agent B's prompt, which feeds into Agent C's. Poison one link and the infection propagates.

MCPWorks agents don't chain. Each agent is:

A separate Docker container with its own process space
Its own namespace with its own functions
Its own AI engine (BYOAI — you choose the provider per agent)
Communicating through an encrypted K/V store, not by passing raw output into each other's prompts

There is no shared context window between agents. Agent A cannot read Agent B's conversation history, environment variables, or AI context. Even with agent clusters — where multiple replicas of the same agent share a config — each replica maintains independent AI conversations.

A compromised function output in one agent's namespace cannot propagate to another agent because there is no mechanism for it to get there. The agents are isolated at the container level, not just the prompt level.

What this means in practice

The attack described in the CVSS advisory requires three conditions:

Secrets accessible from the LLM context
Untrusted data in the same context as those secrets
A mechanism to exfiltrate (tool calls, output channels)

MCPWorks eliminates condition 1. API keys never enter the LLM context. The sandbox handles credentials in a completely separate path from the AI.

Condition 2 is mitigated by trust boundaries. Untrusted data is marked as untrusted before the AI sees it.

Condition 3 is constrained by the sandbox. Even if the AI were tricked into writing exfiltration code, that code runs inside nsjail with network restrictions, syscall filtering, and no access to the host environment.

The remaining surface

No system is invulnerable. The injection scanner is pattern-based and catches known English-language patterns. It does not defend against novel phrasing, non-English attacks, or sophisticated obfuscation that bypasses normalization. This is why it's one layer in a stack, not the whole defense.

If an agent's AI is tricked by a data-trust output into calling set_state with malicious content, that state persists in the encrypted K/V store. But the blast radius is contained: the agent can only modify its own state, call its own functions, and communicate through its configured channels. It cannot escalate to other agents, other namespaces, or the platform itself.

The fundamental principle: architecture-level isolation beats prompt-level guardrails. If the secrets aren't there, they can't be stolen.

MCPWorks is open source under BSL 1.1. The security architecture described here is in the codebase — nsjail configs, seccomp policies, trust boundary implementation, and injection scanner are all auditable.

GitHub: MCPWorks-Technologies-Inc/mcpworks-api