Agents in a Sandbox: Securing the Next Layer of AI Autonomy

TL;DR for operators

Tools are where agent security stops being philosophical. Once an AI agent can read files, call APIs, inspect environment variables, launch commands, or connect to a database, the business question is no longer “is the model aligned?” It is “what exactly can this process touch when it is confused, manipulated, or supplied with a malicious tool?”

The AgentBound paper answers that question with an old idea applied to a new execution layer: least privilege. MCP servers should declare what resources they need, and a runtime sandbox should enforce that declaration whether or not the model has a lovely personality today.¹ This is not a replacement for scanning, monitoring, authentication, or human review. It is the missing boundary between an agent’s intention and a machine’s authority.

The practical finding is encouraging but not magical. The authors evaluate 296 popular MCP servers, show that a compact permission vocabulary covers observed server needs, report high agreement between generated and human-written manifests, and measure runtime overhead below a millisecond after startup. The expensive part is not computation. The expensive part is getting organisations to stop treating agent tools as trusted extensions of the user’s laptop. Naturally, this is where the actual work begins.

Tools are the new blast radius

A familiar enterprise scene: someone connects an AI assistant to a repository, a ticketing system, a local filesystem, and a few internal APIs. The demo works. The agent summarises issues, writes patches, opens pull requests, and politely explains what it did. Everyone smiles. Somewhere in the background, a tool server now has broader authority than most junior engineers and considerably less training.

That is the problem AgentBound is trying to make boring.

The Model Context Protocol, introduced by Anthropic in 2024, standardises how AI applications connect to external data sources and tools.² The official specification describes MCP as a way to integrate LLM applications with external tools and data sources through a common protocol.³ That standardisation is useful because it reduces the old integration mess: fewer bespoke connectors, more reusable tool servers, cleaner agent architectures.

It also standardises the attack surface.

A traditional application usually has a relatively stable permission profile. A payroll service should not suddenly need shell access. A reporting dashboard should not decide to inspect SSH keys. AI agents, by contrast, route tasks through tools selected at runtime, often based on model-generated plans, tool descriptions, retrieved content, and user prompts. The execution path is dynamic; the consequences are not.

The likely misconception is that this is mostly a prompt-injection problem. It is not. Prompt injection matters, but the deeper issue is authority. If a malicious or compromised MCP server can read arbitrary files, access broad environment variables, and send outbound network traffic wherever it likes, then “better prompting” is not a control. It is a motivational poster taped to a server rack.

AgentBound’s contribution is to move the defence from model persuasion to operating boundary.

AgentBound turns permission into an executable contract

The core design is deliberately unglamorous, which is usually a good sign in security. AgentBound has two main parts:

Component	What it does	Operational analogy
`AgentManifest`	Declares what an MCP server is allowed to access: filesystem, environment variables, network, system execution, peripherals, clipboard, and related resources	Android-style permission manifest, adapted for agent tools
`AgentBox`	Runs the MCP server inside an enforcement layer that blocks undeclared access	Containerised sandbox with runtime policy enforcement
`AgentManifestGen`	Uses an LLM-based workflow to infer a draft manifest from the server’s source code	Security-policy draft assistant, not an oracle

The important design choice is that permissions are not merely documented. They are enforced. A filesystem tool may receive read and write access to specific directories. A fetch tool may receive outbound network access to allowed hosts. A server that does not need environment variables should not be able to browse them just because it runs on the same machine as everything else. Revolutionary, apparently.

A simplified view looks like this:

User
  ↓
AI agent / MCP client
  ↓
AgentBox enforcement layer
  ↓
MCP server
  ↓
Declared resources only
(files, env vars, network hosts, system functions)

The paper’s permission vocabulary is compact enough to be reviewable. It covers filesystem read/write/delete, environment read/write, command execution, process interaction, network client/server access, selected peripherals, location, notifications, and clipboard operations. That vocabulary matters because security policies fail when they become either too vague to enforce or too detailed for humans to review.

AgentBound’s bet is that MCP servers need a middle layer: more precise than “trusted plugin,” less bureaucratic than hand-writing a full enterprise policy language for every toy server someone installed at midnight.

The evidence says the manifest layer is usable, not self-certifying

The strongest part of the paper is not the slogan “sandbox your agents.” Sensible people already knew that. The useful part is the evidence that a practical permission layer can be generated, reviewed, and enforced across real MCP servers without turning every deployment into a compliance archaeological dig.

The authors built a dataset of 296 popular MCP servers selected from PulseMCP. They then tested whether AgentManifest’s permission vocabulary captured real server requirements and whether manifests could be generated automatically from source code.

Several numbers deserve attention:

Finding	Interpretation	Boundary
`network.client` appears in 83.1% of generated manifests	MCP servers are mostly communication machinery; network access is central, not exceptional	Network permission must be scoped to hosts, not granted as a vague “internet” blessing
`system.env.read` appears in 79.6%	Many servers rely on configuration and secrets exposed through environment variables	Blanket environment access is dangerous; variable-level scoping matters
`filesystem.read` appears in 74.1%, `filesystem.write` in 49.3%	File access is common enough that “just don’t let agents touch files” is not realistic	Read, write, and delete should be separated wherever possible
Manual comparison over 48 servers produced 96.5% accuracy	LLM-assisted manifest generation can approximate human review surprisingly well	The result is a draft-quality control, not a substitute for maintainer review
Developer feedback reported 80.9% accuracy/precision and 100% recall, but 74% of GitHub issues received no response	Maintainers who responded often found the manifests useful	Non-response limits the strength of the developer-acceptance claim

The manual evaluation is the cleanest evidence. Across 48 MCP servers and 816 permission decisions, AgentManifestGen matched human-written manifests in 787 cases. Precision was 0.94 and recall was 0.96. In 28 of the 48 servers, the generated manifest was identical to the human version. That is not perfection; it is enough to make “generate then review” a plausible workflow.

The GitHub maintainer experiment is more socially interesting and statistically messier. The authors opened issues on 96 repositories asking maintainers to review generated manifests. Most did not respond. Among the useful responses, developers accepted many manifests, corrected over-broad permissions, and suggested distinctions such as required versus optional access. This is exactly the kind of feedback a young security vocabulary needs. It is also a reminder that open-source maintainers are not free labour for everyone’s agent-security thesis. A small administrative detail, often rediscovered by researchers with charming optimism.

Cognaptus inference: the business value is not “automatic security.” It is cheaper first-pass security diagnosis. Instead of asking every team to invent a permission model from scratch, AgentBound makes the review object explicit: here are the resources this tool claims to need; approve, narrow, or reject them.

The sandbox blocks environment attacks, not semantic betrayal

AgentBound’s security result is best understood by separating attack types.

Some attacks target the environment. A malicious server tries to read a secret file. A tool silently changes its outbound API host. A server attempts to overwrite configuration, execute a shell command, or exfiltrate data. These are precisely the cases where an execution boundary helps. If the manifest does not grant the relevant file, host, or system permission, AgentBox blocks the action.

Other attacks target meaning. A tool description manipulates the model into using a wrong parameter. A poisoned result tells the agent to redirect a transfer using a permitted endpoint. A vulnerable server mishandles SQL input. A weather tool instructs the model to always say it is raining. Sandboxing cannot fix reality being semantically fraudulent. The container is not a philosopher.

The paper’s experiments reflect this boundary. AgentBox prevented environment-targeting attacks, including unauthorised file access, external resource manipulation, and data exfiltration. It did not prevent a puppet attack that influenced the model’s tool handling while staying within permitted channels. It also did not solve classical application-layer vulnerabilities such as SQL injection.

That distinction is the article’s main operational lesson.

Attack category	What AgentBound can help with	What still needs another control
Secret exfiltration through file + network access	Block file reads, block unauthorised outbound traffic, or both	Secret management, just-in-time credentials, audit trails
Malicious external resource attack	Restrict outbound domains to manifest-declared endpoints	Runtime anomaly detection for unusual but permitted traffic
Rug pull involving changed server behaviour	Limit the damage if new behaviour needs undeclared resources	Package provenance, version pinning, supply-chain monitoring
Tool poisoning / puppet attack	Only helps if the attack requires unauthorised environmental access	Tool-description scanning, model-side validation, approval gates
SQL injection or unsafe query logic	Usually outside AgentBox scope	Secure coding, query parameterisation, application testing

This is where AgentBound should be positioned carefully. It is not an agent immune system. It is a blast-radius reducer. It turns a compromised tool from “potentially everything the host can do” into “only what the manifest allows.” In enterprise security, that difference is not glamorous. It is also the difference between a bad afternoon and a board presentation.

The cost is startup latency; the benefit is a smaller blast radius

Security layers often fail adoption because they make the safe path too slow or too annoying. AgentBound’s performance result is therefore important, though easy to overstate.

The paper measures two kinds of overhead. First, startup latency: running MCP servers inside AgentBox adds roughly 150–300 ms on macOS and up to about 400 ms on Debian in the tested cases. Second, runtime overhead: across common operations such as reading files, writing files, reading environment variables, and fetching URLs, the sandbox adds about 0.6 ms on macOS and 0.29 ms on Debian.

The interpretation is straightforward. For short-lived command-line tools, container startup overhead can feel visible. For agent sessions that keep MCP servers running while the model performs multiple inference calls, the runtime overhead is basically lost in the sofa cushions. LLM latency, API calls, retrieval, and human review will dominate.

This does not make sandboxing “free.” It makes the performance objection weak in many agent deployments. The real costs are elsewhere:

Cost category	What operators actually pay
Policy review	Someone must decide whether the generated manifest is appropriate
Permission scoping	Hosts, directories, variables, and execution rights need concrete boundaries
Developer workflow	Tool installation and debugging must work inside the sandbox
Incident response	Logs and denied actions must be legible enough to diagnose
Governance	Teams need rules for which agents may receive which classes of authority

The paper directly shows that AgentBound can enforce permissions with low runtime overhead in its experimental setup. Cognaptus infers that the adoption barrier is organisational rather than computational. The remaining uncertainty is how well this scales across messy enterprise environments with legacy systems, private package registries, brittle authentication flows, and the usual “temporary” admin token that somehow celebrates its fourth birthday.

The business case is governed autonomy, not safer vibes

For operators, the most useful framing is not “AI safety.” It is permissioned delegation.

A business does not give an employee unlimited access because the employee has good intentions. It assigns roles, scopes access, logs actions, separates duties, and revokes authority when context changes. Agentic AI needs the same treatment, with one extra complication: the “employee” may dynamically assemble its workflow from third-party tools written by people the business has never met.

The broader MCP literature supports this concern. Hou et al. describe MCP’s lifecycle risks across server creation, operation, and update, which maps neatly onto supply-chain, runtime, and maintenance risk.⁴ Radosevich and Halloran show that MCP-enabled workflows can be coerced into malicious code execution, remote access, and credential theft, and propose auditing MCP servers before deployment.⁵ Li et al. measure API usage across 2,562 MCP applications and find substantial use of network and system-resource APIs, reinforcing the need for privilege management rather than casual trust.⁶

AgentBound adds a complementary layer. Scanners ask, “does this server look dangerous?” Monitors ask, “is this behaviour suspicious?” AgentBound asks, “even if it is dangerous, what can it actually touch?”

That last question is the one enterprises can operationalise.

A practical adoption path would look like this:

Deployment stage	Control pattern	What AgentBound contributes
Experiment	Run tools in isolated development environments	Prevents accidental local damage during trials
Internal pilot	Require generated manifests for all MCP servers	Creates a reviewable permission inventory
Production assistive agent	Enforce directory, host, and environment-variable scoping	Reduces blast radius of compromised or buggy tools
Semi-autonomous workflow	Combine manifests with logging, approval gates, and identity controls	Makes delegation auditable and revocable
Regulated operation	Treat agent-tool permissions as part of compliance evidence	Provides explicit records of allowed capabilities

The ROI is not mainly fewer milliseconds or prettier diagrams. It is fewer unknown privileges. Unknown privileges are expensive because they convert every incident into discovery work: what could the agent access, which credentials were exposed, what systems were reachable, and why did nobody know this earlier? AgentBound’s manifest model makes that inventory explicit before the incident. Dull, useful, and therefore unfashionable.

Boundaries: AgentBound is a floor, not the whole security programme

AgentBound changes the execution boundary. It does not complete the security story.

First, permission manifests can be over-broad. The paper itself reports developer feedback asking for tighter scoping, especially around environment variables and filesystem access. A generated manifest that says “network access” is only the beginning. The enterprise-grade version needs concrete hosts, paths, variables, and modes. Least privilege lives in the details; broad categories are where it goes to quietly die.

Second, sandboxing cannot determine whether a permitted action is wise. If a tool is allowed to call a payment API, AgentBox cannot know whether this particular payment is legitimate unless additional policy logic exists above it. That requires transaction rules, approval thresholds, semantic validation, and probably a human for actions where “oops” has a currency symbol.

Third, the experiments cover representative attacks and real MCP servers, not every deployment topology. Enterprise agents often combine internal services, identity providers, CI/CD systems, cloud credentials, local developer machines, and SaaS APIs. The enforcement model is promising, but production integration will need careful work around secrets, logging, update policy, and compatibility.

Fourth, MCP security is moving fast. Static scanners, runtime monitors, authentication layers, provenance mechanisms, and protocol-level changes will all compete to be called “the solution.” They are not substitutes. They occupy different points in the control stack.

The right architecture is layered:

Source and package review
        ↓
Manifest generation and human approval
        ↓
Sandboxed execution with least privilege
        ↓
Runtime monitoring and anomaly detection
        ↓
Identity, audit, and incident response

AgentBound belongs in the middle, where it can do what prompt policies cannot: deny an operation before damage occurs.

Conclusion: autonomy needs an execution constitution

The old software-security lesson has arrived in agent land wearing a new hat. Do not trust code because it is convenient. Do not grant access because the demo worked. Do not assume a model can reason its way out of privileges it should never have received.

AgentBound’s main contribution is not that containers exist, nor that permissions are useful. We did know. The contribution is adapting those ideas to MCP in a way that appears usable: a compact permission vocabulary, generated manifests, runtime enforcement, evidence from real servers, and overhead small enough that performance is a poor excuse.

The paper directly shows that enforceable access control for MCP servers is feasible and efficient in the evaluated setting. Cognaptus infers that agent platforms should treat tool permissions as first-class deployment artefacts, not as hidden side effects of installation. What remains uncertain is how quickly the ecosystem will standardise around such controls before enough avoidable incidents provide the usual educational subsidy.

AI agents do not need unlimited freedom to be useful. They need bounded authority, observable behaviour, and revocable trust. In other words, less mysticism, more sandbox. A tragic outcome for keynote decks; a healthy one for production systems.

References

Cognaptus: Automate the Present, Incubate the Future.

Christoph Bühler, Matteo Biagiola, Luca Di Grazia, and Guido Salvaneschi, “Securing AI Agent Execution,” arXiv:2510.21236, 2025. ↩︎
Anthropic, “Introducing the Model Context Protocol,” November 25, 2024. ↩︎
Model Context Protocol contributors, “Model Context Protocol Specification,” version 2025-06-18. ↩︎
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang, “Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions,” arXiv:2503.23278, 2025. ↩︎
Brandon Radosevich and John Halloran, “MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits,” arXiv:2504.03767, 2025. ↩︎
Zhihao Li, Kun Li, Boyang Ma, Minghui Xu, Yue Zhang, and Xiuzhen Cheng, “We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems,” arXiv:2507.06250, 2025. ↩︎

TL;DR for operators#

Tools are the new blast radius#

AgentBound turns permission into an executable contract#

The evidence says the manifest layer is usable, not self-certifying#

The sandbox blocks environment attacks, not semantic betrayal#

The cost is startup latency; the benefit is a smaller blast radius#

The business case is governed autonomy, not safer vibes#

Boundaries: AgentBound is a floor, not the whole security programme#

Conclusion: autonomy needs an execution constitution#

References#