Skip to content
ZeroTrace cybersecurity hardware and software
Back to Blog
AI Security

AI Agents Keep Escaping Sandboxes

Every coding agent ships with a sandbox story: "the agent can only touch this directory," "the container isolates the model from your real system," "the tool runs...

AI Agents Keep Escaping Sandboxes - ZeroTrace blog image
April 21, 20263 min read539 words
AI SecurityAI SecurityAgentsKeepEscaping

The sandbox was always optimistic

Every coding agent ships with a sandbox story: "the agent can only touch this directory," "the container isolates the model from your real system," "the tool runs with reduced permissions." In practice, 2025 and 2026 have produced a steady drip of reports where those boundaries turned out to be softer than the marketing implied.

The pattern is not a single bug. It is a class of bugs with a shared root cause: sandboxes that were designed against honest users, then deployed against adversarial prompts.

The recurring escape vectors

Across disclosed issues in the last year, the escape paths look familiar:

  • Symlink traversal from an "allowed" project directory into the rest of the filesystem
  • Environment variable leakage from the host into the agent's execution context
  • Git hooks, build scripts, or test runners that execute with fewer restrictions than the agent itself
  • MCP servers or plugin systems that run outside the agent's sandbox entirely
  • Path normalization bugs that accept ../ once the input is URL-encoded or mixed-case
  • Network rules that allow "just the model provider" but let proxy headers reach internal services

None of these are exotic. They are all variations of "the sandbox trusts something it should not have trusted."

Why agent sandboxes are harder than container sandboxes

A normal container sandbox has to resist code running inside it. An agent sandbox has to resist prompts that describe code, which then runs inside it, often with the user's implicit blessing. The attack surface is the model's willingness to take instructions from any text it encounters — pull request descriptions, issue comments, README files, scraped web content, dependency metadata, even code comments.

That makes prompt injection part of the sandbox threat model, not a separate "AI safety" concern. If a malicious README can get your agent to write a file to ~/.ssh/authorized_keys, the sandbox failed, regardless of whether the model was "jailbroken" in a traditional sense.

What to actually check

For any coding agent running on operator workstations:

  • Confirm the sandbox resists symlink and junction traversal — test it, do not assume
  • Audit which environment variables the agent can read at runtime
  • Treat every tool the agent can invoke as a potential escape vector, including git, make, npm, and test runners
  • Restrict network egress to the model provider, and block host-header smuggling into the proxy
  • Log every tool invocation with enough context to reconstruct what the agent did
  • Keep credentials for production systems out of any environment an agent can reach

The short version: assume the agent will do exactly what a prompt tells it, and ask whether the sandbox holds when that prompt is adversarial.

The design goal

A sandbox that only works against well-behaved inputs is not a sandbox. The bar for agent tooling in 2026 should be the same bar we already apply to browser renderers and container runtimes: the boundary holds even when the content inside is actively hostile.

Treat every agent like it has already been jailbroken. Then the sandbox story gets honest.

Source note

This field note reflects patterns across disclosed issues in Cursor, Windsurf, Claude Code, and Copilot Workspace through early 2026, plus the SoK paper on prompt injection against agentic coding assistants and ongoing research on MCP tool poisoning.

Command Palette

Search for a command to run...