LLM Security

6,000 Prompt Injection Attempts, Zero Leaks: What the HackMyClaw Challenge Actually Proves

Fernando Irarrázaval opened his OpenClaw AI email agent to 2,000 attackers and 6,000 attempts. Nobody extracted the secret — but the architecture of the challenge explains the result as much as the model does.

PyramidLedger Research28 June 20264 min read

Over 6,000 prompt injection attempts from more than 2,000 participants failed to extract secrets from an Anthropic Claude Opus 4.6-powered email agent called OpenClaw.
The agent's protection combined explicit system-prompt rules with frontier model training — and critically, the agent had no high-consequence write access to external systems.
Simon Willison, who surfaced the result, cautioned that zero leaks in a sandboxed challenge does not translate to safety in production systems where agent actions are irreversible.
Least-privilege agent design — constraining what an agent can do, not just what it can say — is the primary control the challenge validated by its own construction.

The Challenge Setup

Fernando Irarrázaval published hackmyclaw.com with a simple premise: email his OpenClaw AI assistant — named Fiu — and convince it to leak the contents of a secrets.env file, the kind of file where developers store API keys and credentials. The underlying model was Anthropic's Claude Opus 4.6. The system prompt contained a short but explicit anti-injection ruleset, prohibiting the assistant from ever revealing credentials, modifying its own configuration files, executing code supplied by email, or exfiltrating data to external endpoints — regardless of what an incoming message instructed.

6,000 Attempts, Zero Leaks

More than 2,000 people participated over the life of the challenge, sending over 6,000 emails between them. None extracted the secret. The campaign ran long enough to cost Irarrázaval roughly $500 in token spend and trigger a Google account suspension from the sheer volume of inbound mail — at which point he closed the challenge.

The effort the labs have been putting in to training their frontier models not to fall for injection attacks...do appear effective in making these attacks much harder to pull off.
— Simon Willison

That assessment from Willison is meaningful. A well-crafted injection in an email body would have had reasonable odds against a comparable setup not long ago. Frontier model training against indirect prompt injection has measurably shifted the baseline.

Why the Caveat Matters More Than the Headline

Willison's commentary contains a warning that security practitioners should read as carefully as the result. The HackMyClaw challenge defined a deliberately narrow threat model: the worst outcome was a leaked string. In production agentic systems — those that can send authenticated email replies, call external APIs, initiate financial operations, or modify records — a successful injection doesn't just exfiltrate data. It acts, often irreversibly.

The OpenClaw setup worked partly because Irarrázaval scoped the agent's permissions tightly from the outset. That is sound system design. It is not, however, the setup most teams inherit when they connect an LLM to an existing business workflow.

The agent's blast radius was deliberately small: no unsanctioned outbound email, no system command execution, no external API calls.
Irreversible actions — financial transfers, data deletion, identity-impersonating communications — were excluded by design, a discipline many production deployments skip.
A frontier model's injection resistance is one layer of a defence-in-depth stack, not a substitute for scoped permissions.

Practical Takeaways for Security and Engineering Teams

The result is a useful data point. It is not a proof of safety. What it suggests for teams building or auditing AI agent deployments:

1Frontier models trained on injection resistance meaningfully raise the bar — this is real progress worth acknowledging.
2System-prompt prohibitions remain part of the control set and appear to reinforce model-level training.
3Least-privilege agent design — limiting what the agent can do, not just what it can say — is the primary control the challenge validated by the constraints Irarrázaval set on his own system.
4Red-team your agents before attackers do: a controlled challenge in a sandboxed environment is how you find your real-world exposure, not a substitute for it.

For any team evaluating an agentic deployment, the right question is not 'can the model resist a prompt injection?' It is: 'what is the worst case if it doesn't — and have we bounded that?'

Frequently Asked Questions

What model was used in the HackMyClaw prompt injection challenge?

Anthropic's Claude Opus 4.6 was the underlying model powering the OpenClaw email agent called Fiu.

Does the HackMyClaw result mean prompt injection is no longer a serious threat?

No. The challenge demonstrated that modern frontier models are more resistant to injection than before, but Simon Willison explicitly cautioned against generalising a sandboxed result to production systems — particularly those where agent actions are irreversible, such as sending authenticated communications or triggering financial operations.

What made the OpenClaw agent resistant to injection in this challenge?

Two factors worked together: an explicit system-prompt ruleset forbidding the agent from acting on email-borne instructions regarding credentials or self-modification, and Anthropic's training of Opus 4.6 against indirect prompt injection. Equally important was what the challenge excluded by design — the agent had no high-consequence write capabilities, which limited the blast radius of any successful attack.

Sources

1What happened after 2,000 people tried to hack my AI assistant — Simon Willison
2HackMyClaw — Prompt Injection Challenge — Fernando Irarrázaval
3What happened after 2,000 people tried to hack my AI assistant — Fernando Irarrázaval
4This AI Agent Survived 6,000 Hack Attempts — Here's How — Decrypt
5What happened after 2,000 people tried to hack my AI assistant — Simon Willison