New AI Red Team Uncovers Critical Flaws In Phase Two Of Secret Project Fetch

By 813 Staff

New AI Red Team Uncovers Critical Flaws In Phase Two Of Secret Project Fetch

Dario Amodei just pulled back the curtain on Anthropic’s most aggressive safety stress test yet. In a new blog post published Wednesday, the company revealed Phase 2 of its internal red teaming effort, codenamed Project Fetch. Internal documents show the project is designed to probe whether frontier models can be coerced into autonomously retrieving and synthesizing sensitive data—a capability Anthropic engineers close to the project say is “the single most concerning attack surface we’ve seen in any model to date.” Phase 1, completed late last year, focused on short-horizon tool use. Phase 2 pushes further, evaluating whether a model can be chained through multiple reasoning and retrieval steps without direct human oversight.

The timing is notable. Anthropic (@AnthropicAI) published the update just days after a leaked internal memo from a competitor warned that state-aligned threat actors are already experimenting with prompt-injection techniques that exploit long-context windows. Engineers close to Project Fetch say the rollout has been anything but smooth: Phase 2 required building a custom evaluation harness that could simulate adversarial persistence across dozens of conversation turns. One source described the results as “sobering,” noting that certain configurations of Claude—the company’s flagship family—exhibited “low but non-trivial success rates” on the most complex task chains, despite hardening updates in the past two release cycles.

Why this matters: Project Fetch is not a theoretical exercise. If a model can be tricked into executing multi-step data retrieval without safeguards, the practical implications are immediate—automated exfiltration of proprietary corporate data, unauthorized access to internal APIs, and persistent surveillance through compromised agentic loops. Anthropic has been open that it views this as a necessary step before any broader deployment of autonomous agent features, which the company has signaled it plans to expand later this year.

What happens next is uncertain but closely watched. Anthropic says it will publish detailed failure mode taxonomies from Phase 2 in the coming weeks, and it is already designing Phase 3, which sources say will test dynamic retrieval across live, sandboxed third-party services. No timeline has been set for that phase, but one internal note warned that “Phase 2’s results mean we cannot rush.” The industry will be reading the full data closely—not least because regulators and enterprise buyers are now asking for explicit red teaming results before signing contracts.

Source: https://x.com/AnthropicAI/status/2067651699486200091

Related Stories

More Technology →