Models · · 2 min read

Anthropic Formalizes Model Sabotage Surveillance Following California Safety Law

Anthropic has upgraded its AI safety reporting, detailing how it monitors and mitigates the risk that its own frontier models could autonomously disrupt internal systems.

Anthropic Formalizes Model Sabotage Surveillance Following California Safety Law
Photo by Solen Feyissa / Unsplash

Anthropic has shifted its AI safety reporting from an internal exercise into a formal, compliance-ready control framework, publishing its Sabotage Risk Report for Claude Opus 4.6 yesterday. The report establishes formal systems for how the company monitors, audits, and mitigates the risk that its own frontier models could undermine internal systems or research.

The move comes amid growing policy pressure in California, where the Transparency in Frontier AI Act (SB-53) requires large developers to publish “frontier AI frameworks” detailing how they test for catastrophic risks, secure deployments, and report incidents.

Anthropic’s earlier Summer 2025 report explicitly framed itself as provisional — describing the document as “an early pilot risk report” and “not currently load-bearing under our Responsible Scaling Policy.”

The Opus 4.6 update reads very differently.

“Claude Opus 4.6 does not pose a significant risk of autonomous actions that contribute significantly to later catastrophic outcomes… the overall risk is very low but not negligible.”

Where the pilot focused largely on behavioral testing, the new version adds a dedicated risk mitigations section covering “security, monitoring, and related safeguards,” limits on complex autonomous plans, and tighter oversight of internal deployments.

Anthropic also broadened its detection toolkit to include:

The company argues Opus 4.6 shows “limited opaque reasoning abilities” and “limited capabilities in long-horizon software tasks,” reducing its capacity to execute hidden, multi-step sabotage.

Earlier conclusions remain intact. The firm continues to cite “weak reliability on highly complex tasks” and “weak deception capabilities on complex tasks” as practical constraints on harmful autonomy.

Read next