Anthropic Formalizes Model Sabotage Surveillance Following California Safety Law

Anthropic has shifted its AI safety reporting from an internal exercise into a formal, compliance-ready control framework, publishing its Sabotage Risk Report for Claude Opus 4.6 yesterday. The report establishes formal systems for how the company monitors, audits, and mitigates the risk that its own frontier models could undermine internal systems or research.

The move comes amid growing policy pressure in California, where the Transparency in Frontier AI Act (SB-53) requires large developers to publish “frontier AI frameworks” detailing how they test for catastrophic risks, secure deployments, and report incidents.

Anthropic’s earlier Summer 2025 report explicitly framed itself as provisional — describing the document as “an early pilot risk report” and “not currently load-bearing under our Responsible Scaling Policy.”

The Opus 4.6 update reads very differently.

“Claude Opus 4.6 does not pose a significant risk of autonomous actions that contribute significantly to later catastrophic outcomes… the overall risk is very low but not negligible.”

Where the pilot focused largely on behavioral testing, the new version adds a dedicated risk mitigations section covering “security, monitoring, and related safeguards,” limits on complex autonomous plans, and tighter oversight of internal deployments.

Anthropic also broadened its detection toolkit to include:

interpretability investigations,
reasoning-faithfulness and steganography checks,
internal code-use monitoring, and
deployment-time safeguards.

The company argues Opus 4.6 shows “limited opaque reasoning abilities” and “limited capabilities in long-horizon software tasks,” reducing its capacity to execute hidden, multi-step sabotage.

Earlier conclusions remain intact. The firm continues to cite “weak reliability on highly complex tasks” and “weak deception capabilities on complex tasks” as practical constraints on harmful autonomy.

Anthropic Formalizes Model Sabotage Surveillance Following California Safety Law

Read next

Oregon Moves to Regulate Wildfire Models

New Updates Make AI Models Dangerously Good at Hacking, and Harder to Trust

Physical Risk, Not Broad ESG, Is Driving Model and Data Growth: MSCI

Anthropic Formalizes Model Sabotage Surveillance Following California Safety Law

Sign up for Risk Market News

Read next

Oregon Moves to Regulate Wildfire Models

New Updates Make AI Models Dangerously Good at Hacking, and Harder to Trust

Physical Risk, Not Broad ESG, Is Driving Model and Data Growth: MSCI