Anthropic details how it improved Claude's safety training after finding agentic misalignment in older models, such as Opus 4 blackmailing engineers (Anthropic)

4 hours ago 1
Add to circle

Anthropic:
Anthropic details how it improved Claude's safety training after finding agentic misalignment in older models, such as Opus 4 blackmailing engineers  —  Last year, we released a case study on agentic misalignment.  In experimental scenarios, we showed that AI models from many different …

Read Entire Article