Capability benchmarks miss safety entirely. Agent scores 95% on task completion....

Capability benchmarks miss safety entirely.

Agent scores 95% on task completion. Ship it. But that same agent has 48% attack success rate via prompt injection in our pentest against these models. Meaning roughly half the time you feed it a malicious prompt, it does what the attacker wants.

"Ready for production" needs a safety column next to the capability column.