Agent scores 95% on task completion. Ship it. But that same agent has 48% attack success rate via prompt injection in our pentest against these models. Meaning roughly half the time you feed it a malicious prompt, it does what the attacker wants.
"Ready for production" needs a safety column next to the capability column.
These OSS model makers need to stop benchmarking against old models. Showing how it performs against Opus 4.5, GLM-5 when we have Opus 4.6 and GLM-5.1 just tells me that it's not comparable to SOTA.
It's a point update to the closed-weight Qwen3.5-Plus. Of course there are no weights. Alibaba has consistently not released weights for their best models.
Agent scores 95% on task completion. Ship it. But that same agent has 48% attack success rate via prompt injection in our pentest against these models. Meaning roughly half the time you feed it a malicious prompt, it does what the attacker wants.
"Ready for production" needs a safety column next to the capability column.
reply