Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Evaluating Control Protocols for Untrusted AI Agents (arxiv.org)
1 point by timini 5 months ago | hide | past | favorite | 1 comment


This paper evaluates three control strategies for untrusted agents: deferral to trusted models, resampling, and critical action deferral. Initial testing showed resampling and critical action deferral achieving 96% safety. However, adversarial testing revealed resampling crashes to 17% safety when attackers can detect resampling or simulate monitors, while critical action deferral remained robust against all attack strategies.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: