Two-faced AI language models learn to hide deception - ‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working.
Open link in next tab
Two-faced AI language models learn to hide deception
https://www.nature.com/articles/d41586-024-00189-3
‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working.