Safeguards (inference-time auxiliaries)

Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.

Theory of Change:By building a bunch of scalable and hardened things on top of an unsafe model, we can defend against known and unknown attacks, monitor for misuse, and prevent models from causing harm, even if the core model has vulnerabilities.

General Approach:Engineering

Target Case:Average Case

Orthodox Problems:

7.Superintelligence can fool human supervisors 12.A boxed AGI might exfiltrate itself

See Also:

Various Redteams, Iterative alignment

Some names:Mrinank Sharma, Meg Tong, Jesse Mu, Alwin Peng, Julian Michael, Henry Sleight, Theodore Sumers, Raj Agarwal, Nathan Bailey, Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Sahil Verma, Keegan Hines, Jeff Bilmes

Estimated FTEs:100+

Critiques:

Obfuscated Activations Bypass LLM Latent-Space Defenses

Outputs:

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming— Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples— Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma

Monitoring computer use via hierarchical summarization— Theodore Sumers, Raj Agarwal, Nathan Bailey, Tim Belonax, Brian Clarke, Jasmine Deng, Kyla Guru, Evan Frondorf, Keegan Hankes, Jacob Klein, Lynx Lean, Kevin Lin, Linda Petrini, Madeleine Tucker, Ethan Perez, Mrinank Sharma, Nikhil Saxena

Defeating Prompt Injections by Design— Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, Florian Tramèr

Introducing Anthropic's Safeguards Research Team

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities— Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh