Various Redteams
attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Theory of Change:to ensure models are safe, we must actively try to break them. By developing and applying a diverse suite of attacks (e.g., in novel domains, against agentic systems, or using automated tools), researchers can discover vulnerabilities, specification gaming, and deceptive behaviors before they are exploited, thereby informing the development of more robust defenses.
General Approach:Behavioral
Target Case:Average Case
See Also:
Some names:Ryan Greenblatt
Estimated FTEs:100+
Outputs:
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models— Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
In-Context Representation Hijacking— Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman
Building and evaluating alignment auditing agents— Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, Evan Hubinger, Samuel Marks
Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise— Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, Ethan Perez, Nicholas Carlini
Agentic Misalignment: How LLMs could be insider threats— Aengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, Ethan Perez, Evan Hubinger
Compromising Honesty and Harmlessness in Language Models via Deception Attacks— Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff
Eliciting Language Model Behaviors with Investigator Agents— Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt
Shutdown Resistance in Large Language Models— Jeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish
Stress Testing Deliberative Alignment for Anti-Scheming Training— Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn
Chain-of-Thought Hijacking— Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents— Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel
Agentic Misalignment: How LLMs Could be Insider Threats— Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart Richie, Sören Mindermann, Evan Hubinger, Ethan Perez, Kevin Troy
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google— ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine
Why Do Some Language Models Fake Alignment While Others Don't?— Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, Fabien Roger
Demonstrating specification gaming in reasoning models— Alexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility— Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, Kellin Pelrine
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors— Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning— Alex Beutel, Kai Xiao, Johannes Heidecke, Lilian Weng
Call Me A Jerk: Persuading AI to Comply with Objectionable Requests— Lennart Meincke, Dan Shapiro, Angela Duckworth, Ethan R. Mollick, Lilach Mollick, Robert Cialdini
RedDebate: Safer Responses through Multi-Agent Red Teaming Debates— Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu
The Structural Safety Generalization Problem— Julius Broomfield, Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Tia Nasir, Jason Zhang, Reihaneh Iranmanesh, Sara Pieri, Reihaneh Rabbany, Kellin Pelrine
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms— Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy Dvijotham
Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs— Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, Yarin Gal
STACK: Adversarial Attacks on LLM Safeguard Pipelines— Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Adversarial Manipulation of Reasoning Models using Internal Representations— Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
Discovering Forbidden Topics in Language Models— Can Rager, Chris Wendler, Rohit Gandikota, David Bau
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?— Rohan Gupta, Erik Jenner
Jailbreak Transferability Emerges from Shared Representations— Rico Angell, Jannik Brinkmann, He He
Mitigating Many-Shot Jailbreaking— Christopher M. Ackerman, Nina Panickssery
Active Attacks: Red-teaming LLMs via Adaptive Environments— Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
LLM Robustness Leaderboard v1 --Technical report— Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach— Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics— Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine
Discovering Undesired Rare Behaviors via Model Diff Amplification— Santiago Aranguri, Thomas McGrath
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective— Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann
Adversarial Attacks on Robotic Vision Language Action Models— Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models— Chejian Xu, Jiawei Zhang, Zhaorun Chen, Chulin Xie, Mintong Kang, Yujin Potter, Zhun Wang, Zhuowen Yuan, Alexander Xiong, Zidi Xiong, Chenhui Zhang, Lingzhi Yuan, Yi Zeng, Peiyang Xu, Chengquan Guo, Andy Zhou, Jeffrey Ziwei Tan, Xuandong Zhao, Francesco Pinto, Zhen Xiang, Yu Gai, Zinan Lin, Dan Hendrycks, Bo Li, Dawn Song
Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models— Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, Andrej Risteski
Will alignment-faking Claude accept a deal to reveal its misalignment?— Ryan Greenblatt, Kyle Fish
'For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts— Annika M Schoene, Cansu Canca
Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models— Lars Malmqvist
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language— Erik Jones, Arjun Patrawala, Jacob Steinhardt
RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents— Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents— Lukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi
Research directions Open Phil wants to fund in technical AI safety— jake_mendel, maxnadeau, Peter Favaloro
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure— lechmazur, eltociear
ToolTweak: An Attack on Tool Selection in LLM-based Agents— Jonathan Sneh, Ruomei Yan, Jialin Yu, Philip Torr, Yarin Gal, Sunando Sengupta, Eric Sommerlade, Alasdair Paren, Adel Bibi
Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents— Qiusi Zhan, Richard Fang, Henil Shalin Panchal, Daniel Kang
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives— Leo Schwinn, Yan Scholten, Tom Wollschläger, Sophie Xhonneux, Stephen Casper, Stephan Günnemann, Gauthier Gidel
Transferable Adversarial Attacks on Black-Box Vision-Language Models— Kai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu, Haoqi Hu, Matt Fredrikson
Advancing Gemini's security safeguards— Google DeepMind Security & Privacy Research Team