Shallow Review of Technical AI Safety, 2025

Various Redteams

attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Theory of Change:to ensure models are safe, we must actively try to break them. By developing and applying a diverse suite of attacks (e.g., in novel domains, against agentic systems, or using automated tools), researchers can discover vulnerabilities, specification gaming, and deceptive behaviors before they are exploited, thereby informing the development of more robust defenses.
General Approach:Behavioral
Target Case:Average Case
See Also:
Some names:Ryan Greenblatt
Estimated FTEs:100+
Outputs:
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language ModelsLiwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri
In-Context Representation HijackingItay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman
Building and evaluating alignment auditing agentsTrenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, Evan Hubinger, Samuel Marks
Findings from a Pilot Anthropic—OpenAI Alignment Evaluation ExerciseSamuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, Ethan Perez, Nicholas Carlini
Agentic Misalignment: How LLMs could be insider threatsAengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, Ethan Perez, Evan Hubinger
Compromising Honesty and Harmlessness in Language Models via Deception AttacksLaurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff
Eliciting Language Model Behaviors with Investigator AgentsXiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt
Shutdown Resistance in Large Language ModelsJeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish
Stress Testing Deliberative Alignment for Anti-Scheming TrainingBronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn
Chain-of-Thought HijackingJianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-AgentsSalman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel
Agentic Misalignment: How LLMs Could be Insider ThreatsAengus Lynch, Benjamin Wright, Caleb Larson, Stuart Richie, Sören Mindermann, Evan Hubinger, Ethan Perez, Kevin Troy
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and GoogleChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine
Why Do Some Language Models Fake Alignment While Others Don't?Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, Fabien Roger
Demonstrating specification gaming in reasoning modelsAlexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish
Jailbreak-Tuning: Models Efficiently Learn Jailbreak SusceptibilityBrendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, Kellin Pelrine
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential MonitorsChen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He
Call Me A Jerk: Persuading AI to Comply with Objectionable RequestsLennart Meincke, Dan Shapiro, Angela Duckworth, Ethan R. Mollick, Lilach Mollick, Robert Cialdini
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesAli Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu
The Structural Safety Generalization ProblemJulius Broomfield, Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Tia Nasir, Jason Zhang, Reihaneh Iranmanesh, Sara Pieri, Reihaneh Rabbany, Kellin Pelrine
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety MechanismsJoshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy Dvijotham
Fundamental Limitations in Pointwise Defences of LLM Finetuning APIsXander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, Yarin Gal
STACK: Adversarial Attacks on LLM Safeguard PipelinesIan R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Adversarial Manipulation of Reasoning Models using Internal RepresentationsKureha Yamaguchi, Benjamin Etheridge, Andy Arditi
Discovering Forbidden Topics in Language ModelsCan Rager, Chris Wendler, Rohit Gandikota, David Bau
Mitigating Many-Shot JailbreakingChristopher M. Ackerman, Nina Panickssery
Active Attacks: Red-teaming LLMs via Adaptive EnvironmentsTaeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
LLM Robustness Leaderboard v1 --Technical reportPierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier ApproachTony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful TopicsMatthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic ObjectiveSimon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann
Adversarial Attacks on Robotic Vision Language Action ModelsEliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson, J. Zico Kolter
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation ModelsChejian Xu, Jiawei Zhang, Zhaorun Chen, Chulin Xie, Mintong Kang, Yujin Potter, Zhun Wang, Zhuowen Yuan, Alexander Xiong, Zidi Xiong, Chenhui Zhang, Lingzhi Yuan, Yi Zeng, Peiyang Xu, Chengquan Guo, Andy Zhou, Jeffrey Ziwei Tan, Xuandong Zhao, Francesco Pinto, Zhen Xiang, Yu Gai, Zinan Lin, Dan Hendrycks, Bo Li, Dawn Song
Toward Understanding the Transferability of Adversarial Suffixes in Large Language ModelsSarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, Andrej Risteski
Uncovering Gaps in How Humans and LLMs Interpret Subjective LanguageErik Jones, Arjun Patrawala, Jacob Steinhardt
RedCodeAgent: Automatic Red-teaming Agent against Diverse Code AgentsChengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS AgentsLukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi
ToolTweak: An Attack on Tool Selection in LLM-based AgentsJonathan Sneh, Ruomei Yan, Jialin Yu, Philip Torr, Yarin Gal, Sunando Sengupta, Eric Sommerlade, Alasdair Paren, Adel Bibi
Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM AgentsQiusi Zhan, Richard Fang, Henil Shalin Panchal, Daniel Kang
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable ObjectivesLeo Schwinn, Yan Scholten, Tom Wollschläger, Sophie Xhonneux, Stephen Casper, Stephan Günnemann, Gauthier Gidel
Transferable Adversarial Attacks on Black-Box Vision-Language ModelsKai Hu, Weichen Yu, Li Zhang, Alexander Robey, Andy Zou, Chengming Xu, Haoqi Hu, Matt Fredrikson
Advancing Gemini's security safeguardsGoogle DeepMind Security & Privacy Research Team