Shallow Review of Technical AI Safety, 2025

Meta

Structure:for-profit
Safety teams:

Safety "integrated into" capabilities research, Meta Superintelligence Lab. But also FAIR Alignment, Brain and AI.

Framework:FAF
Some names:Shuchao Bi, Hongyuan Zhan, Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Jason Weston, ShengYun Peng, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Evangelia Spiliopoulou, Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda, Adina Williams
Critiques:
Outputs:
The Alignment Waltz: Jointly Training Agents to Collaborate for SafetyJingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan
Large Reasoning Models Learn Better Alignment from Flawed ThinkingShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi
Robust LLM safeguarding via refusal feature adversarial trainingLei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda