Meta
Structure:for-profit
Safety teams:
Safety "integrated into" capabilities research, Meta Superintelligence Lab. But also FAIR Alignment, Brain and AI.
Framework:FAF
See Also:
Some names:Shuchao Bi, Hongyuan Zhan, Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Jason Weston, ShengYun Peng, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Evangelia Spiliopoulou, Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda, Adina Williams
Critiques:
extreme underelicitation, Stein-Perlman, Carlsmith on labs in general
Outputs:
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety— Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan
Large Reasoning Models Learn Better Alignment from Flawed Thinking— ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi
Robust LLM safeguarding via refusal feature adversarial training— Lei Yu, Virginie Do, Karen Hambardzumyan, Nicola Cancedda