Shallow Review of Technical AI Safety, 2025

Data quality for alignment

Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.
Theory of Change:The quality of alignment is heavily dependent on the quality of the data (e.g., human preferences); by improving the "signal" from annotators and reducing noise/bias, we will get more robustly aligned models.
General Approach:Engineering
Target Case:Average Case
Some names:Maarten Buyl, Kelsey Kraus, Margaret Kroll, Danqing Shi
Estimated FTEs:20-50
Outputs:
AI Alignment at Your DiscretionMaarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C. Vieira Machado, Flavio du Pin Calmon
DxHF: Providing High-Quality Human Feedback for LLM Alignment via Interactive DecompositionDanqing Shi, Furui Cheng, Tino Weinkauf, Antti Oulasvirta, Mennatallah El-Assady
Challenges and Future Directions of Data-Centric AI AlignmentMin-Hsuan Yeh, Jeffrey Wang, Xuefeng Du, Seongheon Park, Leitian Tao, Shawn Im, Yixuan Li
You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and GeneralisationSimon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, George Wang, Liam Carroll, Daniel Murfet