Shallow Review of Technical AI Safety, 2025

Human inductive biases

Discover connections deep learning AI systems have with human brains and human learning processes. Develop an 'alignment moonshot' based on a coherent theory of learning which applies to both humans and AI systems.
Theory of Change:Humans learn trust, honesty, self-maintenance, and corrigibility; if we understand how they do maybe we can get future AI systems to learn them.
General Approach:Cognitive
Target Case:Pessimistic
See Also:
active learning, ACS research
Some names:Lukas Muttenthaler, Quentin Delfosse
Estimated FTEs:4
Outputs:
Aligning machine and human visual representations across abstraction levelsLukas Muttenthaler, Klaus Greff, Frieda Born, Bernhard Spitzer, Simon Kornblith, Michael C. Mozer, Klaus-Robert Müller, Thomas Unterthiner, Andrew K. Lampinen
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and AlignmentYang Hu, Runchen Wang, Stephen Chong Zhao, Xuhui Zhan, Do Hun Kim, Mark Wallace, David A. Tovar
Towards Cognitively-Faithful Decision-Making Models to Improve AI AlignmentCyrus Cousins, Vijay Keswani, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong