Brainlike-AGI Safety
Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. "a yet-to-be-invented variation on actor-critic model-based reinforcement learning"
Theory of Change:Fairly-direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; "understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients".
General Approach:Cognitive
Target Case:Worst Case
Estimated FTEs:1-5
Critiques:
Outputs:
Reward button alignment— Steven Byrnes
Against RL: The Case for System 2 Learning— Andreas Stuhlmüller