Brainlike-AGI Safety

Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. "a yet-to-be-invented variation on actor-critic model-based reinforcement learning"

Theory of Change:Fairly-direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; "understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients".

General Approach:Cognitive

Target Case:Worst Case

Estimated FTEs:1-5

Critiques:

Tsvi BT

Outputs:

Perils of Under vs Over-sculpting AGI Desires

Reward button alignment— Steven Byrnes

System 2 Alignment: Deliberation, Review, and Thought Management— Seth Herd

Against RL: The Case for System 2 Learning— Andreas Stuhlmüller

Foom and Doom 1: Brain in a Box in a Basement

Foom and Doom 2: Technical Alignment is Hard