High-Actuation Spaces
Mech interp and alignment assume a stable "computational substrate" (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will not transfer. Therefore, better to try and infer goals via a "telic DAG" which abstracts over substrates, and so sidestep the issue of how to define intermediate representations. Category theory is intended to provide guarantees that this abstraction is valid.
Theory of Change:Sufficiently complex mindlike entities can alter their goals in ways that cannot be predicted or accounted for under substrate-dependent descriptions of the kind sought in mechanistic interpretability. use the telic DAG to define a method analogous to factoring a causal DAG.
Target Case:Pessimistic
See Also:
Some names:Sahil K, Matt Farr, Aditya Arpitha Prasad, Chris Pang, Aditya Adiga, Jayson Amati, Steve Petersen, Topos
Estimated FTEs:1-10
Outputs:
HIBP Human Inductive Bias Project Plan— Félix Dorn
MoSSAIC: AI Safety After Mechanism— Matt Farr, Aditya Arpitha Prasad, Chris Pang, Aditya Adiga, Jayson Amati, Sahil K