LLM introspection training
Train LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own 'introspective' access
Theory of Change:Use the resulting LLMs as powerful dimensionality reduction, explaining internals in a distinct way than interpretability methods and CoT. Distilling self-explanation into the model should lead to the skill being scalable, since self-explanation skill advancement will feed off general-intelligence advancement.
General Approach:Cognitive
Some names:Vincent Huang, Jacob Steinhardt, Jack Lindsey
Estimated FTEs:2-20
Outputs:
Training Language Models to Explain Their Own Computations— Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas