Make AI solve it
Weak-to-strong generalization
4 papersUse weaker models to supervise and provide a feedback signal to stronger models.
Supervising AIs improving AIs
8 papersBuild formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees
AI explanations of AIs
5 papersMake open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that "searches" for a specified behaviour in frontier models.
Debate
6 papersIn the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.
LLM introspection training
2 papersTrain LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own 'introspective' access