Black-box safety (understand and control current model behaviour)
Iterative alignment at pretrain-time
2 papersGuide weights during pretraining.
Iterative alignment at post-train-time
16 papersModify weights after pre-training.
Black-box make-AI-solve-it
12 papersFocus on using existing models to improve and align further models.
Inoculation prompting
4 papersPrompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.
Inference-time: In-context learning
5 papersInvestigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.
Inference-time: Steering
4 papersManipulate an LLM's internal representations/token probabilities without touching weights.
Capability removal: unlearning
18 papersDeveloping methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.
Control
22 papersIf we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?
Safeguards (inference-time auxiliaries)
6 papersLayers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.
Chain of thought monitoring
17 papersSupervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states.
Model values / model preferences
14 papersAnalyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.
Character training and persona steering
13 papersMap, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).
Emergent misalignment
17 papersFine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. [A new agenda](https://www.lesswrong.com/posts/AcTEiu5wYDgrbmXow/open-problems-in-emergent-misalignment) which quickly led to a stream of exciting work.
Model specs and constitutions
11 papersWrite detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.
Model psychopathology
9 papersFind interesting LLM phenomena like glitch [tokens](https://vgel.me/posts/seahorse/) and the reversal curse; these are vital data for theory.
Data filtering
4 papersBuilds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.
Hyperstition studies
4 papersStudy, steer, and intervene on the following feedback loop: "we produce stories about how present and future AI systems behave" → "these stories become training data for the AI" → "these stories shape how AI systems in fact behave".
Data poisoning defense
3 papersDevelops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.
Synthetic data for alignment
8 papersUses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.
Data quality for alignment
5 papersImproves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.
Mild optimisation
4 papersAvoid Goodharting by getting AI to satisfice rather than maximise.
RL safety
11 papersImproves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.
Assistance games, assistive agents
5 papersFormalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.
Harm reduction for open weights
5 papersDevelops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.
The "Neglected Approaches" Approach
3 papersAgenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.