Broad Approaches
The rough methods used across agendas. Many agendas combine multiple of these approaches.
Inspired by: Defining Alignment Research
Engineering
17 agendasPractical, implementation-focused approaches that build systems and tools to make AI safer. Emphasizes empirical testing, iterative development, and scalable solutions.
Agendas using this approach
Iterative alignment at pretrain-timeIterative alignment at post-train-timeBlack-box make-AI-solve-itInoculation promptingInference-time: In-context learningInference-time: SteeringSafeguards (inference-time auxiliaries)Chain of thought monitoringModel specs and constitutionsData filteringData poisoning defenseSynthetic data for alignmentData quality for alignmentRL safetyHarm reduction for open weightsThe "Neglected Approaches" ApproachWeak-to-strong generalization
Behavioral
15 agendasApproaches focused on observable AI behavior and outputs rather than internal mechanisms. Includes techniques like RLHF, red-teaming, and behavioral testing.
Agendas using this approach
Emergent misalignmentData attributionSupervising AIs improving AIsAligning to contextAligned to who?AGI metricsCapability evalsAutonomy evalsWMD evals (Weapons of Mass Destruction)Situational awareness and self-awareness evalsSteganography evalsSandbagging evalsSelf-replication evalsVarious RedteamsOther evals
Cognitive
25 agendasApproaches that model or analyze the internal reasoning, representations, and decision-making processes of AI systems. Includes interpretability and understanding how models "think."
Agendas using this approach
Model values / model preferencesCharacter training and persona steeringHyperstition studiesMild optimisationReverse engineeringExtracting latent knowledgeLie and deception detectorsModel diffingCausal AbstractionsPragmatic interpretabilityLearning dynamics and developmental interpretabilityRepresentation structure and geometryHuman inductive biasesMonitoring conceptsScientist AIBrainlike-AGI SafetyAI explanations of AIsLLM introspection trainingAgent foundationsTiling agentsAsymptotic guaranteesNatural abstractionsThe Learning-Theoretic AgendaAligning to the social contractTheory for aligning multiple AIs