Broad Approaches

The rough methods used across agendas. Many agendas combine multiple of these approaches.

Inspired by: Defining Alignment Research

Engineering

17 agendas

Practical, implementation-focused approaches that build systems and tools to make AI safer. Emphasizes empirical testing, iterative development, and scalable solutions.

Agendas using this approach

Iterative alignment at pretrain-time Iterative alignment at post-train-time Black-box make-AI-solve-it Inoculation prompting Inference-time: In-context learning Inference-time: Steering Safeguards (inference-time auxiliaries)Chain of thought monitoring Model specs and constitutions Data filtering Data poisoning defense Synthetic data for alignment Data quality for alignment RL safety Harm reduction for open weights The "Neglected Approaches" Approach Weak-to-strong generalization

Behavioral

15 agendas

Approaches focused on observable AI behavior and outputs rather than internal mechanisms. Includes techniques like RLHF, red-teaming, and behavioral testing.

Agendas using this approach

Emergent misalignment Data attribution Supervising AIs improving AIs Aligning to context Aligned to who?AGI metrics Capability evals Autonomy evals WMD evals (Weapons of Mass Destruction)Situational awareness and self-awareness evals Steganography evals Sandbagging evals Self-replication evals Various Redteams Other evals

Cognitive

25 agendas

Approaches that model or analyze the internal reasoning, representations, and decision-making processes of AI systems. Includes interpretability and understanding how models "think."