Shallow Review of Technical AI Safety, 2025

Broad Approaches

The rough methods used across agendas. Many agendas combine multiple of these approaches.

Inspired by: Defining Alignment Research

Engineering

17 agendas

Practical, implementation-focused approaches that build systems and tools to make AI safer. Emphasizes empirical testing, iterative development, and scalable solutions.

Behavioral

15 agendas

Approaches focused on observable AI behavior and outputs rather than internal mechanisms. Includes techniques like RLHF, red-teaming, and behavioral testing.

Cognitive

25 agendas

Approaches that model or analyze the internal reasoning, representations, and decision-making processes of AI systems. Includes interpretability and understanding how models "think."