Shallow Review of Technical AI Safety, 2025

Black-box safety (understand and control current model behaviour)

Iterative alignment

Iterative alignment at pretrain-time

2 papers

Guide weights during pretraining.

Iterative alignment at post-train-time

16 papers

Modify weights after pre-training.

Black-box make-AI-solve-it

12 papers

Focus on using existing models to improve and align further models.

Inoculation prompting

4 papers

Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.

Inference-time: In-context learning

5 papers

Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.

Inference-time: Steering

4 papers

Manipulate an LLM's internal representations/token probabilities without touching weights.

Capability removal: unlearning

18 papers

Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.

Control

22 papers

If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?

Safeguards (inference-time auxiliaries)

6 papers

Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.

Chain of thought monitoring

17 papers

Supervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states.