Shallow Review of Technical AI Safety, 2025

Orthodox Problems

Noncanonical problems in AI alignment that research agendas could aim to address. Each problem is a core challenge or assumption in one particular view of the field (the "orthodox" view).

Based on A list of core AI safety problems and how I hope to solve them by davidad (2023-08-26)

1.

Value is fragile and hard to specify

24 agendas

Human values are complex, context-dependent, and difficult to formally specify. Small errors in value specification could lead to catastrophic outcomes.

2.

Corrigibility is anti-natural

6 agendas

An agent optimizing for a goal has instrumental reasons to resist shutdown or modification, making corrigibility difficult to maintain as capability increases.

3.

Pivotal processes require dangerous capabilities

1 agenda

Actions sufficient to prevent AI catastrophe may themselves require dangerous AI capabilities, creating a catch-22.

Relevant Agendas

4.

Goals misgeneralize out of distribution

29 agendas

Goals learned during training may not generalize correctly to novel situations, leading to unintended behavior in deployment.

5.

Instrumental convergence

8 agendas

Sufficiently advanced agents will converge on similar instrumental subgoals (self-preservation, resource acquisition, goal preservation) regardless of their terminal goals.

6.

Pivotal processes likely require incomprehensibly complex plans

0 agendas

Plans sufficient to solve alignment may be too complex for humans to verify directly.

7.

Superintelligence can fool human supervisors

26 agendas

A sufficiently intelligent system could deceive or manipulate human overseers, undermining oversight mechanisms.

8.

Superintelligence can hack software supervisors

13 agendas

A sufficiently capable system could find and exploit vulnerabilities in software-based monitoring and control systems.

9.

Humans cannot be first-class parties to a superintelligent value handshake

3 agendas

The cognitive gap between humans and superintelligence may preclude meaningful negotiation or value alignment through mutual understanding.

10.

Humanlike minds/goals are not necessarily safe

3 agendas

Even AI systems with human-like cognition or values may not be safe, as humans themselves are capable of harmful behavior.

11.

Someone else will deploy unsafe superintelligence first

3 agendas

Competitive pressures may lead to deployment of unsafe systems before safety problems are solved.

12.

A boxed AGI might exfiltrate itself

8 agendas

Even a contained AI could escape through steganography, spearphishing, or other covert channels.

13.

Fair, sane pivotal processes

4 agendas

Ensuring that transformative AI development proceeds in ways that are fair and don't concentrate power inappropriately. We are ethically obligated to propose pivotal processes that are as close as possible to fair Pareto improvements for all citizens.