Orthodox Problems

Noncanonical problems in AI alignment that research agendas could aim to address. Each problem is a core challenge or assumption in one particular view of the field (the "orthodox" view).

Based on A list of core AI safety problems and how I hope to solve them by davidad (2023-08-26)

Value is fragile and hard to specify

24 agendas

Human values are complex, context-dependent, and difficult to formally specify. Small errors in value specification could lead to catastrophic outcomes.

See: Specification gaming examples, Defining and Characterizing Reward Hacking

Corrigibility is anti-natural

6 agendas

An agent optimizing for a goal has instrumental reasons to resist shutdown or modification, making corrigibility difficult to maintain as capability increases.

See: The Off-Switch Game, Corrigibility (2014)

Relevant Agendas

Agent foundations Tiling agents Behavior alignment theory Other corrigibility Aligning to context Aligning what?

Pivotal processes require dangerous capabilities

1 agenda

Actions sufficient to prevent AI catastrophe may themselves require dangerous AI capabilities, creating a catch-22.

See: Pivotal outcomes and pivotal processes

Relevant Agendas

Scientist AI

Goals misgeneralize out of distribution

29 agendas

Goals learned during training may not generalize correctly to novel situations, leading to unintended behavior in deployment.

See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning

Instrumental convergence

8 agendas

Sufficiently advanced agents will converge on similar instrumental subgoals (self-preservation, resource acquisition, goal preservation) regardless of their terminal goals.

See: The basic AI drives, Seeking power is often convergently instrumental

Relevant Agendas

Scientist AI Behavior alignment theory Other corrigibility Natural abstractions Aligning to context Aligning to the social contract Aligning what?Self-replication evals

Pivotal processes likely require incomprehensibly complex plans

0 agendas

Plans sufficient to solve alignment may be too complex for humans to verify directly.

See: List of Lethalities #30

Superintelligence can fool human supervisors

26 agendas

A sufficiently intelligent system could deceive or manipulate human overseers, undermining oversight mechanisms.

See: Reinforcement Learning from Human Feedback/Challenges, Obfuscated Arguments Problem

Superintelligence can hack software supervisors

13 agendas

A sufficiently capable system could find and exploit vulnerabilities in software-based monitoring and control systems.

See: Reward Tampering Problems and Solutions in Reinforcement Learning

Relevant Agendas

Capability removal: unlearning Chain of thought monitoring Data poisoning defense Weak-to-strong generalization Supervising AIs improving AIs AI explanations of AIs LLM introspection training Heuristic explanations Theory for aligning multiple AIs Tools for aligning multiple AIs Situational awareness and self-awareness evals AI deception evals Sandbagging evals

Humans cannot be first-class parties to a superintelligent value handshake

3 agendas

The cognitive gap between humans and superintelligence may preclude meaningful negotiation or value alignment through mutual understanding.

See: Values handshakes

Relevant Agendas

Guaranteed-Safe AI Natural abstractions The Learning-Theoretic Agenda

10.

Humanlike minds/goals are not necessarily safe

3 agendas

Even AI systems with human-like cognition or values may not be safe, as humans themselves are capable of harmful behavior.

See: Joseph Stalin

Relevant Agendas

Capability removal: unlearning Assistance games, assistive agents Aligning to the social contract

11.

Someone else will deploy unsafe superintelligence first

3 agendas

Competitive pressures may lead to deployment of unsafe systems before safety problems are solved.

See: Can the Singularity be avoided? (Vinge, 1993)

Relevant Agendas

Data poisoning defense Harm reduction for open weights The "Neglected Approaches" Approach

12.

A boxed AGI might exfiltrate itself

8 agendas

Even a contained AI could escape through steganography, spearphishing, or other covert channels.

See: AI Boxing

Relevant Agendas

Capability removal: unlearning Safeguards (inference-time auxiliaries)Chain of thought monitoring Monitoring concepts Guaranteed-Safe AI Steganography evals Self-replication evals Various Redteams

13.

Fair, sane pivotal processes

4 agendas

Ensuring that transformative AI development proceeds in ways that are fair and don't concentrate power inappropriately. We are ethically obligated to propose pivotal processes that are as close as possible to fair Pareto improvements for all citizens.

See: moral philosophy

Relevant Agendas

Aligning to context Aligning to the social contract Aligned to who?Aligning what?