Tiling agents

An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.

Theory of Change:Build enough theoretical basis through various approaches such that AI systems we create are capable of self-modification while preserving goals.

General Approach:Cognitive

Target Case:Worst Case

Orthodox Problems:

1.Value is fragile and hard to specify 2.Corrigibility is anti-natural 4.Goals misgeneralize out of distribution