Shallow Review of Technical AI Safety, 2025

Model psychopathology

Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.
Theory of Change:The study of 'pathological' phenomena in LLMs is potentially key for theoretically modelling LLM cognition and LLM training-dynamics (compare: the study of aphasia and visual processing disorder in humans plays a key role cognitive science), and in particular for developing a good theory of generalization in LLMS
Target Case:Pessimistic
See Also:
Emergent misalignment, mechanistic anomaly detection
Some names:Truthful AI, Theia Vogel, Stewart Slocum, Nell Watson, Samuel G. B. Johnson, Liwei Jiang, Monika Jotautaite, Saloni Dash
Estimated FTEs:5-20
Outputs:
Subliminal Learning: Language models transmit behavioral traits via hidden signals in dataAlex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
LLMs Can Get "Brain Rot"!Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, Zhangyang Wang
Persona-Assigned Large Language Models Exhibit Human-Like Motivated ReasoningSaloni Dash, Amélie Reymond, Emma S. Spiro, Aylin Caliskan
Unified Multimodal Models Cannot Describe Images From MemoryMichael Aerni, Joshua Swanson, Kristina Nikolić, Florian Tramèr
Believe It or Not: How Deeply do LLMs Believe Implanted Facts?Stewart Slocum, Julian Minder, Clément Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, Rowan Wang
Imagining and building wise machines: The centrality of AI metacognitionSamuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi
Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM InteractionsYuyang Jiang, Longjie Guo, Yuchen Wu, Aylin Caliskan, Tanu Mitra, Hua Shen