Model psychopathology
Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.
Theory of Change:The study of 'pathological' phenomena in LLMs is potentially key for theoretically modelling LLM cognition and LLM training-dynamics (compare: the study of aphasia and visual processing disorder in humans plays a key role cognitive science), and in particular for developing a good theory of generalization in LLMS
Target Case:Pessimistic
Orthodox Problems:
See Also:
Emergent misalignment, mechanistic anomaly detection
Some names:Truthful AI, Theia Vogel, Stewart Slocum, Nell Watson, Samuel G. B. Johnson, Liwei Jiang, Monika Jotautaite, Saloni Dash
Estimated FTEs:5-20
Outputs:
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data— Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans
LLMs Can Get "Brain Rot"!— Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, Zhangyang Wang
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning— Saloni Dash, Amélie Reymond, Emma S. Spiro, Aylin Caliskan
Unified Multimodal Models Cannot Describe Images From Memory— Michael Aerni, Joshua Swanson, Kristina Nikolić, Florian Tramèr
Believe It or Not: How Deeply do LLMs Believe Implanted Facts?— Stewart Slocum, Julian Minder, Clément Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, Rowan Wang
Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence— Nell Watson, Ali Hessami
Imagining and building wise machines: The centrality of AI metacognition— Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)— Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi
Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions— Yuyang Jiang, Longjie Guo, Yuchen Wu, Aylin Caliskan, Tanu Mitra, Hua Shen