DS1 spectrogram: Toy Models of Superposition

Toy Models of Superposition

September 21, 20222209.10652

Authors

Catherine Olsson,Shauna Kravec,Zac Hatfield-Dodds,Dawn Drain,Sam McCandlish

Abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples.

We also discuss potential implications for mechanistic interpretability.

Resources

Stay in the loop

Get tldr.takara.ai to Your Email, Everyday.

tldr.takara.aiHome·Daily at 6am UTC·© 2026 takara.ai Ltd

Content is sourced from third-party publications.