DS1 spectrogram: Beyond Scaling Laws: Understanding Transformer Performance with
  Associative Memory

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

May 14, 20242405.08707

Authors

Bo Bai,Lei Deng,Wei Han,Xueyan Niu

Abstract

Increasing the size of a Transformer does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws.

Furthermore, the model's enhanced performance is closely associated with its memorization of the training samples. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models.

We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. In particular, the energy function in modern continuous Hopfield networks serves as an explanation for the attention mechanism, which we approximate with a distance-based energy function.

By observing that the softmax function corresponds to the gradient of the LogSumExp function in the energy, and employing the majorization-minimization technique, we construct a global energy function designed to capture the layered architecture. We demonstrate a dependency between the model size and the dataset size for the model to achieve optimal performance, and we show that the achievable cross-entropy loss is bounded from below.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.