DS1 spectrogram: Attention Alignment and Flexible Positional Embeddings Improve
  Transformer Length Extrapolation

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

November 1, 20232311.00684

Authors

Ting-Han Fan,Alexander I. Rudnicky,Ta-Chung Chi

Abstract

An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design.

Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution.

To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning.

This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation.

Resources

Stay in the loop

Get tldr.takara.ai to Your Email, Everyday.

tldr.takara.aiHome·Daily at 6am UTC·© 2026 takara.ai Ltd

Content is sourced from third-party publications.