SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations

Standard Transformers rely on O(N^2) attention, which becomes prohibitive for large N. Although local or sparse approximations reduce complexity, they may limit global context.

We propose SwarmFormer, a hierarchical local-global approach that draws inspiration from swarm intelligence. Each layer combines repeated local (swarm-like) token neighbor updates with cluster-based global attention among a smaller set of representatives.

The local aggregator enables decentralized multi-hop propagation, while the cluster-level attention captures global context without full O(N^2) overhead. Experimental results on text classification tasks show that SwarmFormer achieves strong accuracy with up to 90% fewer parameters than baseline Transformers, demonstrating efficient scalability to longer sequences.

SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations

Authors

Abstract

Resources

Stay in the loop

Pages

Tools

Details