
SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations
Authors
Abstract
Standard Transformers rely on O(N^2) attention, which becomes prohibitive for large N. Although local or sparse approximations reduce complexity, they may limit global context.
We propose SwarmFormer, a hierarchical local-global approach that draws inspiration from swarm intelligence. Each layer combines repeated local (swarm-like) token neighbor updates with cluster-based global attention among a smaller set of representatives.
The local aggregator enables decentralized multi-hop propagation, while the cluster-level attention captures global context without full O(N^2) overhead. Experimental results on text classification tasks show that SwarmFormer achieves strong accuracy with up to 90% fewer parameters than baseline Transformers, demonstrating efficient scalability to longer sequences.