DS1 spectrogram: CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

April 16, 20262604.14630

Authors

Inseok Jeon,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Jungho Lee

Abstract

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies.

In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks.

To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.