DS1 spectrogram: TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal
  Models for Video Understanding

TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding

January 26, 20252501.15513

Authors

Lei Huang,Xingjian Zhang,Xi Weng,Yihao Yue,Zhaoxin Fan

Abstract

Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers.

Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters.

The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike traditional image-level resampler, our approach effectively mitigates redundancy while enhancing temporal comprehension, leading to improved performance on video-based tasks.

In addition, TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs. It surpasses several existing 7B-parameter models on multiple benchmarks.

We believe this work provides a valuable foundation for future research on lightweight video understanding models. The code and weights is available at https://github.com/ZhangXJ199/TinyLLaVA-Video.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.