TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding

Abstract

Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers.

Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters.

The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike traditional image-level resampler, our approach effectively mitigates redundancy while enhancing temporal comprehension, leading to improved performance on video-based tasks.

In addition, TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs. It surpasses several existing 7B-parameter models on multiple benchmarks.

We believe this work provides a valuable foundation for future research on lightweight video understanding models. The code and weights is available at https://github.com/ZhangXJ199/TinyLLaVA-Video.

TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding

Authors

Abstract

Resources

Stay in the loop

Pages

Tools

Details