Learning video embedding space with Natural Language Supervision

2303.14584

Authors

Vaidehi Joshi,Phani Krishna Uppala,Abhishek Bamotra,Shriti Priya

Abstract

The recent success of the CLIP model has shown its potential to be applied to a wide range of vision and language tasks. However this only establishes embedding space relationship of language to images, not to the video domain.

In this paper, we propose a novel approach to map video embedding space to natural langugage. We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain, along with the corresponding text descriptions.

We evaluate our method on two benchmark datasets, UCF101 and HMDB51, and achieve state-of-the-art performance on both tasks.

Resources

View on Hugging Face Read PDF ArXiv

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Tools

Details

takara.ai
Custom AI and machine learning from the Frontier Research Team.
Content is sourced from third-party publications.