DS1 spectrogram: The Unreasonable Effectiveness of Large Language-Vision Models for
  Source-free Video Domain Adaptation

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

2308.09139

Authors

Stéphane Lathuilière,Paolo Rota,Elisa Ricci,Giacomo Zara,Alessandro Conti

Abstract

Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself.

In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target.

Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.