DS1 spectrogram: VideoPoet: A Large Language Model for Zero-Shot Video Generation

VideoPoet: A Large Language Model for Zero-Shot Video Generation

December 21, 20232312.14125

Authors

Lijun Yu,Alonso Martinez,Bryan Seybold,Lu Jiang,Ming-Hsuan Yang

Abstract

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio.

The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework.

The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions.

Project page: http://sites.research.google/videopoet/

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.