DS1 spectrogram: MiniGPT-5: Interleaved Vision-and-Language Generation via Generative
  Vokens

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

2310.02239

Authors

Kaizhi Zheng,Xuehai He,Xin Eric Wang

Abstract

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped.

Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of "generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs.

Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions.

Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.