DS1 spectrogram: KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual
  Entities

KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

2410.11824

Authors

Xinyi Wang,Yonatan Bitton,Hagai Taitelbaum,Gaurav Singh Tomar,Ming-Wei Chang

Abstract

Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities.

To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTensive image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of the latest text-to-image models and retrieval-augmented models, focusing on their ability to generate real-world visual entities, such as landmarks and animals.

Analysis using carefully designed human evaluations, automatic metrics, and MLLM evaluations show that even advanced text-to-image models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entity in creative text prompts.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.