DS1 spectrogram: PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

2412.15209

Authors

Tianjiao Yu,Pinar Yanardag,Ismini Lourentzou,Muntasir Wahed,Kiet A. Nguyen

Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding.

Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone.

To support training and evaluation, we curate M4SEG, a new multi-image reasoning segmentation benchmark consisting of $\sim$744K question-answer pairs that require fine-grained visual understanding across multiple images. PRIMA outperforms state-of-the-art baselines with $7.83%$ and $11.25%$ improvements in Recall and S-IoU, respectively.

Ablation studies further demonstrate the effectiveness of the proposed SQuARE module in capturing cross-image relationships.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.