DS1 spectrogram: Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

2606.02459

Authors

Wei Deng,Xianlin Zhang,Mengshi Qi

Abstract

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications.

Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning.

First, we introduce a new dynamic cognitive map parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel Spatial Assertion Codes (SAC), Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals.

We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with 80.5% overall accuracy, outperforming the best current method by 29.5 accuracy points (a relative improvement of 53.2%) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.