DS1 spectrogram: EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

April 21, 20262604.19105

Authors

Yuwei Gui,Mingshuang Luo,Bingpeng Ma,Hong Chang,Shiguang Shan

Abstract

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception.

In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions.

We identify a critical reasoning-generation entanglement challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality.

To address this challenge, we propose a hierarchical generative framework EgoMotion. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages.

In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution.

In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories.

Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.