
Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance
Authors
Abstract
We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding when to interrupt, and how to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence.
We address this gap with four contributions: (1)we release EgoProactive, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; (2)we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into Pro\textsuperscript{2Bench} under a unified proactive-guidance schema; (3)we propose a decoupled planner--interaction architecture specialized for procedural state, visual cues, and recovery injection; (4)we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus4.6, Gemini3.1Pro, GPT5.2) and open-weight baselines (Qwen3VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.