Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements. However, current waypoint predictors struggle with spatial awareness, while navigators lack historical reasoning and backtracking capabilities, limiting adaptability.
We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder, masked cross-attention fusion, and an occupancyaware loss for better waypoint quality. The navigator incorporates history-aware reasoning and adaptive path planning with backtracking, improving robustness. Experiments on R2R-CE and MP3D benchmarks show our method achieves state-of-theart (SOTA) performance in zero-shot settings, demonstrating competitive results compared to fully supervised methods.
We deploy our method on a TurtleBot 4 equipped with an OAK-D Pro camera, demonstrating its adaptability through real-world validation.
Our approach consists of two key components: an Occupancy-aware Waypoint Predictor and an MLLM-based Navigator. The waypoint predictor refines waypoint selection by integrating a stronger vision encoder, a masked cross-attention fusion mechanism, and an occupancy-aware loss, improving prediction quality. The MLLM-based Navigator processes candidate waypoints using visual and textual information to enhance navigation decisions, incorporating finer turning options, historical context, and a backtracking strategy. The robot on this figure is our Turtlebot 4 mobile robot equipped with an OAK-D Pro camera mounted at a height of 70 cm.
Go forward, walk across the chairs, turn left and stop at the TV.
Turn right and then walk to the red ladder.
Navigate through the simulated environment and stop at the goal.
@misc{shi2025smartwayenhancedwaypointprediction,
title={SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation},
author={Xiangyu Shi, Zerui Li, Wenqi Lyu, Jiatong Xia, Feras Dayoub, Yanyuan Qiao, and Qi Wu},
year={2025},
eprint={2503.10069},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2503.10069},
}
This website is adapted from Nerfies, which is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .