Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation

Prerequisites

Our model was trained and evaluated using the following package dependencies:

Pytorch 1.9.1
Python 3.6.12

Install Matterport3D simulators: follow instructions here.
Download object features here.
Download datasets of R2R and R4R here. It contains a datasets folder.
(Optional). Download the trained model here.

Pre-training

cd pretrain_src
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_r4r.sh 8001  # R4R
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_r2r.sh 8001  # R2R

Fine-tuning

CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/r4r_b16.sh 8001  # R4R
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/r2r_b16.sh 8001  # R2R

RxR

Please see APAF_RxR/README.md.

Citation

If you find this work useful in your research, please cite the following paper:

# BibTeX
@article{10.1145/3748656,
author = {Huang, Bowen and Zheng, Yanwei and Lan, Chuanlin and Sui, Dongchen and Zhao, Xinpeng and Zhang, Xiao and Xiao, Mengbai and Yu, Dongxiao},
title = {Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation},
year = {2025},
issue_date = {September 2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {21},
number = {9},
issn = {1551-6857},
url = {https://doi.org/10.1145/3748656},
doi = {10.1145/3748656},
month = sep,
articleno = {270},
numpages = {22},
keywords = {Long-Instruction Vision-and-Language Navigation, Action-Perception Alignment Framework, Action-Contextual Encoding Module, Dynamic Instruction Weighting Module}
}

# GB/T 7714
[1] Huang B , Zheng Y , Lan C ,et al.Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation[J].ACM Transactions on Multimedia Computing, Communications and Applications, 2025.

# MLA
[1] Huang, Bowen , et al. "Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation." #i{ACM Transactions on Multimedia Computing, Communications and Applications} (2025).

# APA
[1] Huang, B. ,  Zheng, Y. ,  Lan, C. , &  Sui, D. . (2025). Action-aware visual-textual alignment for long-instruction vision-and-language navigation. #i{ACM Transactions on Multimedia Computing, Communications and Applications}.

Acknowledgement

Codebase from ScaleVLN, BEVBert and DUET.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
APAF_RxR		APAF_RxR
bert_config		bert_config
map_nav_src		map_nav_src
pretrain_src		pretrain_src
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation

Prerequisites

Pre-training

Fine-tuning

RxR

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

visee-sdu/APAF

Folders and files

Latest commit

History

Repository files navigation

Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation

Prerequisites

Pre-training

Fine-tuning

RxR

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages