Dear Authors,
Thank you for releasing the AutoVLA codebase. We are working on replicating your results, starting with nuScenes. We've carefully reviewed both the paper and the released code, and extracted the following SFT training configuration. Could you confirm these match what was used in your experiments?
SFT Hyperparameters (from code + paper):
Optimizer: AdamW (PyTorch defaults: β1=0.9, β2=0.999, ε=1e-8)
Learning rate: 1 × 10⁻⁵
Weight decay: 0.01
LR schedule: Linear warmup (500 steps) → step decay (γ=0.98 every 2000 steps)
Gradient clipping: value clipping at 1.0
Mixed precision: bfloat16 (FSDP FULL_SHARD)
Per-GPU batch size: 1, gradient accumulation: 4, 8 GPUs → effective batch size 32
Epochs: 5
Loss: L_SFT = w_i × (L_LM + λa × L_action), with λa=1, w_i=40 for CoT samples, 1 otherwise
Vision backbone frozen, LM backbone trained
If any of these differ from your actual experimental setup, we'd greatly appreciate knowing the correct values.
Additional questions:
Data mixing (Figure 4): The scaling experiments use 10K, 50K, 100K, and 185K samples from a nuPlan + nuScenes mixture. What is the dataset composition at each scale (e.g., ratio of nuPlan to nuScenes)? Is it a simple concatenation and shuffle, or balanced sampling?
nuScenes-only baseline: All reported nuScenes evaluation results appear to be from models trained on the mixed dataset, not nuScenes alone. Have you trained on nuScenes only (19K samples), and if so, what L2 / collision rate do you observe?
Thank you for your time.
Dear Authors,
Thank you for releasing the AutoVLA codebase. We are working on replicating your results, starting with nuScenes. We've carefully reviewed both the paper and the released code, and extracted the following SFT training configuration. Could you confirm these match what was used in your experiments?
SFT Hyperparameters (from code + paper):
Optimizer: AdamW (PyTorch defaults: β1=0.9, β2=0.999, ε=1e-8)
Learning rate: 1 × 10⁻⁵
Weight decay: 0.01
LR schedule: Linear warmup (500 steps) → step decay (γ=0.98 every 2000 steps)
Gradient clipping: value clipping at 1.0
Mixed precision: bfloat16 (FSDP FULL_SHARD)
Per-GPU batch size: 1, gradient accumulation: 4, 8 GPUs → effective batch size 32
Epochs: 5
Loss: L_SFT = w_i × (L_LM + λa × L_action), with λa=1, w_i=40 for CoT samples, 1 otherwise
Vision backbone frozen, LM backbone trained
If any of these differ from your actual experimental setup, we'd greatly appreciate knowing the correct values.
Additional questions:
Data mixing (Figure 4): The scaling experiments use 10K, 50K, 100K, and 185K samples from a nuPlan + nuScenes mixture. What is the dataset composition at each scale (e.g., ratio of nuPlan to nuScenes)? Is it a simple concatenation and shuffle, or balanced sampling?
nuScenes-only baseline: All reported nuScenes evaluation results appear to be from models trained on the mixed dataset, not nuScenes alone. Have you trained on nuScenes only (19K samples), and if so, what L2 / collision rate do you observe?
Thank you for your time.