Hi, thank you for the great work and for open-sourcing the code!
In the paper, four types of evaluations are conducted in the Simpler-env benchmark:
- In-Domain
- Subject Generalization
- Physical Generalization
- Semantics Generalization
However, after checking the provided evaluation scripts in the repository, it seems that all four available tasks correspond to In-Domain evaluation only. I could not find any code or instructions for reproducing the other three types of generalization results.
Could you kindly clarify:
- Are there existing scripts for Subject, Physical, and Semantics Generalization evaluations?
- If not, could you provide guidance to replicate these generalization results?
Thanks in advance!
Hi, thank you for the great work and for open-sourcing the code!
In the paper, four types of evaluations are conducted in the Simpler-env benchmark:
However, after checking the provided evaluation scripts in the repository, it seems that all four available tasks correspond to In-Domain evaluation only. I could not find any code or instructions for reproducing the other three types of generalization results.
Could you kindly clarify:
Thanks in advance!